Skip to main content

Babel Street Analytics API

Text Analytics Endpoints

Language Identifier

 https://analytics.babelstreet.com/rest/v1/language

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/language.curl

Language Identifier identifies the language or languages of the input text. The endpoint returns a list of languages identified in descending order of confidence, so the first result is the best. Language Identifier can also detect different language regions in a multilingual document. When multilingual is set to true, it returns a list of language regions in addition to the whole-document results.

The input data may be in any of 364 language–encoding–script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

For all supported languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).

Tip

If the endpoint is unable to detect a language, it returns Unknown (xxx).

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Name

Type

Description

content

string

Text to process

contentUri

string

URI to accessible content

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option

Type

Description

Default

multilingual

boolean

If true, detect regions in multilingual documents. In addition to the whole-document language identification results, it returns a list of language region results in descending order of confidence.

false

koreanDialects

boolean

If true, language classification for North Korean and South Korean is enabled.

false

{
  "content": "string",
  "options": {
    "koreanDialects": false,
    "multilingual": false
  }
}

Response

Language Identifier returns a confidence score with each language result, ranging from 0 to 1. You can use this score as a threshold for filtering out low-confidence results.

{
  "languageDetections": [
    {
      "language": "string",
      "confidence": number
    }
  ]
}

Supported Languages

GET /language/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Language (code)

Short Strings Support

Albanian (sqi)

Arabic (ara)

[a]

Bengali (ben)

Bulgarian (bul)

Catalan (cat)

Chinese: Simplified & Traditional (zho)

Croatian (hrv)

Czech (ces)

Danish (dan)

Dutch (nld)

English (eng) [b]

Estonian (est)

Finnish (fin)

French (fra)

German (deu)

Greek (ell)

Gujarati (guj)

Hebrew (heb)

Hindi (hin)

Hungarian (hun)

Icelandic (isl)

Indonesian (ind)

Italian (ita)

Japanese (jpn)

Kannada (kan)

Korean (kor) [c]

Korean (North) (qkp) [c]

Korean (South) (qkr) [c]

Kurdish (kur)

[a]

Latvian (lav)

Lithuanian (lit)

Macedonian (mkd)

Malayalam (mal)

Norwegian (nor)

Persian (fas)

[a]

Polish (pol)

Portuguese (por)

Pashto (pus)

[a]

Romanian (ron)

Russian (rus)

Serbian (srp) [d]

[e]

Slovak (slk)

Slovenian (slv)

Somali (som)

Spanish (spa)

Standard Malay (zsm)

Swedish (swe)

Tagalog (tgl)

Tamil (tam)

Telugu (tel)

Thai (tha)

Turkish (tur)

Ukrainian (ukr)

Urdu (urd)

[a]

Uzbek (uzb)

Vietnamese (vie)

[a] Short-string support is only provided for Arabic script

[b] The language endpoint also provides specialized support for English text in which case should be ignored. When processing such text, specify the English language code uen.

[c] North Korean (qkp) and South Korean (qkr) are enabled only if koreanDialects is true. Korean (kor) is only enabled if koreanDialects is false. By default, koreanDialects is set to false.

[d] The language endpoint detects Cyrillic Serbian, but disables, by default, detection of Latin script Serbian. To enable detection of Serbian written in Latin script, set its weight to a positive number (such as 100) in the options when calling the endpoint. "options": { "languageWeightAdjustments": [ { "language": "srp", "script": "Latn", "weight": 100 } ] }

[e] The short-string algorithm supports Latin Script Serbian by default.