Text Analytics Endpoints

Language Identifier

https://analytics.babelstreet.com/rest/v1/language

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/language.curl

Language Identifier identifies the language or languages of the input text. The endpoint returns a list of languages identified in descending order of confidence, so the first result is the best. Language Identifier can also detect different language regions in a multilingual document. When multilingual is set to true, it returns a list of language regions in addition to the whole-document results.

The input data may be in any of 364 language–encoding–script combinations, involving 56 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The pre-built profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built profiles.

For all supported languages, the endpoint provides a different proprietary algorithm for detecting the language of short strings (140 characters or less).

Tip

If the endpoint is unable to detect a language, it returns Unknown (xxx).

Tip

Try it in the interactive documentation

Query Parameters

Name	Value	Description
output	rosette	Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Name	Type	Description
`content`	string	Text to process
`contentUri`	string	URI to accessible content

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option	Type	Description	Default
`multilingual`	boolean	If true, detect regions in multilingual documents. In addition to the whole-document language identification results, it returns a list of language region results in descending order of confidence.	false
`koreanDialects`	boolean	If true, language classification for North Korean and South Korean is enabled.	false

{
  "content": "string",
  "options": {
    "koreanDialects": false,
    "multilingual": false
  }
}

Response

Language Identifier returns a confidence score with each language result, ranging from 0 to 1. You can use this score as a threshold for filtering out low-confidence results.

{
  "languageDetections": [
    {
      "language": "string",
      "confidence": number
    }
  ]
}

Supported Languages

GET /language/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Language (`code`)	Short Strings Support
Albanian (`sqi`)	✓
Arabic (`ara`)	✓ ^[a]
Bengali (`ben`)	✓
Bulgarian (`bul`)	✓
Catalan (`cat`)	✓
Chinese: Simplified & Traditional (`zho`)	✓
Croatian (`hrv`)	✓
Czech (`ces`)	✓
Danish (`dan`)	✓
Dutch (`nld`)	✓
English (`eng`) ^[b]	✓
Estonian (`est`)	✓
Finnish (`fin`)	✓
French (`fra`)	✓
German (`deu`)	✓
Greek (`ell`)	✓
Gujarati (`guj`)	✓
Hebrew (`heb`)	✓
Hindi (`hin`)	✓
Hungarian (`hun`)	✓
Icelandic (`isl`)	✓
Indonesian (`ind`)	✓
Italian (`ita`)	✓
Japanese (`jpn`)	✓
Kannada (`kan`)	✓
Korean (`kor`) ^[c]	✓
Korean (North) (`qkp`) ^[c]	✓
Korean (South) (`qkr`) ^[c]	✓
Kurdish (`kur`)	✓ ^[a]
Latvian (`lav`)	✓
Lithuanian (`lit`)	✓
Macedonian (`mkd`)	✓
Malayalam (`mal`)	✓
Norwegian (`nor`)	✓
Persian (`fas`)	✓ ^[a]
Polish (`pol`)	✓
Portuguese (`por`)	✓
Pashto (`pus`)	✓ ^[a]
Romanian (`ron`)	✓
Russian (`rus`)	✓
Serbian (`srp`) ^[d]	✓ ^[e]
Slovak (`slk`)	✓
Slovenian (`slv`)	✓
Somali (`som`)	✓
Spanish (`spa`)	✓
Standard Malay (`zsm`)	✓
Swedish (`swe`)	✓
Tagalog (`tgl`)	✓
Tamil (`tam`)	✓
Telugu (`tel`)	✓
Thai (`tha`)	✓
Turkish (`tur`)	✓
Ukrainian (`ukr`)	✓
Urdu (`urd`)	✓ ^[a]
Uzbek (`uzb`)	✓
Vietnamese (`vie`)	✓
^[a]Short-string support is only provided for Arabic script ^[b]The language endpoint also provides specialized support for English text in which case should be ignored. When processing such text, specify the English language code `uen`. ^[c]North Korean (`qkp`) and South Korean (`qkr`) are enabled only if koreanDialects is `true`. Korean (`kor`) is only enabled if koreanDialects is `false`. By default, koreanDialects is set to `false`. ^[d]The language endpoint detects Cyrillic Serbian, but disables, by default, detection of Latin script Serbian. To enable detection of Serbian written in Latin script, set its weight to a positive number (such as 100) in the options when calling the endpoint. `"options": { "languageWeightAdjustments": [ { "language": "srp", "script": "Latn", "weight": 100 } ] }` ^[e]The short-string algorithm supports Latin Script Serbian by default.

Babel Street Analytics API