Base Linguistics

Base Linguistics provides language-specific tools to prepare your data for analysis through three endpoints:

The following table indicates the type of support that base linguistics provides for each supported language. The tokenizer provides normalization, tokenization, and sentence boundary detection. The analyzer provides lemma lookup (including orthographic normalization for Japanese), and lemma guessing (when the lookup fails).

For languages with compound words, words are divided into components, which improves recall for search engines.

For Chinese tokens in Han script, pinyin transcriptions are returned as the Han reading. For Japanese tokens in Han script (kanji), hiragana transcriptions are returned as the Han reading.

For unknown languages (language code xxx), generic rules, such as whitespace and punctuation delimitation, are used to tokenize. It will also identify some common acronyms and abbreviations, as well as sentence boundaries. Segmentation user dictionaries are supported for unknown languages.

Language (code)	Tokenization	Parts of Speech	Lemmas	Compound Components	Han Readings	Sentence Boundary
Arabic (`ara`)	✓	✓	✓			✓
Catalan (`cat`)	✓		✓			✓
Chinese (zho)	✓	✓	✓		✓	✓
Czech (`ces`)	✓	✓	✓			✓
Danish (`dan`)	✓		✓	✓		✓
Dutch (`nld`)	✓	✓	✓	✓		✓
English (`eng`)	✓	✓	✓			✓
Estonian (`est`)	✓		✓			✓
Finnish (`fin`)	✓
French (`fra`)	✓	✓	✓			✓
German (`deu`)	✓	✓	✓	✓		✓
Greek (`ell`)	✓	✓	✓			✓
Hebrew (`heb`)	✓	✓	✓			✓
Hungarian (`hun`)	✓	✓	✓	✓		✓
Indonesian (`ind`)	✓	✓	✓
Italian (`ita`)	✓	✓	✓			✓
Japanese (`jpn`)	✓	✓	✓		✓	✓
Korean (`kor`)	✓	✓	✓	✓		✓
Korean-North (`qkp`)	✓	✓	✓	✓		✓
Korean-South (`qkr`)	✓	✓	✓	✓		✓
Latvian (`lav`)	✓		✓			✓
Malay, Standard (`zsm`)	✓	✓	✓
Norwegian (`nor`)	✓		✓	✓		✓
Norwegian-Bokmål (`nob`)	✓		✓	✓		✓
Norwegian-Nynorsk (`nno`)	✓		✓	✓		✓
Pashto (`pus`)	✓
Persian (`fas`)	✓	✓	✓			✓
Persian-Afghan (`prs`)	✓	✓	✓			✓
Persian-Iranian (`pes`)	✓	✓	✓			✓
Polish (`pol`)	✓	✓	✓			✓
Portuguese (`por`)	✓	✓	✓			✓
Romanian (`ron`)	✓		✓			✓
Russian (`rus`)	✓	✓	✓			✓
Serbian (`srp`)^[a]	✓		✓			✓
Slovak (`slk`)	✓		✓			✓
Spanish (`spa`)	✓	✓	✓			✓
Swedish (`swe`)	✓		✓	✓		✓
Tagalog (`tgl`)	✓	✓	✓
Thai (`tha`)	✓		✓			✓
Turkish (`tur`)	✓		✓			✓
Ukrainian (`ukr`)	✓
Urdu (`urd`)	✓	✓				✓
^[a]The /morphology endpoint only supports Serbian text written in Latin script. However, by default, it only identifies Serbian text written in Cyrillic script. To take advantage of the morphological analysis feature for Serbian, you must explicitly include the language code srp in your request.

Babel Street Analytics API

Base Linguistics

Search results