Base Linguistics
Base Linguistics provides language-specific tools to prepare your data for analysis through three endpoints:
The following table indicates the type of support that base linguistics provides for each supported language. The tokenizer provides normalization, tokenization, and sentence boundary detection. The analyzer provides lemma lookup (including orthographic normalization for Japanese), and lemma guessing (when the lookup fails).
For languages with compound words, words are divided into components, which improves recall for search engines.
For Chinese tokens in Han script, pinyin transcriptions are returned as the Han reading. For Japanese tokens in Han script (kanji), hiragana transcriptions are returned as the Han reading.
For unknown languages (language code xxx
), generic rules, such as whitespace and punctuation delimitation, are used to tokenize. It will also identify some common acronyms and abbreviations, as well as sentence boundaries. Segmentation user dictionaries are supported for unknown languages.