Tech Specs

Application Developer's Guide - Babel Street Match

Application Developer's Guide - Babel Street Match for Elasticsearch

Application Developer's Guide - Babel Street Match for OpenSearch

Match Evaluation and Configuration Guide

Try the demo: demo.babelstreet.com/match-demo

Try Analytics Cloud: developer.babelstreet.com

Key features

Cross-lingual name matching
Intuitive match confidence score from 0 to 1 to enable automation of workflow
Explainable AI provides explanation of match score calculations
Intelligent address and date matching
Flexible and extensive configurations to optimize accuracy on your data
See impact of configuration changes in real-time through administrative interface, Match Studio
Supports matching of just a name or a name plus any number of identifying attributes (address, date of birth, ID number, etc.)

Languages

Match matches names within the following languages and scripts.

Language (ISO 639-3)	Scripts (ISO 15924)	Real world ID dictionary
Arabic (ara)	Arabic (Arab)	✓
Burmese (mya)	Burmese (Mymr)	✓
Chinese (zho)^[1]	Han (Hanzi) (Hani), Han (Simplified variant) (Hans), Han (Traditional variant) (Hant)	✓
English (eng)	Latin (Latn)	✓
French (fra)	Latin (Latn)	✓
German (deu)	Latin (Latn)	✓
Greek (ell)	Greek (Grek)	✓
Hebrew (heb)	Hebrew (Hebr)	✓
Hindi (hin)	Devanagari (Deva)	✓
Hungarian (hun)	Latin (Latn)	✓
Italian (ita)	Latin (Latn)	✓
Japanese (jpn)	Han (Kanji) (Hani), Hiragana (Hira), Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt), Japanese (alias for Han + Hiragana + Katakana) (Jpan), Katakana (Kana)	✓
Khmer (khm)	Khmer (Khmr)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang), Han (Hanja) (Hani), Korean (alias for Hangul + Han) (Kore)	✓
Malay (zsm)	Latin (Latn)
Pashto (pus)	Arabic (Arab)
Persian (fas) ^[2]	Arabic (Arab)
Persian, Afghan (prs)	Arabic (Arab)
Persian, Iranian (pes)	Arabic (Arab)
Portuguese (por)	Latin (Latn)	✓
Russian (rus)	Cyrillic (Cyrl)	✓
Spanish (spa)	Latin (Latn)	✓
Thai (tha)	Thai (Thai)	✓
Turkish (tur)	Latin (Latn)
Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	Latin (Latn)	✓
^[1]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[2]Persian is the macro language that includes Afghan Persian (prs) and Iranian Persian (pes)

This table identifies the range of cross-language matching that Match fully supports.

Query Domain	Index Domain / Match Domain
Language (ISO 639-3)	Language (ISO 639-3)	Scripts (ISO 15924)
Arabic (ara)	Arabic (ara)	Arabic (Arab)
Arabic (ara)	English (eng)	Latin (Latn)
Burmese (mya)	Burmese (mya)	Burmese (Mymr)
Burmese (mya)	English (eng)	Latin (Latn)
Chinese (zho)^[1]	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
English (eng)	Arabic (ara)	Arabic (Arab)
	Burmese (mya)	Burmese (Mymr)
	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	French (fra)	Latin (Latn)
	German (deu)	Latin (Latn)
	Greek (ell)	Greek (Grek)
	Hebrew (heb)	Hebrew (Hebr)
	Hindi (hin)	Devanagari (Deva)
	Hungarian (hun)	Latin (Latn)
	Italian (ita)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Khmer (khm)	Khmer (Khmr)
	Korean (kor)	(Hani), (Hang), (Kore)
	Malay (zsm)	Latin (Latn)
	Pashto (pus)	Arabic (Arab)
	Persian (fas)	Arabic (Arab)
	Persian, Afghan (prs)	Arabic (Arab)
	Persian, Iranian (pes)	Arabic (Arab)
	Portuguese (por)	Latin (Latn)
	Russian (rus)	Cyrillic (Cyrl)
	Spanish (spa)	Latin (Latn)
	Thai (tha)	Thai (Thai)
	Urdu (urd)	Arabic (Arab)
	Turkish (tur)	Latin (Latn)
	Vietnamese (vie)	Latin (Latn)
French (fra)	English (eng)	Latin (Latn)
French (fra)	French (fra)	Latin (Latn)
German (deu)	English (eng)	Latin (Latn)
German (deu)	German (deu)	Latin (Latn)
Greek (ell)	English (eng)	Latin (Latn)
Greek (ell)	Greek (ell)	Greek (Grek)
Hebrew (heb)	English (eng)	Latin (Latn)
Hebrew (heb)	Hebrew (heb)	Hebrew (Hebr)
Hindi (hin)	English (eng)	Latin (Latn)
Hindi (hin)	Hindi (hin)	Devanagari (Deva)
Hungarian (hun)	English (eng)	Latin (Latn)
Hungarian (hun)	Hungarian (hun)	Latin (Latn)
Italian (ita)	English (eng)	Latin (Latn)
Italian (ita)	Italian (ita)	Latin (Latn)
Japanese (jpn)	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Khmer (khm)	English (eng)	Latin (Latn)
Khmer (khm)	Khmer (khm)	Khmer (Khmr)
Korean (kor)	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Malay (zsm)	English (eng)	Latin (Latn)
Malay (zsm)	Malay (zsm)	Latin (Latn)
Pashto (pus)	English (eng)	Latin (Latn)
Pashto (pus)	Pashto (pus)	Arabic (Arab)
Persian^[2] (fas)	English (eng)	Latin (Latn)
Persian^[2] (fas)	Persian (fas)	Arabic (Arab)
Persian, Afghan (prs)	Afghan Persian (prs)	Arabic (Arab)
Persian, Afghan (prs)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	Iranian Persian (pes)	Arabic (Arab)
Portuguese (por)	English (eng)	Latin (Latn)
Portuguese (por)	Portuguese (por)	Latin (Latn)
Russian (rus)	English (eng)	Latin (Latn)
Russian (rus)	Russian (rus)	Cyrillic (Cyrl)
Spanish (spa)	English (eng)	Latin (Latn)
Spanish (spa)	Spanish (spa)	Latin (Latn)
Thai (tha)	English (eng)	Latin (Latn)
Thai (tha)	Thai (tha)	Thai (Thai)
Turkish (tur)	English (eng)	Latin (Latn)
Turkish (tur)	Turkish (tur)	Latin (Latn)
Urdu (urd)	English (eng)	Latin (Latn)
Urdu (urd)	Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	English (eng)	Latin (Latn)
Vietnamese (vie)	Vietnamese (vie)	Latin (Latn)
^[1]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[2]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Name variations

These are some of the many name variations that Name Match considers in every name comparison.

Example 1. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 2. Integrations

Elasticsearch
OpenSearch
Solr

Example 3. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 4. Sample output

{
  "translation": "Mu'ammar Muhammad Abu-Minyar al-Qadhaf",
  "targetLanguage": "eng",
  "targetScript": "Latn",
  "targetScheme": "IC",
  "confidence": 0.06856099342585828
}

Name Translator

Translates names consistently according to transliteration standards using knowledge of language-specific name conventions.

Documentation and Resources

Application Developer's Guide - Babel Street Match

Try the demo: demo.babelstreet.com/match-demo

Try Analytics Cloud: developer.babelstreet.com

Key features

Transliterates names consistently according to user-selected transliteration standards.
Recognizes language of origin of names to produce “conventional spellings” of well-known names when possible (Example, from the Arabic جورج دبليو بوش translating to “George Bush” instead of transliterating to “Jurj Bush”).
Recognizes when to translate words (such as titles) instead of transliterating.

Arabic translation sample

Example of Arabic name translation from Name Translator:

Transliteration type	Input	Output
Person name – Arabic-origin	ابو يوسف يعقوب‎‎	Abu-Yusif Ya’qub
Person name – English-origin	رذرفورد بي هايز	Rutherford B. Hayes
Place name – Arabic-origin	باقة الشرقية	Baqah al-Sharqiyyah
Organization Acronym – English-origin	بي بي سي	B.B.C.

Scripts and transliterations

Rosette Name Translator translates names between these writing systems and transliteration standards.

Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...	Language(s) of Origin
Language (ISO639-3)	Script (ISO15924)	Language (ISO639-3)	Script (ISO15924)	Transliteration Scheme(s) (name),...	Language(s) of Origin
Afghan Persian (prs)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)	prs
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)	ara, eng
Burmese (mya)	Burmese (Mymr)	English (eng)	Latin (Latn)	Folk (folk), MLCTS (mlcts)	mya
Chinese (zho)	Han (Hanzi) (Hani)	English (eng)	Latin (Latn)	BGN (bgn)^[1] , IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[2], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[3]	zho, eng
Chinese (zho)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	BGN (bgn)^[1], IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[2], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[3]	zho, eng
Chinese (zho)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	BGN (bgn)^[1], IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[2], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[3]	zho, eng
English (eng)	Latin (Latn)	Afghan Persian (prs)	Arabic (Arab)	Folk (folk)	prs
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	Basis (basis), BGN (bgn), Buckwalter (buckwalter), Folk (folk), SATTS (satts)	ara
English (eng)	Latin (Latn)	Chinese (zho)	Han (Hanzi, Kanji, Hanja) (Hani)	Chinese Telegraph Code (ctc), Folk (folk), Jyutping (LSHK)^[3]	zho, eng
English (eng)	Latin (Latn)	English (eng)	Latin (Latn)	BGN (bgn), Basis (basis), IC (ic)^[4]	ara
English (eng)	Latin (Latn)	English (eng)	Latin (Latn)	Undiacritized BGN (und_bgn)	eng, ara, pus, urd, prs, pes, fas, rus, zho, kor
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	BGN (bgn)	pes
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	Folk (folk)	pes
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	BGN (bgn)	kor
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Folk (folk)	kor, eng
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	McCuneReischauer (mcr)	kor
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Revised Romanization of Korean (moct)	kor
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	BGN (bgn)	pus
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	Folk (folk)	pus
English (eng)	Latin (Latn)	Persian (fas)	Arabic (Arab)	BGN (bgn)	fas
English (eng)	Latin (Latn)	Russian (rus)	Cyrillic (Cyrl)	Folk (folk)	rus, eng
English (eng)	Latin (Latn)	Urdu (urd)	Arabic (Arab)	BGN (bgn)	urd
Greek (ell)	Greek (Grek)	English (eng)	Latin (Latn)	ISO 843:1997 (iso843_1997), ICU (icu)	eng, ell
Hebrew (heb)	Hebrew (Hebr)	English (eng)	Latin (Latn)	ISO 259-2:1994 (iso259_2_1994), Folk (folk), ICU (icu)	heb, eng
Hindi (hin)	Devanagari (Nagari) (Deva)	English (eng)	Latin (Latn)	IC (ic)	hin
Iranian Persian (pes)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)	pes
Japanese (jpn)	Han (Kanji) (Hani)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Hiragana (Hira)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng, zho, kor
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng
Japanese (jpn)	Katakana (Kana)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng
Khmer (khm)	Khmer (Khmr)	English (eng)	Latin (Latn)	Folk (folk)	khm, eng
Korean (kor)	Han (Hanja) (Hani)	English (eng)	Latin (Latn)	BGN (bgn)^[5], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	English (eng)	Latin (Latn)	BGN (bgn)^[5] , IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	English (eng)	Latin (Latn)	BGN (bgn)^[5], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)	pus
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	IC (ic)	pus, prs
Persian (fas)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)	fas
Russian (rus)	Cyrillic (Cyrl)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)	rus, eng
Thai (tha)	Thai (Thai)	English (eng)	Latin (Latn)	ICU (icu), ISO :11940-2 (iso_11940_2), ISO 11940-2:2007 (iso11940_2_2007)	eng, tha
Urdu (urd)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn)^[6] , IC (ic), Undiacritized BGN (und_bgn), Folk (folk)	urd
^[1]For Chinese, BGN uses the Pinyin transliteration scheme. ^[2]Pinyin (hypy) is the only transliteration scheme used for name matching. ^[3]For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts. ^[4]Standardizes names of Arabic origin in Latin script to conform to the target transliteration scheme. ^[5]For Korean, BGN uses the McCune-Reischauer transliteration scheme. ^[6]For Urdu, we implemented a BGN transliteration scheme based on an unofficial specification prior to the specification being officially adopted.

Example 5. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 6. Integrations

Elasticsearch
Solr

Example 7. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 8. Sample output

{
  "name1": {
    "text": "Влади́мир Влади́мирович Пу́тин",
    "language": "rus",
    "entityType": "PERSON"
  },
  "name2": {
    "text": "Vladimir Putin",
    "language": "eng",
    "entityType": "PERSON"
  }
}

{
  "result": {
    "score": 0.9486632809417912
  }
}

Match Studio

Match Studio is a user-friendly administrative interface to help the Name Match user try out different configurations and see the impact on match behavior and scores in real-time.

Documentation and Resources

Match Studio User Guide

Video Tutorials

Key features

Analyze precision and recall to assist user in choosing a match threshold; the score above which two names are considered “a match”
User-friendly configuration testing in a GUI environment
Compare match results using differently weighted fields when comparing records

Languages

Match Studio matches names within the following languages and scripts.

Language (ISO 639-3)	Scripts (ISO 15924)	Real world ID dictionary
Arabic (ara)	Arabic (Arab)	✓
Burmese (mya)	Burmese (Mymr)	✓
Chinese (zho)^[1]	Han (Hanzi) (Hani), Han (Simplified variant) (Hans), Han (Traditional variant) (Hant)	✓
English (eng)	Latin (Latn)	✓
French (fra)	Latin (Latn)	✓
German (deu)	Latin (Latn)	✓
Greek (ell)	Greek (Grek)	✓
Hebrew (heb)	Hebrew (Hebr)	✓
Hindi (hin)	Devanagari (Deva)	✓
Hungarian (hun)	Latin (Latn)	✓
Italian (ita)	Latin (Latn)	✓
Japanese (jpn)	Han (Kanji) (Hani), Hiragana (Hira), Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt), Japanese (alias for Han + Hiragana + Katakana) (Jpan), Katakana (Kana)	✓
Khmer (khm)	Khmer (Khmr)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang), Han (Hanja) (Hani), Korean (alias for Hangul + Han) (Kore)	✓
Malay (zsm)	Latin (Latn)
Pashto (pus)	Arabic (Arab)
Persian (fas) ^[2]	Arabic (Arab)
Persian, Afghan (prs)	Arabic (Arab)
Persian, Iranian (pes)	Arabic (Arab)
Portuguese (por)	Latin (Latn)	✓
Russian (rus)	Cyrillic (Cyrl)	✓
Spanish (spa)	Latin (Latn)	✓
Thai (tha)	Thai (Thai)	✓
Turkish (tur)	Latin (Latn)
Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	Latin (Latn)	✓
^[1]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[2]Persian is the macro language that includes Afghan Persian (prs) and Iranian Persian (pes)

This table identifies the range of cross-language matching that Match fully supports.

Query Domain	Index Domain / Match Domain
Language (ISO 639-3)	Language (ISO 639-3)	Scripts (ISO 15924)
Arabic (ara)	Arabic (ara)	Arabic (Arab)
Arabic (ara)	English (eng)	Latin (Latn)
Burmese (mya)	Burmese (mya)	Burmese (Mymr)
Burmese (mya)	English (eng)	Latin (Latn)
Chinese (zho)^[1]	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
English (eng)	Arabic (ara)	Arabic (Arab)
	Burmese (mya)	Burmese (Mymr)
	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	French (fra)	Latin (Latn)
	German (deu)	Latin (Latn)
	Greek (ell)	Greek (Grek)
	Hebrew (heb)	Hebrew (Hebr)
	Hindi (hin)	Devanagari (Deva)
	Hungarian (hun)	Latin (Latn)
	Italian (ita)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Khmer (khm)	Khmer (Khmr)
	Korean (kor)	(Hani), (Hang), (Kore)
	Malay (zsm)	Latin (Latn)
	Pashto (pus)	Arabic (Arab)
	Persian (fas)	Arabic (Arab)
	Persian, Afghan (prs)	Arabic (Arab)
	Persian, Iranian (pes)	Arabic (Arab)
	Portuguese (por)	Latin (Latn)
	Russian (rus)	Cyrillic (Cyrl)
	Spanish (spa)	Latin (Latn)
	Thai (tha)	Thai (Thai)
	Urdu (urd)	Arabic (Arab)
	Turkish (tur)	Latin (Latn)
	Vietnamese (vie)	Latin (Latn)
French (fra)	English (eng)	Latin (Latn)
French (fra)	French (fra)	Latin (Latn)
German (deu)	English (eng)	Latin (Latn)
German (deu)	German (deu)	Latin (Latn)
Greek (ell)	English (eng)	Latin (Latn)
Greek (ell)	Greek (ell)	Greek (Grek)
Hebrew (heb)	English (eng)	Latin (Latn)
Hebrew (heb)	Hebrew (heb)	Hebrew (Hebr)
Hindi (hin)	English (eng)	Latin (Latn)
Hindi (hin)	Hindi (hin)	Devanagari (Deva)
Hungarian (hun)	English (eng)	Latin (Latn)
Hungarian (hun)	Hungarian (hun)	Latin (Latn)
Italian (ita)	English (eng)	Latin (Latn)
Italian (ita)	Italian (ita)	Latin (Latn)
Japanese (jpn)	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Khmer (khm)	English (eng)	Latin (Latn)
Khmer (khm)	Khmer (khm)	Khmer (Khmr)
Korean (kor)	Chinese (zho)^[1]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira), (Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Malay (zsm)	English (eng)	Latin (Latn)
Malay (zsm)	Malay (zsm)	Latin (Latn)
Pashto (pus)	English (eng)	Latin (Latn)
Pashto (pus)	Pashto (pus)	Arabic (Arab)
Persian^[2] (fas)	English (eng)	Latin (Latn)
Persian^[2] (fas)	Persian (fas)	Arabic (Arab)
Persian, Afghan (prs)	Afghan Persian (prs)	Arabic (Arab)
Persian, Afghan (prs)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	Iranian Persian (pes)	Arabic (Arab)
Portuguese (por)	English (eng)	Latin (Latn)
Portuguese (por)	Portuguese (por)	Latin (Latn)
Russian (rus)	English (eng)	Latin (Latn)
Russian (rus)	Russian (rus)	Cyrillic (Cyrl)
Spanish (spa)	English (eng)	Latin (Latn)
Spanish (spa)	Spanish (spa)	Latin (Latn)
Thai (tha)	English (eng)	Latin (Latn)
Thai (tha)	Thai (tha)	Thai (Thai)
Turkish (tur)	English (eng)	Latin (Latn)
Turkish (tur)	Turkish (tur)	Latin (Latn)
Urdu (urd)	English (eng)	Latin (Latn)
Urdu (urd)	Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	English (eng)	Latin (Latn)
Vietnamese (vie)	Vietnamese (vie)	Latin (Latn)
^[1]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[2]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Example 9. Availability

Windows
macOS
Docker
Linux

Text Analytics

Entity Extractor

Documentation and Resources

Try the demo: demo.babelstreet.com/text-analytics-demo

Try Analytics Cloud: developer.babelstreet.com

Fast, accurate named entity recognition using an ensemble of algorithms – statistical models and deep learning models (AI), pattern matching (regular expressions), and entity lists (gazetteers) for high accuracy across languages. Internally, an adjudication module scores and remediates conflicting results between the different processors.

Linking to knowledge bases enables Babel Street Analytics to distinguish between similarly named entities, such as Neil Armstrong (astronaut) and Neil Armstrong (hockey referee).

Key features

Entity extraction of people, location, organization and more entity types.
Entity linking to Wikidata by default, but customizable to other knowledge bases.
Coreference resolution – linking entities and their pronoun mentions.
Salience scoring – highlighting entities most relevant to document content.

Languages

Entity Extractor supports the languages listed below.

Language	Code
Arabic	`ara`
Chinese, Script-insensitive	`zho`
Chinese, Simplified	`zhs`
Chinese, Traditional	`zht`
Dutch	`nld`
English	`eng`
English Uppercase	`uen`
French	`fra`
German	`deu`
Hebrew	`heb`
Hungarian	`hun`
Indonesian	`ind`
Italian	`ita`
Japanese	`jpn`
Korean	`kor`
Malay, Standard	`zsm`
Persian (Iranian Persian, Afghan Persian)	`fas`
Portuguese	`por`
Pashto	`pus`
Russian	`rus`
Spanish	`spa`
Swedish	`swe`
Tagalog	`tgl`
Urdu	`urd`
Vietnamese	`vie`
Generic, Cross-Language	`xxx`

Entity types

Entity Extractor is pre-trained to extract the following entity types:

Location
Organization
Person
Title
Nationality
Religion
Credit Card
Distance
Email
Latitude/Longitude
Money
Currency
ID number
Phone number
URL
UTM
Date
Time

In addition to the entity types above, Analytics recognizes over 450 sub-entity types and will link to a WikiData QID and DBpedia parse tree when it is available. As an example: “Ibuprofen” will be tagged as “SUBSTANCE”, linked to the WikiData ID: Q186969, and assigned the DBpedia tree ”ChemicalSubstance/Drug”.

Linked knowledge bases

By default, Entity Extractor is pre-trained to link entities to entries in these knowledge bases, which enables it to distinguish between similarly named entities by examining the context in which the entity appears. Users can also specify their own knowledge bases for linking.

WikiData
DBpedia

Example 10. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 11. Integrations

Elasticsearch
Solr

Example 12. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 13. Sample output

{
  "entities": [
    {
      "type": "ORGANIZATION",
      "mention": "Securities and Exchange Commission",
      "normalized": "Securities and Exchange Commission",
      "count": 3,
      "mentionOffsets": [
        {
          "startOffset": 4,
          "endOffset": 38
        },
        {
          "startOffset": 166,
          "endOffset": 169
        },
        {
          "startOffset": 536,
          "endOffset": 539
        }
      ],
      "entityId": "Q953944",
      "confidence": 0.67070782,
      "linkingConfidence": 0.27190905,
      "dbpediaType": "Agent/Organisation/GovernmentAgency"
    },
    {
      "type": "PERSON",
      "mention": "Bridget Fitzpatrick",
      "normalized": "Bridget Fitzpatrick",
      "count": 2,
      "mentionOffsets": [
        {
          "startOffset": 99,
          "endOffset": 118
        },
        {
          "startOffset": 287,
          "endOffset": 298
        }
      ],
      "entityId": "T1",
      "confidence": 0.92063326
    },
    {
      "type": "PERSON",
      "mention": "David Gottesman",
      "normalized": "David Gottesman",
      "count": 2,
      "mentionOffsets": [
        {
          "startOffset": 174,
          "endOffset": 189
        },
        {
          "startOffset": 307,
          "endOffset": 316
        }
      ],
      "entityId": "Q5234268",
      "confidence": 0.92488831,
      "linkingConfidence": 0.47211223,
      "dbpediaType": "Agent/Person"
    },
    {
      "type": "TITLE",
      "mention": "Chief Litigation Counsel",
      "normalized": "Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 134,
          "endOffset": 158
        }
      ],
      "entityId": "T2",
      "confidence": 0.3306601
    },
    {
      "type": "TITLE",
      "mention": "Deputy Chief Litigation Counsel",
      "normalized": "Deputy Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 229,
          "endOffset": 260
        }
      ],
      "entityId": "T5",
      "confidence": 0.81287289
    },
    {
      "type": "TEMPORAL:DATE",
      "mention": "December 2016",
      "normalized": "December 2016",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 268,
          "endOffset": 281
        }
      ],
      "entityId": "T6"
    },
    {
      "type": "TITLE",
      "mention": "Ms.",
      "normalized": "Ms.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 283,
          "endOffset": 286
        }
      ],
      "entityId": "T7",
      "confidence": 0.76600134
    },
    {
      "type": "TITLE",
      "mention": "Mr.",
      "normalized": "Mr.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 303,
          "endOffset": 306
        }
      ],
      "entityId": "T9",
      "confidence": 0.72353458
    },
    {
      "type": "TITLE",
      "mention": "Co-Acting Chief Litigation Counsel",
      "normalized": "Co-Acting Chief Litigation Counsel",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 332,
          "endOffset": 366
        }
      ],
      "entityId": "T11",
      "confidence": 0.03582656
    },
    {
      "type": "LOCATION",
      "mention": "Washington D.C.",
      "normalized": "Washington D.C.",
      "count": 1,
      "mentionOffsets": [
        {
          "startOffset": 460,
          "endOffset": 475
        }
      ],
      "entityId": "Q61",
      "linkingConfidence": 0.66086622,
      "dbpediaType": "Place/PopulatedPlace/Settlement"
    }
  ]
}

Event Extractor

Effectively triage the overflow of data to get alerts and key details about critical events -- specific to your use case -- when you need them, even from data in languages that your analysts don’t know.

Documentation and Resources

Try Analytics Cloud: developer.babelstreet.com

Key features

Quick training of AI models for events specific to you -- in less than a day with Model Training Suite (MTS).
Extraction of the who, what, when and where for each event – not just the source text.
True context understanding, more accurate than keyword alerting.

Languages

Event Extractor supports event extraction models for the following languages. Model Training Suite enables users to train models for events they require in as little as a day.

Arabic
Chinese
Dutch
English
German
Hungarian
Japanese
Korean
Russian

Example 14. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 15. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 16. Sample

Request

{"content": "I want flights from Boston to New York",
"language": "eng",
"options": {
   "workspaceId": "multi-1"
 }

Response

{
    "events": [
        {
            "eventType": "flight_booking_schema.flight_booking",
            "mentions": [
                {
                    "startOffset": 7,
                    "endOffset": 38,
                    "roles": [
                        {
                            "startOffset": 7,
                            "endOffset": 14,
                            "name": "key",
                            "id": "E1",
                            "dataSpan": "flights",
                            "obsolete": false,
                            "roleType": "flight_booking_schema.flight_booking_key",
                            "extractorName": "flight_booking_schema.flight-key-morphological"
                        },
                        {
                            "startOffset": 20,
                            "endOffset": 26,
                            "name": "origin",
                            "id": "T0",
                            "dataSpan": "Boston",
                            "obsolete": false,
                            "roleType": "generic_schema.location",
                            "extractorName": "generic_schema.location-entity"
                        },
                        {
                            "startOffset": 30,
                            "endOffset": 38,
                            "name": "destination",
                            "id": "T1",
                            "dataSpan": "New York",
                            "obsolete": false,
                            "roleType": "generic_schema.location",
                            "extractorName": "generic_schema.location-entity"
                        }
                    ]
                }
            ],
            "confidence": 0.93891401,
            "workspaceId": "multi-1"
        }
    ]
}

Model Training Suite

Model Training Suite (MTS) is a framework with user-friendly graphical user interface for quickly creating machine learning models or adapting existing models to a particular domain or data set. Compared to traditional data annotation methods, MTS reduces the number of training documents that need to be annotated to achieve a reasonable level of accuracy, shortening the annotation process.

Documentation and Resources

Support

Key features

Model Training Suite is a framework to quickly annotate, train, and deploy NLP models. Inside, Adaptation Studio provides a user-friendly interface for annotators to annotate and for project managers to easily check cross-annotation agreement and adjudicate conflicts to produce high-quality training data.

AI-assisted data preprocessing — Instead of starting each model from scratch, MTS builds on top of existing natural language models.
Iterative model evaluation —Project managers can continually monitor the accuracy of an interim model built from documents tagged “so far,” so that tagging can be halted as soon as the model reaches the target accuracy or hits a point of diminishing returns.
Efficient annotation — Based on the interim model’s confidence score on untagged documents, active learning recommends the most likely “informative” documents to be tagged first.
Computer-assisted tagging — The interim model will pre-tag documents for the human annotator to accept, reject, or correct; adjusting tags is much faster than tagging from scratch.

NLP models

Model Training Suite is set up to train models for the following NLP tasks. It is extensible to train models for other tasks, too.

Entity extraction
Event extraction

Languages

Model Training Suite trains custom models for entity extraction and events in the following languages. The language support is easily extensible to all the languages supported by Base Linguistics.

Language	Model Type
	Entities	Events
Arabic (`ara`)	✓	✓
Chinese (`zho`)	✓	✓
Dutch (`nld`)	✓	✓
English (`eng`)	✓	✓
French (`fra`)	✓
German (`deu`)	✓	✓
Hebrew (`heb`)	✓
Hungarian (`hun`)	✓	✓
Indonesian (`ind`)	✓
Italian (`ita`)	✓
Japanese (`jpn`)	✓	✓
Korean (`kor`)	✓	✓
Malay, Standard (`zsm`)	✓
Persian (`fas`)	✓
Portuguese (`por`)	✓
Pashto (`pus`)	✓
Russian (`rus`)	✓	✓
Spanish (`spa`)	✓
Swedish (`swe`)	✓
Tagalog (`tgl`)	✓
Urdu (`urd`)	✓
Vietnamese (`vie`)	✓

Example 17. Availability

Server

Semantic Similarity

Semantic Similarity can compare the meaning of words and text within the following languages (orchestra, symphony) and between these languages (king, roi). It can also find semantically similar words given an input word. See the sample output in the righthand column of the word “spy” and its semantically similar output in Spanish, German, and Japanese.

Documentation and Resources

Try Analytics Cloud: developer.babelstreet.com

Supported languages

Arabic (ara)
Chinese (zho)
English (eng)
French (fra)
German (deu)
Hebrew (heb)
Hungarian (hun)
Italian (ita)
Japanese (jpn)
Korean (kor)
Korean - North (qkp)
Korean - South (qkr)
Persian (fas)
Portuguese (por)
Russian (rus)
Spanish (spa)
Tagalog (tgl)
Urdu (urd)

Example 18. Availability

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 19. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 20. Sample output (/semantics/similar)

{"content": "spy", "options": {"resultLanguages": ["spa", "deu", "jpn"]}}

{
  "similarTerms": {
    "spa": [
      {
        "term": "espía",
        "similarity": 0.61295485
      },
      {
        "term": "cia",
        "similarity": 0.46201307
      },
      {
        "term": "desertor",
        "similarity": 0.42849663
      },
      {
        "term": "cómplice",
        "similarity": 0.36646274
      },
      {
        "term": "subrepticiamente",
        "similarity": 0.36629659
      },
      {
        "term": "asesino",
        "similarity": 0.36264464
      },
      {
        "term": "misterioso",
        "similarity": 0.35466132
      },
      {
        "term": "fugitivo",
        "similarity": 0.35033143
      },
      {
        "term": "informante",
        "similarity": 0.34707013
      },
      {
        "term": "mercenario",
        "similarity": 0.34658083
      }
    ],
    "jpn": [
      {
        "term": "スパイ",
        "similarity": 0.5544399
      },
      {
        "term": "諜報",
        "similarity": 0.46903181
      },
      {
        "term": "MI6",
        "similarity": 0.46344957
      },
      {
        "term": "殺し屋",
        "similarity": 0.41098994
      },
      {
        "term": "正体",
        "similarity": 0.40109193
      },
      {
        "term": "プレデター",
        "similarity": 0.39433435
      },
      {
        "term": "レンズマン",
        "similarity": 0.3918637
      },
      {
        "term": "S.H.I.E.L.D.",
        "similarity": 0.38338536
      },
      {
        "term": "サーシャ",
        "similarity": 0.37628397
      },
      {
        "term": "黒幕",
        "similarity": 0.37256041
      }
    ],
    "deu": [
      {
        "term": "Deckname",
        "similarity": 0.51391315
      },
      {
        "term": "GRU",
        "similarity": 0.50809389
      },
      {
        "term": "Spion",
        "similarity": 0.50051737
      },
      {
        "term": "KGB",
        "similarity": 0.49981388
      },
      {
        "term": "Informant",
        "similarity": 0.48774603
      },
      {
        "term": "Geheimagent",
        "similarity": 0.48700801
      },
      {
        "term": "Geheimdienst",
        "similarity": 0.48512384
      },
      {
        "term": "Spionin",
        "similarity": 0.47224587
      },
      {
        "term": "MI6",
        "similarity": 0.46969846
      },
      {
        "term": "Decknamen",
        "similarity": 0.44730526
      }
    ]
  }

Base Linguistics

Text analytics fundamentals to prepare your data for multilingual search and advanced NLP analysis.

Documentation and Resources

Try the demo: demo.babelstreet.com/text-analytics-demo

Try Analytics Cloud: developer.babelstreet.com

Key features

Base Linguistics outputs these morphological analyses, some of which are language-specific.

Tokenization
Sentence boundary detection
Part of speech tagging
Lemmatization
Noun decompounding
Chinese readings (Pinyin pronunciation)
Chinese script conversion (traditional <=> simplified
Japanese readings (pronunciation)
Japanese spelling normalization (Katakana and modern/old-style Kanji)
Arabic script languages; text and token normalization, semitic root and stem analysis

Languages and analyses

Base Linguistics outputs the following analyses for each supported language as shown in the table below.

Language (code)	Tokenization	Parts of Speech	Lemmas	Compound Components	Han Readings	Sentence Boundary
Arabic (`ara`)	✓	✓	✓			✓
Catalan (`cat`)	✓		✓			✓
Chinese (zho)	✓	✓	✓		✓	✓
Czech (`ces`)	✓	✓	✓			✓
Danish (`dan`)	✓		✓	✓		✓
Dutch (`nld`)	✓	✓	✓	✓		✓
English (`eng`)	✓	✓	✓			✓
Estonian (`est`)	✓		✓			✓
Finnish (`fin`)	✓
French (`fra`)	✓	✓	✓			✓
German (`deu`)	✓	✓	✓	✓		✓
Greek (`ell`)	✓	✓	✓			✓
Hebrew (`heb`)	✓	✓	✓			✓
Hungarian (`hun`)	✓	✓	✓	✓		✓
Indonesian (`ind`)	✓	✓	✓
Italian (`ita`)	✓	✓	✓			✓
Japanese (`jpn`)	✓	✓	✓		✓	✓
Korean (`kor`)	✓	✓	✓	✓		✓
Korean-North (`qkp`)	✓	✓	✓	✓		✓
Korean-South (`qkr`)	✓	✓	✓	✓		✓
Latvian (`lav`)	✓		✓			✓
Malay, Standard (`zsm`)	✓	✓	✓
Norwegian (`nor`)	✓		✓	✓		✓
Norwegian-Bokmål (`nob`)	✓		✓	✓		✓
Norwegian-Nynorsk (`nno`)	✓		✓	✓		✓
Pashto (`pus`)	✓
Persian (`fas`)	✓	✓	✓			✓
Persian-Afghan (`prs`)	✓	✓	✓			✓
Persian-Iranian (`pes`)	✓	✓	✓			✓
Polish (`pol`)	✓	✓	✓			✓
Portuguese (`por`)	✓	✓	✓			✓
Romanian (`ron`)	✓		✓			✓
Russian (`rus`)	✓	✓	✓			✓
Serbian (`srp`)^[1]	✓		✓			✓
Slovak (`slk`)	✓		✓			✓
Spanish (`spa`)	✓	✓	✓			✓
Swedish (`swe`)	✓		✓	✓		✓
Tagalog (`tgl`)	✓	✓	✓
Thai (`tha`)	✓		✓			✓
Turkish (`tur`)	✓		✓			✓
Ukrainian (`ukr`)	✓
Urdu (`urd`)	✓	✓				✓
^[1]The /morphology endpoint only supports Serbian text written in Latin script. However, by default, it only identifies Serbian text written in Cyrillic script. To take advantage of the morphological analysis feature for Serbian, you must explicitly include the language code srp in your request.

Example 21. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 22. Integrations

Elasticsearch
Solr

Example 23. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 24. Sample output

{
"tokens": [
"The",
"fact",
"is",
"that",
"the",
"geese",
"just",
"went",
"back",
"to",
"get",
"a",
"rest",
"and",
"I",
"'m",
"not",
"banking",
"on",
"their",
"return",
"soon"
],
"lemmas": [
"the",
"fact",
"be",
"that",
"the",
"goose",
"just",
"go",
"back",
"to",
"get",
"a",
"rest",
"and",
"I",
"be",
"not",
"bank",
"on",
"they",
"return",
"soon"
]
}

Language Identifier

Instantly identify the language of whole documents or multiple language regions within each document.

Documentation and Resources

Try the demo: demo.babelstreet.com/text-analytics-demo

Try Analytics Cloud: developer.babelstreet.com

Key features

Detects North Korean v. South Korean text, and transliterated Arabic, Kurdish, Persian, Pashto, and Urdu
Detects language of short strings (3 words to a sentence)
Detects sections of text in different languages within a single multilingual document

Languages

Language Identifier detects text in the following languages and encodings.

Language (ISO 639-3)	Script (ISO 15924)	Short-String Detection	Encodings
Albanian (sqi)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Arabic (ara)	Arabic (Arab)	✓	ISO-8859-6, UTF-16BE, UTF-16LE, UTF-8, windows-1256, windows-720
Arabic (ara)	Latin (Latn)		ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252, windows-1256
Bengali (ben)	Bengali (Beng)	✓	ISCII-Bengali, UTF-16BE, UTF-16LE, UTF-8
Bulgarian (bul)	Cyrillic (Cyrl)	✓	ISO-8859-5, KOI8-R, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Catalan (cat)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Chinese (zho)	Han, Simplified (Hans)	✓^[1]	GB18030, GB2312, HZ-GB-2312, ISO-2022-CN, UTF-16BE, UTF-16LE, UTF-8
Chinese (zho)	Han, Traditional (Hant)	✓^[1]	Big5, UTF-16BE, UTF-16LE, UTF-8
Croatian (hrv)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Czech (ces)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Danish (dan)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Dutch (nld)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
English Uppercase (uen^[2])	Latin (Latn)		ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
English (eng)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Estonian (est)	Latin (Latn)	✓	ISO-8859-13, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1257
Finnish (fin)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
French (fra)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
German (deu)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Greek (ell)	Greek (Grek)	✓	ISO-8859-7, UTF-16BE, UTF-16LE, UTF-8, windows-1253
Gujarati (guj)	Gujarati (Gujr)	✓	ISCII-Gujarati, UTF-16BE, UTF-16LE, UTF-8
Hebrew (heb)	Hebrew (Hebr)	✓	ISO-8859-8, UTF-16BE, UTF-16LE, UTF-8, windows-1255
Hindi (hin)	Devanagari (Deva)	✓	ISCII-Devanagari, UTF-16BE, UTF-16LE, UTF-8
Hungarian (hun)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Icelandic (isl)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Indonesian (ind)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Italian (ita)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Japanese (jpn)	Japanese [Han + Hiragana + Katakana] (Jpan)	✓	EUC-JP, ISO-2022-JP, Shift_JIS, Shift_JIS-2004, UTF-16BE, UTF-16LE, UTF-8
Japanese (jpn)	Katakana (Kana)	✓	EUC-JP, Shift_JIS, Shift_JIS-2004, UTF-16BE, UTF-16LE, UTF-8
Kannada (kan)	Kannada (Knda)	✓	ISCII-Kannada, UTF-16BE, UTF-16LE, UTF-8
Korean (kor^[3]^[4])	Korean [Hangul + Han] (Kore)	✓	EUC-KR, ISO-2022-KR, UTF-16BE, UTF-16LE, UTF-8
Kurdish (kur)	Arabic (Arab)	✓	UTF-16BE, UTF-16LE, UTF-8, windows-1256
Kurdish (kur)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252, windows-1256
Latvian (lav)	Latin (Latn)	✓	ISO-8859-13, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1257
Lithuanian (lit)	Latin (Latn)	✓	ISO-8859-13, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1257
Macedonian (mkd)	Cyrillic (Cyrl)	✓	ISO-8859-5, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Malay, Standard (zsm)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Malayalam (mal)	Malayalam (Mlym)	✓	ISCII-Malayalam, UTF-16BE, UTF-16LE, UTF-8
North Korean (qkp^[4])	Korean [Hangul + Han] (Kore)	✓	UTF-8
Norwegian (nor)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Persian (fas)	Arabic (Arab)	✓	UTF-16BE, UTF-16LE, UTF-8, windows-1256
Persian (fas)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252, windows-1256
Polish (pol)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Portuguese (por)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Pashto (pus)	Arabic (Arab)	✓	UTF-16BE, UTF-16LE, UTF-8, windows-1256
Pashto (pus)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252, windows-1256
Romanian (ron)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Russian (rus)	Cyrillic (Cyrl)	✓	IBM866, ISO-8859-5, KOI8-R, UTF-16BE, UTF-16LE, UTF-8, windows-1251, x-mac-cyrillic
Serbian (srp)	Cyrillic (Cyrl)	✓	ISO-8859-5, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Serbian (srp)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Slovak (slk)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Slovenian (slv)	Latin (Latn)	✓	ISO-8859-2, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1250
Somali (som)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
South Korean (qkr^[4])	Korean [Hangul + Han] (Kore)	✓	UTF-8
Spanish (spa)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Swedish (swe)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Tagalog (tgl)	Latin (Latn)	✓	ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252
Tamil (tam)	Tamil (Taml)	✓	ISCII-Tamil, UTF-16BE, UTF-16LE, UTF-8
Telugu (tel)	Telugu (Telu)	✓	ISCII-Telugu, UTF-16BE, UTF-16LE, UTF-8
Thai (tha)	Thai (Thai)	✓	UTF-16BE, UTF-16LE, UTF-8, windows-874
Turkish (tur)	Latin (Latn)	✓	ISO-8859-9, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1254
Ukrainian (ukr)	Cyrillic (Cyrl)	✓	ISO-8859-5, KOI8-R, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Urdu (urd)	Arabic (Arab)	✓	UTF-16BE, UTF-16LE, UTF-8, windows-1256
Urdu (urd)	Latin (Latn)		ISO-8859-1, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1252, windows-1256
Uzbek (uzb)	Cyrillic (Cyrl)	✓	ISO-8859-5, KOI8-R, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Uzbek (uzb)	Latin (Latn)	✓	US-ASCII, UTF-16BE, UTF-16LE, UTF-8, windows-1251
Vietnamese (vie)	Latin (Latn)	✓	TCVN, US-ASCII, UTF-16BE, UTF-16LE, UTF-8, VIQR, VISCII, VNI, VPS
Unknown (xxx^[2])^[5]	[Any script]	✓
^[1]For Chinese, short-string analysis returns Han script. ^[2]Non-standard language code. ^[3]For Korean, short-string analysis returns Hangul script. ^[4]North Korean and South Korean are enabled only if `koreanDialects` is true. Korean is enabled only if `koreanDialects` is false. ^[5]Returned if analysis cannot identify a language.

Example 25. Availability

Java SDK 
For on-premises systems that need the low-latency, high-speed integration of an SDK, Java is the way to go. It has been deployed in the most demanding, high-transaction environments, including web search engines, financial compliance, and border security. 

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 26. Integrations

Elasticsearch
Solr

Example 27. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 28. Sample output

{
  "languageDetections": [
    {
      "language": "spa",
      "confidence": 0.38719602327387076
    },
    {
      "language": "eng",
      "confidence": 0.32699986625091865
    },
    {
      "language": "por",
      "confidence": 0.05569054210624943
    },
    {
      "language": "deu",
      "confidence": 0.030069489878380328
    },
    {
      "language": "swe",
      "confidence": 0.027734757034048835
    }
  ]
}

Sentiment Analyzer

Detects positive, negative or neutral sentiment of a whole document or emotional hotspots in text about companies, people, and products. Sentiment Analyzer rates sentiment on a scale of 1 (positive) to –1 (negative) with 0 being neutral.

Documentation and Resources

Try the demo: demo.babelstreet.com/text-analytics-demo

Try Analytics Cloud: developer.babelstreet.com

Languages

Sentiment Analyzer identifies the emotion within text written in these languages.

Arabic (ara)
English (eng)
French (fra)
Japanese (jpn)
Persian (fas)
Spanish (spa)

Entity types

Sentiment Analyzer can identify sentiment about a particular entity. Entity-centric sentiment analysis is possible for the entity types listed below and any other type supported by Entity Extractor.

Person
Organization
Location
Nationality
Religion

Example 29. Availability

Analytics Server 
This on-premises private cloud deployment puts all the functionality of the Analytics API behind your secure firewall, and enables advanced user settings, access to custom profiles (user-specific configuration setups), and deployment of custom models. 

Hosted Services 
The SaaS version of Babel Street Analytics is rapidly implemented, low maintenance and ideal for users who wish to pay based on monthly call volume. Numerous bindings through a RESTful API are supported.

Example 30. Bindings

 Visit our GitHub pages for bindings and documentation. 

cURL
Python
PHP
Java
Ruby
C#
Node.js

Example 31. Sample Output

For the document, and each identified entity, only the highest scoring sentiment is returned, along with a confidence value between 0 and 1. For each entity, detail about the entity is also returned.

{
  "document": {
    "label": "string", 
    "confidence": number
  },
  "entities": [
  {
   "type": "string",
   "mention": "string",
   "normalized": "string",
   "count": 0,
   "mentionOffsets": [
    {
      "startOffset": number,
      "endOffset": number
    }
   ],
   "entityId": "string",
   "confidence": 0,
   "linkingConfidence": 0,   
   "sentiment": {
     "label": "string",
     "confidence": number
   }
 ]
}

Topic Extractor

Identify keywords and significant phrases in your text, and the topics not explicitly named to capture the essence of a document’s content.

Documentation and Resources