Morphology
https://analytics.babelstreet.com/rest/v1/morphology/{morphoFeature}
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_complete.curl
The morphological analysis endpoint provides language-specific tools for returning part of speech, lemmas (dictionary form), compound components, and Han readings for each token in the input.
Append a morphoFeature
to the morphology/endpoint to specify which feature you want returned, or complete
to return all features.
morphoFeature | Description |
---|---|
complete | Returns all results for all features available for the language of the input text. |
lemmas | Returns the lemmas or dictionary forms. |
parts-of-speech | Returns the parts of speech where each language has its own set of POS tags. |
compound-components | Decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components. This can improve recall for search engines. |
han-readings | For Chinese tokens in Han script, pinyin transcriptions are returned as the Han reading. For Japanese tokens in Han script (kanji), hiragana transcriptions are returned as the Han reading. |
Do you know the language of your input?
If you know the language of your input, include the three-letter language code in your call. This will speed up the response time.
Otherwise, the endpoint will identify the language automatically.
Complete
https://analytics.babelstreet.com/rest/v1/morphology/complete
You can call the complete set of morphology features and Rosette returns the lemmas, compound components, Han readings, and parts of speech tags for the input text.
The table above shows the features supported for each input language.
Lemmas
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_lemmas.curl
https://analytics.babelstreet.com/rest/v1/morphology/lemmas
A lemma is the dictionary form of a word. Morphology determines the lemma of each word in the input based on its usage and context.
For example, saw can be used as a noun or a past tense verb.
In the sentence “The carpenter picked up the saw from the workbench,” the lemma “saw”, a noun, is returned.
However, in the sentence “The bird saw the worm in the shade of the tree,” the lemma “see”, the dictionary-form of the verb, is returned.
Parts of Speech
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_parts-of-speech.curl
https://analytics.babelstreet.com/rest/v1/morphology/parts-of-speech
The morphology endpoint grammatically analyzes text to determine the role of each word within the input. The Parts of Speech feature returns a part-of-speech (POS) tag for each of the words, depending on the context of how it is used.
For example, spoke can be classified as a noun or a verb. “The wheel spoke creaked,” (noun) compared to, “She spoke the truth” (verb).
Compound Components
https://analytics.babelstreet.com/rest/v1/morphology/compound-components
With the decompounding feature, compound words are broken into sub-components and returned as individual elements. This is useful for increasing search relevancy in languages such as German and Korean.
For example, the German compound word “Rechtsschutzversicherungsgesellschaften” means “legal expenses insurance companies”.
The compound components: “Recht”, “Schutz”, “Versicherung”, and “Gesellschaft” are returned.
Han Readings
https://analytics.babelstreet.com/rest/v1/morphology/han-readings
The Han readings feature provides pronunciation information for Han script, in both Chinese and Japanese input text. The algorithm selected will impact the Han reading returned.
For Chinese tokens in Han script, by default pinyin transcriptions are returned using diacritics. Multiple possible readings may be returned for each word. If you call morphology with “美国大选中的”, it returns these Han readings: [[“měiguó”], [“dàxuǎn”], [“zhōng”, “zhòng”], [“de”, “di”, “dī”, “dí”, “dì”]]. Note that the last two tokens each contain multiple possible readings.
For Japanese tokens in Han script (kanji), by default hiragana transcriptions are returned. If you call morphology with “医療番組”, it returns these Han readings: “いりょう”, “ばんぐみ”.
You can specify that the perceptron algorithm should be used instead of the default algorithm by setting the modelType
to perceptron
.
For Chinese tokens, using the perceptron algorithm, one pinyin transcription per token, using digits, is returned. If you call with “美国大选中的”, it returns these Han readings: “Mei3-guo2”, “da4-xuan3”, “zhong1”, “de0”.
For Japanese, using the perceptron algorithm, katakana transcriptions are returned. If you call with “医療番組”, it returns these Han readings: “イリョウ”, “バングミ”.
Query Parameters
Name | Value | Description |
---|---|---|
output | rosette | Returns the response in ADM format. |
Note
All input parameters, including the text being analyzed and any relevant options, are defined in the request body.
Request
Option | Type | Description | Default |
---|---|---|---|
| string | Model type to use for Thai analyses. Valid values For Korean input without spaces, use |
|
| string | For Hebrew only, determines the disambiguator used. Valid values are |
|
| Boolean | Indicates whether the analyzers should disambiguate the results. |
|
| Boolean | Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior. |
|
| string | Selects which part of speech tag sets to return. Valid values are |
|
{ "content": "string", "language": "string", "options": { "modelType": "string", "disambiguatorType": "string", "disambiguate": boolean, "query": boolean, "partOfSpeechTagSet": "string" } }
Response
{ "tokens": [ "string" ], "posTags": [ "string" ], "lemmas": [ "string" ], "compoundComponents": [ "string" ], "hanReadings": [ "string" ] }
Supported languages
GET /morphology/supported-languages
Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.
Response
Field | Type | Description |
---|---|---|
| string | ISO 639 language code |
| string | Four-letter ISO-15924 script code |
| boolean | Indicates if you are licensed for this language |
{ "supportedLanguages": [ { "language": "string", "script": "string", "licensed": boolean } ] }