Skip to main content

Babel Street Analytics API

Morphology

https://analytics.babelstreet.com/rest/v1/morphology/{morphoFeature}

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_complete.curl

The morphological analysis endpoint provides language-specific tools for returning part of speech, lemmas (dictionary form), compound components, and Han readings for each token in the input.

Append a morphoFeature to the morphology/endpoint to specify which feature you want returned, or complete to return all features.

morphoFeature

Description

complete

Returns all results for all features available for the language of the input text.

lemmas

Returns the lemmas or dictionary forms.

parts-of-speech

Returns the parts of speech where each language has its own set of POS tags.

compound-components

Decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components. This can improve recall for search engines.

han-readings

For Chinese tokens in Han script, pinyin transcriptions are returned as the Han reading. For Japanese tokens in Han script (kanji), hiragana transcriptions are returned as the Han reading.

Do you know the language of your input?

If you know the language of your input, include the three-letter language code in your call. This will speed up the response time.

Otherwise, the endpoint will identify the language automatically.

Complete

https://analytics.babelstreet.com/rest/v1/morphology/complete 

You can call the complete set of morphology features and Rosette returns the lemmas, compound components, Han readings, and parts of speech tags for the input text.

The table above shows the features supported for each input language.

Lemmas

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_lemmas.curl

https://analytics.babelstreet.com/rest/v1/morphology/lemmas

A lemma is the dictionary form of a word. Morphology determines the lemma of each word in the input based on its usage and context.

For example, saw can be used as a noun or a past tense verb.

In the sentence “The carpenter picked up the saw from the workbench,” the lemma “saw”, a noun, is returned.

However, in the sentence “The bird saw the worm in the shade of the tree,” the lemma “see”, the dictionary-form of the verb, is returned.

Parts of Speech

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/morphology_parts-of-speech.curl

https://analytics.babelstreet.com/rest/v1/morphology/parts-of-speech

The morphology endpoint grammatically analyzes text to determine the role of each word within the input. The Parts of Speech feature returns a part-of-speech (POS) tag for each of the words, depending on the context of how it is used.

For example, spoke can be classified as a noun or a verb. “The wheel spoke creaked,” (noun) compared to, “She spoke the truth” (verb).

Compound Components

https://analytics.babelstreet.com/rest/v1/morphology/compound-components

With the decompounding feature, compound words are broken into sub-components and returned as individual elements. This is useful for increasing search relevancy in languages such as German and Korean.

For example, the German compound word “Rechtsschutzversicherungsgesellschaften” means “legal expenses insurance companies”.

The compound components: “Recht”, “Schutz”, “Versicherung”, and “Gesellschaft” are returned.

Han Readings

https://analytics.babelstreet.com/rest/v1/morphology/han-readings

The Han readings feature provides pronunciation information for Han script, in both Chinese and Japanese input text. The algorithm selected will impact the Han reading returned.

For Chinese tokens in Han script, by default pinyin transcriptions are returned using diacritics. Multiple possible readings may be returned for each word. If you call morphology with “美国大选中的”, it returns these Han readings: [[“měiguó”], [“dàxuǎn”], [“zhōng”, “zhòng”], [“de”, “di”, “dī”, “dí”, “dì”]]. Note that the last two tokens each contain multiple possible readings.

For Japanese tokens in Han script (kanji), by default hiragana transcriptions are returned. If you call morphology with “医療番組”, it returns these Han readings: “いりょう”, “ばんぐみ”.

You can specify that the perceptron algorithm should be used instead of the default algorithm by setting the modelType to perceptron.

  • For Chinese tokens, using the perceptron algorithm, one pinyin transcription per token, using digits, is returned. If you call with “美国大选中的”, it returns these Han readings: “Mei3-guo2”, “da4-xuan3”, “zhong1”, “de0”.

  • For Japanese, using the perceptron algorithm, katakana transcriptions are returned. If you call with “医療番組”, it returns these Han readings: “イリョウ”, “バングミ”.

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Option

Type

Description

Default

modelType

string

Model type to use for Thai analyses. Valid values default , perceptron, DNN.

For Korean input without spaces, use DNN.

default

disambiguatorType

string

For Hebrew only, determines the disambiguator used. Valid values are perceptron, DNN, dictionary.

perceptron

disambiguate

Boolean

Indicates whether the analyzers should disambiguate the results.

true

query

Boolean

Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior.

false

partOfSpeechTagSet

string

Selects which part of speech tag sets to return. Valid values are basis and upt16.

upt16

{
  "content": "string",
  "language": "string",
  "options": {
    "modelType": "string",
    "disambiguatorType": "string",
    "disambiguate": boolean,
    "query": boolean,
    "partOfSpeechTagSet": "string"
  }
}

Response

{
  "tokens": [
    "string"
   ],
  "posTags": [
    "string"
  ],
  "lemmas": [
    "string"
  ],
  "compoundComponents": [
     "string"
  ],
  "hanReadings": [
     "string"
  ]  
}

Supported languages

GET /morphology/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}