Skip to main content

Babel Street Analytics API

Tokenizer

https://analytics.babelstreet.com/rest/v1/tokens

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/tokens.curl

Tokens identifies and separates each word into one or more tokens through advanced statistical modeling. A token is an atomic element such as a word, number, possessive affix, or punctuation.

The resulting token output minimizes index size, enhances search accuracy, and increases relevancy.

We offer two algorithms for Chinese and Japanese tokenization and morphological analysis. Prior to August 2018, the default algorithm was a perceptron. To return to that algorithm, set modelType to perceptron.

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Option

Type

Description

Default

modelType

string

Model type to use for Thai analyses. Valid values default , perceptron, DNN.

For Korean input without spaces, use DNN.

default

disambiguatorType

string

For Hebrew only, determines the disambiguator used. Valid values are perceptron, DNN, dictionary.

perceptron

disambiguate

Boolean

Indicates whether the analyzers should disambiguate the results.

true

query

Boolean

Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior.

false

partOfSpeechTagSet

string

Selects which part of speech tag sets to return. Valid values are basis and upt16.

upt16

{
  "content": "string",
  "language": "string",
  "options": {
    "modelType": "string",
    "disambiguatorType": "string",
    "disambiguate": boolean,
    "query": boolean,
    "partOfSpeechTagSet": "string"
  }
}

Response

{
  "tokens": [
    "string"
  ]
}

Supported languages

GET /tokens/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}