Tokenizer

https://analytics.babelstreet.com/rest/v1/tokens

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/tokens.curl

Tokens identifies and separates each word into one or more tokens through advanced statistical modeling. A token is an atomic element such as a word, number, possessive affix, or punctuation.

The resulting token output minimizes index size, enhances search accuracy, and increases relevancy.

We offer two algorithms for Chinese and Japanese tokenization and morphological analysis. Prior to August 2018, the default algorithm was a perceptron. To return to that algorithm, set modelType to perceptron.

Tip

Try it in the interactive documentation

Query Parameters

Name	Value	Description
output	rosette	Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Name	Type	Description
`content`	string	Text to process
`contentUri`	string	URI to accessible content
`language`	string	ISO 639 language code

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option	Type	Description	Default
`modelType`	string	Model type to use for Thai analyses. Valid values `default` , `perceptron`, `DNN`. For Korean input without spaces, use `DNN`.	`default`
`disambiguatorType`	string	For Hebrew only, determines the disambiguator used. Valid values are `perceptron`, `DNN`, `dictionary`.	`perceptron`
`disambiguate`	Boolean	Indicates whether the analyzers should disambiguate the results.	`true`
`query`	Boolean	Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior.	`false`
`partOfSpeechTagSet`	string	Selects which part of speech tag sets to return. Valid values are `basis` and `upt16`.	`upt16`

{
  "content": "string",
  "language": "string",
  "options": {
    "modelType": "string",
    "disambiguatorType": "string",
    "disambiguate": boolean,
    "query": boolean,
    "partOfSpeechTagSet": "string"
  }
}

Response

{
  "tokens": [
    "string"
  ]
}

Supported languages

GET /tokens/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Babel Street Analytics API

Tokenizer

Tip

Query Parameters

Note

Request

Notice

Response

Supported languages

Response

Search results