Semantic Similarity

Semantic Similarity provides tools for generating and using text vectors to identify semantically similar words in multiple supported languages through two endpoints:

Text vectors provide a mechanism for comparing documents or words based on their semantic similarity. For any given term (which may be a document or a single word), a location in semantic space, as represented by a vector of floating point numbers, is calculated. This vector can be mathematically compared with other term or document vectors. Words with similar meanings have similar contexts, so they are mapped close to each other. The terms being compared can be in the same or different languages, providing cross-lingual semantic similarity evaluation without the need for any translation.

In the semantic space, corresponding words in different languages have similar meanings and therefore similar mappings. Take for example, “Washington was the first president of the United States” and" Washington fue el primer presidente de los Estados Unidos“. These sentences have roughly equivalent meanings in both Spanish and English, which will be reflected in the proximity of their text vectors.

Supported Languages

The semantics/vector and semantics/similar endpoints support the same languages.

You can specify the language of your input with the three-letter language code. If you do not specify the language, then the endpoint automatically detects it.

Arabic (ara)
Chinese (zho)
English (eng)
French (fra)
German (deu)
Hebrew (heb)
Hungarian (hun)
Italian (ita)
Japanese (jpn)
Korean (kor)
Korean - North (qkp)
Korean - South (qkr)
Persian (fas)
Portuguese (por)
Russian (rus)
Spanish (spa)
Tagalog (tgl)
Urdu (urd)

Semantics/Similar

https://analytics.babelstreet.com/rest/v1/semantics/similar

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/similar_terms.curl

Semantics/similar uses text vectors to identify semantically similar terms. By calculating the text vector for a term, the endpoint can find similar terms in the semantic space in multiple supported languages.

To find semantically similar terms, provide an input term and one or more result languages. By default, the endpoint will return 10 similar terms in each requested language. You can request up to 50 terms for each language. For each returned term, a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.

Tip

Try it in the interactive documentation

Query Parameters

Name	Value	Description
output	rosette	Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

The desired output languages are supplied with the required resultLanguages option. The statement {“options”: {“resultLanguages”: [“spa”, “jpn”]} returns Spanish and Japanese terms.

To specify the number of terms to return (between 1 and 50), use the count option. For example, to return 20 terms in Spanish and 20 terms in Japanese, add {“options”: {“resultLanguages”: [“spa”, “jpn”], “count”: 20}} to your call.

You can specify the language of your input by including the three-letter language code within your request. If you do not include this field, the endpoint will automatically detect the language of the input.

Name	Type	Description
`content`	string	Text to process
`contentUri`	string	URI to accessible content
`language`	string	ISO 639 language code

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option	Type	Description	Required	Default
`resultLanguages`	string	Language code(s) for the results.	yes
`count`	number	Number of terms to return in each language. Valid values 1-50.	no	10
`embeddingsMode`	string	Determines the embeddings (`GEN_1` or `GEN_2`) used to generate the results.	no	`GEN_2`

{
  "content": "string",
  "language": "string",
  "options": {  
    "count": 0,
    "resultLanguages": [
      "languageCode": "string"
    ]
  }
}

Response

Semantics/similar returns between 1 and 50 terms for each language, as set by the count option. For each returned term, a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.

{
  "similarTerms": {
    "language": [
      {
      "term": "string",
      "similarity": 0
      }
    ]
  }
}

Supported languages

GET /semantics/similar/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Semantics/Vector

https://analytics.babelstreet.com/rest/v1/semantics/vector

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/semantic_vectors.curl

Vector returns a single vector of floating point numbers for your input, representing the location of the input in semantic space. The length of the input can range from a single word to an entire document. Among other uses, a text vector enables you to calculate the similarity between two documents or two words.

Tip

Try it in the interactive documentation

Query Parameters

Name	Value	Description
output	rosette	Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

By default, a vector representative of the entire input document is returned, regardless of length. To return a vector for each individual token, set perToken to true.

Name	Type	Description
`content`	string	Text to process
`contentUri`	string	URI to accessible content
`language`	string	ISO 639 language code

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option	Type	Description	Default
`perToken`	boolean	If true, returns a vector for each individual token.	false
`embeddingsMode`	string	Determines the embeddings (`GEN_1` or `GEN_2`) used to generate the results.	`GEN_2`

{
  "content": "string",
  "language": "string",
  "options": {
    "perToken": "false"
  }
}

Response

{
  "documentEmbedding": [
    0
  ],
  "tokenEmbeddings": [
    0
  ],
  "tokens": [
    "string"
  ]
}

Supported languages

GET /semantics/vector/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Babel Street Analytics API