Skip to main content

Babel Street Analytics API

Semantic Similarity

Semantic Similarity provides tools for generating and using text vectors to identify semantically similar words in multiple supported languages through two endpoints:

Text vectors provide a mechanism for comparing documents or words based on their semantic similarity. For any given term (which may be a document or a single word), a location in semantic space, as represented by a vector of floating point numbers, is calculated. This vector can be mathematically compared with other term or document vectors. Words with similar meanings have similar contexts, so they are mapped close to each other. The terms being compared can be in the same or different languages, providing cross-lingual semantic similarity evaluation without the need for any translation.

In the semantic space, corresponding words in different languages have similar meanings and therefore similar mappings. Take for example, “Washington was the first president of the United States” and" Washington fue el primer presidente de los Estados Unidos“. These sentences have roughly equivalent meanings in both Spanish and English, which will be reflected in the proximity of their text vectors.

Supported Languages

The semantics/vector and semantics/similar endpoints support the same languages.

You can specify the language of your input with the three-letter language code. If you do not specify the language, then the endpoint automatically detects it.

  • Arabic (ara)

  • Chinese (zho)

  • English (eng)

  • French (fra)

  • German (deu)

  • Hebrew (heb)

  • Hungarian (hun)

  • Italian (ita)

  • Japanese (jpn)

  • Korean (kor)

  • Korean - North (qkp)

  • Korean - South (qkr)

  • Persian (fas)

  • Portuguese (por)

  • Russian (rus)

  • Spanish (spa)

  • Tagalog (tgl)

  • Urdu (urd)

Semantics/Similar

https://analytics.babelstreet.com/rest/v1/semantics/similar

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/similar_terms.curl

Semantics/similar uses text vectors to identify semantically similar terms. By calculating the text vector for a term, the endpoint can find similar terms in the semantic space in multiple supported languages.

To find semantically similar terms, provide an input term and one or more result languages. By default, the endpoint will return 10 similar terms in each requested language. You can request up to 50 terms for each language. For each returned term, a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

The desired output languages are supplied with the required resultLanguages option. The statement {“options”: {“resultLanguages”: [“spa”, “jpn”]} returns Spanish and Japanese terms.

To specify the number of terms to return (between 1 and 50), use the count option. For example, to return 20 terms in Spanish and 20 terms in Japanese, add {“options”: {“resultLanguages”: [“spa”, “jpn”], “count”: 20}} to your call.

You can specify the language of your input by including the three-letter language code within your request. If you do not include this field, the endpoint will automatically detect the language of the input.

Option

Type

Description

Required

Default

resultLanguages 

string

Language code(s) for the results.

yes

count 

number

Number of terms to return in each language. Valid values 1-50.

no

10

embeddingsMode 

string

Determines the embeddings (GEN_1 or GEN_2) used to generate the results.

no

GEN_2 

{
  "content": "string",
  "language": "string",
  "options": {  
    "count": 0,
    "resultLanguages": [
      "languageCode": "string"
    ]
  }
}
Response

Semantics/similar returns between 1 and 50 terms for each language, as set by the count option. For each returned term,  a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.

{
  "similarTerms": {
    "language": [
      {
      "term": "string",
      "similarity": 0
      }
    ]
  }
}
Supported languages

GET /semantics/similar/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Semantics/Vector

https://analytics.babelstreet.com/rest/v1/semantics/vector

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/semantic_vectors.curl

Vector returns a single vector of floating point numbers for your input, representing the location of the input in semantic space. The length of the input can range from a single word to an entire document. Among other uses, a text vector enables you to calculate the similarity between two documents or two words.

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

By default, a vector representative of the entire input document is returned, regardless of length. To return a vector for each individual token, set perToken to true.

Option

Type

Description

Default

perToken

boolean

If true, returns a vector for each individual token.

false

embeddingsMode

string

Determines the embeddings (GEN_1 or GEN_2) used to generate the results.

GEN_2 

{
  "content": "string",
  "language": "string",
  "options": {
    "perToken": "false"
  }
}
Response
{
  "documentEmbedding": [
    0
  ],
  "tokenEmbeddings": [
    0
  ],
  "tokens": [
    "string"
  ]
}
Supported languages

GET /semantics/vector/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}