Semantic Similarity
Semantic Similarity provides tools for generating and using text vectors to identify semantically similar words in multiple supported languages through two endpoints:
Text vectors provide a mechanism for comparing documents or words based on their semantic similarity. For any given term (which may be a document or a single word), a location in semantic space, as represented by a vector of floating point numbers, is calculated. This vector can be mathematically compared with other term or document vectors. Words with similar meanings have similar contexts, so they are mapped close to each other. The terms being compared can be in the same or different languages, providing cross-lingual semantic similarity evaluation without the need for any translation.
In the semantic space, corresponding words in different languages have similar meanings and therefore similar mappings. Take for example, “Washington was the first president of the United States” and" Washington fue el primer presidente de los Estados Unidos“. These sentences have roughly equivalent meanings in both Spanish and English, which will be reflected in the proximity of their text vectors.
Supported Languages
The semantics/vector
and semantics/similar
endpoints support the same languages.
You can specify the language of your input with the three-letter language code. If you do not specify the language, then the endpoint automatically detects it.
Arabic (
ara
)Chinese (
zho
)English (
eng
)French (
fra
)German (
deu
)Hebrew (
heb
)Hungarian (
hun
)Italian (
ita
)Japanese (
jpn
)Korean (
kor
)Korean - North (
qkp
)Korean - South (
qkr
)Persian (
fas
)Portuguese (
por
)Russian (
rus
)Spanish (
spa
)Tagalog (
tgl
)Urdu (
urd
)
Semantics/Similar
https://analytics.babelstreet.com/rest/v1/semantics/similar
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/similar_terms.curl
Semantics/similar uses text vectors to identify semantically similar terms. By calculating the text vector for a term, the endpoint can find similar terms in the semantic space in multiple supported languages.
To find semantically similar terms, provide an input term and one or more result languages. By default, the endpoint will return 10 similar terms in each requested language. You can request up to 50 terms for each language. For each returned term, a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.
Query Parameters
Name | Value | Description |
---|---|---|
output | rosette | Returns the response in ADM format. |
Note
All input parameters, including the text being analyzed and any relevant options, are defined in the request body.
Request
The desired output languages are supplied with the required resultLanguages
option. The statement {“options”: {“resultLanguages”: [“spa”, “jpn”]}
returns Spanish and Japanese terms.
To specify the number of terms to return (between 1 and 50), use the count option. For example, to return 20 terms in Spanish and 20 terms in Japanese, add {“options”: {“resultLanguages”: [“spa”, “jpn”], “count”: 20}}
to your call.
You can specify the language of your input by including the three-letter language code within your request. If you do not include this field, the endpoint will automatically detect the language of the input.
Option | Type | Description | Required | Default |
---|---|---|---|---|
| string | Language code(s) for the results. | yes | |
| number | Number of terms to return in each language. Valid values 1-50. | no | 10 |
| string | Determines the embeddings ( | no |
|
{ "content": "string", "language": "string", "options": { "count": 0, "resultLanguages": [ "languageCode": "string" ] } }
Response
Semantics/similar returns between 1 and 50 terms for each language, as set by the count
option. For each returned term, a similarity value between -1 and 1 is returned, where a value closer to 1 indicates a higher degree of similarity.
{ "similarTerms": { "language": [ { "term": "string", "similarity": 0 } ] } }
Supported languages
GET /semantics/similar/supported-languages
Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.
Response
Field | Type | Description |
---|---|---|
| string | ISO 639 language code |
| string | Four-letter ISO-15924 script code |
| boolean | Indicates if you are licensed for this language |
{ "supportedLanguages": [ { "language": "string", "script": "string", "licensed": boolean } ] }
Semantics/Vector
https://analytics.babelstreet.com/rest/v1/semantics/vector
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/semantic_vectors.curl
Vector returns a single vector of floating point numbers for your input, representing the location of the input in semantic space. The length of the input can range from a single word to an entire document. Among other uses, a text vector enables you to calculate the similarity between two documents or two words.
Query Parameters
Name | Value | Description |
---|---|---|
output | rosette | Returns the response in ADM format. |
Note
All input parameters, including the text being analyzed and any relevant options, are defined in the request body.
Request
By default, a vector representative of the entire input document is returned, regardless of length. To return a vector for each individual token, set perToken
to true
.
Option | Type | Description | Default |
---|---|---|---|
| boolean | If true, returns a vector for each individual token. | false |
| string | Determines the embeddings ( |
|
{ "content": "string", "language": "string", "options": { "perToken": "false" } }
Response
{ "documentEmbedding": [ 0 ], "tokenEmbeddings": [ 0 ], "tokens": [ "string" ] }
Supported languages
GET /semantics/vector/supported-languages
Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.
Response
Field | Type | Description |
---|---|---|
| string | ISO 639 language code |
| string | Four-letter ISO-15924 script code |
| boolean | Indicates if you are licensed for this language |
{ "supportedLanguages": [ { "language": "string", "script": "string", "licensed": boolean } ] }