Skip to main content

Babel Street Analytics API

Name Deduplication

https://analytics.babelstreet.com/rest/v1/name-deduplication 

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/name_deduplication.curl

Name Deduplication takes a list of names and groups similar names into clusters. Matches across name variations (such as misspellings and nicknames) have multilingual support by using a linguistic, statistically-based system. A threshold parameter determines how much variation between names is permitted in any given cluster. There is a limit of 1000 names per list per call.

Given a list of names as input, the output is a cluster ID for each name. Similar names are given the same cluster ID. The output may then be sorted by cluster ID to group together possible duplicate names.

Important

Do not use the Analytics language identification endpoint to determine source language of names. Name deduplication uses an algorithm specifically tuned to identify the language of names rather than general text.

Request

When you submit a name-deduplication request, you need only input a list of names and an optional threshold; however, we strongly recommend specifying the source language when known, for better accuracy. If you do not specify the source language, the name similarity algorithm will guess the language.

Option

Type

Description

Required

threshold 

number

Determines how much variation between names is permitted in any given cluster. A higher value will have less variation within the cluster.

no

{
  "names": [
    {
      "entityType": "PERSON",
      "language": "string",
      "script": "string",
      "text": "string"
    }
  ],
  "threshold": 0.75
}

Response

Given a list of names as input, the output is a cluster ID for each name. Similar names are given the same cluster ID. The output may then be sorted by cluster ID to group together possible duplicate names.

Input (names)

Output (cluster ID)

John Smith

1

Cyndi McBoysen

2

Dmitri Shostakovich

4

Jim Hockenberry

3

Takeshi Suzuki

5

Jon Smythe

1

James Hawkenbury

3

Cindy MacBoysen

2

Дми́трий Шостако́вич

4

{
  "results": [
    "string"
  ]
}

Supported languages

GET /name-deduplication/supported-languages 

Retrieve the language pairs supported by the name deduplication endpoint. The endpoint supports matching between the source and target of each pair. The language, script, and transliteration scheme are listed for each source and target.

Response

Field

Type

Description

transliterationScheme

string

script

string

Four-letter ISO-15924 script code

language

string

ISO 639 language code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguagePairs": [
    {
      "source": {
        "transliterationScheme": "string",
        "script": "string".
        "language": "string"
      },
    {
      "target": {
        "transliterationScheme": "string",
        "script": "string".
        "language": "string"
      },
    "licensed": true
    }
  ]
}