Name Deduplication

https://analytics.babelstreet.com/rest/v1/name-deduplication

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/name_deduplication.curl

Name Deduplication takes a list of names and groups similar names into clusters. Matches across name variations (such as misspellings and nicknames) have multilingual support by using a linguistic, statistically-based system. A threshold parameter determines how much variation between names is permitted in any given cluster. There is a limit of 1000 names per list per call.

Given a list of names as input, the output is a cluster ID for each name. Similar names are given the same cluster ID. The output may then be sorted by cluster ID to group together possible duplicate names.

Important

Do not use the Analytics language identification endpoint to determine source language of names. Name deduplication uses an algorithm specifically tuned to identify the language of names rather than general text.

Tip

Try it in the interactive documentation

Request

When you submit a name-deduplication request, you need only input a list of names and an optional threshold; however, we strongly recommend specifying the source language when known, for better accuracy. If you do not specify the source language, the name similarity algorithm will guess the language.

Field	Type	Description	Required
`name`	string	Name to match	yes
`language`	string	Three-letter ISO 693-3 language code	no (but strongly recommended if source language is known)
`entityType`	string	The type of name being matched. The most common ones are `PERSON` (default), `LOCATION`, and `ORGANIZATION`. Different types of identifiers can also be matched.	no If not specified, the type `PERSON` will be used.
`script`	string	Four-letter ISO-15924 script code	no

Option	Type	Description	Required
`threshold`	number	Determines how much variation between names is permitted in any given cluster. A higher value will have less variation within the cluster.	no

{
  "names": [
    {
      "entityType": "PERSON",
      "language": "string",
      "script": "string",
      "text": "string"
    }
  ],
  "threshold": 0.75
}

Response

Input (names)	Output (cluster ID)
John Smith	1
Cyndi McBoysen	2
Dmitri Shostakovich	4
Jim Hockenberry	3
Takeshi Suzuki	5
Jon Smythe	1
James Hawkenbury	3
Cindy MacBoysen	2
Дми́трий Шостако́вич	4

{
  "results": [
    "string"
  ]
}

Supported languages

GET /name-deduplication/supported-languages

Retrieve the language pairs supported by the name deduplication endpoint. The endpoint supports matching between the source and target of each pair. The language, script, and transliteration scheme are listed for each source and target.

Response

Field	Type	Description
transliterationScheme	string
script	string	Four-letter ISO-15924 script code
language	string	ISO 639 language code
licensed	boolean	Indicates if you are licensed for this language

{
  "supportedLanguagePairs": [
    {
      "source": {
        "transliterationScheme": "string",
        "script": "string".
        "language": "string"
      },
    {
      "target": {
        "transliterationScheme": "string",
        "script": "string".
        "language": "string"
      },
    "licensed": true
    }
  ]
}

Babel Street Analytics API