Name Deduplication
https://analytics.babelstreet.com/rest/v1/name-deduplication
https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/name_deduplication.curl
Name Deduplication takes a list of names and groups similar names into clusters. Matches across name variations (such as misspellings and nicknames) have multilingual support by using a linguistic, statistically-based system. A threshold parameter determines how much variation between names is permitted in any given cluster. There is a limit of 1000 names per list per call.
Given a list of names as input, the output is a cluster ID for each name. Similar names are given the same cluster ID. The output may then be sorted by cluster ID to group together possible duplicate names.
Important
Do not use the Analytics language identification endpoint to determine source language of names. Name deduplication uses an algorithm specifically tuned to identify the language of names rather than general text.
Request
When you submit a name-deduplication request, you need only input a list of names and an optional threshold; however, we strongly recommend specifying the source language when known, for better accuracy. If you do not specify the source language, the name similarity algorithm will guess the language.
Option | Type | Description | Required |
---|---|---|---|
| number | Determines how much variation between names is permitted in any given cluster. A higher value will have less variation within the cluster. | no |
{ "names": [ { "entityType": "PERSON", "language": "string", "script": "string", "text": "string" } ], "threshold": 0.75 }
Response
Given a list of names as input, the output is a cluster ID for each name. Similar names are given the same cluster ID. The output may then be sorted by cluster ID to group together possible duplicate names.
Input (names) | Output (cluster ID) |
---|---|
John Smith | 1 |
Cyndi McBoysen | 2 |
Dmitri Shostakovich | 4 |
Jim Hockenberry | 3 |
Takeshi Suzuki | 5 |
Jon Smythe | 1 |
James Hawkenbury | 3 |
Cindy MacBoysen | 2 |
Дми́трий Шостако́вич | 4 |
{ "results": [ "string" ] }
Supported languages
GET /name-deduplication/supported-languages
Retrieve the language pairs supported by the name deduplication endpoint. The endpoint supports matching between the source and target of each pair. The language, script, and transliteration scheme are listed for each source and target.
Response
Field | Type | Description |
---|---|---|
transliterationScheme | string | |
script | string | Four-letter ISO-15924 script code |
language | string | ISO 639 language code |
licensed | boolean | Indicates if you are licensed for this language |
{ "supportedLanguagePairs": [ { "source": { "transliterationScheme": "string", "script": "string". "language": "string" }, { "target": { "transliterationScheme": "string", "script": "string". "language": "string" }, "licensed": true } ] }