Identity Deduplication
Introduction
Identity Deduplication analyzes your data, identifies duplicated records, and creates a report listing the duplications.
The deduplication process is highly configurable, allowing you to optimize it for your data and deduplication guidelines.
Each record is normalized and enriched with phonetic and linguistic data.
Similar records are identified using a locality sensitive hashing (LSH) algorithm.
A similarity score is calculated for each pair of similar records using Babel Street Match for Elasticsearch.
Duplicates are identified by finding the clusters of records that score above a given threshold.
The process can be paused and resumed at any time.
Getting Started
Requirements
An Elasticsearch cluster with Match for Elasticsearch installed. The cluster should be robust enough to index all your data. Refer to the Elasticsearch documentation for instructions on setting up a cluster and increasing virtual memory.
As a rule of thumb, each node should have at least 10x as much disk space as the size of the input data file. For example, if you have a 5GB data file, it will require at least 50GB to store as an Elasticsearch index.
A jsonl (newline-delimited JSON) file containing your data. The data can be in a single file or split among multiple files.
Important
The input data cannot contain any curly braces { }.
Note
The data must be valid json (no illegal backslash escape sequences, extraneous curly braces, etc.). We also recommend removing any
"|"characters. The Match for Elasticsearch plugin uses this character to indicate a fielded name; it may result in unexpected behavior, especially when included in non-person name fields.
Basic (username/password) and API authentication for the Elasticsearch cluster are supported.
Installing Identity Deduplication
When you obtain Identity Deduplication, you will receive the following files:
The product file:
identity-dedupe-<version>.zip.The license file:
rlp-license.xml.This documentation file:
identity-deduplication-<version>.pdf
Download and unpack the zip file
identity-dedupe-<version>.zip.Place the license file in the
bt-root/rlp/rlp/licensesdirectory.
Configuring Identity Deduplication
Identity deduplication requires tuning to get optimal results for your specific data. We recommend you spend some time on a sample of the data to configure the mapping weights, the Match parameters, and the Elasticsearch cluster size to ensure that the results are what you expect.
The max-records parameter can limit the number of records used in your initial runs. By running Identity Deduplication on only 1K - 20K records, you'll have a large enough sample set to understand the speed and accuracy you can expect with your configuration.
You can modify any of the following to get your optimal results:
Weight: Mapping weight controls how much impact each field of a record has on the similarity score. You may decide to decrease the weight of less important fields, or ignore some completely.
Match parameters: Identity Deduplication uses Match for Elasticsearch to calculate the similarity scores between two records. Match contains many parameters to tune similarity scores for your specific languages and data.
Elasticsearch cluster size: The cluster must be large enough to index all the data. The speed of the cluster will directly impact the speed of deduplication; a larger cluster will be faster.
Deduplication parameters: Set the similarity threshold, batch size, and other deduplication parameters in the
dedupe.yamlfile.LSH parameters: Optimize the LSH algorithm for speed and accuracy.
Configuration files
The conf directory contains the following configuration files:
dedupe.yaml:This is where you set the value for the similarity threshold to determine a duplicate, as well as other product settings.Update the
input-data-directoryproperty to point to the data files to be deduplicated.Update the
dedupe-mapping-fileproperty to point to the mapping file for the data.
elasticsearch.yaml:This is where you set the Elasticsearch configuration settings, including the url of the host and the authentication scheme. If your Elasticsearch cluster requires authentication in order to connect, uncomment and fill in the appropriate properties with authentication credentials.Note
Authentication credentials can also be set as environment variables or system properties instead of storing the authentication information in plain text.
lsh_parameters.yaml:Various parameters used by the LSH algorithm. Modify these parameters to balance between accuracy, precision, and speed. Each parameter is described fully in the file.
Creating a mapping file
The mapping file is a yaml file describing the data type and the weight of each field in your data. The file should be in the following format:
fields:
<field_name_1>:
type: <type>
weight: <weight>
<field_name_2>:
type: <type>
weight: <weight>
...Example:
# MAPPING file: name: type: "PERSON" weight: 1 dob: type: "DATE" weight: 1
where the data file contains the following sample data:
# JSONL file:
{"name": "Sara Connor", "dob": "2004-05-06"}
{"name": "Alice Smith", "dob": "1988-07-08"}
{"name": "Bob Johnson", "dob": "1975-09-10"}
{"name": "Adam Smith", "dob": "1988-07-08"}
{"name": "Robert Johnson", "dob": "1975-09-10"}
{"name": "Emily Parker", "dob": "1995-03-15"}
The weight controls how strongly each field influences the similarity score between the two records. The weight is an integer. A larger weight will have a larger impact on the score. A weight of 0 means the field will be ignored. The weights are relative to each other; that is, a field with a weight of 2 impacts the score twice as much as a field with a weight of 1.
Example: Given the following mapping file, the weights will be calculated from a full weight of 8. The name field is 4/8 or 50% of the score, the date field is 1/8 or 12.5% of the score, and the address field is 3/8 or 37.5% of the score.
name: type: "PERSON" weight: 4 date: type: "DATE" weight: 1 address: type: "ADDRESS" weight: 3
LSH weights
You can specify a separate LSH weight in the mapping file if you want the field treated differently by the LSH algorithm than in final scoring. By default, LSH uses the same weight as regular scoring. The LSH algorithm only cares if the field is included or not; the only valid values for lshweight are 0 and 1.
To set the LSH value, add a third line under type and weight: lshWeight: <0/1>.
To ignore a field that has a weight > 0, set
lshWeight: 0.To include a field that has a weight of 0, set
lshWeight: 1.
Example:
To remove the date from the LSH calculation:
name: type: "PERSON" weight: 4 date: type: "DATE" weight: 1 lshWeight: 0 address: type: "ADDRESS" weight: 3
The data type determines how each field is handled during processing and scoring. The supported data types are:
Data type | Description | Example |
|---|---|---|
PERSON | Name of a person | John H. Smith |
ORGANIZATION | Name of a company, institution, or other organization | ACME Co. |
ADDRESS | A street address | 123 Main St., Smallville, USA 12345 |
LOCATION | A geographical location, such as a country, city, landmass, or body of water | San Francisco or Atlantic Ocean |
DATE | A calendar date | April 1, 1985 or 2001-05-06 |
An email address | johnsmith28@yahoo.com | |
PHONE | A phone number | +86 13996753420 or (564) 332-8457 |
TEXT | A block of unstructured text, such as a description | Babel Street provides the most advanced data analytics and intelligence platform for the world’s most trusted government and commercial brands. |
MISC | Any data that doesn't fit into one of the above categories, such as numerical data, unique identifiers, URLs, etc. MISC data is not normalized and is scored using edit distance | 43, www.worldwidestuff.com, XXX-DR-3245 |
Running Identity Deduplication
Install.
Create the mapping file. All fields in the input file must have a corresponding field with the same type in the mapping file.
Modify the configuration files.
Start the Elasticsearch cluster.
From the unzipped dedupe directory, run
bin/dedupe.Analyze the reports. The directory is set in the dedupe.yaml file. There are 2 files generated:
A JSONL file containing detailed information about identified duplicate clusters
A summary JSON file with statistics about the deduplication process
Reports
The deduplication_report_<timestamp>.jsonl file contains detailed information about the duplicate clusters identified:
{
"records": [
{
"recordId": "9DITU5kB8ADBDGQaM4mA",
"similarityScores": {
"8jITU5kB8ADBDGQaM4mA": 0.95867
},
"customerRecord": {
"fields": {
"dob": "01/23/1965",
"name": "Jim Jones"
}
}
},
{
"recordId": "8zITU5kB8ADBDGQaM4mA",
"similarityScores": {
"8jITU5kB8ADBDGQaM4mA": 0.90151066
},
"customerRecord": {
"fields": {
"dob": "01/23/1965",
"name": "James Earl Jones"
}
}
},
{
"recordId": "8jITU5kB8ADBDGQaM4mA",
"similarityScores": {
"8zITU5kB8ADBDGQaM4mA": 0.90151066,
"9DITU5kB8ADBDGQaM4mA": 0.95867
},
"customerRecord": {
"fields": {
"dob": "01/23/1965",
"name": "James Jones"
}
}
}
]
}The deduplication_statistics_<timestamp>.json file contains summary statistics about the deduplication process, including the percentage of duplicates identified:
{
"totalDuplicateClusters" : 2,
"percentageDuplicatesFound" : 0.2727272727272727,
"totalDuplicatesFound" : 3,
"averageDuplicateGroupSize" : 2.5,
"totalRecordsProcessed" : 11
}