Identity Deduplication

Introduction

Identity Deduplication analyzes your data, identifies duplicated records, and creates a report listing the duplications.

The deduplication process is highly configurable, allowing you to optimize it for your data and deduplication guidelines.

Each record is normalized and enriched with phonetic and linguistic data.
Similar records are identified using a locality sensitive hashing (LSH) algorithm.
A similarity score is calculated for each pair of similar records using Babel Street Match for Elasticsearch.
Duplicates are identified by finding the clusters of records that score above a given threshold.

The process can be paused and resumed at any time.

Getting Started

Requirements

An Elasticsearch cluster with Match for Elasticsearch installed. The cluster should be robust enough to index all your data. Refer to the Elasticsearch documentation for instructions on setting up a cluster and increasing virtual memory.
As a rule of thumb, each node should have at least 10x as much disk space as the size of the input data file. For example, if you have a 5GB data file, it will require at least 50GB to store as an Elasticsearch index.
A json or jsonl (newline-delimited JSON) file or Elasticsearch cluster containing your data.
- The data can be in a single file or split among multiple jsonl files.
- The Elasticsearch cluster must be set to read_only to ensure consistent results.
Important
The input data cannot contain any curly braces { }.
Note
The data must be valid json (no illegal backslash escape sequences, extraneous curly braces, etc.). Each line must be under 20 million characters.

Basic (username/password) and API authentication for the Elasticsearch cluster are supported.

Installing Identity Deduplication

When you obtain Identity Deduplication, you will receive the following files:

The product file: identity-dedupe-<version>.zip.
The license file: rlp-license.xml.
This documentation file: identity-deduplication-<version>.pdf

Download and unpack the zip file identity-dedupe-<version>.zip.
Place the license file in the bt-root/rlp/rlp/licenses directory.

Configuring Identity Deduplication

Identity deduplication requires tuning to get optimal results for your specific data. We recommend you spend some time on a sample of the data to configure the mapping weights, the Match parameters, and the Elasticsearch cluster size to ensure that the results are what you expect.

The max-records parameter can limit the number of records used in your initial runs. By running Identity Deduplication on only 1K - 20K records, you'll have a large enough sample set to understand the speed and accuracy you can expect with your configuration.

You can modify any of the following to get your optimal results:

Weight: Mapping weight controls how much impact each field of a record has on the similarity score. You may decide to decrease the weight of less important fields, or ignore some completely.
Match parameters: Identity Deduplication uses Match for Elasticsearch to calculate the similarity scores between two records. Match contains many parameters to tune similarity scores for your specific languages and data.
Elasticsearch cluster size: The cluster must be large enough to index all the data. The speed of the cluster will directly impact the speed of deduplication; a larger cluster will be faster.
Deduplication parameters: Set the similarity threshold, batch size, and other deduplication parameters in the dedupe.yaml file.
LSH parameters: Optimize the LSH algorithm for speed and accuracy.

Configuration files

The conf directory contains the following configuration files:

dedupe.yaml: This is where you set the value for the similarity threshold to determine a duplicate, as well as other product settings.
- Update the dedupe-mapping-file property to point to the mapping file for the data.
elasticsearch.yaml: This is where you set the Elasticsearch configuration settings, including the url of the host and the authentication scheme. If your Elasticsearch cluster requires authentication in order to connect, uncomment and fill in the appropriate properties with authentication credentials.
Note
Authentication credentials can also be set as environment variables or system properties instead of storing the authentication information in plain text.
lsh_parameters.yaml: Various parameters used by the LSH algorithm. Modify these parameters to balance between accuracy, precision, and speed. Each parameter is described fully in the file.
input_connector.yaml: This is where you set where the input data is coming from. We currently support json files or an Elasticsearch cluster.
- Update input-type to specify which data source (file or es) you are using.
- If file, update the input-data-directory property to point to the directory containing the data files to be deduplicated.
- If es, set the properties to specify how to connect to the Elasticsearch cluster.
  - input-elasticsearch-host: name of the Elasticsearch host.
  - input-elasticsearch-index: name of the Elasticsearch index.
  - If your Elasticsearch cluster requires authentication in order to connect, uncomment and fill in the appropriate properties with authentication credentials.
  - input-elasticsearch-pit-keep-alive: how long to keep the Point-In-Time open to read the data from the Elasticsearch cluster.

Creating a mapping file

The mapping file is a yaml file describing the data type and the weight of each field in your data. The file should be in the following format:

fields:
  <field_name_1>:
    type: <type>
    weight: <weight>
  <field_name_2>:
    type: <type>
    weight: <weight>
  ...

Example:

# MAPPING file:
name:
  type: "PERSON"
  weight: 1
dob:
  type: "DATE"
  weight: 1

where the data file contains the following sample data:

# JSONL file:
{"name": "Sara Connor", "dob": "2004-05-06"}
{"name": "Alice Smith", "dob": "1988-07-08"}
{"name": "Bob Johnson", "dob": "1975-09-10"}
{"name": "Adam Smith", "dob": "1988-07-08"}
{"name": "Robert Johnson", "dob": "1975-09-10"}
{"name": "Emily Parker", "dob": "1995-03-15"}

The weight controls how strongly each field influences the similarity score between the two records. The weight is an integer. A larger weight will have a larger impact on the score. A weight of 0 means the field will be ignored. The weights are relative to each other; that is, a field with a weight of 2 impacts the score twice as much as a field with a weight of 1.

Example: Given the following mapping file, the weights will be calculated from a full weight of 8. The name field is 4/8 or 50% of the score, the date field is 1/8 or 12.5% of the score, and the address field is 3/8 or 37.5% of the score.

name:
  type: "PERSON"
  weight: 4
date:
  type: "DATE"
  weight: 1
address:
  type: "ADDRESS"
  weight: 3

LSH weights

You can specify a separate LSH weight in the mapping file if you want the field treated differently by the LSH algorithm than in final scoring. By default, LSH uses the same weight as regular scoring. The LSH algorithm only cares if the field is included or not; the only valid values for lshweight are 0 and 1.

To set the LSH value, add a third line under type and weight: lshWeight: <0/1>.

To ignore a field that has a weight > 0, set lshWeight: 0.
To include a field that has a weight of 0, set lshWeight: 1.

Example:

To remove the date from the LSH calculation:

name:
  type: "PERSON"
  weight: 4
date:
  type: "DATE"
  weight: 1
  lshWeight: 0
address:
  type: "ADDRESS"
  weight: 3

The data type determines how each field is handled during processing and scoring. The supported data types are:

Table 37. Supported data types

Data type	Description	Example
PERSON	Name of a person	John H. Smith
ORGANIZATION	Name of a company, institution, or other organization	ACME Co.
ADDRESS	A street address	123 Main St., Smallville, USA 12345
LOCATION	A geographical location, such as a country, city, landmass, or body of water	San Francisco or Atlantic Ocean
DATE	A calendar date	April 1, 1985 or 2001-05-06
EMAIL	An email address	johnsmith28@yahoo.com
PHONE	A phone number	+86 13996753420 or (564) 332-8457
TEXT	A block of unstructured text, such as a description	Babel Street provides the most advanced data analytics and intelligence platform for the world’s most trusted government and commercial brands.
MISC	Any data that doesn't fit into one of the above categories, such as numerical data, unique identifiers, URLs, etc. MISC data is not normalized and is scored using edit distance	43, www.worldwidestuff.com, XXX-DR-3245
IGNORE	A field of the input data that should not be included in the deduplication process. Fields with this type will not be indexed, scored, or included in the cluster report. This results in faster performance than setting the weight to 0.

Running Identity Deduplication

Install.
Create the mapping file. All fields in the input file must have a corresponding field with the same type in the mapping file.
Modify the configuration files.
Start the Elasticsearch cluster.
From the unzipped dedupe directory, run bin/dedupe.
Analyze the reports. The directory is set in the dedupe.yaml file. There are 2 files generated:
- A JSONL file containing detailed information about identified duplicate clusters
- A summary JSON file with statistics about the deduplication process

Reports

The deduplication_report_<timestamp>.jsonl file contains detailed information about the duplicate clusters identified:

{
  "records": [
    {
      "recordId": "9DITU5kB8ADBDGQaM4mA",
      "similarityScores": {
        "8jITU5kB8ADBDGQaM4mA": 0.95867
      },
      "customerRecord": {
        "fields": {
          "dob": "01/23/1965",
          "name": "Jim Jones"
        }
      }
    },
    {
      "recordId": "8zITU5kB8ADBDGQaM4mA",
      "similarityScores": {
        "8jITU5kB8ADBDGQaM4mA": 0.90151066
      },
      "customerRecord": {
        "fields": {
          "dob": "01/23/1965",
          "name": "James Earl Jones"
        }
      }
    },
    {
      "recordId": "8jITU5kB8ADBDGQaM4mA",
      "similarityScores": {
        "8zITU5kB8ADBDGQaM4mA": 0.90151066,
        "9DITU5kB8ADBDGQaM4mA": 0.95867
      },
      "customerRecord": {
        "fields": {
          "dob": "01/23/1965",
          "name": "James Jones"
        }
      }
    }
  ],
  "confidenceScore": 0.90151066,
  "proposedMergeRecord": {
	"dob": "01/23/1965",
	"name": "Jim Jones"
  }
}

records: The list of records marked as duplicates. Each item in the list represents a single duplicate record in the cluster and contains the following information:
- recordId: The document ID of the record stored in the internal Elasticsearch cluster.
- similarityScores: A map of document IDs to scores, where each document ID refers to another record in this cluster and the corresponding score is the similarity score between the two document IDs.
- customerRecord: The field information of the actual record. This reflects the data contained within the record.
confidenceScore: the confidence score of how likely the cluster is to contain duplicates. This number is effectively the smallest similarity score between two records within the cluster.
proposedMergeRecord: When the merge process is run, all records in the records field will be removed from the internal cluster and the proposedMergeRecord will be indexed. The current automated suggested merge strategy is to take the first available value for each field from the records list in order.

The deduplication_statistics_<timestamp>.json file contains summary statistics about the deduplication process, including the percentage of duplicates identified:

{
  "totalDuplicateClusters" : 2,
  "percentageDuplicatesFound" : 0.2727272727272727,
  "totalDuplicatesFound" : 3,
  "averageDuplicateGroupSize" : 2.5,
  "totalRecordsProcessed" : 11
  "duplicateMatchThreshold" : 0.9
}

totalDuplicateClusters: The number of duplicate clusters in the deduplication_report_<timestamp>.jsonl file.
percentageDuplicatesFound: The percentage of duplicates found (totalDuplicatesFound / totalRecordsProcessed).
totalDuplicatesFound: The number of unique records in all clusters minus the number of clusters (i.e., number of records to remove in merge).
averageDuplicateGroupSize: The average size of duplicate clusters.
totalRecordsProcessed: The total number of records processed (the document count of the internal Elasticsearch index after running the deduplication process).
duplicateMatchThreshold: The similarity score threshold used to determine a duplicate.

Match Identity