Skip to main content

Match Identity

RNI-Elasticsearch Plugin Guide

Introduction

RNI-Elasticsearch is an Elasticsearch[1] plugin for building fuzzy name retrieval and name matching applications for persons, locations, and organizations. It uses Rosette Name Indexer (RNI), implementing high-speed, scalable, cross-language, and cross-script searches with the Elasticsearch full-text search engine to store the names and search keys.

This guide describes how to use the RNI-Elasticsearch plugin and RNI features, and is not intended to be a complete guide to Elasticsearch.

Supported Platforms

RNI-Elasticsearch is supported on the following operating systems and CPUs.

Table 1. Supported Platforms

OS

CPU

MAC OS X v10.9+ (Darwin 13)

AMD64

Linux

AMD64

Linux

AARCH64

Windows

AMD64

Java Only

Any OS and CPU with 64-bit Java SDK 17 through 20



Language support

RNI can match names in any language. For the languages listed in Fully supported text domains for name matching, RNI calculates a match score using a variety of techniques, as described in Understanding name match scores. For names not listed in those tables, RNI provides limited support, as described in Language support parameters.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Interpreting RNI scores

Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.

RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).

The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding name match scores.

Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:

Variation

Example(s)

Phonetic and/or spelling differences

Nayif Hawatmeh and Nayif Hawatma

Missing name components

Mohammad Salah and Mohammad Abd El-Hamid Salah

Rarity of a shared name component

Two English names that contain Ditters are more likely to match than two names that contain Smith

Initials

John F. Kennedy and John Fitzgerald Kennedy

Nicknames

Bobby Holguin and Robert Holguin

"Cousin" or cognate names

Pedro Calzon and Peter Calzon

Uppercase/Lowercase

Rosa Elena PACHECO and Rosa Elena Pacheco

Reordered name components

Zedong Mao and Mao Zedong

Variable Segmentation

Henry Van Dick and Henri VanDick, Robert Smith and Robert JohnSmyth

Corresponding name fields

For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox]

Truncation of name elements

For Sawyer, the similarity with Sawy is higher than the similarity with Sawi.

Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

You can configure RNI to customize how it scores different match phenomena.

The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).

Installing RNI-Elasticsearch

Important

The RNI Elasticsearch plugin does not work with the AWS managed elastic service.

Note

You will need read/write permissions on the host system to install RNI-Elasticsearch.

Note

For security reasons, Elasticsearch is only accessible locally by default; this is suitable for a local dev/test environment. For accessible dev/test/production server instances, users need to follow the steps outlined in the Elasticsearch reference manual to enable the appropriate network settings for their specific instance.

Note

When installing on top of an existing version of RNI-ES without reindexing, some changes to the software may not function with an index created on the previous installation, and many newly developed features won't work without reindex. For the best experience, Babel Street recommends reindexing on every installation.

To use RNI-Elasticsearch you need the RNI Elasticsearch plugin, an RLP license file (rlp-license.xml) and Elasticsearch. [2]

Note

If you are using the Linux distribution of RNI-ES, note that glibc is required. The version of glibc that the native libraries are built against can be found in the filename of the distributed package.

  1. If you do not already have it, install Elasticsearch using the setup instructions for the appropriate version.

    Download and unzip Elasticsearch-<version>.zip.

    Important

    The version of Elasticsearch must match the first three digits of the version of the RNI-ES plugin. If your version of Elasticsearch does not match the plugin version, the plugin will not install.

    Example:

    • Elasticsearch version: 7.5.2

    • RNI-ES plugin version: 7.5.2.x where x is an integer

  2. Install the plugin.

    Navigate to the elasticsearch-<version> root directory and run the install command.

    On Unix, Linux and MacOS:

    bin/elasticsearch-plugin install file:///path/to/rni-es-<version>.x.zip

    On Windows:

    bin\elasticsearch-plugin install file:///C:\path\to\rni-es-<version>.x.zip

    Note

    You must use the absolute file path to refer to the plugin zip file. For example, if the file is in the home directory of rniUser on macOS, the command would be:

    bin/elasticsearch-plugin install file:///Users/rniUser/rni-es-<version>.x.zip

    You may be prompted to grant permissions necessary for the plugin to function.

    The plugin is now in plugins/rni.

    Note

    For Windows users, you must add

    bin\elasticsearch-<version>\plugins\rni\bt_root\rlp\bin\*

    to your PATH environment variable. In this case, you must replace * with the name of the subdirectory which contains platform-specific binary library files (for example, amd64-w64-msvc120).

    Additionally, the RNI-Elasticsearch plugin cannot be installed into distributions of Elasticsearch found in the C:\Program Files directory.

  3. Copy the RLP License (rlp-license.xml) to plugins/rni/bt_root/rlp/rlp/licenses.

    This license must be in place before you can use the RNI-Elasticsearch plugin.

Note

If your index contains complex mappings or searches, including many fields or nested fields, you may need to increase the heap size as described in the Elasticsearch documentation.

To start the Elasticsearch server, run:

bin/elasticsearch

Note

When starting Elasticsearch with the plugin you may see some non-fatal error messages. If a message follows the error stating that  “Cluster health status changed from [RED] to [YELLOW]“, the error can be ignored. This may occur when the enableDynamicConfiguration is set to true.

Note

To utilize the overlay directory functionality, you must specify the overlay root location with the bt.overlay.root or bt.overlay.relative.root (relative to the plugin installation directory) system property at plugin startup time.

Example of an overlay root location:

ES_JAVA_OPTS="-Dbt.overlay.root=<overlay-root-path>" bin/elasticsearch

Example of a relative overlay root directory:

ES_JAVA_OPTS="-Dbt.overlay.relative.root=<relative-overlay-root-path>" bin/elasticsearch

If the overlay directory is not located within the Elasticsearch installation directory, you must add an entry to RNI's plugin-security.policy file giving read permissions to the directory.

libpostal data directory

RNI uses libpostal to parse addresses; libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.

RNI packages libpostal data in plugins/rni/bt_root/rlpnc/data/libpostal. The data directory is relatively large (~2G). If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other RNI functionalities.

Prepare the index

Important

Unless otherwise specified, all inputs to RNI need to be UTF-8 encoded.

Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.

Elasticsearch provides real-time search and analytics for all kinds of data. The data is stored in documents, each having a set of fields, some of which are defined as search fields. An Elasticsearch index is a collection of these documents.

The RNI-Elasticsearch plugin uses an Elasticsearch index to store documents containing names, dates, addresses, or other fields to be matched.

Before using RNI to search for matches, you must create the index, define mappings, and load the index with documents.

  1. Create an index, or a searchable container for your documents.

  2. Define a mapping for fields that contain person, location, organization, or identifier entity types. The type of a name field to be searched by RNI is "rni_name". A mapping defines the data types of each of the searchable fields in a document. The mapping does not have to include every field in the document, just the searchable fields.

  3. Index documents that contain one or more name fields along with other fields of interest. This step loads the documents into the index.

  4. Test the RNI integration before continuing on.Test the RNI integration

Once you've completed the above steps, you are ready to query the index.

The following snippets use the cURL command-line tool to illustrate the Elasticsearch commands for running the plugin.

You can also use Kibana, an open source dashboard for Elasticsearch.

Create an index

An Elasticsearch index consists of one or more documents, and a document contains one or more fields. A name index is an indexed list of names.

The default port for running Elasticsearch locally is localhost:9200.

The following cURL statement creates an index named rni-test.

curl -XPUT 'http://localhost:9200/rni-test'

Define a mapping

A mapping defines how a document, along with the fields it contains, is stored and indexed and sets the types of the search fields. For name search fields, set the "type" of the name fields to "rni_name".

The following statement maps the "primary_name" and "aka" (also known as) fields in the document to the "rni_name" type in the "rni-test" index.

curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
    "properties" : {
        "primary_name" : { "type" : "rni_name" },
        "aka" : { "type" : "rni_name" },
        "occupation" : { "type" : "text" }
    }
}'

Previous to the RNI-RNT 7.35.1.c65.0 release (September 2021), entityType was not considered when querying. For example, a name with an entityType of PERSON could match a name with an entityType of ORGANIZATION. To return to this behavior, set the mapping parameter testEntityType to false. This will allow indexed names with any or no entityType to be returned, regardless of the entityType in the search.

Index documents

This is the step where you add your data, or documents, to the index. A document is a JSON object containing one or more fields. Each field in a document is defined as a key-value pairs, where the key is the field and the value is the data.

Documents may include fields other than name fields.

curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
    "primary_name" : "Joe Schmoe",
    "aka" : "Bossman",
    "occupation" : "business owner"
}'

Name fields can include properties in addition to the name string (or "data " property). Properties are used when searching to optimize the search algorithms for the data. The "entityType" property is particularly important for name searching and customizations.

Property

Required

Description

data

The name string.

language

ISO 639-3 Code for the language of use: the language of the document in which the name was found.

languageOfOrigin

ISO 639-3 Code for the language of origin of the name. For example, a name of Spanish origin (spa) may be found in an English (eng) document.

script

ISO 15924 code for the script.

entityType

Type of the name.

uid

Unique string identifier for the document.

gender

An explicitly defined gender for a name.

Example:

curl -XPUT 'http://localhost:9200/rni-test/_doc/3' -H'Content-Type: application/json' -d '{ "primary_name" : { "data" : "Joe Schmoe", "language" : "eng", "script" : "Latn", "entityType" : "PERSON" } }'

Tip

When creating a large set of documents, use the Bulk insert for optimal performance.

Tip

You may need to wait a few minutes for the documents to be ready to query. Documents are not always immediately available from Elasticsearch after being added to the index.

Entity types

The entityType field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.

Important

The entityType should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType, the type PERSON will be used.

Table 2. Entity Types

Type

Description

Features

PERSON

A human identified by name, nickname, or alias.

Values are tokenized and token pairs are compared.

Stop words, overrides, frequency and gender models are supported.

LOCATION

A city, state, country, region or other location.

Values are tokenized and token pairs are compared.

Stop words, overrides, and frequency models are supported.

ORGANIZATION

A corporation, institution, government agency, or other group of people defined by an established organizational structure.

Values are tokenized and token pairs are compared.

Stop words, overrides, frequency models, and embeddings are supported.

Real World IDs are supported.

IDENTIFIER

IDENTIFIER:DRIVERS_LICENSE

IDENTIFIER:LICENSE_PLATE

IDENTIFIER:NATIONAL_ID_NUM

An alphanumeric identifier.

Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.



Fielded names

You can process fielded names by separating the fields with "|". RNI assigns no explicit semantics to each field (such as given name or surname), but it does pay attention to the order of the fields when comparing two fielded names. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in name1 matches the second field in name2). Fields within a name can be empty.

When scoring a potential match between a name with fields and a name without fields, RNI treats the name without fields as if it were a name with a single field.

RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".

Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*.

Names containing special characters

When using JSON objects with RNI, special characters must be properly escaped when used in strings. RNI requires a backslash to escape the special character and then JSON requires another backslash to escape the first backslash. Thus, In RNI, the proper escape character for names containing a special character is a double backslash (\\).

The | used in fielded names is one example of a special character embedded within a name, where | is used to separate the fields. For proper processing of the vertical bar character, RNI needs to be able to distinguish when the user intends to build a fielded name and a name which contains the vertical bar character.

Let's assume we have a name that includes a |; it is not indicating a fielded name: "John|Smith". RNI requires that you escape the vertical bar with a backslash; e.g. "John\|Smith". Then, JSON requires that the backslash character be escaped with a backlash. The correct syntax for the name "John|Smith" is "John\\|Smith". If the entry were representing a fielded name, the correct syntax would be "John|Smith" without any backslashes.

Verify the RNI SDK version

To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version:

curl -XGET 'localhost:9200/rni-test/rni_plugin/_get_version'

This call also verifies that the RNI plugin is installed and running successfully.

Bulk insert

Bulk insert allows you to add multiple documents to Elasticsearch in a single API call, improving the throughput for uploading documents by orders of magnitude. We recommend you use bulk indexing to create and index your data wherever possible.

  1. Create the index.

  2. Define the mapping.

  3. Run Bulk Insert.

Tip

Do not perform any queries or searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.

The structure for all Elasticsearch bulk API calls is:

{ action_to_be_performed: { metadata_related_to_action}}\n
{ request_body_data_to_index }\n
Bulk insert example

We're going to continue the example that we started. The index is rni-test. The mapping defines a primary_name, aka, and occupation.

  1. Create the index.

    curl -X PUT http://localhost:9200/rni-test

  2. Define the mapping.

    The previously defined mapping:

    {
        "properties" : {
           "primary_name" : { "type" : "rni_name" },
           "aka" : { "type" : "rni_name" },
           "occupation" : { "type" : "text" }
        }
    }

    You can put the mapping in a JSON file and create it from the command line. The following curl command creates the mapping using a file (mapping.json in this example):

    curl -X PUT -H"Content-Type:application/json" -d @mapping.json http://localhost:9200/rni-test/_mapping
  3. Create a data file in newline delimited JSON (NDJSON) format. Save the file as bulknames.json. The file MUST end with a newline after the final record.

    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Joaquín Guzmán","entityType":"PERSON"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "René Lindström Jones","entityType":"PERSON"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Guadalupe Hernandez","entityType":"PERSON"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Chris Joseph Arsenault","entityType":"PERSON"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "ABC","entityType":"ORGANIZATION"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Basis Technology","entityType":"ORGANIZATION"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Australian Boradcasting Corporation","entityType":"ORGANIZATION"}}
    {"index":{"_index":"rni-test","_id":null}} 
    {"primary_name":{"data": "Amazon","entityType":"ORGANIZATION"}}
  4. Use the _bulk method to load the data file with curl using the following command:

    curl -X POST -H"Content-Type:application/json" --data-binary @bulknames.json http://localhost:9200/rni-test/_bulk

    Note

    If you're providing text file input to curl, use the --data-binary flag instead of plain -d to preserve the newlines.

Searching with queries

At this point you've created an index and loaded data. Now you can start using RNI to search for matches.

A query searches the index and returns a match score. In RNI, the query for a name consists of two parts, a base query and a rescorer.

The base query is a standard Elasticsearch query against a name field. The rescorer takes the results of the base query, and uses Elasticsearch rescoring to select the top candidates and perform pairwise matching on the top candidates.

The query returns an RNI match score (max_score), the score of the top scoring document.

Important

The entityType (PERSON, LOCATION, ORGANIZATION) must be added to a name query to utilize all RNI features. If you don't specify an entityType, the type NONE will be used and RNI may return less accurate results.

Base query

The base query is a standard query against a name field:

    "query" : {
        "match" : {
            "primary_name" : "Jo Shmoe"
        }
    }

Querying supports the same name properties that you may use when indexing documents. Unlike during document creation, you must pass the JSON object containing the name fields as a string. You should always include the entityType property in your query.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
        "match" : {
            "primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}" 
        }
    }
}'

Much like during indexing, RNI creates a set of keys based on the name and then generates a more complex internal query to match against the indexed keys.

Rescore with the RNI pairwise name match

The base query returns a ranked list of matching documents. The rescorer takes the top documents from the list and performs pairwise matching algorithms on those documents, and returns a re-ranked list. RNI has a custom rescorer which allows you to further tune the candidates passing to RNI pairwise matcher. Since the pairwise matcher is a computationally intensive process, you want to rescore just enough documents to find the best matches.

Elasticsearch rescoring includes the following parameters:

  • window_size (an integer, defaults to 10) specifies how many documents from the base query should be passed to the RNI pairwise matcher.

    Use this parameter to limit the number of compute-intensive name matches that need to be performed. If you set the value too high, the query will take too long, but if you set the value too low, you will increase the number of false negatives.

    Tip

    A good starting point for window_size is to make it the square root of the size of the index. For example, an index of 10,000 entries would use a window_size of 100.

  • query_weight (a float, defaults to 1.0) specifies the weighting of the score returned by the base query.

    In the context of RNI pairwise matching, the base query score has little meaning, so we suggest you set it to 0.0.

  • rescore_query_weight (a float, defaults to 1.0) specifies the weighting of the maximum RNI pairwise match score.

    If query_weight 0.0 and rescore_query_weight is 1.0, the score that is returned by rescoring is the RNI pairwise match score.

  • score_mode controls how the query and rescore query scores are combined. The default value is total meaning that both scores are added together after being multiplied by their respective weights.

In the following example, pairwise matching is performed on the top 200 names returned by the base query.

Example with RNI Rescorer:

    "rescore" : {
        "window_size" : 200,
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "name_score" : {
                        "field" : "primary_name",
                        "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }

The "name_score" function matches every name in the given field against the query name and returns the maximum score to the rescorer.

The "name_score" function score query must be given at least one object that specifies:

  • field: the search field being rescored which must be of type rni_name.

  • query: the value of the search field.

The object passed to the name_score function can also include any of the name properties.

This example illustrates the full query incorporating both match and rescore, using RNI query parameters.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : {
            "primary_name" : "{\"data\" : \"Jo Shmoe\",\"entityType\" : \"PERSON\"}" 
        }
    },
    "rescore" : {
        "window_size" : 200,
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "name_score" : {
                        "field" : "primary_name",
                        "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

This query returns an RNI match score against "Joe Shmoe" in the "_score" field:

{
  "_index": "rni-test",
  "_type": "_doc",
  "_id": "1",
  "_score": 0.80217975,
  "_source": {
    "primary_name": "Joe Shmoe",
    "aka": "Bossman",
    "occupation": "business owner"
  }
}

Representing arrays in Elasticsearch

If the name field in your documents is structured as an array, such as first name and last name fields, wrap the field in a nested object. The nested datatype allows arrays of objects to be indexed and queried independently of each other.

Since Elasticsearch flattens object hierarchies into a simple list of field names and values, if you don't use the nested type, you can lose the relationship between the fields. For example, the following document:

  "names" : [ 
    {
      "first" : "Joe",
      "last" :  "Smith"
    },
    {
      "first" : "Mike",
      "last" :  "Shmoe"
    }
  ]

would be transformed internally into a document that looks more like this:

{
  "names.first" : [ "mike", "joe" ],
  "names.last" :  [ "smith", "shmoe" ]
} 

The names.first and names.last fields are flattened into multi-value fields, and the association between Joe and Smith is lost. This document would incorrectly match a query for mike and smith.

If you wrap an array field in a nested object, you will get more accurate search results.

Include a field of type "nested" containing the name field in the mapping:

  "nested_names" : {
      "type" : "nested",
      "properties" : {
          "name" : { "type" :"rni_name" }
      }
  }

Multiple names can be added to the nested field:

{ 
    "nested_names" : [
        {
            "name" : "Joe Smith"
        },
        {
            "name" : "Mike Shmoe"
        }
    ]
}

Update the query to refer to the nested object. Set the "score_mode" to "max".

{
    "query" : {
        "nested" : {
            "path" : "nested_names",
            "query" : {
                "match" : {
                    "nested_names.name" : "Mike Shmoe"
                }
            }
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "nested" : {
                    "path" : "nested_names",
                    "score_mode" : "max",
                    "query" : {
                        "function_score" : {
                            "name_score" : {
                                "field" : "nested_names.name",
                                "query_name" : "Mike Shmoe"
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}

See the Elasticsearch documentation for more detailed information on nested objects and queries.

Nested names example

Let's consider an example of a database that includes alias names along with a primary name.

Nested Mapping:

"properties" : {
   "primary_name" : {"type" : "rni_name"},
   "aliases" : { 
      "type" : "nested",
      "properties" : { 
        "alias_name" : { "type" : "rni_name" } 
      }
   }
}

The curl command to create the mapping:

curl -XPUT "http://localhost:9200/rni-test/_mapping" -H 'Content-Type: application/json' -d '{
    "properties" : {
        "primary_name" : { "type" : "rni_name"},
        "aliases" : {
            "type" : "nested",
            "properties": { "alias_name" : { "type" : "rni_name" }
         }
       }
    }
}'

Each record includes a primary name. Each primary name can have multiple aliases.

"primary_name" : "John Smith", 
"aliases": [
  {"alias_name": "John Shark"},
  {"alias_name": "Smithy"},
  {"alias_name": "Johnny boy"}
]

The curl command to add the data:

curl -XPUT "http://localhost:9200/rni-test/_doc/null" -H 'Content-Type: application/json' -d '{
    "primary_name" : "John Smith", 
    "aliases" : [
        {"alias_name": "John Shark"},  
        {"alias_name": "Smithy"},
        {"alias_name": "Johnny boy"}
    ]
}'

The query will try to match one of the aliases. Specify score_mode: max to return the highest match score of the aliases.

curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d '{
    "query" : {
        "nested" : {
            "path" : "aliases",
            "query" : {
                "match" : { "aliases.alias_name": "Johnny" }
            }
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "nested" : {
                    "path" : "aliases",
                    "score_mode" : "max",
                    "query" : {
                        "function_score" : {
                            "name_score" : {
                                "field" : "aliases.alias_name",
                                "query_name" : "Johnny"
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Sorting results by rni_name

Elasticsearch supports the ability to sort search results by the values of their document fields. In the case of RNI, one may want to sort on an rni_name field. Because these fields are internally composed of many subfields, it is necessary to specify the subfield to sort on. Below are a couple of subfields that you may be interested in:

IndexFields enum value

Raw subfield name

Example for "John' Smith"

Explanation

ORIGINAL_NAME_FIELD

bt_rni_name_original

John Smith

original input data for the name

NORMALIZED_DATA_FIELD

bt_rni_name_normalized

john smith

normalized name

As an example, if your field's name is primaryName, you can sort on the original name data by referring to primaryName.bt_rni_name_original in your sort specification.

In the Java API, these fields can be referenced through the IndexFields enum. Regarding the previous example, one could refer to the same subfield in Java:

"primaryName." + IndexFields.ORIGINAL_NAME_FIELD.fieldName()

Configuring name matching

There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.

Tuning match parameters

The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.

The typical process for tuning parameters is as follows:

  1. Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.

  2. After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.

  3. Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.

  4. Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.

  5. Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.

  6. Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.

Parameter configuration files

Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.

The parameter files are contained in two .yaml files located in plugins/rni/bt_root/rlpnc/data/etc. The parameters are defined in parameter_defs.yaml and modified in parameter_profiles.yaml.

  • parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum) and recommended maximum (sane_maximum) value, which we advise you do not exceed.

  • parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.

Important

Do not modify the parameter_defs.yaml file. All changes should be made in the parameter_profiles.yaml file.

Do refer to the parameter_defs.yaml file for definitions and usage of all available parameters.

Parameter profiles

The parameters in the parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng profile. There is also an any profile which applies to all language pairs.

Parameter profiles have the following characteristics:

  • Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (eng), which always comes last. The two languages can be the same. Examples:

    • spa_eng 

    • ara_jpn 

    • eng_eng 

  • They can include the entity type being matched, such as eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.

  • Parameter profiles can inherit mappings from other parameter profiles. The global any profile applies to all languages; all profiles inherit its values.

  • The any profile can include an entity type; any_PERSON applies to all PERSON matches regardless of language.

  • Specific language profiles inherit values from global profiles. The profile matching person names is named any_PERSON. The profile for matching Spanish person against English person names is named spa_eng_PERSON. It inherits parameter values from the spa_eng profile and the any_PERSON profile. The any_PERSON profile will not override parameter values from more specific profiles, such as the spa_eng profile.

Important

Global changes are made with the any profile.

Any changes to address parameters should go under the any profile, and will affect all fields for all addresses.

Any changes to date parameters must go under the any profile.

Parameter universe

A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:

For example, the MyParameterUniverse universe may include the following parameter profiles:

  • "name": "MyParameterUniverse/any" applies to all language pairs.

  • "name": "MyParameterUniverse/spa_eng" applies to English - Spanish name pairs.

  • "name": "MyParameterUniverse/spa_eng_PERSON" applies to all PERSON English - Spanish name pairs.

Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml file, along with a value. Parameter universes are added to the parameter_profiles.yaml file.

A parameter universe can also be defined dynamically . We recommend that you use dynamic parameter universes for testing and tuning only. For production use, add all parameter universes to the parameter_profiles.yaml file.

Tip

You can define multiple named parameter profiles.

Define the parameter universe in the parameter_profiles.yaml file. Example:

parameterUniverseOne/spa_eng_PERSON:
    reorderPenalty: 0.4
    HMMUsageThreshold: 0.8
    stringDistanceThreshold: 0.1
    useEditDistanceTokenScorer: true
parameterUniverseOne/eng_eng:
    reorderPenalty: 0.6
    
Using a parameter universe

To use a parameter universe, add it as part of the name_score function when rescoring names queried from the index. All parameter values defined in the parameter universe will be used, where appropriate.

curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ 
  "query": {
    "match": {
      "full_name": "A Ely Taylor"
    }
  },
  "rescore": {
    "window_size": 3,
    "rni_query": {
      "rescore_query": {
        "rni_function_score": {
          "name_score": {
            "field": "full_name",
            "query_name": "A Ely Taylor",
            "score_to_rescore_restriction": 1,
            "window_size_allowance": 0.5,
            "universe": "parameterUniverseOne"
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}'

Parameter universes can also be used in the query phase. To do so, specify the query name as a json string and include the universe in the body.

Note

The parameter universe can only be used in the query phase in RNI-ES 8.6.2.0 and later.

curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ 
  "query": {
    "match": {
      "full_name": "{ \"data\": \"A Ely Taylor\", \"universe\": \"parameterUniverseOne\"}"
    }
  },
  "rescore": {
    "window_size": 3,
    "rni_query": {
      "rescore_query": {
        "rni_function_score": {
          "name_score": {
            "field": "full_name",
            "query_name": "A Ely Taylor",
            "score_to_rescore_restriction": 1,
            "window_size_allowance": 0.5,
            "universe": "parameterUniverseOne"
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}'
Dynamic parameter universes

When tuning RNI, you can use the Parameters REST API endpoint to dynamically create or update a parameter universe, overriding the existing parameter values without having to restart Elasticsearch. Once the optimum values are determined for each parameter, add the parameter universe to the parameter_profiles.yaml file for production use.

Tip

Dynamic parameter universes are best suited for testing and tuning the RNI match parameters. Once you determine the best set of parameters, add the parameter universe to the parameter_profiles.yaml file for production use. Using dynamic parameter universes can slow your system down considerably.

Use the Parameters endpoint to create a parameter universe, with parameters and values.

curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'{ 
  "profiles": [
    {
      "name": "parameterUniverseOne/spa_eng_PERSON",
      "parameters": {
        "reorderPenalty": 0.4,
        "HMMUsageThreshold": 0.8,
        "stringDistanceThreshold": 0.1,
        "useEditDistanceTokenScorer": true
      }
    }
  ]
}'

The name of the parameter universe is parameterUniverseOne and it applies to matching person names between Spanish and English.

Modifying name parameters

To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml. For additional support on tuning the parameters, contact support@rosette.com.

Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Selected name parameters

Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml file.

The following examples describe the impact of parameter changes in more detail.

Example 1. Token Conflict Score conflictScore

Let’s look at the two names:  ‘John Mike Smith’ and ‘John Joe Smith’.  ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.



Example 2. Initials Score (initialsScore)

Consider the following two names:  'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match.  You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.



Example 3. Token Deletion Score (deletionScore)

Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.



Example 4. Token Reorder Penalty (reorderPenalty)

This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty.  If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.



Example 5. Right End Boost/Left End Boost/Both Ends Boost (boostWeightAtRightEnd, boostWeightAtLeftEnd, boostWeightAtBothEndsboost)

These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’.  By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.

The parameters boostWeightAtRightEnd and boostWeightAtLeftEnd should not be used together.



Language support parameters

RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully supported text domains for name matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Limited support uses two match score computations:

  • Exact matches return a score of 1. This is the same for all languages.

  • A score is calculated based on string edit distance.

Two parameters control the level of language support.

Table 3. Language Support Parameters

Parameter

Description

Default

allLanguageSupport

When set to true, all languages are supported.

true

limitedLanguageEditDistance

When set to true, edit distance match scores are enabled for limited support languages. allLanguageSupport must be true.

true



Neural model for matching

When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.

To enable the neural model, set enableSeq2SeqTokenScorer to true in the jpn_eng profile in the parameter_profiles.yaml file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.

To use the neural model:

  • Extract the appropriate library files from the platform-specific tensorflow JAR provided in the rni-es-<version>-seq2seq-libraries.zip bundle.

  • Elasticsearch must be started with an additional Java property and point to the directory containing the extracted libraries:

     ES_JAVA_OPTS="-Dorg.bytedeco.javacpp.cacheLibraries=false -Djava.library.path=<path-to-extracted-libraries>"

Note

The neural model is currently only available on MacOS and Linux platforms in RNI-ES versions 7.10.2.x and all plugins including RNI-RNT 7.38.1.67.0 or later.

Matching Korean names

If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language (languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.

To enable Korean readings of names in Han script you need to edit the parameter files as follows:

  1. Edit the zho_eng profile in the internal_param_profiles.yaml file and remove kor from the list of ignoreTranslationOrigins parameter.

  2. Edit the zho_eng profile in the parameter_profiles.yaml file to increase the alternativePairsToCheck parameter by 1 to compensate for the additional reading.

Matching names with Han characters

We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.

The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.

The parameter haniFourCornerCodeMismatchPenalty applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.

To enable the feature, add the following line to your parameter_profiles.yaml file:

zho_zho_PERSON:  
  haniFourCornerCodeMismatchPenalty: 1

Note

This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.

Matching Turkish and Vietnamese names

Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.

  1. Edit the parameter_profiles.yaml file.

  2. To enable Turkish detection, add:

    detectableLanguagesRuleBased:
      [tur]

    To enable Vietnamese detection, add:

    detectableLanguagesRuleBased:
      [vie]
  3. Restart the system.

Ignore malformed and null value parameters for RNI types

You can index null values and empty strings by updating the allowNullValue parameter. If the allowNullValue parameter is enabled, any document containing null values and empty strings for the fields rni_name, rni_address, and rni_date types will be successfully indexed, but search capabilities will be limited to valid values.

You can direct RNI to index documents with malformed strings of language by updating the ignoreBadData parameter. If the ignoreBadData parameter is enabled, any document containing a malformed language string will be successfully indexed, but search capabilities will be limited to valid languages.

By default these parameters are disabled. These features are useful when performing bulk operations in Elasticsearch.

The file name is parameter_profiles.yaml, located in plugins/rni/bt_root/rlpnc/data/etc/.

To turn any of these features on, set the value of the parameter ignoreBadData or allowNullValue in the above file to true.

Evaluating parameter configuration

To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.

If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.

Configuring name overrides

RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:

  • Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.

  • Name pair matches specify scores to be assigned for specified full-name pairs.

  • Token pair overrides specify name token pairs that match along with a match score.

  • Token normalization files specify the normalized form for tokens and variants to normalize to that form.

  • Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.

The name matching override files are in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override directory.

You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.

Stop patterns and stop word prefixes

Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.

For each name, RNI performs the following steps in order:

  1. Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.

  2. Stop patterns are applied.

  3. Stop words are applied.

RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.

Stop Pattern

A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern; see the Javadoc for detailed documentation.

Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt

where LANG is a three-letter language code.

Each row in the file, except for rows that begin with #[3] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at the beginning and end as needed.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general stop pattern is applied first, but general is also a stop pattern and will be applied as well.

RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entries

^fnu\b
\blnu$ 

indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.

You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn, where n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,

\blnu$    2
\blnu$    3

indicates that the regex is to be applied to fields 2 and 3 in fielded names.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop Word Prefixes

A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.

Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt 

where LANG is a three-letter language code. Each row in the file, except for rows that begin with #, is a string literal. Prefixes matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel stop word prefix is applied where applicable when colonel is also a stop word prefix.

RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt would include stop word prefixes for use with Russian text.

Overriding name pair matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Each row in the file, except for rows that begin with #, is a tab-delimited full-name pair and score:

name1 Tab name2 Tab score

The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.

Tip

Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.

The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,

John Doe	Joe Bloggs	1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt:

外山恒	   Toyama Koichi    1.0
ヒラリークリントン    Hillary Clinton    1.0
Overriding token pair matches

You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported[4] for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail and Abby constitute a token pair, then the match score for Abigail Harris and Abby Harris will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

tokens_LANG1_LANG2_[TYPE].txt

where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:

Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]

A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:

Peter    Pete    NICKNAME
Peter    Pedro   COGNATE

This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt, tokens_zho_eng_ORGANIZATION.txt.

When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt.

We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.

Multiple sets of token overrides

There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector parameter.

  • The value of overrideSelector is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.

  • The value of overrideSelector can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the overrideSelector of OverrideGroup1 would be named:

    tokens_eng_eng_PERSON-OverrideGroup1.txt
  • If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.

Note

Overrides that are associated with a specific selector are not additive to the base overrides. If a custom overrideSelector value is specified, RNI will only consider overrides in that specific selector. As with the base overrides, for a given selector, RNI will consider non-entity-type overrides for that selector if no entity-type-specific override pair is found for that selector.

Normalizing token variants

You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:

equivalenceclasses_LANG_[TYPE].txt

For example, equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.

Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:

[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to muhammad:

[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd

You can add lists of variants to this file, including the normalized form in square brackets to start each list.

Unimportant tokens

You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.

The file name is lowWeightTokens_LANG.txt.

For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".

Matching organizations with real world IDs

Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.

RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.

Table 4. Real World ID Parameters

Parameter

Description

Default

useRealWorldIds

Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.

true (enabled)

doQueryRealWorldIds

Enables querying with real world IDs; set by language pair.

true (enabled)

realWorldIdScore

Sets the match score when two names match due to matching real world IDs. Set by language pair.

0.98

nameRealWorldQueryBoost

Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.

35



Building a real world ID file

Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.

The input file is a tab separated file (.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.

The omit file is a tab separated file (.tsv) named <LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.

  • Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.

  • Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.

  • Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in RNI:

  1. Place the omit file in the BT_ROOT directory.

  2. Open omit_ids.datafiles, which is in the plugins/rni/bt_root/rlpnc/data/real_world_ids/ref/omit_ids directory by default.

  3. Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:

    ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
  4. Save omit_ids.datafiles.

Address matching

The RNI plugin can match addresses in English, Traditional Chinese, and Simplified Chinese, returning a match score reflecting the similarity of two addresses.

In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.

As with name and date matching, the process is to create an index containing addresses, then query an address against the index.

Note

Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.

Address definition

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library[5] is used to parse the address string into address fields.

When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.

RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using character-based methods.

Table 5. Supported Address Fields

Field Name

Description

Example(s)

house

venue and building names

"Brooklyn Academy of Music", "Empire State Building"

houseNumber

usually refers to the external (street-facing) building number

"123"

road

street name(s)

"Harrison Avenue"

unit

an apartment, unit, office, lot, or other secondary unit designator

"Apt. 123"

level

expressions indicating a floor number

"3rd Floor", "Ground Floor"

staircase

numbered/lettered staircase

"2"

entrance

numbered/lettered entrance

"front gate"

suburb

usually an unofficial neighborhood name

"Harlem", "South Bronx", "Crown Heights"

cityDistrict

these are usually boroughs or districts within a city that serve some official purpose

"Brooklyn", "Hackney", "Bratislava IV"

city

any human settlement including cities, towns, villages, hamlets, localities, etc.

"Boston"

island

named islands

"Maui"

stateDistrict

usually a second-level administrative division or county

"Saratoga"

state

a first-level administrative division

"Massachusetts"

countryRegion

informal subdivision of a country without any political status

"South/Latin America"

country

sovereign nations and their dependent territories, which have a designated ISO-3166 code

"United States of America"

worldRegion

currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean

"Jamaica, West Indies"

postCode

postal codes used for mail sorting

"02110"

poBox

post office box: typically found in non-physical (mail-only) addresses

"28"



Address field groups

When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.

When matching two fields, the following penalties are applied:

  • If the fields are the same, no penalty is applied. (street - street)

  • If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)

  • If the fields are in different field groups, a large penalty is applied. (road - city)

Table 6. Address Groups

Group

Fields

house

house

house_number

houseNumber

road

road

unit

unit

level

staircase

entrance

city

suburb

cityDistrict

city

state

island

stateDistrict

state

country

countryRegion

country

worldRegion

post_code

postCode

po_box

po_box



How Rosette calculates address match scores

The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.

  • Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.

    For example, most addresses are formatted from specific to general:

    houseNumber road city state postCode
    
    • The parser would provide predictable results for an address in an expected order:

      38 Concord Road, Apt. B Arlington MA

    • The parser would have more difficulty if the address format was in an unexpected order:

      Arlington MA Concord Road #38 Apt B

    If you are getting unexpected match values, check how the addresses are being parsed into address fields.

  • Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.

  • Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:

    • Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.

    • Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.

    • Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.

  • Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.

  • Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.

Using address matching

Index addresses
  1. Create an index.

    curl -XPUT 'http://localhost:9200/rni-test'
  2. Define a mapping for fields that will contain addresses. The type for each of these fields is "rni_address".

    curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
      "properties" : {
        "primary_name" : { "type" : "rni_name" },
        "residence" : { "type" : "rni_address" }
      }
    }'
  3. Index documents containing an address field.

    curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
      "primary_name" : "Joe Schmoe",
      "residence" : {
        "houseNumber" : "123",
        "road" : "Main St",
        "city" : "Boston",
        "state" : "Massachusetts",
        "postCode" : "02110"
      }
    }'

    The address in the document can also be defined as a string.

    curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
        "primary_name" : "Joe Schmoe",
        "residence" : "123 Main St, Boston, Massachusetts, 02110"
    }'
Query field addresses

RNI compares the fields in the query with the fields in the index, matching each non-blank field. Addresses do not have to contain all the same fields to be compared and matched.

As with other objects, the query for an address consists of two parts: the base query and the RNI pairwise address match rescore query.

Base Query. The base query is a standard query against the address field. Refer to Query the Index.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : {
      "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}"
    }
  }
}'

RNI Rescore with Addresses. Refer to Rescoring with RNI Pairwise Name Match.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : { "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}" }
  },
  "rescore" : {
    "query" : {
      "rescore_query" : {
        "function_score" : {
          "address_score" : {
            "field" : "residence",
            "query_address" : {
              "road" : "Main", 
              "state" : "MA"
            }
          }
        }
      }, 
      "query_weight" : 0.0,
      "rescore_query_weight" : 1.0
    }
  }
}'

The query returns a hit with the RNI address match score.

"hits": {
 "total" : 1,
  "max_score" : 0.6057692,
  "hits" : [
    {
      "_index" : "rni-test",
      "_type" : "_doc",
      "_id" : "1",
      "_score" : 0.6057692,
      "_source" : {
        "primary_name" : "Joe Schmoe",
        "residence" : {
          "houseNumber" : "123",
          "road" : "Main St",
          "city" : "Boston",
          "state" : "Massachusetts",
          "postCode" : "02110"
        }
      }
    }
  ]
}

The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.

Query string addresses

The address can be structured as a string for queries. The address structure for the query is independent of the format of the address in the original document. A string can be used in the query regardless of whether the indexed address was formatted with fields or as a string.

Base Query. The base query constructed with an address string.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : {"residence" : "Main, MA"}
  }
}'

RNI Rescore with Addresses. The rescore query with an address string.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
     "match" : { "residence" : "Main, MA" }},
   "rescore" : {
     "query" : {
       "rescore_query" : {
         "function_score" : {
           "address_score" : {
             "field" : "residence",
             "query_address" : "Main, MA"
         }
       }
     },
     "query_weight" : 0.0,
     "rescore_query_weight" : 1.0
     }
   }
}'

The response displayed here returns the address as a string because the indexed document used in this example represented the address as strings. The response will return the address in the same format as the indexed document. The format of the query does not have to match the format of the indexed documents.

"hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.4552421,
    "hits" : [
      {
        "_index" : "rni-test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.4552421,
        "_source" : {
          "primary_name" : "Joe Schmoe",
          "residence" : "123 Main St, Boston, Massachusetts, 02110"
        }
      }
    ]
  }

The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.

Configuring address matching

Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.

There are two types of override files for addresses:

  • Stop patterns and stop word prefixes designate address field elements to strip during indexing and queries.

  • Token pair overrides specify address field elements pairs that match.

File Directories

  • The parameters are modified in the plugins/rni/bt_root/rlpnc/data/etc/parameter_profiles.yaml file.

  • The address matching override files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/overrides directory.

  • The address stop word files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords directory.

Modifying address parameters

To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml. The parameter files are described in Parameter configuration files.

Note

Changes made to the any profile apply to all supported languages.

An example parameter to tune is addressJoinedTokenLimit, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit will be allowed to merge more tokens.

Another example parameter is houseNumberAddressFieldWeight, which controls the weight of the houseNumber score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight controls the weight of the city field when matching addresses.

Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Address parameters

Stop patterns and stop word prefixes

RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.

For each address field, RNI performs the following steps in order:

  1. Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.

  2. Stop patterns are applied.

  3. Stop words are applied.

Stop pattern

A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern class; see the Javadoc for detailed documentation.

Stop patterns for a given address field are specified in a UTF-8 file with the AddressField name:

stopregexes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,[6] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Note

The delimiter before FIELD is a double underscore (__)

Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.

Stop pattern files are arranged by field in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt would include regular expressions to remove elements from the CITY address field for English.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop word prefixes

A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.

Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField name:

stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,[7] is a string literal.

Note

The delimiter before FIELD is a double underscore (__)

Prefixes in the address field matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.

RNI includes files with stop word prefixes for selected address fields in English and Chinese. These files are in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt would include stopword prefixes for use on CITY address field for English.

Overriding token pair matches

You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road and rd constitute a token pair, then the match score for Stuart Road and Stuart Rd will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

LANG1_LANG2_FIELD.txt

where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField name. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore parameter value will be used.

Token1 Tab Token2 Tab [0.0-1.0]

A token pair override score serves as a minimum score, but you can write /force after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [0.0-1.0]/force

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".

RNI includes plugins/rni/bt_root/rlpnc/data/addresses/ref/override/eng_eng_state.txt, which contains a list of U.S. state abbreviations. For example:

Massachusetts  MA
California  CA

When you create an additional file in the same location, use the respective AddressField name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt indicates that the contents match Chinese - English cityDistrict address fields.

Date matching

RNI can match dates returning a data match score reflecting the time similarity of the two dates. Dates that are closer together are considered a stronger match and return a match score closer to 1.

For example, 11/05/1993 and 11/07/1993 have a high score, as they are very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.

The process is similar to name matching:

  • Index the dates in connection to the related names.

  • Query the date and name, receiving back a match score.

The query will return separate match scores for the name and for the associated date of birth. You may decide that the name is more important than the birth date. Within your system, you can weight and combine the name and date match scores to determine the final match score.

Date definition

A date contains a year, month, and day, but not all fields are required for matching. All common delimiters for English dates are supported, and dates can be expressed with various orderings. RNI will filter out some non-date related words. Formats that include time of day are not supported.

You can specify an Elasticsearch date format that includes time information in the mapping. The time component will be ignored.

RNI supports a wide variety of date formats. The best date format will always be the ISO standard of YYYY-MM-DD, where March 7, 1984 is written as 1984-03-07. RNI will attempt to interpret any date provided, although the less standard the format, the less guarantee that its interpretation will be the one you might expect.

Dates can be represented as YYYY-MM-DD. When some fields are unspecified, the letters represent the unknown values. For example, March 7 is YYYY-03-07, since the year in unspecified. Two digit years will be assumed to have unknown centuries. 3/7/84 is interpreted as YY84-03-07. March 7, 1984 will be an equally good match as March 7, 2084 and March 7, 1884.

When a date is provided, RNI will attempt to identify the year, month, and day within it, leaving blank any fields it cannot determine. You can omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12- are all valid dates.

If RNI encounters an invalid date in an acceptable format, such as March 38, 1984, it will not return an error. Rather it will replace the impossible value as an unknown, March 1984.

Supported date formats

RNI supports a wide variety of date formats. 

  • Days can be represented by 1 or 2 digits.

  • Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).

  • Years can be represented by 1, 2, 3 or 4 digits.

  • Supported delimiters include , . - /, as well as a space.

  • Partial fields can be entered.

  • At this time, only English month names and abbreviations are recognized.

  • All words are case-insensitive; upper and lower case are interpreted the same.

The following table shows different acceptable formats for the date March 7, 1984.

Format

Valid Examples

Notes

Y-M-D

1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7

M-D

03-07; 3/7; Mar-07; March 7

Y-M

1984-03; 1984 March; 1984-Mar

YYYYMMDD

19840307

All 8 digits must be included

M-D-Y

03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984

M-YYYY

03-1984; March 1984; Mar-1984

The year must include 4 digits. March-84 will not be recognized.

D-M-Y

07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984

D-M

07-03; 7/3; 07-Mar; 7 March

D(MONTH)Y

7MAR84; 07March1984

The month is a word or abbreviation

YYYY

1984

Month

March

Using date matching

Index dates
  1. Create an index.

    curl -XPUT 'http://localhost:9200/rni-test'

  2. Define a mapping for fields that will contain dates. The type for a date field when matching is "rni_date".

    curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date" }, "primary_name" : { "type" : "rni_name" } } }'

    Optionally, in the mapping, you can specify an Elasticsearch date format. All dates must adhere to the specified format. If you specify a format that includes time information, RNI ignores the time component of the date.

    Warning

    Specifying an Elasticsearch format disables support for unspecified fields. If, for example, you select a format that does not include a day field ("MM-yyyy"), you will get an error when you use the date format in a query.

    curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date", "format" : "MM-yyyy-dd" }, "primary_name" : { "type" : "rni_name" } } }'

  3. Index documents containing a date field.

    curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "birth_date" : "07-1955-24" }'

Query dates

There are many ways to incorporate date matching within your query. Here are two examples, one with date matching by itself, and one with date and name matching.

Basic Date Matching

Base Query. The base query is a standard query against the date field. Refer to Query the Index.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : {
            "birth_date" : "08-1955-25"
        }
    }
}'

RNI Rescore with Dates. Refer to Rescoring with RNI Pairwise Name Match.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : { "birth_date" : "08-1955-25" }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "date_score" : {
                        "field" : "birth_date",
                        "query_date" : "08-1955-25"
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

The query returns a hit, with the RNI date match score.

"hits": {
   "total": 1,
   "max_score": 1.618923,
   "hits": [
     {
       "_index": "test",
       "_type": "_doc",
       "_id": "AVXMepnorGuybmuiQtQr",
       "_score": 0.8120856,
       "_source": {
         "primary_name": "Joe Schmoe",
         "birth_date": "07-1955-24"
       }
     }
   ]
 }
Date and Name Match

Base Query. The base query is a standard query against the date and name fields.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "primary_name": "Joe S."
          }
        },
        {
          "match": {
            "birth_date": "08-1955-25"
          }
        }
      ]
    }
  }'

RNI Rescore with Dates. Use the doc_score function in the rescore when matching a combination of Elasticsearch field types instead of the functions for a single type (name_score and date_score). The name field is also added to the rescore.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "primary_name": "Joe S."
          }
        },
        {
          "match": {
            "birth_date": "08-1955-25"
          }
        }
      ]
    }
 },
"rescore": {
    "query": {
      "rescore_query": {
        "function_score": {
          "doc_score": {
            "fields": {
              "primary_name": {
                "query_value": "Joe S."
              },
              "birth_date": {
                "query_value": "08-1955-25"
              }
            }
          }
        }
      },
      "query_weight" : 0.0,
      "rescore_query_weight" : 1.0
    }
  }
}'

Date match parameters

Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the plugins/rni/bt_root/rlpnc/data/etc/parameter_defs.yaml file.

Record matching

A search can include multiple fields and return a single match and match score. The fields can be any combination of type rni_name, rni_date, rni_address, or any other Elasticsearch field type.

Each field can be assigned a weight to reflect its importance in the overall matching logic. When searching for a match, some fields are more important in determining a match than others. For example, the name field is likely more important in determining a match than an address field. If no weights are defined, each field is weighted equally.

When matching records, a similarity score is calculated for each field. Then the final match score is then calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a document, that field is removed from the score calculation and its weight is evenly distributed across other fields. You can override this behavior by using the score_if_null option to specify a score to be returned if the field is null in the index document.

Use the doc_score function in the rescore query when matching records that include multiple field types, instead of the functions for a single type, such as the name_score and date_score functions. The doc_score function has built-in similarity functions for many core types. It does not, however, currently support multiple nested fields.

If your record query contains types which the doc_score function doesn't support, you can create a custom similarity function using the Elasticsearch script_score function in the rescore query.

Supported field types

The doc_score function has default support for rni_name, rni_date, rni_address, and many of the Elasticsearch core field types. All default similarity scores are between 0.0 and 1.0.

Field Type(s)

Default Similarity Function

Example(s)

rni_name

name_score (refer to RNI pairwise match score)

'John David Smith' vs 'Jon D Smith' = 0.88

rni_date, date

date_score (refer to Date matching)

'2010-11-4' vs '2010-5-11' = 0.92

rni_address

address_score (refer to Address matching

'Red Cedar Ct' vs 'Cedar Ct' = 0.53

keyword, text, string

Normalized edit distance

'37 Congress St.' vs '35 Congres St.' = 0.875

integer, long, short, double, float

Normalized difference (eg. percentage)

'65' vs '59' = 0.908

boolean

Equality

'true' vs 'true' = 1.0, 'true' vs 'false' = 0.0

geo_point

Log function over Haversine distance

'[lat=42.361145, lon=-71.057083]' vs '[lat=42.3736, lon=-71.1097]' = 0.83

Using record matching

Index records
  1. Create an index with a mapping containing fields with different types

    curl -XPUT 'http://localhost:9200/rni-test' -H'Content-Type: application/json' -d '{
        "mappings" : {
             "properties" : {
                "name" : { "type" : "rni_name" },
                "dob" : { "type" : "rni_date" },
                "address" : { "type" : "rni_address" },
                "height" : { "type" : "integer" },
                "nationality" : { "type" : "keyword" }
            }
        }
    }'
  2. Index documents that contain those fields

    curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
        "name" : "Ryan McDonagh", 
        "dob" : "11/19/1987",
        "address" : {
            "houseNumber" : "47",
            "road" : "Park St",
            "city" : "Boston",
            "state" : "MA"
        },
        "nationality" : "USA", 
        "height" : 65 
    }'
Basic multi-field query

The query can be a record containing multiple fields. The fields in the query record must be mapped to those of the indexed documents.

Base Query. The base query is a standard Elasticsearch query containing multiple fields that will return candidates for rescoring.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "{\"data\" : \"Brian McDonough\", \"entityType\": \"PERSON\"}" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } 
            ]
        }
    }
}'

RNI Rescore with Records. Use the doc_score function to rescore the indexed documents against a query record.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": "MA\" } } 
            ]
        }
    },
    "rescore" : {
        "rni_query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : { "query_value": "Brian McDonough" },
                             "dob" : { "query_value": "10/19/87" },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   }
                             },
                             "height" : { "query_value": 67 },
                             "nationality" : { "query_value": "CANADA" }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

As with addresses, the query_value of names can be an object to match additional name information. The rescore query above can easily be modified to additionally match against a name's entityType field:

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } 
            ]
        }
    },
    "rescore" : {
        "rni_query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : {
                                  "query_value": {
                                      "data": "Brian McDonough",
                                      "entityType": "PERSON"
                                  }
                             },
                             "dob" : { "query_value": "10/19/87" },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   }
                             },
                             "height" : { "query_value": 67 },
                             "nationality" : { "query_value": "CANADA" }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

The quotes in the query above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.

Weighted multi-field query

Each field can be given a weight to reflect its importance in the overall matching logic.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", 
                                             \"city\" : \"Boston\", \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                         "fields": {
                             "name": { "query_value": "Brian McDonough", "weight": 4 },
                             "dob": { "query_value": "10/19/87", "weight": 2 },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                             },
                             "height" : { "query_value": 67, "weight": 0.5},
                             "nationality" : { "query_value": "CANADA", "weight": 1 }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

By default, if a queried-for field is null in the index, the field is removed from the score calculation, and the weights of the other fields are redistributed. However, you can override this behavior by using the score_if_null option to specify what score should be returned for this field if it is null in the index document.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", 
                                             \"city\" : \"Boston\", \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : { "query_value": "Brian McDonough", "weight": 4, "score_if_null" : 0.0  },
                             "dob": { "query_value": "10/19/87", "weight": 2 },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                             },
                             "height" : { "query_value": 67, "weight": 0.5},
                             "nationality" : { "query_value": "CANADA", "weight": 1 ,  "score_if_null" : 1.0  }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

The quotes in the query above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.

Multi-field query with multiple nested fields

The doc_score function for rescoring does not currently support search queries containing multiple nested fields. To perform these queries, chain multiple rescorers and adjust the query_weight and rescore_query_weight parameters to control the relative importance of the original query and of the rescore query, respectively. When chaining multiple RNI advanced rescorers, be sure to add "score_mode":"total" to each rni_query object to ensure the final score is properly accumulated.

This example expands the previous examples, adding alias names and modifying the single date of birth (dob field) to contain a list of dates of birth, one for each alias (dob field).

  1. Create an index with a mapping containing multiple nested fields

    curl -XPUT "http://localhost:9200/rni-test" -H 'Content-Type: application/json' -d'{
        "mappings": {
        "properties": {
          "name": {
            "type": "rni_name"
          },
          "aliases": {
            "type": "nested",
            "properties": {
              "alias_name": {
                "type": "rni_name"
              }
            }
          },
          "dobs": {
            "type": "nested",
            "properties": {
              "dob": {
                "type": "rni_date"
              }
            }
          },
          "address": {
            "type": "rni_address"
          },
          "height": {
            "type": "integer"
          },
          "nationality": {
            "type": "keyword"
          }
        }
      }
    }'
  2. Index documents that contain the fields

    curl -XPUT "http://localhost:9200/rni-test/_doc/1" -H 'Content-Type: application/json' -d'{
      "name": "Ryan McDonagh",
      "aliases": [
        {
          "alias_name": "Rayan McDonagh"
        },
        {
          "alias_name": "R. McDonagh"
        },
        {
          "alias_name": "Rayan M."
        }
      ],
      "dobs": [
        {
          "dob": "11/19/1987"
        },
        {
          "dob": "11/20/1987"
        },
        {
          "dob": "10/19/1987"
        }
      ],
      "address": {
        "houseNumber": "47",
        "road": "Park St",
        "city": "Boston",
        "state": "MA"
      },
      "nationality": "USA",
      "height": 65
    }'
  3. Query index with chained multiple rescorers

    curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{
      "query": {
        "bool": {
         "should": [
          {
            "nested": {
              "path": "dobs",
              "query": {
                "bool": {
                  "should": {
                    "match": { "dob": "10/19/87"} 
                  }
                }
              }
           } 
         },
         {
            "nested": {
              "path":"aliases",
              "query": {
                "bool": {
                  "should": {
                    "match": {"name": "Brian McDonough"} 
                  }
                }
              }
           }
         },
         {
            "match":{
              "address": "{\"houseNumber\": \"48\", \"road\": \"Parker St\", \"city\": \"Boston\", \"state\": \"MA\" }"
            }
          }
        ]
      }
     },
     "rescore": [
       {      
         "rni_query": {
           "rescore_query": {
             "nested": {
               "score_mode": "max",
               "path": "aliases",
               "query": {
                 "rni_function_score": {
                   "name_score": {
                     "field": "aliases.alias_name",
                     "query_name": "Brian McDonough",
                     "window_size_allowance": 1
                   }
                 }
               }
             }
           },
           "score_mode": "total",
           "query_weight": 0.0,
           "rescore_query_weight": 1.0,1
           }
         },
         {
          "rni_query": {
            "rescore_query": {
              "nested": {
                "score_mode": "max",
                "path": "dobs",
                "query": {
                  "rni_function_score": {
                    "date_score": {
                      "field": "dobs.dob",
                      "query_date": "10/19/87"
                    }
                  }
                }
              }
            },
            "score_mode": "total",
            "query_weight": 0.67,
            "rescore_query_weight": 0.33 2
            }
          },
          {
           "rni_query": {
             "rescore_query": {
               "rni_function_score": {
                 "address_score": {
                   "field": "address",
                   "query_address": {
                     "houseNumber": "48",
                      "road": "Parker St",
                      "city": "Boston",
                      "state": "MA"
                    }
                  }
                }
              },
              "score_mode": "total",
              "query_weight": 0.75,
              "rescore_query_weight": 0.25 3
             }
           },
          {      
            "query": {
            "rescore_query": {
              "match": {
                "height": 67
              }
            },
            "query_weight": 0.89,
            "rescore_query_weight": 0.11 4
            }
           },
          {      
            "query": {
            "rescore_query": {
              "match": {
                "nationality": "CANADA"
              }
            },
            "query_weight": 0.9,
            "rescore_query_weight": 0.1 5
          }
        }
      ]
    }'

To calculate the rescore_query_weight for each nested field, you have to work from bottom to top, dividing each field's desired weight by the product of the already-calculated query_weight values. The query_weight is calculated by subtracting the rescore_query_weight from 1.

If there are no previous query_weight values, the rescore_query_weight is simply the desired field weight.

In this example, the desired field weights are 0.4, 0.2, 0.2, 0.1, and 0.1 for the alias, dob, address, height, and country fields, respectively.

1

Rescore based on alias

Name field weight = 0.4

rescore_query_weight = 0.4 / (0.667 x 0.75 x 0.89 x 0.9) = 1

query_weight = 1 - 1 = 0

2

Rescore based on date of birth

DOB field weight = 0.2

rescore_query_weight = 0.2 / (0.75 x 0.89 x 0.9) = 0.333

query_weight = 1 - 0.33 = 0.667

3

Rescore based on address

Address field weight = 0.2

rescore_query_weight = 0.2 / (0.9 * 0.89) = 0.25

query_weight = 1 - 0.25 = 0.75

4

Rescore based on height

Height field weight = 0.1

rescore_query_weight = 0.1 / 0.9 = 0.11

query_weight = 1 - 0.11 = 0.89

5

Rescore based on nationality

Country field weight = 0.1

rescore_query_weight = 0.1

query_weight = 1 - 0.1 = 0.9

Weighted multi-field query with custom similarity function

While the doc_score function has built-in similarity functions for many core field types, a custom similarity function can be provided at query time. In this manufactured example, we'll use a simple script_score function that matches CANADA and USA with a high score. Refer to the Elasticsearch documentation for more details about Elasticsearch scripting. Any other function can also be used.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", \"city\" : \"Boston\", 
                                             \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { "query_value": "Brian McDonough", "weight": 4 },
                            "dob": { "query_value": "10/19/87", "weight": 2 },
                            "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                            },
                            "height": { "query_value": 67, "weight": 0.5 },
                            "nationality": { 
                               "function": {
                                   "function_score": {
                                       "script_score": {
                                           "script": {
                                               "lang": "painless",
                                               "params": {
                                                   "query_value": "CANADA"
                                               },
                                               "inline": "if (params.query_value == '\''CANADA'\'' &&
                                                doc['\''nationality'\''].value == '\''USA'\'') {return 0.8} 
                                                else {return 0.2}"
                                           }
                                       }
                                   }
                                },
                                "weight": 1
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

The quotes in the query above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.

Explainability of RNI Matching

Explainability of RNI matching

As important as getting a match score is, understanding how the system calculated the score can be just as important. When matching two names or records, RNI returns a JSON response explaining in detail how the two names, dates, addresses, or records were matched. With this information, you can understand how the score was calculated and, if necessary, modify the matching parameters to better solve your matching problems.

The following concepts are helpful when reviewing the explainInfo JSON file.

  • When two objects are being compared, one is referred to as the left input, one as the right input.

  • Every token of the left object is compared to every token of the right object. Token strings, made up of multiple tokens, may also be compared.

  • Names are usually composed of multiple tokens. For example, John Fitzgerald Kennedy is 3 tokens.

Common Terms

The response JSON contains sections for each type of object: names, addresses, and dates. While each object has its own criteria for comparison, there are common terms used for all comparisons, as shown below.

Table 9. Definitions of Terms

Term

Definition

Note

bin

A number representing the frequency of the token in the language. A lower bin indicates the token in unusual and therefore should be more highly weighted when calculating the similarity score.

biasedBin

The bin raised to a power from .1 to 10 (default 0.970). This value is set by the frequencyRankBias parameter.

scoreInIsolation

The matching score of just the tuples being compared, ignoring things like position in the name, name weighting, etc. This will show a match core of 1.000 if it is an exact match of tokens, even if if there are biases that will lower the score in context.

scoreInContext

The matching score between the tuples taking into account the placement in the overall query and any biases related to the overall query.

(left/right)MinTokenIndex

This is the index of the first token in the string of tokens.

For single tokens, the min and max tokenIndex will have the same value.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

(left/right)MaxTokenIndex

This is the index of the last token in the string.

For single tokens, the min and max tokenIndex will have the same value.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

unbiasedScore

The raw score before any calculations using finalBias, adjustOnesideDeletionScores, or other such bias parameters.

score

The final score after finalBias, adjustOnesideDeletionScores, and other such bias parameters are added to the calculation.



Response structure

All matches responses contain the same sections. The details contained within the section can change based on the match object (names, dates, addresses).

  • Left/right input information: The input information for each input along with the properties for each token in the input. Properties depend on the type of object being matched.

    For example, the name matching example contains the following properties:

    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH"

    While a date comparison would contain different properties:

    "century": 20,
    "month": 10,
    "canonicalForm": "2024-10-01",
    "yearWithoutCentury": 24,
    "dayMonthSwapped": true,
    "originalString": "10 January 2024",
    "modifiedJulianDay": 60584,
    "day": 1
  • Tuple scores: The score for every tuple, where a tuple is a token string from the left input and a token string from the right input. Every token in the left input is matched to every token in the right input, along with some token strings (multiple tokens combined together).

  • Score adjustments: The score adjustments list the parameters applied, and the score calculated with those parameters.

    For example, the name example here contains the following parameters:

    "unbiasedScore": 0.6829129823127231,
    "score": 0.6919264820086959,
    "parameter": "adjustOneSidedDeletionScores"
    
    "unbiasedScore": 0.6919264820086959,
    "score": 0.8435140063279181,
    "parameter": "finalBias"

    Meanwhile, a date comparison would contain different parameters. In this case, a different matching scheme, tryDayMonthSwap, is tried to see if a better result is returned.

    "score": 0.95,
    "unbiasedScore": 0.5926523220980572,
    "parameter": "tryDayMonthSwap"
    
    "score": 0.95,
    "unbiasedScore": 0.95,
    "parameter": "dateFinalBias"
  • Final score: The similarity score for the two names.

Example: matching names

Let's take a look at an example. In this example we're matching the following 2 names:

  • John Smith

  • Jon J Smyth

The JSON output is broken down by section.

Example 6. Left Input: John Smith
"leftInput": {
    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.41435888604672094,
        "tokenType": "GIVEN"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.585641113953279,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  • The name is tokenized. Each token is evaluated.

  • The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.

  • The tokenTypes are identified. Even if the name was provided as Smith John, Smith would be identified as a SURNAME and John as a GIVEN name.



Example 7. Right input: Jon J Smyth
"rightInput": {
    "data": "Jon J. Smyth",
    "normalizedData": "jon j. smyth",
    "latnData": "jon j. smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "jon",
        "latnToken": "jon",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.2083122782666673,
        "tokenType": "UNKNOWN"
      },
     {
        "token": "j",
        "latnToken": "j",
        "bin": 8,
        "biasedBin": 7.5161819937120935,
        "tokenWeight": 0.08948764635417582,
        "tokenType": "UNKNOWN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.702200075379157,
        "tokenType": "UNKNOWN"
      }
    ],
    "entityType": "PERSON"
  },
  • The name is tokenized. Each token is evaluated.

  • The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.

  • The tokenTypes are identified. Since both Jon and Smyth are unusual spellings, the tokenType is not identified.



Example 8. Score tuples
"scoreTuples": [
    {
      "scoreInIsolation": 0.7595918889283346,
      "scoreInContext": 0.7595918889283346,
      "left": "john",
      "right": "jon",
      "marked": true,1
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.4912303477031893,
      "scoreInContext": 0.4666688303180298,
      "left": "john",
      "right": "jonj",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.542,
      "scoreInContext": 0.4743439389212776,
      "left": "john",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.2941408383164158,
      "scoreInContext": 0.279433796400595,
      "left": "johnsmith",
      "right": "jon",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.46557800000000005,
      "scoreInContext": 0.4422991,
      "left": "johnsmith",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,2
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 2,
      "rightMaxTokenIndex": 2
    },
    {
      "scoreInIsolation": 0.27169000000000004,
      "scoreInContext": 0.27169000000000004,
      "left": "",
      "right": "j",
      "marked": true,3
      "reason": "DELETION",
      "leftMinTokenIndex": -1,
      "leftMaxTokenIndex": -1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],

All tuples are compared. The tuples that are marked as true are the matches that are used to calculate the scores.

Matching tuples:

1

John : Jon

2

Smith: Smyth

3

J: deleted (no match)



Example 9. Score adjustments
"scoreAdjustments": [
    {
      "unbiasedScore": 0.6829129823127231,
      "score": 0.6919264820086959,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6919264820086959,
      "score": 0.8435140063279181,
      "parameter": "finalBias"
    }
  ],

The unbiased score is the score before the parameter is applied. The score is after the parameter is applied.



Example 10. Final score
"finalScore": 0.8435140063279181

The final calculated score with all parameters applied. This is the similarity score returned by RNI.



Response schemas by object

The following sections list the JSON schema for each object type.

Name response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftMinTokenIndex": { "type": "integer", "default": 0 },
          "leftMaxTokenIndex": { "type": "integer", "default": 0 },
          "rightMinTokenIndex": { "type": "integer", "default": 0 },
          "rightMaxTokenIndex": { "type": "integer", "default": 0 }
        },
        "required": ["left", "right", "reason"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}
Address response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "leftMinTokenIndex": { "type": "number", "default": 0 },
          "leftMaxTokenIndex": { "type": "number", "default": 0 },
          "rightMinTokenIndex": { "type": "number", "default": 0 },
          "rightMaxTokenIndex": { "type": "number", "default": 0 }
        },
        "required": ["left", "right", "reason", "leftField", "rightField"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" }
        },
      }
    },
    "finalScore": { "type": "number" },
    "fieldScores": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "score": { "type": "number", "default": 0.0 },
          "marked": { "type": "boolean", "default": false }
        },
        "required": ["leftField", "rightField"]
      }
    }
  },
}
Date response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "weight": { "type": "number", "default": 0.0 },
          "component": { "type": "string" },
          "differenceInDays": { "type": "integer" }
        },
        "required": ["left", "right", "component"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Dynamic configuration endpoints

The plugin includes Elasticsearch REST APIs to customize and tune matching through stop words, token overrides, and parameter universes. These endpoints allow you to add and modify these configuration values, without having to restart Elasticsearch.

Tip

To use any of the configuration REST APIs, the parameter enableDynamicConfigurationEndpoints must be set to true in the parameter_profiles.yaml file in the any: profile. By default, this parameter is set to false. These endpoints should be used for testing and tuning only. When the dynamic configuration endpoints are enabled, they can slow the system down considerably.

The RNI-ES plugin relies on having an active primary or replica shard for each dynamic index available on every node. As a result, it is up to users to ensure each node has enough disk space to not exceed Elasticsearch watermark thresholds. Outside of this, users should never interact directly with the underlying dynamic configuration indices; all requests should go through the appropriate /rni_plugin endpoints.

Request timeouts

Whenever the RNI-ES plugin detects that one of its underlying dynamic indices has changed, it must fetch the entire index contents before the next name matching or indexing request. The timeout threshold for this fetch request is managed individually for each class of endpoints, and defaults to 60,000 ms. If this value is found to be insufficient for any reason, users can configure it at plugin startup time with the bt.{override,stopword,parameter}.timeout java property.

Tip

To use dynamic configuration endpoints in an Elasticsearch deployment using SSL encryption, the RNI Elasticsearch plugin must be aware of the server's certificate file. To accomplish this, start elasticsearch with:

ES_JAVA_OPTS="-Dbt.ssl.certificate=<path_to_certificate>"

Stop words

The _stopwords endpoint allows you to ADD, GET and DELETE stop words without restarting the Elasticsearch server. See Stop patterns and stop word prefixes for more detailed information on stop words.

The following properties are used when creating stop words. The entity_type is optional; all other fields are required when adding stop words through the API.

Table 10. Stop Word Properties

Property

Required

Description

lang

ISO 639-3 code for the language of the stop word(s).

stopword_type

Type of stop word(s), either regexes or prefixes

entity_type

Entity type for which to apply the stop word(s), defaults to ALL.

stop words

List of stop words to be added.



Note

Stop words are applied whenever a token is normalized, meaning stop words will impact the names content that is included in the index. Therefore, changes to dynamic stop words do require data to be reindexed to take effect.

Create stop words

The POST_stopword adds one or more stop words. The entity_type field is optional, but the other fields are all required.

curl -XPOST "http://localhost:9200/rni_plugin/_stopwords" -H 'Content-Type: application/json' -d '{
    "lang": "eng",
    "stopword_type": "prefixes",
    "entity_type": "PERSON",
    "stopwords": [
        "honorable",
        "senior correspondent"
    ]
}'
Get stop words

The GET _stopwords method returns all stop words for a given language and stop word type. You can search by just language or by language and type.

When no entity type is specified, the stop word is applied to all names in the language, those with and without entity types. Therefore, calls that specify a type such as PERSON or ORGANIZATION will also return all stop words that don't have an entity type specified.

Returns all prefix stop words for PERSON types in English:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON"

Returns all regex stop words for ORGANIZATION types in Spanish:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/regexes_spa_ORGANIZATION"

Returns all prefix stop words in English with no type specified. For some languages, this list is empty by default. In these cases, data will only be returned if you've populated the file with values:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng"
Delete stop words

The DELETE _stopwords method deletes a specified stop word. Deleting a stop word from a specific profile will also delete it from the any profile.

curl -XDELETE "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON/doctor"

Token overrides

The _overrides endpoint allows you to ADD, GET and DELETE token pair overrides without restarting the Elasticsearch server. See Overriding token pair matches for more detailed information on token pair overrides.

The following properties are used when creating token overrides.

Table 11. Token Overrides Properties

Property

Required

Description

lang1

ISO 639-3 code for the language of the first name in the override pair.

lang2

ISO 639-3 code for the language of the second name in the override pair.

entity_type

Entity type of the list of token override pairs, defaults to "ALL".

selector

An alphanumeric string which specifies the selector value to apply for these overrides.

NOTE: This property is only available in RNI-ES 8.6.2.0 and later.

token_pairs

List of token override pairs to be added.

token1

Tokens of the first name in the override pair; they should be of lang1

token2

Token of the second name in the override pair; they should be of lang2.

type

The specific override type for the token pair. If omitted, the nickname type is used.

score

Raw score of the token pair between 0.0 and 1.0. If omitted, the value from the nicknameOverrideScore parameter is used.

force

Indicates whether to force this score to be exactly that value for the given token pair, defaults to false .



Note

RNI is designed so that override information is not included with indexed names. Therefore, changes to dynamic overrides do not require data to be reindexed to take effect.

Create override index

The override index must exist before you can start adding token overrides. To create the index:

curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_create"
Refresh override index

To force a refresh of the dynamic override index:

curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_refresh"
Create token overrides

The POST _overrides adds one or more token overrides. As shown in the table above, entity_type, force, and score are optional, but the other fields are required.

curl -XPOST "http://localhost:9200/rni_plugin/_overrides" -H 'Content-Type: application/json' -d'{
        "lang1": "eng",
        "lang2": "eng",
        "entity_type": "PERSON",
        "token_pairs": 
        [{
        "token1": "Abigail",
        "token2": "Abbey",
        "score": 0.74,
        "force": true},
        {
        "token1": "Aleksander",
        "token2": "Alex",
        "score": 0.74},
        {
        "token1": "Alfonso",
        "token2": "Alphonse",
        "type": "COGNATE"},
        {
        "token1": "Frederica",
        "token2": "Federica",
        }]}'
Get token overrides

The GET _overrides method returns the overrides of a given language profile.

curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON"

You can also retrieve the score of a given override pair.

curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON?token1=abigel&token2=abigail"
Delete token overrides

The DELETE _overrides method deletes a given override pair. Deleting an override from a specific profile will also delete it from the any profile.

curl -XDELETE "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON/abigel+abigail"

Parameters

The _parameter_universe endpoint allows you to ADD, GET and DELETE parameters through parameter universes, without restarting the Elasticsearch server. See Parameter universe for more information on tuning parameters with parameter universes.

Note

While some parameters can impact the data that is included in the index, these parameters cannot be dynamically specified. Therefore, changes to dynamic parameters do not require data to be reindexed to take effect.

Add parameter(s)

The POST _parameter_universe method creates a parameter universe and the parameter profiles within the universe. Use this method to add or update a parameter value in a parameter universe. If you try to add a parameter universe that already exists, it overrides it with the new values. The parameter universe method uses the following syntax:

SomeParameterUniverseName/xxx_yyy where xxx_yyy is the language profile the parameters belong to, expressed in ISO 639-3 codes. The parameters field expects a list of parameters for the given profile, where the naming of the parameters should match the ones declared in parameter_defs.yaml.

curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'
  {
    "profiles": [
      {
        "name": "SomeParameterUniverseName/any",
        "parameters": {
          "translatorResultsToKeep": 4,
          "deletionScore": 0.269,
          "doQueryTokenOverrides": true,
          "fieldDeletionScore": 0.27,
          "yearDistanceWeight": 0.2
        }
      },
      {
        "name": "SomeParameterUniverseName/eng_eng",
        "parameters": {
          "HMMUsageThreshold": 0.8,
          "stringDistanceThreshold": 0.1,
          "useEditDistanceTokenScorer": true,
          "finalBias": 2.4,
          "reorderPenalty": 0.2
        }
      }
    ]
  }'
Get parameter(s)

The GET _parameter_universe method retrieves parameter universes.

To retrieve a given parameter universe, the name of the parameter universe is provided as a path parameter:

curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"

If you include the name of the profile and a parameter, it returns the value of the parameter:

curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"
Delete parameter(s)

The DELETE _parameter_universe method deletes parameter universes.

To delete a specific parameter universe:

curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"

To delete a parameter for a specific profile within a parameter universe:

curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"

Note

Deleting a parameter from a specific parameter profile will also delete it from the any profile. The parameter from the default value in the parameter_defs.yaml file will be used.

Pairwise match endpoint

You can perform a pairwise match between two rni_names, rni_dates, rni_addresses, or other datatypes through the POST _pair_match method. The results provide insight into how the match scores were calculated, including tokens and token scores. This endpoint can help you understand the impact a specific match parameter has on the final score, and can aid in testing and debugging RNI.

The type of pairwise match being performed is provided to the query, along with the values being compared (data1 and data2). You can also specify one or more parameters and see how they impact the match scores.

You may use the optional responseFormat URL parameter to control the format of the response. The default value is explainInfo, which produces the output format of plugin versions using SDK 7.43.0.c71.0 and later. Setting this parameter to legacyExplainInfo will produce the output of previous plugin versions.

Tip

We strongly recommend sending in complete strings and allowing RNI to perform tokenization. RNI includes weighting and other calculations which operate on the full string, enhancing the token matching scoring algorithms to improve match scores.

Request

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d'
{"dataPair": {"data1": "12/25/19","data2": "1/15/20"},
 "parameters": {
     "timeDistanceWeight": ".8",
     "stringDistanceWeight": "0"}}'

Response

{
  "leftInput": {
    "originalString": "12/25/19",
    "day": 25,
    "month": 12,
    "yearWithoutCentury": 19,
    "century": -1000,
    "modifiedJulianDay": -671643,
    "canonicalForm": "YY19-12-25"
  },
  "rightInput": {
    "originalString": "1/15/20",
    "day": 15,
    "month": 1,
    "yearWithoutCentury": 20,
    "century": -1000,
    "modifiedJulianDay": -671622,
    "canonicalForm": "YY20-01-15"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 0.9330329915368074,
      "scoreInContext": 0.9330329915368074,
      "left": "YY19",
      "right": "YY20",
      "marked": true,
      "weight": 0.2,
      "component": "YEAR_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6830201283771977,
      "scoreInContext": 0.6830201283771977,
      "left": "12",
      "right": "01",
      "marked": true,
      "weight": 0.2,
      "component": "MONTH_DISTANCE"
    },
    {
      "scoreInIsolation": 0.7071067811865476,
      "scoreInContext": 0.7071067811865476,
      "left": "25",
      "right": "15",
      "marked": true,
      "weight": 0.1,
      "component": "DAY_DISTANCE"
    },
    {
      "scoreInIsolation": 0.375,
      "scoreInContext": 0.375,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0,
      "component": "STRING_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6949591099211685,
      "scoreInContext": 0.6949591099211685,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0.8,
      "component": "TIME_PROXIMITY"
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.730683530798762,
      "score": 0.730683530798762,
      "parameter": "dateFinalBias"
    }
  ],
  "finalScore": 0.730683530798762
}

Supported types

The following data types are supported by the pairwise match endpoint.

  • rni_name

  • rni_date

  • rni_address

  • date

  • keyword

  • text

  • string

  • integer

  • long

  • short

  • double

  • float

  • boolean

  • geo_point

Request

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=text" -H 'Content-Type: application/json' -d'
{
   "dataPair":
   {
    "data1": "word1",
    "data2": "word2"
  }
}'

Response

{
  "score" : 0.8333333333333334
}

Name matching example

Request

Parameters are specified directly in the request. The source language (language) of the name is optional, but recommended if known.

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_name" -H 'Content-Type: application/json' -d'
{
  "dataPair":   
   {    
     "data1":     
     {
      "data": "John Robert Edward Smith",
      "language": "eng",
      "entityType": "PERSON"

    },
    "data2":
    {
      "data": "John Smyth",
      "language": "eng",
      "entityType": "PERSON"
    }
  },
  "parameters": {
    "deletionScore": 0.469
  }
}'

Response

The response includes detailed information on how the names were matched.

{
  "leftInput": {
    "data": "John Robert Edward Smith",
    "normalizedData": "john robert edward smith",
    "latnData": "john robert edward smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.20817481793666764,
        "tokenType": "GIVEN"
      },
      {
        "token": "robert",
        "latnToken": "robert",
        "bin": 4,
        "biasedBin": 3.8370564773010574,
        "tokenWeight": 0.24879889758995669,
        "tokenType": "MIDDLE"
      },
      {
        "token": "edward",
        "latnToken": "edward",
        "bin": 4,
        "biasedBin": 3.8370564773010574,
        "tokenWeight": 0.24879889758995669,
        "tokenType": "MIDDLE"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.294227386883419,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  "rightInput": {
    "data": "John Smyth",
    "normalizedData": "john smyth",
    "latnData": "john smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.17348100675885905,
        "tokenType": "GIVEN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.8265189932411409,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 1,
      "scoreInContext": 1,
      "left": "john",
      "right": "john",
      "marked": true,
      "reason": "MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.469,
      "scoreInContext": 0.469,
      "left": "robertedward",
      "right": "",
      "marked": true,
      "reason": "DELETION",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 2,
      "rightMinTokenIndex": -1,
      "rightMaxTokenIndex": -1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 3,
      "leftMaxTokenIndex": 3,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.6686154265898526,
      "score": 0.6972609000299432,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6972609000299432,
      "score": 0.8468047657291401,
      "parameter": "finalBias"
    }
  ],
  "finalScore": 0.8468047657291401
}

Address matching example

Request

The pairwise match endpoint supports both fielded and unfielded addresses. Fielded addresses must be specified as objects, while unfielded addresses must be specified as strings.

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_address" -H 'Content-Type: application/json' -d'
{
  "dataPair":
   {
    "data1":
     {
      "houseNumber": "101",
      "road": "Main st",
      "city": "Cambridge",
      "state": "Massachusetts",
      "country": "United States of America"
    },
    "data2": "101 Main St, Cambridge, MA, USA"
  }
}'

Response

The response includes a score marking the similarity of the two addresses as well as a type field describing the type of match observed. The response also includes detailed information on how each of the fields were matched. In the example below, only part of the detailed response for HOUSE_NUMBER is included. This is not the complete response.

{
  "leftInput": {
    "fieldInputInfos": [
      {
        "data": "United States of America",
        "latnData": "United States of America",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "united",
            "latnToken": "united",
            "tokenWeight": 0.25
          },
          {
            "token": "states",
            "latnToken": "states",
            "tokenWeight": 0.25
          },
          {
            "token": "of",
            "latnToken": "of",
            "tokenWeight": 0.25
          },
          {
            "token": "america",
            "latnToken": "america",
            "tokenWeight": 0.25
          }
        ],
        "addressField": "COUNTRY",
        "normalizedData": "united states of america"
      },
      {
        "data": "Massachusetts",
        "latnData": "Massachusetts",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "massachusetts",
            "latnToken": "massachusetts",
            "tokenWeight": 1
          }
        ],
        "addressField": "STATE",
        "normalizedData": "massachusetts"
      },
      {
        "data": "Cambridge",
        "latnData": "Cambridge",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "cambridge",
            "latnToken": "cambridge",
            "tokenWeight": 1
          }
        ],
        "addressField": "CITY",
        "normalizedData": "cambridge"
      },
      {
        "data": "Main st",
        "latnData": "Main st",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "main",
            "latnToken": "main",
            "tokenWeight": 0.5
          },
          {
            "token": "st",
            "latnToken": "st",
            "tokenWeight": 0.5
          }
        ],
        "addressField": "ROAD",
        "normalizedData": "main st"
      },
      {
        "data": "101",
        "languageOfUse": "UNKNOWN",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "101",
            "latnToken": "101",
            "tokenWeight": 1
          }
        ],
        "addressField": "HOUSE_NUMBER",
        "normalizedData": "101"
      }
    ]
  },
  "rightInput": {
    "fieldInputInfos": [
      {
        "data": "usa",
        "latnData": "usa",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "usa",
            "latnToken": "usa",
            "tokenWeight": 1
          }
        ],
        "addressField": "COUNTRY",
        "normalizedData": "usa"
      },
      {
        "data": "ma",
        "latnData": "ma",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "ma",
            "latnToken": "ma",
            "tokenWeight": 1
          }
        ],
        "addressField": "STATE",
        "normalizedData": "ma"
      },
      {
        "data": "cambridge",
        "latnData": "cambridge",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "cambridge",
            "latnToken": "cambridge",
            "tokenWeight": 1
          }
        ],
        "addressField": "CITY",
        "normalizedData": "cambridge"
      },
      {
        "data": "main st",
        "latnData": "main st",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "main",
            "latnToken": "main",
            "tokenWeight": 0.5
          },
          {
            "token": "st",
            "latnToken": "st",
            "tokenWeight": 0.5
          }
        ],
        "addressField": "ROAD",
        "normalizedData": "main st"
      },
      {
        "data": "101",
        "languageOfUse": "UNKNOWN",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "101",
            "latnToken": "101",
            "tokenWeight": 1
          }
        ],
        "addressField": "HOUSE_NUMBER",
        "normalizedData": "101"
      }
    ]
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 1,
      "scoreInContext": 1,
      "left": "101",
      "right": "101",
      "marked": true,
      "reason": "MATCH",
      "leftField": "HOUSE_NUMBER",
      "rightField": "HOUSE_NUMBER",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.3,
      "scoreInContext": 0.3,
      "left": "101",
      "right": "",
      "marked": false,
      "reason": "DELETION",
      "leftField": "HOUSE_NUMBER",
      "rightField": "HOUSE_NUMBER",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": -1,
      "rightMaxTokenIndex": -2
    },
...
  }
}

Date matching example

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d'
{"dataPair": {"data1": "12/25/19","data2": "1/15/20"},
 "parameters": {
     "timeDistanceWeight": ".8",
     "stringDistanceWeight": "0"}}'

Response

The response includes detailed information on how the dates were matched.

{
  "leftInput": {
    "originalString": "12/25/19",
    "day": 25,
    "month": 12,
    "yearWithoutCentury": 19,
    "century": -1000,
    "modifiedJulianDay": -671643,
    "canonicalForm": "YY19-12-25"
  },
  "rightInput": {
    "originalString": "1/15/20",
    "day": 15,
    "month": 1,
    "yearWithoutCentury": 20,
    "century": -1000,
    "modifiedJulianDay": -671622,
    "canonicalForm": "YY20-01-15"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 0.6949591099211685,
      "scoreInContext": 0.6949591099211685,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0.8,
      "component": "TIME_DISTANCE",
      "differenceInDays": 21
    },
    {
      "scoreInIsolation": 0.9330329915368074,
      "scoreInContext": 0.9330329915368074,
      "left": "YY19",
      "right": "YY20",
      "marked": true,
      "weight": 0.2,
      "component": "YEAR_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6830201283771977,
      "scoreInContext": 0.6830201283771977,
      "left": "12",
      "right": "01",
      "marked": true,
      "weight": 0.2,
      "component": "MONTH_DISTANCE"
    },
    {
      "scoreInIsolation": 0.7071067811865476,
      "scoreInContext": 0.7071067811865476,
      "left": "25",
      "right": "15",
      "marked": true,
      "weight": 0.1,
      "component": "DAY_DISTANCE"
    },
    {
      "scoreInIsolation": 0.375,
      "scoreInContext": 0.375,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0,
      "component": "STRING_DISTANCE"
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.730683530798762,
      "score": 0.730683530798762,
      "parameter": "dateFinalBias"
    }
  ],
  "finalScore": 0.730683530798762
}

Fully supported text domains for name matching

The following tables describe the domain pairings for which RNI provides full support. All other domain pairings have limited support, as described in Language support parameters. A domain refers to the language and script of a piece of text. For example, one domain might be Latin (Latn) script in the English (eng) language.

Note

"Language" in this appendix refers to the language of use, the language of the document in which the name is found, which may not be the language of origin associated with the name. If the language of use is undetermined, use unknown (xxx).

Note

Prior to release 7.36.0, RNI did not support any limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Name matching within a language

The first table identifies the languages, and for each language the writing scripts that Rosette Name Indexer fully supports.

Cross-language matches

This table identifies the range of cross-language searching and matching that Rosette Name Indexer and name matching fully support. If your query is a name in an Arabic document in Arabic script, the query may return one or more names in English documents in Latin script, in addition to names from Arabic documents in Arabic script. If the query is a name in English and Latin script, it may return documents from any of the supported languages and their native scripts.

Note

For supported scripts for each language, see the table in section 13.1.

Appendix

Match phenomena

Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.

Name

Description

Example

CONFLICT

The tokens do not match.

When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT.

DELETION

The token is unmatched.

When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION.

EMBEDDING_MATCH

The tokens are semantically similar as determined by word-embedding vectors.

When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH.

FIELD_BLOCKED

This field cannot be matched because of a cross-field match involving the same field in the other name.

When comparing "Bob|William|Smith" with "William||Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field.

FIELD_CONFLICT

When comparing two names that are divided into fields, these fields do not match.

When comparing "Richard|William|Smith" with "Richard|Johnson|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT.

FIELD_DELETION

When comparing two names that are divided into fields, this field is unmatched.

When comparing "Richard|Xi|Smith" with "Richard||Smith", "xi" would be considered a FIELD_DELETION.

GIVEN_NAME_DELETION

When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched.

When comparing "Richard|William|Smith" and "||William|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a Given_name field.

HANI_ABBREVIATION

One Hani token appears to be an abbreviation of another Hani token.

"北京大学" and "北大" are a HANI_ABBREVIATION match.

HMM_MATCH

The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match.

"richard" and "richerd" are an HMM_MATCH.

INITIALISM

One token is a name and the other token is the initials of the words which make up the name.

"john fitzgerald kennedy" and "JFK" are an INITIALISM.

"consumer value stores" and "CVS" are an INITIALISM.

INITIAL_MATCH

One token is the first initial of the other.

"w" and "william" are an INITIAL_MATCH.

LANGUAGE_SPECIFIC_MATCH

The match was determined by a language-specific matcher.

"laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH.

MATCH

The tokens are identical (after stop word elimination and normalization).

"john" and "john" are a MATCH.

NULL

The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object.

N/A

OUT_OF_ORDER_DELETION

This unmatched token still leaves the remaining tokens out of order when it is removed.

When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION.

OVERRIDE

The tokens appear as a pair on the override list. This is often used for nicknames.

"john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list.

PREFIX_INITIAL

One token is an initial that matches a prefix in the other token.

In practice, the PREFIX_INITIAL phenomenon is rare.

If the initialsScore parameter is set to 0.1, "E Silva" and "EduardoSil" will be a PREFIX_INITIAL match.

STRING_SIMILARITY

The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match.

"akcd" and "xkcd" are a STRING_SIMILARITY match.

STUCK_INITIAL

One name appears to have an initial mistakenly attached to a preceding token.

"DavidK" and "David Keith" are a STUCK_INITIAL match.

SURNAME_DELETION

When comparing two names that are divided into fields, the SURNAME field is unmatched.

When comparing "Richard|William|Smith" and "Richard|William||", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a Surname field.

TRAILING_PATRONYMIC_DELETION[a]

The unmatched token is a patronymic which has been truncated in the other name.

When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION.

TRUNCATED_EXACT_MATCH

The tokens are identical except that one has been slightly truncated.

"murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH.

TRUNCATED_HMM_MATCH

The tokens are similar, but not identical, and one has been slightly truncated.

"gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH.

UNKNOWN_FIELD_MATCH

One of the tokens is part of an "unknown" field in a fielded name.

The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API.

When comparing "Richard|William|Smith" with "Richard|William|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH.

[a] Only applies to Latin script names of Arabic origin.

Parameters

This table lists the parameters that can be configured via paramater_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in paramater_defs.yaml.

Table 12. Parameter impacts

Parameter name

Applies to

Impacts

addressCrossFieldScoreThreshold

Addresses

Can affect any kind of match

addressDeletionScore

Addresses

DELETION match phenomenon

addressDifferentGroupPenalty

Addresses

Can affect any kind of match

addressFinalBias

Addresses

All address match scores

addressJoinedTokenLimit

Addresses

Concatenation[a]

addressOverrideDefaultScore

Addresses

OVERRIDE match phenomenon

addressOverrideTablePath

File locations

Internal engineering detail

addressReorderPenalty

Addresses

Reordering[e]

addressSameGroupPenalty

Addresses

Can affect any kind of match

addressStopPatternsPath

File locations

Internal engineering detail

addressUnpairedFieldScore

Addresses

FIELD_DELETION match phenomenon

adjustOneSidedDeletionScores

All names

DELETION match phenomenon

allowNullValue

Elasticsearch

Elasticsearch setting

alternativePairsToCheck

All names

Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)

alternativeTimeProximityMatch

Dates

All date match scores

boostWeightAtBothEnds

All names

All name match scores

boostWeightAtLeftEnd

All names

All name match scores

boostWeightAtRightEnd

All names

All name match scores

caseSensitiveData

All names

INITIALISM match phenomenon

cityAddressFieldWeight

Addresses

Weighting

cityDistrictAddressFieldWeight

Addresses

Weighting

cognateOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as COGNATE in the override file

conflictScore

All names

CONFLICT match phenomenon

conflictThreshold

All names

CONFLICT match phenomenon

countryAddressFieldWeight

Addresses

Weighting

countryRegionAddressFieldWeight

Addresses

Weighting

crossFieldInitialsPenalty

Fielded names

INITIAL_MATCH match phenomenon

crossFieldJoinInitialPenalty

Fielded names

Concatenation[a]

crossFieldJoinPenalty

Fielded names

Concatenation[a]

crossFieldMatchPenalty

Fielded names

Can affect any kind of match

crossLanguageGenderConflictPenalty

All names

Gender mismatch[b]

dateFinalBias

Dates

All date match scores

dateOrdering

Dates

All date match scores

dayDistanceWeight

Dates

All date match scores

deletionScore

All names

DELETION match phenomenon

detectableLanguagesModelBased

All names

Can affect any kind of match where one or more names is in Latin script and the language is not already specified.

detectableLanguagesRuleBased

All names

Currently, you can only enable detection of Latin script as Turkish or Vietnamese.

editDistanceScoreBias

All names

Can affect any kind of match

enableDynamicConfigurationEndpoints

Elasticsearch

Elasticsearch setting

enablePromisingTermFiltering

Speed/Accuracy

Performance only

enableYueReadings

All names

Names written in Han script

entranceAddressFieldWeight

Addresses

Weighting

equivalenceClassesPath

File locations

Internal engineering detail

estimatedConflictOrDeletionScore

All names

Internal engineering detail

exactLatnMatchScore

All names

Token normalization

expensiveScorerJoinedTokenLimit

All names

Concatenation[a]

fieldBlockedScore

Fielded names

OUT_OF_ORDER_DELETION match phenomenon

fieldConflictScore

Fielded names

CONFLICT match phenomenon

fieldDeletionScore

Fielded names

DELETION match phenomenon

finalBias

All names

All name match scores

frequencyRankBias

All names

Can affect any kind of match

genderConflictPenalty

All names

Gender mismatch[b]

genderConflictPenaltyThreshold

All names

Gender mismatch[b]

globalTokenCacheConfig

Speed/Accuracy

Performance only

globalTokenPairCacheConfig

Speed/Accuracy

Performance only

haniAbbreviationScore

All names

INITIALISM match phenomena in Han script

haniAbbreviationThreshold

All names

INITIALISM match phenomena in Han script

haniFourCornerCodeMismatchPenalty

All names

Names written in Han script

hmmNormalizationAlternative

All names

HMM_MATCH phenomenon

hmmScoreBias

All names

HMM_MATCH phenomenon

hmmScoreLimit

All names

HMM_MATCH phenomenon

houseAddressFieldWeight

Addresses

Weighting

houseNumberAddressFieldWeight

Addresses

Weighting

ignoreBadData

Elasticsearch

Elasticsearch setting

improveSingleDigitManipulationMatch

Dates

Date match scores containing exactly one instance of digit manipulation[c] and no other differences

initialFrequencyRank

All names

INITIAL_MATCH match phenomenon

initialismMismatchPenalty

All names

HMM_MATCH phenomenon

initialismScore

All names

INITIALISM match phenomenon

initialsConflictScore

All names

CONFLICT match phenomenon

initialsDeletionPenalty

All names

DELETION match phenomenon

initialsScore

All names

INITIAL_MATCH match phenomenon

islandAddressFieldWeight

Addresses

Weighting

joinedTokenInitialsPenalty

All names

Concatenation[a]

INITIAL_MATCH match phenomenon

joinedTokenLimit

All names

Concatenation[a]

joinedTokenPenalty

All names

Concatenation[a]

levelAddressFieldWeight

Addresses

Weighting

libpostalDataDirPath

File locations

Internal engineering detail

lowWeightTokenFrequencyRank

All names

Can affect any kind of match

lowWeightTokenPath

File locations

Internal engineering detail

maximumAlternateTokenizationRelativeDistance

All names

Affects tokenization and therefore any potential score

maximumOrganizationInitialismLength

Organization names

INITIALISM match phenomenon

maximumPersonInitialismLength

Person names

INITIALISM match phenomenon

maxYearDistanceForDigitManipulation

Dates

Date match scores containing exactly one instance of digit manipulation[c] and no other differences.

minFieldWeightFactor

Fielded names

Weighting

minimumAlternateTokenizationLength

All names

Affects tokenization and therefore any potential score

minimumOrganizationInitialismLength

Organization names

INITIALISM match phenomenon

minimumPersonInitialismLength

Person names

INITIALISM match phenomenon

monthDistanceWeight

Dates

All date match scores

nameBigramQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameDoubleMetaphoneQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameGluedQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameInitialQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameLengthMismatchPenalty

All names

DELETION match phenomenon

Concatenation[a]

Any phenomenon that changes the number of tokens in a name

nameRealWorldIdQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

ngramLMPath

File locations

Internal engineering detail

ngramThresholdPath

File locations

Internal engineering detail

nicknameOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as NICKNAME in override file

numericTokenFrequencyRank

All names

Can affect any kind of match

outOfOrderDeletionScore

All names

OUT_OF_ORDER_DELETION match phenomenon

parseUnknownFieldMarker

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

poBoxAddressFieldWeight

Addresses

Weighting

postCodeAddressFieldWeight

Addresses

Weighting

queryAlternativeOriginLanguages

Speed/Accuracy

Can affect any kind of match

realWorldIdsPath

File locations

Internal engineering detail

realWorldIdsPathUser

File locations

Internal engineering detail

reorderCorrection

All names

Rotation

[d]

reorderCorrectionThreshold

All names

Rotation[d]

reorderPenalty

All names

Reordering[e]

rniFullnameOverridesPath

File locations

Internal engineering detail

rntFullnameOverridesPath

File locations

Internal engineering detail

roadAddressFieldWeight

Addresses

Weighting

sameNameUnknownFieldMatchInterpolator

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

staircaseAddressFieldWeight

Addresses

Weighting

stateAddressFieldWeight

Addresses

Weighting

stateDistrictAddressFieldWeight

Addresses

Weighting

stopPatternsPath

File locations

Internal engineering detail

stringDistanceWeight

Dates

All date match scores

stuckInitialAffixMinLength

All names

STUCK_INITIAL match phenomenon

stuckInitialScore

All names

STUCK_INITIAL match phenomenon

suburbAddressFieldWeight

Addresses

Weighting

thresholdToDropoffBiasMapping

Dates

All date match scores

timeDistanceWeight

Dates

All date match scores

timeProximityYearInterval

Dates

All date match scores

tokenizeOrganizationsWithNumbers

Organization names

Affects tokenization and therefore any potential score

tokenOverridesPath

File locations

Internal engineering detail

trailingPatronymicDeletionScore

Person names

TRAILING_PATRONYMIC_DELETION match phenomenon

truncationFractionLimit

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationScorerBias

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

tryAlternateTokenization

All names

Affects tokenization and therefore any potential score

tryDayMonthSwap

Dates

All date match scores

unigramLMPath

File locations

Internal engineering detail

unitAddressFieldWeight

Addresses

Weighting

unknownFieldFrequencyRank

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

unknownVsKnownScore

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

unknownVsUnknownScore

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

useEmbeddings

Organization names

EMBEDDING_MATCH match phenomenon

useSolrPhraseQueries

Solr

Solr plugin setting

variantOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as VARIANT in the override file

worldRegionAddressFieldWeight

Addresses

Weighting

yearDistanceWeight

Dates

All date match scores

[a] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name.

[b] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match.

[c] A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0↔8, 1↔7, 3↔8, 5↔8, 5↔6, 6↔8, 7↔2.

[d] If the tokens in a name have been rotated, the reorder penalty will negatively impact the match score. RNI detects and compensates for this error.

[e] Tokens that match, but that appear to be out-of-order, have their match scores adjusted to reflect that fact.



Internal parameters

This table lists the parameters that can be configured via internal_param_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in internal_param_defs.yaml.

Important

We recommend against modifying these parameters unless advised to by Rosette support.

Name

Applies to

Impacts

affixGlueThreshold[a]

All names and addresses

Concatenation[b]

allLanguageSupport

All names

Can affect any kind of match

allowCacheBonuses

All names

Internal engineering detail

alwaysComputeSuffixes[a]

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

araRNISpeedOption

Translated names

Speed and accuracy tradeoff

crossSurnameMatchPenalty

All names

Matches in languages with onomastic information

debuggableIndex

N/A

Internal engineering detail

Has no effect on matching

debugPrintTuples

N/A

Internal engineering detail

Has no effect on matching

defaultScoreToCheckRestriction

All names

Dates

Addresses

First-pass scoring

disabledLanguages

All names

doFrontTruncations

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

doQueryBigrams

All names

First-pass accuracy

doQueryCompleted

All names

First-pass accuracy

doQueryFullnameOverrides

All names

First-pass accuracy

doQueryFuzzy

All names

First-pass accuracy

doQueryGlued

All names

First-pass accuracy

doQueryIndexKeys

All names

First-pass accuracy

doQueryInitials

All names

First-pass accuracy

doQueryNormalized

All names

First-pass accuracy

doQueryPersonInitialisms

All names

First-pass accuracy

doQueryPhrase

All names

First-pass accuracy

doQueryRealWorldIds

All names

First-pass accuracy

doQueryTokenOverrides

All names

First-pass accuracy

doQueryTranslated

All names

First-pass accuracy

doViterbiRescaling

All names

Fuzzy match[c]

editDistanceTokenScorerPenalty

All names

STRING_SIMILARITY match phenomenon

embeddingBias

Organization names

EMBEDDING_MATCH match phenomenon

embeddingZeroScore

Organization names

EMBEDDING_MATCH match phenomenon

enableAdditionalOnomastics

All names

Matches in languages with onomastic information

enableRemoteTokenScorer

All names

Japanese/English fuzzy match[c]

enableSeq2SeqTokenScorer

All names

Japanese/English fuzzy match[c]

enableTokenPairLogging

N/A

Internal engineering detail

Has no effect on matching

engEngFastMode

All names

English/English fuzzy match[c]

expandedLanguages

All names

Fuzzy match[c]

OVERRIDE match phenomenon

expansionLimit

All names

Fuzzy match[c]

OVERRIDE match phenomenon

expansionScoreThreshold

All names

Fuzzy match[c]

OVERRIDE match phenomenon

familiarTokenMismatchPenalty

All names

Can affect any kind of match

familiarTokenThreshold

All names

Can affect any kind of match

firstPassDayRange

Dates

Performance only

firstPassMonthRange

Dates

Performance only

firstPassYearRange

Dates

Performance only

foreignAddressFinalBias

Addresses

All English-to-non-English address matches

genderPenaltyMinimumLength

All names

Gender mismatch[d]

givenFieldDeletionScore

Fielded names

DELETION match phenomenon

HMMCachePerProcess

All names

Internal engineering detail

HMM_MATCH phenomenon

HMMCachePerThread

All names

Internal engineering detail

HMM_MATCH phenomenon

hmmNormBias

All names

Internal engineering detail

Fuzzy match[c]

HMMUsageThreshold

All names

Internal engineering detail

HMM_MATCH phenomenon

identifierEditDistanceTokenScorerPenalty

Identifiers

STRING_SIMILARITY match phenomenon

ignoreTranslationOrigins

All names

Can affect any kind of match that uses English transliteration

includeExtraKatakanaPersonReadings

Translated names

Can affect any kind of match

initialAndSuffixMinLength

All names

Fuzzy match[c]

INITIAL_MATCH match phenomenon

initialAndSuffixScore

All names

Fuzzy match[c]

INITIAL_MATCH match phenomenon

jniBias

All names

Can affect any kind of match in languages that use a JNI scorer

jpnRNISpeedOption

Translated names

Speed and accuracy tradeoff

kanjiMismatchPenalty

All names

Normalization of tokens that include kanji

katakanaTransliterationsOnly

Translated names

Can affect any kind of match

korRNISpeedOption

Translated names

Speed and accuracy tradeoff

latinDataAlternativesToCheck

All names

Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)

limitedLanguageEditDistance

All names

STRING_SIMILARITY match phenomenon

maxIdentifierEditDistance

All names

First-pass accuracy

notExactMatchPenalty

All names

Normalization

_postCodePathAddressFieldWeight

Addresses

Weighting

promisingFuzzyTermFrequencyFactor

Speed/Accuracy

Performance only

promisingTermFrequencyFactor

Speed/Accuracy

Performance only

queryMaxResults

All names

Dates

Addresses

First-pass scoring

queryMaxToCheck

All names

Dates

Addresses

First-pass scoring

queryMaxToConsider

All names

Dates

Addresses

First-pass scoring

queryToCheckAllowance

All names

Dates

Addresses

First-pass scoring

realWorldIdScore

Organization names

Real-world id match[e]

remoteTokenScorerURL

All names

Internal engineering detail

Japanese/English fuzzy match[c]

rntTokenOverridesPath

File locations

Internal engineering detail

rusRNISpeedOption

Translated names

Speed and accuracy tradeoff

secondarySurnameTokenTypeWeight

All names

Matches in languages with onomastic information

seq2seqCachePerProcess

All names

Internal engineering detail

Japanese/English fuzzy match[c]

seq2seqCachePerThread

All names

Internal engineering detail

Japanese/English fuzzy match[c]

seq2seqTokenOverridesPath

File locations

Internal engineering detail

seq2seqUsageThreshold

All names

Internal engineering detail

Japanese/English fuzzy match[c]

splitTokens

All names

Internal engineering detail

stringDistanceThreshold[a]

All names

Fuzzy match[c]

surnameFieldDeletionScore

fielded names

DELETION match phenomenon

surnameTokenTypeWeight

All names

Matches in languages with onomastic information

taggerMinimumConfidenceThreshold

All names

Matches in languages with onomastic information

translatorResultsToKeep

translated names

Can affect any kind of match

truncationAffixSimilarityLength

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationAffixSimilarityThreshold

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationLengthLimit

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

useCharacterLM

All names

Can affect any kind of match

useEditDistanceTokenScorer

All names

STRING_SIMILARITY match phenomenon

useIdentifierEditDistanceTokenScorer

Identifiers

STRING_SIMILARITY match phenomenon

useLM

All names

Can affect any kind of match

useOldAndNewNameSegmentationForJapanese

All names

Can affect any kind of match involving Japanese translations

useRealWorldIds

all names (or just orgs?)

Real-world id match[e]

zhoRNISpeedOption

Translated names

Speed and accuracy tradeoff

[a] Unlike public parameters for this feature, this is a speed/accuracy tradeoff, not a science-tuning parameter.

[b] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name.

[c] A fuzzy match is a match between tokens that are similar but not identical. The HMM_MATCH and SEQ2SEQ_MATCH phenomena are examples of this.

[d] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match.

[e] RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name




[1] Copyright by Elasticsearch BV. Dual licensed under Server Side Public License (SSPL version 1) and the Elastic License 2.0 (ELv2).

[2] For RNI plugins that support earlier versions of Elasticsearch (such as 1.x.y), contact support@rosette.com.

[3] # may also be used after an entry on the same line to begin a comment.

[4] Override files are not provided for all supported languages. Specifically, while no files are provided for Russian or Korean, you can create token pair files for these languages.

[5] RNI depends on the jpostal binding for the open source libpostal library to parse unfielded addresses as a pre-processing step. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support@basistech.com if you discover any issues.

[6] # may also be used after an entry on the same line to begin a comment.

[7] # may also be used after an entry on the same line to begin a comment.