RNI-Elasticsearch Plugin Guide

Introduction

RNI-Elasticsearch is an Elasticsearch^[1] plugin for building fuzzy name retrieval and name matching applications for persons, locations, and organizations. It uses Rosette Name Indexer (RNI), implementing high-speed, scalable, cross-language, and cross-script searches with the Elasticsearch full-text search engine to store the names and search keys.

This guide describes how to use the RNI-Elasticsearch plugin and RNI features, and is not intended to be a complete guide to Elasticsearch.

Supported Platforms

RNI-Elasticsearch is supported on the following operating systems and CPUs.

Table 1. Supported Platforms

OS	CPU
MAC OS X v10.9+ (Darwin 13)	AMD64
Linux	AMD64
Linux	AARCH64
Windows	AMD64
Java Only	Any OS and CPU with 64-bit Java SDK 17 through 20

Language support

RNI can match names in any language. For the languages listed in Fully supported text domains for name matching, RNI calculates a match score using a variety of techniques, as described in Understanding name match scores. For names not listed in those tables, RNI provides limited support, as described in Language support parameters.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Interpreting RNI scores

Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.

RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).

The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding name match scores.

Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:

Variation	Example(s)
Phonetic and/or spelling differences	Nayif Hawatmeh and Nayif Hawatma
Missing name components	Mohammad Salah and Mohammad Abd El-Hamid Salah
Rarity of a shared name component	Two English names that contain Ditters are more likely to match than two names that contain Smith
Initials	John F. Kennedy and John Fitzgerald Kennedy
Nicknames	Bobby Holguin and Robert Holguin
"Cousin" or cognate names	Pedro Calzon and Peter Calzon
Uppercase/Lowercase	Rosa Elena PACHECO and Rosa Elena Pacheco
Reordered name components	Zedong Mao and Mao Zedong
Variable Segmentation	Henry Van Dick and Henri VanDick, Robert Smith and Robert JohnSmyth
Corresponding name fields	For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox]
Truncation of name elements	For Sawyer, the similarity with Sawy is higher than the similarity with Sawi.

Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

You can configure RNI to customize how it scores different match phenomena.

The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).

Installing RNI-Elasticsearch

Important

The RNI Elasticsearch plugin does not work with the AWS managed elastic service.

Note

You will need read/write permissions on the host system to install RNI-Elasticsearch.

Note

For security reasons, Elasticsearch is only accessible locally by default; this is suitable for a local dev/test environment. For accessible dev/test/production server instances, users need to follow the steps outlined in the Elasticsearch reference manual to enable the appropriate network settings for their specific instance.

Note

When installing on top of an existing version of RNI-ES without reindexing, some changes to the software may not function with an index created on the previous installation, and many newly developed features won't work without reindex. For the best experience, Babel Street recommends reindexing on every installation.

To use RNI-Elasticsearch you need the RNI Elasticsearch plugin, an RLP license file (rlp-license.xml) and Elasticsearch. ^[2]

Note

If you are using the Linux distribution of RNI-ES, note that glibc is required. The version of glibc that the native libraries are built against can be found in the filename of the distributed package.

If you do not already have it, install Elasticsearch using the setup instructions for the appropriate version.
Download and unzip Elasticsearch-<version>.zip.
Important
The version of Elasticsearch must match the first three digits of the version of the RNI-ES plugin. If your version of Elasticsearch does not match the plugin version, the plugin will not install.
Example:
- Elasticsearch version: 7.5.2
- RNI-ES plugin version: 7.5.2.x where x is an integer
Install the plugin.
Navigate to the elasticsearch-<version> root directory and run the install command.
On Unix, Linux and MacOS:
```
bin/elasticsearch-plugin install file:///path/to/rni-es-<version>.x.zip
```
On Windows:
```
bin\elasticsearch-plugin install file:///C:\path\to\rni-es-<version>.x.zip
```
Note
You must use the absolute file path to refer to the plugin zip file. For example, if the file is in the home directory of rniUser on macOS, the command would be:
bin/elasticsearch-plugin install file:///Users/rniUser/rni-es-<version>.x.zip
You may be prompted to grant permissions necessary for the plugin to function.
The plugin is now in plugins/rni.
Note
For Windows users, you must add
```
bin\elasticsearch-<version>\plugins\rni\bt_root\rlp\bin\*
```
to your PATH environment variable. In this case, you must replace * with the name of the subdirectory which contains platform-specific binary library files (for example, amd64-w64-msvc120).
Additionally, the RNI-Elasticsearch plugin cannot be installed into distributions of Elasticsearch found in the C:\Program Files directory.
Copy the RLP License (rlp-license.xml) to plugins/rni/bt_root/rlp/rlp/licenses.
This license must be in place before you can use the RNI-Elasticsearch plugin.

Note

If your index contains complex mappings or searches, including many fields or nested fields, you may need to increase the heap size as described in the Elasticsearch documentation.

To start the Elasticsearch server, run:

bin/elasticsearch

Note

When starting Elasticsearch with the plugin you may see some non-fatal error messages. If a message follows the error stating that “Cluster health status changed from [RED] to [YELLOW]“, the error can be ignored. This may occur when the enableDynamicConfiguration is set to true.

Note

To utilize the overlay directory functionality, you must specify the overlay root location with the bt.overlay.root or bt.overlay.relative.root (relative to the plugin installation directory) system property at plugin startup time.

Example of an overlay root location:

ES_JAVA_OPTS="-Dbt.overlay.root=<overlay-root-path>" bin/elasticsearch

Example of a relative overlay root directory:

ES_JAVA_OPTS="-Dbt.overlay.relative.root=<relative-overlay-root-path>" bin/elasticsearch

If the overlay directory is not located within the Elasticsearch installation directory, you must add an entry to RNI's plugin-security.policy file giving read permissions to the directory.

libpostal data directory

RNI uses libpostal to parse addresses; libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.

RNI packages libpostal data in plugins/rni/bt_root/rlpnc/data/libpostal. The data directory is relatively large (~2G). If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other RNI functionalities.

Prepare the index

Important

Unless otherwise specified, all inputs to RNI need to be UTF-8 encoded.

Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.

Elasticsearch provides real-time search and analytics for all kinds of data. The data is stored in documents, each having a set of fields, some of which are defined as search fields. An Elasticsearch index is a collection of these documents.

The RNI-Elasticsearch plugin uses an Elasticsearch index to store documents containing names, dates, addresses, or other fields to be matched.

Before using RNI to search for matches, you must create the index, define mappings, and load the index with documents.

Create an index, or a searchable container for your documents.
Define a mapping for fields that contain person, location, organization, or identifier entity types. The type of a name field to be searched by RNI is "rni_name". A mapping defines the data types of each of the searchable fields in a document. The mapping does not have to include every field in the document, just the searchable fields.
Index documents that contain one or more name fields along with other fields of interest. This step loads the documents into the index.
Test the RNI integration before continuing on.Test the RNI integration

Once you've completed the above steps, you are ready to query the index.

The following snippets use the cURL command-line tool to illustrate the Elasticsearch commands for running the plugin.

You can also use Kibana, an open source dashboard for Elasticsearch.

Create an index

An Elasticsearch index consists of one or more documents, and a document contains one or more fields. A name index is an indexed list of names.

The default port for running Elasticsearch locally is localhost:9200.

The following cURL statement creates an index named rni-test.

curl -XPUT 'http://localhost:9200/rni-test'

Define a mapping

A mapping defines how a document, along with the fields it contains, is stored and indexed and sets the types of the search fields. For name search fields, set the "type" of the name fields to "rni_name".

The following statement maps the "primary_name" and "aka" (also known as) fields in the document to the "rni_name" type in the "rni-test" index.

curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
    "properties" : {
        "primary_name" : { "type" : "rni_name" },
        "aka" : { "type" : "rni_name" },
        "occupation" : { "type" : "text" }
    }
}'

Previous to the RNI-RNT 7.35.1.c65.0 release (September 2021), entityType was not considered when querying. For example, a name with an entityType of PERSON could match a name with an entityType of ORGANIZATION. To return to this behavior, set the mapping parameter testEntityType to false. This will allow indexed names with any or no entityType to be returned, regardless of the entityType in the search.

Index documents

This is the step where you add your data, or documents, to the index. A document is a JSON object containing one or more fields. Each field in a document is defined as a key-value pairs, where the key is the field and the value is the data.

Documents may include fields other than name fields.

curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
    "primary_name" : "Joe Schmoe",
    "aka" : "Bossman",
    "occupation" : "business owner"
}'

Name fields can include properties in addition to the name string (or "data " property). Properties are used when searching to optimize the search algorithms for the data. The "entityType" property is particularly important for name searching and customizations.

Property	Required	Description
`data`	✓	The name string.
`language`		ISO 639-3 Code for the language of use: the language of the document in which the name was found.
`languageOfOrigin`		ISO 639-3 Code for the language of origin of the name. For example, a name of Spanish origin (spa) may be found in an English (eng) document.
`script`		ISO 15924 code for the script.
`entityType`		Type of the name.
`uid`		Unique string identifier for the document.
`gender`		An explicitly defined gender for a name.

Example:

curl -XPUT 'http://localhost:9200/rni-test/_doc/3' -H'Content-Type: application/json' -d '{ "primary_name" : { "data" : "Joe Schmoe", "language" : "eng", "script" : "Latn", "entityType" : "PERSON" } }'

Tip

When creating a large set of documents, use the Bulk insert for optimal performance.

Tip

You may need to wait a few minutes for the documents to be ready to query. Documents are not always immediately available from Elasticsearch after being added to the index.

Entity types

The entityType field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.

Important

The entityType should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType, the type PERSON will be used.

Table 2. Entity Types

Type	Description	Features
PERSON	A human identified by name, nickname, or alias.	Values are tokenized and token pairs are compared. Stop words, overrides, frequency and gender models are supported.
LOCATION	A city, state, country, region or other location.	Values are tokenized and token pairs are compared. Stop words, overrides, and frequency models are supported.
ORGANIZATION	A corporation, institution, government agency, or other group of people defined by an established organizational structure.	Values are tokenized and token pairs are compared. Stop words, overrides, frequency models, and embeddings are supported. Real World IDs are supported.
IDENTIFIER IDENTIFIER:DRIVERS_LICENSE IDENTIFIER:LICENSE_PLATE IDENTIFIER:NATIONAL_ID_NUM	An alphanumeric identifier.	Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.

Fielded names

You can process fielded names by separating the fields with "|". RNI assigns no explicit semantics to each field (such as given name or surname), but it does pay attention to the order of the fields when comparing two fielded names. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in name1 matches the second field in name2). Fields within a name can be empty.

When scoring a potential match between a name with fields and a name without fields, RNI treats the name without fields as if it were a name with a single field.

RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".

Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*.

Names containing special characters

When using JSON objects with RNI, special characters must be properly escaped when used in strings. RNI requires a backslash to escape the special character and then JSON requires another backslash to escape the first backslash. Thus, In RNI, the proper escape character for names containing a special character is a double backslash (\\).

The | used in fielded names is one example of a special character embedded within a name, where | is used to separate the fields. For proper processing of the vertical bar character, RNI needs to be able to distinguish when the user intends to build a fielded name and a name which contains the vertical bar character.

Let's assume we have a name that includes a |; it is not indicating a fielded name: "John|Smith". RNI requires that you escape the vertical bar with a backslash; e.g. "John\|Smith". Then, JSON requires that the backslash character be escaped with a backlash. The correct syntax for the name "John|Smith" is "John\\|Smith". If the entry were representing a fielded name, the correct syntax would be "John|Smith" without any backslashes.

Verify the RNI SDK version

To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version:

curl -XGET 'localhost:9200/rni-test/rni_plugin/_get_version'

This call also verifies that the RNI plugin is installed and running successfully.

Bulk insert

Bulk insert allows you to add multiple documents to Elasticsearch in a single API call, improving the throughput for uploading documents by orders of magnitude. We recommend you use bulk indexing to create and index your data wherever possible.

Create the index.
Define the mapping.
Run Bulk Insert.

Tip

Do not perform any queries or searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.

The structure for all Elasticsearch bulk API calls is:

{ action_to_be_performed: { metadata_related_to_action}}\n
{ request_body_data_to_index }\n

Bulk insert example

We're going to continue the example that we started. The index is rni-test. The mapping defines a primary_name, aka, and occupation.

Create the index.
curl -X PUT http://localhost:9200/rni-test

Define the mapping.

The previously defined mapping:

{
    "properties" : {
       "primary_name" : { "type" : "rni_name" },
       "aka" : { "type" : "rni_name" },
       "occupation" : { "type" : "text" }
    }
}

You can put the mapping in a JSON file and create it from the command line. The following curl command creates the mapping using a file (mapping.json in this example):

curl -X PUT -H"Content-Type:application/json" -d @mapping.json http://localhost:9200/rni-test/_mapping

Create a data file in newline delimited JSON (NDJSON) format. Save the file as bulknames.json. The file MUST end with a newline after the final record.

{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Joaquín Guzmán","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "René Lindström Jones","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Guadalupe Hernandez","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Chris Joseph Arsenault","entityType":"PERSON"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "ABC","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Basis Technology","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Australian Boradcasting Corporation","entityType":"ORGANIZATION"}}
{"index":{"_index":"rni-test","_id":null}} 
{"primary_name":{"data": "Amazon","entityType":"ORGANIZATION"}}

Use the _bulk method to load the data file with curl using the following command:
```
curl -X POST -H"Content-Type:application/json" --data-binary @bulknames.json http://localhost:9200/rni-test/_bulk
```
Note
If you're providing text file input to curl, use the --data-binary flag instead of plain -d to preserve the newlines.

Searching with queries

At this point you've created an index and loaded data. Now you can start using RNI to search for matches.

A query searches the index and returns a match score. In RNI, the query for a name consists of two parts, a base query and a rescorer.

The base query is a standard Elasticsearch query against a name field. The rescorer takes the results of the base query, and uses Elasticsearch rescoring to select the top candidates and perform pairwise matching on the top candidates.

The query returns an RNI match score (max_score), the score of the top scoring document.

Important

The entityType (PERSON, LOCATION, ORGANIZATION) must be added to a name query to utilize all RNI features. If you don't specify an entityType, the type NONE will be used and RNI may return less accurate results.

Base query

The base query is a standard query against a name field:

    "query" : {
        "match" : {
            "primary_name" : "Jo Shmoe"
        }
    }

Querying supports the same name properties that you may use when indexing documents. Unlike during document creation, you must pass the JSON object containing the name fields as a string. You should always include the entityType property in your query.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
        "match" : {
            "primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}" 
        }
    }
}'

Much like during indexing, RNI creates a set of keys based on the name and then generates a more complex internal query to match against the indexed keys.

Rescore with the RNI pairwise name match

The base query returns a ranked list of matching documents. The rescorer takes the top documents from the list and performs pairwise matching algorithms on those documents, and returns a re-ranked list. RNI has a custom rescorer which allows you to further tune the candidates passing to RNI pairwise matcher. Since the pairwise matcher is a computationally intensive process, you want to rescore just enough documents to find the best matches.

Elasticsearch rescoring includes the following parameters:

window_size (an integer, defaults to 10) specifies how many documents from the base query should be passed to the RNI pairwise matcher.
Use this parameter to limit the number of compute-intensive name matches that need to be performed. If you set the value too high, the query will take too long, but if you set the value too low, you will increase the number of false negatives.
Tip
A good starting point for window_size is to make it the square root of the size of the index. For example, an index of 10,000 entries would use a window_size of 100.
query_weight (a float, defaults to 1.0) specifies the weighting of the score returned by the base query.
In the context of RNI pairwise matching, the base query score has little meaning, so we suggest you set it to 0.0.
rescore_query_weight (a float, defaults to 1.0) specifies the weighting of the maximum RNI pairwise match score.
If query_weight 0.0 and rescore_query_weight is 1.0, the score that is returned by rescoring is the RNI pairwise match score.
score_mode controls how the query and rescore query scores are combined. The default value is total meaning that both scores are added together after being multiplied by their respective weights.

In the following example, pairwise matching is performed on the top 200 names returned by the base query.

Example with RNI Rescorer:

    "rescore" : {
        "window_size" : 200,
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "name_score" : {
                        "field" : "primary_name",
                        "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }

The "name_score" function matches every name in the given field against the query name and returns the maximum score to the rescorer.

The "name_score" function score query must be given at least one object that specifies:

field: the search field being rescored which must be of type rni_name.
query: the value of the search field.

The object passed to the name_score function can also include any of the name properties.

This example illustrates the full query incorporating both match and rescore, using RNI query parameters.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : {
            "primary_name" : "{\"data\" : \"Jo Shmoe\",\"entityType\" : \"PERSON\"}" 
        }
    },
    "rescore" : {
        "window_size" : 200,
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "name_score" : {
                        "field" : "primary_name",
                        "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"}
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

This query returns an RNI match score against "Joe Shmoe" in the "_score" field:

{
  "_index": "rni-test",
  "_type": "_doc",
  "_id": "1",
  "_score": 0.80217975,
  "_source": {
    "primary_name": "Joe Shmoe",
    "aka": "Bossman",
    "occupation": "business owner"
  }
}

Representing arrays in Elasticsearch

If the name field in your documents is structured as an array, such as first name and last name fields, wrap the field in a nested object. The nested datatype allows arrays of objects to be indexed and queried independently of each other.

Since Elasticsearch flattens object hierarchies into a simple list of field names and values, if you don't use the nested type, you can lose the relationship between the fields. For example, the following document:

  "names" : [ 
    {
      "first" : "Joe",
      "last" :  "Smith"
    },
    {
      "first" : "Mike",
      "last" :  "Shmoe"
    }
  ]

would be transformed internally into a document that looks more like this:

{
  "names.first" : [ "mike", "joe" ],
  "names.last" :  [ "smith", "shmoe" ]
}

The names.first and names.last fields are flattened into multi-value fields, and the association between Joe and Smith is lost. This document would incorrectly match a query for mike and smith.

If you wrap an array field in a nested object, you will get more accurate search results.

Include a field of type "nested" containing the name field in the mapping:

  "nested_names" : {
      "type" : "nested",
      "properties" : {
          "name" : { "type" :"rni_name" }
      }
  }

Multiple names can be added to the nested field:

{ 
    "nested_names" : [
        {
            "name" : "Joe Smith"
        },
        {
            "name" : "Mike Shmoe"
        }
    ]
}

Update the query to refer to the nested object. Set the "score_mode" to "max".

{
    "query" : {
        "nested" : {
            "path" : "nested_names",
            "query" : {
                "match" : {
                    "nested_names.name" : "Mike Shmoe"
                }
            }
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "nested" : {
                    "path" : "nested_names",
                    "score_mode" : "max",
                    "query" : {
                        "function_score" : {
                            "name_score" : {
                                "field" : "nested_names.name",
                                "query_name" : "Mike Shmoe"
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}

See the Elasticsearch documentation for more detailed information on nested objects and queries.

Nested names example

Let's consider an example of a database that includes alias names along with a primary name.

Nested Mapping:

"properties" : {
   "primary_name" : {"type" : "rni_name"},
   "aliases" : { 
      "type" : "nested",
      "properties" : { 
        "alias_name" : { "type" : "rni_name" } 
      }
   }
}

The curl command to create the mapping:

curl -XPUT "http://localhost:9200/rni-test/_mapping" -H 'Content-Type: application/json' -d '{
    "properties" : {
        "primary_name" : { "type" : "rni_name"},
        "aliases" : {
            "type" : "nested",
            "properties": { "alias_name" : { "type" : "rni_name" }
         }
       }
    }
}'

Each record includes a primary name. Each primary name can have multiple aliases.

"primary_name" : "John Smith", 
"aliases": [
  {"alias_name": "John Shark"},
  {"alias_name": "Smithy"},
  {"alias_name": "Johnny boy"}
]

The curl command to add the data:

curl -XPUT "http://localhost:9200/rni-test/_doc/null" -H 'Content-Type: application/json' -d '{
    "primary_name" : "John Smith", 
    "aliases" : [
        {"alias_name": "John Shark"},  
        {"alias_name": "Smithy"},
        {"alias_name": "Johnny boy"}
    ]
}'

The query will try to match one of the aliases. Specify score_mode: max to return the highest match score of the aliases.

curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d '{
    "query" : {
        "nested" : {
            "path" : "aliases",
            "query" : {
                "match" : { "aliases.alias_name": "Johnny" }
            }
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "nested" : {
                    "path" : "aliases",
                    "score_mode" : "max",
                    "query" : {
                        "function_score" : {
                            "name_score" : {
                                "field" : "aliases.alias_name",
                                "query_name" : "Johnny"
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Sorting results by rni_name

Elasticsearch supports the ability to sort search results by the values of their document fields. In the case of RNI, one may want to sort on an rni_name field. Because these fields are internally composed of many subfields, it is necessary to specify the subfield to sort on. Below are a couple of subfields that you may be interested in:

IndexFields enum value	Raw subfield name	Example for "John' Smith"	Explanation
ORIGINAL_NAME_FIELD	bt_rni_name_original	John Smith	original input data for the name
NORMALIZED_DATA_FIELD	bt_rni_name_normalized	john smith	normalized name

As an example, if your field's name is primaryName, you can sort on the original name data by referring to primaryName.bt_rni_name_original in your sort specification.

In the Java API, these fields can be referenced through the IndexFields enum. Regarding the previous example, one could refer to the same subfield in Java:

"primaryName." + IndexFields.ORIGINAL_NAME_FIELD.fieldName()

Configuring name matching

There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.

Tuning match parameters

The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.

The typical process for tuning parameters is as follows:

Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.

Parameter configuration files

Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.

The parameter files are contained in two .yaml files located in plugins/rni/bt_root/rlpnc/data/etc. The parameters are defined in parameter_defs.yaml and modified in parameter_profiles.yaml.

parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum) and recommended maximum (sane_maximum) value, which we advise you do not exceed.
parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.

Important

Do not modify the parameter_defs.yaml file. All changes should be made in the parameter_profiles.yaml file.

Do refer to the parameter_defs.yaml file for definitions and usage of all available parameters.

Parameter profiles

The parameters in the parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng profile. There is also an any profile which applies to all language pairs.

Parameter profiles have the following characteristics:

Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (eng), which always comes last. The two languages can be the same. Examples:
- spa_eng
- ara_jpn
- eng_eng
They can include the entity type being matched, such as eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.
Parameter profiles can inherit mappings from other parameter profiles. The global any profile applies to all languages; all profiles inherit its values.
The any profile can include an entity type; any_PERSON applies to all PERSON matches regardless of language.
Specific language profiles inherit values from global profiles. The profile matching person names is named any_PERSON. The profile for matching Spanish person against English person names is named spa_eng_PERSON. It inherits parameter values from the spa_eng profile and the any_PERSON profile. The any_PERSON profile will not override parameter values from more specific profiles, such as the spa_eng profile.

Important

Global changes are made with the any profile.

Any changes to address parameters should go under the any profile, and will affect all fields for all addresses.

Any changes to date parameters must go under the any profile.

Parameter universe

A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:

For example, the MyParameterUniverse universe may include the following parameter profiles:

"name": "MyParameterUniverse/any" applies to all language pairs.
"name": "MyParameterUniverse/spa_eng" applies to English - Spanish name pairs.
"name": "MyParameterUniverse/spa_eng_PERSON" applies to all PERSON English - Spanish name pairs.

Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml file, along with a value. Parameter universes are added to the parameter_profiles.yaml file.

A parameter universe can also be defined dynamically . We recommend that you use dynamic parameter universes for testing and tuning only. For production use, add all parameter universes to the parameter_profiles.yaml file.

Tip

You can define multiple named parameter profiles.

Define the parameter universe in the parameter_profiles.yaml file. Example:

parameterUniverseOne/spa_eng_PERSON:
    reorderPenalty: 0.4
    HMMUsageThreshold: 0.8
    stringDistanceThreshold: 0.1
    useEditDistanceTokenScorer: true
parameterUniverseOne/eng_eng:
    reorderPenalty: 0.6

Using a parameter universe

To use a parameter universe, add it as part of the name_score function when rescoring names queried from the index. All parameter values defined in the parameter universe will be used, where appropriate.

curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ 
  "query": {
    "match": {
      "full_name": "A Ely Taylor"
    }
  },
  "rescore": {
    "window_size": 3,
    "rni_query": {
      "rescore_query": {
        "rni_function_score": {
          "name_score": {
            "field": "full_name",
            "query_name": "A Ely Taylor",
            "score_to_rescore_restriction": 1,
            "window_size_allowance": 0.5,
            "universe": "parameterUniverseOne"
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}'

Parameter universes can also be used in the query phase. To do so, specify the query name as a json string and include the universe in the body.

Note

The parameter universe can only be used in the query phase in RNI-ES 8.6.2.0 and later.

curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ 
  "query": {
    "match": {
      "full_name": "{ \"data\": \"A Ely Taylor\", \"universe\": \"parameterUniverseOne\"}"
    }
  },
  "rescore": {
    "window_size": 3,
    "rni_query": {
      "rescore_query": {
        "rni_function_score": {
          "name_score": {
            "field": "full_name",
            "query_name": "A Ely Taylor",
            "score_to_rescore_restriction": 1,
            "window_size_allowance": 0.5,
            "universe": "parameterUniverseOne"
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}'

Dynamic parameter universes

When tuning RNI, you can use the Parameters REST API endpoint to dynamically create or update a parameter universe, overriding the existing parameter values without having to restart Elasticsearch. Once the optimum values are determined for each parameter, add the parameter universe to the parameter_profiles.yaml file for production use.

Tip

Dynamic parameter universes are best suited for testing and tuning the RNI match parameters. Once you determine the best set of parameters, add the parameter universe to the parameter_profiles.yaml file for production use. Using dynamic parameter universes can slow your system down considerably.

Use the Parameters endpoint to create a parameter universe, with parameters and values.

curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'{ 
  "profiles": [
    {
      "name": "parameterUniverseOne/spa_eng_PERSON",
      "parameters": {
        "reorderPenalty": 0.4,
        "HMMUsageThreshold": 0.8,
        "stringDistanceThreshold": 0.1,
        "useEditDistanceTokenScorer": true
      }
    }
  ]
}'

The name of the parameter universe is parameterUniverseOne and it applies to matching person names between Spanish and English.

Modifying name parameters

To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml. For additional support on tuning the parameters, contact support@rosette.com.

Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Selected name parameters

Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml file.

Parameter	Description	Behavior
`conflictScore`	The score that is assigned to unmatched conflict tokens	Increasing leads to higher final score
`initialsConflictScore`	The score that is assigned to unmatched conflict initials	Increasing leads to higher final score
`initialsScore`	The score that is assigned to an initial matching a token	Increasing leads to higher final score
`initialismScore`	Score assigned to initialism matching a name	The score that is assigned to an initial matching a token
`stuckInitialScore`	Score applied when initial is “stuck” to previous token	Increasing leads to higher final score
`deletionScore`	Score applied to an unmatched token when surrounding tokens are matched	Increasing leads to higher final score
`outOfOrderDeletionScore`	Score applied to an unmatched token when surrounding tokens are also unmatched	Increasing leads to higher final score
`reorderPenalty`	Penalty applied to matching tokens with different positions	Increasing leads to lower final score
`initialsDeletionPenalty`	Multiplier on token deletion score when deleted token is an initial	Increasing leads to higher final score
`genderConflictPenalty`	Penalty applied when name genders don’t match	Increasing leads to lower final score
`crossLanguageGenderConflictPenalty`	Penalty applied when name genders of different languages don’t match	Increasing leads to lower final score
`boostWeightAtRightEnd`	Boost applied to tokens at the right end of the name (i.e. surnames in English)
`boostWeightAtLeftEnd`	Boost applied to tokens at the left end of the name (i.e. given names in English)
`boostWeightAtBothEnds`	Boost applied to tokens at either end of the name (i.e. less weight for middle names in English)
`adjustOneSideDeletionScores`	Multiplier on the token deletion score when all deleted tokens are on one side of the name	Increasing leads to higher score
`reorderCorrection`	Boost to final score if one name’s tokens are a reordering of the other's	Increasing leads to higher final score
`finalBias`	Helps normalize scores	Increasing leads to a higher score for ALL names

The following examples describe the impact of parameter changes in more detail.

Example 1. Token Conflict Score conflictScore

Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.

Example 2. Initials Score (initialsScore)

Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.

Example 3. Token Deletion Score (deletionScore)

Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.

Example 4. Token Reorder Penalty (reorderPenalty)

This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.

Example 5. Right End Boost/Left End Boost/Both Ends Boost (boostWeightAtRightEnd, boostWeightAtLeftEnd, boostWeightAtBothEndsboost)

These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.

The parameters boostWeightAtRightEnd and boostWeightAtLeftEnd should not be used together.

Language support parameters

RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully supported text domains for name matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Limited support uses two match score computations:

Exact matches return a score of 1. This is the same for all languages.
A score is calculated based on string edit distance.

Two parameters control the level of language support.

Table 3. Language Support Parameters

Parameter	Description	Default
`allLanguageSupport`	When set to `true`, all languages are supported.	`true`
`limitedLanguageEditDistance`	When set to `true`, edit distance match scores are enabled for limited support languages. `allLanguageSupport` must be `true`.	`true`

Neural model for matching

When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.

To enable the neural model, set enableSeq2SeqTokenScorer to true in the jpn_eng profile in the parameter_profiles.yaml file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.

To use the neural model:

Extract the appropriate library files from the platform-specific tensorflow JAR provided in the rni-es-<version>-seq2seq-libraries.zip bundle.
Elasticsearch must be started with an additional Java property and point to the directory containing the extracted libraries:
```
 ES_JAVA_OPTS="-Dorg.bytedeco.javacpp.cacheLibraries=false -Djava.library.path=<path-to-extracted-libraries>"
```

Note

The neural model is currently only available on MacOS and Linux platforms in RNI-ES versions 7.10.2.x and all plugins including RNI-RNT 7.38.1.67.0 or later.

Matching Korean names

If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language (languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.

To enable Korean readings of names in Han script you need to edit the parameter files as follows:

Edit the zho_eng profile in the internal_param_profiles.yaml file and remove kor from the list of ignoreTranslationOrigins parameter.
Edit the zho_eng profile in the parameter_profiles.yaml file to increase the alternativePairsToCheck parameter by 1 to compensate for the additional reading.

Matching names with Han characters

We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.

The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.

The parameter haniFourCornerCodeMismatchPenalty applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.

To enable the feature, add the following line to your parameter_profiles.yaml file:

zho_zho_PERSON:  
  haniFourCornerCodeMismatchPenalty: 1

Note

This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.

Matching Turkish and Vietnamese names

Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.

Edit the parameter_profiles.yaml file.

To enable Turkish detection, add:

detectableLanguagesRuleBased:
  [tur]

To enable Vietnamese detection, add:

detectableLanguagesRuleBased:
  [vie]

Restart the system.

Ignore malformed and null value parameters for RNI types

You can index null values and empty strings by updating the allowNullValue parameter. If the allowNullValue parameter is enabled, any document containing null values and empty strings for the fields rni_name, rni_address, and rni_date types will be successfully indexed, but search capabilities will be limited to valid values.

You can direct RNI to index documents with malformed strings of language by updating the ignoreBadData parameter. If the ignoreBadData parameter is enabled, any document containing a malformed language string will be successfully indexed, but search capabilities will be limited to valid languages.

By default these parameters are disabled. These features are useful when performing bulk operations in Elasticsearch.

The file name is parameter_profiles.yaml, located in plugins/rni/bt_root/rlpnc/data/etc/.

To turn any of these features on, set the value of the parameter ignoreBadData or allowNullValue in the above file to true.

Evaluating parameter configuration

To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.

If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.

Configuring name overrides

RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:

Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
Name pair matches specify scores to be assigned for specified full-name pairs.
Token pair overrides specify name token pairs that match along with a match score.
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.

The name matching override files are in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override directory.

You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.

Stop patterns and stop word prefixes

Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.

For each name, RNI performs the following steps in order:

Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.

RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.

Stop Pattern

A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern; see the Javadoc for detailed documentation.

Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt

where LANG is a three-letter language code.

Each row in the file, except for rows that begin with #^[3] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at the beginning and end as needed.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general stop pattern is applied first, but general is also a stop pattern and will be applied as well.

RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entries

^fnu\b
\blnu$

indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.

You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn, where n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,

\blnu$    2
\blnu$    3

indicates that the regex is to be applied to fields 2 and 3 in fielded names.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop Word Prefixes

A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.

Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt

where LANG is a three-letter language code. Each row in the file, except for rows that begin with #, is a string literal. Prefixes matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel stop word prefix is applied where applicable when colonel is also a stop word prefix.

RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt would include stop word prefixes for use with Russian text.

Overriding name pair matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Each row in the file, except for rows that begin with #, is a tab-delimited full-name pair and score:

name1 Tab name2 Tab score

The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.

Tip

Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.

The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,

John Doe	Joe Bloggs	1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt:

外山恒	   Toyama Koichi    1.0
ヒラリークリントン    Hillary Clinton    1.0

Overriding token pair matches

You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported^[4] for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail and Abby constitute a token pair, then the match score for Abigail Harris and Abby Harris will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

tokens_LANG1_LANG2_[TYPE].txt

where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:

Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]

A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:

Peter    Pete    NICKNAME
Peter    Pedro   COGNATE

This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt, tokens_zho_eng_ORGANIZATION.txt.

When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt.

We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.

Multiple sets of token overrides

There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector parameter.

The value of overrideSelector is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.
The value of overrideSelector can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the overrideSelector of OverrideGroup1 would be named:
```
tokens_eng_eng_PERSON-OverrideGroup1.txt
```
If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.

Note

Overrides that are associated with a specific selector are not additive to the base overrides. If a custom overrideSelector value is specified, RNI will only consider overrides in that specific selector. As with the base overrides, for a given selector, RNI will consider non-entity-type overrides for that selector if no entity-type-specific override pair is found for that selector.

Normalizing token variants

You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:

equivalenceclasses_LANG_[TYPE].txt

For example, equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.

Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:

[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to muhammad:

[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd

You can add lists of variants to this file, including the normalized form in square brackets to start each list.

Unimportant tokens

You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.

The file name is lowWeightTokens_LANG.txt.

For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".

Matching organizations with real world IDs

Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.

RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.

Table 4. Real World ID Parameters

Parameter	Description	Default
`useRealWorldIds`	Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.	`true` (enabled)
`doQueryRealWorldIds`	Enables querying with real world IDs; set by language pair.	`true` (enabled)
realWorldIdScore	Sets the match score when two names match due to matching real world IDs. Set by language pair.	0.98
nameRealWorldQueryBoost	Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.	35

Building a real world ID file

Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.

The input file is a tab separated file (.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.

The omit file is a tab separated file (.tsv) named <LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.

Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in RNI:

Place the omit file in the BT_ROOT directory.
Open omit_ids.datafiles, which is in the plugins/rni/bt_root/rlpnc/data/real_world_ids/ref/omit_ids directory by default.
Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:
```
ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
```
Save omit_ids.datafiles.

Address matching

The RNI plugin can match addresses in English, Traditional Chinese, and Simplified Chinese, returning a match score reflecting the similarity of two addresses.

In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.

As with name and date matching, the process is to create an index containing addresses, then query an address against the index.

Note

Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.

Address definition

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library^[5] is used to parse the address string into address fields.

When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.

RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using character-based methods.

Table 5. Supported Address Fields

Field Name	Description	Example(s)
`house`	venue and building names	"Brooklyn Academy of Music", "Empire State Building"
`houseNumber`	usually refers to the external (street-facing) building number	"123"
`road`	street name(s)	"Harrison Avenue"
`unit`	an apartment, unit, office, lot, or other secondary unit designator	"Apt. 123"
`level`	expressions indicating a floor number	"3rd Floor", "Ground Floor"
`staircase`	numbered/lettered staircase	"2"
`entrance`	numbered/lettered entrance	"front gate"
`suburb`	usually an unofficial neighborhood name	"Harlem", "South Bronx", "Crown Heights"
`cityDistrict`	these are usually boroughs or districts within a city that serve some official purpose	"Brooklyn", "Hackney", "Bratislava IV"
`city`	any human settlement including cities, towns, villages, hamlets, localities, etc.	"Boston"
`island`	named islands	"Maui"
`stateDistrict`	usually a second-level administrative division or county	"Saratoga"
`state`	a first-level administrative division	"Massachusetts"
`countryRegion`	informal subdivision of a country without any political status	"South/Latin America"
`country`	sovereign nations and their dependent territories, which have a designated ISO-3166 code	"United States of America"
`worldRegion`	currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean	"Jamaica, West Indies"
`postCode`	postal codes used for mail sorting	"02110"
`poBox`	post office box: typically found in non-physical (mail-only) addresses	"28"

Address field groups

When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.

When matching two fields, the following penalties are applied:

If the fields are the same, no penalty is applied. (street - street)
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
If the fields are in different field groups, a large penalty is applied. (road - city)

Table 6. Address Groups

Group	Fields
house	house
house_number	houseNumber
road	road
unit	unit level staircase entrance
city	suburb cityDistrict city
state	island stateDistrict state
country	countryRegion country worldRegion
post_code	postCode
po_box	po_box

How Rosette calculates address match scores

The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.

Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.
For example, most addresses are formatted from specific to general:
```
houseNumber road city state postCode
```
- The parser would provide predictable results for an address in an expected order:
  38 Concord Road, Apt. B Arlington MA
- The parser would have more difficulty if the address format was in an unexpected order:
  Arlington MA Concord Road #38 Apt B
If you are getting unexpected match values, check how the addresses are being parsed into address fields.
Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.
Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:
- Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.
- Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.
- Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.
Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.
Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.

Using address matching

Index addresses

Create an index.

curl -XPUT 'http://localhost:9200/rni-test'

Define a mapping for fields that will contain addresses. The type for each of these fields is "rni_address".

curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{
  "properties" : {
    "primary_name" : { "type" : "rni_name" },
    "residence" : { "type" : "rni_address" }
  }
}'

Index documents containing an address field.

curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
  "primary_name" : "Joe Schmoe",
  "residence" : {
    "houseNumber" : "123",
    "road" : "Main St",
    "city" : "Boston",
    "state" : "Massachusetts",
    "postCode" : "02110"
  }
}'

The address in the document can also be defined as a string.

curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
    "primary_name" : "Joe Schmoe",
    "residence" : "123 Main St, Boston, Massachusetts, 02110"
}'

Query field addresses

RNI compares the fields in the query with the fields in the index, matching each non-blank field. Addresses do not have to contain all the same fields to be compared and matched.

As with other objects, the query for an address consists of two parts: the base query and the RNI pairwise address match rescore query.

Base Query. The base query is a standard query against the address field. Refer to Query the Index.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : {
      "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}"
    }
  }
}'

RNI Rescore with Addresses. Refer to Rescoring with RNI Pairwise Name Match.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : { "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}" }
  },
  "rescore" : {
    "query" : {
      "rescore_query" : {
        "function_score" : {
          "address_score" : {
            "field" : "residence",
            "query_address" : {
              "road" : "Main", 
              "state" : "MA"
            }
          }
        }
      }, 
      "query_weight" : 0.0,
      "rescore_query_weight" : 1.0
    }
  }
}'

The query returns a hit with the RNI address match score.

"hits": {
 "total" : 1,
  "max_score" : 0.6057692,
  "hits" : [
    {
      "_index" : "rni-test",
      "_type" : "_doc",
      "_id" : "1",
      "_score" : 0.6057692,
      "_source" : {
        "primary_name" : "Joe Schmoe",
        "residence" : {
          "houseNumber" : "123",
          "road" : "Main St",
          "city" : "Boston",
          "state" : "Massachusetts",
          "postCode" : "02110"
        }
      }
    }
  ]
}

The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.

Query string addresses

The address can be structured as a string for queries. The address structure for the query is independent of the format of the address in the original document. A string can be used in the query regardless of whether the indexed address was formatted with fields or as a string.

Base Query. The base query constructed with an address string.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
    "match" : {"residence" : "Main, MA"}
  }
}'

RNI Rescore with Addresses. The rescore query with an address string.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query" : {
     "match" : { "residence" : "Main, MA" }},
   "rescore" : {
     "query" : {
       "rescore_query" : {
         "function_score" : {
           "address_score" : {
             "field" : "residence",
             "query_address" : "Main, MA"
         }
       }
     },
     "query_weight" : 0.0,
     "rescore_query_weight" : 1.0
     }
   }
}'

The response displayed here returns the address as a string because the indexed document used in this example represented the address as strings. The response will return the address in the same format as the indexed document. The format of the query does not have to match the format of the indexed documents.

"hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.4552421,
    "hits" : [
      {
        "_index" : "rni-test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.4552421,
        "_source" : {
          "primary_name" : "Joe Schmoe",
          "residence" : "123 Main St, Boston, Massachusetts, 02110"
        }
      }
    ]
  }

The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.

Configuring address matching

Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.

There are two types of override files for addresses:

Stop patterns and stop word prefixes designate address field elements to strip during indexing and queries.
Token pair overrides specify address field elements pairs that match.

File Directories

The parameters are modified in the plugins/rni/bt_root/rlpnc/data/etc/parameter_profiles.yaml file.
The address matching override files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/overrides directory.
The address stop word files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords directory.

Modifying address parameters

To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml. The parameter files are described in Parameter configuration files.

Note

Changes made to the any profile apply to all supported languages.

An example parameter to tune is addressJoinedTokenLimit, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit will be allowed to merge more tokens.

Another example parameter is houseNumberAddressFieldWeight, which controls the weight of the houseNumber score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight controls the weight of the city field when matching addresses.

Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Address parameters

Parameter	Description	Behavior
`addressFinalBias`	Helps normalize scores	Increasing leads to a higher score for ALL names
`addressReorderPenalty`	Penalty for token reordering within a comparison	Increasing leads to lower final score
`addressDeletionScore`	Score for deleted token within a comparison
`addressUnpairedFieldScore`	Score for an unpaired field
`addressOverrideDefaultScore`	Score for override matches
`addressJoinedTokenLimit`	Maximum sum of the number of tokens considered when matching two address fields
`addressCrossFieldScoreThreshold`	Minimum value a cross-field score must have to be included in the final score
`addressSameGroupPenalty`	Multiplier on field comparisons from the same group	Increasing leads to lower final score
`addressDifferentGroupPenalty`	Multiplier on field comparisons from different groups.	Increasing leads to lower final score
`houseAddressFieldWeight`	Weight used during comparison of the house field
`houseNumberAddressFieldWeight`	Weight used during comparison of the house number field
`roadAddressFieldWeight`	Weight used during comparison of the road field
`unitAddressFieldWeight`	Weight used during comparison of the unit field
`levelAddressFieldWeight`	Weight used during comparison of the field
`staircaseAddressFieldWeight`	Weight used during comparison of the field
`entranceAddressFieldWeight`	Weight used during comparison of the entrance field
`suburbAddressFieldWeight`	Weight used during comparison of the suburb field
`cityDistrictAddressFieldWeight`	Weight used during comparison of the cityDistrict field
`cityAddressFieldWeight`	Weight used during comparison of the city field
`islandAddressFieldWeight`	Weight used during comparison of the island field
`stateDistrictAddressFieldWeight`	Weight used during comparison of the stateDistrict field
`stateAddressFieldWeight`	Weight used during comparison of the state field
`countryRegionAddressFieldWeight`	Weight used during comparison of the countryRegion field
`countryAddressFieldWeight`	Weight used during comparison of the country field
`worldRegionAddressFieldWeight`	Weight used during comparison of the worldRegion field
`postCodeAddressFieldWeight`	Weight used during comparison of the postCode field
`poBoxAddressFieldWeight`	Weight used during comparison of the poBox field

Stop patterns and stop word prefixes

RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.

For each address field, RNI performs the following steps in order:

Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.
Stop patterns are applied.
Stop words are applied.

Stop pattern

A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern class; see the Javadoc for detailed documentation.

Stop patterns for a given address field are specified in a UTF-8 file with the AddressField name:

stopregexes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,^[6] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Note

The delimiter before FIELD is a double underscore (__)

Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.

Stop pattern files are arranged by field in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt would include regular expressions to remove elements from the CITY address field for English.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop word prefixes

A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.

Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField name:

stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt

Note

The delimiter before FIELD is a double underscore (__)

Prefixes in the address field matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.

RNI includes files with stop word prefixes for selected address fields in English and Chinese. These files are in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt would include stopword prefixes for use on CITY address field for English.

Overriding token pair matches

You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road and rd constitute a token pair, then the match score for Stuart Road and Stuart Rd will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

LANG1_LANG2_FIELD.txt

where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField name. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore parameter value will be used.

Token1 Tab Token2 Tab [0.0-1.0]

A token pair override score serves as a minimum score, but you can write /force after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [0.0-1.0]/force

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".

RNI includes plugins/rni/bt_root/rlpnc/data/addresses/ref/override/eng_eng_state.txt, which contains a list of U.S. state abbreviations. For example:

Massachusetts  MA
California  CA

When you create an additional file in the same location, use the respective AddressField name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt indicates that the contents match Chinese - English cityDistrict address fields.

Date matching

RNI can match dates returning a data match score reflecting the time similarity of the two dates. Dates that are closer together are considered a stronger match and return a match score closer to 1.

For example, 11/05/1993 and 11/07/1993 have a high score, as they are very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.

The process is similar to name matching:

Index the dates in connection to the related names.
Query the date and name, receiving back a match score.

The query will return separate match scores for the name and for the associated date of birth. You may decide that the name is more important than the birth date. Within your system, you can weight and combine the name and date match scores to determine the final match score.

Date definition

A date contains a year, month, and day, but not all fields are required for matching. All common delimiters for English dates are supported, and dates can be expressed with various orderings. RNI will filter out some non-date related words. Formats that include time of day are not supported.

You can specify an Elasticsearch date format that includes time information in the mapping. The time component will be ignored.

RNI supports a wide variety of date formats. The best date format will always be the ISO standard of YYYY-MM-DD, where March 7, 1984 is written as 1984-03-07. RNI will attempt to interpret any date provided, although the less standard the format, the less guarantee that its interpretation will be the one you might expect.

Dates can be represented as YYYY-MM-DD. When some fields are unspecified, the letters represent the unknown values. For example, March 7 is YYYY-03-07, since the year in unspecified. Two digit years will be assumed to have unknown centuries. 3/7/84 is interpreted as YY84-03-07. March 7, 1984 will be an equally good match as March 7, 2084 and March 7, 1884.

When a date is provided, RNI will attempt to identify the year, month, and day within it, leaving blank any fields it cannot determine. You can omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12- are all valid dates.

If RNI encounters an invalid date in an acceptable format, such as March 38, 1984, it will not return an error. Rather it will replace the impossible value as an unknown, March 1984.

Supported date formats

RNI supports a wide variety of date formats.

Days can be represented by 1 or 2 digits.
Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).
Years can be represented by 1, 2, 3 or 4 digits.
Supported delimiters include , . - /, as well as a space.
Partial fields can be entered.
At this time, only English month names and abbreviations are recognized.
All words are case-insensitive; upper and lower case are interpreted the same.

The following table shows different acceptable formats for the date March 7, 1984.

Format	Valid Examples	Notes
Y-M-D	1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7
M-D	03-07; 3/7; Mar-07; March 7
Y-M	1984-03; 1984 March; 1984-Mar
YYYYMMDD	19840307	All 8 digits must be included
M-D-Y	03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984
M-YYYY	03-1984; March 1984; Mar-1984	The year must include 4 digits. March-84 will not be recognized.
D-M-Y	07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984
D-M	07-03; 7/3; 07-Mar; 7 March
D(MONTH)Y	7MAR84; 07March1984	The month is a word or abbreviation
YYYY	1984
Month	March

Using date matching

Index dates

Create an index.
curl -XPUT 'http://localhost:9200/rni-test'
Define a mapping for fields that will contain dates. The type for a date field when matching is "rni_date".
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date" }, "primary_name" : { "type" : "rni_name" } } }'
Optionally, in the mapping, you can specify an Elasticsearch date format. All dates must adhere to the specified format. If you specify a format that includes time information, RNI ignores the time component of the date.
Warning
Specifying an Elasticsearch format disables support for unspecified fields. If, for example, you select a format that does not include a day field ("MM-yyyy"), you will get an error when you use the date format in a query.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date", "format" : "MM-yyyy-dd" }, "primary_name" : { "type" : "rni_name" } } }'
Index documents containing a date field.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "birth_date" : "07-1955-24" }'

Query dates

There are many ways to incorporate date matching within your query. Here are two examples, one with date matching by itself, and one with date and name matching.

Basic Date Matching

Base Query. The base query is a standard query against the date field. Refer to Query the Index.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : {
            "birth_date" : "08-1955-25"
        }
    }
}'

RNI Rescore with Dates. Refer to Rescoring with RNI Pairwise Name Match.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "match" : { "birth_date" : "08-1955-25" }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "date_score" : {
                        "field" : "birth_date",
                        "query_date" : "08-1955-25"
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

The query returns a hit, with the RNI date match score.

"hits": {
   "total": 1,
   "max_score": 1.618923,
   "hits": [
     {
       "_index": "test",
       "_type": "_doc",
       "_id": "AVXMepnorGuybmuiQtQr",
       "_score": 0.8120856,
       "_source": {
         "primary_name": "Joe Schmoe",
         "birth_date": "07-1955-24"
       }
     }
   ]
 }

Date and Name Match

Base Query. The base query is a standard query against the date and name fields.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "primary_name": "Joe S."
          }
        },
        {
          "match": {
            "birth_date": "08-1955-25"
          }
        }
      ]
    }
  }'

RNI Rescore with Dates. Use the doc_score function in the rescore when matching a combination of Elasticsearch field types instead of the functions for a single type (name_score and date_score). The name field is also added to the rescore.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "primary_name": "Joe S."
          }
        },
        {
          "match": {
            "birth_date": "08-1955-25"
          }
        }
      ]
    }
 },
"rescore": {
    "query": {
      "rescore_query": {
        "function_score": {
          "doc_score": {
            "fields": {
              "primary_name": {
                "query_value": "Joe S."
              },
              "birth_date": {
                "query_value": "08-1955-25"
              }
            }
          }
        }
      },
      "query_weight" : 0.0,
      "rescore_query_weight" : 1.0
    }
  }
}'

Date match parameters

Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the plugins/rni/bt_root/rlpnc/data/etc/parameter_defs.yaml file.

Table 7. Date General Parameters

Parameter Name

Description

Behavior

alternativeTimeProximityMatch

When enabled, computes the chronological distance between dates in years. The default is false.

Instead of using the distance between dates in days, the score is calculated based on the distance between the dates in unit time of years.

dateOrdering

Sets the default date representations. Valid values are YMD, DMY, and MDY. The default value is MDY.

Supports different date formats. For example, UK dates tend to be DMY; US dates tend to be MDY.

improveSingleDigitManipulationMatch

Controls how much the score is increased if there is exactly one instance of digit manipulation^[a]and there are no other differences. The default value is 0 (off).

If the parameter is set to 0 (minimum), then the score is not increased at all. If the parameter is set to 1 (maximum), then two dates with a single digit manipulation will be an exact match.

maxYearDistanceForDigitManipulation

Sets the maximum number of years beyond which two dates will not be affected by improveSingleDigitManipulation. The default value is 10.

By default, two dates that are more than ten years apart do not have their match score increased by improveSingleDigitManipulation, even if they contain exactly one instance of digit manipulation^[a] and no other differences.

thresholdToDropoffBiasMapping

Specifies the points at which scores should drop, and by how much, based on the difference in years between the two dates. By default, this object is empty, which means no dropoff bias applies.

The match score is decreased based on the difference in years.

If the parameter is {2: 0.7, 5: 0.1} then the following biases will be applied:

Years differ by	Bias applied
1	none
2-4	0.7
5 or more	0.1

timeProximityYearInterval

Specifies the time interval in years that alternativeTimeProximityMatch uses to determine a score. The default is 10.

By default, dates within 10 years of each other score above a threshold of 0.8.

tryDayMonthSwap

Allows for date matching with swapped day and month fields. It is on by default.

This parameter attempts to correct for parsing errors by swapping the day and month. Turn it off if you only want to match the dates exactly as indexed.

^[a]A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0<>8, 1<>7, 3<>8, 5<>8, 5<>6, 6<>8, 7<>2.

Because dates are sometimes written month day and other times written day month, swap tries matching the date fields as written as well as with the month and date fields switched. The best score is returned as the match score. For example, if the dates in question are 1970-3-5 and 1970-6-4, this feature will match the following four pairs:

1970-3-5	↔	1970-6-4
1970-3-5	↔	1970-4-6
1970-5-3	↔	1970-6-4
1970-5-3	↔	1970-4-6

Table 8. Date Weighting Parameters

Parameter Name	Description	Behavior
`dayDistanceWeight`	Weight for the day field comparison of the dates	1 and 30 are far, even if they are close in time. They will have a low match score.
`monthDistanceWeight`	Weight for the month field comparison of the dates	1 and 12 are far, even if they are close in time. They will have a low match score.
`stringDistanceWeight`	The edit difference between the two dates, when converted to a standard string (05021974 for 5/2/1974)	1979-12-31 and 1980-1-1 will be 19791231 and 198000101. They will have a low match score.
`timeDistanceWeight`	Weight for the time distance (i.e. #days) between two dates). The score is based on the number of days between the two dates.	1979-12-31 and 1980-1-1 look different, but their time difference is very close. They will have a high match score.
`yearDistanceWeight`	Weight for the year field comparison of the dates.	Close years will have a high match score.

The date weighting fields control the relative strength of each aspect of the date-matching algorithm. A separate score is calculated for each match type. The final match score is calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.

Dates with a high time match score may have a very low string match score. Time finds dates that are close together; string gives high scores to similarly formatted dates.

Record matching

A search can include multiple fields and return a single match and match score. The fields can be any combination of type rni_name, rni_date, rni_address, or any other Elasticsearch field type.

Each field can be assigned a weight to reflect its importance in the overall matching logic. When searching for a match, some fields are more important in determining a match than others. For example, the name field is likely more important in determining a match than an address field. If no weights are defined, each field is weighted equally.

When matching records, a similarity score is calculated for each field. Then the final match score is then calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a document, that field is removed from the score calculation and its weight is evenly distributed across other fields. You can override this behavior by using the score_if_null option to specify a score to be returned if the field is null in the index document.

Use the doc_score function in the rescore query when matching records that include multiple field types, instead of the functions for a single type, such as the name_score and date_score functions. The doc_score function has built-in similarity functions for many core types. It does not, however, currently support multiple nested fields.

If your record query contains types which the doc_score function doesn't support, you can create a custom similarity function using the Elasticsearch script_score function in the rescore query.

Supported field types

The doc_score function has default support for rni_name, rni_date, rni_address, and many of the Elasticsearch core field types. All default similarity scores are between 0.0 and 1.0.

Field Type(s)	Default Similarity Function	Example(s)
rni_name	name_score (refer to RNI pairwise match score)	'John David Smith' vs 'Jon D Smith' = 0.88
rni_date, date	date_score (refer to Date matching)	'2010-11-4' vs '2010-5-11' = 0.92
rni_address	address_score (refer to Address matching	'Red Cedar Ct' vs 'Cedar Ct' = 0.53
keyword, text, string	Normalized edit distance	'37 Congress St.' vs '35 Congres St.' = 0.875
integer, long, short, double, float	Normalized difference (eg. percentage)	'65' vs '59' = 0.908
boolean	Equality	'true' vs 'true' = 1.0, 'true' vs 'false' = 0.0
geo_point	Log function over Haversine distance	'[lat=42.361145, lon=-71.057083]' vs '[lat=42.3736, lon=-71.1097]' = 0.83

Using record matching

Index records

Create an index with a mapping containing fields with different types

curl -XPUT 'http://localhost:9200/rni-test' -H'Content-Type: application/json' -d '{
    "mappings" : {
         "properties" : {
            "name" : { "type" : "rni_name" },
            "dob" : { "type" : "rni_date" },
            "address" : { "type" : "rni_address" },
            "height" : { "type" : "integer" },
            "nationality" : { "type" : "keyword" }
        }
    }
}'

Index documents that contain those fields

curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{
    "name" : "Ryan McDonagh", 
    "dob" : "11/19/1987",
    "address" : {
        "houseNumber" : "47",
        "road" : "Park St",
        "city" : "Boston",
        "state" : "MA"
    },
    "nationality" : "USA", 
    "height" : 65 
}'

Basic multi-field query

The query can be a record containing multiple fields. The fields in the query record must be mapped to those of the indexed documents.

Base Query. The base query is a standard Elasticsearch query containing multiple fields that will return candidates for rescoring.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "{\"data\" : \"Brian McDonough\", \"entityType\": \"PERSON\"}" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } 
            ]
        }
    }
}'

RNI Rescore with Records. Use the doc_score function to rescore the indexed documents against a query record.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": "MA\" } } 
            ]
        }
    },
    "rescore" : {
        "rni_query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : { "query_value": "Brian McDonough" },
                             "dob" : { "query_value": "10/19/87" },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   }
                             },
                             "height" : { "query_value": 67 },
                             "nationality" : { "query_value": "CANADA" }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

As with addresses, the query_value of names can be an object to match additional name information. The rescore query above can easily be modified to additionally match against a name's entityType field:

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } 
            ]
        }
    },
    "rescore" : {
        "rni_query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : {
                                  "query_value": {
                                      "data": "Brian McDonough",
                                      "entityType": "PERSON"
                                  }
                             },
                             "dob" : { "query_value": "10/19/87" },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   }
                             },
                             "height" : { "query_value": 67 },
                             "nationality" : { "query_value": "CANADA" }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

The quotes in the query above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.

Weighted multi-field query

Each field can be given a weight to reflect its importance in the overall matching logic.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", 
                                             \"city\" : \"Boston\", \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                         "fields": {
                             "name": { "query_value": "Brian McDonough", "weight": 4 },
                             "dob": { "query_value": "10/19/87", "weight": 2 },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                             },
                             "height" : { "query_value": 67, "weight": 0.5},
                             "nationality" : { "query_value": "CANADA", "weight": 1 }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

By default, if a queried-for field is null in the index, the field is removed from the score calculation, and the weights of the other fields are redistributed. However, you can override this behavior by using the score_if_null option to specify what score should be returned for this field if it is null in the index document.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : {
            "should" : [
                { "match" : { "name" : "Brian McDonough" } },
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", 
                                             \"city\" : \"Boston\", \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score" : {
                         "fields" : {
                             "name" : { "query_value": "Brian McDonough", "weight": 4, "score_if_null" : 0.0  },
                             "dob": { "query_value": "10/19/87", "weight": 2 },
                             "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                             },
                             "height" : { "query_value": 67, "weight": 0.5},
                             "nationality" : { "query_value": "CANADA", "weight": 1 ,  "score_if_null" : 1.0  }
                         }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

Multi-field query with multiple nested fields

The doc_score function for rescoring does not currently support search queries containing multiple nested fields. To perform these queries, chain multiple rescorers and adjust the query_weight and rescore_query_weight parameters to control the relative importance of the original query and of the rescore query, respectively. When chaining multiple RNI advanced rescorers, be sure to add "score_mode":"total" to each rni_query object to ensure the final score is properly accumulated.

This example expands the previous examples, adding alias names and modifying the single date of birth (dob field) to contain a list of dates of birth, one for each alias (dob field).

Create an index with a mapping containing multiple nested fields

curl -XPUT "http://localhost:9200/rni-test" -H 'Content-Type: application/json' -d'{
    "mappings": {
    "properties": {
      "name": {
        "type": "rni_name"
      },
      "aliases": {
        "type": "nested",
        "properties": {
          "alias_name": {
            "type": "rni_name"
          }
        }
      },
      "dobs": {
        "type": "nested",
        "properties": {
          "dob": {
            "type": "rni_date"
          }
        }
      },
      "address": {
        "type": "rni_address"
      },
      "height": {
        "type": "integer"
      },
      "nationality": {
        "type": "keyword"
      }
    }
  }
}'

Index documents that contain the fields

curl -XPUT "http://localhost:9200/rni-test/_doc/1" -H 'Content-Type: application/json' -d'{
  "name": "Ryan McDonagh",
  "aliases": [
    {
      "alias_name": "Rayan McDonagh"
    },
    {
      "alias_name": "R. McDonagh"
    },
    {
      "alias_name": "Rayan M."
    }
  ],
  "dobs": [
    {
      "dob": "11/19/1987"
    },
    {
      "dob": "11/20/1987"
    },
    {
      "dob": "10/19/1987"
    }
  ],
  "address": {
    "houseNumber": "47",
    "road": "Park St",
    "city": "Boston",
    "state": "MA"
  },
  "nationality": "USA",
  "height": 65
}'

Query index with chained multiple rescorers

curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{
  "query": {
    "bool": {
     "should": [
      {
        "nested": {
          "path": "dobs",
          "query": {
            "bool": {
              "should": {
                "match": { "dob": "10/19/87"} 
              }
            }
          }
       } 
     },
     {
        "nested": {
          "path":"aliases",
          "query": {
            "bool": {
              "should": {
                "match": {"name": "Brian McDonough"} 
              }
            }
          }
       }
     },
     {
        "match":{
          "address": "{\"houseNumber\": \"48\", \"road\": \"Parker St\", \"city\": \"Boston\", \"state\": \"MA\" }"
        }
      }
    ]
  }
 },
 "rescore": [
   {      
     "rni_query": {
       "rescore_query": {
         "nested": {
           "score_mode": "max",
           "path": "aliases",
           "query": {
             "rni_function_score": {
               "name_score": {
                 "field": "aliases.alias_name",
                 "query_name": "Brian McDonough",
                 "window_size_allowance": 1
               }
             }
           }
         }
       },
       "score_mode": "total",
       "query_weight": 0.0,
       "rescore_query_weight": 1.0,1
       }
     },
     {
      "rni_query": {
        "rescore_query": {
          "nested": {
            "score_mode": "max",
            "path": "dobs",
            "query": {
              "rni_function_score": {
                "date_score": {
                  "field": "dobs.dob",
                  "query_date": "10/19/87"
                }
              }
            }
          }
        },
        "score_mode": "total",
        "query_weight": 0.67,
        "rescore_query_weight": 0.33 2
        }
      },
      {
       "rni_query": {
         "rescore_query": {
           "rni_function_score": {
             "address_score": {
               "field": "address",
               "query_address": {
                 "houseNumber": "48",
                  "road": "Parker St",
                  "city": "Boston",
                  "state": "MA"
                }
              }
            }
          },
          "score_mode": "total",
          "query_weight": 0.75,
          "rescore_query_weight": 0.25 3
         }
       },
      {      
        "query": {
        "rescore_query": {
          "match": {
            "height": 67
          }
        },
        "query_weight": 0.89,
        "rescore_query_weight": 0.11 4
        }
       },
      {      
        "query": {
        "rescore_query": {
          "match": {
            "nationality": "CANADA"
          }
        },
        "query_weight": 0.9,
        "rescore_query_weight": 0.1 5
      }
    }
  ]
}'

To calculate the rescore_query_weight for each nested field, you have to work from bottom to top, dividing each field's desired weight by the product of the already-calculated query_weight values. The query_weight is calculated by subtracting the rescore_query_weight from 1.

If there are no previous query_weight values, the rescore_query_weight is simply the desired field weight.

In this example, the desired field weights are 0.4, 0.2, 0.2, 0.1, and 0.1 for the alias, dob, address, height, and country fields, respectively.

1	Rescore based on alias Name field weight = 0.4 rescore_query_weight = 0.4 / (0.667 x 0.75 x 0.89 x 0.9) = 1 query_weight = 1 - 1 = 0
2	Rescore based on date of birth DOB field weight = 0.2 rescore_query_weight = 0.2 / (0.75 x 0.89 x 0.9) = 0.333 query_weight = 1 - 0.33 = 0.667
3	Rescore based on address Address field weight = 0.2 rescore_query_weight = 0.2 / (0.9 * 0.89) = 0.25 query_weight = 1 - 0.25 = 0.75
4	Rescore based on height Height field weight = 0.1 rescore_query_weight = 0.1 / 0.9 = 0.11 query_weight = 1 - 0.11 = 0.89
5	Rescore based on nationality Country field weight = 0.1 rescore_query_weight = 0.1 query_weight = 1 - 0.1 = 0.9

Weighted multi-field query with custom similarity function

While the doc_score function has built-in similarity functions for many core field types, a custom similarity function can be provided at query time. In this manufactured example, we'll use a simple script_score function that matches CANADA and USA with a high score. Refer to the Elasticsearch documentation for more details about Elasticsearch scripting. Any other function can also be used.

curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{
    "query" : {
        "bool" : { 
            "should" : [ 
                { "match" : { "name" : "Brian McDonough" } }, 
                { "match" : { "dob" : "10/19/87" } },
                { "match" : { "address" : "{ \"houseNumber\" : \"48\", 
                                             \"road\" : \"Parker St\", \"city\" : \"Boston\", 
                                             \"state\" : \"MA\" }" } } 
            ]
        }
    },
    "rescore" : {
        "query" : {
            "rescore_query" : {
                "function_score" : {
                    "doc_score": {
                        "fields": {
                            "name": { "query_value": "Brian McDonough", "weight": 4 },
                            "dob": { "query_value": "10/19/87", "weight": 2 },
                            "address" : {
                                  "query_value" : { 
                                      "houseNumber" : "48", 
                                      "road" : "Parker St", 
                                      "city" : "Boston", 
                                      "state" : "MA" 
                                   },
                                   "weight" : 2
                            },
                            "height": { "query_value": 67, "weight": 0.5 },
                            "nationality": { 
                               "function": {
                                   "function_score": {
                                       "script_score": {
                                           "script": {
                                               "lang": "painless",
                                               "params": {
                                                   "query_value": "CANADA"
                                               },
                                               "inline": "if (params.query_value == '\''CANADA'\'' &&
                                                doc['\''nationality'\''].value == '\''USA'\'') {return 0.8} 
                                                else {return 0.2}"
                                           }
                                       }
                                   }
                                },
                                "weight": 1
                            }
                        }
                    }
                }
            },
            "query_weight" : 0.0,
            "rescore_query_weight" : 1.0
        }
    }
}'

Note

Explainability of RNI Matching

Explainability of RNI matching

As important as getting a match score is, understanding how the system calculated the score can be just as important. When matching two names or records, RNI returns a JSON response explaining in detail how the two names, dates, addresses, or records were matched. With this information, you can understand how the score was calculated and, if necessary, modify the matching parameters to better solve your matching problems.

The following concepts are helpful when reviewing the explainInfo JSON file.

When two objects are being compared, one is referred to as the left input, one as the right input.
Every token of the left object is compared to every token of the right object. Token strings, made up of multiple tokens, may also be compared.
Names are usually composed of multiple tokens. For example, John Fitzgerald Kennedy is 3 tokens.

Common Terms

The response JSON contains sections for each type of object: names, addresses, and dates. While each object has its own criteria for comparison, there are common terms used for all comparisons, as shown below.

Table 9. Definitions of Terms

Term	Definition	Note
bin	A number representing the frequency of the token in the language. A lower bin indicates the token in unusual and therefore should be more highly weighted when calculating the similarity score.
biasedBin	The bin raised to a power from .1 to 10 (default 0.970). This value is set by the `frequencyRankBias` parameter.
scoreInIsolation	The matching score of just the tuples being compared, ignoring things like position in the name, name weighting, etc. This will show a match core of 1.000 if it is an exact match of tokens, even if if there are biases that will lower the score in context.
scoreInContext	The matching score between the tuples taking into account the placement in the overall query and any biases related to the overall query.
(left/right)MinTokenIndex	This is the index of the first token in the string of tokens. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.	An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.
(left/right)MaxTokenIndex	This is the index of the last token in the string. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.	An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.
unbiasedScore	The raw score before any calculations using `finalBias`, `adjustOnesideDeletionScores`, or other such bias parameters.
score	The final score after `finalBias`, `adjustOnesideDeletionScores`, and other such bias parameters are added to the calculation.

Response structure

All matches responses contain the same sections. The details contained within the section can change based on the match object (names, dates, addresses).

Left/right input information: The input information for each input along with the properties for each token in the input. Properties depend on the type of object being matched.

For example, the name matching example contains the following properties:

"data": "John Smith",
"normalizedData": "john smith",
"latnData": "john smith",
"script": "Latn",
"languageOfUse": "ENGLISH",
"languageOfOrigin": "ENGLISH"

While a date comparison would contain different properties:

"century": 20,
"month": 10,
"canonicalForm": "2024-10-01",
"yearWithoutCentury": 24,
"dayMonthSwapped": true,
"originalString": "10 January 2024",
"modifiedJulianDay": 60584,
"day": 1

Tuple scores: The score for every tuple, where a tuple is a token string from the left input and a token string from the right input. Every token in the left input is matched to every token in the right input, along with some token strings (multiple tokens combined together).
Score adjustments: The score adjustments list the parameters applied, and the score calculated with those parameters.
For example, the name example here contains the following parameters:
```
"unbiasedScore": 0.6829129823127231,
"score": 0.6919264820086959,
"parameter": "adjustOneSidedDeletionScores"

"unbiasedScore": 0.6919264820086959,
"score": 0.8435140063279181,
"parameter": "finalBias"
```
Meanwhile, a date comparison would contain different parameters. In this case, a different matching scheme, tryDayMonthSwap, is tried to see if a better result is returned.
```
"score": 0.95,
"unbiasedScore": 0.5926523220980572,
"parameter": "tryDayMonthSwap"

"score": 0.95,
"unbiasedScore": 0.95,
"parameter": "dateFinalBias"
```
Final score: The similarity score for the two names.

Example: matching names

Let's take a look at an example. In this example we're matching the following 2 names:

John Smith
Jon J Smyth

The JSON output is broken down by section.

Example 6. Left Input: John Smith

"leftInput": {
    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.41435888604672094,
        "tokenType": "GIVEN"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.585641113953279,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },

The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Even if the name was provided as Smith John, Smith would be identified as a SURNAME and John as a GIVEN name.

Example 7. Right input: Jon J Smyth

"rightInput": {
    "data": "Jon J. Smyth",
    "normalizedData": "jon j. smyth",
    "latnData": "jon j. smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "jon",
        "latnToken": "jon",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.2083122782666673,
        "tokenType": "UNKNOWN"
      },
     {
        "token": "j",
        "latnToken": "j",
        "bin": 8,
        "biasedBin": 7.5161819937120935,
        "tokenWeight": 0.08948764635417582,
        "tokenType": "UNKNOWN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.702200075379157,
        "tokenType": "UNKNOWN"
      }
    ],
    "entityType": "PERSON"
  },

The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Since both Jon and Smyth are unusual spellings, the tokenType is not identified.

Example 8. Score tuples

"scoreTuples": [
    {
      "scoreInIsolation": 0.7595918889283346,
      "scoreInContext": 0.7595918889283346,
      "left": "john",
      "right": "jon",
      "marked": true,1
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.4912303477031893,
      "scoreInContext": 0.4666688303180298,
      "left": "john",
      "right": "jonj",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.542,
      "scoreInContext": 0.4743439389212776,
      "left": "john",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.2941408383164158,
      "scoreInContext": 0.279433796400595,
      "left": "johnsmith",
      "right": "jon",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.46557800000000005,
      "scoreInContext": 0.4422991,
      "left": "johnsmith",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,2
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 2,
      "rightMaxTokenIndex": 2
    },
    {
      "scoreInIsolation": 0.27169000000000004,
      "scoreInContext": 0.27169000000000004,
      "left": "",
      "right": "j",
      "marked": true,3
      "reason": "DELETION",
      "leftMinTokenIndex": -1,
      "leftMaxTokenIndex": -1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],

All tuples are compared. The tuples that are marked as true are the matches that are used to calculate the scores.

Matching tuples:

1	John : Jon
2	Smith: Smyth
3	J: deleted (no match)

Example 9. Score adjustments

"scoreAdjustments": [
    {
      "unbiasedScore": 0.6829129823127231,
      "score": 0.6919264820086959,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6919264820086959,
      "score": 0.8435140063279181,
      "parameter": "finalBias"
    }
  ],

The unbiased score is the score before the parameter is applied. The score is after the parameter is applied.

Example 10. Final score

"finalScore": 0.8435140063279181

The final calculated score with all parameters applied. This is the similarity score returned by RNI.

Response schemas by object

The following sections list the JSON schema for each object type.

Name response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftMinTokenIndex": { "type": "integer", "default": 0 },
          "leftMaxTokenIndex": { "type": "integer", "default": 0 },
          "rightMinTokenIndex": { "type": "integer", "default": 0 },
          "rightMaxTokenIndex": { "type": "integer", "default": 0 }
        },
        "required": ["left", "right", "reason"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Address response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "leftMinTokenIndex": { "type": "number", "default": 0 },
          "leftMaxTokenIndex": { "type": "number", "default": 0 },
          "rightMinTokenIndex": { "type": "number", "default": 0 },
          "rightMaxTokenIndex": { "type": "number", "default": 0 }
        },
        "required": ["left", "right", "reason", "leftField", "rightField"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" }
        },
      }
    },
    "finalScore": { "type": "number" },
    "fieldScores": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "score": { "type": "number", "default": 0.0 },
          "marked": { "type": "boolean", "default": false }
        },
        "required": ["leftField", "rightField"]
      }
    }
  },
}

Date response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "weight": { "type": "number", "default": 0.0 },
          "component": { "type": "string" },
          "differenceInDays": { "type": "integer" }
        },
        "required": ["left", "right", "component"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Dynamic configuration endpoints

The plugin includes Elasticsearch REST APIs to customize and tune matching through stop words, token overrides, and parameter universes. These endpoints allow you to add and modify these configuration values, without having to restart Elasticsearch.

Tip

To use any of the configuration REST APIs, the parameter enableDynamicConfigurationEndpoints must be set to true in the parameter_profiles.yaml file in the any: profile. By default, this parameter is set to false. These endpoints should be used for testing and tuning only. When the dynamic configuration endpoints are enabled, they can slow the system down considerably.

The RNI-ES plugin relies on having an active primary or replica shard for each dynamic index available on every node. As a result, it is up to users to ensure each node has enough disk space to not exceed Elasticsearch watermark thresholds. Outside of this, users should never interact directly with the underlying dynamic configuration indices; all requests should go through the appropriate /rni_plugin endpoints.

Request timeouts

Whenever the RNI-ES plugin detects that one of its underlying dynamic indices has changed, it must fetch the entire index contents before the next name matching or indexing request. The timeout threshold for this fetch request is managed individually for each class of endpoints, and defaults to 60,000 ms. If this value is found to be insufficient for any reason, users can configure it at plugin startup time with the bt.{override,stopword,parameter}.timeout java property.

Tip

To use dynamic configuration endpoints in an Elasticsearch deployment using SSL encryption, the RNI Elasticsearch plugin must be aware of the server's certificate file. To accomplish this, start elasticsearch with:

ES_JAVA_OPTS="-Dbt.ssl.certificate=<path_to_certificate>"

Stop words

The _stopwords endpoint allows you to ADD, GET and DELETE stop words without restarting the Elasticsearch server. See Stop patterns and stop word prefixes for more detailed information on stop words.

The following properties are used when creating stop words. The entity_type is optional; all other fields are required when adding stop words through the API.

Table 10. Stop Word Properties

Property	Required	Description
`lang`	✓	ISO 639-3 code for the language of the stop word(s).
`stopword_type`	✓	Type of stop word(s), either `regexes` or `prefixes`
`entity_type`		Entity type for which to apply the stop word(s), defaults to `ALL`.
`stop words`	✓	List of stop words to be added.

Note

Stop words are applied whenever a token is normalized, meaning stop words will impact the names content that is included in the index. Therefore, changes to dynamic stop words do require data to be reindexed to take effect.

Create stop words

The POST_stopword adds one or more stop words. The entity_type field is optional, but the other fields are all required.

curl -XPOST "http://localhost:9200/rni_plugin/_stopwords" -H 'Content-Type: application/json' -d '{
    "lang": "eng",
    "stopword_type": "prefixes",
    "entity_type": "PERSON",
    "stopwords": [
        "honorable",
        "senior correspondent"
    ]
}'

Get stop words

The GET _stopwords method returns all stop words for a given language and stop word type. You can search by just language or by language and type.

When no entity type is specified, the stop word is applied to all names in the language, those with and without entity types. Therefore, calls that specify a type such as PERSON or ORGANIZATION will also return all stop words that don't have an entity type specified.

Returns all prefix stop words for PERSON types in English:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON"

Returns all regex stop words for ORGANIZATION types in Spanish:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/regexes_spa_ORGANIZATION"

Returns all prefix stop words in English with no type specified. For some languages, this list is empty by default. In these cases, data will only be returned if you've populated the file with values:

curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng"

Delete stop words

The DELETE _stopwords method deletes a specified stop word. Deleting a stop word from a specific profile will also delete it from the any profile.

curl -XDELETE "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON/doctor"

Token overrides

The _overrides endpoint allows you to ADD, GET and DELETE token pair overrides without restarting the Elasticsearch server. See Overriding token pair matches for more detailed information on token pair overrides.

The following properties are used when creating token overrides.

Table 11. Token Overrides Properties

Property	Required	Description
`lang1`	✓	ISO 639-3 code for the language of the first name in the override pair.
`lang2`	✓	ISO 639-3 code for the language of the second name in the override pair.
`entity_type`		Entity type of the list of token override pairs, defaults to "ALL".
`selector`		An alphanumeric string which specifies the selector value to apply for these overrides. NOTE: This property is only available in RNI-ES 8.6.2.0 and later.
`token_pairs`	✓	List of token override pairs to be added.
`token1`	✓	Tokens of the first name in the override pair; they should be of `lang1`
`token2`	✓	Token of the second name in the override pair; they should be of `lang2`.
`type`		The specific override type for the token pair. If omitted, the `nickname` type is used.
`score`		Raw score of the token pair between 0.0 and 1.0. If omitted, the value from the `nicknameOverrideScore` parameter is used.
`force`		Indicates whether to force this score to be exactly that value for the given token pair, defaults to `false` .

Note

RNI is designed so that override information is not included with indexed names. Therefore, changes to dynamic overrides do not require data to be reindexed to take effect.

Create override index

The override index must exist before you can start adding token overrides. To create the index:

curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_create"

Refresh override index

To force a refresh of the dynamic override index:

curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_refresh"

Create token overrides

The POST _overrides adds one or more token overrides. As shown in the table above, entity_type, force, and score are optional, but the other fields are required.

curl -XPOST "http://localhost:9200/rni_plugin/_overrides" -H 'Content-Type: application/json' -d'{
        "lang1": "eng",
        "lang2": "eng",
        "entity_type": "PERSON",
        "token_pairs": 
        [{
        "token1": "Abigail",
        "token2": "Abbey",
        "score": 0.74,
        "force": true},
        {
        "token1": "Aleksander",
        "token2": "Alex",
        "score": 0.74},
        {
        "token1": "Alfonso",
        "token2": "Alphonse",
        "type": "COGNATE"},
        {
        "token1": "Frederica",
        "token2": "Federica",
        }]}'

Get token overrides

The GET _overrides method returns the overrides of a given language profile.

curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON"

You can also retrieve the score of a given override pair.

curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON?token1=abigel&token2=abigail"

Delete token overrides

The DELETE _overrides method deletes a given override pair. Deleting an override from a specific profile will also delete it from the any profile.

curl -XDELETE "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON/abigel+abigail"

Parameters

The _parameter_universe endpoint allows you to ADD, GET and DELETE parameters through parameter universes, without restarting the Elasticsearch server. See Parameter universe for more information on tuning parameters with parameter universes.

Note

While some parameters can impact the data that is included in the index, these parameters cannot be dynamically specified. Therefore, changes to dynamic parameters do not require data to be reindexed to take effect.

Add parameter(s)

The POST _parameter_universe method creates a parameter universe and the parameter profiles within the universe. Use this method to add or update a parameter value in a parameter universe. If you try to add a parameter universe that already exists, it overrides it with the new values. The parameter universe method uses the following syntax:

SomeParameterUniverseName/xxx_yyy where xxx_yyy is the language profile the parameters belong to, expressed in ISO 639-3 codes. The parameters field expects a list of parameters for the given profile, where the naming of the parameters should match the ones declared in parameter_defs.yaml.

curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'
  {
    "profiles": [
      {
        "name": "SomeParameterUniverseName/any",
        "parameters": {
          "translatorResultsToKeep": 4,
          "deletionScore": 0.269,
          "doQueryTokenOverrides": true,
          "fieldDeletionScore": 0.27,
          "yearDistanceWeight": 0.2
        }
      },
      {
        "name": "SomeParameterUniverseName/eng_eng",
        "parameters": {
          "HMMUsageThreshold": 0.8,
          "stringDistanceThreshold": 0.1,
          "useEditDistanceTokenScorer": true,
          "finalBias": 2.4,
          "reorderPenalty": 0.2
        }
      }
    ]
  }'

Get parameter(s)

The GET _parameter_universe method retrieves parameter universes.

To retrieve a given parameter universe, the name of the parameter universe is provided as a path parameter:

curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"

If you include the name of the profile and a parameter, it returns the value of the parameter:

curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"

Delete parameter(s)

The DELETE _parameter_universe method deletes parameter universes.

To delete a specific parameter universe:

curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"

To delete a parameter for a specific profile within a parameter universe:

curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"

Note

Deleting a parameter from a specific parameter profile will also delete it from the any profile. The parameter from the default value in the parameter_defs.yaml file will be used.

Pairwise match endpoint

You can perform a pairwise match between two rni_names, rni_dates, rni_addresses, or other datatypes through the POST _pair_match method. The results provide insight into how the match scores were calculated, including tokens and token scores. This endpoint can help you understand the impact a specific match parameter has on the final score, and can aid in testing and debugging RNI.

The type of pairwise match being performed is provided to the query, along with the values being compared (data1 and data2). You can also specify one or more parameters and see how they impact the match scores.

You may use the optional responseFormat URL parameter to control the format of the response. The default value is explainInfo, which produces the output format of plugin versions using SDK 7.43.0.c71.0 and later. Setting this parameter to legacyExplainInfo will produce the output of previous plugin versions.

Tip

We strongly recommend sending in complete strings and allowing RNI to perform tokenization. RNI includes weighting and other calculations which operate on the full string, enhancing the token matching scoring algorithms to improve match scores.

Request

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d'
{"dataPair": {"data1": "12/25/19","data2": "1/15/20"},
 "parameters": {
     "timeDistanceWeight": ".8",
     "stringDistanceWeight": "0"}}'

Response

{
  "leftInput": {
    "originalString": "12/25/19",
    "day": 25,
    "month": 12,
    "yearWithoutCentury": 19,
    "century": -1000,
    "modifiedJulianDay": -671643,
    "canonicalForm": "YY19-12-25"
  },
  "rightInput": {
    "originalString": "1/15/20",
    "day": 15,
    "month": 1,
    "yearWithoutCentury": 20,
    "century": -1000,
    "modifiedJulianDay": -671622,
    "canonicalForm": "YY20-01-15"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 0.9330329915368074,
      "scoreInContext": 0.9330329915368074,
      "left": "YY19",
      "right": "YY20",
      "marked": true,
      "weight": 0.2,
      "component": "YEAR_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6830201283771977,
      "scoreInContext": 0.6830201283771977,
      "left": "12",
      "right": "01",
      "marked": true,
      "weight": 0.2,
      "component": "MONTH_DISTANCE"
    },
    {
      "scoreInIsolation": 0.7071067811865476,
      "scoreInContext": 0.7071067811865476,
      "left": "25",
      "right": "15",
      "marked": true,
      "weight": 0.1,
      "component": "DAY_DISTANCE"
    },
    {
      "scoreInIsolation": 0.375,
      "scoreInContext": 0.375,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0,
      "component": "STRING_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6949591099211685,
      "scoreInContext": 0.6949591099211685,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0.8,
      "component": "TIME_PROXIMITY"
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.730683530798762,
      "score": 0.730683530798762,
      "parameter": "dateFinalBias"
    }
  ],
  "finalScore": 0.730683530798762
}

Supported types

The following data types are supported by the pairwise match endpoint.

rni_name
rni_date
rni_address
date
keyword
text
string
integer
long
short
double
float
boolean
geo_point

Request

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=text" -H 'Content-Type: application/json' -d'
{
   "dataPair":
   {
    "data1": "word1",
    "data2": "word2"
  }
}'

Response

{
  "score" : 0.8333333333333334
}

Name matching example

Request

Parameters are specified directly in the request. The source language (language) of the name is optional, but recommended if known.

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_name" -H 'Content-Type: application/json' -d'
{
  "dataPair":   
   {    
     "data1":     
     {
      "data": "John Robert Edward Smith",
      "language": "eng",
      "entityType": "PERSON"

    },
    "data2":
    {
      "data": "John Smyth",
      "language": "eng",
      "entityType": "PERSON"
    }
  },
  "parameters": {
    "deletionScore": 0.469
  }
}'

Response

The response includes detailed information on how the names were matched.

{
  "leftInput": {
    "data": "John Robert Edward Smith",
    "normalizedData": "john robert edward smith",
    "latnData": "john robert edward smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.20817481793666764,
        "tokenType": "GIVEN"
      },
      {
        "token": "robert",
        "latnToken": "robert",
        "bin": 4,
        "biasedBin": 3.8370564773010574,
        "tokenWeight": 0.24879889758995669,
        "tokenType": "MIDDLE"
      },
      {
        "token": "edward",
        "latnToken": "edward",
        "bin": 4,
        "biasedBin": 3.8370564773010574,
        "tokenWeight": 0.24879889758995669,
        "tokenType": "MIDDLE"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.294227386883419,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  "rightInput": {
    "data": "John Smyth",
    "normalizedData": "john smyth",
    "latnData": "john smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.17348100675885905,
        "tokenType": "GIVEN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.8265189932411409,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 1,
      "scoreInContext": 1,
      "left": "john",
      "right": "john",
      "marked": true,
      "reason": "MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.469,
      "scoreInContext": 0.469,
      "left": "robertedward",
      "right": "",
      "marked": true,
      "reason": "DELETION",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 2,
      "rightMinTokenIndex": -1,
      "rightMaxTokenIndex": -1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 3,
      "leftMaxTokenIndex": 3,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.6686154265898526,
      "score": 0.6972609000299432,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6972609000299432,
      "score": 0.8468047657291401,
      "parameter": "finalBias"
    }
  ],
  "finalScore": 0.8468047657291401
}

Address matching example

Request

The pairwise match endpoint supports both fielded and unfielded addresses. Fielded addresses must be specified as objects, while unfielded addresses must be specified as strings.

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_address" -H 'Content-Type: application/json' -d'
{
  "dataPair":
   {
    "data1":
     {
      "houseNumber": "101",
      "road": "Main st",
      "city": "Cambridge",
      "state": "Massachusetts",
      "country": "United States of America"
    },
    "data2": "101 Main St, Cambridge, MA, USA"
  }
}'

Response

The response includes a score marking the similarity of the two addresses as well as a type field describing the type of match observed. The response also includes detailed information on how each of the fields were matched. In the example below, only part of the detailed response for HOUSE_NUMBER is included. This is not the complete response.

{
  "leftInput": {
    "fieldInputInfos": [
      {
        "data": "United States of America",
        "latnData": "United States of America",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "united",
            "latnToken": "united",
            "tokenWeight": 0.25
          },
          {
            "token": "states",
            "latnToken": "states",
            "tokenWeight": 0.25
          },
          {
            "token": "of",
            "latnToken": "of",
            "tokenWeight": 0.25
          },
          {
            "token": "america",
            "latnToken": "america",
            "tokenWeight": 0.25
          }
        ],
        "addressField": "COUNTRY",
        "normalizedData": "united states of america"
      },
      {
        "data": "Massachusetts",
        "latnData": "Massachusetts",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "massachusetts",
            "latnToken": "massachusetts",
            "tokenWeight": 1
          }
        ],
        "addressField": "STATE",
        "normalizedData": "massachusetts"
      },
      {
        "data": "Cambridge",
        "latnData": "Cambridge",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "cambridge",
            "latnToken": "cambridge",
            "tokenWeight": 1
          }
        ],
        "addressField": "CITY",
        "normalizedData": "cambridge"
      },
      {
        "data": "Main st",
        "latnData": "Main st",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "main",
            "latnToken": "main",
            "tokenWeight": 0.5
          },
          {
            "token": "st",
            "latnToken": "st",
            "tokenWeight": 0.5
          }
        ],
        "addressField": "ROAD",
        "normalizedData": "main st"
      },
      {
        "data": "101",
        "languageOfUse": "UNKNOWN",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "101",
            "latnToken": "101",
            "tokenWeight": 1
          }
        ],
        "addressField": "HOUSE_NUMBER",
        "normalizedData": "101"
      }
    ]
  },
  "rightInput": {
    "fieldInputInfos": [
      {
        "data": "usa",
        "latnData": "usa",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "usa",
            "latnToken": "usa",
            "tokenWeight": 1
          }
        ],
        "addressField": "COUNTRY",
        "normalizedData": "usa"
      },
      {
        "data": "ma",
        "latnData": "ma",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "ma",
            "latnToken": "ma",
            "tokenWeight": 1
          }
        ],
        "addressField": "STATE",
        "normalizedData": "ma"
      },
      {
        "data": "cambridge",
        "latnData": "cambridge",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "cambridge",
            "latnToken": "cambridge",
            "tokenWeight": 1
          }
        ],
        "addressField": "CITY",
        "normalizedData": "cambridge"
      },
      {
        "data": "main st",
        "latnData": "main st",
        "script": "Latn",
        "languageOfUse": "ENGLISH",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "main",
            "latnToken": "main",
            "tokenWeight": 0.5
          },
          {
            "token": "st",
            "latnToken": "st",
            "tokenWeight": 0.5
          }
        ],
        "addressField": "ROAD",
        "normalizedData": "main st"
      },
      {
        "data": "101",
        "languageOfUse": "UNKNOWN",
        "languageOfOrigin": "UNKNOWN",
        "tokens": [
          {
            "token": "101",
            "latnToken": "101",
            "tokenWeight": 1
          }
        ],
        "addressField": "HOUSE_NUMBER",
        "normalizedData": "101"
      }
    ]
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 1,
      "scoreInContext": 1,
      "left": "101",
      "right": "101",
      "marked": true,
      "reason": "MATCH",
      "leftField": "HOUSE_NUMBER",
      "rightField": "HOUSE_NUMBER",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.3,
      "scoreInContext": 0.3,
      "left": "101",
      "right": "",
      "marked": false,
      "reason": "DELETION",
      "leftField": "HOUSE_NUMBER",
      "rightField": "HOUSE_NUMBER",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": -1,
      "rightMaxTokenIndex": -2
    },
...
  }
}

Date matching example

curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d'
{"dataPair": {"data1": "12/25/19","data2": "1/15/20"},
 "parameters": {
     "timeDistanceWeight": ".8",
     "stringDistanceWeight": "0"}}'

Response

The response includes detailed information on how the dates were matched.

{
  "leftInput": {
    "originalString": "12/25/19",
    "day": 25,
    "month": 12,
    "yearWithoutCentury": 19,
    "century": -1000,
    "modifiedJulianDay": -671643,
    "canonicalForm": "YY19-12-25"
  },
  "rightInput": {
    "originalString": "1/15/20",
    "day": 15,
    "month": 1,
    "yearWithoutCentury": 20,
    "century": -1000,
    "modifiedJulianDay": -671622,
    "canonicalForm": "YY20-01-15"
  },
  "scoreTuples": [
    {
      "scoreInIsolation": 0.6949591099211685,
      "scoreInContext": 0.6949591099211685,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0.8,
      "component": "TIME_DISTANCE",
      "differenceInDays": 21
    },
    {
      "scoreInIsolation": 0.9330329915368074,
      "scoreInContext": 0.9330329915368074,
      "left": "YY19",
      "right": "YY20",
      "marked": true,
      "weight": 0.2,
      "component": "YEAR_DISTANCE"
    },
    {
      "scoreInIsolation": 0.6830201283771977,
      "scoreInContext": 0.6830201283771977,
      "left": "12",
      "right": "01",
      "marked": true,
      "weight": 0.2,
      "component": "MONTH_DISTANCE"
    },
    {
      "scoreInIsolation": 0.7071067811865476,
      "scoreInContext": 0.7071067811865476,
      "left": "25",
      "right": "15",
      "marked": true,
      "weight": 0.1,
      "component": "DAY_DISTANCE"
    },
    {
      "scoreInIsolation": 0.375,
      "scoreInContext": 0.375,
      "left": "YY19-12-25",
      "right": "YY20-01-15",
      "marked": true,
      "weight": 0,
      "component": "STRING_DISTANCE"
    }
  ],
  "scoreAdjustments": [
    {
      "unbiasedScore": 0.730683530798762,
      "score": 0.730683530798762,
      "parameter": "dateFinalBias"
    }
  ],
  "finalScore": 0.730683530798762
}

Fully supported text domains for name matching

The following tables describe the domain pairings for which RNI provides full support. All other domain pairings have limited support, as described in Language support parameters. A domain refers to the language and script of a piece of text. For example, one domain might be Latin (Latn) script in the English (eng) language.

Note

"Language" in this appendix refers to the language of use, the language of the document in which the name is found, which may not be the language of origin associated with the name. If the language of use is undetermined, use unknown (xxx).

Note

Prior to release 7.36.0, RNI did not support any limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Name matching within a language

The first table identifies the languages, and for each language the writing scripts that Rosette Name Indexer fully supports.

Language (ISO 639-3)	Scripts (ISO 15924)	Real World Id Dictionary
Arabic (ara)	Arabic (Arab)	✓
Burmese (mya)	Burmese (Mymr)	✓
Chinese (zho)^[a]	Han (Hanzi) (Hani), Han (Simplified variant) (Hans), Han (Traditional variant) (Hant)	✓
English (eng)	Latin (Latn)	✓
French (fra)	Latin (Latn)	✓
German (deu)	Latin (Latn)	✓
Greek (ell)	Greek (Grek)	✓
Hebrew (heb)	Hebrew (Hebr)	✓
Hungarian (hun)	Latin (Latn)	✓
Italian (ita)	Latin (Latn)	✓
Japanese (jpn)	Han (Kanji) (Hani), Hiragana (Hira), Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt), Japanese (alias for Han + Hiragana + Katakana) (Jpan), Katakana (Kana)	✓
Khmer (khm)	Khmer (Khmr)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang), Han (Hanja) (Hani), Korean (alias for Hangul + Han) (Kore)	✓
Malay (zsm)	Latin (Latn)
Pashto (pus)	Arabic (Arab)
Persian (fas) ^[b]	Arabic (Arab)
Persian, Afghan (prs)	Arabic (Arab)
Persian, Iranian (pes)	Arabic (Arab)
Portuguese (por)	Latin (Latn)	✓
Russian (rus)	Cyrillic (Cyrl)	✓
Spanish (spa)	Latin (Latn)	✓
Thai (tha)	Thai (Thai)	✓
Turkish (tur)	Latin (Latn)
Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	Latin (Latn)	✓
^[a]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[b]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Cross-language matches

This table identifies the range of cross-language searching and matching that Rosette Name Indexer and name matching fully support. If your query is a name in an Arabic document in Arabic script, the query may return one or more names in English documents in Latin script, in addition to names from Arabic documents in Arabic script. If the query is a name in English and Latin script, it may return documents from any of the supported languages and their native scripts.

Note

For supported scripts for each language, see the table in section 13.1.

Query Domain	Index Domain / Match Domain
Language (ISO 639-3)	Language (ISO 639-3)	Scripts (ISO 15924)
Arabic (ara)	Arabic (ara)	Arabic (Arab)
Arabic (ara)	English (eng)	Latin (Latn)
Burmese (mya)	Burmese (mya)	Burmese (Mymr)
Burmese (mya)	English (eng)	Latin (Latn)
Chinese (zho)^[a]	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
English (eng)	Arabic (ara)	Arabic (Arab)
	Burmese (mya)	Burmese (Mymr)
	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	French (fra)	Latin (Latn)
	German (deu)	Latin (Latn)
	Greek (ell)	Greek (Grek)
	Hebrew (heb)	Hebrew (Hebr)
	Hungarian (hun)	Latin (Latn)
	Italian (ita)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Khmer (khm)	Khmer (Khmr)
	Korean (kor)	(Hani), (Hang), (Kore)
	Malay (zsm	Latin (Latn)
	Pashto (pus)	Arabic (Arab)
	Persian (fas)	Arabic (Arab)
	Persian, Afghan (prs)	Arabic (Arab)
	Persian, Iranian (pes)	Arabic (Arab)
	Portuguese (por)	Latin (Latn)
	Russian (rus)	Cyrillic (Cyrl)
	Spanish (spa)	Latin (Latn)
	Thai (tha)	Thai (Thai)
	Urdu (urd)	Arabic (Arab)
	Turkish (tur)	Latin (Latn)
	Vietnamese (vie)	Latin (Latn)
French (fra)	English (eng)	Latin (Latn)
French (fra)	French (fra)	Latin (Latn)
German (deu)	English (eng)	Latin (Latn)
German (deu)	German (deu)	Latin (Latn)
Greek (ell)	English (eng)	Latin (Latn)
Greek (ell)	Greek (ell)	Greek (Grek)
Hebrew (heb)	English (eng)	Latin (Latn)
Hebrew (heb)	Hebrew (heb)	Hebrew (Hebr)
Hungarian (hun)	English (eng)	Latin (Latn)
Hungarian (hun)	Hungarian (hun)	Latin (Latn)
Italian (ita)	English (eng)	Latin (Latn)
Italian (ita)	Italian (ita)	Latin (Latn)
Japanese (jpn)	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Khmer (khm)	English (eng)	Latin (Latn)
Khmer (khm)	Khmer (khm)	Khmer (Khmr)
Korean (kor)	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Malay (zsm)	English (eng)	Latin (Latn)
Malay (zsm)	Malay (zsm)	Latin (Latn)
Pashto (pus)	English (eng)	Latin (Latn)
Pashto (pus)	Pashto (pus)	Arabic (Arab)
Persian^[b] (fas)	English (eng)	Latin (Latn)
Persian^[b] (fas)	Persian (fas)	Arabic (Arab)
Persian, Afghan (prs)	Afghan Persian (prs)	Arabic (Arab)
Persian, Afghan (prs)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	Iranian Persian (pes)	Arabic (Arab)
Portuguese (por)	English (eng)	Latin (Latn)
Portuguese (por)	Portuguese (por)	Latin (Latn)
Russian (rus)	English (eng)	Latin (Latn)
Russian (rus)	Russian (rus)	Cyrillic (Cyrl)
Spanish (spa)	English (eng)	Latin (Latn)
Spanish (spa)	Spanish (spa)	Latin (Latn)
Thai (tha)	English (eng)	Latin (Latn)
Thai (tha)	Thai (tha)	Thai (Thai)
Turkish (tur)	English (eng)	Latin (Latn)
Turkish (tur)	Turkish (tur)	Latin (Latn)
Urdu (urd)	English (eng)	Latin (Latn)
Urdu (urd)	Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	English (eng)	Latin (Latn)
Vietnamese (vie)	Vietnamese (vie)	Latin (Latn)
^[a]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[b]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Appendix

Match phenomena

Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.

Name	Description	Example
CONFLICT	The tokens do not match.	When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT.
DELETION	The token is unmatched.	When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION.
EMBEDDING_MATCH	The tokens are semantically similar as determined by word-embedding vectors.	When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH.
FIELD_BLOCKED	This field cannot be matched because of a cross-field match involving the same field in the other name.	When comparing "Bob\|William\|Smith" with "William\|\|Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field.
FIELD_CONFLICT	When comparing two names that are divided into fields, these fields do not match.	When comparing "Richard\|William\|Smith" with "Richard\|Johnson\|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT.
FIELD_DELETION	When comparing two names that are divided into fields, this field is unmatched.	When comparing "Richard\|Xi\|Smith" with "Richard\|\|Smith", "xi" would be considered a FIELD_DELETION.
GIVEN_NAME_DELETION	When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched.	When comparing "Richard\|William\|Smith" and "\|\|William\|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a `Given_name` field.
HANI_ABBREVIATION	One Hani token appears to be an abbreviation of another Hani token.	"北京大学" and "北大" are a HANI_ABBREVIATION match.
HMM_MATCH	The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match.	"richard" and "richerd" are an HMM_MATCH.
INITIALISM	One token is a name and the other token is the initials of the words which make up the name.	"john fitzgerald kennedy" and "JFK" are an INITIALISM. "consumer value stores" and "CVS" are an INITIALISM.
INITIAL_MATCH	One token is the first initial of the other.	"w" and "william" are an INITIAL_MATCH.
LANGUAGE_SPECIFIC_MATCH	The match was determined by a language-specific matcher.	"laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH.
MATCH	The tokens are identical (after stop word elimination and normalization).	"john" and "john" are a MATCH.
NULL	The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object.	N/A
OUT_OF_ORDER_DELETION	This unmatched token still leaves the remaining tokens out of order when it is removed.	When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION.
OVERRIDE	The tokens appear as a pair on the override list. This is often used for nicknames.	"john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list.
PREFIX_INITIAL	One token is an initial that matches a prefix in the other token. In practice, the PREFIX_INITIAL phenomenon is rare.	If the `initialsScore` parameter is set to 0.1, "E Silva" and "EduardoSil" will be a PREFIX_INITIAL match.
STRING_SIMILARITY	The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match.	"akcd" and "xkcd" are a STRING_SIMILARITY match.
STUCK_INITIAL	One name appears to have an initial mistakenly attached to a preceding token.	"DavidK" and "David Keith" are a STUCK_INITIAL match.
SURNAME_DELETION	When comparing two names that are divided into fields, the SURNAME field is unmatched.	When comparing "Richard\|William\|Smith" and "Richard\|William\|\|", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a `Surname` field.
TRAILING_PATRONYMIC_DELETION^[a]	The unmatched token is a patronymic which has been truncated in the other name.	When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION.
TRUNCATED_EXACT_MATCH	The tokens are identical except that one has been slightly truncated.	"murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH.
TRUNCATED_HMM_MATCH	The tokens are similar, but not identical, and one has been slightly truncated.	"gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH.
UNKNOWN_FIELD_MATCH	One of the tokens is part of an "unknown" field in a fielded name. The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API.	When comparing "Richard\|William\|Smith" with "Richard\|William\|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH.
^[a]Only applies to Latin script names of Arabic origin.

Parameters

This table lists the parameters that can be configured via paramater_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in paramater_defs.yaml.

Table 12. Parameter impacts

Parameter name	Applies to	Impacts
addressCrossFieldScoreThreshold	Addresses	Can affect any kind of match
addressDeletionScore	Addresses	DELETION match phenomenon
addressDifferentGroupPenalty	Addresses	Can affect any kind of match
addressFinalBias	Addresses	All address match scores
addressJoinedTokenLimit	Addresses	Concatenation^[a]
addressOverrideDefaultScore	Addresses	OVERRIDE match phenomenon
addressOverrideTablePath	File locations	Internal engineering detail
addressReorderPenalty	Addresses	Reordering^[e]
addressSameGroupPenalty	Addresses	Can affect any kind of match
addressStopPatternsPath	File locations	Internal engineering detail
addressUnpairedFieldScore	Addresses	FIELD_DELETION match phenomenon
adjustOneSidedDeletionScores	All names	DELETION match phenomenon
allowNullValue	Elasticsearch	Elasticsearch setting
alternativePairsToCheck	All names	Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)
alternativeTimeProximityMatch	Dates	All date match scores
boostWeightAtBothEnds	All names	All name match scores
boostWeightAtLeftEnd	All names	All name match scores
boostWeightAtRightEnd	All names	All name match scores
caseSensitiveData	All names	INITIALISM match phenomenon
cityAddressFieldWeight	Addresses	Weighting
cityDistrictAddressFieldWeight	Addresses	Weighting
cognateOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as COGNATE in the override file
conflictScore	All names	CONFLICT match phenomenon
conflictThreshold	All names	CONFLICT match phenomenon
countryAddressFieldWeight	Addresses	Weighting
countryRegionAddressFieldWeight	Addresses	Weighting
crossFieldInitialsPenalty	Fielded names	INITIAL_MATCH match phenomenon
crossFieldJoinInitialPenalty	Fielded names	Concatenation^[a]
crossFieldJoinPenalty	Fielded names	Concatenation^[a]
crossFieldMatchPenalty	Fielded names	Can affect any kind of match
crossLanguageGenderConflictPenalty	All names	Gender mismatch^[b]
dateFinalBias	Dates	All date match scores
dateOrdering	Dates	All date match scores
dayDistanceWeight	Dates	All date match scores
deletionScore	All names	DELETION match phenomenon
detectableLanguagesModelBased	All names	Can affect any kind of match where one or more names is in Latin script and the language is not already specified.
detectableLanguagesRuleBased	All names	Currently, you can only enable detection of Latin script as Turkish or Vietnamese.
editDistanceScoreBias	All names	Can affect any kind of match
enableDynamicConfigurationEndpoints	Elasticsearch	Elasticsearch setting
enablePromisingTermFiltering	Speed/Accuracy	Performance only
enableYueReadings	All names	Names written in Han script
entranceAddressFieldWeight	Addresses	Weighting
equivalenceClassesPath	File locations	Internal engineering detail
estimatedConflictOrDeletionScore	All names	Internal engineering detail
exactLatnMatchScore	All names	Token normalization
expensiveScorerJoinedTokenLimit	All names	Concatenation^[a]
fieldBlockedScore	Fielded names	OUT_OF_ORDER_DELETION match phenomenon
fieldConflictScore	Fielded names	CONFLICT match phenomenon
fieldDeletionScore	Fielded names	DELETION match phenomenon
finalBias	All names	All name match scores
frequencyRankBias	All names	Can affect any kind of match
genderConflictPenalty	All names	Gender mismatch^[b]
genderConflictPenaltyThreshold	All names	Gender mismatch^[b]
globalTokenCacheConfig	Speed/Accuracy	Performance only
globalTokenPairCacheConfig	Speed/Accuracy	Performance only
haniAbbreviationScore	All names	INITIALISM match phenomena in Han script
haniAbbreviationThreshold	All names	INITIALISM match phenomena in Han script
haniFourCornerCodeMismatchPenalty	All names	Names written in Han script
hmmNormalizationAlternative	All names	HMM_MATCH phenomenon
hmmScoreBias	All names	HMM_MATCH phenomenon
hmmScoreLimit	All names	HMM_MATCH phenomenon
houseAddressFieldWeight	Addresses	Weighting
houseNumberAddressFieldWeight	Addresses	Weighting
ignoreBadData	Elasticsearch	Elasticsearch setting
improveSingleDigitManipulationMatch	Dates	Date match scores containing exactly one instance of digit manipulation^[c] and no other differences
initialFrequencyRank	All names	INITIAL_MATCH match phenomenon
initialismMismatchPenalty	All names	HMM_MATCH phenomenon
initialismScore	All names	INITIALISM match phenomenon
initialsConflictScore	All names	CONFLICT match phenomenon
initialsDeletionPenalty	All names	DELETION match phenomenon
initialsScore	All names	INITIAL_MATCH match phenomenon
islandAddressFieldWeight	Addresses	Weighting
joinedTokenInitialsPenalty	All names	Concatenation^[a] INITIAL_MATCH match phenomenon
joinedTokenLimit	All names	Concatenation^[a]
joinedTokenPenalty	All names	Concatenation^[a]
levelAddressFieldWeight	Addresses	Weighting
libpostalDataDirPath	File locations	Internal engineering detail
lowWeightTokenFrequencyRank	All names	Can affect any kind of match
lowWeightTokenPath	File locations	Internal engineering detail
maximumAlternateTokenizationRelativeDistance	All names	Affects tokenization and therefore any potential score
maximumOrganizationInitialismLength	Organization names	INITIALISM match phenomenon
maximumPersonInitialismLength	Person names	INITIALISM match phenomenon
maxYearDistanceForDigitManipulation	Dates	Date match scores containing exactly one instance of digit manipulation^[c] and no other differences.
minFieldWeightFactor	Fielded names	Weighting
minimumAlternateTokenizationLength	All names	Affects tokenization and therefore any potential score
minimumOrganizationInitialismLength	Organization names	INITIALISM match phenomenon
minimumPersonInitialismLength	Person names	INITIALISM match phenomenon
monthDistanceWeight	Dates	All date match scores
nameBigramQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameDoubleMetaphoneQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameGluedQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameInitialQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameLengthMismatchPenalty	All names	DELETION match phenomenon Concatenation^[a] Any phenomenon that changes the number of tokens in a name
nameRealWorldIdQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
ngramLMPath	File locations	Internal engineering detail
ngramThresholdPath	File locations	Internal engineering detail
nicknameOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as NICKNAME in override file
numericTokenFrequencyRank	All names	Can affect any kind of match
outOfOrderDeletionScore	All names	OUT_OF_ORDER_DELETION match phenomenon
parseUnknownFieldMarker	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
poBoxAddressFieldWeight	Addresses	Weighting
postCodeAddressFieldWeight	Addresses	Weighting
queryAlternativeOriginLanguages	Speed/Accuracy	Can affect any kind of match
realWorldIdsPath	File locations	Internal engineering detail
realWorldIdsPathUser	File locations	Internal engineering detail
reorderCorrection	All names	Rotation ^[d]
reorderCorrectionThreshold	All names	Rotation^[d]
reorderPenalty	All names	Reordering^[e]
rniFullnameOverridesPath	File locations	Internal engineering detail
rntFullnameOverridesPath	File locations	Internal engineering detail
roadAddressFieldWeight	Addresses	Weighting
sameNameUnknownFieldMatchInterpolator	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
staircaseAddressFieldWeight	Addresses	Weighting
stateAddressFieldWeight	Addresses	Weighting
stateDistrictAddressFieldWeight	Addresses	Weighting
stopPatternsPath	File locations	Internal engineering detail
stringDistanceWeight	Dates	All date match scores
stuckInitialAffixMinLength	All names	STUCK_INITIAL match phenomenon
stuckInitialScore	All names	STUCK_INITIAL match phenomenon
suburbAddressFieldWeight	Addresses	Weighting
thresholdToDropoffBiasMapping	Dates	All date match scores
timeDistanceWeight	Dates	All date match scores
timeProximityYearInterval	Dates	All date match scores
tokenizeOrganizationsWithNumbers	Organization names	Affects tokenization and therefore any potential score
tokenOverridesPath	File locations	Internal engineering detail
trailingPatronymicDeletionScore	Person names	TRAILING_PATRONYMIC_DELETION match phenomenon
truncationFractionLimit	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationScorerBias	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
tryAlternateTokenization	All names	Affects tokenization and therefore any potential score
tryDayMonthSwap	Dates	All date match scores
unigramLMPath	File locations	Internal engineering detail
unitAddressFieldWeight	Addresses	Weighting
unknownFieldFrequencyRank	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
unknownVsKnownScore	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
unknownVsUnknownScore	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
useEmbeddings	Organization names	EMBEDDING_MATCH match phenomenon
useSolrPhraseQueries	Solr	Solr plugin setting
variantOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as VARIANT in the override file
worldRegionAddressFieldWeight	Addresses	Weighting
yearDistanceWeight	Dates	All date match scores
^[a]Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. ^[b]Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. ^[c]A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0↔8, 1↔7, 3↔8, 5↔8, 5↔6, 6↔8, 7↔2. ^[d]If the tokens in a name have been rotated, the reorder penalty will negatively impact the match score. RNI detects and compensates for this error. ^[e]Tokens that match, but that appear to be out-of-order, have their match scores adjusted to reflect that fact.

Internal parameters

This table lists the parameters that can be configured via internal_param_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in internal_param_defs.yaml.

Important

We recommend against modifying these parameters unless advised to by Rosette support.

Name	Applies to	Impacts
affixGlueThreshold^[a]	All names and addresses	Concatenation^[b]
allLanguageSupport	All names	Can affect any kind of match
allowCacheBonuses	All names	Internal engineering detail
alwaysComputeSuffixes^[a]	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
araRNISpeedOption	Translated names	Speed and accuracy tradeoff
crossSurnameMatchPenalty	All names	Matches in languages with onomastic information
debuggableIndex	N/A	Internal engineering detail Has no effect on matching
debugPrintTuples	N/A	Internal engineering detail Has no effect on matching
defaultScoreToCheckRestriction	All names Dates Addresses	First-pass scoring
disabledLanguages	All names
doFrontTruncations	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
doQueryBigrams	All names	First-pass accuracy
doQueryCompleted	All names	First-pass accuracy
doQueryFullnameOverrides	All names	First-pass accuracy
doQueryFuzzy	All names	First-pass accuracy
doQueryGlued	All names	First-pass accuracy
doQueryIndexKeys	All names	First-pass accuracy
doQueryInitials	All names	First-pass accuracy
doQueryNormalized	All names	First-pass accuracy
doQueryPersonInitialisms	All names	First-pass accuracy
doQueryPhrase	All names	First-pass accuracy
doQueryRealWorldIds	All names	First-pass accuracy
doQueryTokenOverrides	All names	First-pass accuracy
doQueryTranslated	All names	First-pass accuracy
doViterbiRescaling	All names	Fuzzy match^[c]
editDistanceTokenScorerPenalty	All names	STRING_SIMILARITY match phenomenon
embeddingBias	Organization names	EMBEDDING_MATCH match phenomenon
embeddingZeroScore	Organization names	EMBEDDING_MATCH match phenomenon
enableAdditionalOnomastics	All names	Matches in languages with onomastic information
enableRemoteTokenScorer	All names	Japanese/English fuzzy match^[c]
enableSeq2SeqTokenScorer	All names	Japanese/English fuzzy match^[c]
enableTokenPairLogging	N/A	Internal engineering detail Has no effect on matching
engEngFastMode	All names	English/English fuzzy match^[c]
expandedLanguages	All names	Fuzzy match^[c] OVERRIDE match phenomenon
expansionLimit	All names	Fuzzy match^[c] OVERRIDE match phenomenon
expansionScoreThreshold	All names	Fuzzy match^[c] OVERRIDE match phenomenon
familiarTokenMismatchPenalty	All names	Can affect any kind of match
familiarTokenThreshold	All names	Can affect any kind of match
firstPassDayRange	Dates	Performance only
firstPassMonthRange	Dates	Performance only
firstPassYearRange	Dates	Performance only
foreignAddressFinalBias	Addresses	All English-to-non-English address matches
genderPenaltyMinimumLength	All names	Gender mismatch^[d]
givenFieldDeletionScore	Fielded names	DELETION match phenomenon
HMMCachePerProcess	All names	Internal engineering detail HMM_MATCH phenomenon
HMMCachePerThread	All names	Internal engineering detail HMM_MATCH phenomenon
hmmNormBias	All names	Internal engineering detail Fuzzy match^[c]
HMMUsageThreshold	All names	Internal engineering detail HMM_MATCH phenomenon
identifierEditDistanceTokenScorerPenalty	Identifiers	STRING_SIMILARITY match phenomenon
ignoreTranslationOrigins	All names	Can affect any kind of match that uses English transliteration
includeExtraKatakanaPersonReadings	Translated names	Can affect any kind of match
initialAndSuffixMinLength	All names	Fuzzy match^[c] INITIAL_MATCH match phenomenon
initialAndSuffixScore	All names	Fuzzy match^[c] INITIAL_MATCH match phenomenon
jniBias	All names	Can affect any kind of match in languages that use a JNI scorer
jpnRNISpeedOption	Translated names	Speed and accuracy tradeoff
kanjiMismatchPenalty	All names	Normalization of tokens that include kanji
katakanaTransliterationsOnly	Translated names	Can affect any kind of match
korRNISpeedOption	Translated names	Speed and accuracy tradeoff
latinDataAlternativesToCheck	All names	Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)
limitedLanguageEditDistance	All names	STRING_SIMILARITY match phenomenon
maxIdentifierEditDistance	All names	First-pass accuracy
notExactMatchPenalty	All names	Normalization
_postCodePathAddressFieldWeight	Addresses	Weighting
promisingFuzzyTermFrequencyFactor	Speed/Accuracy	Performance only
promisingTermFrequencyFactor	Speed/Accuracy	Performance only
queryMaxResults	All names Dates Addresses	First-pass scoring
queryMaxToCheck	All names Dates Addresses	First-pass scoring
queryMaxToConsider	All names Dates Addresses	First-pass scoring
queryToCheckAllowance	All names Dates Addresses	First-pass scoring
realWorldIdScore	Organization names	Real-world id match^[e]
remoteTokenScorerURL	All names	Internal engineering detail Japanese/English fuzzy match^[c]
rntTokenOverridesPath	File locations	Internal engineering detail
rusRNISpeedOption	Translated names	Speed and accuracy tradeoff
secondarySurnameTokenTypeWeight	All names	Matches in languages with onomastic information
seq2seqCachePerProcess	All names	Internal engineering detail Japanese/English fuzzy match^[c]
seq2seqCachePerThread	All names	Internal engineering detail Japanese/English fuzzy match^[c]
seq2seqTokenOverridesPath	File locations	Internal engineering detail
seq2seqUsageThreshold	All names	Internal engineering detail Japanese/English fuzzy match^[c]
splitTokens	All names	Internal engineering detail
stringDistanceThreshold^[a]	All names	Fuzzy match^[c]
surnameFieldDeletionScore	fielded names	DELETION match phenomenon
surnameTokenTypeWeight	All names	Matches in languages with onomastic information
taggerMinimumConfidenceThreshold	All names	Matches in languages with onomastic information
translatorResultsToKeep	translated names	Can affect any kind of match
truncationAffixSimilarityLength	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationAffixSimilarityThreshold	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationLengthLimit	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
useCharacterLM	All names	Can affect any kind of match
useEditDistanceTokenScorer	All names	STRING_SIMILARITY match phenomenon
useIdentifierEditDistanceTokenScorer	Identifiers	STRING_SIMILARITY match phenomenon
useLM	All names	Can affect any kind of match
useOldAndNewNameSegmentationForJapanese	All names	Can affect any kind of match involving Japanese translations
useRealWorldIds	all names (or just orgs?)	Real-world id match^[e]
zhoRNISpeedOption	Translated names	Speed and accuracy tradeoff
^[a]Unlike public parameters for this feature, this is a speed/accuracy tradeoff, not a science-tuning parameter. ^[b]Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. ^[c]A fuzzy match is a match between tokens that are similar but not identical. The HMM_MATCH and SEQ2SEQ_MATCH phenomena are examples of this. ^[d]Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. ^[e]RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name

^[1]Copyright by Elasticsearch BV. Dual licensed under Server Side Public License (SSPL version 1) and the Elastic License 2.0 (ELv2).

^[2]For RNI plugins that support earlier versions of Elasticsearch (such as 1.x.y), contact support@rosette.com.

^[3]# may also be used after an entry on the same line to begin a comment.

^[4]Override files are not provided for all supported languages. Specifically, while no files are provided for Russian or Korean, you can create token pair files for these languages.

^[5]RNI depends on the jpostal binding for the open source libpostal library to parse unfielded addresses as a pre-processing step. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support@basistech.com if you discover any issues.

^[6]# may also be used after an entry on the same line to begin a comment.

^[7]# may also be used after an entry on the same line to begin a comment.

Match Identity