RNI-Elasticsearch Plugin Guide
Introduction
RNI-Elasticsearch is an Elasticsearch[1] plugin for building fuzzy name retrieval and name matching applications for persons, locations, and organizations. It uses Rosette Name Indexer (RNI), implementing high-speed, scalable, cross-language, and cross-script searches with the Elasticsearch full-text search engine to store the names and search keys.
This guide describes how to use the RNI-Elasticsearch plugin and RNI features, and is not intended to be a complete guide to Elasticsearch.
Supported Platforms
RNI-Elasticsearch is supported on the following operating systems and CPUs.
OS | CPU |
---|---|
MAC OS X v10.9+ (Darwin 13) | AMD64 |
Linux | AMD64 |
Linux | AARCH64 |
Windows | AMD64 |
Java Only | Any OS and CPU with 64-bit Java SDK 17 through 20 |
Language support
RNI can match names in any language. For the languages listed in Fully supported text domains for name matching, RNI calculates a match score using a variety of techniques, as described in Understanding name match scores. For names not listed in those tables, RNI provides limited support, as described in Language support parameters.
Note
Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.
To set RNI to behave as it did previously, set allLanguageSupport
to false
.
Interpreting RNI scores
Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.
RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).
The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding name match scores.
Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:
Variation | Example(s) |
---|---|
Phonetic and/or spelling differences | Nayif Hawatmeh and Nayif Hawatma |
Missing name components | Mohammad Salah and Mohammad Abd El-Hamid Salah |
Rarity of a shared name component | Two English names that contain Ditters are more likely to match than two names that contain Smith |
Initials | John F. Kennedy and John Fitzgerald Kennedy |
Nicknames | Bobby Holguin and Robert Holguin |
"Cousin" or cognate names | Pedro Calzon and Peter Calzon |
Uppercase/Lowercase | Rosa Elena PACHECO and Rosa Elena Pacheco |
Reordered name components | Zedong Mao and Mao Zedong |
Variable Segmentation | Henry Van Dick and Henri VanDick, Robert Smith and Robert JohnSmyth |
Corresponding name fields | For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox] |
Truncation of name elements | For Sawyer, the similarity with Sawy is higher than the similarity with Sawi. |
Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.
You can configure RNI to customize how it scores different match phenomena.
The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).
Installing RNI-Elasticsearch
Important
The RNI Elasticsearch plugin does not work with the AWS managed elastic service.
Note
You will need read/write permissions on the host system to install RNI-Elasticsearch.
Note
For security reasons, Elasticsearch is only accessible locally by default; this is suitable for a local dev/test environment. For accessible dev/test/production server instances, users need to follow the steps outlined in the Elasticsearch reference manual to enable the appropriate network settings for their specific instance.
Note
When installing on top of an existing version of RNI-ES without reindexing, some changes to the software may not function with an index created on the previous installation, and many newly developed features won't work without reindex. For the best experience, Babel Street recommends reindexing on every installation.
To use RNI-Elasticsearch you need the RNI Elasticsearch plugin, an RLP license file (rlp-license.xml
) and Elasticsearch. [2]
Note
If you are using the Linux distribution of RNI-ES, note that glibc is required. The version of glibc that the native libraries are built against can be found in the filename of the distributed package.
If you do not already have it, install Elasticsearch using the setup instructions for the appropriate version.
Download and unzip Elasticsearch-<version>.zip.
Important
The version of Elasticsearch must match the first three digits of the version of the RNI-ES plugin. If your version of Elasticsearch does not match the plugin version, the plugin will not install.
Example:
Elasticsearch version: 7.5.2
RNI-ES plugin version: 7.5.2.x where x is an integer
Install the plugin.
Navigate to the
elasticsearch-<version>
root directory and run the install command.On Unix, Linux and MacOS:
bin/elasticsearch-plugin install file:///path/to/rni-es-<version>.x.zip
On Windows:
bin\elasticsearch-plugin install file:///C:\path\to\rni-es-<version>.x.zip
Note
You must use the absolute file path to refer to the plugin zip file. For example, if the file is in the home directory of rniUser on macOS, the command would be:
bin/elasticsearch-plugin install file:///Users/rniUser/rni-es-<version>.x.zip
You may be prompted to grant permissions necessary for the plugin to function.
The plugin is now in
plugins/rni
.Note
For Windows users, you must add
bin\elasticsearch-<version>\plugins\rni\bt_root\rlp\bin\*
to your PATH environment variable. In this case, you must replace * with the name of the subdirectory which contains platform-specific binary library files (for example,
amd64-w64-msvc120
).Additionally, the RNI-Elasticsearch plugin cannot be installed into distributions of Elasticsearch found in the
C:\Program Files
directory.Copy the RLP License (
rlp-license.xml
) toplugins/rni/bt_root/rlp/rlp/licenses
.This license must be in place before you can use the RNI-Elasticsearch plugin.
Note
If your index contains complex mappings or searches, including many fields or nested fields, you may need to increase the heap size as described in the Elasticsearch documentation.
To start the Elasticsearch server, run:
bin/elasticsearch
Note
When starting Elasticsearch with the plugin you may see some non-fatal error messages. If a message follows the error stating that “Cluster health status changed from [RED] to [YELLOW]“, the error can be ignored. This may occur when the enableDynamicConfiguration
is set to true
.
Note
To utilize the overlay directory functionality, you must specify the overlay root location with the bt.overlay.root
or bt.overlay.relative.root
(relative to the plugin installation directory) system property at plugin startup time.
Example of an overlay root location:
ES_JAVA_OPTS="-Dbt.overlay.root=<overlay-root-path>" bin/elasticsearch
Example of a relative overlay root directory:
ES_JAVA_OPTS="-Dbt.overlay.relative.root=<relative-overlay-root-path>" bin/elasticsearch
If the overlay directory is not located within the Elasticsearch installation directory, you must add an entry to RNI's plugin-security.policy
file giving read permissions to the directory.
libpostal data directory
RNI uses libpostal to parse addresses; libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.
RNI packages libpostal data in plugins/rni/bt_root/rlpnc/data/libpostal
. The data directory is relatively large (~2G). If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other RNI functionalities.
Prepare the index
Important
Unless otherwise specified, all inputs to RNI need to be UTF-8 encoded.
Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.
Elasticsearch provides real-time search and analytics for all kinds of data. The data is stored in documents, each having a set of fields, some of which are defined as search fields. An Elasticsearch index is a collection of these documents.
The RNI-Elasticsearch plugin uses an Elasticsearch index to store documents containing names, dates, addresses, or other fields to be matched.
Before using RNI to search for matches, you must create the index, define mappings, and load the index with documents.
Create an index, or a searchable container for your documents.
Define a mapping for fields that contain person, location, organization, or identifier entity types. The type of a name field to be searched by RNI is
"rni_name"
. A mapping defines the data types of each of the searchable fields in a document. The mapping does not have to include every field in the document, just the searchable fields.Index documents that contain one or more name fields along with other fields of interest. This step loads the documents into the index.
Test the RNI integration before continuing on.
Once you've completed the above steps, you are ready to query the index.
The following snippets use the cURL command-line tool to illustrate the Elasticsearch commands for running the plugin.
You can also use Kibana, an open source dashboard for Elasticsearch.
Create an index
An Elasticsearch index consists of one or more documents, and a document contains one or more fields. A name index is an indexed list of names.
The default port for running Elasticsearch locally is localhost:9200
.
The following cURL statement creates an index named rni-test.
curl -XPUT 'http://localhost:9200/rni-test'
Define a mapping
A mapping defines how a document, along with the fields it contains, is stored and indexed and sets the types of the search fields. For name search fields, set the "type" of the name fields to "rni_name".
The following statement maps the "primary_name" and "aka" (also known as) fields in the document to the "rni_name" type in the "rni-test" index.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "primary_name" : { "type" : "rni_name" }, "aka" : { "type" : "rni_name" }, "occupation" : { "type" : "text" } } }'
Previous to the RNI-RNT 7.35.1.c65.0 release (September 2021), entityType
was not considered when querying. For example, a name with an entityType
of PERSON could match a name with an entityType
of ORGANIZATION. To return to this behavior, set the mapping parameter testEntityType
to false. This will allow indexed names with any or no entityType
to be returned, regardless of the entityType
in the search.
Index documents
This is the step where you add your data, or documents, to the index. A document is a JSON object containing one or more fields. Each field in a document is defined as a key-value pairs, where the key is the field and the value is the data.
Documents may include fields other than name fields.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "aka" : "Bossman", "occupation" : "business owner" }'
Name fields can include properties in addition to the name string (or "data
" property). Properties are used when searching to optimize the search algorithms for the data. The "entityType
" property is particularly important for name searching and customizations.
Property | Required | Description |
---|---|---|
| ✓ | The name string. |
| ISO 639-3 Code for the language of use: the language of the document in which the name was found. | |
| ISO 639-3 Code for the language of origin of the name. For example, a name of Spanish origin (spa) may be found in an English (eng) document. | |
| ISO 15924 code for the script. | |
| Type of the name. | |
| Unique string identifier for the document. | |
| An explicitly defined gender for a name. |
Example:
curl -XPUT 'http://localhost:9200/rni-test/_doc/3' -H'Content-Type: application/json' -d '{ "primary_name" : { "data" : "Joe Schmoe", "language" : "eng", "script" : "Latn", "entityType" : "PERSON" } }'
Tip
When creating a large set of documents, use the Bulk insert for optimal performance.
Tip
You may need to wait a few minutes for the documents to be ready to query. Documents are not always immediately available from Elasticsearch after being added to the index.
Entity types
The entityType
field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.
Important
The entityType
should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType
, the type PERSON
will be used.
Type | Description | Features |
---|---|---|
PERSON | A human identified by name, nickname, or alias. | Values are tokenized and token pairs are compared. Stop words, overrides, frequency and gender models are supported. |
LOCATION | A city, state, country, region or other location. | Values are tokenized and token pairs are compared. Stop words, overrides, and frequency models are supported. |
ORGANIZATION | A corporation, institution, government agency, or other group of people defined by an established organizational structure. | Values are tokenized and token pairs are compared. Stop words, overrides, frequency models, and embeddings are supported. Real World IDs are supported. |
IDENTIFIER IDENTIFIER:DRIVERS_LICENSE IDENTIFIER:LICENSE_PLATE IDENTIFIER:NATIONAL_ID_NUM | An alphanumeric identifier. | Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance. |
Fielded names
You can process fielded names by separating the fields with "|". RNI assigns no explicit semantics to each field (such as given name or surname), but it does pay attention to the order of the fields when comparing two fielded names. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in name1 matches the second field in name2). Fields within a name can be empty.
When scoring a potential match between a name with fields and a name without fields, RNI treats the name without fields as if it were a name with a single field.
RNI treats trailing empty fields as if they were not present. For example "Rosanne|Taylor Smith|" is treated the same as "Rosanne|Taylor Smith".
Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*
.
Names containing special characters
When using JSON objects with RNI, special characters must be properly escaped when used in strings. RNI requires a backslash to escape the special character and then JSON requires another backslash to escape the first backslash. Thus, In RNI, the proper escape character for names containing a special character is a double backslash (\\).
The | used in fielded names is one example of a special character embedded within a name, where | is used to separate the fields. For proper processing of the vertical bar character, RNI needs to be able to distinguish when the user intends to build a fielded name and a name which contains the vertical bar character.
Let's assume we have a name that includes a |; it is not indicating a fielded name: "John|Smith". RNI requires that you escape the vertical bar with a backslash; e.g. "John\|Smith". Then, JSON requires that the backslash character be escaped with a backlash. The correct syntax for the name "John|Smith" is "John\\|Smith". If the entry were representing a fielded name, the correct syntax would be "John|Smith" without any backslashes.
Verify the RNI SDK version
To verify the version of the RNI SDK being used by the plugin, send a GET request to {index_name}/rni_plugin/_get_version:
curl -XGET 'localhost:9200/rni-test/rni_plugin/_get_version'
This call also verifies that the RNI plugin is installed and running successfully.
Bulk insert
Bulk insert allows you to add multiple documents to Elasticsearch in a single API call, improving the throughput for uploading documents by orders of magnitude. We recommend you use bulk indexing to create and index your data wherever possible.
Create the index.
Define the mapping.
Run Bulk Insert.
Tip
Do not perform any queries or searches on the cluster while indexing data via the bulk index API. Doing so can cause significant performance issues.
The structure for all Elasticsearch bulk API calls is:
{ action_to_be_performed: { metadata_related_to_action}}\n { request_body_data_to_index }\n
Bulk insert example
We're going to continue the example that we started. The index is rni-test
. The mapping defines a primary_name
, aka
, and occupation
.
Create the index.
curl -X PUT http://localhost:9200/rni-test
Define the mapping.
The previously defined mapping:
{ "properties" : { "primary_name" : { "type" : "rni_name" }, "aka" : { "type" : "rni_name" }, "occupation" : { "type" : "text" } } }
You can put the mapping in a JSON file and create it from the command line. The following curl command creates the mapping using a file (
mapping.json
in this example):curl -X PUT -H"Content-Type:application/json" -d @mapping.json http://localhost:9200/rni-test/_mapping
Create a data file in newline delimited JSON (NDJSON) format. Save the file as
bulknames.json
. The file MUST end with a newline after the final record.{"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Joaquín Guzmán","entityType":"PERSON"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "René Lindström Jones","entityType":"PERSON"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Guadalupe Hernandez","entityType":"PERSON"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Chris Joseph Arsenault","entityType":"PERSON"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "ABC","entityType":"ORGANIZATION"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Basis Technology","entityType":"ORGANIZATION"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Australian Boradcasting Corporation","entityType":"ORGANIZATION"}} {"index":{"_index":"rni-test","_id":null}} {"primary_name":{"data": "Amazon","entityType":"ORGANIZATION"}}
Use the
_bulk
method to load the data file with curl using the following command:curl -X POST -H"Content-Type:application/json" --data-binary @bulknames.json http://localhost:9200/rni-test/_bulk
Note
If you're providing text file input to curl, use the
--data-binary
flag instead of plain-d
to preserve the newlines.
Searching with queries
At this point you've created an index and loaded data. Now you can start using RNI to search for matches.
A query searches the index and returns a match score. In RNI, the query for a name consists of two parts, a base query and a rescorer.
The base query is a standard Elasticsearch query against a name field. The rescorer takes the results of the base query, and uses Elasticsearch rescoring to select the top candidates and perform pairwise matching on the top candidates.
The query returns an RNI match score (max_score
), the score of the top scoring document.
Important
The entityType
(PERSON, LOCATION, ORGANIZATION) must be added to a name query to utilize all RNI features. If you don't specify an entityType
, the type NONE
will be used and RNI may return less accurate results.
Base query
The base query is a standard query against a name field:
"query" : { "match" : { "primary_name" : "Jo Shmoe" } }
Querying supports the same name properties that you may use when indexing documents. Unlike during document creation, you must pass the JSON object containing the name fields as a string. You should always include the entityType
property in your query.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "primary_name" : "{\"data\" : \"Jo Shmoe\", \"entityType\" : \"PERSON\"}" } } }'
Much like during indexing, RNI creates a set of keys based on the name and then generates a more complex internal query to match against the indexed keys.
Rescore with the RNI pairwise name match
The base query returns a ranked list of matching documents. The rescorer takes the top documents from the list and performs pairwise matching algorithms on those documents, and returns a re-ranked list. RNI has a custom rescorer which allows you to further tune the candidates passing to RNI pairwise matcher. Since the pairwise matcher is a computationally intensive process, you want to rescore just enough documents to find the best matches.
Elasticsearch rescoring includes the following parameters:
window_size
(an integer, defaults to 10) specifies how many documents from the base query should be passed to the RNI pairwise matcher.Use this parameter to limit the number of compute-intensive name matches that need to be performed. If you set the value too high, the query will take too long, but if you set the value too low, you will increase the number of false negatives.
Tip
A good starting point for
window_size
is to make it the square root of the size of the index. For example, an index of 10,000 entries would use awindow_size
of 100.query_weight
(a float, defaults to 1.0) specifies the weighting of the score returned by the base query.In the context of RNI pairwise matching, the base query score has little meaning, so we suggest you set it to 0.0.
rescore_query_weight
(a float, defaults to 1.0) specifies the weighting of the maximum RNI pairwise match score.If
query_weight
0.0 andrescore_query_weight
is 1.0, the score that is returned by rescoring is the RNI pairwise match score.score_mode
controls how the query and rescore query scores are combined. The default value istotal
meaning that both scores are added together after being multiplied by their respective weights.
In the following example, pairwise matching is performed on the top 200 names returned by the base query.
Example with RNI Rescorer:
"rescore" : { "window_size" : 200, "query" : { "rescore_query" : { "function_score" : { "name_score" : { "field" : "primary_name", "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"} } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } }
The "name_score
" function matches every name in the given field against the query name and returns the maximum score to the rescorer.
The "name_score
" function score query must be given at least one object that specifies:
field: the search field being rescored which must be of type
rni_name
.query: the value of the search field.
The object passed to the name_score
function can also include any of the name properties.
This example illustrates the full query incorporating both match and rescore, using RNI query parameters.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "primary_name" : "{\"data\" : \"Jo Shmoe\",\"entityType\" : \"PERSON\"}" } }, "rescore" : { "window_size" : 200, "query" : { "rescore_query" : { "function_score" : { "name_score" : { "field" : "primary_name", "query_name" : {"data" : "Jo Shmoe", "entityType":"PERSON"} } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
This query returns an RNI match score against "Joe Shmoe" in the "_score" field:
{ "_index": "rni-test", "_type": "_doc", "_id": "1", "_score": 0.80217975, "_source": { "primary_name": "Joe Shmoe", "aka": "Bossman", "occupation": "business owner" } }
Representing arrays in Elasticsearch
If the name field in your documents is structured as an array, such as first name and last name fields, wrap the field in a nested object. The nested
datatype allows arrays of objects to be indexed and queried independently of each other.
Since Elasticsearch flattens object hierarchies into a simple list of field names and values, if you don't use the nested
type, you can lose the relationship between the fields. For example, the following document:
"names" : [ { "first" : "Joe", "last" : "Smith" }, { "first" : "Mike", "last" : "Shmoe" } ]
would be transformed internally into a document that looks more like this:
{ "names.first" : [ "mike", "joe" ], "names.last" : [ "smith", "shmoe" ] }
The names.first
and names.last
fields are flattened into multi-value fields, and the association between Joe and Smith is lost. This document would incorrectly match a query for mike and smith.
If you wrap an array field in a nested object, you will get more accurate search results.
Include a field of type "nested
" containing the name field in the mapping:
"nested_names" : { "type" : "nested", "properties" : { "name" : { "type" :"rni_name" } } }
Multiple names can be added to the nested field:
{ "nested_names" : [ { "name" : "Joe Smith" }, { "name" : "Mike Shmoe" } ] }
Update the query to refer to the nested object. Set the "score_mode
" to "max
".
{ "query" : { "nested" : { "path" : "nested_names", "query" : { "match" : { "nested_names.name" : "Mike Shmoe" } } } }, "rescore" : { "query" : { "rescore_query" : { "nested" : { "path" : "nested_names", "score_mode" : "max", "query" : { "function_score" : { "name_score" : { "field" : "nested_names.name", "query_name" : "Mike Shmoe" } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }
See the Elasticsearch documentation for more detailed information on nested objects and queries.
Nested names example
Let's consider an example of a database that includes alias names along with a primary name.
Nested Mapping:
"properties" : { "primary_name" : {"type" : "rni_name"}, "aliases" : { "type" : "nested", "properties" : { "alias_name" : { "type" : "rni_name" } } } }
The curl command to create the mapping:
curl -XPUT "http://localhost:9200/rni-test/_mapping" -H 'Content-Type: application/json' -d '{ "properties" : { "primary_name" : { "type" : "rni_name"}, "aliases" : { "type" : "nested", "properties": { "alias_name" : { "type" : "rni_name" } } } } }'
Each record includes a primary name. Each primary name can have multiple aliases.
"primary_name" : "John Smith", "aliases": [ {"alias_name": "John Shark"}, {"alias_name": "Smithy"}, {"alias_name": "Johnny boy"} ]
The curl command to add the data:
curl -XPUT "http://localhost:9200/rni-test/_doc/null" -H 'Content-Type: application/json' -d '{ "primary_name" : "John Smith", "aliases" : [ {"alias_name": "John Shark"}, {"alias_name": "Smithy"}, {"alias_name": "Johnny boy"} ] }'
The query will try to match one of the aliases. Specify score_mode: max
to return the highest match score of the aliases.
curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d '{ "query" : { "nested" : { "path" : "aliases", "query" : { "match" : { "aliases.alias_name": "Johnny" } } } }, "rescore" : { "query" : { "rescore_query" : { "nested" : { "path" : "aliases", "score_mode" : "max", "query" : { "function_score" : { "name_score" : { "field" : "aliases.alias_name", "query_name" : "Johnny" } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
Sorting results by rni_name
Elasticsearch supports the ability to sort search results by the values of their document fields. In the case of RNI, one may want to sort on an rni_name field. Because these fields are internally composed of many subfields, it is necessary to specify the subfield to sort on. Below are a couple of subfields that you may be interested in:
IndexFields enum value | Raw subfield name | Example for "John' Smith" | Explanation |
---|---|---|---|
ORIGINAL_NAME_FIELD | bt_rni_name_original | John Smith | original input data for the name |
NORMALIZED_DATA_FIELD | bt_rni_name_normalized | john smith | normalized name |
As an example, if your field's name is primaryName, you can sort on the original name data by referring to primaryName.bt_rni_name_original
in your sort specification.
In the Java API, these fields can be referenced through the IndexFields
enum. Regarding the previous example, one could refer to the same subfield in Java:
"primaryName." + IndexFields.ORIGINAL_NAME_FIELD.fieldName()
Configuring name matching
There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.
Tuning match parameters
The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.
The typical process for tuning parameters is as follows:
Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.
Parameter configuration files
Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.
The parameter files are contained in two .yaml files located in plugins/rni/bt_root/rlpnc/data/etc
. The parameters are defined in parameter_defs.yaml
and modified in parameter_profiles.yaml
.
parameter_defs.yaml
lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum
) and recommended maximum (sane_maximum
) value, which we advise you do not exceed.parameter_profiles.yaml
is where you change parameter values based on the language pairs in the match.
Important
Do not modify the parameter_defs.yaml
file. All changes should be made in the parameter_profiles.yaml
file.
Do refer to the parameter_defs.yaml
file for definitions and usage of all available parameters.
Parameter profiles
The parameters in the parameter_profiles.yaml
file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng
profile. There is also an any
profile which applies to all language pairs.
Parameter profiles have the following characteristics:
Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (
eng
), which always comes last. The two languages can be the same. Examples:spa_eng
ara_jpn
eng_eng
They can include the entity type being matched, such as
eng_eng_PERSON
. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.Parameter profiles can inherit mappings from other parameter profiles. The global
any
profile applies to all languages; all profiles inherit its values.The
any
profile can include an entity type;any_PERSON
applies to all PERSON matches regardless of language.Specific language profiles inherit values from global profiles. The profile matching person names is named
any_PERSON
. The profile for matching Spanish person against English person names is namedspa_eng_PERSON
. It inherits parameter values from thespa_eng
profile and theany_PERSON
profile. Theany_PERSON
profile will not override parameter values from more specific profiles, such as thespa_eng
profile.
Important
Global changes are made with the any
profile.
Any changes to address parameters should go under the any
profile, and will affect all fields for all addresses.
Any changes to date parameters must go under the any
profile.
Parameter universe
A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any
profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:
For example, the MyParameterUniverse universe may include the following parameter profiles:
"name": "MyParameterUniverse/any"
applies to all language pairs."name": "MyParameterUniverse/spa_eng"
applies to English - Spanish name pairs."name": "MyParameterUniverse/spa_eng_PERSON"
applies to all PERSON English - Spanish name pairs.
Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml
file, along with a value. Parameter universes are added to the parameter_profiles.yaml
file.
A parameter universe can also be defined dynamically . We recommend that you use dynamic parameter universes for testing and tuning only. For production use, add all parameter universes to the parameter_profiles.yaml
file.
Tip
You can define multiple named parameter profiles.
Define the parameter universe in the parameter_profiles.yaml
file. Example:
parameterUniverseOne/spa_eng_PERSON: reorderPenalty: 0.4 HMMUsageThreshold: 0.8 stringDistanceThreshold: 0.1 useEditDistanceTokenScorer: true parameterUniverseOne/eng_eng: reorderPenalty: 0.6
Using a parameter universe
To use a parameter universe, add it as part of the name_score
function when rescoring names queried from the index. All parameter values defined in the parameter universe will be used, where appropriate.
curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ "query": { "match": { "full_name": "A Ely Taylor" } }, "rescore": { "window_size": 3, "rni_query": { "rescore_query": { "rni_function_score": { "name_score": { "field": "full_name", "query_name": "A Ely Taylor", "score_to_rescore_restriction": 1, "window_size_allowance": 0.5, "universe": "parameterUniverseOne" } } }, "query_weight": 0, "rescore_query_weight": 1 } } }'
Parameter universes can also be used in the query phase. To do so, specify the query name as a json string and include the universe in the body.
Note
The parameter universe can only be used in the query phase in RNI-ES 8.6.2.0 and later.
curl -XPOST "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ "query": { "match": { "full_name": "{ \"data\": \"A Ely Taylor\", \"universe\": \"parameterUniverseOne\"}" } }, "rescore": { "window_size": 3, "rni_query": { "rescore_query": { "rni_function_score": { "name_score": { "field": "full_name", "query_name": "A Ely Taylor", "score_to_rescore_restriction": 1, "window_size_allowance": 0.5, "universe": "parameterUniverseOne" } } }, "query_weight": 0, "rescore_query_weight": 1 } } }'
Dynamic parameter universes
When tuning RNI, you can use the Parameters REST API endpoint to dynamically create or update a parameter universe, overriding the existing parameter values without having to restart Elasticsearch. Once the optimum values are determined for each parameter, add the parameter universe to the parameter_profiles.yaml
file for production use.
Tip
Dynamic parameter universes are best suited for testing and tuning the RNI match parameters. Once you determine the best set of parameters, add the parameter universe to the parameter_profiles.yaml
file for production use. Using dynamic parameter universes can slow your system down considerably.
Use the Parameters endpoint to create a parameter universe, with parameters and values.
curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d'{ "profiles": [ { "name": "parameterUniverseOne/spa_eng_PERSON", "parameters": { "reorderPenalty": 0.4, "HMMUsageThreshold": 0.8, "stringDistanceThreshold": 0.1, "useEditDistanceTokenScorer": true } } ] }'
The name of the parameter universe is parameterUniverseOne and it applies to matching person names between Spanish and English.
Modifying name parameters
To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml
. For additional support on tuning the parameters, contact support@rosette.com.
Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Selected name parameters
Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml
file.
The following examples describe the impact of parameter changes in more detail.
conflictScore
Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.
initialsScore
)Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.
deletionScore
)Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.
reorderPenalty
)This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.
boostWeightAtRightEnd
, boostWeightAtLeftEnd
, boostWeightAtBothEndsboost
)These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.
The parameters boostWeightAtRightEnd
and boostWeightAtLeftEnd
should not be used together.
Language support parameters
RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully supported text domains for name matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.
Note
Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.
To set RNI to behave as it did previously, set allLanguageSupport
to false
.
Limited support uses two match score computations:
Exact matches return a score of 1. This is the same for all languages.
A score is calculated based on string edit distance.
Two parameters control the level of language support.
Parameter | Description | Default |
---|---|---|
| When set to |
|
| When set to |
|
Neural model for matching
When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.
To enable the neural model, set enableSeq2SeqTokenScorer
to true in the jpn_eng
profile in the parameter_profiles.yaml
file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.
To use the neural model:
Extract the appropriate library files from the platform-specific tensorflow JAR provided in the
rni-es-<version>-seq2seq-libraries.zip
bundle.Elasticsearch must be started with an additional Java property and point to the directory containing the extracted libraries:
ES_JAVA_OPTS="-Dorg.bytedeco.javacpp.cacheLibraries=false -Djava.library.path=<path-to-extracted-libraries>"
Note
The neural model is currently only available on MacOS and Linux platforms in RNI-ES versions 7.10.2.x and all plugins including RNI-RNT 7.38.1.67.0 or later.
Matching Korean names
If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language
(languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.
To enable Korean readings of names in Han script you need to edit the parameter files as follows:
Edit the
zho_eng
profile in theinternal_param_profiles.yaml
file and removekor
from the list ofignoreTranslationOrigins
parameter.Edit the
zho_eng
profile in theparameter_profiles.yaml
file to increase thealternativePairsToCheck
parameter by 1 to compensate for the additional reading.
Matching names with Han characters
We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.
The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.
The parameter haniFourCornerCodeMismatchPenalty
applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty
is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.
To enable the feature, add the following line to your parameter_profiles.yaml
file:
zho_zho_PERSON: haniFourCornerCodeMismatchPenalty: 1
Note
This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.
Matching Turkish and Vietnamese names
Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.
Edit the
parameter_profiles.yaml
file.To enable Turkish detection, add:
detectableLanguagesRuleBased: [tur]
To enable Vietnamese detection, add:
detectableLanguagesRuleBased: [vie]
Restart the system.
Ignore malformed and null value parameters for RNI types
You can index null values and empty strings by updating the allowNullValue
parameter. If the allowNullValue
parameter is enabled, any document containing null values and empty strings for the fields rni_name
, rni_address
, and rni_date
types will be successfully indexed, but search capabilities will be limited to valid values.
You can direct RNI to index documents with malformed strings of language by updating the ignoreBadData
parameter. If the ignoreBadData
parameter is enabled, any document containing a malformed language string will be successfully indexed, but search capabilities will be limited to valid languages.
By default these parameters are disabled. These features are useful when performing bulk operations in Elasticsearch.
The file name is parameter_profiles.yaml
, located in plugins/rni/bt_root/rlpnc/data/etc/
.
To turn any of these features on, set the value of the parameter ignoreBadData
or allowNullValue
in the above file to true
.
Evaluating parameter configuration
To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.
If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.
Configuring name overrides
RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:
Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
Name pair matches specify scores to be assigned for specified full-name pairs.
Token pair overrides specify name token pairs that match along with a match score.
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.
The name matching override files are in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override
directory.
You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.
Stop patterns and stop word prefixes
Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.
For each name, RNI performs the following steps in order:
Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.
RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.
Stop Pattern
A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern
; see the Javadoc for detailed documentation.
Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopregexes_LANG[_TYPE].txt
where LANG is a three-letter language code.
Each row in the file, except for rows that begin with #
[3] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s
at the beginning and end as needed.
Tip
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general
stop pattern is applied first, but general
is also a stop pattern and will be applied as well.
RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
. The generic (non-entity-specific) English file is stopregexes_eng.txt
. For example, the entries
^fnu\b \blnu$
indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.
You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn
, where n
is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,
\blnu$ 2 \blnu$ 3
indicates that the regex is to be applied to fields 2 and 3 in fielded names.
You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt
would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt
would include regular expression to remove elements from PERSON names in English text.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
Stop Word Prefixes
A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.
Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:
stopprefixes_LANG[_TYPE].txt
where LANG is a three-letter language code. Each row in the file, except for rows that begin with #
, is a string literal. Prefixes matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel
stop word prefix is applied where applicable when colonel
is also a stop word prefix.
RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override
. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt
would include stop word prefixes for use with Russian text.
Overriding name pair matches
You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:
fullnames_LANG1_LANG2[_TYPE].txt
where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.
Tip
Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.
Each row in the file, except for rows that begin with #
, is a tab-delimited full-name pair and score:
name1 Tab name2 Tab score
The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.
Tip
Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.
The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt
. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,
John Doe Joe Bloggs 1.0
indicates that the query name John Doe
matches the index name Joe Bloggs
(both used in different regions to indicate 'person unknown') with a score of 1.0.
These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs
and the index includes a document with an rni_name
field containing John Doe
.
You can add entries for English to English name matches to fullnames_eng_eng.txt
, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt
:
外山恒 Toyama Koichi 1.0 ヒラリークリントン Hillary Clinton 1.0
Overriding token pair matches
You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported[4] for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail
and Abby
constitute a token pair, then the match score for Abigail Harris
and Abby Harris
will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
tokens_LANG1_LANG2_[TYPE].txt
where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:
Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]
A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt
, which contains a list of English/English token pairs. For example:
Peter Pete NICKNAME Peter Pedro COGNATE
This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt
, tokens_zho_eng_ORGANIZATION.txt
.
When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt
indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt
indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt
.
We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.
Multiple sets of token overrides
There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector
parameter.
The value of
overrideSelector
is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.The value of
overrideSelector
can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using theoverrideSelector
ofOverrideGroup1
would be named:tokens_eng_eng_PERSON-OverrideGroup1.txt
If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.
Note
Overrides that are associated with a specific selector are not additive to the base overrides. If a custom overrideSelector
value is specified, RNI will only consider overrides in that specific selector. As with the base overrides, for a given selector, RNI will consider non-entity-type overrides for that selector if no entity-type-specific override pair is found for that selector.
Normalizing token variants
You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:
equivalenceclasses_LANG_[TYPE].txt
For example, equivalenceclasses_jpn.txt
would contain entries for normalizing Japanese token variants for any entity type to a normalized form.
Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:
[normal_form1] variant1_1 variant1_2 variant1_3 [normal_form2] variant2_1 variant2_2 variant2_3 ...
RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt
, which contains a list of variant renderings to normalize to muhammad
:
[muhammad] mohammed mahamed mohamed mohamad mohammad muhammed muhamed muhammet muhamet md mohd muhd
You can add lists of variants to this file, including the normalized form in square brackets to start each list.
Unimportant tokens
You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.
The file name is lowWeightTokens_LANG.txt
.
For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt
contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".
Matching organizations with real world IDs
Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.
RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.
Parameter | Description | Default |
---|---|---|
| Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing. |
|
| Enables querying with real world IDs; set by language pair. |
|
realWorldIdScore | Sets the match score when two names match due to matching real world IDs. Set by language pair. | 0.98 |
nameRealWorldQueryBoost | Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair. | 35 |
Building a real world ID file
Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin
where <LANG> is the three-letter language code of the file.
The input file is a tab separated file (.tsv
). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.
IBM WE1X92 Big Blue WE1X92 International Business Machines WE1X92
Unzip the file realWorldIDBuilder.zip
found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md
file in the zip file.
Omit real world IDs
You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.
The omit file is a tab separated file (.tsv
) named <LANG>_ORGANIZATION_ids.tsv
where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.
Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.
Example:
IBM Q37156 Nintendo * * Q45700
To enable an omit file in RNI:
Place the omit file in the
BT_ROOT
directory.Open
omit_ids.datafiles
, which is in theplugins/rni/bt_root/rlpnc/data/real_world_ids/ref/omit_ids
directory by default.Add a new entry for your omit file following the format
<LANG>_ORGANIZATION tab * tab <file path>
, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:ara_ORGANIZATION * rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
Save
omit_ids.datafiles
.
Address matching
The RNI plugin can match addresses in English, Traditional Chinese, and Simplified Chinese, returning a match score reflecting the similarity of two addresses.
In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.
As with name and date matching, the process is to create an index containing addresses, then query an address against the index.
Note
Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.
Address definition
Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library[5] is used to parse the address string into address fields.
When entered as a set of fields, the address may include any of the fields below. At least one field must be specified, but no specific fields are required.
RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields such as house number, postal code, and unit, are matched using character-based methods.
Field Name | Description | Example(s) |
---|---|---|
| venue and building names | "Brooklyn Academy of Music", "Empire State Building" |
| usually refers to the external (street-facing) building number | "123" |
| street name(s) | "Harrison Avenue" |
| an apartment, unit, office, lot, or other secondary unit designator | "Apt. 123" |
| expressions indicating a floor number | "3rd Floor", "Ground Floor" |
| numbered/lettered staircase | "2" |
| numbered/lettered entrance | "front gate" |
| usually an unofficial neighborhood name | "Harlem", "South Bronx", "Crown Heights" |
| these are usually boroughs or districts within a city that serve some official purpose | "Brooklyn", "Hackney", "Bratislava IV" |
| any human settlement including cities, towns, villages, hamlets, localities, etc. | "Boston" |
| named islands | "Maui" |
| usually a second-level administrative division or county | "Saratoga" |
| a first-level administrative division | "Massachusetts" |
| informal subdivision of a country without any political status | "South/Latin America" |
| sovereign nations and their dependent territories, which have a designated ISO-3166 code | "United States of America" |
| currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean | "Jamaica, West Indies" |
| postal codes used for mail sorting | "02110" |
| post office box: typically found in non-physical (mail-only) addresses | "28" |
Address field groups
When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.
When matching two fields, the following penalties are applied:
If the fields are the same, no penalty is applied. (street - street)
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
If the fields are in different field groups, a large penalty is applied. (road - city)
Group | Fields |
---|---|
house | house |
house_number | houseNumber |
road | road |
unit | unit level staircase entrance |
city | suburb cityDistrict city |
state | island stateDistrict state |
country | countryRegion country worldRegion |
post_code | postCode |
po_box | po_box |
How Rosette calculates address match scores
The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.
Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.
For example, most addresses are formatted from specific to general:
houseNumber road city state postCode
The parser would provide predictable results for an address in an expected order:
38 Concord Road, Apt. B Arlington MA
The parser would have more difficulty if the address format was in an unexpected order:
Arlington MA Concord Road #38 Apt B
If you are getting unexpected match values, check how the addresses are being parsed into address fields.
Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.
Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:
Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.
Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.
Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.
Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.
Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.
Using address matching
Index addresses
Create an index.
curl -XPUT 'http://localhost:9200/rni-test'
Define a mapping for fields that will contain addresses. The type for each of these fields is
"rni_address"
.curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "primary_name" : { "type" : "rni_name" }, "residence" : { "type" : "rni_address" } } }'
Index documents containing an address field.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "residence" : { "houseNumber" : "123", "road" : "Main St", "city" : "Boston", "state" : "Massachusetts", "postCode" : "02110" } }'
The address in the document can also be defined as a string.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "residence" : "123 Main St, Boston, Massachusetts, 02110" }'
Query field addresses
RNI compares the fields in the query with the fields in the index, matching each non-blank field. Addresses do not have to contain all the same fields to be compared and matched.
As with other objects, the query for an address consists of two parts: the base query and the RNI pairwise address match rescore query.
Base Query. The base query is a standard query against the address field. Refer to Query the Index.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}" } } }'
RNI Rescore with Addresses. Refer to Rescoring with RNI Pairwise Name Match.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "residence" : "{\"road\" : \"Main\", \"state\" : \"MA\"}" } }, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "address_score" : { "field" : "residence", "query_address" : { "road" : "Main", "state" : "MA" } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
The query returns a hit with the RNI address match score.
"hits": { "total" : 1, "max_score" : 0.6057692, "hits" : [ { "_index" : "rni-test", "_type" : "_doc", "_id" : "1", "_score" : 0.6057692, "_source" : { "primary_name" : "Joe Schmoe", "residence" : { "houseNumber" : "123", "road" : "Main St", "city" : "Boston", "state" : "Massachusetts", "postCode" : "02110" } } } ] }
The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.
Query string addresses
The address can be structured as a string for queries. The address structure for the query is independent of the format of the address in the original document. A string can be used in the query regardless of whether the indexed address was formatted with fields or as a string.
Base Query. The base query constructed with an address string.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : {"residence" : "Main, MA"} } }'
RNI Rescore with Addresses. The rescore query with an address string.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "residence" : "Main, MA" }}, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "address_score" : { "field" : "residence", "query_address" : "Main, MA" } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
The response displayed here returns the address as a string because the indexed document used in this example represented the address as strings. The response will return the address in the same format as the indexed document. The format of the query does not have to match the format of the indexed documents.
"hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.4552421, "hits" : [ { "_index" : "rni-test", "_type" : "_doc", "_id" : "1", "_score" : 0.4552421, "_source" : { "primary_name" : "Joe Schmoe", "residence" : "123 Main St, Boston, Massachusetts, 02110" } } ] }
The address match score is a measure of how similar the addresses are. Similar addresses have a stronger match and their address match score is closer to 1.
Configuring address matching
Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.
There are two types of override files for addresses:
Stop patterns and stop word prefixes designate address field elements to strip during indexing and queries.
Token pair overrides specify address field elements pairs that match.
File Directories
The parameters are modified in the
plugins/rni/bt_root/rlpnc/data/etc/parameter_profiles.yaml
file.The address matching override files are in the
plugins/rni/bt_root/rlpnc/data/addresses/ref/overrides
directory.The address stop word files are in the
plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
directory.
Modifying address parameters
To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml
. The parameter files are described in Parameter configuration files.
Note
Changes made to the any
profile apply to all supported languages.
An example parameter to tune is addressJoinedTokenLimit
, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit
will be allowed to merge more tokens.
Another example parameter is houseNumberAddressFieldWeight
, which controls the weight of the houseNumber
score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight
controls the weight of the city field when matching addresses.
Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml
file.
Address parameters
Stop patterns and stop word prefixes
RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.
For each address field, RNI performs the following steps in order:
Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.
Stop patterns are applied.
Stop words are applied.
Stop pattern
A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern
class; see the Javadoc for detailed documentation.
Stop patterns for a given address field are specified in a UTF-8 file with the AddressField
name:
stopregexes_LANG_ADDRESS_FIELD__FIELD.txt
where LANG is a three-letter language code and FIELD is an AddressField
name. Currently, the only supported values for LANG are eng
and zho
. Each row in the file, except for rows that begin with #
,[6] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s
at beginning and end where needed.
Note
The delimiter before FIELD is a double underscore (__
)
Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.
Stop pattern files are arranged by field in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt
would include regular expressions to remove elements from the CITY
address field for English.
Use of complex patterns may increase processing time. When possible, use stop word prefixes.
Stop word prefixes
A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.
Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField
name:
stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt
where LANG is a three-letter language code and FIELD is an AddressField
name. Currently, the only supported values for LANG are eng
and zho
. Each row in the file, except for rows that begin with #
,[7] is a string literal.
Note
The delimiter before FIELD is a double underscore (__
)
Prefixes in the address field matching any of these string literals are removed.
Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.
RNI includes files with stop word prefixes for selected address fields in English and Chinese. These files are in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords
. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt
would include stopword prefixes for use on CITY
address field for English.
Overriding token pair matches
You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road
and rd
constitute a token pair, then the match score for Stuart Road
and Stuart Rd
will be higher than it would be if the token pair had not been specified.
The token pairs may be within a language or cross-lingual, as indicated by the file name:
LANG1_LANG2_FIELD.txt
where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField
name. Each entry in the file, except for rows that begin with #
, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore
parameter value will be used.
Token1 Tab Token2 Tab [0.0-1.0]
A token pair override score serves as a minimum score, but you can write /force
after a token score to force it to be exactly that value:
Token1 Tab Token2 Tab [0.0-1.0]/force
If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".
RNI includes plugins/rni/bt_root/rlpnc/data/addresses/ref/override/eng_eng_state.txt
, which contains a list of U.S. state abbreviations. For example:
Massachusetts MA California CA
When you create an additional file in the same location, use the respective AddressField
name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt
indicates that the contents match Chinese - English cityDistrict address fields.
Date matching
RNI can match dates returning a data match score reflecting the time similarity of the two dates. Dates that are closer together are considered a stronger match and return a match score closer to 1.
For example, 11/05/1993 and 11/07/1993 have a high score, as they are very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.
The process is similar to name matching:
Index the dates in connection to the related names.
Query the date and name, receiving back a match score.
The query will return separate match scores for the name and for the associated date of birth. You may decide that the name is more important than the birth date. Within your system, you can weight and combine the name and date match scores to determine the final match score.
Date definition
A date contains a year, month, and day, but not all fields are required for matching. All common delimiters for English dates are supported, and dates can be expressed with various orderings. RNI will filter out some non-date related words. Formats that include time of day are not supported.
You can specify an Elasticsearch date format that includes time information in the mapping. The time component will be ignored.
RNI supports a wide variety of date formats. The best date format will always be the ISO standard of YYYY-MM-DD
, where March 7, 1984 is written as 1984-03-07. RNI will attempt to interpret any date provided, although the less standard the format, the less guarantee that its interpretation will be the one you might expect.
Dates can be represented as YYYY-MM-DD. When some fields are unspecified, the letters represent the unknown values. For example, March 7 is YYYY-03-07, since the year in unspecified. Two digit years will be assumed to have unknown centuries. 3/7/84 is interpreted as YY84-03-07. March 7, 1984 will be an equally good match as March 7, 2084 and March 7, 1884.
When a date is provided, RNI will attempt to identify the year, month, and day within it, leaving blank any fields it cannot determine. You can omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12- are all valid dates.
If RNI encounters an invalid date in an acceptable format, such as March 38, 1984, it will not return an error. Rather it will replace the impossible value as an unknown, March 1984.
Supported date formats
RNI supports a wide variety of date formats.
Days can be represented by 1 or 2 digits.
Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).
Years can be represented by 1, 2, 3 or 4 digits.
Supported delimiters include
, . - /
, as well as a space.Partial fields can be entered.
At this time, only English month names and abbreviations are recognized.
All words are case-insensitive; upper and lower case are interpreted the same.
The following table shows different acceptable formats for the date March 7, 1984.
Format | Valid Examples | Notes |
---|---|---|
Y-M-D | 1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7 | |
M-D | 03-07; 3/7; Mar-07; March 7 | |
Y-M | 1984-03; 1984 March; 1984-Mar | |
YYYYMMDD | 19840307 | All 8 digits must be included |
M-D-Y | 03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984 | |
M-YYYY | 03-1984; March 1984; Mar-1984 | The year must include 4 digits. March-84 will not be recognized. |
D-M-Y | 07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984 | |
D-M | 07-03; 7/3; 07-Mar; 7 March | |
D(MONTH)Y | 7MAR84; 07March1984 | The month is a word or abbreviation |
YYYY | 1984 | |
Month | March |
Using date matching
Index dates
Create an index.
curl -XPUT 'http://localhost:9200/rni-test'
Define a mapping for fields that will contain dates. The type for a date field when matching is
"rni_date"
.curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date" }, "primary_name" : { "type" : "rni_name" } } }'
Optionally, in the mapping, you can specify an Elasticsearch date format. All dates must adhere to the specified format. If you specify a format that includes time information, RNI ignores the time component of the date.
Warning
Specifying an Elasticsearch format disables support for unspecified fields. If, for example, you select a format that does not include a day field ("MM-yyyy"), you will get an error when you use the date format in a query.
curl -XPUT 'http://localhost:9200/rni-test/_mapping' -H'Content-Type: application/json' -d '{ "properties" : { "birth_date" : { "type" : "rni_date", "format" : "MM-yyyy-dd" }, "primary_name" : { "type" : "rni_name" } } }'
Index documents containing a date field.
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "primary_name" : "Joe Schmoe", "birth_date" : "07-1955-24" }'
Query dates
There are many ways to incorporate date matching within your query. Here are two examples, one with date matching by itself, and one with date and name matching.
Basic Date Matching
Base Query. The base query is a standard query against the date field. Refer to Query the Index.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "birth_date" : "08-1955-25" } } }'
RNI Rescore with Dates. Refer to Rescoring with RNI Pairwise Name Match.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "match" : { "birth_date" : "08-1955-25" } }, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "date_score" : { "field" : "birth_date", "query_date" : "08-1955-25" } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
The query returns a hit, with the RNI date match score.
"hits": { "total": 1, "max_score": 1.618923, "hits": [ { "_index": "test", "_type": "_doc", "_id": "AVXMepnorGuybmuiQtQr", "_score": 0.8120856, "_source": { "primary_name": "Joe Schmoe", "birth_date": "07-1955-24" } } ] }
Date and Name Match
Base Query. The base query is a standard query against the date and name fields.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query": { "bool": { "should": [ { "match": { "primary_name": "Joe S." } }, { "match": { "birth_date": "08-1955-25" } } ] } }'
RNI Rescore with Dates. Use the doc_score
function in the rescore when matching a combination of Elasticsearch field types instead of the functions for a single type (name_score
and date_score
). The name field is also added to the rescore.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query": { "bool": { "should": [ { "match": { "primary_name": "Joe S." } }, { "match": { "birth_date": "08-1955-25" } } ] } }, "rescore": { "query": { "rescore_query": { "function_score": { "doc_score": { "fields": { "primary_name": { "query_value": "Joe S." }, "birth_date": { "query_value": "08-1955-25" } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
Date match parameters
Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the plugins/rni/bt_root/rlpnc/data/etc/parameter_defs.yaml
file.
Record matching
A search can include multiple fields and return a single match and match score. The fields can be any combination of type rni_name
, rni_date
, rni_address
, or any other Elasticsearch field type.
Each field can be assigned a weight to reflect its importance in the overall matching logic. When searching for a match, some fields are more important in determining a match than others. For example, the name field is likely more important in determining a match than an address field. If no weights are defined, each field is weighted equally.
When matching records, a similarity score is calculated for each field. Then the final match score is then calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a document, that field is removed from the score calculation and its weight is evenly distributed across other fields. You can override this behavior by using the score_if_null
option to specify a score to be returned if the field is null in the index document.
Use the doc_score
function in the rescore query when matching records that include multiple field types, instead of the functions for a single type, such as the name_score
and date_score
functions. The doc_score
function has built-in similarity functions for many core types. It does not, however, currently support multiple nested fields.
If your record query contains types which the doc_score
function doesn't support, you can create a custom similarity function using the Elasticsearch script_score
function in the rescore query.
Supported field types
The doc_score
function has default support for rni_name
, rni_date
, rni_address
, and many of the Elasticsearch core field types. All default similarity scores are between 0.0 and 1.0.
Field Type(s) | Default Similarity Function | Example(s) |
---|---|---|
rni_name | name_score (refer to RNI pairwise match score) | 'John David Smith' vs 'Jon D Smith' = 0.88 |
rni_date, date | date_score (refer to Date matching) | '2010-11-4' vs '2010-5-11' = 0.92 |
rni_address | address_score (refer to Address matching | 'Red Cedar Ct' vs 'Cedar Ct' = 0.53 |
keyword, text, string | Normalized edit distance | '37 Congress St.' vs '35 Congres St.' = 0.875 |
integer, long, short, double, float | Normalized difference (eg. percentage) | '65' vs '59' = 0.908 |
boolean | Equality | 'true' vs 'true' = 1.0, 'true' vs 'false' = 0.0 |
geo_point | Log function over Haversine distance | '[lat=42.361145, lon=-71.057083]' vs '[lat=42.3736, lon=-71.1097]' = 0.83 |
Using record matching
Index records
Create an index with a mapping containing fields with different types
curl -XPUT 'http://localhost:9200/rni-test' -H'Content-Type: application/json' -d '{ "mappings" : { "properties" : { "name" : { "type" : "rni_name" }, "dob" : { "type" : "rni_date" }, "address" : { "type" : "rni_address" }, "height" : { "type" : "integer" }, "nationality" : { "type" : "keyword" } } } }'
Index documents that contain those fields
curl -XPUT 'http://localhost:9200/rni-test/_doc/1' -H'Content-Type: application/json' -d '{ "name" : "Ryan McDonagh", "dob" : "11/19/1987", "address" : { "houseNumber" : "47", "road" : "Park St", "city" : "Boston", "state" : "MA" }, "nationality" : "USA", "height" : 65 }'
Basic multi-field query
The query can be a record containing multiple fields. The fields in the query record must be mapped to those of the indexed documents.
Base Query. The base query is a standard Elasticsearch query containing multiple fields that will return candidates for rescoring.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "{\"data\" : \"Brian McDonough\", \"entityType\": \"PERSON\"}" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } ] } } }'
RNI Rescore with Records. Use the doc_score
function to rescore the indexed documents against a query record.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "Brian McDonough" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": "MA\" } } ] } }, "rescore" : { "rni_query" : { "rescore_query" : { "function_score" : { "doc_score" : { "fields" : { "name" : { "query_value": "Brian McDonough" }, "dob" : { "query_value": "10/19/87" }, "address" : { "query_value" : { "houseNumber" : "48", "road" : "Parker St", "city" : "Boston", "state" : "MA" } }, "height" : { "query_value": 67 }, "nationality" : { "query_value": "CANADA" } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
As with addresses, the query_value
of names can be an object to match additional name information. The rescore query above can easily be modified to additionally match against a name's entityType
field:
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "Brian McDonough" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{\"houseNumber\":\"48\",\"road\":\"Parker St\",\"city\":\"Boston\",\"state\": \"MA\" } } ] } }, "rescore" : { "rni_query" : { "rescore_query" : { "function_score" : { "doc_score" : { "fields" : { "name" : { "query_value": { "data": "Brian McDonough", "entityType": "PERSON" } }, "dob" : { "query_value": "10/19/87" }, "address" : { "query_value" : { "houseNumber" : "48", "road" : "Parker St", "city" : "Boston", "state" : "MA" } }, "height" : { "query_value": 67 }, "nationality" : { "query_value": "CANADA" } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
Note
The quotes in the query
above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.
Weighted multi-field query
Each field can be given a weight to reflect its importance in the overall matching logic.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "Brian McDonough" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{ \"houseNumber\" : \"48\", \"road\" : \"Parker St\", \"city\" : \"Boston\", \"state\" : \"MA\" }" } } ] } }, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "doc_score": { "fields": { "name": { "query_value": "Brian McDonough", "weight": 4 }, "dob": { "query_value": "10/19/87", "weight": 2 }, "address" : { "query_value" : { "houseNumber" : "48", "road" : "Parker St", "city" : "Boston", "state" : "MA" }, "weight" : 2 }, "height" : { "query_value": 67, "weight": 0.5}, "nationality" : { "query_value": "CANADA", "weight": 1 } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
By default, if a queried-for field is null in the index, the field is removed from the score calculation, and the weights of the other fields are redistributed. However, you can override this behavior by using the score_if_null
option to specify what score should be returned for this field if it is null in the index document.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "Brian McDonough" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{ \"houseNumber\" : \"48\", \"road\" : \"Parker St\", \"city\" : \"Boston\", \"state\" : \"MA\" }" } } ] } }, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "doc_score" : { "fields" : { "name" : { "query_value": "Brian McDonough", "weight": 4, "score_if_null" : 0.0 }, "dob": { "query_value": "10/19/87", "weight": 2 }, "address" : { "query_value" : { "houseNumber" : "48", "road" : "Parker St", "city" : "Boston", "state" : "MA" }, "weight" : 2 }, "height" : { "query_value": 67, "weight": 0.5}, "nationality" : { "query_value": "CANADA", "weight": 1 , "score_if_null" : 1.0 } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
Note
The quotes in the query
above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.
Multi-field query with multiple nested fields
The doc_score
function for rescoring does not currently support search queries containing multiple nested fields. To perform these queries, chain multiple rescorers and adjust the query_weight
and rescore_query_weight
parameters to control the relative importance of the original query and of the rescore query, respectively. When chaining multiple RNI advanced rescorers, be sure to add "score_mode":"total"
to each rni_query object to ensure the final score is properly accumulated.
This example expands the previous examples, adding alias names and modifying the single date of birth (dob field) to contain a list of dates of birth, one for each alias (dob
field).
Create an index with a mapping containing multiple nested fields
curl -XPUT "http://localhost:9200/rni-test" -H 'Content-Type: application/json' -d'{ "mappings": { "properties": { "name": { "type": "rni_name" }, "aliases": { "type": "nested", "properties": { "alias_name": { "type": "rni_name" } } }, "dobs": { "type": "nested", "properties": { "dob": { "type": "rni_date" } } }, "address": { "type": "rni_address" }, "height": { "type": "integer" }, "nationality": { "type": "keyword" } } } }'
Index documents that contain the fields
curl -XPUT "http://localhost:9200/rni-test/_doc/1" -H 'Content-Type: application/json' -d'{ "name": "Ryan McDonagh", "aliases": [ { "alias_name": "Rayan McDonagh" }, { "alias_name": "R. McDonagh" }, { "alias_name": "Rayan M." } ], "dobs": [ { "dob": "11/19/1987" }, { "dob": "11/20/1987" }, { "dob": "10/19/1987" } ], "address": { "houseNumber": "47", "road": "Park St", "city": "Boston", "state": "MA" }, "nationality": "USA", "height": 65 }'
Query index with chained multiple rescorers
curl -XGET "http://localhost:9200/rni-test/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "should": [ { "nested": { "path": "dobs", "query": { "bool": { "should": { "match": { "dob": "10/19/87"} } } } } }, { "nested": { "path":"aliases", "query": { "bool": { "should": { "match": {"name": "Brian McDonough"} } } } } }, { "match":{ "address": "{\"houseNumber\": \"48\", \"road\": \"Parker St\", \"city\": \"Boston\", \"state\": \"MA\" }" } } ] } }, "rescore": [ { "rni_query": { "rescore_query": { "nested": { "score_mode": "max", "path": "aliases", "query": { "rni_function_score": { "name_score": { "field": "aliases.alias_name", "query_name": "Brian McDonough", "window_size_allowance": 1 } } } } }, "score_mode": "total", "query_weight": 0.0, "rescore_query_weight": 1.0,1 } }, { "rni_query": { "rescore_query": { "nested": { "score_mode": "max", "path": "dobs", "query": { "rni_function_score": { "date_score": { "field": "dobs.dob", "query_date": "10/19/87" } } } } }, "score_mode": "total", "query_weight": 0.67, "rescore_query_weight": 0.33 2 } }, { "rni_query": { "rescore_query": { "rni_function_score": { "address_score": { "field": "address", "query_address": { "houseNumber": "48", "road": "Parker St", "city": "Boston", "state": "MA" } } } }, "score_mode": "total", "query_weight": 0.75, "rescore_query_weight": 0.25 3 } }, { "query": { "rescore_query": { "match": { "height": 67 } }, "query_weight": 0.89, "rescore_query_weight": 0.11 4 } }, { "query": { "rescore_query": { "match": { "nationality": "CANADA" } }, "query_weight": 0.9, "rescore_query_weight": 0.1 5 } } ] }'
To calculate the rescore_query_weight
for each nested field, you have to work from bottom to top, dividing each field's desired weight by the product of the already-calculated query_weight
values. The query_weight
is calculated by subtracting the rescore_query_weight
from 1.
If there are no previous query_weight
values, the rescore_query_weight
is simply the desired field weight.
In this example, the desired field weights are 0.4, 0.2, 0.2, 0.1, and 0.1 for the alias, dob, address, height, and country fields, respectively.
Rescore based on alias Name field weight = 0.4 rescore_query_weight = 0.4 / (0.667 x 0.75 x 0.89 x 0.9) = 1 query_weight = 1 - 1 = 0 | |
Rescore based on date of birth DOB field weight = 0.2 rescore_query_weight = 0.2 / (0.75 x 0.89 x 0.9) = 0.333 query_weight = 1 - 0.33 = 0.667 | |
Rescore based on address Address field weight = 0.2 rescore_query_weight = 0.2 / (0.9 * 0.89) = 0.25 query_weight = 1 - 0.25 = 0.75 | |
Rescore based on height Height field weight = 0.1 rescore_query_weight = 0.1 / 0.9 = 0.11 query_weight = 1 - 0.11 = 0.89 | |
Rescore based on nationality Country field weight = 0.1 rescore_query_weight = 0.1 query_weight = 1 - 0.1 = 0.9 |
Weighted multi-field query with custom similarity function
While the doc_score function has built-in similarity functions for many core field types, a custom similarity function can be provided at query time. In this manufactured example, we'll use a simple script_score
function that matches CANADA and USA with a high score. Refer to the Elasticsearch documentation for more details about Elasticsearch scripting. Any other function can also be used.
curl -XGET 'http://localhost:9200/rni-test/_search' -H'Content-Type: application/json' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "name" : "Brian McDonough" } }, { "match" : { "dob" : "10/19/87" } }, { "match" : { "address" : "{ \"houseNumber\" : \"48\", \"road\" : \"Parker St\", \"city\" : \"Boston\", \"state\" : \"MA\" }" } } ] } }, "rescore" : { "query" : { "rescore_query" : { "function_score" : { "doc_score": { "fields": { "name": { "query_value": "Brian McDonough", "weight": 4 }, "dob": { "query_value": "10/19/87", "weight": 2 }, "address" : { "query_value" : { "houseNumber" : "48", "road" : "Parker St", "city" : "Boston", "state" : "MA" }, "weight" : 2 }, "height": { "query_value": 67, "weight": 0.5 }, "nationality": { "function": { "function_score": { "script_score": { "script": { "lang": "painless", "params": { "query_value": "CANADA" }, "inline": "if (params.query_value == '\''CANADA'\'' && doc['\''nationality'\''].value == '\''USA'\'') {return 0.8} else {return 0.2}" } } } }, "weight": 1 } } } } }, "query_weight" : 0.0, "rescore_query_weight" : 1.0 } } }'
Note
The quotes in the query
above are escaped because you can't pass an object to the basic Elasticsearch query; it requires a string. The rescore queries can handle objects because they are using RNI functions to parse the values.
Explainability of RNI Matching
Explainability of RNI matching
As important as getting a match score is, understanding how the system calculated the score can be just as important. When matching two names or records, RNI returns a JSON response explaining in detail how the two names, dates, addresses, or records were matched. With this information, you can understand how the score was calculated and, if necessary, modify the matching parameters to better solve your matching problems.
The following concepts are helpful when reviewing the explainInfo JSON file.
When two objects are being compared, one is referred to as the left input, one as the right input.
Every token of the left object is compared to every token of the right object. Token strings, made up of multiple tokens, may also be compared.
Names are usually composed of multiple tokens. For example, John Fitzgerald Kennedy is 3 tokens.
Common Terms
The response JSON contains sections for each type of object: names, addresses, and dates. While each object has its own criteria for comparison, there are common terms used for all comparisons, as shown below.
Term | Definition | Note |
---|---|---|
bin | A number representing the frequency of the token in the language. A lower bin indicates the token in unusual and therefore should be more highly weighted when calculating the similarity score. | |
biasedBin | The bin raised to a power from .1 to 10 (default 0.970). This value is set by the | |
scoreInIsolation | The matching score of just the tuples being compared, ignoring things like position in the name, name weighting, etc. This will show a match core of 1.000 if it is an exact match of tokens, even if if there are biases that will lower the score in context. | |
scoreInContext | The matching score between the tuples taking into account the placement in the overall query and any biases related to the overall query. | |
(left/right)MinTokenIndex | This is the index of the first token in the string of tokens. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion. | An index of -1 or -2 means the token isn't in the name and the token is considered a deletion. |
(left/right)MaxTokenIndex | This is the index of the last token in the string. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion. | An index of -1 or -2 means the token isn't in the name and the token is considered a deletion. |
unbiasedScore | The raw score before any calculations using | |
score | The final score after |
Response structure
All matches responses contain the same sections. The details contained within the section can change based on the match object (names, dates, addresses).
Left/right input information: The input information for each input along with the properties for each token in the input. Properties depend on the type of object being matched.
For example, the name matching example contains the following properties:
"data": "John Smith", "normalizedData": "john smith", "latnData": "john smith", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "ENGLISH"
While a date comparison would contain different properties:
"century": 20, "month": 10, "canonicalForm": "2024-10-01", "yearWithoutCentury": 24, "dayMonthSwapped": true, "originalString": "10 January 2024", "modifiedJulianDay": 60584, "day": 1
Tuple scores: The score for every tuple, where a tuple is a token string from the left input and a token string from the right input. Every token in the left input is matched to every token in the right input, along with some token strings (multiple tokens combined together).
Score adjustments: The score adjustments list the parameters applied, and the score calculated with those parameters.
For example, the name example here contains the following parameters:
"unbiasedScore": 0.6829129823127231, "score": 0.6919264820086959, "parameter": "adjustOneSidedDeletionScores" "unbiasedScore": 0.6919264820086959, "score": 0.8435140063279181, "parameter": "finalBias"
Meanwhile, a date comparison would contain different parameters. In this case, a different matching scheme,
tryDayMonthSwap
, is tried to see if a better result is returned."score": 0.95, "unbiasedScore": 0.5926523220980572, "parameter": "tryDayMonthSwap" "score": 0.95, "unbiasedScore": 0.95, "parameter": "dateFinalBias"
Final score: The similarity score for the two names.
Example: matching names
Let's take a look at an example. In this example we're matching the following 2 names:
John Smith
Jon J Smyth
The JSON output is broken down by section.
"leftInput": { "data": "John Smith", "normalizedData": "john smith", "latnData": "john smith", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "ENGLISH", "tokens": [ { "token": "john", "latnToken": "john", "bin": 5, "biasedBin": 4.764319787410581, "tokenWeight": 0.41435888604672094, "tokenType": "GIVEN" }, { "token": "smith", "latnToken": "smith", "bin": 3.5, "biasedBin": 3.3709010396413017, "tokenWeight": 0.585641113953279, "tokenType": "SURNAME" } ], "entityType": "PERSON" },
The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Even if the name was provided as Smith John, Smith would be identified as a SURNAME and John as a GIVEN name.
"rightInput": { "data": "Jon J. Smyth", "normalizedData": "jon j. smyth", "latnData": "jon j. smyth", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "ENGLISH", "tokens": [ { "token": "jon", "latnToken": "jon", "bin": 3.5, "biasedBin": 3.3709010396413017, "tokenWeight": 0.2083122782666673, "tokenType": "UNKNOWN" }, { "token": "j", "latnToken": "j", "bin": 8, "biasedBin": 7.5161819937120935, "tokenWeight": 0.08948764635417582, "tokenType": "UNKNOWN" }, { "token": "smyth", "latnToken": "smyth", "bin": 1, "biasedBin": 1, "tokenWeight": 0.702200075379157, "tokenType": "UNKNOWN" } ], "entityType": "PERSON" },
The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Since both Jon and Smyth are unusual spellings, the tokenType is not identified.
"scoreTuples": [ { "scoreInIsolation": 0.7595918889283346, "scoreInContext": 0.7595918889283346, "left": "john", "right": "jon", "marked": true,1 "reason": "HMM_MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": 0, "rightMaxTokenIndex": 0 }, { "scoreInIsolation": 0.4912303477031893, "scoreInContext": 0.4666688303180298, "left": "john", "right": "jonj", "marked": false, "reason": "HMM_MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": 0, "rightMaxTokenIndex": 1 }, { "scoreInIsolation": 0.542, "scoreInContext": 0.4743439389212776, "left": "john", "right": "j", "marked": false, "reason": "INITIAL_MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": 1, "rightMaxTokenIndex": 1 }, { "scoreInIsolation": 0.2941408383164158, "scoreInContext": 0.279433796400595, "left": "johnsmith", "right": "jon", "marked": false, "reason": "HMM_MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 1, "rightMinTokenIndex": 0, "rightMaxTokenIndex": 0 }, { "scoreInIsolation": 0.46557800000000005, "scoreInContext": 0.4422991, "left": "johnsmith", "right": "j", "marked": false, "reason": "INITIAL_MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 1, "rightMinTokenIndex": 1, "rightMaxTokenIndex": 1 }, { "scoreInIsolation": 0.7237045473947534, "scoreInContext": 0.7237045473947534, "left": "smith", "right": "smyth", "marked": true,2 "reason": "HMM_MATCH", "leftMinTokenIndex": 1, "leftMaxTokenIndex": 1, "rightMinTokenIndex": 2, "rightMaxTokenIndex": 2 }, { "scoreInIsolation": 0.27169000000000004, "scoreInContext": 0.27169000000000004, "left": "", "right": "j", "marked": true,3 "reason": "DELETION", "leftMinTokenIndex": -1, "leftMaxTokenIndex": -1, "rightMinTokenIndex": 1, "rightMaxTokenIndex": 1 } ],
All tuples are compared. The tuples that are marked
as true are the matches that are used to calculate the scores.
"scoreAdjustments": [ { "unbiasedScore": 0.6829129823127231, "score": 0.6919264820086959, "parameter": "adjustOneSidedDeletionScores" }, { "unbiasedScore": 0.6919264820086959, "score": 0.8435140063279181, "parameter": "finalBias" } ],
The unbiased score is the score before the parameter is applied. The score is after the parameter is applied.
"finalScore": 0.8435140063279181
The final calculated score with all parameters applied. This is the similarity score returned by RNI.
Response schemas by object
The following sections list the JSON schema for each object type.
Name response schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "leftInput": { "type": "object", "properties": { "data": { "type": "string" }, "normalizedData": { "type": "string" }, "latnData": { "type": "string" }, "script": { "type": "string" }, "languageOfUse": { "type": "string" }, "languageOfOrigin": { "type": "string" }, "tokens": { "type": "array", "items": { "type": "object", "properties": { "token": { "type": "string" }, "latnToken": { "type": "string" }, "bin": { "type": "number", "default": 0.0 }, "biasedBin": { "type": "number", "default": 0.0 }, "tokenWeight": { "type": "number", "default": 0.0 }, "tokenType": { "type": "string", "default": null } } } }, "entityType": { "type": "string" }, "realWorldIds": { "type" : "array", "items": { "type": "string" } } }, "required": ["entityType"] }, "rightInput": { "type": "object", "properties": { "data": { "type": "string" }, "normalizedData": { "type": "string" }, "latnData": { "type": "string" }, "script": { "type": "string" }, "languageOfUse": { "type": "string" }, "languageOfOrigin": { "type": "string" }, "tokens": { "type": "array", "items": { "type": "object", "properties": { "token": { "type": "string" }, "latnToken": { "type": "string" }, "bin": { "type": "number", "default": 0.0 }, "biasedBin": { "type": "number", "default": 0.0 }, "tokenWeight": { "type": "number", "default": 0.0 }, "tokenType": { "type": "string", "default": null } } } }, "entityType": { "type": "string" }, "realWorldIds": { "type" : "array", "items": { "type": "string" } } }, "required": ["entityType"] }, "scoreTuples": { "type": "array", "items": { "type": "object", "properties": { "scoreInIsolation": { "type": "number", "default": 0.0 }, "scoreInContext": { "type": "number", "default": 0.0 }, "left": { "type": "string" }, "right": { "type": "string" }, "marked": { "type": "boolean", "default": false }, "reason": { "type": "string" }, "leftMinTokenIndex": { "type": "integer", "default": 0 }, "leftMaxTokenIndex": { "type": "integer", "default": 0 }, "rightMinTokenIndex": { "type": "integer", "default": 0 }, "rightMaxTokenIndex": { "type": "integer", "default": 0 } }, "required": ["left", "right", "reason"] } }, "scoreAdjustments": { "type": "array", "items": { "type": "object", "properties": { "unbiasedScore": { "type": "number", "default": 0.0 }, "score": { "type": "number", "default": 0.0 }, "parameter": { "type": "string" } } } }, "finalScore": { "type": "number" } } }
Address response schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "leftInput": { "type": "object", "properties": { "fieldInputInfos": { "type": "array", "items": { "type": "object", "properties": { "data": { "type": "string" }, "latnData": { "type": "string" }, "script": { "type": "string" }, "languageOfUse": { "type": "string" }, "languageOfOrigin": { "type": "string" }, "tokens": { "type": "array", "items": { "type": "object", "properties": { "token": { "type": "string" }, "latnToken": { "type": "string" }, "tokenWeight": { "type": "number", "default": 0.0 } } } }, "addressField": { "type": "string" }, "normalizedData": { "type": "string" } }, } } }, "required": ["fieldInputInfos"] }, "rightInput": { "type": "object", "properties": { "fieldInputInfos": { "type": "array", "items": { "type": "object", "properties": { "data": { "type": "string" }, "latnData": { "type": "string" }, "script": { "type": "string" }, "languageOfUse": { "type": "string" }, "languageOfOrigin": { "type": "string" }, "tokens": { "type": "array", "items": { "type": "object", "properties": { "token": { "type": "string" }, "latnToken": { "type": "string" }, "tokenWeight": { "type": "number", "default": 0.0 } } } }, "addressField": { "type": "string" }, "normalizedData": { "type": "string" } }, } } }, "required": ["fieldInputInfos"] }, "scoreTuples": { "type": "array", "items": { "type": "object", "properties": { "scoreInIsolation": { "type": "number", "default": 0.0 }, "scoreInContext": { "type": "number", "default": 0.0 }, "left": { "type": "string" }, "right": { "type": "string" }, "marked": { "type": "boolean", "default": false }, "reason": { "type": "string" }, "leftField": { "type": "string" }, "rightField": { "type": "string" }, "leftMinTokenIndex": { "type": "number", "default": 0 }, "leftMaxTokenIndex": { "type": "number", "default": 0 }, "rightMinTokenIndex": { "type": "number", "default": 0 }, "rightMaxTokenIndex": { "type": "number", "default": 0 } }, "required": ["left", "right", "reason", "leftField", "rightField"] } }, "scoreAdjustments": { "type": "array", "items": { "type": "object", "properties": { "unbiasedScore": { "type": "number", "default": 0.0 }, "score": { "type": "number", "default": 0.0 }, "parameter": { "type": "string" }, "leftField": { "type": "string" }, "rightField": { "type": "string" } }, } }, "finalScore": { "type": "number" }, "fieldScores": { "type": "array", "items": { "type": "object", "properties": { "leftField": { "type": "string" }, "rightField": { "type": "string" }, "score": { "type": "number", "default": 0.0 }, "marked": { "type": "boolean", "default": false } }, "required": ["leftField", "rightField"] } } }, }
Date response schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "leftInput": { "type": "object", "properties": { "originalString": { "type": "string" }, "day": { "type": "integer" }, "month": { "type": "integer" }, "yearWithoutCentury": { "type": "integer" }, "century": { "type": "integer" }, "modifiedJulianDay": { "type": "integer" }, "canonicalForm": { "type": "string" }, "dayMonthSwapped": { "type": "boolean" } } }, "rightInput": { "type": "object", "properties": { "originalString": { "type": "string" }, "day": { "type": "integer" }, "month": { "type": "integer" }, "yearWithoutCentury": { "type": "integer" }, "century": { "type": "integer" }, "modifiedJulianDay": { "type": "integer" }, "canonicalForm": { "type": "string" }, "dayMonthSwapped": { "type": "boolean" } } }, "scoreTuples": { "type": "array", "items": { "type": "object", "properties": { "scoreInIsolation": { "type": "number", "default": 0.0 }, "scoreInContext": { "type": "number", "default": 0.0 }, "left": { "type": "string" }, "right": { "type": "string" }, "marked": { "type": "boolean", "default": false }, "weight": { "type": "number", "default": 0.0 }, "component": { "type": "string" }, "differenceInDays": { "type": "integer" } }, "required": ["left", "right", "component"] } }, "scoreAdjustments": { "type": "array", "items": { "type": "object", "properties": { "unbiasedScore": { "type": "number", "default": 0.0 }, "score": { "type": "number", "default": 0.0 }, "parameter": { "type": "string" } } } }, "finalScore": { "type": "number" } } }
Dynamic configuration endpoints
The plugin includes Elasticsearch REST APIs to customize and tune matching through stop words, token overrides, and parameter universes. These endpoints allow you to add and modify these configuration values, without having to restart Elasticsearch.
Tip
To use any of the configuration REST APIs, the parameter enableDynamicConfigurationEndpoints
must be set to true
in the parameter_profiles.yaml
file in the any:
profile. By default, this parameter is set to false
. These endpoints should be used for testing and tuning only. When the dynamic configuration endpoints are enabled, they can slow the system down considerably.
The RNI-ES plugin relies on having an active primary or replica shard for each dynamic index available on every node. As a result, it is up to users to ensure each node has enough disk space to not exceed Elasticsearch watermark thresholds. Outside of this, users should never interact directly with the underlying dynamic configuration indices; all requests should go through the appropriate /rni_plugin endpoints.
Request timeouts
Whenever the RNI-ES plugin detects that one of its underlying dynamic indices has changed, it must fetch the entire index contents before the next name matching or indexing request. The timeout threshold for this fetch request is managed individually for each class of endpoints, and defaults to 60,000 ms. If this value is found to be insufficient for any reason, users can configure it at plugin startup time with the bt.{override,stopword,parameter}.timeout
java property.
Tip
To use dynamic configuration endpoints in an Elasticsearch deployment using SSL encryption, the RNI Elasticsearch plugin must be aware of the server's certificate file. To accomplish this, start elasticsearch with:
ES_JAVA_OPTS="-Dbt.ssl.certificate=<path_to_certificate>"
Stop words
The _stopwords
endpoint allows you to ADD, GET and DELETE stop words without restarting the Elasticsearch server. See Stop patterns and stop word prefixes for more detailed information on stop words.
The following properties are used when creating stop words. The entity_type
is optional; all other fields are required when adding stop words through the API.
Property | Required | Description |
---|---|---|
| ✓ | ISO 639-3 code for the language of the stop word(s). |
| ✓ | Type of stop word(s), either |
| Entity type for which to apply the stop word(s), defaults to | |
| ✓ | List of stop words to be added. |
Note
Stop words are applied whenever a token is normalized, meaning stop words will impact the names content that is included in the index. Therefore, changes to dynamic stop words do require data to be reindexed to take effect.
Create stop words
The POST_stopword
adds one or more stop words. The entity_type
field is optional, but the other fields are all required.
curl -XPOST "http://localhost:9200/rni_plugin/_stopwords" -H 'Content-Type: application/json' -d '{ "lang": "eng", "stopword_type": "prefixes", "entity_type": "PERSON", "stopwords": [ "honorable", "senior correspondent" ] }'
Get stop words
The GET _stopwords
method returns all stop words for a given language and stop word type. You can search by just language or by language and type.
When no entity type is specified, the stop word is applied to all names in the language, those with and without entity types. Therefore, calls that specify a type such as PERSON or ORGANIZATION will also return all stop words that don't have an entity type specified.
Returns all prefix stop words for PERSON types in English:
curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON"
Returns all regex stop words for ORGANIZATION types in Spanish:
curl -XGET "http://localhost:9200/rni_plugin/_stopwords/regexes_spa_ORGANIZATION"
Returns all prefix stop words in English with no type specified. For some languages, this list is empty by default. In these cases, data will only be returned if you've populated the file with values:
curl -XGET "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng"
Delete stop words
The DELETE _stopwords
method deletes a specified stop word. Deleting a stop word from a specific profile will also delete it from the any
profile.
curl -XDELETE "http://localhost:9200/rni_plugin/_stopwords/prefixes_eng_PERSON/doctor"
Token overrides
The _overrides
endpoint allows you to ADD, GET and DELETE token pair overrides without restarting the Elasticsearch server. See Overriding token pair matches for more detailed information on token pair overrides.
The following properties are used when creating token overrides.
Property | Required | Description |
---|---|---|
| ✓ | ISO 639-3 code for the language of the first name in the override pair. |
| ✓ | ISO 639-3 code for the language of the second name in the override pair. |
| Entity type of the list of token override pairs, defaults to "ALL". | |
| An alphanumeric string which specifies the selector value to apply for these overrides. NOTE: This property is only available in RNI-ES 8.6.2.0 and later. | |
| ✓ | List of token override pairs to be added. |
| ✓ | Tokens of the first name in the override pair; they should be of |
| ✓ | Token of the second name in the override pair; they should be of |
| The specific override type for the token pair. If omitted, the | |
| Raw score of the token pair between 0.0 and 1.0. If omitted, the value from the | |
| Indicates whether to force this score to be exactly that value for the given token pair, defaults to |
Note
RNI is designed so that override information is not included with indexed names. Therefore, changes to dynamic overrides do not require data to be reindexed to take effect.
Create override index
The override index must exist before you can start adding token overrides. To create the index:
curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_create"
Refresh override index
To force a refresh of the dynamic override index:
curl -s -XPOST "localhost:9200/rni_plugin/_overrides/_refresh"
Create token overrides
The POST _overrides
adds one or more token overrides. As shown in the table above, entity_type
, force
, and score
are optional, but the other fields are required.
curl -XPOST "http://localhost:9200/rni_plugin/_overrides" -H 'Content-Type: application/json' -d'{ "lang1": "eng", "lang2": "eng", "entity_type": "PERSON", "token_pairs": [{ "token1": "Abigail", "token2": "Abbey", "score": 0.74, "force": true}, { "token1": "Aleksander", "token2": "Alex", "score": 0.74}, { "token1": "Alfonso", "token2": "Alphonse", "type": "COGNATE"}, { "token1": "Frederica", "token2": "Federica", }]}'
Get token overrides
The GET _overrides
method returns the overrides of a given language profile.
curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON"
You can also retrieve the score of a given override pair.
curl -XGET "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON?token1=abigel&token2=abigail"
Delete token overrides
The DELETE _overrides
method deletes a given override pair. Deleting an override from a specific profile will also delete it from the any
profile.
curl -XDELETE "http://localhost:9200/rni_plugin/_overrides/hun_eng_PERSON/abigel+abigail"
Parameters
The _parameter_universe
endpoint allows you to ADD, GET and DELETE parameters through parameter universes, without restarting the Elasticsearch server. See Parameter universe for more information on tuning parameters with parameter universes.
Note
While some parameters can impact the data that is included in the index, these parameters cannot be dynamically specified. Therefore, changes to dynamic parameters do not require data to be reindexed to take effect.
Add parameter(s)
The POST _parameter_universe
method creates a parameter universe and the parameter profiles within the universe. Use this method to add or update a parameter value in a parameter universe. If you try to add a parameter universe that already exists, it overrides it with the new values. The parameter universe method uses the following syntax:
SomeParameterUniverseName/xxx_yyy
where xxx_yyy
is the language profile the parameters belong to, expressed in ISO 639-3 codes. The parameters
field expects a list of parameters for the given profile, where the naming of the parameters should match the ones declared in parameter_defs.yaml
.
curl -XPOST "http://localhost:9200/rni_plugin/_parameter_universe" -H 'Content-Type: application/json' -d' { "profiles": [ { "name": "SomeParameterUniverseName/any", "parameters": { "translatorResultsToKeep": 4, "deletionScore": 0.269, "doQueryTokenOverrides": true, "fieldDeletionScore": 0.27, "yearDistanceWeight": 0.2 } }, { "name": "SomeParameterUniverseName/eng_eng", "parameters": { "HMMUsageThreshold": 0.8, "stringDistanceThreshold": 0.1, "useEditDistanceTokenScorer": true, "finalBias": 2.4, "reorderPenalty": 0.2 } } ] }'
Get parameter(s)
The GET _parameter_universe
method retrieves parameter universes.
To retrieve a given parameter universe, the name of the parameter universe is provided as a path parameter:
curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"
If you include the name of the profile and a parameter, it returns the value of the parameter:
curl -XGET "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"
Delete parameter(s)
The DELETE _parameter_universe
method deletes parameter universes.
To delete a specific parameter universe:
curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName"
To delete a parameter for a specific profile within a parameter universe:
curl -XDELETE "http://localhost:9200/rni_plugin/_parameter_universe/SomeParameterUniverseName/eng_eng.reorderPenalty"
Note
Deleting a parameter from a specific parameter profile will also delete it from the any
profile. The parameter from the default value in the parameter_defs.yaml
file will be used.
Pairwise match endpoint
You can perform a pairwise match between two rni_names, rni_dates, rni_addresses, or other datatypes through the POST _pair_match
method. The results provide insight into how the match scores were calculated, including tokens and token scores. This endpoint can help you understand the impact a specific match parameter has on the final score, and can aid in testing and debugging RNI.
The type of pairwise match being performed is provided to the query, along with the values being compared (data1
and data2
). You can also specify one or more parameters and see how they impact the match scores.
You may use the optional responseFormat
URL parameter to control the format of the response. The default value is explainInfo
, which produces the output format of plugin versions using SDK 7.43.0.c71.0 and later. Setting this parameter to legacyExplainInfo
will produce the output of previous plugin versions.
Tip
We strongly recommend sending in complete strings and allowing RNI to perform tokenization. RNI includes weighting and other calculations which operate on the full string, enhancing the token matching scoring algorithms to improve match scores.
Request
curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d' {"dataPair": {"data1": "12/25/19","data2": "1/15/20"}, "parameters": { "timeDistanceWeight": ".8", "stringDistanceWeight": "0"}}'
Response
{ "leftInput": { "originalString": "12/25/19", "day": 25, "month": 12, "yearWithoutCentury": 19, "century": -1000, "modifiedJulianDay": -671643, "canonicalForm": "YY19-12-25" }, "rightInput": { "originalString": "1/15/20", "day": 15, "month": 1, "yearWithoutCentury": 20, "century": -1000, "modifiedJulianDay": -671622, "canonicalForm": "YY20-01-15" }, "scoreTuples": [ { "scoreInIsolation": 0.9330329915368074, "scoreInContext": 0.9330329915368074, "left": "YY19", "right": "YY20", "marked": true, "weight": 0.2, "component": "YEAR_DISTANCE" }, { "scoreInIsolation": 0.6830201283771977, "scoreInContext": 0.6830201283771977, "left": "12", "right": "01", "marked": true, "weight": 0.2, "component": "MONTH_DISTANCE" }, { "scoreInIsolation": 0.7071067811865476, "scoreInContext": 0.7071067811865476, "left": "25", "right": "15", "marked": true, "weight": 0.1, "component": "DAY_DISTANCE" }, { "scoreInIsolation": 0.375, "scoreInContext": 0.375, "left": "YY19-12-25", "right": "YY20-01-15", "marked": true, "weight": 0, "component": "STRING_DISTANCE" }, { "scoreInIsolation": 0.6949591099211685, "scoreInContext": 0.6949591099211685, "left": "YY19-12-25", "right": "YY20-01-15", "marked": true, "weight": 0.8, "component": "TIME_PROXIMITY" } ], "scoreAdjustments": [ { "unbiasedScore": 0.730683530798762, "score": 0.730683530798762, "parameter": "dateFinalBias" } ], "finalScore": 0.730683530798762 }
Supported types
The following data types are supported by the pairwise match endpoint.
rni_name
rni_date
rni_address
date
keyword
text
string
integer
long
short
double
float
boolean
geo_point
Request
curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=text" -H 'Content-Type: application/json' -d' { "dataPair": { "data1": "word1", "data2": "word2" } }'
Response
{ "score" : 0.8333333333333334 }
Name matching example
Request
Parameters are specified directly in the request. The source language (language
) of the name is optional, but recommended if known.
curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_name" -H 'Content-Type: application/json' -d' { "dataPair": { "data1": { "data": "John Robert Edward Smith", "language": "eng", "entityType": "PERSON" }, "data2": { "data": "John Smyth", "language": "eng", "entityType": "PERSON" } }, "parameters": { "deletionScore": 0.469 } }'
Response
The response includes detailed information on how the names were matched.
{ "leftInput": { "data": "John Robert Edward Smith", "normalizedData": "john robert edward smith", "latnData": "john robert edward smith", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "ENGLISH", "tokens": [ { "token": "john", "latnToken": "john", "bin": 5, "biasedBin": 4.764319787410581, "tokenWeight": 0.20817481793666764, "tokenType": "GIVEN" }, { "token": "robert", "latnToken": "robert", "bin": 4, "biasedBin": 3.8370564773010574, "tokenWeight": 0.24879889758995669, "tokenType": "MIDDLE" }, { "token": "edward", "latnToken": "edward", "bin": 4, "biasedBin": 3.8370564773010574, "tokenWeight": 0.24879889758995669, "tokenType": "MIDDLE" }, { "token": "smith", "latnToken": "smith", "bin": 3.5, "biasedBin": 3.3709010396413017, "tokenWeight": 0.294227386883419, "tokenType": "SURNAME" } ], "entityType": "PERSON" }, "rightInput": { "data": "John Smyth", "normalizedData": "john smyth", "latnData": "john smyth", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "ENGLISH", "tokens": [ { "token": "john", "latnToken": "john", "bin": 5, "biasedBin": 4.764319787410581, "tokenWeight": 0.17348100675885905, "tokenType": "GIVEN" }, { "token": "smyth", "latnToken": "smyth", "bin": 1, "biasedBin": 1, "tokenWeight": 0.8265189932411409, "tokenType": "SURNAME" } ], "entityType": "PERSON" }, "scoreTuples": [ { "scoreInIsolation": 1, "scoreInContext": 1, "left": "john", "right": "john", "marked": true, "reason": "MATCH", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": 0, "rightMaxTokenIndex": 0 }, { "scoreInIsolation": 0.469, "scoreInContext": 0.469, "left": "robertedward", "right": "", "marked": true, "reason": "DELETION", "leftMinTokenIndex": 1, "leftMaxTokenIndex": 2, "rightMinTokenIndex": -1, "rightMaxTokenIndex": -1 }, { "scoreInIsolation": 0.7237045473947534, "scoreInContext": 0.7237045473947534, "left": "smith", "right": "smyth", "marked": true, "reason": "HMM_MATCH", "leftMinTokenIndex": 3, "leftMaxTokenIndex": 3, "rightMinTokenIndex": 1, "rightMaxTokenIndex": 1 } ], "scoreAdjustments": [ { "unbiasedScore": 0.6686154265898526, "score": 0.6972609000299432, "parameter": "adjustOneSidedDeletionScores" }, { "unbiasedScore": 0.6972609000299432, "score": 0.8468047657291401, "parameter": "finalBias" } ], "finalScore": 0.8468047657291401 }
Address matching example
Request
The pairwise match endpoint supports both fielded and unfielded addresses. Fielded addresses must be specified as objects, while unfielded addresses must be specified as strings.
curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_address" -H 'Content-Type: application/json' -d' { "dataPair": { "data1": { "houseNumber": "101", "road": "Main st", "city": "Cambridge", "state": "Massachusetts", "country": "United States of America" }, "data2": "101 Main St, Cambridge, MA, USA" } }'
Response
The response includes a score
marking the similarity of the two addresses as well as a type
field describing the type of match observed. The response also includes detailed information on how each of the fields were matched. In the example below, only part of the detailed response for HOUSE_NUMBER
is included. This is not the complete response.
{ "leftInput": { "fieldInputInfos": [ { "data": "United States of America", "latnData": "United States of America", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "united", "latnToken": "united", "tokenWeight": 0.25 }, { "token": "states", "latnToken": "states", "tokenWeight": 0.25 }, { "token": "of", "latnToken": "of", "tokenWeight": 0.25 }, { "token": "america", "latnToken": "america", "tokenWeight": 0.25 } ], "addressField": "COUNTRY", "normalizedData": "united states of america" }, { "data": "Massachusetts", "latnData": "Massachusetts", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "massachusetts", "latnToken": "massachusetts", "tokenWeight": 1 } ], "addressField": "STATE", "normalizedData": "massachusetts" }, { "data": "Cambridge", "latnData": "Cambridge", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "cambridge", "latnToken": "cambridge", "tokenWeight": 1 } ], "addressField": "CITY", "normalizedData": "cambridge" }, { "data": "Main st", "latnData": "Main st", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "main", "latnToken": "main", "tokenWeight": 0.5 }, { "token": "st", "latnToken": "st", "tokenWeight": 0.5 } ], "addressField": "ROAD", "normalizedData": "main st" }, { "data": "101", "languageOfUse": "UNKNOWN", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "101", "latnToken": "101", "tokenWeight": 1 } ], "addressField": "HOUSE_NUMBER", "normalizedData": "101" } ] }, "rightInput": { "fieldInputInfos": [ { "data": "usa", "latnData": "usa", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "usa", "latnToken": "usa", "tokenWeight": 1 } ], "addressField": "COUNTRY", "normalizedData": "usa" }, { "data": "ma", "latnData": "ma", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "ma", "latnToken": "ma", "tokenWeight": 1 } ], "addressField": "STATE", "normalizedData": "ma" }, { "data": "cambridge", "latnData": "cambridge", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "cambridge", "latnToken": "cambridge", "tokenWeight": 1 } ], "addressField": "CITY", "normalizedData": "cambridge" }, { "data": "main st", "latnData": "main st", "script": "Latn", "languageOfUse": "ENGLISH", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "main", "latnToken": "main", "tokenWeight": 0.5 }, { "token": "st", "latnToken": "st", "tokenWeight": 0.5 } ], "addressField": "ROAD", "normalizedData": "main st" }, { "data": "101", "languageOfUse": "UNKNOWN", "languageOfOrigin": "UNKNOWN", "tokens": [ { "token": "101", "latnToken": "101", "tokenWeight": 1 } ], "addressField": "HOUSE_NUMBER", "normalizedData": "101" } ] }, "scoreTuples": [ { "scoreInIsolation": 1, "scoreInContext": 1, "left": "101", "right": "101", "marked": true, "reason": "MATCH", "leftField": "HOUSE_NUMBER", "rightField": "HOUSE_NUMBER", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": 0, "rightMaxTokenIndex": 0 }, { "scoreInIsolation": 0.3, "scoreInContext": 0.3, "left": "101", "right": "", "marked": false, "reason": "DELETION", "leftField": "HOUSE_NUMBER", "rightField": "HOUSE_NUMBER", "leftMinTokenIndex": 0, "leftMaxTokenIndex": 0, "rightMinTokenIndex": -1, "rightMaxTokenIndex": -2 }, ... } }
Date matching example
curl -XPOST "http://localhost:9200/rni_plugin/_pair_match?type=rni_date" -H 'Content-Type: application/json' -d' {"dataPair": {"data1": "12/25/19","data2": "1/15/20"}, "parameters": { "timeDistanceWeight": ".8", "stringDistanceWeight": "0"}}'
Response
The response includes detailed information on how the dates were matched.
{ "leftInput": { "originalString": "12/25/19", "day": 25, "month": 12, "yearWithoutCentury": 19, "century": -1000, "modifiedJulianDay": -671643, "canonicalForm": "YY19-12-25" }, "rightInput": { "originalString": "1/15/20", "day": 15, "month": 1, "yearWithoutCentury": 20, "century": -1000, "modifiedJulianDay": -671622, "canonicalForm": "YY20-01-15" }, "scoreTuples": [ { "scoreInIsolation": 0.6949591099211685, "scoreInContext": 0.6949591099211685, "left": "YY19-12-25", "right": "YY20-01-15", "marked": true, "weight": 0.8, "component": "TIME_DISTANCE", "differenceInDays": 21 }, { "scoreInIsolation": 0.9330329915368074, "scoreInContext": 0.9330329915368074, "left": "YY19", "right": "YY20", "marked": true, "weight": 0.2, "component": "YEAR_DISTANCE" }, { "scoreInIsolation": 0.6830201283771977, "scoreInContext": 0.6830201283771977, "left": "12", "right": "01", "marked": true, "weight": 0.2, "component": "MONTH_DISTANCE" }, { "scoreInIsolation": 0.7071067811865476, "scoreInContext": 0.7071067811865476, "left": "25", "right": "15", "marked": true, "weight": 0.1, "component": "DAY_DISTANCE" }, { "scoreInIsolation": 0.375, "scoreInContext": 0.375, "left": "YY19-12-25", "right": "YY20-01-15", "marked": true, "weight": 0, "component": "STRING_DISTANCE" } ], "scoreAdjustments": [ { "unbiasedScore": 0.730683530798762, "score": 0.730683530798762, "parameter": "dateFinalBias" } ], "finalScore": 0.730683530798762 }
Fully supported text domains for name matching
The following tables describe the domain pairings for which RNI provides full support. All other domain pairings have limited support, as described in Language support parameters. A domain refers to the language and script of a piece of text. For example, one domain might be Latin (Latn) script in the English (eng) language.
Note
"Language" in this appendix refers to the language of use, the language of the document in which the name is found, which may not be the language of origin associated with the name. If the language of use is undetermined, use unknown (xxx
).
Note
Prior to release 7.36.0, RNI did not support any limited languages; when presented with names in those languages, an "unsupported language" error would be returned.
To set RNI to behave as it did previously, set allLanguageSupport
to false
.
Name matching within a language
The first table identifies the languages, and for each language the writing scripts that Rosette Name Indexer fully supports.
Cross-language matches
This table identifies the range of cross-language searching and matching that Rosette Name Indexer and name matching fully support. If your query is a name in an Arabic document in Arabic script, the query may return one or more names in English documents in Latin script, in addition to names from Arabic documents in Arabic script. If the query is a name in English and Latin script, it may return documents from any of the supported languages and their native scripts.
Note
For supported scripts for each language, see the table in section 13.1.
Appendix
Match phenomena
Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.
Name | Description | Example | |||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CONFLICT | The tokens do not match. | When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT. | |||||||||||||||||||||||||||||||||||||||||||||||
DELETION | The token is unmatched. | When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION. | |||||||||||||||||||||||||||||||||||||||||||||||
EMBEDDING_MATCH | The tokens are semantically similar as determined by word-embedding vectors. | When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
FIELD_BLOCKED | This field cannot be matched because of a cross-field match involving the same field in the other name. | When comparing "Bob|William|Smith" with "William||Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field. | |||||||||||||||||||||||||||||||||||||||||||||||
FIELD_CONFLICT | When comparing two names that are divided into fields, these fields do not match. | When comparing "Richard|William|Smith" with "Richard|Johnson|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT. | |||||||||||||||||||||||||||||||||||||||||||||||
FIELD_DELETION | When comparing two names that are divided into fields, this field is unmatched. | When comparing "Richard|Xi|Smith" with "Richard||Smith", "xi" would be considered a FIELD_DELETION. | |||||||||||||||||||||||||||||||||||||||||||||||
GIVEN_NAME_DELETION | When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched. | When comparing "Richard|William|Smith" and "||William|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a | |||||||||||||||||||||||||||||||||||||||||||||||
HANI_ABBREVIATION | One Hani token appears to be an abbreviation of another Hani token. | "北京大学" and "北大" are a HANI_ABBREVIATION match. | |||||||||||||||||||||||||||||||||||||||||||||||
HMM_MATCH | The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match. | "richard" and "richerd" are an HMM_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
INITIALISM | One token is a name and the other token is the initials of the words which make up the name. | "john fitzgerald kennedy" and "JFK" are an INITIALISM. "consumer value stores" and "CVS" are an INITIALISM. | |||||||||||||||||||||||||||||||||||||||||||||||
INITIAL_MATCH | One token is the first initial of the other. | "w" and "william" are an INITIAL_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
LANGUAGE_SPECIFIC_MATCH | The match was determined by a language-specific matcher. | "laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
MATCH | The tokens are identical (after stop word elimination and normalization). | "john" and "john" are a MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
NULL | The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object. | N/A | |||||||||||||||||||||||||||||||||||||||||||||||
OUT_OF_ORDER_DELETION | This unmatched token still leaves the remaining tokens out of order when it is removed. | When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION. | |||||||||||||||||||||||||||||||||||||||||||||||
OVERRIDE | The tokens appear as a pair on the override list. This is often used for nicknames. | "john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list. | |||||||||||||||||||||||||||||||||||||||||||||||
PREFIX_INITIAL | One token is an initial that matches a prefix in the other token. In practice, the PREFIX_INITIAL phenomenon is rare. | If the | |||||||||||||||||||||||||||||||||||||||||||||||
STRING_SIMILARITY | The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match. | "akcd" and "xkcd" are a STRING_SIMILARITY match. | |||||||||||||||||||||||||||||||||||||||||||||||
STUCK_INITIAL | One name appears to have an initial mistakenly attached to a preceding token. | "DavidK" and "David Keith" are a STUCK_INITIAL match. | |||||||||||||||||||||||||||||||||||||||||||||||
SURNAME_DELETION | When comparing two names that are divided into fields, the SURNAME field is unmatched. | When comparing "Richard|William|Smith" and "Richard|William||", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a | |||||||||||||||||||||||||||||||||||||||||||||||
TRAILING_PATRONYMIC_DELETION[a] | The unmatched token is a patronymic which has been truncated in the other name. | When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION. | |||||||||||||||||||||||||||||||||||||||||||||||
TRUNCATED_EXACT_MATCH | The tokens are identical except that one has been slightly truncated. | "murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
TRUNCATED_HMM_MATCH | The tokens are similar, but not identical, and one has been slightly truncated. | "gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
UNKNOWN_FIELD_MATCH | One of the tokens is part of an "unknown" field in a fielded name. The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API. | When comparing "Richard|William|Smith" with "Richard|William|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH. | |||||||||||||||||||||||||||||||||||||||||||||||
[a] Only applies to Latin script names of Arabic origin. |
Parameters
This table lists the parameters that can be configured via paramater_profiles.yaml
. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in paramater_defs.yaml
.
Parameter name | Applies to | Impacts | |||||||||||||||||||||||||||||||||||||||||||||||
addressCrossFieldScoreThreshold | Addresses | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
addressDeletionScore | Addresses | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
addressDifferentGroupPenalty | Addresses | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
addressFinalBias | Addresses | All address match scores | |||||||||||||||||||||||||||||||||||||||||||||||
addressJoinedTokenLimit | Addresses | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
addressOverrideDefaultScore | Addresses | OVERRIDE match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
addressOverrideTablePath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
addressReorderPenalty | Addresses | Reordering[e] | |||||||||||||||||||||||||||||||||||||||||||||||
addressSameGroupPenalty | Addresses | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
addressStopPatternsPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
addressUnpairedFieldScore | Addresses | FIELD_DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
adjustOneSidedDeletionScores | All names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
allowNullValue | Elasticsearch | Elasticsearch setting | |||||||||||||||||||||||||||||||||||||||||||||||
alternativePairsToCheck | All names | Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names) | |||||||||||||||||||||||||||||||||||||||||||||||
alternativeTimeProximityMatch | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
boostWeightAtBothEnds | All names | All name match scores | |||||||||||||||||||||||||||||||||||||||||||||||
boostWeightAtLeftEnd | All names | All name match scores | |||||||||||||||||||||||||||||||||||||||||||||||
boostWeightAtRightEnd | All names | All name match scores | |||||||||||||||||||||||||||||||||||||||||||||||
caseSensitiveData | All names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
cityAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
cityDistrictAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
cognateOverrideScore | All names | OVERRIDE match phenomenon for tokens marked as COGNATE in the override file | |||||||||||||||||||||||||||||||||||||||||||||||
conflictScore | All names | CONFLICT match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
conflictThreshold | All names | CONFLICT match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
countryAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
countryRegionAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
crossFieldInitialsPenalty | Fielded names | INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
crossFieldJoinInitialPenalty | Fielded names | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
crossFieldJoinPenalty | Fielded names | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
crossFieldMatchPenalty | Fielded names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
crossLanguageGenderConflictPenalty | All names | Gender mismatch[b] | |||||||||||||||||||||||||||||||||||||||||||||||
dateFinalBias | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
dateOrdering | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
dayDistanceWeight | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
deletionScore | All names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
detectableLanguagesModelBased | All names | Can affect any kind of match where one or more names is in Latin script and the language is not already specified. | |||||||||||||||||||||||||||||||||||||||||||||||
detectableLanguagesRuleBased | All names | Currently, you can only enable detection of Latin script as Turkish or Vietnamese. | |||||||||||||||||||||||||||||||||||||||||||||||
editDistanceScoreBias | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
enableDynamicConfigurationEndpoints | Elasticsearch | Elasticsearch setting | |||||||||||||||||||||||||||||||||||||||||||||||
enablePromisingTermFiltering | Speed/Accuracy | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
enableYueReadings | All names | Names written in Han script | |||||||||||||||||||||||||||||||||||||||||||||||
entranceAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
equivalenceClassesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
estimatedConflictOrDeletionScore | All names | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
exactLatnMatchScore | All names | Token normalization | |||||||||||||||||||||||||||||||||||||||||||||||
expensiveScorerJoinedTokenLimit | All names | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
fieldBlockedScore | Fielded names | OUT_OF_ORDER_DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
fieldConflictScore | Fielded names | CONFLICT match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
fieldDeletionScore | Fielded names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
finalBias | All names | All name match scores | |||||||||||||||||||||||||||||||||||||||||||||||
frequencyRankBias | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
genderConflictPenalty | All names | Gender mismatch[b] | |||||||||||||||||||||||||||||||||||||||||||||||
genderConflictPenaltyThreshold | All names | Gender mismatch[b] | |||||||||||||||||||||||||||||||||||||||||||||||
globalTokenCacheConfig | Speed/Accuracy | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
globalTokenPairCacheConfig | Speed/Accuracy | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
haniAbbreviationScore | All names | INITIALISM match phenomena in Han script | |||||||||||||||||||||||||||||||||||||||||||||||
haniAbbreviationThreshold | All names | INITIALISM match phenomena in Han script | |||||||||||||||||||||||||||||||||||||||||||||||
haniFourCornerCodeMismatchPenalty | All names | Names written in Han script | |||||||||||||||||||||||||||||||||||||||||||||||
hmmNormalizationAlternative | All names | HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
hmmScoreBias | All names | HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
hmmScoreLimit | All names | HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
houseAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
houseNumberAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
ignoreBadData | Elasticsearch | Elasticsearch setting | |||||||||||||||||||||||||||||||||||||||||||||||
improveSingleDigitManipulationMatch | Dates | Date match scores containing exactly one instance of digit manipulation[c] and no other differences | |||||||||||||||||||||||||||||||||||||||||||||||
initialFrequencyRank | All names | INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialismMismatchPenalty | All names | HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialismScore | All names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialsConflictScore | All names | CONFLICT match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialsDeletionPenalty | All names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialsScore | All names | INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
islandAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
joinedTokenInitialsPenalty | All names | Concatenation[a] INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
joinedTokenLimit | All names | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
joinedTokenPenalty | All names | Concatenation[a] | |||||||||||||||||||||||||||||||||||||||||||||||
levelAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
libpostalDataDirPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
lowWeightTokenFrequencyRank | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
lowWeightTokenPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
maximumAlternateTokenizationRelativeDistance | All names | Affects tokenization and therefore any potential score | |||||||||||||||||||||||||||||||||||||||||||||||
maximumOrganizationInitialismLength | Organization names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
maximumPersonInitialismLength | Person names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
maxYearDistanceForDigitManipulation | Dates | Date match scores containing exactly one instance of digit manipulation[c] and no other differences. | |||||||||||||||||||||||||||||||||||||||||||||||
minFieldWeightFactor | Fielded names | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
minimumAlternateTokenizationLength | All names | Affects tokenization and therefore any potential score | |||||||||||||||||||||||||||||||||||||||||||||||
minimumOrganizationInitialismLength | Organization names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
minimumPersonInitialismLength | Person names | INITIALISM match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
monthDistanceWeight | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
nameBigramQueryBoost | Lucene searches | First-pass accuracy Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
nameDoubleMetaphoneQueryBoost | Lucene searches | First-pass accuracy Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
nameGluedQueryBoost | Lucene searches | First-pass accuracy Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
nameInitialQueryBoost | Lucene searches | First-pass accuracy Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
nameLengthMismatchPenalty | All names | DELETION match phenomenon Concatenation[a] Any phenomenon that changes the number of tokens in a name | |||||||||||||||||||||||||||||||||||||||||||||||
nameRealWorldIdQueryBoost | Lucene searches | First-pass accuracy Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
ngramLMPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
ngramThresholdPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
nicknameOverrideScore | All names | OVERRIDE match phenomenon for tokens marked as NICKNAME in override file | |||||||||||||||||||||||||||||||||||||||||||||||
numericTokenFrequencyRank | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
outOfOrderDeletionScore | All names | OUT_OF_ORDER_DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
parseUnknownFieldMarker | Fielded names | UNKNOWN_FIELD_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
poBoxAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
postCodeAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
queryAlternativeOriginLanguages | Speed/Accuracy | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
realWorldIdsPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
realWorldIdsPathUser | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
reorderCorrection | All names | Rotation | |||||||||||||||||||||||||||||||||||||||||||||||
reorderCorrectionThreshold | All names | Rotation[d] | |||||||||||||||||||||||||||||||||||||||||||||||
reorderPenalty | All names | Reordering[e] | |||||||||||||||||||||||||||||||||||||||||||||||
rniFullnameOverridesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
rntFullnameOverridesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
roadAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
sameNameUnknownFieldMatchInterpolator | Fielded names | UNKNOWN_FIELD_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
staircaseAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
stateAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
stateDistrictAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
stopPatternsPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
stringDistanceWeight | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
stuckInitialAffixMinLength | All names | STUCK_INITIAL match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
stuckInitialScore | All names | STUCK_INITIAL match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
suburbAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
thresholdToDropoffBiasMapping | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
timeDistanceWeight | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
timeProximityYearInterval | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
tokenizeOrganizationsWithNumbers | Organization names | Affects tokenization and therefore any potential score | |||||||||||||||||||||||||||||||||||||||||||||||
tokenOverridesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
trailingPatronymicDeletionScore | Person names | TRAILING_PATRONYMIC_DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
truncationFractionLimit | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
truncationScorerBias | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
tryAlternateTokenization | All names | Affects tokenization and therefore any potential score | |||||||||||||||||||||||||||||||||||||||||||||||
tryDayMonthSwap | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
unigramLMPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
unitAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
unknownFieldFrequencyRank | Fielded names | UNKNOWN_FIELD_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
unknownVsKnownScore | Fielded names | UNKNOWN_FIELD_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
unknownVsUnknownScore | Fielded names | UNKNOWN_FIELD_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
useEmbeddings | Organization names | EMBEDDING_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
useSolrPhraseQueries | Solr | Solr plugin setting | |||||||||||||||||||||||||||||||||||||||||||||||
variantOverrideScore | All names | OVERRIDE match phenomenon for tokens marked as VARIANT in the override file | |||||||||||||||||||||||||||||||||||||||||||||||
worldRegionAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
yearDistanceWeight | Dates | All date match scores | |||||||||||||||||||||||||||||||||||||||||||||||
[a] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. [b] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. [c] A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0↔8, 1↔7, 3↔8, 5↔8, 5↔6, 6↔8, 7↔2. [d] If the tokens in a name have been rotated, the reorder penalty will negatively impact the match score. RNI detects and compensates for this error. [e] Tokens that match, but that appear to be out-of-order, have their match scores adjusted to reflect that fact. |
Internal parameters
This table lists the parameters that can be configured via internal_param_profiles.yaml
. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in internal_param_defs.yaml
.
Important
We recommend against modifying these parameters unless advised to by Rosette support.
Name | Applies to | Impacts | |||||||||||||||||||||||||||||||||||||||||||||||
affixGlueThreshold[a] | All names and addresses | Concatenation[b] | |||||||||||||||||||||||||||||||||||||||||||||||
allLanguageSupport | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
allowCacheBonuses | All names | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
alwaysComputeSuffixes[a] | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
araRNISpeedOption | Translated names | Speed and accuracy tradeoff | |||||||||||||||||||||||||||||||||||||||||||||||
crossSurnameMatchPenalty | All names | Matches in languages with onomastic information | |||||||||||||||||||||||||||||||||||||||||||||||
debuggableIndex | N/A | Internal engineering detail Has no effect on matching | |||||||||||||||||||||||||||||||||||||||||||||||
debugPrintTuples | N/A | Internal engineering detail Has no effect on matching | |||||||||||||||||||||||||||||||||||||||||||||||
defaultScoreToCheckRestriction | All names Dates Addresses | First-pass scoring | |||||||||||||||||||||||||||||||||||||||||||||||
disabledLanguages | All names | ||||||||||||||||||||||||||||||||||||||||||||||||
doFrontTruncations | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryBigrams | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryCompleted | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryFullnameOverrides | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryFuzzy | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryGlued | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryIndexKeys | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryInitials | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryNormalized | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryPersonInitialisms | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryPhrase | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryRealWorldIds | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryTokenOverrides | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doQueryTranslated | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
doViterbiRescaling | All names | Fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
editDistanceTokenScorerPenalty | All names | STRING_SIMILARITY match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
embeddingBias | Organization names | EMBEDDING_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
embeddingZeroScore | Organization names | EMBEDDING_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
enableAdditionalOnomastics | All names | Matches in languages with onomastic information | |||||||||||||||||||||||||||||||||||||||||||||||
enableRemoteTokenScorer | All names | Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
enableSeq2SeqTokenScorer | All names | Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
enableTokenPairLogging | N/A | Internal engineering detail Has no effect on matching | |||||||||||||||||||||||||||||||||||||||||||||||
engEngFastMode | All names | English/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
expandedLanguages | All names | Fuzzy match[c] OVERRIDE match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
expansionLimit | All names | Fuzzy match[c] OVERRIDE match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
expansionScoreThreshold | All names | Fuzzy match[c] OVERRIDE match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
familiarTokenMismatchPenalty | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
familiarTokenThreshold | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
firstPassDayRange | Dates | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
firstPassMonthRange | Dates | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
firstPassYearRange | Dates | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
foreignAddressFinalBias | Addresses | All English-to-non-English address matches | |||||||||||||||||||||||||||||||||||||||||||||||
genderPenaltyMinimumLength | All names | Gender mismatch[d] | |||||||||||||||||||||||||||||||||||||||||||||||
givenFieldDeletionScore | Fielded names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
HMMCachePerProcess | All names | Internal engineering detail HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
HMMCachePerThread | All names | Internal engineering detail HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
hmmNormBias | All names | Internal engineering detail Fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
HMMUsageThreshold | All names | Internal engineering detail HMM_MATCH phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
identifierEditDistanceTokenScorerPenalty | Identifiers | STRING_SIMILARITY match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
ignoreTranslationOrigins | All names | Can affect any kind of match that uses English transliteration | |||||||||||||||||||||||||||||||||||||||||||||||
includeExtraKatakanaPersonReadings | Translated names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
initialAndSuffixMinLength | All names | Fuzzy match[c] INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
initialAndSuffixScore | All names | Fuzzy match[c] INITIAL_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
jniBias | All names | Can affect any kind of match in languages that use a JNI scorer | |||||||||||||||||||||||||||||||||||||||||||||||
jpnRNISpeedOption | Translated names | Speed and accuracy tradeoff | |||||||||||||||||||||||||||||||||||||||||||||||
kanjiMismatchPenalty | All names | Normalization of tokens that include kanji | |||||||||||||||||||||||||||||||||||||||||||||||
katakanaTransliterationsOnly | Translated names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
korRNISpeedOption | Translated names | Speed and accuracy tradeoff | |||||||||||||||||||||||||||||||||||||||||||||||
latinDataAlternativesToCheck | All names | Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names) | |||||||||||||||||||||||||||||||||||||||||||||||
limitedLanguageEditDistance | All names | STRING_SIMILARITY match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
maxIdentifierEditDistance | All names | First-pass accuracy | |||||||||||||||||||||||||||||||||||||||||||||||
notExactMatchPenalty | All names | Normalization | |||||||||||||||||||||||||||||||||||||||||||||||
_postCodePathAddressFieldWeight | Addresses | Weighting | |||||||||||||||||||||||||||||||||||||||||||||||
promisingFuzzyTermFrequencyFactor | Speed/Accuracy | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
promisingTermFrequencyFactor | Speed/Accuracy | Performance only | |||||||||||||||||||||||||||||||||||||||||||||||
queryMaxResults | All names Dates Addresses | First-pass scoring | |||||||||||||||||||||||||||||||||||||||||||||||
queryMaxToCheck | All names Dates Addresses | First-pass scoring | |||||||||||||||||||||||||||||||||||||||||||||||
queryMaxToConsider | All names Dates Addresses | First-pass scoring | |||||||||||||||||||||||||||||||||||||||||||||||
queryToCheckAllowance | All names Dates Addresses | First-pass scoring | |||||||||||||||||||||||||||||||||||||||||||||||
realWorldIdScore | Organization names | Real-world id match[e] | |||||||||||||||||||||||||||||||||||||||||||||||
remoteTokenScorerURL | All names | Internal engineering detail Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
rntTokenOverridesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
rusRNISpeedOption | Translated names | Speed and accuracy tradeoff | |||||||||||||||||||||||||||||||||||||||||||||||
secondarySurnameTokenTypeWeight | All names | Matches in languages with onomastic information | |||||||||||||||||||||||||||||||||||||||||||||||
seq2seqCachePerProcess | All names | Internal engineering detail Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
seq2seqCachePerThread | All names | Internal engineering detail Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
seq2seqTokenOverridesPath | File locations | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
seq2seqUsageThreshold | All names | Internal engineering detail Japanese/English fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
splitTokens | All names | Internal engineering detail | |||||||||||||||||||||||||||||||||||||||||||||||
stringDistanceThreshold[a] | All names | Fuzzy match[c] | |||||||||||||||||||||||||||||||||||||||||||||||
surnameFieldDeletionScore | fielded names | DELETION match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
surnameTokenTypeWeight | All names | Matches in languages with onomastic information | |||||||||||||||||||||||||||||||||||||||||||||||
taggerMinimumConfidenceThreshold | All names | Matches in languages with onomastic information | |||||||||||||||||||||||||||||||||||||||||||||||
translatorResultsToKeep | translated names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
truncationAffixSimilarityLength | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
truncationAffixSimilarityThreshold | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
truncationLengthLimit | All names | TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
useCharacterLM | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
useEditDistanceTokenScorer | All names | STRING_SIMILARITY match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
useIdentifierEditDistanceTokenScorer | Identifiers | STRING_SIMILARITY match phenomenon | |||||||||||||||||||||||||||||||||||||||||||||||
useLM | All names | Can affect any kind of match | |||||||||||||||||||||||||||||||||||||||||||||||
useOldAndNewNameSegmentationForJapanese | All names | Can affect any kind of match involving Japanese translations | |||||||||||||||||||||||||||||||||||||||||||||||
useRealWorldIds | all names (or just orgs?) | Real-world id match[e] | |||||||||||||||||||||||||||||||||||||||||||||||
zhoRNISpeedOption | Translated names | Speed and accuracy tradeoff | |||||||||||||||||||||||||||||||||||||||||||||||
[a] Unlike public parameters for this feature, this is a speed/accuracy tradeoff, not a science-tuning parameter. [b] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. [c] A fuzzy match is a match between tokens that are similar but not identical. The HMM_MATCH and SEQ2SEQ_MATCH phenomena are examples of this. [d] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. [e] RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name |
[1] Copyright by Elasticsearch BV. Dual licensed under Server Side Public License (SSPL version 1) and the Elastic License 2.0 (ELv2).
[2] For RNI plugins that support earlier versions of Elasticsearch (such as 1.x.y), contact support@rosette.com.
[3] #
may also be used after an entry on the same line to begin a comment.
[4] Override files are not provided for all supported languages. Specifically, while no files are provided for Russian or Korean, you can create token pair files for these languages.
[5] RNI depends on the jpostal binding for the open source libpostal library to parse unfielded addresses as a pre-processing step. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support@basistech.com if you discover any issues.
[6] #
may also be used after an entry on the same line to begin a comment.
[7] #
may also be used after an entry on the same line to begin a comment.