Skip to main content

Match for OpenSearch

Babel Street Match

Introduction

Babel Street Match (Match) provides the linguistic infrastructure and Java APIs to perform name matches, name searches, and translations across an expanding collection of languages and scripts.

You can use Match to perform the following tasks:

For information about other Babel Street products that can help with processing documents, extracting names, and additional text analytics, contact analyticssupport@babelstreet.com.

Overview of name matching

The natural language processing algorithms employed by Match use machine learning and cutting-edge NLP techniques to perform name matching. The match scores produced are a relative indication of how similar two names are, or a search name is to a name in an index; the higher the score the stronger the match. Customizations are available to tune and configure Match to fit your business and data.

There are two common usage patterns in name and address matching: pairwise and index.

  • In pairwise matching, you have two names or addresses that you are comparing directly to one another. This comparison results in a single similarity score that indicates how similar the two names are.

  • With index matching, you have a single name or address that you are comparing to a list. This can be thought of as a search problem. You have a name and want to search a large list of records to find a match.

Index matching includes pairwise matching. When querying an index Match performs a two-pass search:

  1. Generate candidates: The first pass is designed to quickly generate a set of candidates for the second pass to consider.

  2. Pairwise match: The query value is compared with each value returned by the first pass and a similarity score is calculated for each pair.

Language support

Match can match names in any language. For the languages listed in Fully supported text domains for name matching, Match calculates a match score using a variety of techniques, as described in Understanding name match scores. For names not listed in those tables, Match provides limited support, as described in Language support parameters.

Note

Prior to release 7.36.0, Match did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set Match to behave as it did previously, set allLanguageSupport to false.

Documentation

This guide provides information on installing, running the sample applications included in Match, setting up a development environment, and creating applications that use the runtime environment to incorporate Match functionality.

The Java API is documented in HTML Javadoc pages generated from the source code, found at api-reference/index.html .

For instructions on using Match with RLP, see Installing RLP with Match.

Getting started

Requirements

  • Java SDK 11 through 23. Match is tested with OpenJDK.

  • Apache Ant 1.7.1 or later to use the Ant build scripts we provide to build and run the samples.

  • The compressed SDK package file for your platform.

    See Supported Platforms and Match Package File Names.

  • The Match documentation set includes the following:

    • Release Notes with up-to-date information about new features and bug fixes in this release

    • The Match Application Developer's Guide (this document)

    • Online reference to the Java API

  • The license file: rlp-license.xml.

Important

Unless otherwise specified, all inputs to Match need to be UTF-8 encoded.

Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.

Supported platforms

You must install an SDK package that is appropriate for your platform with respect to operating system and CPU. Since the public API for Match is Java, the C++ compiler that appears in the following list is irrelevant.

Table 18. Supported Platforms

OS

CPU

Compiler

$BT_BUILD[a]

macOS v15+ (Sequoia)

AMD64

xcode 16

amd64-darwin24-xcode16

macOS v15+ (Sequoia)

AARCH64

xcode 16

aarch64-darwin24-xcode16

Linux

AMD64

gcc 9.4

amd64-glibc231-gcc9

Linux

AARCH64

gcc 11

aarch64-glibc234-gcc11

Linux [b][c]

AMD64

gcc 7.3

amd64-glibc226-gcc73

Linux[b][c]

AARCH64

gcc 7.3

aarch64-glibc226-gcc73

Windows

AMD64

Visual Studio 2013

amd64-w64-msvc120

Java Only[d]

n/a

n/a

jvm

[a] $BT_BUILD is embedded in the name of the downloaded package. It is also the subdirectory name used in various locations for platform-specific files, such as binary library files.

[b] Valid for Match for ElasticSearch 8.4.1.x

[c] Valid for Match for OpenSearch

[d] The Java-only SDK runs on any OS and CPU with 64-bit Java SDK 11 through 23.



The compressed SDK package file names take the form:

rni-rnt-<version>-sdk-$BT_BUILD.<ext> 

where <version> is the Match version ( x.xx.x.cxx.x is the format), $BT_BUILD is in the table above, and <ext> is .zip for Windows or Java-only, and tar.gz for Unix platforms.

Note

The version number is embedded in the package file name.

Documentation Files
  • Match-<version>-api-reference.zip 

  • Match-<version>-ReleaseNotes.pdf 

  • Match-<version>-AppDevGuide.pdf 

Installing Match

When you obtain Match, you should receive the following files:

  1. The SDK package listed above for your platform: e.g., rni-rnt-<version>-sdk-amd64-glibc217-gcc48.tar.gz

  2. The license file: rlp-license.xml.

Expand the SDK into the install directory, which we will call $BT_ROOT, and copy the license to the $BT_ROOT/rlp/rlp/licenses subdirectory.

Once you have installed Match, you can install RLP. See instructions for Installing RLP with Match .

Note

For Windows users, you must add

$BT_ROOT\rlp\bin\*

to your PATH environment variable. In this case, you must replace * with the name of the subdirectory which contains the platform-specific binary library files (for example, amd64-w64-msvc120).

Note on logging

Match uses the Logging Facade for Java (SLF4J) to log Match activities. See http://www.slf4j.org/.

SFL4J is a facade for various logging APIs. Using SFL4J, the developer or an administrator can determine which one of many popular logging systems to use at runtime.

This is done by including one and only one adapter jar on the classpath, such as slf4j-log4j-1.17.36.jar, for the logging system of your choice, and the jar for that logging system (such as log4j-2.19.0.jar). You also need to include the SLF4J API jar, slf4j-api-1.17.36.jar, on the classpath.

By default, all activity is logged to the console. To log to a file and to control the level of logging, place an adapter jar, a logging library, an SLF4J API jar, and the appropriate properties file (e.g., log4j.properties if you are using log4j) on your classpath.

The adapter, logging, and API jars mentioned above are in samples/java/lib. A copy of log4j.properties, which is used by our samples, is in samples/java/logging. You should adjust the copy of log4j.properties that you place on your classpath to meet your specific runtime logging needs.

libpostal data directory

Match uses libpostal to parse addresses; libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.

Match packages libpostal data in plugins/babel-street-match/bt_root/rlpnc/data/libpostal. The data directory is relatively large (~2G). If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other Match functionalities.

RLP with Match

If you are using both RLP and Match together, first install Match. Next, install RLP in the same location. When you are installing RLP there will be some overlap of directories, which is expected. Allow the RLP directories and files to replace the existing directories and files.

Setting up your development environment

When building or running a Match application, you must include the following JAR files on your classpath:

  • btrlpnc.jar

  • bt-adm-model-<admversion>.jar

  • btcommon-api-<apiversion>.jar

  • btcommon-api-jackson-<jacksonversion>.jar

  • btcommon-lib-<libversion>.jar

  • icu4j-<icuversion>.jar

If using the seq2seq model for Katakana-English matching, you must also include the following JAR files on your classpath:

  • btrlpnc-seq2seq.jar

  • tensorflow-core-native-<tensorflowversion>.jar

    If you need GPU support, replace the tensorflow file with the version compiled for your platform. macOS must be at version 10.13 or higher.

  • jna-<jnaversion>.jar

These files are in $BT_ROOT/rlpnc/lib/jvm.

For information about $BT_ROOT (the Match root directory) and $BT_BUILD (the platform designator), see Installing Match.

To use the Ant scripts described in Building and running the sample applications, make sure you have Ant (1.7.1 or later), the JAVA_HOME environment variable is set to the root of your Java SDK, and the Java SDK bin directory is on your PATH.

Handling the runtime environment

Match uses data resources stored in the file system in standard locations relative to the Basis root directory ($BT_ROOT). Accordingly, you must follow a few basic rules when you are assembling an application that includes Match functionality.

  • Prior to accessing the Match API, you must set the Basis root directory.

  • Match maintains singleton Environment objects for maintaining read-only shared data. Depending on the operations you perform, you may need to explicitly instantiate an Environment object before you perform these operations and close the Environment object when you are done.

Setting the Match root directory

The API provides two ways of performing this action:

  • Use com.basistech.names.internal.Pathnames.setBTRootDirectory (String BT_ROOT).

  • Set the bt.root system property. You can do this from the command line when you launch the Java virtual machine:

    java -Dbt.root=$BT_ROOT ...

    where $BT_ROOT is the path to the Match root directory.

You can also set up an overlay directory. This directory must have an identical structure to the normal root directory outside of the rlp/lib and rlp/bin directories. License files will only be considered from BT_ROOT and should not be moved over to the overlay root.

Important

If a location for this overlay directory is specified, either in Java with com.basistech.names.internal.Pathnames.setOverlayRootDirectory or the bt.overlay.root system property, Match will look in that location for every data/configuration file instead of the root directory. If no location is specified, Match will use the normal root directory.

Note

Libpostal data (controlled by the libpostalDataDirPath parameter, defaulting to rlpnc/data/libpostal) and word embedding data (rlpnc/data/tvec/filtered-vectors) will only be considered from BT_ROOT and should not be moved over as part of the overlay root

Manipulating the environment

Before you use Babel Street Name Translator (RNT), you must instantiate a com.basistech.rnt.RNTEnvironment object. For example:

RNTEnvironment rntEnv = new RNTEnvironment();

The RNTEnvironment uses data files stored in the file system according to the standard RLP release hierarchy. Accordingly, you must set the Basis root directory prior to instantiating RNTEnvironment.

If your RLP license is not found in the appropriate location (rlp/rlp/licenses/rlp-license.xml) under your BT_ROOT directory, RNIConfiguration and RNTEnvironment include a setLicenseXML() method that you can use to provide the license as a string.

When you have finished performing translations, you should close the RNTEnvironment object to free resources. For example:

rntEnv.close();

When you use Match, an RNTEnvironment object is instantiated as required. If Match instantiates an RNTEnvironment object, it also closes it at the appropriate time.

A Quick look at Match: running a sample program

Building and running the sample applications

To build and run the sample applications, you must have the Java SDK (11 or later). To use the Ant build files we provide to build and run the samples, you need Ant (1.7.1 or later) with the JAVA_HOME environment variable set to the root of your Java SDK. For more information, see http://ant.apache.org.

The source files for these applications and the Ant build file for compiling and running them (build.xml) are located in $BT_ROOT/rlpnc/samples/java.

Table 19. Sample Applications

Source File

Description

AddNamesSample.java 

Adds names from a UTF-8 file to a Match Index.

LoadGazetteerSample.java 

Loads an XML gazetteer into a Match Index.

IndexQuerySample.java 

Submits a series of queries (names) to an index and reports on the results.

DistributedTransactionSample.java 

Queries an index, deletes the names returned from that index, and adds the names to a second index. The deletions and additions are performed in a single distributed transaction with two-phase commit.

MatchNamesSample.java 

Determines the similarity of two or more names.

MatchPhenomenaSample.java 

Demonstrates the different name matching phenomena that Match supports.

AutomatedTranslationSample.java 

Translates one or more names.

InteractiveTranslationSample.java 

Simulates a series of user interactions resulting in the translation of an Arabic name.

RNISolrjSample.java 

Integrates Match with Solr to add and query Solr documents with multiple and multivalued name fields.

AddressIndexQuerySample.java 

Submits a series of queries (addresses) to an index and reports on the results.

AddressMatchPhenomenaSample.java 

Demonstrates the different address matching phenomena that Match supports.



Your License

You must copy the license file you obtained from Babel Street to $BT_ROOT/rlp/rlp/licenses. If the license is not in place, you cannot access any Match functionality. The license defines the scope of the activities you may perform with Match.

Using the Ant build script

Tip

The Ant scripts and build files require one input property: bt.arch=$BT_BUILD (bt.arch=amd64-glibc217-gcc48, for example). If you set this property in the script (build.xml), you do not need to include it on the command line.

Change directory to $BT_ROOT/rlpnc/samples/java and run Ant:

ant -Dbt.arch=$BT_BUILD target

where target is one of the Ant build targets in the following table.

target

Description

compile

Compiles the samples and places the class files in $BT_ROOT/rlpnc/samples/java/obj/$BT_BUILD.

compile.Class[a]

Compiles the specified sample.

run

Compiles (if necessary) and runs the samples with the command-line arguments defined in the Ant build file. Each sample prints a message to the console indicating what it has done, including any file it has created.

run.Class [a]

Runs the specified sample.

clean

Removes the class files and any files created by the samples.

clean.Class [a]

Removes the sample class file(s) and any file created by the sample.

all

Calls compile and run.

[a] Class is the sample class name. Use the Class targets to compile, run, or clean a single sample. For example, to run LoadGazetteerSample, the target is run.LoadGazetteerSample.

As you create your own applications, you can use the Ant build file as the starting point for establishing your own build procedures.

Matching names

Match provides a Java API for matching names across the boundaries of writing scripts. For the complete list of the languages and writing scripts that name matching supports, see Fully supported text domains for name matching.

In the Match context, name matching means comparing two names, performing linguistic analysis, and returning a score (a double greater than or equal to zero and less than or equal to one) that indicates how similar the two names are. A value of 1.0 is returned if and only if the two names are identical (the strings, languages, languages of origin, and entity types match). A score of less than 1.0 is returned for names that potentially match, with different mismatched name variations.

Interpreting Match scores

Names are complex to match because of the large number of variations that occur within a language and across languages. Match breaks a name into tokens and compares the matching tokens. Match can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.

Match scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).

The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative values of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how Match calculates the score, see Understanding name match scores.

Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:

Variation

Example(s)

Phonetic and/or spelling differences

Nayif Hawatmeh and Nayif Hawatma

Missing name components

Mohammad Salah and Mohammad Abd El-Hamid Salah

Rarity of a shared name component

Two English names that contain Ditters are more likely to match than two names that contain Smith

Initials

John F. Kennedy and John Fitzgerald Kennedy

Nicknames

Bobby Holguin and Robert Holguin

"Cousin" or cognate names

Pedro Calzon and Peter Calzon

Uppercase/Lowercase

Rosa Elena PACHECO and Rosa Elena Pacheco

Reordered name components

Zedong Mao and Mao Zedong

Variable Segmentation

Henry Van Dick and Henri VanDickRobert Smith and Robert JohnSmyth

Corresponding name fields

For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox]

Truncation of name elements

For Sawyer, the similarity with Sawy is higher than the similarity with Sawi.

Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

You can configure Match to customize how it scores different match phenomena.

The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).

Entity types

The entityType identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types. Each of these entity types can be used in conjunction with the rni_name field type.

Important

The entityType should always be specified to utilize all available methods when indexing and matching names. The rni_name field type can be used in conjunction with all entity types, including PERSON, LOCATION, ORGANIZATION, and IDENTIFIER. If you don't specify an entityType, the type PERSON will be used.

Table 20. Entity types

Entity type

Description

Features

PERSON

A human identified by name, nickname, or alias.

Values are tokenized and token pairs are compared.

Stop words, overrides, frequency and gender models are supported.

LOCATION

A city, state, country, region or other location.

Values are tokenized and token pairs are compared.

Stop words, overrides, and frequency models are supported.

ORGANIZATION

A corporation, institution, government agency, or other group of people defined by an established organizational structure.

Values are tokenized and token pairs are compared.

Stop words, overrides, frequency models, and embeddings are supported.

Real World IDs are supported.

IDENTIFIER

IDENTIFIER:DRIVERS_LICENSE

IDENTIFIER:LICENSE_PLATE

IDENTIFIER:NATIONAL_ID_NUM

An alphanumeric identifier.

Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.

IDENTIFIER:EMAIL

An email address.

Values are validated, split, and scored.



Names with data fields

By using a string array (such as String[] nameData = {"John", "Smith"};), you can create a name with data fields. The maximum number of data fields is 5. We assign no explicit semantics to each field (such as given name or surname), but the order of the fields does matter when comparing two names that have fields. Match assigns lower scores to matches that cross field boundaries (e.g., the first field in one name matches the second field in another name). The use of fields may enhance accuracy when you are performing queries and matches with PERSON names in languages where standard name ordering is not the norm. By dictating a consistent name ordering, you can avoid penalties for mis-ordered tokens.

For consistency, you may want to adopt a paradigm for name fields, such as {title, given names, surname, suffix}. Include empty fields in the appropriate position for names that do not contain all these elements. If a trailing field is empty, you can leave it out. For example:

{"Mr", "John Miles", "Doe", "Jr"}

{"Queen", "Elizabeth", "", "II"}

{"Mr", "Anthony Charles", "Blair"}

{"Ms", "Rosanne Christine", "Atwood"}

{"", "Martin Luther", "King", "Jr"}

Note

When scoring a potential match between a name with data fields and a name without data fields, Match treats the name without data fields as if it were a name with one data field.

Match treats trailing empty fields as if they were not present. For example, {"Rosanne", "Taylor Smith",""} is treated the same as {"Rosanne", "Taylor Smith"}.

Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with Name.UNKNOWN_FIELD_MARKER.

Name matching usage model

Identify two names to compare. They may be in different languages (languages of use) and writing scripts.

Use MatchScorer to score the similarity of two Name objects. MatchScorer and Name are in the com.basistech.rni.match package.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_2names.java

For the Arabic name نايف أبو شرخ and its IC transliteration Nayif Abu-Sharakh, this comparison returns a score of 0.99.

If you want to compare one name to many names, for improved efficiency you can cache the scorer with the one name (the query name) and use the cached scorer to compare that name to multiple names. As illustrated in the following code snippet, you must prepare each name that you use with the cached scorer.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_1name_tomany.java

For a sample Java application that matches two names and matches a query name against multiple reference names, see MatchNamesSample.

Configuring name matching

There are many ways to configure Match to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.

Match Plugin for OpenSearch Restriction

This feature requires access to system parameters not currently available in this release.

We are currently working with OpenSearch to provide access and enable these features.

Tuning match parameters

The default values of the Match match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.

The typical process for tuning parameters is as follows:

  1. Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.

  2. After indexing the data, run the queries using Match and determine a match score threshold that appears to provide the best results.

  3. Analyze the results to discover cases that Match failed to score high enough or that Match incorrectly scored higher than the threshold.

  4. Choose a subset of these name pairs that Match scored too low or too high that will be used as examples to tune your parameters.

  5. Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.

  6. Run the large set of queries through Match again to test that the new parameter values still return the desired matches, and not new undesired results.

Parameter configuration files

Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.

The parameter files are contained in two .yaml files located in /rlpnc/data/etc. The parameters are defined in parameter_defs.yaml and modified in parameter_profiles.yaml.

  • parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum) and recommended maximum (sane_maximum) value, which we advise you do not exceed.

  • parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.

Important

Do not modify the parameter_defs.yaml file. All changes should be made in the parameter_profiles.yaml file.

Do refer to the parameter_defs.yaml file for definitions and usage of all available parameters.

Parameter profiles

The parameters in the parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng profile. There is also an any profile which applies to all language pairs.

Parameter profiles have the following characteristics:

  • Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (eng), which always comes last. The two languages can be the same. Examples:

    • spa_eng 

    • ara_jpn 

    • eng_eng 

  • They can include the entity type being matched, such as eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.

  • Parameter profiles can inherit mappings from other parameter profiles. The global any profile applies to all languages; all profiles inherit its values.

  • The any profile can include an entity type; any_PERSON applies to all PERSON matches regardless of language.

  • Specific language profiles inherit values from global profiles. The profile matching person names is named any_PERSON. The profile for matching Spanish person against English person names is named spa_eng_PERSON. It inherits parameter values from the spa_eng profile and the any_PERSON profile. The any_PERSON profile will not override parameter values from more specific profiles, such as the spa_eng profile.

Important

Global changes are made with the any profile.

Any changes to address parameters should go under the any profile, and will affect all fields for all addresses.

Any changes to date parameters must go under the any profile.

Parameter universe

A parameter universe is a named profile containing a set of Match parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:

For example, the MyParameterUniverse universe may include the following parameter profiles:

  • "name": "MyParameterUniverse/any" applies to all language pairs.

  • "name": "MyParameterUniverse/spa_eng" applies to English - Spanish name pairs.

  • "name": "MyParameterUniverse/spa_eng_PERSON" applies to all PERSON English - Spanish name pairs.

Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml file, along with a value. Parameter universes are added to the parameter_profiles.yaml file.

Tip

You can define multiple named parameter profiles.

Define the parameter universe in the parameter_profiles.yaml file. Example:

parameterUniverseOne/spa_eng_PERSON:
    reorderPenalty: 0.4
    HMMUsageThreshold: 0.8
    stringDistanceThreshold: 0.1
    useEditDistanceTokenScorer: true
parameterUniverseOne/eng_eng:
    reorderPenalty: 0.6
    
Modifying name parameters

To start tuning the parameters, run the Match pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml. For additional support on tuning the parameters, contact analyticssupport@babelstreet.com.

Once you define a profile and set a parameter value, rerun the Match pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Selected name parameters

Given the large number of configurable name match parameters in Match, you should start by looking at the impact of modifying a few parameters. The complete definition of all available parameters is found in the parameter_defs.yaml file.

The following examples describe the impact of parameter changes in more detail.

Example 12. Token Conflict Score (conflictScore)

Let’s look at the two names:  ‘John Mike Smith’ and ‘John Joe Smith’.  ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.



Example 13. Initials Score (initialsScore)

Consider the following two names:  'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match.  You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.



Example 14. Token Deletion Score (deletionScore)

Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.



Example 15. Token Reorder Penalty (reorderPenalty)

This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty.  If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.



Example 16. Right End Boost/Left End Boost/Both Ends Boost (boostWeightAtRightEnd, boostWeightAtLeftEnd, boostWeightAtBothEndsboost)

These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’.  By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.

The parameters boostWeightAtRightEnd and boostWeightAtLeftEnd should not be used together.



Language support parameters

Match currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully supported text domains for name matching lists the languages and scripts with complete support. For all other languages, Match has limited support.

Limited support uses two match score computations:

  • Exact matches return a score of 1. This is the same for all languages.

  • A score is calculated based on string edit distance.

Two parameters control the level of language support.

Table 21. Language Support Parameters

Parameter

Description

Default

allLanguageSupport

When set to true, all languages are supported.

true

limitedLanguageEditDistance

When set to true, edit distance match scores are enabled for limited support languages. allLanguageSupport must be true.

true



Neural model for matching

When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.

To enable the neural model, set enableSeq2SeqTokenScorer to true in the jpn_eng profile in the parameter_profiles.yaml file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.

Matching Korean names

If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language (languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.

To enable Korean readings of names in Han script you need to edit the parameter files as follows:

  1. Edit the zho_eng profile in the internal_param_profiles.yaml file and remove kor from the list of ignoreTranslationOrigins parameter.

  2. Edit the zho_eng profile in the parameter_profiles.yaml file to increase the alternativePairsToCheck parameter by 1 to compensate for the additional reading.

Matching names with Han characters

We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.

The four-corner system is a method for encoding Hani script characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify the character, it does limit the list of possibilities.

The parameter haniFourCornerCodeMismatchPenalty applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.

To enable the feature, add the following line to your parameter_profiles.yaml file:

zho_zho_PERSON:  
  haniFourCornerCodeMismatchPenalty: 1

This method can be used for matching any names where both languages are Hani script (kor_kor, jpn_kor, etc.). The parameter can be enabled globally, for all Hani script, or using a specific language profile as shown above.

Note

This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.

Matching Turkish and Vietnamese names

Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.

  1. Edit the parameter_profiles.yaml file.

  2. To enable Turkish detection, add:

    detectableLanguagesRuleBased:
      [tur]

    To enable Vietnamese detection, add:

    detectableLanguagesRuleBased:
      [vie]
  3. Restart the system.

Evaluating parameter configuration

To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.

If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.

Configuring name overrides

Match includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:

  • Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.

  • Name pair matches specify scores to be assigned for specified full-name pairs.

  • Token pair overrides specify name token pairs that match along with a match score.

  • Token normalization files specify the normalized form for tokens and variants to normalize to that form.

  • Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.

The name matching override files are in the /rlpnc/data/rnm/ref/override directory.

You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.

Stop patterns and stop word prefixes

Before running any matching algorithms, the names are transformed into tokens that can be compared. Match uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.

For each name, Match performs the following steps in order:

  1. Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.

  2. Stop patterns are applied.

  3. Stop words are applied.

Match cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.

Stop Pattern

A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern; see the Javadoc for detailed documentation.

Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt

where LANG is a three-letter language code.

Each row in the file, except for rows that begin with #[5] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at the beginning and end as needed.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general stop pattern is applied first, but general is also a stop pattern and will be applied as well.

Match includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entries

^fnu\b
\blnu$ 

indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.

You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn, where n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,

\blnu$    2
\blnu$    3

indicates that the regex is to be applied to fields 2 and 3 in fielded names.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop Word Prefixes

A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.

Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt 

where LANG is a three-letter language code. Each row in the file, except for rows that begin with #, is a string literal. Prefixes matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel stop word prefix is applied where applicable when colonel is also a stop word prefix.

Match includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in /rlpnc/data/rnm/ref/override. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt would include stop word prefixes for use with Russian text.

Overriding name pair matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Each row in the file, except for rows that begin with #, is a tab-delimited full-name pair and score:

name1 Tab name2 Tab score

The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.

Tip

Since the minimum score for names returned by Match queries must be greater than 0, a Match query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.

The installation includes a sample file with sample entries commented out: /rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in a Match index. For example,

John Doe	Joe Bloggs	1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt:

外山恒	   Toyama Koichi    1.0
ヒラリークリントン    Hillary Clinton    1.0
Overriding token pair matches

You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported[6] for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. When Match evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail and Abby constitute a token pair, then the match score for Abigail Harris and Abby Harris will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

tokens_LANG1_LANG2_[TYPE].txt

where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:

Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]

A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, Match assumes NICKNAME.

Match includes /rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:

Peter    Pete    NICKNAME
Peter    Pedro   COGNATE

This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt, tokens_zho_eng_ORGANIZATION.txt.

When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt.

We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.

Multiple sets of token overrides

There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector parameter.

  • The value of overrideSelector is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.

  • The value of overrideSelector can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the overrideSelector of OverrideGroup1 would be named:

    tokens_eng_eng_PERSON-OverrideGroup1.txt
  • If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.

Note

Overrides that are associated with a specific selector are not additive to the base overrides. If a custom overrideSelector value is specified, Match will only consider overrides in that specific selector. As with the base overrides, for a given selector, Match will consider non-entity-type overrides for that selector if no entity-type-specific override pair is found for that selector.

Normalizing token variants

You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:

equivalenceclasses_LANG_[TYPE].txt

For example, equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.

Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:

[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...

Match includes /rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to muhammad:

[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd

You can add lists of variants to this file, including the normalized form in square brackets to start each list.

Unimportant tokens

You can edit the list of tokens that are given low influence in Match. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.

The file name is lowWeightTokens_LANG.txt.

For example, /rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".

Matching organizations with real world IDs

Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.

Match contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.

Table 22. Real World ID Parameters

Parameter

Description

Default

useRealWorldIds

Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.

true (enabled)

doQueryRealWorldIds

Enables querying with real world IDs; set by language pair.

true (enabled)

realWorldIdScore

Sets the match score when two names match due to matching real world IDs. Set by language pair.

0.98

nameRealWorldQueryBoost

Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.

35



Building a real world ID file

Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in Match and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.

The input file is a tab separated file (.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.

The omit file is a tab separated file (.tsv) named <LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.

  • Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.

  • Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.

  • Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in Match:

  1. Place the omit file in the BT_ROOT directory.

  2. Open omit_ids.datafiles, which is in the /rlpnc/data/real_world_ids/ref/omit_ids directory by default.

  3. Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:

    ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
  4. Save omit_ids.datafiles.

Custom language model training

You can train a language model on your own name data. Match uses language models in which common names score differently than rare names. For example, "John Jingleheimer" should match "Jingleheimer" better than "John", because Jingleheimer is a rarer name than John. Match already comes with language models for many supported languages, but you might find it best to train a new language model so that it reflects the statistics of your data. Please note that a large amount of full names are required to train an effective language model.

Installation

Unpack frequencyModelTrainer.zip to any desired location. Ensure that the JAVA_HOME environment variable is set and points to a Java version of 11 or higher.

Simple usage example

bin/buildLM.sh -root rni-rnt -in eng_PER_LM.tsv 
-out rni-rnt/data/rnm/ref/user_models/eng_PERSON_unigram.bin
-lang eng -script Latn

See README.txt in frequencyModelTrainer.zip for more details, including the full description of arguments.

L337/OCR token scorer

Names may include modified spellings which include symbols. The main type of intentional modified spelling is l337 or leetspeak, a system of modified spellings using character replacements that often play on the similarity of their glyphs. One example is Ke$ha for Kesha. Another way symbols may be introduced is through optical character recognition (OCR) errors. When typed, handwritten, or printed text is converted into machine-encoded text, errors may be introduced. Name match can handle both of these "misspellings" through a L337/OCR token scorer, also known as the generalized edit distance token scorer. This feature is turned off by default.

The generalized edit distance token scorer extends the string similarity match scoring. It applies to Latin script names only. Both names must be of the same language and the same entity type. Only the languages ENGLISH, SPANISH, FRENCH, ITALIAN, GERMAN, and PORTUGUESE are supported at this time.

The token scorer consists of a set of rules files, which define a set of symbol substitutions, along with the penalty applied for making the substitutions. The rules files are found in the data/rnm/ref/edit_distance directory. You can edit them as needed, changing the substitution, defining new substitutions, and changing the penalty values for each substitution. There are 3 files in the directory, though you can add additional rule files.

  • edit_distance.datafiles: lists all files containing rules along with the entity types they apply to. Rules can be general and work across all entity types. If you add a new rule file, it must be added to this file.

  • edits_eng.tsv: the predefined list of pair rules, implementing the L337/OCR rule below.

  • edits_eng_PERSON: a stub file, intended for rules specific to PERSON entity types. Add any rules specific to PERSON name matches in this file.

The following parameters control the scorer. The parameters are defined in the internal_param_defs.yaml file and can be applied to profiles in the parameter_profiles.yaml file.

Table 23. L337/OCR Parameters

Parameter

Description

Type

Default

useGeneralizedEditDistanceTokenScorer

Turns on/off the L337/OCR token scorer

boolean

false

generalizedEditDistanceSubstitutionCost

The substitution cost. 1 is Levenshtein distance.

real

1

generalizedEditDistanceTokenScorerPenalty

A penalty that gets multiplied to the score inside the L337/OCR token scorer.

real

0.85

generalizedEditDistanceScoreBias

A rescaling factor for scores produced by the L337/OCR token scorer.

real

1



OCR_Rules-060924-150948.pdf

Email scorer

When a name is given the entity type of IDENTIFIER:EMAIL, the email scorer is used, allowing the system to compare, index, and query email addresses.

The following steps are used to score an email:

  1. The value is validated using a regex based on the Internet Message Format RFC 5322. An error is returned if it fails.

  2. The email is split into two parts by the @ sign:

    • Local (before @)

    • Domain (after @)

  3. The components are scored individually:

    • The local part is scored using the person name match scorer with automatic language detection.

    • The domain part is scored using string edit distance.

  4. The scores are combined:

    • The two scores are combined using a weighted average.

    • The weight parameter is emailUsernameWeightInScoring (range: 0 to 1).

      • 0: only the domain score is used

      • 1: only the local score is used.

      • Default value is 0.85: 85% local score, 15% domain score.

Indexing and querying names

Name Match enables high-speed, scalable, cross-language, and cross-script searches for names.

Match uses the Apache Lucene full-text search engine to store names with their search keys and a key index. Match updates and queries with Lucene are transactional.

When you search for a name, Match generates a search key for each component of the name, locates all the names indexed by those search keys, and uses linguistic matching algorithms to filter that set of names down to the most similar names.

For a list of the languages and writing scripts that Match supports, see Fully supported text domains for name matching.

Match provides a Java API that you can use to embed it in your applications. The Match classes are in com.basistech.rni.index. Unqualified class names that appear in this section are in com.basistech.rni.index.

For detailed information about the API, see the Java API Reference shipped with Match.

Constructing a name index

A name index is an indexed list of names. The list includes a collection of Name objects and associated keys.

The Name object includes the name, language, [7] script, (script and language will be inferred if not included in the name definition) and may include entity type (such as person or place), language of origin, and additional information (with place names, for example, you may want to store the geocoordinates).

Tip

You can also create an index in memory that is never stored on disk.

To create an indexed list of names on disk, you must specify a pathname for the data store, and you must use a IndexStoreDataModelFlags object (the default is fine).

Example:

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_index.java

Once the index is created, use NameBuilder to create Name objects and add them to the index. NameBuilder provides a fluent interface that supports method chaining. The following fragment illustrates the syntax for creating and adding a name to the index.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_name.java

When you are finished adding names, close the name index, as in the preceding fragment.

Note

NameBuilder also includes static methods that you can use for determining the language and script for a name prior to creating the Name object: guessLanguage(String nameData) and guessScript(String nameData).

You can use hintLanguage(com.basistech.util.LanguageCode hintLanguage) to suggest the language when you create a Name. The NameBuilder uses the suggestion if it is compatible with the script, otherwise it uses its own language guess.

When you are adding a large number of names to an index, you can use an INameIndexSession object to batch these additions into a single transaction. A single transaction is faster than adding each name in a separate transaction. For information, see Match Sessions and Transactions, and for a sample application that adds multiple names in a single transaction, see AddNamesSample.

Querying a name index

Once you have an index created, you can use queries to search the index for similar names.

Opening a name index

The primary role of a name index is to perform queries. You can also perform updates (insertions and deletions).

StandardNameIndex provides a static method for opening a name index.

INameIndex index = StandardNameIndex.open(String indexPathname);

indexPathname is the path to the directory that contains the name index.

To optimize the index for more efficient queries, call

index.optimize();

When you are done using the name index, you must close it:

index.close();
Defining a name search query

A query includes a Name object and may also include settings to constrain the query. For example, the query can specify the entity type, language, and/or script of the names that it returns. For the details, see the Javadoc for com.basistech.rni.index.IndexStoreDataModelFlags.

You can also define a query to return all the names associated with a specified entity.

Set up a NameIndexQuery object. For example:

// Define a query.
NameIndexQuery defineQuery(Name queryName)
 throws NameIndexException, NameIndexStoreException, RNTException {
 NameIndexQuery query = new NameIndexQuery(queryName);
 query.setNameDataMinimumMatchScore(.30);
 return query;
}
Running the query and accessing the query results

INameIndex includes a query method that takes as its parameter the defined NameIndexQuery.

The query returns a NameIndexQueryResult iterator. Each NameIndexQueryResult object provides a Name object and a similarity score. As the following fragment illustrates, you can obtain and process each name and its score. The higher the score (greater than 0 and less than or equal to 1), the greater the confidence that this is a relevant match. A score of 1.0 indicates that the query name string and result name string are identical. Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/query_index.java

Span Matches. Each query result may contain information about spans (one or more tokens) in the query name that match or do not match spans in each result name. The NameIndexQueryResult provides a MatchResult object, which in turn provides match type and a list of SpanMatch objects. For more information, see the Javadoc for com.basistech.rni.match.SpanMatch and com.basistech.rni.match.Span. The Javadoc for MatchResult#getSpanMatches() provides information about the scope and limitations on what is returned for names in various text domains.

Cleanup

When you are done running queries, close the index:

index.close();
Sample

For a sample Java application that defines a query, runs the query, and reports the results, see IndexQuerySample.

Retrieving groups of names

You may want to retrieve a group of names that share some common characteristic other than name similarity. Perhaps you even want to retrieve all the names in a Match index.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/name_groups.java

The query returns all the names for which the Extra field contains the token used in the query.

Optimizing query performance

By adjusting NameIndexQuery parameters, you can optimize queries for your use case.

Tradeoffs between accuracy and speed

Match passes a subset of the highest scoring names from the first-pass high-recall search to the second-pass high-precision filter. The namesToCheckAllowance and maximumNamesToCheck parameters can be adjusted to control how many names are included in that subset.

maximumNamesToCheck

The maximumNamesToCheck parameter sets a hard limit on the number of names passed to the high-precision filter for each query. Use it to control the maximum query latency. The appropriate value is largely determined by the size of your index and should increase as your index grows.

namesToCheckAllowance

The namesToCheckAllowance parameter is a value between 0.0 and 1.0 used at query time to dynamically calculate the most efficient number of names to pass to the high-precision filter based on the commonality of the query name in the index. When set to 1.0, the value of maximumNamesToCheck is used for every query. After determining a good value for maximumNamesToCheck, adjust this parameter to fine-tune the performance.

In general, for greater speed and less accuracy (particularly recall), decrease the value of these parameters using:

  • setNamesToCheckAllowance(double namesToCheckAllowance)

  • setMaximumNamesToCheck(int maxNamesToCheck)

For greater recall and less speed, increase those settings.

To pass all names found by the high-recall search to the high-precision filter, set:

  • namesToCheckAllowance to 1.0

  • maximumNamesToCheck to NameIndexQuery.UNLIMITED_RESULTS.

Optimizing for Duplicate Names. If your index contains duplicate names, you should use setMaximumNamesToConsider(int maxNamesToConsider) to set the maximum number of names to consider to a value higher than the maximum number of names to check. Match returns the maximum names to consider in the first-pass high-recall search and sends the maximum names to check to the second-pass high-precision filter. If there are any duplicates in the names returned by the first pass, the duplicates are not passed to the second-pass. In other words, the score assigned by the second pass to the first instance of a given name is assigned to its duplicates without spending time sending them through the second pass. For optimal behavior, the ratio of maximumNamesToConsider to maximumNamesToCheck should be approximately the same as the average number of times that a name is repeated in the Match index. So, for example, if each name is entered twice (on average), maximumNamesToConsider should be twice as big as maximumNamesToCheck. If your index does not include duplicates, you can use IndexStoreDataModelFlags to set optimizeDuplicateNames to false (the default setting is true), in which case Match does not perform this optimization procedure.

Constraints on maximum settings. maximumNamesToCheck and maximumResultsToReturn must be less than or equal to maximumNamesToConsider. As described above, maximumNamesToCheck may be less than maximumResultsToReturn. Accordingly, the order in which you make these settings is important. For example, you cannot set maximumResultsToReturn to a value higher than maximumNamesToConsider, so you may need to reset maximumNamesToConsider before you can reset maximumResultsToReturn.

To simulate a high-recall search with perfect recall:

  1. Retrieve all names in the index as described in Retrieving Groups of Names.

  2. Apply the high-precision filter to each name by matching it against the query with a MatchScorer (see Matching Names).

This is not recommended for a production environment due to the high amount of computation such a procedure requires, but it can be useful during development to identify recall errors (false negatives) made by the high-recall search but not the high-precision filter.

Tradeoffs between false positives and false negatives

For fewer false positives (bad matches) and more false negatives (missing good matches) in your query results, you can:

  • increase the minimum match score that a candidate must reach to be returned

  • decrease the number of results that are returned (candidates with the highest scores are included)

The default minimum match score is NameIndexQuery.DEFAULT_MINIMUM_MATCH_SCORE. To reset this threshold, use setNameDataMinimumMatchScore(double nameDataMinimumMatchScore), where nameDataMinimumMatchScore is greater than 0 and less than or equal to 1.

The default maximum number of results to return is NameIndexQuery.DEFAULT_MAXIMUM_RESULTS_TO_RETURN. To reset this value, use setMaximumResultsToReturn(int maximumResultsToReturn).

To return an unlimited number of results, use setMaximumResultsToReturn(NameIndexQuery.UNLIMITED_RESULTS).

Querying name frequency

The NameFrequency method returns the frequency of each token in a name within a given language context, based on our internal datasets.

It can be useful to know how frequently a name appears in a language. When an extremely common name is matched, the names returned by the first pass may be cropped to fit the window size. Depending on where the search name fits in the results, it may not be returned on the list, resulting in lower accuracy for extremely common names. If you know the frequency of a name, you can dynamically adjust the window size for common names.

Note

A larger window size will increase latency.

Name name = NameBuilder.data("John Smith").buildAndComplete();
boolean includeExplainInfo = true;
NameFrequency scorer = new NameFrequency(ParameterProfile.defaultProfile());
NameFrequencyResult result = scorer.getNameFrequency(name, includeExplainInfo);

The NameFrequency object contains the following fields:

Name

Type

Description

name

String

Name for which frequency information returned.

script

ISO 15924 script code

Script of the name.

language

ISO 639-3 language code

Language of the name

tokenFrequencies

List

List of frequency information for each token in the name.

The TokenFrequencyResult object contains the following fields:

Name

Type

Description

token

String

Token for which the frequency information returned.

frequency

double

Frequency of the token.

error

String

Error message in an error occurred for the token, else null.

frequencyScoreExplainInfo

FrequencyScoreExplainInfo

Explanation of how the final frequency was calculated, or null if not requested.

The FrequencyScoreExplainInfo object is a list of parameters that affected the calculation. Each parameter in the list consists of the parameterName and the parameterValue.

Match sessions and transactions

In addition to using the INameIndex API for performing operations on a Match index, you can use the INameIndexSession API for finer-grained control. Sessions allow a set of operations to happen atomically (all occur or nothing occurs), and, especially for write operations, more efficiently. For those familiar with relational databases and SQL, the Match concept of a session is similar to the JDBC concept of a connection with auto-commit mode off.

  • To start a session, call INameIndex.openSession().

  • To end the session, call close() on the resultant INameIndexSession object.

While INameIndexSession provides many of the same operations as INameIndex, such as query() and addName(), the difference is when changes to the index become permanent. INameIndex update operations are immediately flushed to disk, but INameIndexSession operations are not made permanent until you call commit(). At any time, you can invoke rollback() to undo all the operations since the last commit(). If you call rollback() before ever calling commit(), all of the operations of the session are undone.

You can run multiple sessions concurrently by having multiple threads call openSession() on the same INameIndex object. When multiple sessions are acting concurrently in separate threads, they are logically isolated from each other in order to not interfere with each other's operations. The isolation level is equivalent to READ COMMITTED, as outlined in the SQL-1992 Specification. This guarantees that one session will not see any uncommitted changes to the index performed by another session. In addition, a session will not see any uncommitted changes that it has made itself. For example, if a session adds a name to the index and then searches for that name before committing, it will not find the name it has added. You can also perform INameIndex auto-commit operations in the midst of one or more sessions; each INameIndex update or query is performed in its own session.

The session objects themselves are thread-safe; a session object may be shared by multiple threads.

The INameIndexSession API is recommended for doing bulk adds to the index. It is much more efficient to create a single session for adding all the names of a bulk add than to use the INameIndex API. The following fragment shows an example.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_names.java

A Sample. For a sample application that adds multiple names in a single transaction, see AddNamesSample.

Local vs. Distributed Transactions. A local transaction is a set of operations performed atomically (all occur or nothing occurs) on a single index. A distributed transaction is a set of operations performed atomically on multiple data sources, such as a relational database and a Match index. All the operations on all the data sources must take place, or none of the operations take place.

For local transactions, use the INameIndexSession API, as illustrated above. The transaction object is managed internally and is not visible to the user.

In order to participate in a distributed transaction, an INameIndexTransaction object must be created from the session by calling INameIndexSession.startTransaction(). This transaction object is linked with the session internally. There is a division of labor between the two objects: the session object can only be used for adding/removing/searching, and the transaction object can only be used for committing or rolling back. A typical use case would be to provide the session object to the user application while handing over the transaction object to a transaction manager.

One side effect of this division of labor between the session and transaction objects is that a session cannot call commit() or rollback() once it is associated with a distributed transaction. These operations are only allowed by the linked transaction object. Specifically, after calling INameIndexSession.startTransaction(), you should not call INameIndexSession.commit(). You must call INameIndexTransaction.commit() instead.

A session can be associated with multiple distributed transactions, one at a time. When the work for one transaction is finished, you may call INameIndexSession.startTransaction() again to start a new one.

Two-Phase Commit. INameIndexTransaction supports two-phase commits, a standard protocol for managing transactions robustly among multiple data sources. INameIndexTransaction provides the prepare(), commit(), and rollback() operations necessary for a transaction manager to effectively execute the protocol. Match does not include a transaction manager.

The following simplified example illustrates the use of INameIndexTransaction in a distributed transaction with a two-phase commit. In this example, both transactions are Match transactions.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/distributed_transaction.java

A Sample. For a sample application that illustrates a distributed transaction with a two-phase commit involving two Match indexes, see DistributedTransactionSample.

Multithreading

No more than one INameIndex object may exist for a given name index on disk at any time.

Queries and updates may be performed in multiple threads on a single INameIndex object.

One write session at a time

While a write session (which may be shared by multiple threads) is open, all other writing sessions (including optimization) are blocked. If there is an operation that is expected to take a long time (e.g., batch document adds or calls to optimize), care should be taken to ensure it is the only active writing session. If a write attempt needs to wait too long, a timeout exception is thrown, and the transaction is aborted.

Matching organizations with real world IDs

Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.

Match contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.

Table 24. Real World ID Parameters

Parameter

Description

Default

useRealWorldIds

Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.

true (enabled)

doQueryRealWorldIds

Enables querying with real world IDs; set by language pair.

true (enabled)

realWorldIdScore

Sets the match score when two names match due to matching real world IDs. Set by language pair.

0.98

nameRealWorldQueryBoost

Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.

35



Building a real world ID file

Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in Match and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.

The input file is a tab separated file (.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.

The omit file is a tab separated file (.tsv) named <LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.

  • Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.

  • Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.

  • Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in Match:

  1. Place the omit file in the BT_ROOT directory.

  2. Open omit_ids.datafiles, which is in the /rlpnc/data/real_world_ids/ref/omit_ids directory by default.

  3. Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:

    ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
  4. Save omit_ids.datafiles.

Matching addresses

Match provides a Java API for matching addresses in English, Traditional Chinese, and Simplified Chinese.

In the Match context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than or equal to zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.

Note

Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.

Address definition

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.

When entered as a set of fields, the address may include any of the fields in ???. At least one field must be specified, but no specific fields are required.

Match optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields, such as house number, postal code, and unit, are matched using character-based methods.

Address field groups

When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, Match uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, Match applies a penalty to that match, reducing the score for that pair.

When matching two fields, the following penalties are applied:

  • If the fields are the same, no penalty is applied. (street - street)

  • If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)

  • If the fields are in different field groups, a large penalty is applied. (road - city)

Table 26. Address Groups

Group

Fields

house

house

house_number

houseNumber

road

road

unit

unit

level

staircase

entrance

city

suburb

cityDistrict

city

state

island

stateDistrict

state

country

countryRegion

country

worldRegion

post_code

postCode

po_box

po_box



Address matching usage model

Identify two addresses to compare.

Use MatchScorer to score the similarity of two AddressSpec objects. MatchScorer and AddressSpec are in the com.basistech.rni.match and com.basistech.rni.match.address packages respectively.

// Use MatchScorer to match two addresses.
void match2Addresses(AddressSpec addr1, AddressSpec addr2) {
    MatchScorer ms = new MatchScorer();
    double score = ms.score(addr1, addr2);
    // Handle the score.
    System.out.println("Score: " + score);
    // Release resources used by the match scorer.
    ms.close();
}

How Match calculates address match scores

The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.

  • Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Match uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an address is irregularly formatted.

    For example, most addresses are formatted from specific to general:

    houseNumber road city state postCode
    
    • The parser would provide predictable results for an address in an expected order:

      38 Concord Road, Apt. B Arlington MA

    • The parser would have more difficulty if the address format was in an unexpected order:

      Arlington MA Concord Road #38 Apt B

    If you are getting unexpected match values, check how the addresses are being parsed into address fields.

  • Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.

  • Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:

    • Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.

    • Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.

    • Postal codes: Match uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Match can recognize and score the match correctly.

  • Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.

  • Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.

Configuring address matching

Match Plugin for OpenSearch Restriction

This feature requires access to system parameters not currently available in this release.

We are currently working with OpenSearch to provide access and enable these features.

Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.

There are two types of override files for addresses:

  • Stop patterns and stop word prefixes designate address field elements to strip during indexing and queries.

  • Token pair overrides specify address field elements pairs that match.

File Directories

  • The parameters are modified in the /rlpnc/data/etc/parameter_profiles.yaml file.

  • The address matching override files are in the /rlpnc/data/addresses/ref/overrides directory.

  • The address stop word files are in the /rlpnc/data/addresses/ref/stopwords directory.

Modifying address parameters

To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml. The parameter files are described in Parameter configuration files.

Note

Changes made to the any profile apply to all supported languages.

An example parameter to tune is addressJoinedTokenLimit, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit will be allowed to merge more tokens.

Another example parameter is houseNumberAddressFieldWeight, which controls the weight of the houseNumber score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight controls the weight of the city field when matching addresses.

Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Address parameters

Stop patterns and stop word prefixes

Match uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.

For each address field, Match performs the following steps in order:

  1. Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.

  2. Stop patterns are applied.

  3. Stop words are applied.

Stop pattern

A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern class; see the Javadoc for detailed documentation.

Stop patterns for a given address field are specified in a UTF-8 file with the AddressField name:

stopregexes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,[8] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Note

The delimiter before FIELD is a double underscore (__)

Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.

Stop pattern files are arranged by field in /rlpnc/data/addresses/ref/stopwords. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt would include regular expressions to remove elements from the CITY address field for English.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop word prefixes

A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.

Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField name:

stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,[9] is a string literal.

Note

The delimiter before FIELD is a double underscore (__)

Prefixes in the address field matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.

Match includes files with stop word prefixes for selected address fields in English and Chinese. These files are in /rlpnc/data/addresses/ref/stopwords. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt would include stopword prefixes for use on CITY address field for English.

Overriding token pair matches

You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. When Match evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road and rd constitute a token pair, then the match score for Stuart Road and Stuart Rd will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

LANG1_LANG2_FIELD.txt

where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField name. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore parameter value will be used.

Token1 Tab Token2 Tab [0.0-1.0]

A token pair override score serves as a minimum score, but you can write /force after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [0.0-1.0]/force

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".

Match includes /rlpnc/data/addresses/ref/override/eng_eng_state.txt, which contains a list of U.S. state abbreviations. For example:

Massachusetts  MA
California  CA

When you create an additional file in the same location, use the respective AddressField name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt indicates that the contents match Chinese - English cityDistrict address fields.

Indexing addresses

Match enables high-speed, scalable searches for addresses in English using the Apache Lucene full-text search engine to store addresses with their search keys and a key index.

When you search for an address, Match generates a search key for each component of each address field, locates all addresses indexed by those search keys, and uses linguistic matching algorithms to filter that set of addresses down to the most similar addresses.

Match provides a Java API that you can use to embed it in your applications.

Java packages: The address indexing classes are in com.basistech.rni.index.internal. Unqualified class names that appear in this section are in com.basistech.rni.index.internal.

For detailed information about the API, see the Java API Reference.

Using Match to Index Addresses

Reminder: If you have not already done so, you must set the Basis root directory.

  1. Construct an address index.

  2. Open the address index.

  3. Define a query.

  4. Run the query and access the results.

  5. Close the address index.

Constructing an address index

An address index is an indexed list of addresses. The list includes a collection of AddressSpec objects and associated keys.

The AddressSpec object may include house, house number, road, unit, level, staircase, entrance, suburb, city district, city, island, state district, state, country region, country, world region, post code, post office box and additional fields.

Note

You can also create an index in memory that is never stored on disk.

To create an indexed list of addresses on disk, you must specify a pathname for the data store.

For example:

// Create an Address index.
// indexPathname specifies the directory where the index will be created.
StandardAddressIndex createIndex(String indexPathname) throws NameIndexStoreException,
        RNTException {
    StandardAddressIndex index = StandardAddressIndex.create(indexPathname);
    return index;
}

Now you can use AddressSpecBuilder to create AddressSpec objects and add them to the index. AddressSpecBuilder provides a fluent interface that supports method chaining.

You can also create an AddressSpec object by parsing an address using AddressSpecBuilder.parse(String str) which internally utilizes the jpostal library. The following fragment illustrates the syntax for creating and adding an AddressSpec to the index.

// Add an address to the index.
void addAddress(StandardAddressIndex index, Integer id) throws NameIndexException, IOException {
    // Give the address a unique identifier. Must be a string.
    String uid = Integer.toString(id);
    // AddressSpecBuilder provides methods for adding address fields,
    // and a build method that returns the AddressSpec.
    AddressSpec addr = new AddressSpecBuilder()
            .house("101")
            .road("Stuart Street")
            .city("Boston")
            .state("MA")
            .countryRegion("New England")
            .uid(uid)
            .build();
// AddressSpecBuilder also provides a method for parsing addresses which uses jpostal,
// and a build method that returns the AddressSpec.
AddressSpec addr2 = AddressSpecBuilder.parse("101 Stuart Street, Boston, MA").build();

index.addAddress(addr);
index.close();}

When you are done adding addresses, be sure to close the address index, as in the preceding fragment.

Querying an address index

You can define and run queries that search an index for similar addresses.

Opening an address index

The primary role of an address index is to perform queries. You can also perform updates (insertions and deletions).

StandardAddressIndex provides a static method for opening an address index.

StandardAddressIndex index = StandardAddressIndex.open(String indexPathname);

indexPathname is the path to the directory that contains the address index.

To optimize the index for more efficient queries, call

index.optimize();

When you are done using the address index, you must close it:

index.close();
Defining an address search query

A query includes an AddressSpec object and several settings that you can use to constrain the query.

Set up an AddressIndexQuery object . For example:

// Define a query.
AddressIndexQuery defineQuery(AddressSpec address){
    AddressIndexQuery query = new AddressIndexQuery(address);
    query.setAddressDataMinimumMatchScore(.30);   
    return query;
}
Query performance tradeoffs

You can make tradeoffs between different dimensions of performance by adjusting certain AddressIndexQuery parameters.

For more information about tradeoffs between accuracy and speed and between false positives and false negatives, refer to Query Performance Tradeoffs for names. For addresses, you will adjust the addressesToCheckAllowance and maximumAddressesToCheck AddressIndexQuery parameters.

Running the query and accessing the query results

StandardAddressIndex includes a query method that takes as its parameter the AddressIndexQuery you have set up.

The query returns an AddressIndexQueryResult list. Each AddressIndexQueryResult object provides an AddressSpec object and a similarity score. As the following fragment illustrates, you can obtain and process each AddressSpec and its score. The higher the score (greater than 0 and less than or equal to 1), the greater the confidence that this is a relevant match. A score of 1.0 indicates that the query address and result address are identical. Scoring is commutative: the scores for two given addresses are always the same, regardless of which address is in the index and which address is in the query.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/address_query_index.java

AddressMatchResult. The AddressIndexQueryResult provides an AddressMatchResult object, which in turn provides a match type and score.

Cleanup

When you are done running queries, close the index:

index.close();
Sample

For a sample Java application that defines a Match address query, runs the query, and reports the results, see AddressIndexQuerySample.

Multithreading

No more than one StandardAddressIndex object may exist for a given address index on disk at any time.

Queries and updates may be performed in multiple threads on a single StandardAddressIndex object.

Matching dates

Match can match dates returning a data match score reflecting the time similarity of the two dates. Dates that are closer together are considered a stronger match and return a match score closer to 1.

For example, 11/05/1993 and 11/07/1993 have a high score, as they are very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.

Date definition

A date contains a year, month, and day, but not all fields are required for matching. All common delimiters for English dates are supported, and dates can be expressed with various orderings. Match will filter out some non-date related words. Formats that include time of day are not supported.

Match supports a wide variety of date formats. The best date format will always be the ISO standard of YYYY-MM-DD, where March 7, 1984 is written as 1984-03-07. Match will attempt to interpret any date provided, although the less standard the format, the less guarantee that its interpretation will be the one you might expect.

Dates can be represented as YYYY-MM-DD. When some fields are unspecified, the letters represent the unknown values. For example, March 7 is YYYY-03-07, since the year in unspecified. Two digit years will be assumed to have unknown centuries. 3/7/84 is interpreted as YY84-03-07. March 7, 1984 will be an equally good match as March 7, 2084 and March 7, 1884.

When a date is provided, Match will attempt to identify the year, month, and day within it, leaving blank any fields it cannot determine. You can omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12- are all valid dates.

If Match encounters an invalid date in an acceptable format, such as March 38, 1984, it will not return an error. Rather it will replace the impossible value as an unknown, March 1984.

Supported date formats

Match supports a wide variety of date formats. 

  • Days can be represented by 1 or 2 digits. Alphanumerics, such as 2nd or 1st, are not supported.

  • Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).

  • Years can be represented by 1, 2, 3 or 4 digits.

  • Supported delimiters include , . - /, as well as a space.

  • Partial fields can be entered.

  • At this time, only English month names and abbreviations are recognized.

  • All words are case-insensitive; upper and lower case are interpreted the same.

The following table shows different acceptable formats for the date March 7, 1984.

Format

Valid examples

Notes

Y-M-D

1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7

M-D

03-07; 3/7; Mar-07; March 7

Y-M

1984-03; 1984 March; 1984-Mar

YYYYMMDD

19840307

All 8 digits must be included

M-D-Y

03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984

M-YYYY

03-1984; March 1984; Mar-1984

The year must include 4 digits. March-84 will not be recognized.

D-M-Y

07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984

D-M

07-03; 7/3; 07-Mar; 7 March

D(MONTH)Y

7MAR84; 07March1984

The month is a word or abbreviation

YYYY

1984

Month

March

Date match parameters

Match Plugin for OpenSearch Restriction

This feature requires access to system parameters not currently available in this release.

We are currently working with OpenSearch to provide access and enable these features.

Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the /rlpnc/data/etc/parameter_defs.yaml file.

Japanese date support

Japanese dates are written in a unique format that combines both western and kanji characters. Match supports matching Japanese-Japanese dates as well as Japanese-Gregorian dates. A Japanese date is automatically detected by identifying the kanji characters.

  • The most common date format in Japan is YMD. This is the only format currently supported.

  • Eras can be represented as the full kanji (e.g. 令和), or an abbreviation (e.g. R).

  • Date numbers (for year, month, or day) can be represented with digits or kanji.

  • Japanese dates are only supported for dates after Meiji 6 (1873).

The formats currently supported are:

  • Gregorian dates

    YYYY年MM月DD日, YYYY年MM月, MM月DD日, YYYY年, MM月, DD日
  • Era-based dates:

    [ERA]YY年MM月DD日, [ERA]YY/MM/DD, [ERA]YYYY年MM月DD日, [ERA]YY年MM月, [ERA]YY/MM, [ERA]YY年

Matching records

Record similarity refers to a pairwise match between two lists of records which can include multiple fields and return a single match and match score. The fields can be any combination of RecordFieldType.RNI_NAME, RecordFieldType.RNI_DATE, and RecordFieldType.RNI_ADDRESS. The records do not have to contain the same fields; only fields with the same field name are compared. If one record has three fields and the other has two fields, the missing field will be ignored and the other two fields compared.

Each field can be assigned a weight to reflect its importance in the overall matching logic. When matching two records, some fields are more important in determining a match than others. For example, the name field is likely more important in determining a match than an address field. If no weights are defined, each field is weighted equally.

You can specify individual parameter values or a parameter universe string in the record similarity properties object to set tuning variables for the record similarity call.

When matching records, a similarity score is calculated for each field. The final match score is then calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is removed from the score calculation and its weight is evenly distributed across the other fields. Set the score_if_null parameter to a value between 0 and 1 to include missing fields in the score. When set, that value is returned when the field is missing from the record.

Record matching usage model

Use RecordScorer to score the similarity of two records that include multiple field types, instead of the functions for a single type, such as the MatchScorer, DateScorer and AddressScorer functions for names, dates and address matching respectively.

Supported field types

The RecordScorer has default support for the RecordFieldType.RNI_NAME, RecordFieldType.RNI_DATE, and RecordFieldType.RNI_ADDRESS field types. All default similarity scores are between 0.0 and 1.0.

Table 29. RecordScorer Supported Field Types

Field Type

Entity Type

Examples

RecordFieldType.RNI_NAME

PERSON

'John David Smith' vs. 'Jon D Smith' = 0.88

RecordFieldType.RNI_NAME

IDENTIFIER:DRIVERS_LICENSE

'S82062270' vs. 'S82062272' = 0.9

RecordFieldType.RNI_NAME

IDENTIFIER:LICENSE_PLATE

'E23 2IN' vs. 'E23 2IM' = 0.875

RecordFieldType.RNI_NAME

IDENTIFIER:NATIONAL_ID_NUM

'691-84-8999' vs. '691-84-9999' = 0.9167

RecordFieldType.RNI_DATE

N/A

'2010-11-4' vs. '2010-5-11' = 0.92

RecordFieldType.RNI_ADDRESS

N/A

'Red Cedar Ct' vs. 'Cedar Ct' = 0.53



Explainability of Matching

Explainability of matching

As important as getting a match score is, understanding how the system calculated the score can be just as important. When matching two names or records, Match returns a JSON response explaining in detail how the two names, dates, addresses, or records were matched. With this information, you can understand how the score was calculated and, if necessary, modify the matching parameters to better solve your matching problems.

The following concepts are helpful when reviewing the explainInfo JSON file.

  • When two objects are being compared, one is referred to as the left input, one as the right input.

  • Every token of the left object is compared to every token of the right object. Token strings, made up of multiple tokens, may also be compared.

  • Names are usually composed of multiple tokens. For example, John Fitzgerald Kennedy is 3 tokens.

Common Terms

The response JSON contains sections for each type of object: names, addresses, and dates. While each object has its own criteria for comparison, there are common terms used for all comparisons, as shown below.

Table 30. Definitions of Terms

Term

Definition

Note

bin

A number representing the frequency of the token in the language. A lower bin indicates the token in unusual and therefore should be more highly weighted when calculating the similarity score.

biasedBin

The bin raised to a power from .1 to 10 (default 0.970). This value is set by the frequencyRankBias parameter.

scoreInIsolation

The matching score of just the tuples being compared, ignoring things like position in the name, name weighting, etc. This will show a match core of 1.000 if it is an exact match of tokens, even if if there are biases that will lower the score in context.

scoreInContext

The matching score between the tuples taking into account the placement in the overall query and any biases related to the overall query.

(left/right)MinTokenIndex

This is the index of the first token in the string of tokens.

For single tokens, the min and max tokenIndex will have the same value.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

(left/right)MaxTokenIndex

This is the index of the last token in the string.

For single tokens, the min and max tokenIndex will have the same value.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.

unbiasedScore

The raw score before any calculations using finalBias, adjustOnesideDeletionScores, or other such bias parameters.

score

The final score after finalBias, adjustOnesideDeletionScores, and other such bias parameters are added to the calculation.



Response structure

All matches responses contain the same sections. The details contained within the section can change based on the match object (names, dates, addresses).

  • Left/right input information: The input information for each input along with the properties for each token in the input. Properties depend on the type of object being matched.

    For example, the name matching example contains the following properties:

    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH"

    While a date comparison would contain different properties:

    "century": 20,
    "month": 10,
    "canonicalForm": "2024-10-01",
    "yearWithoutCentury": 24,
    "dayMonthSwapped": true,
    "originalString": "10 January 2024",
    "modifiedJulianDay": 60584,
    "day": 1
  • Tuple scores: The score for every tuple, where a tuple is a token string from the left input and a token string from the right input. Every token in the left input is matched to every token in the right input, along with some token strings (multiple tokens combined together).

  • Score adjustments: The score adjustments list the parameters applied, and the score calculated with those parameters.

    For example, the name example here contains the following parameters:

    "unbiasedScore": 0.6829129823127231,
    "score": 0.6919264820086959,
    "parameter": "adjustOneSidedDeletionScores"
    
    "unbiasedScore": 0.6919264820086959,
    "score": 0.8435140063279181,
    "parameter": "finalBias"

    Meanwhile, a date comparison would contain different parameters. In this case, a different matching scheme, tryDayMonthSwap, is tried to see if a better result is returned.

    "score": 0.95,
    "unbiasedScore": 0.5926523220980572,
    "parameter": "tryDayMonthSwap"
    
    "score": 0.95,
    "unbiasedScore": 0.95,
    "parameter": "dateFinalBias"
  • Final score: The similarity score for the two names.

Example: matching names

Let's take a look at an example. In this example we're matching the following 2 names:

  • John Smith

  • Jon J Smyth

The JSON output is broken down by section.

Example 17. Left Input: John Smith
"leftInput": {
    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.41435888604672094,
        "tokenType": "GIVEN"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.585641113953279,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },
  • The name is tokenized. Each token is evaluated.

  • The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.

  • The tokenTypes are identified. Even if the name was provided as Smith John, Smith would be identified as a SURNAME and John as a GIVEN name.



Example 18. Right input: Jon J Smyth
"rightInput": {
    "data": "Jon J. Smyth",
    "normalizedData": "jon j. smyth",
    "latnData": "jon j. smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "jon",
        "latnToken": "jon",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.2083122782666673,
        "tokenType": "UNKNOWN"
      },
     {
        "token": "j",
        "latnToken": "j",
        "bin": 8,
        "biasedBin": 7.5161819937120935,
        "tokenWeight": 0.08948764635417582,
        "tokenType": "UNKNOWN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.702200075379157,
        "tokenType": "UNKNOWN"
      }
    ],
    "entityType": "PERSON"
  },
  • The name is tokenized. Each token is evaluated.

  • The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.

  • The tokenTypes are identified. Since both Jon and Smyth are unusual spellings, the tokenType is not identified.



Example 19. Score tuples
"scoreTuples": [
    {
      "scoreInIsolation": 0.7595918889283346,
      "scoreInContext": 0.7595918889283346,
      "left": "john",
      "right": "jon",
      "marked": true,1
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.4912303477031893,
      "scoreInContext": 0.4666688303180298,
      "left": "john",
      "right": "jonj",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.542,
      "scoreInContext": 0.4743439389212776,
      "left": "john",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.2941408383164158,
      "scoreInContext": 0.279433796400595,
      "left": "johnsmith",
      "right": "jon",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.46557800000000005,
      "scoreInContext": 0.4422991,
      "left": "johnsmith",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,2
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 2,
      "rightMaxTokenIndex": 2
    },
    {
      "scoreInIsolation": 0.27169000000000004,
      "scoreInContext": 0.27169000000000004,
      "left": "",
      "right": "j",
      "marked": true,3
      "reason": "DELETION",
      "leftMinTokenIndex": -1,
      "leftMaxTokenIndex": -1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],

All tuples are compared. The tuples that are marked as true are the matches that are used to calculate the scores.

Matching tuples:

1

John : Jon

2

Smith: Smyth

3

J: deleted (no match)



Example 20. Score adjustments
"scoreAdjustments": [
    {
      "unbiasedScore": 0.6829129823127231,
      "score": 0.6919264820086959,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6919264820086959,
      "score": 0.8435140063279181,
      "parameter": "finalBias"
    }
  ],

The unbiased score is the score before the parameter is applied. The score is after the parameter is applied.



Example 21. Final score
"finalScore": 0.8435140063279181

The final calculated score with all parameters applied. This is the similarity score returned by Match.



Response schemas by object

The following sections list the JSON schema for each object type.

Name response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftMinTokenIndex": { "type": "integer", "default": 0 },
          "leftMaxTokenIndex": { "type": "integer", "default": 0 },
          "rightMinTokenIndex": { "type": "integer", "default": 0 },
          "rightMaxTokenIndex": { "type": "integer", "default": 0 }
        },
        "required": ["left", "right", "reason"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}
Additional fields for translated names

The explain info contains all alternate readings or translations for the input name when the input name is in non-Latin script.

Example 22. Alternate Latin Data Fields
{
    "leftInput": {
        "data": "温家宝",
        "normalizedData": "温家宝",
        "latnData": "wen jiabao",
        "script": "Hani",
        "languageOfUse": "CHINESE",
        "languageOfOrigin": "UNKNOWN",
        "entityType": "PERSON",
        "alternativeLatnData": [
            "on kahou",
            "on ka-po"
        ]
    },
    "rightInput": {
        "data": "عبد الله بن عبد الرحمن بن جبرين",
        "normalizedData": "عبد الله بن عبد الرحمن بن جبرين",
        "latnData": "abd allah bin abd alrahman bin jabrin",
        "script": "Arab",
        "languageOfUse": "ARABIC",
        "languageOfOrigin": "UNKNOWN",
        "entityType": "PERSON",
        "alternativeLatnData": [
            "abid allah bin abd alrahman bin jabrin",
            "abd allah bin abid alrahman bin jabrin",
            "abd allah bunn abd alrahman bin jabrin",
            "abd allah bin abd alrahman bunn jabrin"
        ]
    },
    "finalScore": 0.0
}


Address response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "leftMinTokenIndex": { "type": "number", "default": 0 },
          "leftMaxTokenIndex": { "type": "number", "default": 0 },
          "rightMinTokenIndex": { "type": "number", "default": 0 },
          "rightMaxTokenIndex": { "type": "number", "default": 0 }
        },
        "required": ["left", "right", "reason", "leftField", "rightField"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" }
        },
      }
    },
    "finalScore": { "type": "number" },
    "fieldScores": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "score": { "type": "number", "default": 0.0 },
          "marked": { "type": "boolean", "default": false }
        },
        "required": ["leftField", "rightField"]
      }
    }
  },
}
Date response schema
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "weight": { "type": "number", "default": 0.0 },
          "component": { "type": "string" },
          "differenceInDays": { "type": "integer" }
        },
        "required": ["left", "right", "component"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Using Match with Solr

Match includes plugins for Solr 8.11.3, and Solr 9.6.0 that support the use of Match with Solr documents that contain names, addresses, and dates along with other data. The plugins support single-valued and multi-valued name, address, and date fields. With them, you can run Solr queries against documents that include, but are not limited to, name, address, and date fields.

Getting started with the Solr plugin

To index and search documents with Match in a Solr application, you must add JARs to the Solr classpath, add the name fields to the schema.xml, and modify the solrconfig.xml.

Placing the Solr plugin jar

The Solr plugin Jar should be in your Solr sharedLib directory.

Jar files used by all of the cores in your Solr application (including bt-rni-solr<version>-plugin.jar) should be placed in a sharedLib directory that is defined in solr.xml.

We have placed the Solr plugin jar in rlpnc/data/rnm/sample/solr_shared_lib and included the corresponding sharedLib setting in solr.xml in our sample solr home.

Example, from rlpnc/data/rnm/sample/solr9x_home/solr.xml:

 <!--Adjust the sharedlib setting if you move bt-rni-solr9x-plugins.jar to a different location.-->  
<str name="sharedLib">${bt.root}/rlpnc/data/rnm/sample/solr_shared_lib</str>
Modifying schema.xml

Add fieldType and field definitions to schema.xml.

In types, define the NameField field type.

<fieldType name="bt_rni_name" class="com.basistech.rni.solr.NameField" needNameStore="true"/>

In types, define the AddressField field type.

<fieldType name="bt_rni_addr" class="com.basistech.rni.solr.AddressField" needAddressStore="true"/>

In types define the DateField field type.

<fieldType name="bt_rni_date" class="com.basistech.rni.solr.DateField" needDateStore="true"/>

Add your name, address, and date fields in fields. For example:

<field name="primaryName" type="bt_rni_name" indexed="true" stored="true" multiValued="false"/>
<field name="aka" type="bt_rni_name" indexed="true" stored="true" multiValued="true"/>
<field name="residence" type="bt_rni_addr" indexed="true" stored="true" multiValued="false"/>
<field name="dateOfBirth" type="bt_rni_date" indexed="true" stored="true" multiValued="false"/>

You can copy fragments from rlpnc/data/rnm/sample/solr8x_home/collection1/conf/schema-xml-sample-fragments.xml.

These changes can also be made using the Solr Schema API in the Solr Admin page.

Modifying solrconfig.xml

As top-level elements, add the reRank queryParser included in the Match release along with rniMatch valueSourceParser to solrconfig.xml.

<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<queryParser name="rniAddrRerank" class="com.basistech.rni.solr.RNIAddressReRankQParserPlugin"/>
<queryParser name="rniDateRerank" class="com.basistech.rni.solr.RNIDateReRankQParserPlugin"/>
<valueSourceParser name="rniMatch" class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
<valueSourceParser name="rniAddrMatch" class="com.basistech.rni.solr.AddressMatchValueSourceParser"/>
<valueSourceParser name="rniDateMatch" class="com.basistech.rni.solr.DateMatchValueSourceParser"/>

If your documents include one or more multivalued name fields, include an updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIName">
  <processor class="com.basistech.rni.solr.MultiValueNameUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the update chain.

<requestHandler name="/update" 
      class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">RNIName</str>
  </lst>
</requestHandler>

If your documents include one or more multivalued address fields, include an updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIAddr">
 <!--Custom processor required when using multivalued address fields-->
  <processor class="com.basistech.rni.solr.MultiValueAddressUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
   <str name="update.chain">RNIAddr</str>
 </lst>
</requestHandler>

If your documents include one or more multivalued date fields, include an updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIDate">
  <!--Custom processor required when using multivalued date fields-->
  <processor class="com.basistech.rni.solr.MultiValueDateUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">RNIDate</str>
 </lst>
</requestHandler>

If your documents include one or more multivalued name, address or date fields, include an updateRequestProcessorChain.

<updateRequestProcessorChain name="RNI">
  <!--Custom processor required when using multivalued name fields-->
  <processor class="com.basistech.rni.solr.MultiValueNameUpdateRequestProcessorFactory"/>
  <!--Custom processor required when using multivalued address fields-->
  <processor   class="com.basistech.rni.solr.MultiValueAddressUpdateRequestProcessorFactory"/>
  <!--Custom processor required when using multivalued date fields-->
  <processor class="com.basistech.rni.solr.MultiValueDateUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">RNI</str>
 </lst>
</requestHandler>

You can copy fragments from rlpnc/data/rnm/sample/solr8x_home/collection1/conf/solrconfig-xml-sample-fragments.xml.

Starting Solr

When starting Solr, you must include a java property setting that points to the root of the Match SDK as well as increase the heap size. If you are running JDK 17, you need to enable security manager by including -Djava.security.manager=allow in -a options. For example:

bin/solr -a "-Dbt.root=$BT_ROOT -Djava.security.manager=allow" -m 2g

On Windows, use the following PowerShell command:

Start-Process -FilePath bin\solr -ArgumentList "-f -s `"$Env:BT_ROOT\rlpnc\data\rnm\sample\ofac_solr_home`" -m 4g -a `"-Dbt.root=$Env:BT_ROOT`"" -NoNewWindow

For all systems, the BT_ROOT environment variable must be set.

Loading data into Solr

The data model

Documents or records typically contain multiple names and not all are the same type. For instance, in the OFAC Specially Designated Nationals list, a record may contain a primary name and a list of akas (also known as). Ideally these would all be stored in a single Solr document to efficiently process complex queries involving multiple document fields, especially in a distributed setting.

For example, a Solr document might contain the following data:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>

Solr documents may also include multiple names referring to different persons, locations, or organizations. A single news document, for example, may contain references to a number of individuals.

Address fields

An address may include any of the fields in Table 31, “Supported Address Fields below. At least one field must be specified, but no specific fields are required.

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library is used to parse the address string into address fields.

The format to represent an address with fields consists of non-empty consecutive address fields where each field is an AddressField's name in lower camel case (house, houseNumber, road, unit, level, staircase, entrance, suburb, cityDistrict, city, island, stateDistrict, state, countryRegion, country, worldRegion, postCode, poBox) followed by the value of the field with Hex encoded special characters preceded by the percent sign, and the value itself is enclosed with angle brackets.

Match optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields, such as house number, postal code, and unit, are matched using character-based methods.

Table 31. Supported Address Fields

Field name

Description

Example

house

venue and building names

house<Brooklyn Academy of Music>

houseNumber

usually refers to the external (street-facing) building number

houseNumber<123>

road

street name(s)

road<Harrison Avenue>

unit

an apartment, unit, office, lot, or other secondary unit designator

unit<Apt. 123>

level

expressions indicating a floor number

level<3rd Floor>

staircase

numbered/lettered staircase

staircase<2>

entrance

numbered/lettered entrance

entrance<front gate>

suburb

usually an unofficial neighborhood name

suburb<Crown Heights>

cityDistrict

these are usually boroughs or districts within a city that serve some official purpose

cityDistrict<Brooklyn>

city

any human settlement including cities, towns, villages, hamlets, localities, etc.

city<Boston>

island

named islands

island<Maui>

stateDistrict

usually a second-level administrative division or county

stateDistrict<Saratoga>

state

a first-level administrative division

state<Massachusetts>

countryRegion

informal subdivision of a country without any political status

countryRegion<South/Latin America>

country

sovereign nations and their dependent territories, which have a designated ISO-3166 code

country<United States of America>

worldRegion

currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean

worldRegion<Jamaica, West Indies>

postCode

postal codes used for mail sorting

postCode<02110>

poBox

post office box: typically found in non-physical (mail-only) addresses

poBox<28>



In the string that defines the content of an address field, place a tilde (~) after the address, followed by a comma-delimited attribute-value pair: (fielded=true) or (fielded=false) to specify whether the address consists of a single string or a set of fields.

The above example of a Solr document might contain the following additional data where the address is defined as a set of fields:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">houseNumber<3302>road<Grand Av.>city<West Louisville>state<KY>~fielded=true</field>

The address field can also consist of a single string, and the above example of a Solr document would look like this:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">3302 Grand Av., West Louisville, KY~fielded=false</field>
Date fields

Documents or records may also contain one or multiple dates in which a format can be specified. In order to specify a date format, in the string that defines the content of a date field, place a tilde (~) after the date, followed by a comma-delimited attribute-value pair: (format=dd-MM-yyyy) or (format=MMdd-yyyy) for example, to specify the format to parse the date string with.

An example including a date which is defined without specifying a format:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">houseNumber<3302>road<Grand Av.>city<West Louisville>state<KY>~fielded=true</field>

The date field can also consist of a date string with a specified format:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">01/07/42~format=dd/MM/yy</field>
<field name="address">3302 Grand Av., West Louisville, KY~fielded=false</field>
Fielded names

You can process names with data fields. Use "|" to separate the fields. For example, "Mr|Jon|Q|Smith" has four fields. You can define names with empty fields: in "|Jon|Q|Smith", the first field is ""; in "Mr|Jon||Smith", the third field is "".

You have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*.

Specifying name attributes

In addition to the name itself, a Match Name object may contain attributes that you can specify when you index the name or perform a query.

Name Attribute

Example

Description

language

language="kor"

ISO 639-3 code for the language of use in which the name appears.

hintLanguage

hintLanguage="jpn"

Hint (ISO 639-3 code) for the language of use. Only used if language is not specified or "xxx". The hint is used if compatible with the script. Otherwise, Match makes its own language guess.

languageOfOrigin

languageOfOrigin="eng"

ISO 639-3 code for the name's language of origin.

script

script="Hani"

ISO 15924 code for the script in which the name appears.

entityType

entityType="PERSON"

The entity type, such as "PERSON", "LOCATION" or "ORGANIZATION".

uid

uid="4072"

Unique identifier for the name.

gender

gender="male"

Explicit gender for the name, such as "male", "female", or "nonbinary".

Note

The entityType field in the query must match the entityType field in the indexed name. If the query does not specify an entity type, the indexed name must also not specify an entity type.

In the string that defines the content of a name field, place a tilde (~) after the name, followed by a comma-delimited list of attribute-value pairs.

Examples:

When posting a Solr document:

<field name="primaryName">Muhammad Ali~language=eng,languageOfOrigin=ara,entityType=PERSON</field>

In a query:

primaryName:"Muhamid Ali~language=eng,languageOfOrigin=ara,entityType=PERSON"

In a pairwise name match:

&rq={!rniRerank reRankQuery=$rrq} 
&rrq={!func}rniMatch(primaryName,"Muhammad Ali~language=eng,languageOfOrigin=ara,entityType=PERSON")

Attributes in the bt Namespace. You can include bt attributes as query parameters. These attributes are then used in both the base query and the reRank pairwise match query. For example: &bt.language=jpn &bt.script=Kana

Setting Default Attribute Values. You can include name attributes in field or field type definition as defaults that can be overridden by individual name entries. For example:

<field name="primaryKoreanName" type="bt_rni_name" indexed="true" stored="true" multiValued="false"
       language="kor" script="Hang" entityType="PERSON"/>

Then you only need to include these attributes in name entries when you want to override the defaults.

Query enhancements

It is often necessary to query on other fields besides names fields, such as date of birth and address. The plugin enables the seamless integration of Match into your Solr queries. To apply Boolean logic to queries, combine multiple fields with Boolean operators. The plugin supports all Boolean operators supported by the standard Lucene query parser (AND, OR, NOT, + , -). The OR operator is the default conjunction operator; if there is no Boolean operator between two terms (fields), the OR operator is used.

Example of a query with name and date fields:

primaryName:"Chuy Lopez A Deyas~entityType=PERSON" AND dateOfBirth:"1960-09-30"

Example of a query including an address field:

primaryName:"Chuy Lopez A Deyas~entityType=PERSON" AND residence:"road<Avenida Const. Pedro L Zavala 1957>
 house<Colonia Libertad>city<Culiacan>region<Sinaloa>postalCode<80180>country<Mexico>~fielded=true"

You can include name fields and other fields in your base query in conjunction with a Match Solr reRank query and a custom valueSourceParser. The base query identifies candidate documents. The reRank query sends the top N candidates to the rniMatch valueSourceParser for pairwise matching. You can combine multiple fields in function queries which enable you to generate a relevancy score of those fields. The plugin supports all the functions available for function queries in Solr.

In a pairwise name match we return the maximum score of querying for primaryName and contactName:

&rq={!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
  &rrq={!func}max(rniMatch(primaryName, "Chuy Lopez A Deyas"), rniMatch(contactName, "Chuy Lopez"))

In a pairwise address match we return the maximum score of querying for primaryAddress and residence:

&rq={!rniAddrRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
  &rrq={!func}max(rniAddrMatch(primaryAddress, "road<Calle Lago Cuitzeo 1394>house<Colonia Las Quintas>
 city<Culiacan>region<Sinaloa>postalCode<80060>country<Mexico>~fielded=true"),
rniAddrMatch(residence, "road<Avenida Const. Pedro L Zavala 1957>house<Colonia Libertad>
 city<Culiacan>region<Sinaloa>postalCode<80180>country<Mexico>~fielded=true"))

In a pairwise date match we return the maximum score of querying for dateOfBirth and dob:

&rq={!rniDateRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
&rrq={!func}max(rniDateMatch(dateOfBirth, "01/07/42~format=dd/MM/yy"), rniDateMatch(dob, "1/7/1940"))

You can combine the score of multiple Match fields of the same type where each field can be given a weight to reflect its importance in the overall matching logic.

For example, in a pairwise name match we can return the combined score of querying for aka and primaryName where aka has a weight of 0.3 and the remaining 0.7 is assigned to primaryName field:

&rq={!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0} 
  &rrq={!func}sum(linear(rniMatch(aka, "Jesus Alfonso Diaz"), 0.3, 0),
  linear(rniMatch(primaryName, "Jesus Diaz"), 0.7, 0))
Setting reRank parameters

The RNIRerankQParserPlugin provides parameters that you can set to customize your reRank query:

  • reRankDocs (an integer) specifies the maximum number of documents from the base query to pass to the Match pairwise match.

    Use this parameter to limit the number of compute-intensive name matches that need to be performed, thus decreasing maximum query latency.

  • reRankMode ("add" or "replace") specifies whether the Match match score is added to the Solr score (the default) or replaces the Solr score.

  • reRankWeight (a float) specifies the weighting of the maximum Match pairwise match score when it is combined with the Solr score. This parameter is ignored if reRankMode is set to "replace".

    The Match score, multiplied by the reRankWeight (the Solr default is 2.0), is added to the Solr score to provide the document score that is used to determine the ordering of the documents in the result set. Use this parameter to influence the role that the Match pairwise match plays in the ordering of the result set. If you want to prioritize the Match score and de-emphasize the Solr score, specify a large reRankWeight

  • reRankDocsAllowance (a float from 0 to 1) controls the general proportion of documents from the base query to pass to the Match pairwise match. This is used at query time to dynamically determine the number of documents to rescore based on the commonality of the query name in the index. Setting this to 1.0 will ensure that the maximum number of documents (reRankDocs) are always rescored.

    Use this parameter to limit the number of compute-intensive name matches that need to be performed, thus decreasing query latency.

  • scoreToRerankRestriction (a float from 0 to 1) influences the minimum similarity score, calculated based on the results of the base query, that documents returned by the base query must have in order to be passed to the Match pairwise match for rescoring.

  • reRankFilter (a Solr query) further filters any results from the main query from being passed to the Match pairwise match.

In the following example, pairwise matching is performed on the top 200 names returned by the base query, and the Match score is multiplied by 3 before it is added to the Solr score.

q=primaryName:"Lopez Diaz"
fl=primaryName,aka,score
&rq={!rniRerank reRankQuery=$rrq reRankDocs=200 reRankWeight=3}
&rrq={!func}rniMatch(primaryName, "Lopez Diaz")

Example with Solr Admin

This example walks you through the steps for using the Solr 9 Admin example to perform queries.

Basic Procedure
  1. Download and expand Solr 9.6.0.

  2. Start the Solr webserver.

    You can point it at a Solr core included in the Match package that contains the OFAC list already indexed. From Solr-9.6.0, run the following:

    bin/solr -f -s $BT_ROOT/rlpnc/data/rnm/sample/ofac_solr_home -a \
    "-Dbt.root=$BT_ROOT -Djava.security.manager=allow" -m 2g

    For a Windows environment, you can use the following PowerShell command:

    Start-Process -FilePath bin\solr -ArgumentList "-f -s `"$Env:BT_ROOT\rlpnc\data\rnm\sample\ofac_solr_home`" -m 4g -a `"-Dbt.root=$Env:BT_ROOT`"" -NoNewWindow

    Note

    The BT_ROOT environment variable must be set. For more information, see Installing Match.

  3. Use a Web browser to navigate to http://localhost:8983/solr/#/collection1/query. This form provides the full interface for submitting queries in Solr Admin.

    solr5-admin.jpg
  4. Submit a Solr Query

    solr5-admin-query.jpg
    1. Fill in the q (query) textbox with a query that includes a name string and a date-of-birth range starting at 9/30/1960:

      name:"Chuy Lopez A Deyas~entityType=PERSON" AND dateOfBirth:[1960-09-30T00:00:00Z TO *]
    2. Fill in the fl (fields to return) textbox:

      name,aka,dateOfBirth,address,nationality,score
    3. Set raw query parameters to define the reRankQuery

      &rq={!rniRerank reRankQuery=$rrq reRankMode=replace
       reRankWeight=1.0} &rrq={!func}rniMatch(name, "Chuy Lopez A Deyas~entityType=PERSON")
    4. Click Execute Query.

Solr Admin displays a response.

For this query, Solr returns the appropriate Diaz document.

{
{
  "responseHeader": {
    "status": 0,
    "QTime": 387,
    "params": {
      "rrq": "{!func}rniMatch(name, \"Chuy Lopez A Deyas~entityType=PERSON\\")",
      "q": "name:\"Chuy Lopez A Deyas~entityType=PERSON\" AND dateOfBirth:[1960-09-30T00:00:00Z TO *]",
      "fl": "name,aka,dateOfBirth,address,nationality,score",  
      "_": "1631115490163",   
      "rq": "{!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0} "
    }
  },
  "response": {
    "numFound": 145,
    "start": 0,
    "maxScore":0.6820811,
    "numFoundExact":true,
    "docs": [
      {
        "name": "Jesus Alfonso LOPEZ DIAZ~uid=10353,entityType=PERSON",
        "address": [
          "c/o ESTABLO PUERTO RICO S.A. DE C.V.\nCuliacan Sinaloa\nMexico",
          "Avenida Const. Pedro L Zavala 1957\nColonia Libertad\nCuliacan Sinaloa 80180\nMexico"
        ],
        "nationality": [
          "Mexico"
        ],
        "dateOfBirth": [
          "1962-09-30T00:00:00Z"
        ],
        "score": 0.6820811
      },
      ...
    ]
  }
}

Example using the solrj API

The Match SDK ships with an example that illustrates the use of the org.apache.solr.client.solrj API to integrate the Match Solr plugin into a Solr application. See MatchSolrjSample. This sample also illustrates a procedure for posting Solr documents from an xml file.

You can use the org.apache.solr.client.solrj API to integrate the Match Solr plugin into a Solr application.

The basic steps are as follows:

  1. Add bt-rni-<solr-version>-plugin.jar (distributed in rlpnc/data/rnm/sample/solr_shared_lib) to the classpath.

  2. Set solr.solr.home to a solr directory that contains a collection with a modified schema.xml and solrconfig.xml as described in previous sections.

  3. Instantiate a SolrServer and use it to add documents to a Solr index. The documents should contain one or more name fields along with any other fields of interest. Name, address, and date fields may be multivalued.

  4. Define a Solr query that involves name fields and other fields of interest, and that reranks the documents according to Match's pairwise name match score.

  5. Run the query and examine the documents that are returned.

For convenience utilities for working with Match names in a Solrj environment, see the Javadoc for com.basistech.rni.solr.index.

Translating names

Name Translator (RNT) supports name translation in complex, non-Latin languages, such as Arabic and Chinese. See Supported languages of origin for the complete list of supported languages and scripts. RNT supports multiple transliteration standards for translating from non-Latin scripts to English.

Text domains

Name Translator translates a name from one text domain to another. A text domain is specified by three parameters:

  • Language (ISO 639)

    The language of the document in which the name is found.

  • Writing script (ISO 15294)

    The script used to represent the name, such as the Latin alphabet, Arabic script, or Chinese Han characters.

  • Transliteration scheme

    The transliteration system in which the name is represented. If the name is in its native script, the transliteration scheme is native.

The source domain is the text domain of the document in which the name is found. The target domain is the text domain to which the name is to be translated.

Supported translation domains provides a list of supported source and target domains.

Types of translation

The type of translation depends on the characteristics of the source and target domains and the language of origin of the name to be translated. RNT supports the following types of translations:

Translation of a person name to English

How the name is translated depends on whether the language of origin of the name is the source language.

  • If the language of origin of the person name is the same as the source language, the name is translated according to the specified target transliteration scheme. For example, a Japanese name which is in Japanese.

  • If the language of origin is not the source language, the name is translated to its conventional English form. For example, a non-Japanese name that appears in Japanese.

RNT supports the following translations of names to their conventional English representation:

  • non-Arabic names that appear in the Arabic language

  • non-Chinese names that appear in the Chinese language

  • non-Hebrew names that appear in the Hebrew language

  • non-Japanese names that appear in the Japanese language

  • non-Korean names that appear in the Korean language

  • non-Russian names that appear in the Russian language

Use the languageOfOrigin Name field to inform RNT that the language of origin is not the language of use in which the name appears.

  • If the language of origin is Unknown (the default), the language model may classify the name as foreign (for Japanese, the script must be Katakana).

  • If the language of use is Japanese, the script is Kanji, and the language of origin is Chinese or Korean, RNT attempts to translate the name, using Pinyin for Chinese, and Revised Romanization of Korean for Korean.

  • If the language of use is Chinese and the language of origin is anything other than Chinese, RNT attempts to translate the name to its standard English representation.

  • If the language of use is Korean, the script is Hangul, and the language of origin is any language other than Korean, RNT attempts to translate the name to its standard English representation.

  • For other languages, RNT uses the specified target transliteration scheme to transliterate the name to Latin script, regardless of whether or not the name is etymologically native to the respective source language.

Example - Arabic:

Source domain: Arabic language, Arabic script, native transliteration scheme.

Target domain: English language, Latin script, IC transliteration scheme.

The translation of جورج بوش is George Bush. Note: The IC transliteration is Jwrj Bwsh.

The translation of صفية طالب السهيل (an Arabic name) is the IC transliteration: Safiyyah Talib al-Suhayl.

Example - Pashto with IC transliteration scheme:

For Pashto, if you are using the IC transliteration scheme and the language of origin is Afghan Persian, RNT provides special handling of two short vowels, using 'e' and 'o' in place of 'i' and 'u', as designated in the IC Pashto Standardized Transliteration System for Personal Names.

Source domain: Pashto language, Arabic script, native transliteration scheme.

Target domain: English language, Latin script, IC transliteration scheme.

The standard translation of اسحاق is Ishaq. If the language of origin is Afghan Persian, the translation is Eshaq.

Example - Japanese, Katakana:

Source domain: Japanese language, Katakana script, native transliteration scheme.

Target domain: English language, Latin script, Hebon transliteration scheme.

The translation of ウィリアム・シェイクスピアー is William Shakespeare. Note: The Hebon transliteration is Iriamu Shieikusupiaa.

Example - Japanese, Kanji:

Source domain: Japanese language, Kanji script, native transliteration scheme.

Target domain: English language, Latin script, Hebon transliteration scheme.

With Chinese as the language of origin, the translation (Pinyin transliteration) of 温家宝 is Wen Jiabao. Note: The Hebon transliteration of 温家宝 is On Kahou.

Example - Russian:

Source domain: Russian language, Cyrillic script, native transliteration scheme.

Target domain: English language, Latin script, BGN transliteration scheme.

The translation of Маргарет Этвуд is Margaret Atwood. Note: The BGN transliteration is Margaret Etvud.

The translation of Алекса́ндр Солжени́цын (a Russian name) is the BGN transliteration: Aleksándr Solzhenítsyn.

Example - Thai

Source domain: Thai language, Thai script, native transliteration scheme.

Target domain: English language, Latin script, ISO11940_2_2007 transliteration scheme.

The translation of นายก รัฐมนตรี (a Thai name) is the ISO11940_2_2007 transliteration: Nayok Ratthamontri.

Example - Greek

Source domain: Greek language, Greek script, native transliteration scheme.

Target domain: English language, Latin script, ISO843_1997 transliteration scheme.

The translation of Γεώργιος Αθανασιάδης-Νόβας (a Greek name) is the ISO843_1997 transliteration: Geōrgios Athanasiadīs-Novas.

Example - Hebrew

Source domain: Hebrew language, Hebrew script, English language of origin, native transliteration scheme.

Target domain: English language, Latin script, ISO259_2_1994 transliteration scheme.

The translation of ברברה סטרייסנד is Barbara Streisand. Note: The ISO259_2_1994 transliteration is Brbrah Sṭriysnd.

Note that the translation to Barbara Streisand will be returned only if the user specifies the language of origin as English.

Translation from Native script to Latin Script

This is used when the source script and the transliteration scheme are native while the target script is Latin, the transliteration scheme is something other than native, and the language of origin of the name is native.

Examples:

Source domain: Arabic language, Arabic script, native transliteration.

Target domain: English language, Latin script, IC transliteration.

The translation of صفية طالب السهيل is Safiyyah Talib al-Suhayl.

Reverse transliterations from Latin script to native script

Some transliteration schemes provide enough information to enable reverse transcription, going from English and Latin script to a native script.

Examples:

Source domain: English language, Latin script, Basis transliteration.

Target domain: Arabic language, Arabic script, native transliteration.

The translation of naayif abuu sharkh is نَايِف أَبُو شَرْخ.

Source domain: English language, Latin script, Basis transliteration.

Target domain: Russian language, Cyrillic script, native transliteration.

The translation of Dmitry Medvedev is Дмитрий Медведев.

Standardization of Arabic-origin names in English

This translation takes a name in English that is of Arabic-origin and translates the Arabic components according to the specified transliteration scheme.

Example:

Source domain: English language, Latin script, native transliteration.

Target domain: English language, Latin script, IC transliteration.

The IC standardization of Moustephah Ehmed ben Samire is Mustafa Ahmad Bin-Samir.

Orthographic completion

This is available if the source and target languages are Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu, the source and target scripts are Arabic or Hebrew, and the source and target transliteration schemes are native.

  • Language: Source and target language is Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu.

  • Script: Source and target script is Arabic or Hebrew script.

  • Transliteration: Source and target transliteration scheme is native.

In conventional Arabic and Hebrew script, short vowels and other diacritics are not included. In orthographic completion, the translator attempts to vocalize the names by adding the short vowels and other diacritics that don't appear in conventional Arabic and Hebrew script.

You can also perform orthographic completion as part of the translation process. See Translation options.

Segmentation

Arabic, Chinese, Japanese, and Korean, names are often unsegmented, so that is there are no spaces between the words in the name. The translator attempts to segment the unsegmented names by adding spaces between the words in the name.

Segmentation is available when the source and target languages are Arabic, Chinese, Japanese, or Korean and the source and target transliteration schemes are native.

You can also perform segmentation as part of the translation process. See Translation options.

Variant Latin-Script representations of name in non-Latin Script
  • Language: Source and target language is Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu.

  • Script: Source script is Arabic script; the target script is Latin script.

  • Transliteration: Source transliteration scheme is native; the target transliteration scheme is folk.

The variants of a multi-word name include the cross product of the variants of each word. If, for example, each word in a two-word name has 10 variants, the name has 100 variants. Accordingly, it is a good idea to translate one word at a time. A two-word translation must produce 100 variants to provide the same information as two one-word translations producing 10 variants each.

Use the com.basistech.rnt.ITranslator setMaximumResults method to control the number of variants that are returned.

Example:

Source domain: Arabic language, Arabic script, native transliteration.

Target domain: Arabic language, Latin script, folk transliteration

نبيل شعث contains two words. Ten variants of نبيل are nabil, nabile, nabille, nabeel, nabiyl, nabiyle, nabiylle, nebil, nebile, nebille. Ten variants of شعث are sha`ath, sha'ath, shaath, sha`th, sha'th, shath, cha`ath, cha'ath, chaath, cha`th. These translations use the orthographic completion option, which is turned on by default.

Automated translation

With automated translation, the client provides one or more names, input and output text domains, and types of translation desired. For each name, the application generates a list of translations and associated confidence scores.

Automated usage model for performing RNT translations:

  1. Set up your environment.

    You must define the directory in which you installed Match ($BT_ROOT), and instantiate an Environment object. See Handling the Runtime Environment.

  2. Create a Translator Factory and use it to instantiate a Translator.

    A given translator can perform translations from one source text domain to one target text domain. The Java API includes support for creating a Translator wrapper that can handle multiple source and target domains.

  3. Set translation options (or use the default option settings).

    For a listing of the source and target language domains to which each of these translation options applies, see Supported Translation Option Domains.

  4. Use the Translator to translate names from the source domain to the target domain.

  5. Handle the list of one or more translation results that the Translator generates for each translation. Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest. The sum of the confidence score for all the results that the Translator can generate is less than or equal to 1.0.

  6. Release resources, such as the Translator and Environment.

    translator.close();
    environment.close();

Multithreading

RNT translators are multithreadable.

Using the API

Java Packages: The RNT classes are in com.basistech.rnt and com.basistech.rnt.options (translation options). Utility classes that RNT uses are in com.basistech.util.

Note: Unqualified class names that appear in this section are in the com.basistech.rnt and com.basistech.rnt.options packages.

For detailed information about the API, see the Java API Reference.

Sample

For a sample Java application that translates a name, see AutomatedTranslationSample.

Creating a translator

RNT provides a factory class for creating translators. The factory is responsible for instantiating the correct RNT internal implementation class, which may vary depending on the source and target text domains you specify. For a table that maps input domains to output domains, see Supported Translation Domains.

The following fragment uses the factory to create a translator for translating names from Arabic documents in Arabic script to their standard English form in Latin script, using the IC transliteration scheme:

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_translator.java

When you are done using the translator, close it:

translator.close();

To create a wrapper object that packages a number of Translators, use RuleSetTranslator and define a list of TranslationRules. Each TranslationRule specifies the transliteration scheme for the specified language domain and entity type (NEConstants.NE_TYPE_NONE for all entity types).

Translation options

The translations options are defined in the package com.basistech.rnt.options.

  • Orthographic Completion. Class: CompleteOrthographyOption

    For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to perform orthographic completion (add short vowels and other diacritics) prior to translation. The translator infers an orthographic completion if it cannot locate the name in its Arabic dictionary. For Pashto or Urdu, the translator omits orthographic completion for any named elements it cannot locate in the appropriate dictionary. The default setting for this option is true.

    Given the lack of clear diacritization standards for Iranian Persian, Afghan Persian, Pashto and Urdu, the orthographic completion for these languages reflects BasisTech standards to assist the translation process, and is not intended for external use.

    Suppose you are processing نايف أبو شرخ. As is the case in conventional Arabic, this text is not vocalized. For some transliteration schemes (such as IC) the transliteration of unvocalized Arabic is undefined. The translator produces NAyf 'Bw Shrkh. With the orthographic completion option, the Translator adds the missing vowels (giving نَايِف أَبُو شَرْخ) and produces the correct IC transliteration: Nayif Abu-Sharkh.

    Orthographic completion is performed for Hebrew names in Hebrew script, but it is not controlled by this option. It is always enabled.

    For supported languages and scripts, see Orthographic Completion.

  • Orthographic Minimization. Class: MinimizeOrthographyOption

    For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to devocalize (remove short vowel diacritics). The source domain and the target domain must be one of these languages, Arabic script, and Native transliteration. You can use this option to generate the Arabic script representation of names found in most media, such as news articles. The default setting for this option is false.

    For supported languages and scripts, see Orthographic Minimization.

  • Statistical Methods. Class: StatisticalMethodsOption

    For personal names in Arabic, Hebrew, Japanese, and Russian, use statistical methods to establish information that is not found in a dictionary. Statistical methods are used to do the following:

    • Classify unknown personal names as native or foreign.

    • (Arabic, Hebrew) Vocalize unknown personal names classified as native, or unknown personal names classified as foreign.

    • (Arabic, Hebrew, Japanese, Russian) translate unknown personal names classified as foreign.

    For Arabic this option is set to true by default. If statistical methods are turned off, performance with Arabic input is faster, but for personal names not found in its dictionary, the translator can only mechanically transliterate the input.

    For Hebrew, Japanese, and Russian, statistical analysis is always performed.

    For supported languages and scripts, see Statistical Methods.

  • Performance Tradeoff. Class: PerformanceTradeoff

    For personal names in Arabic, Japanese, or Russian, you can control the tradeoff the translator makes between speed and correctness when it is performing statistical analysis. For Arabic, statistical methods must be turned on (the default). As mentioned above, statistical analysis is always performed for Japanese and Russian. Four settings are defined in com.basistech.rnt.options.TradeoffEnum:

    For supported languages and scripts, see Performance Tradeoff.

    FAST

    Greatest speed; least correctness

    NORMAL

    Even tradeoff between speed and correctness (the default)

    CAREFUL

    More correctness; less speed

    PRECISE

    Greatest correctness; least speed

  • Segmentation. Class: SegmentOption

    For Chinese, Japanese or Korean names, you set the option to segment unsegmented names. Unsegmented Thai names are segmented, but it is not controlled by this option; it is always enabled.

    Suppose you are processing 胡錦濤. This name is not segmented and the Pinyin transliteration is hujintao. With the segmentation option, the Translator segments the name into and 錦濤, and produces the correct Pinyin transliteration: hu jintao.

    For Korean, the name may be in Hangul or Han script. For example, the following Hangul and Han representations of the same name are not segmented: 김정일 and 金正日. With the segmentation option, the Translator uses Hangul to segment either of these forms into and 정일.

    By default, the Segmentation option is set to true.

    For supported languages and scripts, see Segmentation.

  • Normalization. Class: NormalizeOption

    The normalization option applies to Arabic, Chinese, and Japanese names.

    For Arabic native names, the normalizer applies a set of standardization rules. For example, the normalizer inserts a space in عبدالمجيد, producing the more standard representation: عبد المجيد (the IC transliteration is 'Abd-al-Majid).

    For Chinese, normalization converts any characters in the traditional Chinese variant to the simplified Chinese variant (the standard for China). For example, the normalizer converts to .

    For Japanese names, normalization converts Kanji variants (including old Kanji) to their standard form. For example, the normalizer converts to .

    By default, the Normalization option is set to true.

    For supported languages and scripts, see Normalization.

  • Pashto IC: Variant Spelling and Region. For Pashto, when applying the IC standard, these two options implement variations specified in the IC Pashto Standardized Transliteration System for Personal Names. For supported languages and scripts, see Variant Spelling and Region.

  • Korean Geography. For Korean, when applying the BGN standard, the standard that is actually used depends on the Korean Geography Option. For North Korea (the default) McKune-Reischauer is used. For South Korea, Revised Romanization of Korean is used. For supported languages and scripts, see Korean Geography.

Performing a translation

You can set various parameters for the ITranslator, and you must instantiate an ITranslatable object with which the Translator performs the translation.

ITranslator Methods for Setting Translation Parameters
void setMaximumResults (int maxResults)

Sets the maximum number of candidate translations that RNT generates. If you are only interested in the best or most likely result, set this to 1.

void setMinimumConfidence (double confidence)

Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest.

<T> void setOption (optionValue)

Each option is defined by a class. The options are defined in com.basistech.rnt.options).

By default, PerformanceTradeoff is set to TradeoffEnum.NORMAL, VariantSpellingOption is false, RegionOption is RegionEnum.DEFAULT (region unknown), and the other options are set to true. You can use this method to reset an option. For example:

setOption(new CompleteOrthographyOption(false));
setOption(new PerformanceTradeoff(TradeoffEnum.FAST);

RNT performs the translation on an ITranslatable object. An ITranslatable object contains several properties: data (the name), language, script, and entity type (person, location, organization, etc.). Language and script should match the language and script of the source text domain. Entity type may be unknown (com.basistech.util.NEConstants.NE_TYPE_NONE). The ITranslatable object may be extended to include additional information, such as geocoordinates for locations. For more information, see the Javadoc for the implementation of ITranslatable: com.basistech.rni.match.Name.

The following example translates an Arabic name: "صفية طالب السهيل".

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/translate.java
Inspecting translation results

The ITranslator translate() method returns a list of TranslationResult objects.

Each TranslationResult provides access to the translation, the confidence associated with that translation (a double from 0 to 1.0), and may provide additional information with an associated confidence score. The sum of the confidence of all the results returned by a translation is less than or equal to 1.0. The additional information may include orthographic completion (the diacritization of names in Arabic or Hebrew script), segmentation (of names in Chinese, Korean, Japanese, or Thai), and language of origin (for Arabic, Chinese, or Japanese Katakana script). By default, these options are set to true, in which case the Translator attempts to infer the additional information. You can turn off one or all of these options.

For names in Arabic script, orthographic completion means the addition of short-vowel markers and other diacritics that are absent in conventional Arabic script but required for accurate transliteration.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/inspect_translation_results.java

This translation returns a list containing one result.

Result

Value

Translation (IC Transliteration)

Safiyyah Talib al-Suhayl

Translation Confidence

1.0

Orthographic Completion (Diacritization)

صَفِيَّة طَالِب اَلسُّهَيْل

Orthographic Completion Confidence

1.0

Language of Origin

LanguageCode.ARABIC

Overriding name pair translations

You can create UTF-8 files that specify how names are to be translated. The filenames specify the language of the source and target domains, and may specify an entity type. The file entries specify the text of the source and target names, the script of the target domain (required if the target language may be written in multiple scripts), and may specify a confidence score for the translation. RNT applies Unicode NFD normalization to the name strings, and performs the translations specified in these files. RNT supports both fullname and token overrides.

Filenames. The filenames use ISO639 three-letter codes to specify the language of the source domain and the language of the target domain. The filename may also specify an entity type.

fullnames_SRCLANG_TARGETLANG[_TYPE].txt
tokens_SRCLANG_TARGETLANG[_TYPE].txt

For example, fullnames_eng_zho_PERSON.txt would contain entries for translating English PERSON names to Chinese. fullnames_ara_eng.txt would contain entries for translating Arabic names of any or no entity type to English.

Sample Fullname Override Files. fullnames_ara_eng_LOCATION.txt, fullnames_jpn_eng_LOCATION.txt, fullnames_rus_eng_LOCATION.txt contain entries respectively for translating LOCATION names from Arabic, Japanese, and Russian to English. Entity-specific override files are additive to non-entity type overrides. Fullname overrides only take effect if the entire contents of an entry in the first column matches the entire input name.

Sample Token Override Files. tokens_ara_eng_ORGANIZATION.txt, tokens_jpn_eng_ORGANIZATION.txt, tokens_rus_eng_ORGANIZATION.txt contain entries respectively for translating ORGANIZATION names from Arabic, Japanese, and Russian to English. Entity-specific override files are additive to non-entity type overrides.

File entries. Each row in the file, except for rows beginning with #, contains tab-delimited fields with source name, target name, target script (not required if the target language is only written in one script), and optional confidence score.

source name Tab target name[ Tab target_script] [Tab confidence_score]

The confidence score must be between 0 and 1.0. If it is not included, RNT sets the confidence score to 1.0.

The following entry in fullnames_eng_zho_PERSON.txt specifies that Ho Lide should be translated to 贺 利得 with a confidence score of 0.99 if the entity type for the source name is PERSON and the script for the target domain is Hans (simplified Chinese).

Ho Lide贺 利得Hans0.99

The translations you specify are not commutative, so the preceding entry has no influence on the translation of the Chinese 贺 利得 to English.

The following entry in fullnames_ara_eng.txt specifies that علي سعيد should be translated to 'Ali Sa'id with a confidence score of 1.0.

علي سعيد'Ali Sa'id

You can include multiple entries with the same source name, in which case it translates to multiple target names. The sum of the confidence scores for the source name must be between 0 and 1.0.

If you do not include result scores, and a source name translates to multiple target names, RNT sets the confidence score for each pair to 1 divided by the number of targets. If for example, a source name translates to two targets, the confidence score for each translation is 0.5. For fullname overrides, if a source name has multiple targets and only some have a specified confidence score, the confidence scores for the non-specified target names will be an even split of 1 minus the total confidence of the specified targets.

Note

Multiple override entries for tokens is not supported. If there are multiple override entries for a single token, the resulting translations will only contain the first entry in the file and ignore the others.

Note

Specifying a confidence score in token override files is only supported for the following [script/language/transliteration scheme] pairs:

  • source: [Khmr/khm/native] target: [Latn/eng/folk]

  • source: [Latn/eng/folk] target: [Cyrl/rus/native]

  • source: [Thai/tha/native] target: [Latn/eng/iso_11940_2]

  • source: [Thai/tha/native] target: [Latn/eng/iso11940_2_2007]

  • source: [Thai/tha/native] target: [Latn/eng/icu]

  • source: [Mymr/mya/native] target: [Latn/eng/folk]

  • source: [Mymr/mya/native] target: [Latn/eng/mlcts]

  • source: [Grek/ell/native] target: [Latn/eng/iso843_1997]

  • source: [Grek/ell/native] target: [Latn/eng/icu]

  • source: [Hebr/heb/native] target: [Latn/eng/folk]

  • source: [Hebr/heb/native] target: [Latn/eng/iso259_2_1994]

  • source: [Hebr/heb/native] target: [Latn/eng/icu]

  • source: [Hebr/heb/native] target: [Hebr/heb/native]

  • source: [Cyrl/rus/native] target: [Latn/eng/ic]

  • source: [Cyrl/rus/native] target: [Latn/eng/bgn]

  • source: [Cyrl/rus/native] target: [Latn/eng/und_bgn]

  • source: [Cyrl/rus/native] target: [Latn/eng/iso9_1995]

  • source: [Deva/hin/native] target: [Latn/eng/ic]

  • source: [Hani/yue/native] target: [Latn/eng/jyutping]

  • source: [Hans/yue/native] target: [Latn/eng/jyutping]

  • source: [Hant/yue/native] target: [Latn/eng/jyutping]

For all other pairs, specifying a confidence score in the override file will not affect the score of the final result.

Location of Override Files. Place your override files in the $BT_ROOT/rlpnc/data/rnt/ref/override directory.

Tip

To define your own override tables (character streams) in place of the tables in the default directory. See the HTML API documentation for the com.basistech.rnt.DictionaryService.replaceConfiguration method.

Interactive translation

RNT provides an API that you can use to build interactive applications to translate Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and Russian names from English or native script to standardized English. For supported transliteration schemes, see Supported Translation Domains.

The input is an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name in native script or in English that the user wants to translate. In the common case, the name is in English but may not conform to the desired transliteration standard.

  • For Arabic names, the application walks users through the procedure of generating the name in fully vocalized Arabic script (conventional Arabic does not include short vowels), and transliterating the name.

  • For Iranian Persian, Afghan Persian and Pashto names, the application walks the user through the process of generating the names in standard Arabic script (no short-vowel markers).

  • For Chinese, Korean, or Russian names, the application walks the user through the process of generating the name in Hani, Hangul, or Cyrillic.

To take full advantage of the resources that RNT Interactive provides, the user should have some familiarity with Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and/or Russian.

For detailed information about the API, see the com.basistech.rnt.assistant package in the Java API Reference.

Overview of an interactive application

An interactive application that walks the user through the process of translating an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name does the following:

  1. Sets the Basis root directory and instantiates a TranslationAssistant.

  2. Collects user input: a name to transliterate and a description of the desired output.

    The name is in English (a 'folk' transliteration) or in native script.

    The output is defined as one foreign language text domain (such as Arabic, Arab, NATIVE) and one or more English text domains (such as English, Latn, BGN). See Supported Translation Domains.

  3. Asks RNT to initialize an output object, which includes segmentation information about the input.

    The segmentation may not match the segmentation implied by the user input, and needs to be recirculated to RNT as part of each user interaction. For example, in Arabic the definite article or family/clan indicator 'al' may or may not be joined to the element that follows. In some cases, whether it should be joined is unambiguous. In other cases, either segmentation is possible. The selection you make for one element may undo the selection you have already made for another element and/or may change the options available for that other element.

  4. For each segment in the name, provides the users with a set of output alternatives. Each alternative includes the segment transliteration for the specified output text domains. For Arabic input only, the alternative may include a brief gloss and part of speech to assist the user in making a choice: either may be 'Name'; the gloss may contain a Buckwalter annotation, such as 's' for surname' or 'f' for feminine name.

    When the user selects an alternative, the application passes it to the output object, and passes the current segmentation (which the selection may have changed) back to the input.

  5. Publishes the final output: for each output text domain, the combination of alternatives that the user has selected.

  6. When the user is done, closes interactive RNT to free resources.

Sample application

For a sample application that simulates the interactive process described above, see the source code in $BT_ROOT/rlpnc/samples/java/InteractiveTranslationSample.java.

As shipped (you can modify the sample), the input is an Arabic name: Safiyah Talib Al Suhail.

RNT divides this input into a number of segments and generates alternatives for each segment. RNT returns these alternatives in descending order of confidence (the best alternative is the first). For Arabic input only, as the following table shows, RNT provides additional information about each alternative to help the user make the best selection.

As the table also indicates, Al could be an individual component, but in the context of the word that follows, should be joined with Suhail

Arabic (native)

English (IC)

Gloss

Part of Speech

صَفِيَّة

Safiyyah

pure/clear/sincere

Adj

صَافِيَة

Safiyah

net

Noun

صَافِيَة

Safiyah

pure/clear/sincere

Adjective

سَافِيَاء

Safiya'

fine dust

Noun

???

Safiyah

original input

تَلِيب

Talib

Talib s Libyan

Name

طَلِيب

Talib

Talib s

Name

طَالِب

Talib

requesting

Adj

تَعْلِيب

Ta'lib

canning

Noun

تَأْلِيب

Ta'lib

rallying/assembling

Noun

طَلَب

Talab

quest/search // request/demand

Noun

تَأَلُّب

Ta'allub

gathering/rally/assembly

Noun

طَلِيبَة

Talibah

Talibah s

Name

طُلَيْب

Tulayb

Tulayb

Name

تَعْلَب

Ta'lab

Ta'lab

Name

???

Talib

original input

اَل

Al

al- definite article

Definite Article

آل

Al

Al family/clan of

Name

???

Al

original input

اَلسُّهَيْل

al-Suhayl

Suheil // Canopus

Name

اَلصَّهِيل

al-Sahil

neighing

Noun

اَلسُّهَيْلَة

al-Suhaylah

Suhaylah f

Name

اَلسُّحَيْل

al-Suhayl

Suhayl

Name

اَلسُّهِيل

al-Suhil

Suhil

Name

???

Suhail

original input

The final output (choosing the first alternative for each segment) is as follows:

IC transliteration: Safiyyah Talib al-Suhayl

Native transliteration: صَفِيَّة تَلِيب اَلسُّهَيْل

Fully supported text domains for name matching

The following tables describe the domain pairings for which Match provides full support. All other domain pairings have limited support, as described in Language support parameters. A domain refers to the language and script of a piece of text. For example, one domain might be Latin (Latn) script in the English (eng) language.

Note

"Language" in this appendix refers to the language of use, the language of the document in which the name is found, which may not be the language of origin associated with the name. If the language of use is undetermined, use unknown (xxx).

Note

Prior to release 7.36.0, Match did not support any limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set Match to behave as it did previously, set allLanguageSupport to false.

Name matching within a language

Cross-language matches

Supported translation domains

This section specifies supported translation domain pairs and languages of origin for names to be translated with each domain pair.

Source and target translation domains

The following table identifies the translations (target text domains) that Name Translator supports for each source text domain.

For each source domain, the target domains for each language and script combination are presented in a single row.

Translation. If the languages in the source domain and target domain do not match, Name Translator translates the name to the target language.

Language of Origin. For Arabic, Chinese, Hebrew, Japanese, and Korean, if the target language is English and the language of origin does not match the source language, Name Translator attempts to translate the name to its standard English representation. If the language of origin is the source language, Name Translator transliterates the name with the specified transliteration scheme. If the language of origin for the name object is unspecified (UNKNOWN), Name Translator guesses the language of origin. For the supported languages of origin for each domain pair, see Supported Languages of Origin.

Orthographic Enhancement. When the source and target domains match (the transliteration schemes are native), the translations are orthographic enhancements in the native script (vocalization in Arabic and Hebrew script languages, segmentation in Chinese, Japanese, Korean, or Thai).

Name Variants. When the target transliteration scheme is folk, Name Translator generates a list of variant representations of the name in Latin script.

Note

When specifying the source script, the full target domain, including target script must be specified. Otherwise, only the language is used to guess the source domain.

Supported languages of origin

The following table displays the languages of origin that are supported for each source and target domain. For the full English names for the scripts, languages, and transliteration schemes, see the preceding table.

Target Domain(s)

Source Domain

Language(s) of Origin

Language

Script

Transliteration Scheme

Language

Script

Transliteration Scheme(s),...

Language(s),...

ara

Arab

native

ara

Arab

native

ara, eng

ara

Arab

native

eng

Latn

fbis, bgn, basis, ic, satts, buckwalter, und_bgn, ext_ic, folk

ara, eng

ell

Grek

native

eng

Latn

iso843_1997, icu

eng, ell

eng

Latn

bgn

ara

Arab

native

ara

eng

Latn

bgn

pus

Arab

native

pus

eng

Latn

bgn

fas

Arab

native

fas

eng

Latn

bgn

urd

Arab

native

urd

eng

Latn

bgn

pes

Arab

native

pes

eng

Latn

bgn

eng

Latn

und_bgn

eng, ara, pus, urd, prs, pes, fas, rus, zho, kor

eng

Latn

bgn

kor

Hang

native

kor

eng

Latn

basis

ara

Arab

native

ara

eng

Latn

satts

ara

Arab

native

ara

eng

Latn

buckwalter

ara

Arab

native

ara

eng

Latn

mcr

kor

Hang

native

kor

eng

Latn

moct

kor

Hang

native

kor

eng

Latn

ctc

zho

Hani

native

zho

eng

Latn

folk

ara

Arab

native

ara

eng

Latn

folk

pus

Arab

native

pus

eng

Latn

folk

prs

Arab

native

prs

eng

Latn

folk

pes

Arab

native

pes

eng

Latn

folk

rus

Cyrl

native

rus, eng

eng

Latn

folk

kor

Hang

native

kor, eng

eng

Latn

folk

zho

Hani

native

zho, eng

eng

Latn

native

eng

Latn

bgn, basis, ic

ara

fas

Arab

native

eng

Latn

bgn, und_bgn, folk

fas

heb

Hebr

native

eng

Latn

native

heb, eng

hin

Deva

native

eng

Latn

ic

hin

jpn

Hani

native

eng

Latn

hebon, kunrei

jpn, zho, kor

jpn

Hani

native

jpn

Hira

native

jpn

jpn

Hani

native

jpn

Hani

native

jpn, zho, kor

jpn

Hans

native

eng

Latn

hebon, kunrei

jpn, zho, kor

jpn

Hant

native

eng

Latn

hebon, kunrei

jpn, zho, kor

jpn

Hira

native

eng

Latn

hebon, kunrei

jpn

jpn

Hrkt

native

eng

Latn

hebon, kunrei

jpn, eng

jpn

Kana

native

eng

Latn

hebon, kunrei

jpn, eng

jpn

Jpan

native

eng

Latn

hebon, kunrei

jpn, eng, zho, kor

kmr

Khmr

native

eng

Latn

native

kmr, eng

kor

Hang

native

eng

Latn

bgn, ic, und_bgn, korda, mcr, moct, folk

kor, eng

kor

Hang

native

kor

Hang

native

kor, eng

kor

Hani

native

eng

Latn

bgn, ic, und_bgn, korda, mcr, moct, folk

kor, eng

kor

Hani

native

kor

Hang

native

kor, eng

kor

Kore

native

eng

Latn

bgn, ic, und_bgn, korda, mcr, moct, folk

kor, eng

kor

Kore

native

kor

Hang

native

kor, eng

mya

Burmese

folk

eng

Latn

icu

mya, eng

mya

Mymr

native

eng

Latn

folk, mlcts

mya

pes

Arab

native

pes

Arab

native

pes

pes

Arab

native

eng

Latn

bgn, ic, und_bgn

pes

prs

Arab

native

prs

Arab

native

prs

prs

Arab

native

eng

Latn

bgn, ic, und_bgn

prs

pus

Arab

native

pus

Arab

native

pus

pus

Arab

native

eng

Latn

bgn

pus

pus

Arab

native

eng

Latn

ic

pus, prs

pus

Arab

native

eng

Latn

und_bgn, folk

pus

rus

Cyrl

native

eng

Latn

bgn, ic, iso9_1995, und_bgn

rus, eng

rus

Cyrl

native

rus

Cyrl

native

rus, eng

tha

Thai

native

eng

Latn

icu, iso_11940_2, iso11940_2_2007

eng, tha

urd

Arab

native

urd

Arab

native

urd

urd

Arab

native

eng

Latn

bgn, ic, und_bgn, folk

urd

zho

Hani

native

eng

Latn

bgn, ic, und_bgn, hypy, hypy_toned, wade_giles, ctc

zho, eng

zho

Hani

native

zho

Hani

native

zho, eng

zho

Hans

native

eng

Latn

bgn, ic, und_bgn, hypy, hypy_toned, wade_giles

zho, eng

zho

Hant

native

eng

Latn

bgn, ic, und_bgn, hypy, hypy_toned, wade_giles

zho, eng

zho

Hant

native

zho

Hans

native

zho, eng

Supported translation option domains

This section specifies the translation domain pairs to which each of the RNT translation options applies.

Orthographic completion

For Arabic-script names in Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu, this option (com.basistech.rnt.options.CompleteOrthographyOption) adds short vowels and other diacritics. If the language is Arabic or Hebrew[10] and statistical methods are turned on (the default), the translator statistically infers an orthographic completion if the name does not appear in its dictionary. For the other languages, orthographic completion only takes place if the name appears in the relevant dictionary. By default, this option is set to true. See Translation Options.

Orthographic Completion

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Afghan Persian (prs)

Arabic (Arab)

Afghan Persian (prs)

Arabic (Arab)

Native (native)

Afghan Persian (prs)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)

Afghan Persian (prs)

Arabic (Arab)

Afghan Persian (prs)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Arabic (Arab)

Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Arabic (ara)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Hebrew (heb)

Hebrew (Hebr)

Hebrew (heb)

Hebrew (Hebr)

Native (native)

Iranian Persian (pes)

Arabic (Arab)

Iranian Persian (pes)

Arabic (Arab)

Native (native)

Iranian Persian (pes)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)

Iranian Persian (pes)

Arabic (Arab)

Iranian Persian (pes)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Arabic (Arab)

Native (native)

Pashto (pus)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

Persian (fas)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)

Persian (fas)

Arabic (Arab)

Persian (fas)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)

Urdu (urd)

Arabic (Arab)

Urdu (urd)

Arabic (Arab)

Native (native)

Urdu (urd)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[a] , IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

Urdu (urd)

Arabic (Arab)

Urdu (urd)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[a], IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

[a] For Urdu, Basis implemented a BGN transliteration scheme based on an unofficial specification prior to the specification being officially adopted.

Orthographic minimalization

For Arabic-script names in Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu, this option (com.basistech.rnt.options.MinimizeOrthographyOption) removes diacritics for short vowels. The translation produces the Arabic script representation of names found in most print media, including news articles. By default this option is set to false. See Translation Options.

Orthographic Minimalization

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Afghan Persian (prs)

Arabic (Arab)

Afghan Persian (prs)

Arabic (Arab)

Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Arabic (Arab)

Native (native)

Iranian Persian (pes)

Arabic (Arab)

Iranian Persian (pes)

Arabic (Arab)

Native (native)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Arabic (Arab)

Native (native)

Urdu (urd)

Arabic (Arab)

Urdu (urd)

Arabic (Arab)

Native (native)

Statistical methods

For personal names in Arabic, Hebrew, Japanese, and Russian this option (com.basistech.rnt.options.StatisticalMethodsOption) uses statistical methods to establish information that is not found in a dictionary. For Arabic, this option is set to true by default. For Hebrew, Japanese and Russian it is always set to true. See Translation Options.

Statistical Methods

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Afghan Persian (prs)

Latin (Latn) [Deprecated domain]

Afghan Persian (prs)

Arabic (Arab)

BGN (bgn), Native (native)

Afghan Persian (prs)

Latin (Latn) [Deprecated domain]

Afghan Persian (prs)

Arabic (Arab)

Folk (folk), Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Arabic (Arab)

Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Arabic (ara)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Arabic (ara)

Arabic (Arab)

Basis (basis), Native (native)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Arabic (ara)

Arabic (Arab)

BGN (bgn), Native (native)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Arabic (ara)

Arabic (Arab)

Buckwalter (buckwalter), Native (native)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Arabic (ara)

Arabic (Arab)

Folk (folk), Native (native)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Arabic (ara)

Arabic (Arab)

SATTS (satts), Native (native)

English (eng)

Latin (Latn)

Afghan Persian (prs)

Arabic (Arab)

Folk (folk), Native (native)

English (eng)

Latin (Latn)

Arabic (ara)

Arabic (Arab)

Basis (basis), Native (native)

English (eng)

Latin (Latn)

Arabic (ara)

Arabic (Arab)

BGN (bgn), Native (native)

English (eng)

Latin (Latn)

Arabic (ara)

Arabic (Arab)

Buckwalter (buckwalter), Native (native)

English (eng)

Latin (Latn)

Arabic (ara)

Arabic (Arab)

Folk (folk), Native (native)

English (eng)

Latin (Latn)

Arabic (ara)

Arabic (Arab)

SATTS (satts), Native (native)

English (eng)

Latin (Latn)

Iranian Persian (pes)

Arabic (Arab)

Folk (folk), Native (native)

English (eng)

Latin (Latn)

Iranian Persian (pes)

Arabic (Arab)

BGN (bgn), Native (native)

English (eng)

Latin (Latn)

Pashto (pus)

Arabic (Arab)

BGN (bgn), Native (native)

English (eng)

Latin (Latn)

Pashto (pus)

Arabic (Arab)

Folk (folk), Native (native)

English (eng)

Latin (Latn)

Persian (fas)

Arabic (Arab)

BGN (bgn), Native (native)

English (eng)

Latin (Latn)

Urdu (urd)

Arabic (Arab)

BGN (bgn), Native (native)

Hebrew (heb)

Hebrew (Hebr)

English (eng)

Latin (Latn)

Native (native), ISO 259-2:1994 (iso259_2_1994), Folk (folk), ICU (icu)

Iranian Persian (pes)

Latin (Latn) [Deprecated domain]

Iranian Persian (pes)

Arabic (Arab)

BGN (bgn), Native (native)

Iranian Persian (pes)

Latin (Latn) [Deprecated domain]

Iranian Persian (pes)

Arabic (Arab)

Folk (folk), Native (native)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Pashto (pus)

Arabic (Arab)

Folk (folk), Native (native)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Pashto (pus)

Arabic (Arab)

BGN (bgn), Native (native)

Persian (fas)

Latin (Latn) [Deprecated domain]

Persian (fas)

Arabic (Arab)

BGN (bgn), Native (native)

Urdu (urd)

Latin (Latn) [Deprecated domain]

Urdu (urd)

Arabic (Arab)

BGN (bgn), Native (native)

Performance tradeoff

For personal names in Arabic, Japanese, or Russian, this option (com.basistech.rnt.options.PerformanceTradeoff) controls the tradeoff the translator makes between speed and correctness. See Translation Options.

Performance Tradeoff

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Arabic (Arab)

Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Arabic (ara)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Han (Hanzi) (Hani)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Han (Simplified variant) (Hans)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

English (eng)

Latin (Latn)

Russian (rus)

Cyrillic (Cyrl)

Folk (folk), Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Han (Kanji) (Hani)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Hiragana (Hira)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Korean (kor)

Han (Hanja) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Han (Hanja) (Hani)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Han (Hanja) (Hani)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[b] , IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Russian (rus)

Cyrillic (Cyrl)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)

Russian (rus)

Cyrillic (Cyrl)

Russian (rus)

Cyrillic (Cyrl)

Native (native)

Russian (rus)

Cyrillic (Cyrl)

Russian (rus)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)

[a] For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts.

[b] For how BGN is handled for Korean, see Korean Geography.

Segmentation

For Chinese, Japanese, and Korean[11] names, this option (com.basistech.rnt.options.SegmentOption) segments unsegmented names. By default, this option is set to true. See Translation Options.

Segmentation

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Han (Hanzi) (Hani)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Han (Simplified variant) (Hans)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Japanese (jpn)

Han (Kanji) (Hani)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Han (Kanji) (Hani)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Hiragana (Hira)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Korean (kor)

Han (Hanja) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Han (Hanja) (Hani)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Han (Hanja) (Hani)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Native (native)

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)

[a] For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts.

Normalization

This option (com.basistech.rnt.options.NormalizeOption) applies a set of rules to normalize Arabic native names to a standard form, converts characters in the traditional Chinese variant to the corresponding simplified Chinese variant, and for converts Japanese Kanji variants (including old Kanji) to their standard form. Normalization occurs before any other name processing. By default, this option is set to true. See Translation Options.

Normalization

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Arabic (Arab)

Native (native)

Arabic (ara)

Arabic (Arab)

Arabic (ara)

Latin (Latn) [Deprecated domain]

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Arabic (ara)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Han (Hanzi) (Hani)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Hanzi) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Han (Simplified variant) (Hans)

Native (native), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

Chinese (zho)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Chinese (zho)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)[a]

Japanese (jpn)

Han (Kanji) (Hani)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Han (Kanji) (Hani)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Hiragana (Hira)

Native (native)

Japanese (jpn)

Han (Kanji) (Hani)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Simplified variant) (Hans)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Han (Traditional variant) (Hant)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Hiragana (Hira)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese (alias for Han + Hiragana + Katakana) (Jpan)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

English (eng)

Latin (Latn)

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Japanese (jpn)

Katakana (Kana)

Japanese (jpn)

Latin (Latn) [Deprecated domain]

Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)

Pashto (pus)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Arabic (Arab)

Native (native)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)

[a] For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts.

Variant spelling

When using the IC transliteration standard for personal names in Pashto, this option (com.basistech.rnt.options.VariantSpellingOption) specifies how long vowels are transliterated. When variant spelling is false (the default), long vowels are transliterated with a single vowel. When true, long vowels are transliterated with double vowels. For example, Hamid vs. Hamiid. See Translation Options.

Variant Spelling

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Pashto (pus)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), IC (ic)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Native (native), IC (ic)

Region

When using the IC transliteration standard for personal names in Pashto, this option (com.basistech.rnt.options.RegionOption) designates how the Pashto letter ږ (ģe) is transliterated. If the region is set to DEFAULT (unknown) or SOUTH, the transliteration is 'zh'. If the region is set to NORTH, the transliteration is 'g'. See Translation Options.

Region

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Pashto (pus)

Arabic (Arab)

English (eng)

Latin (Latn)

Native (native), IC (ic)

Pashto (pus)

Arabic (Arab)

Pashto (pus)

Latin (Latn) [Deprecated domain]

Native (native), IC (ic)

Korean geography

When using the BGN transliteration standard for names in Korean, this option (com.basistech.rnt.options.KorGeographyOption) designates which transliteration scheme is used. See Translation Options.

Korean Geography Option

Source Domain

Target Domain(s)

Transliteration Scheme(s) (name),...

Language (ISO 639-3)

Script (ISO 15924)

Language (ISO 639-3)

Script (ISO 15924)

Korean (kor)

Han (Hanja) (Hani)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[a]

Korean (kor)

Han (Hanja) (Hani)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[a]

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[a]

Korean (kor)

Hangul (Hangŭl, Hangeul) (Hang)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[a]

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

English (eng)

Latin (Latn)

Native (native), BGN (bgn)[a]

Korean (kor)

Korean (alias for Hangul + Han) (Kore)

Korean (kor)

Latin (Latn) [Deprecated domain]

Native (native), BGN (bgn)[a]

[a] For Korean, BGN uses the McKune-Reischauer (mcr) transliteration scheme if the option is NORTHKOREAN (the default) and Revised Romanization of Korean (moct) if the option is SOUTHKOREAN.

Appendix

Match phenomena

Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.

Note

The examples provided represent match phenomena when using the default parameter values unless indicated otherwise. Parameters can be configured to adjust the score for each match phenomenon.

Name

Description

Example

CONFLICT

The tokens do not match.

When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT.

DELETION

The token is unmatched.

When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION.

EMBEDDING_MATCH

The tokens are semantically similar as determined by word-embedding vectors.

When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH.

FIELD_BLOCKED

This field cannot be matched because of a cross-field match involving the same field in the other name.

When comparing "Bob|William|Smith" with "William||Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field.

FIELD_CONFLICT

When comparing two names that are divided into fields, these fields do not match.

When comparing "Richard|William|Smith" with "Richard|Johnson|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT.

FIELD_DELETION

When comparing two names that are divided into fields, this field is unmatched.

When comparing "Richard|Xi|Smith" with "Richard||Smith", "xi" would be considered a FIELD_DELETION.

GIVEN_NAME_DELETION

When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched.

When comparing "Richard|William|Smith" and "||William|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a Given_name field.

HANI_ABBREVIATION

One Hani token appears to be an abbreviation of another Hani token.

"北京大学" and "北大" are a HANI_ABBREVIATION match.

HMM_MATCH

The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match.

"richard" and "richerd" are an HMM_MATCH.

INITIALISM

One token is a name and the other token is the initials of the words which make up the name.

"john fitzgerald kennedy" and "JFK" are an INITIALISM.

"consumer value stores" and "CVS" are an INITIALISM.

INITIAL_MATCH

One token is the first initial of the other.

"w" and "william" are an INITIAL_MATCH.

LANGUAGE_SPECIFIC_MATCH

The match was determined by a language-specific matcher.

"laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH.

MATCH

The tokens are identical (after stop word elimination and normalization).

"john" and "john" are a MATCH.

NULL

The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object.

N/A

OUT_OF_ORDER_DELETION

This unmatched token still leaves the remaining tokens out of order when it is removed.

When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION.

OVERRIDE

The tokens appear as a pair on the override list. This is often used for nicknames.

"john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list.

PREFIX_INITIAL

One token is an initial that matches a prefix in the other token.

In practice, the PREFIX_INITIAL phenomenon is rare.

If the initialsScore parameter is set to 0.1, "E Silva" and "EduardoSil" will be a PREFIX_INITIAL match.

STRING_SIMILARITY

The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match.

"akcd" and "xkcd" are a STRING_SIMILARITY match.

STUCK_INITIAL

One name appears to have an initial mistakenly attached to a preceding token.

"DavidK" and "David Keith" are a STUCK_INITIAL match.

SURNAME_DELETION

When comparing two names that are divided into fields, the SURNAME field is unmatched.

When comparing "Richard|William|Smith" and "Richard|William||", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a Surname field.

TRAILING_PATRONYMIC_DELETION[a]

The unmatched token is a patronymic which has been truncated in the other name.

When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION.

TRUNCATED_EXACT_MATCH

The tokens are identical except that one has been slightly truncated.

"murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH.

TRUNCATED_HMM_MATCH

The tokens are similar, but not identical, and one has been slightly truncated.

"gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH.

UNKNOWN_FIELD_MATCH

One of the tokens is part of an "unknown" field in a fielded name.

The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API.

When comparing "Richard|William|Smith" with "Richard|William|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH.

[a] Only applies to Latin script names of Arabic origin.

Parameters

Match Plugin for OpenSearch Restriction

This feature requires access to system parameters not currently available in this release.

We are currently working with OpenSearch to provide access and enable these features.

This table lists the parameters that can be configured via paramater_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in paramater_defs.yaml.

Table 32. Parameter impacts

Parameter name

Applies to

Impacts

addressCrossFieldScoreThreshold

Addresses

Can affect any kind of match

addressDeletionScore

Addresses

DELETION match phenomenon

addressDifferentGroupPenalty

Addresses

Can affect any kind of match

addressFinalBias

Addresses

All address match scores

addressJoinedTokenLimit

Addresses

Concatenation[a]

addressOverrideDefaultScore

Addresses

OVERRIDE match phenomenon

addressOverrideTablePath

File locations

Internal engineering detail

addressReorderPenalty

Addresses

Reordering[e]

addressSameGroupPenalty

Addresses

Can affect any kind of match

addressStopPatternsPath

File locations

Internal engineering detail

addressUnpairedFieldScore

Addresses

FIELD_DELETION match phenomenon

adjustOneSidedDeletionScores

All names

DELETION match phenomenon

allowNullValue

ES/OS plugin

Plugin setting

alternateEditDistanceTokenScorerMechanism

All names

Defaults to false, when set to true, enables the AlternateDistanceTokenScorerMechanism

alternateEditDistanceTokenScorerMechanismScore

All names

When turned on, sets the EditDistanceToken score for all name matches where the the edit distance == 1

alternativePairsToCheck

All names

Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)

alternativeTimeProximityMatch

Dates

All date match scores

boostSwappedDigits

Dates

Date scores containing two adjacent swapped digits and no other differences

boostWeightAtBothEnds

All names

All name match scores

boostWeightAtLeftEnd

All names

All name match scores

boostWeightAtRightEnd

All names

All name match scores

bothEndsBoostTokens

All names

All name match scores

caseSensitiveData

All names

INITIALISM match phenomenon

charactersToAlwaysNormalizeToSpace

All names and addresses

Defines a set of characters that will be replaced with a space in the normalization process for names and addresses

Example: if charactersToAlwaysNormalizeToSpace is set to &amp;-,, all ampersands, hyphens, and commas are replaced with spaces

cityAddressFieldWeight

Addresses

Weighting

cityDistrictAddressFieldWeight

Addresses

Weighting

cognateOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as COGNATE in the override file

conflictScore

All names

CONFLICT match phenomenon

conflictThreshold

All names

CONFLICT match phenomenon

countryAddressFieldWeight

Addresses

Weighting

countryRegionAddressFieldWeight

Addresses

Weighting

crossFieldInitialsPenalty

Fielded names

INITIAL_MATCH match phenomenon

crossFieldJoinInitialPenalty

Fielded names

Concatenation[a]

crossFieldJoinPenalty

Fielded names

Concatenation[a]

crossFieldMatchPenalty

Fielded names

Can affect any kind of match

crossLanguageGenderConflictPenalty

All names

Gender mismatch[b]

dateFinalBias

Dates

All date match scores

dateOrdering

Dates

All date match scores

dayDistanceWeight

Dates

All date match scores

deletionScore

All names

DELETION match phenomenon

detectableLanguagesModelBased

All names

Can affect any kind of match where one or more names is in Latin script and the language is not already specified

detectableLanguagesRuleBased

All names

Currently, you can only enable detection of Latin script as Turkish or Vietnamese

editDistanceScoreBias

All names

Can affect any kind of match

enableDynamicConfigurationEndpoints

ES/OS plugin

Plugin setting

enablePromisingTermFiltering

Speed/Accuracy

Performance only

enableYueReadings

All names

Names written in Han script

entranceAddressFieldWeight

Addresses

Weighting

equivalenceClassesPath

File locations

Internal engineering detail

estimatedConflictOrDeletionScore

All names

Internal engineering detail

exactLatnMatchScore

All names

Token normalization

expensiveScorerJoinedTokenLimit

All names

Concatenation[a]

fieldBlockedScore

Fielded names

OUT_OF_ORDER_DELETION match phenomenon

fieldConflictScore

Fielded names

CONFLICT match phenomenon

fieldDeletionScore

Fielded names

DELETION match phenomenon

finalBias

All names

All name match scores

frequencyRankBias

All names

Can affect any kind of match

genderConflictPenalty

All names

Gender mismatch[b]

genderConflictPenaltyThreshold

All names

Gender mismatch[b]

globalTokenCacheConfig

Speed/Accuracy

Performance only

globalTokenPairCacheConfig

Speed/Accuracy

Performance only

haniAbbreviationScore

All names

INITIALISM match phenomena in Han script

haniAbbreviationThreshold

All names

INITIALISM match phenomena in Han script

haniFourCornerCodeMismatchPenalty

All names

Names written in Han script

hmmNormalizationAlternative

All names

HMM_MATCH phenomenon

hmmScoreBias

All names

HMM_MATCH phenomenon

hmmScoreLimit

All names

HMM_MATCH phenomenon

houseAddressFieldWeight

Addresses

Weighting

houseNumberAddressFieldWeight

Addresses

Weighting

ignoreBadData

ES/OS plugin

Plugin setting

improveSingleDigitManipulationMatch

Dates

Date match scores containing exactly one instance of digit manipulation[c] and no other differences

initialFrequencyRank

All names

INITIAL_MATCH match phenomenon

initialismMismatchPenalty

All names

HMM_MATCH phenomenon

initialismScore

All names

INITIALISM match phenomenon

initialsConflictScore

All names

CONFLICT match phenomenon

initialsDeletionPenalty

All names

DELETION match phenomenon

initialsScore

All names

INITIAL_MATCH match phenomenon

islandAddressFieldWeight

Addresses

Weighting

joinedTokenInitialsPenalty

All names

Concatenation[a]

INITIAL_MATCH match phenomenon

joinedTokenLimit

All names

Concatenation[a]

joinedTokenPenalty

All names

Concatenation[a]

leftBoostTokens

All names

All name match scores

levelAddressFieldWeight

Addresses

Weighting

libpostalDataDirPath

File locations

Internal engineering detail

lowWeightTokenFrequencyRank

All names

Can affect any kind of match

lowWeightTokenPath

File locations

Internal engineering detail

maxExpansions

All names

First-pass accuracy

maximumAlternateTokenizationRelativeDistance

All names

Affects tokenization and therefore any potential score

maximumOrganizationInitialismLength

Organization names

INITIALISM match phenomenon

maximumPersonInitialismLength

Person names

INITIALISM match phenomenon

maxYearDistanceForDigitManipulation

Dates

Date match scores containing exactly one instance of digit manipulation[c] and no other differences

minFieldWeightFactor

Fielded names

Weighting

minimumAlternateTokenizationLength

All names

Affects tokenization and therefore any potential score

minimumOrganizationInitialismLength

Organization names

INITIALISM match phenomenon

minimumPersonInitialismLength

Person names

INITIALISM match phenomenon

monthDistanceWeight

Dates

All date match scores

nameBigramQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameDoubleMetaphoneQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameGluedQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameInitialQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

nameLengthMismatchPenalty

All names

DELETION match phenomenon

Concatenation[a]

Any phenomenon that changes the number of tokens in a name

namePairCacheConfig

Speed/Accuracy

Performance

nameRealWorldIdQueryBoost

Lucene searches

First-pass accuracy

Can affect any kind of match

ngramLMPath

File locations

Internal engineering detail

ngramThresholdPath

File locations

Internal engineering detail

nicknameOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as NICKNAME in override file

numericTokenFrequencyRank

All names

Can affect any kind of match

outOfOrderDeletionScore

All names

OUT_OF_ORDER_DELETION match phenomenon

parseUnknownFieldMarker

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

poBoxAddressFieldWeight

Addresses

Weighting

postCodeAddressFieldWeight

Addresses

Weighting

queryAlternativeOriginLanguages

Speed/Accuracy

Can affect any kind of match

realWorldIdsPath

File locations

Internal engineering detail

realWorldIdsPathUser

File locations

Internal engineering detail

reorderCorrection

All names

Rotation

[d]

reorderCorrectionThreshold

All names

Rotation[d]

reorderPenalty

All names

Reordering[e]

rightBoostTokens

All names

All name match scores

rniFullnameOverridesPath

File locations

Internal engineering detail

rntFullnameOverridesPath

File locations

Internal engineering detail

roadAddressFieldWeight

Addresses

Weighting

sameNameUnknownFieldMatchInterpolator

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

staircaseAddressFieldWeight

Addresses

Weighting

stateAddressFieldWeight

Addresses

Weighting

stateDistrictAddressFieldWeight

Addresses

Weighting

stopPatternsPath

File locations

Internal engineering detail

stringDistanceWeight

Dates

All date match scores

stuckInitialAffixMinLength

All names

STUCK_INITIAL match phenomenon

stuckInitialScore

All names

STUCK_INITIAL match phenomenon

suburbAddressFieldWeight

Addresses

Weighting

thresholdToDropoffBiasMapping

Dates

All date match scores

timeDistanceWeight

Dates

All date match scores

timeProximityYearInterval

Dates

All date match scores

tokenizeOrganizationsWithNumbers

Organization names

Affects tokenization and therefore any potential score

tokenOverridesPath

File locations

Internal engineering detail

trailingPatronymicDeletionScore

Person names

TRAILING_PATRONYMIC_DELETION match phenomenon

truncationFractionLimit

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationScorerBias

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

tryAlternateTokenization

All names

Affects tokenization and therefore any potential score

tryDayMonthSwap

Dates

All date match scores

unigramLMPath

File locations

Internal engineering detail

unitAddressFieldWeight

Addresses

Weighting

unknownFieldFrequencyRank

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

unknownVsKnownScore

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

unknownVsUnknownScore

Fielded names

UNKNOWN_FIELD_MATCH match phenomenon

useEmbeddings

Organization names

EMBEDDING_MATCH match phenomenon

useNamePairCache

Speed/Accuracy

Performance

useSolrPhraseQueries

Solr

Solr plugin setting

variantOverrideScore

All names

OVERRIDE match phenomenon for tokens marked as VARIANT in the override file

worldRegionAddressFieldWeight

Addresses

Weighting

yearDistanceWeight

Dates

All date match scores

[a] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name.

[b] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match.

[c] A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0↔8, 1↔7, 3↔8, 5↔8, 5↔6, 6↔8, 7↔2.

[d] If the tokens in a name have been rotated, the reorder penalty will negatively impact the match score. Match detects and compensates for this error.

[e] Tokens that match, but that appear to be out-of-order, have their match scores adjusted to reflect that fact.



Internal parameters

Match Plugin for OpenSearch Restriction

This feature requires access to system parameters not currently available in this release.

We are currently working with OpenSearch to provide access and enable these features.

This table lists the parameters that can be configured via internal_param_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in internal_param_defs.yaml.

Important

We recommend against modifying these parameters unless advised to by support.

Name

Applies to

Impacts

affixGlueThreshold[a]

All names and addresses

Concatenation[b]

allLanguageSupport

All names

Can affect any kind of match

allowCacheBonuses

All names

Internal engineering detail

alwaysComputeSuffixes[a]

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

arabicScriptScorerBias

Ara/Ara name matches

Arabic-script names

araRNISpeedOption

Translated names

Speed and accuracy tradeoff

crossSurnameMatchPenalty

All names

Matches in languages with onomastic information

debuggableIndex

N/A

Internal engineering detail

Has no effect on matching

debugPrintTuples

N/A

Internal engineering detail

Has no effect on matching

defaultScoreToCheckRestriction

All names

Dates

Addresses

First-pass scoring

disableHMMMatching

All names

Speed and accuracy tradeoff

doFrontTruncations

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

doQueryBigrams

All names

First-pass accuracy

doQueryCompleted

All names

First-pass accuracy

doQueryFullnameOverrides

All names

First-pass accuracy

doQueryFuzzy

All names

First-pass accuracy

doQueryGlued

All names

First-pass accuracy

doQueryIndexKeys

All names

First-pass accuracy

doQueryInitials

All names

First-pass accuracy

doQueryNormalized

All names

First-pass accuracy

doQueryPersonInitialisms

All names

First-pass accuracy

doQueryPhrase

All names

First-pass accuracy

doQueryRealWorldIds

All names

First-pass accuracy

doQueryTokenOverrides

All names

First-pass accuracy

doQueryTranslated

All names

First-pass accuracy

doViterbiRescaling

All names

Fuzzy match[c]

editDistanceFalloff

All names

STRING_SIMILARITY match phenomenon

editDistanceTokenScorerPenalty

All names

STRING_SIMILARITY match phenomenon

embeddingBias

Organization names

EMBEDDING_MATCH match phenomenon

embeddingZeroScore

Organization names

EMBEDDING_MATCH match phenomenon

enableAdditionalOnomastics

All names

Matches in languages with onomastic information

enableRemoteTokenScorer

All names

Japanese/English fuzzy match[c]

enableSeq2SeqTokenScorer

All names

Japanese/English fuzzy match[c]

enableTokenPairLogging

N/A

Internal engineering detail

Has no effect on matching

engEngFastMode

All names

English/English fuzzy match[c]

exactDataMatchCorrection

All names

Controls whether two names with exact data has their similarity score set to 1

expandedLanguages

All names

Fuzzy match[c]

OVERRIDE match phenomenon

expansionLimit

All names

Fuzzy match[c]

OVERRIDE match phenomenon

expansionScoreThreshold

All names

Fuzzy match[c]

OVERRIDE match phenomenon

familiarTokenMismatchPenalty

All names

Can affect any kind of match

familiarTokenThreshold

All names

Can affect any kind of match

firstPassDayRange

Dates

Performance only

firstPassMonthRange

Dates

Performance only

firstPassYearRange

Dates

Performance only

foreignAddressFinalBias

Addresses

All English-to-non-English address matches

genderPenaltyMinimumLength

All names

Gender mismatch[d]

generalizedEditDistanceScoreBias

All names

Leet Speak/OCR error token scorer

generalizedEditDistanceSubstitutionCost

All names

Leet Speak/OCR error token scorer

generalizedEditDistanceTokenScorerPenalty

All names

Leet Speak/OCR error token scorer

givenFieldDeletionScore

Fielded names

DELETION match phenomenon

HMMCachePerProcess

All names

Internal engineering detail

HMM_MATCH phenomenon

HMMCachePerThread

All names

Internal engineering detail

HMM_MATCH phenomenon

hmmNormBias

All names

Internal engineering detail

Fuzzy match[c]

HMMUsageThreshold

All names

Internal engineering detail

HMM_MATCH phenomenon

identifierEditDistanceTokenScorerPenalty

Identifiers

STRING_SIMILARITY match phenomenon

ignoreTranslationOrigins

All names

Can affect any kind of match that uses English transliteration

includeExtraKatakanaPersonReadings

Translated names

Can affect any kind of match

initialAndSuffixMinLength

All names

Fuzzy match[c]

INITIAL_MATCH match phenomenon

initialAndSuffixScore

All names

Fuzzy match[c]

INITIAL_MATCH match phenomenon

jniBias

All names

Can affect any kind of match in languages that use a JNI scorer

jpnRNISpeedOption

Translated names

Speed and accuracy tradeoff

kanjiMismatchPenalty

All names

Normalization of tokens that include kanji

katakanaTransliterationsOnly

Translated names

Can affect any kind of match

korRNISpeedOption

Translated names

Speed and accuracy tradeoff

latinDataAlternativesToCheck

All names

Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)

limitedLanguageEditDistance

All names

STRING_SIMILARITY match phenomenon

maxIdentifierEditDistance

All names

First-pass accuracy

notExactMatchPenalty

All names

Normalization

_postCodePathAddressFieldWeight

Addresses

Weighting

promisingFuzzyTermFrequencyFactor

Speed/Accuracy

Performance only

promisingTermFrequencyFactor

Speed/Accuracy

Performance only

queryMaxResults

All names

Dates

Addresses

First-pass scoring

queryMaxToCheck

All names

Dates

Addresses

First-pass scoring

queryMaxToConsider

All names

Dates

Addresses

First-pass scoring

queryToCheckAllowance

All names

Dates

Addresses

First-pass scoring

realWorldIdScore

Organization names

Real-world id match[e]

remoteTokenScorerURL

All names

Internal engineering detail

Japanese/English fuzzy match[c]

rntTokenOverridesPath

File locations

Internal engineering detail

rusRNISpeedOption

Translated names

Speed and accuracy tradeoff

secondarySurnameTokenTypeWeight

All names

Matches in languages with onomastic information

seq2seqCachePerProcess

All names

Internal engineering detail

Japanese/English fuzzy match[c]

seq2seqCachePerThread

All names

Internal engineering detail

Japanese/English fuzzy match[c]

seq2seqTokenOverridesPath

File locations

Internal engineering detail

seq2seqUsageThreshold

All names

Internal engineering detail

Japanese/English fuzzy match[c]

splitTokens

All names

Internal engineering detail

stringDistanceThreshold[a]

All names

Fuzzy match[c]

surnameFieldDeletionScore

fielded names

DELETION match phenomenon

surnameTokenTypeWeight

All names

Matches in languages with onomastic information

taggerMinimumConfidenceThreshold

All names

Matches in languages with onomastic information

tokenByTokenArabLatnFolkTranslation

Arabic-language Arabic-script names when translated to English Latin-script names with the FOLK transliteration scheme.

Performance and accuracy of name translation by translating token-by-token.

translatorResultsToKeep

translated names

Can affect any kind of match

truncationAffixSimilarityLength

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationAffixSimilarityThreshold

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

truncationLengthLimit

All names

TRUNCATED_EXACT_MATCH match phenomenon

TRUNCATED_HMM_MATCH match phenomenon

useCharacterLM

All names

Can affect any kind of match

useEditDistanceTokenScorer

All names

STRING_SIMILARITY match phenomenon

useGeneralizedEditDistanceTokenScorer

All names

Leet Speak/OCR error token scorer

useIdentifierEditDistanceTokenScorer

Identifiers

STRING_SIMILARITY match phenomenon

useLM

All names

Can affect any kind of match

useOldAndNewNameSegmentationForJapanese

All names

Can affect any kind of match involving Japanese translations

useRealWorldIds

Organization names

Real-world id match[e]

zhoRNISpeedOption

Translated names

Speed and accuracy tradeoff

[a] Unlike public parameters for this feature, this is a speed/accuracy tradeoff, not a science-tuning parameter.

[b] Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name.

[c] A fuzzy match is a match between tokens that are similar but not identical. The HMM_MATCH and SEQ2SEQ_MATCH phenomena are examples of this.

[d] Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match.

[e] Match contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name

Directory structure

While not exhaustive, this appendix briefly describes and details the location of parts of Match that developers may need to access. Any directories not described in this appendix, such as those containing internally used compiled code referenced by other parts of the product, will not need to be accessed by most developers.

BT_ROOT directory

BT_ROOT is the Basis Root directory. This is the high-level structure of a Match installation. As a developer, you will not need to access most of these files and folders. Some files and folders of note are listed below. For more information on this directory, see Setting the Match root directory.

_BT_ROOT_structure.png

$BT_ROOT\rlp\rlp\licenses

This is where you should copy the license file included in your product shipment.

$BT_ROOT\rlpnc\lib\jvm

Library of jar files you may need to include on your classpath for logging purposes or when building and running Match applications.

$BT_ROOT\rlpnc\samples\java

Source files for sample applications and the Ant build file for compiling and running them (build.xml).

$BT_ROOT\rlpnc\samples\java\logging

Contains a copy of log4j.properties, which is used by our samples. Adjust the copy that you place on your classpath to meet your specific runtime logging needs.

$BT_ROOT\rlpnc\copyright.txt

Copyright information.

$BT_ROOT\rlpnc\ThirdPartyLicenses.txt

Information on the third party components included in the software.

$BT_ROOT\frequencyModelTrainer.zip

Use this to train a language model on your own name data. Instructions and a full description of arguments are in the README.txt file in the zip file.

$BT_ROOT\realWorldIDBuilder.zip

Use this to build a real world ID binary file. Instructions on how to run the program are in the README.md file in the zip file.

$BT_ROOT\RLPNC-version.txt

A text file containing the version number.

data directory

This is the low-level structure of the $BT_ROOT\rlpnc\data folder. As a developer, most of the files you will need to access are contained in this folder. File paths and descriptions of important files and folders are included below.

_BT_ROOT_data_structure.png

$BT_ROOT\rlpnc\data\addresses\ref

Override and stop word files for matching addresses.

$BT_ROOT\rlpnc\data\etc

Files pertaining to match parameters. The parameters are defined in parameter_defs.yaml and modified in parameter_profiles.yaml. You can also define parameter universes in parameter_profiles.yaml.

$BT_ROOT\rlpnc\data\libpostal

Data for libpostal, a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other Match functionalities.

$BT_ROOT\rlpnc\data\real_world_ids\ref\omit_ids

Contains omit_ids.datafiles, which is where you can enable and point to any omit files you have placed in the $BT_ROOT directory.

$BT_ROOT\rlpnc\data\rnm\ref\override

Files for name matching overrides, stop patterns, stop word prefixes, normalizing token variants, and unimportant tokens.

$BT_ROOT\rlpnc\data\rnm\sample

Sample data for name matching.

$BT_ROOT\rlpnc\data\rnt\ref\override

Name translation override files.




[5] # may also be used after an entry on the same line to begin a comment.

[6] Override files are not provided for all supported languages. Specifically, while no files are provided for Russian or Korean, you can create token pair files for these languages.

[7] Language of use, the language of the document in which the name appears

[8] # may also be used after an entry on the same line to begin a comment.

[9] # may also be used after an entry on the same line to begin a comment.

[10] Hebrew orthographic completion occurs automatically; it is not controlled by this option.

[11] Arabic, Burmese, Khmer, and Thai segmentation occurs automatically; they are not controlled by this option.