Name Indexer and Name Translator

Introduction

Rosette Name Indexer and Rosette Name Translator (RNI-RNT) provides the linguistic infrastructure and Java APIs to perform name matches, name searches, and translations across an expanding collection of languages and scripts.

You can use RNI-RNT to perform the following tasks:

Determine the similarity of two names.
Assemble multilingual name indexes.
Search name indexes across languages for similar entries for a given name.
Build automated applications to translate names from one language to another.
Build interactive applications to translate names in conformance with a number of transliteration standards.

For information about other Rosette products that can help with processing documents, extracting names, and additional text analytics, contact support@rosette.com.

Overview of name matching

The natural language processing algorithms employed by RNI use machine learning and cutting-edge NLP techniques to perform name matching. The match scores produced are a relative indication of how similar two names are, or a search name is to a name in an index; the higher the score the stronger the match. Customizations are available to tune and configure RNI to fit your business and data.

There are two common usage patterns in name and address matching: pairwise and index.

In pairwise matching, you have two names or addresses that you are comparing directly to one another. This comparison results in a single similarity score that indicates how similar the two names are.
With index matching, you have a single name or address that you are comparing to a list. This can be thought of as a search problem. You have a name and want to search are large list of records to find a match.

Index matching includes pairwise matching. When querying an index RNI performs a two-pass search:

Generate candidates: The first pass is designed to quickly generate a set of candidates for the second pass to consider.
Pairwise match: The query value is compared with each value returned by the first pass and a similarity score is calculated for each pair.

Language support

RNI can match names in any language. For the languages listed in Fully supported text domains for name matching, RNI calculates a match score using a variety of techniques, as described in Understanding name match scores. For names not listed in those tables, RNI provides limited support, as described in Language support parameters.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Documentation

This guide provides information on installing, running the sample applications included in RNI-RNT, setting up a development environment, and creating applications that use the runtime environment to incorporate RNI-RNT functionality.

The Java API is documented in HTML Javadoc pages generated from the source code, found at api-reference/index.html .

For instructions on using RNI-RNT with RLP, see Installing RLP with RNI.

Getting started

Requirements

Java SDK 11 through 19. RNI-RNT is tested with OpenJDK.
Apache Ant 1.7.1 or later to use the Ant build scripts we provide to build and run the samples.
The compressed SDK package file for your platform.
See Supported Platforms and RNI-RNT Package File Names.
The RNI-RNT documentation set includes the following:
- Release Notes with up-to-date information about new features and bug fixes in this release
- The RNI-RNT Application Developer's Guide (this document)
- Online reference to the Java API
The Rosette license file: rlp-license.xml.

Important

Unless otherwise specified, all inputs to RNI need to be UTF-8 encoded.

Verify that documents that have been copied from another system maintain UTF-8 encoding and have not been converted to another encoding scheme such as ASCII or UTF-16.

Supported platforms

You must install an SDK package that is appropriate for your platform with respect to operating system and CPU. Since the public API for RNI-RNT is Java, the C++ compiler that appears in the following list is irrelevant.

Table 13. Supported Platforms

OS	CPU	Compiler	$BT_BUILD^[a]
MAC OS X v10.9+ (Darwin 13)	AMD64	xcode 5	amd64-darwin13-xcode5
Linux	AMD64	gcc 4.4	amd64-glibc217-gcc48
Linux	AARCH64	gcc 7.3	aarch64-glibc226-gcc73
Windows	AMD64	Visual Studio 2013	amd64-w64-msvc120
Java Only^[b]	n/a	n/a	jvm
^[a]`$BT_BUILD` is embedded in the name of the downloaded package. It is also the subdirectory name used in various locations for platform-specific files, such as binary library files. ^[b]The Java-only SDK runs on any OS and CPU with 64-bit Java SDK 11 through 19.

The compressed SDK package file names take the form:

rni-rnt-<version>-sdk-$BT_BUILD.<ext>

where <version> is the RNI-RNT version ( x.xx.x.cxx.x is the format), $BT_BUILD is in the table above, and <ext> is .zip for Windows or Java-only, and tar.gz for Unix platforms.

SDK Package Names for RNI-RNT

rni-rnt-<version>-sdk-amd64-darwin13-xcode5.tar.gz
rni-rnt-<version>-sdk-amd64-glibc217-gcc48.tar.gz
rni-rnt-<version>-sdk-aarch64-glibc226-gcc73.tar.gz
rni-rnt-<version>-sdk-amd64-w64-msvc120.zip
rni-rnt-<version>-sdk-jvm.zip

Note

The version number is embedded in the package file name.

Documentation Files

RNI-RNT-<version>-api-reference.zip
RNI-RNT-<version>-ReleaseNotes.pdf
RNI-RNT-<version>-AppDevGuide.pdf

Installing RNI-RNT

When you obtain RNI-RNT, you should receive the following files:

The SDK package listed above for your platform: e.g., rni-rnt-<version>-sdk-amd64-glibc217-gcc48.tar.gz
The Rosette License: rlp-license.xml.

Expand the SDK into the install directory, which we will call $BT_ROOT, and copy the license to the $BT_ROOT/rlp/rlp/licenses subdirectory.

Once you have installed RNI-RNT, you can install RLP. See instructions for Installing RLP with RNI-RNT .

Note

For Windows users, you must add

\rlp\bin\*

to your PATH environment variable. In this case, you must replace * with the name of the subdirectory which contains the platform-specific binary library files (for example, amd64-w64-msvc120).

Note on logging

RNI uses the Logging Facade for Java (SLF4J) to log RNI activities. See http://www.slf4j.org/.

SFL4J is a facade for various logging APIs. Using SFL4J, the developer or an administrator can determine which one of many popular logging systems to use at runtime.

This is done by including one and only one adapter jar on the classpath, such as slf4j-log4j-1.17.36.jar, for the logging system of your choice, and the jar for that logging system (such as log4j-2.19.0.jar). You also need to include the SLF4J API jar, slf4j-api-1.17.36.jar, on the classpath.

By default, all activity is logged to the console. To log to a file and to control the level of logging, place an adapter jar, a logging library, an SLF4J API jar, and the appropriate properties file (e.g., log4j.properties if you are using log4j) on your classpath.

The adapter, logging, and API jars mentioned above are in samples/java/lib. A copy of log4j.properties, which is used by our samples, is in samples/java/logging. You should adjust the copy of log4j.properties that you place on your classpath to meet your specific runtime logging needs.

libpostal data directory

RNI uses libpostal to parse addresses; libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data.

RNI packages libpostal data in plugins/rni/bt_root/rlpnc/data/libpostal. The data directory is relatively large (~2G). If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other RNI functionalities.

RLP with RNI-RNT

If you are using both RLP and RNI-RNT together, first install RNI-RNT. Next, install RLP in the same location. When you are installing RLP there will be some overlap of directories, which is expected. Allow the RLP directories and files to replace the existing directories and files.

Setting up your development environment

When building or running an RNI-RNT application, you must include the following JAR files on your classpath:

btrlpnc.jar
btcommon-api-<apiversion>.jar
btcommon-api-jackson-<jacksonversion>.jar
btcommon-lib-<libversion>.jar
icu4j-<icuversion>.jar

If using the seq2seq model for Katakana-English matching, you must also include the following JAR files on your classpath:

btrlpnc-seq2seq.jar
tensorflow-core-api-<tensorflowversion>-native.jar
If you need GPU support, replace the tensorflow file with the version compiled for your platform. macOS must be at version 10.13 or higher.
jna-<jnaversion>.jar

These files are in $BT_ROOT/rlpnc/lib/jvm.

For information about $BT_ROOT (the Basis root directory) and $BT_BUILD (the platform designator), see Installing RNI-RNT.

To use the Ant scripts described in Building and running the sample applications, make sure you have Ant (1.7.1 or later), the JAVA_HOME environment variable is set to the root of your Java SDK, and the Java SDK bin directory is on your PATH.

Handling the runtime environment

RNI uses data resources stored in the file system in standard locations relative to the Basis root directory ($BT_ROOT). Accordingly, you must follow a few basic rules when you are assembling an application that includes RNI functionality.

Prior to accessing the RNI API, you must set the Basis root directory.
RNI maintains singleton Environment objects for maintaining read-only shared data. Depending on the operations you perform, you may need to explicitly instantiate an Environment object before you perform these operations and close the Environment object when you are done.

Setting the Basis root directory

The API provides two ways of performing this action:

Use com.basistech.names.internal.Pathnames.setBTRootDirectory (String BT_ROOT).
Set the bt.root system property. You can do this from the command line when you launch the Java virtual machine:
```
java -Dbt.root=$BT_ROOT ...
```
where $BT_ROOT is the path to the Basis root directory.

You can also set up an overlay directory. This directory must have an identical structure to the normal root directory outside of the rlp/lib and rlp/bin directories. License files will only be considered from BT_ROOT and should not be moved over to the overlay root.

Important

If a location for this overlay directory is specified, either in Java with com.basistech.names.internal.Pathnames.setOverlayRootDirectory or the bt.overlay.root system property, RNI will look in that location for every data/configuration file instead of the root directory. If no location is specified, RNI will use the normal root directory.

Note

Libpostal data (controlled by the libpostalDataDirPath parameter, defaulting to rlpnc/data/libpostal) and word embedding data (rlpnc/data/tvec/filtered-vectors) will only be considered from BT_ROOT and should not be moved over as part of the overlay root

Manipulating the environment

Before you use Rosette Name Translator (RNT), you must instantiate a com.basistech.rnt.RNTEnvironment object. For example:

RNTEnvironment rntEnv = new RNTEnvironment();

The RNTEnvironment uses data files stored in the file system according to the standard RLP release hierarchy. Accordingly, you must set the Basis root directory prior to instantiating RNTEnvironment.

If your RLP license is not found in the appropriate location (rlp/rlp/licenses/rlp-license.xml) under your BT_ROOT directory, RNIConfiguration and RNTEnvironment include a setLicenseXML() method that you can use to provide the license as a string.

When you have finished performing translations, you should close the RNTEnvironment object to free resources. For example:

rntEnv.close();

When you use Rosette Name Indexer, RNI-RNT instantiates an RNTEnvironment object as required. If RNI-RNT instantiates an RNTEnvironment object, it also closes it at the appropriate time.

A Quick look at RNI-RNT: running a sample program

Building and running the sample applications

To build and run the sample applications, you must have the Java SDK (11 or later). To use the Ant build files we provide to build and run the samples, you need Ant (1.7.1 or later) with the JAVA_HOME environment variable set to the root of your Java SDK. For more information, see http://ant.apache.org.

The source files for these applications and the Ant build file for compiling and running them (build.xml) are located in $BT_ROOT/rlpnc/samples/java.

Tip

The Ant scripts and build files require one input property: bt.arch=$BT_BUILD (bt.arch=amd64-glibc217-gcc48, for example). If you set this property in the script (build.xml), you do not need to include it on the command line.

Table 14. Sample Applications

Source File	Description
`AddNamesSample.java`	Adds names from a UTF-8 file to an RNI Index.
`LoadGazetteerSample.java`	Loads an XML gazetteer into an RNI Index.
`IndexQuerySample.java`	Submits a series of queries (names) to an index and reports on the results.
`DistributedTransactionSample.java`	Queries an index, deletes the names returned from that index, and adds the names to a second index. The deletions and additions are performed in a single distributed transaction with two-phase commit.
`MatchNamesSample.java`	Determines the similarity of two or more names.
`MatchPhenomenaSample.java`	Demonstrates the different name matching phenomena that RNI supports.
`AutomatedTranslationSample.java`	Translates one or more names.
`InteractiveTranslationSample.java`	Simulates a series of user interactions resulting in the translation of an Arabic name.
`RNISolrjSample.java`	Integrates RNI with Solr to add and query Solr documents with multiple and multivalued name fields.
`AddressIndexQuerySample.java`	Submits a series of queries (addresses) to an index and reports on theresults.
`AddressMatchPhenomenaSample.java`	Demonstrates the different address matching phenomena that RNIsupports.

Your License

You must copy the license file you obtained from BasisTech to $BT_ROOT/rlp/rlp/licenses. If the license is not in place, you cannot access any RNI-RNT functionality. The license defines the scope of the activities you may perform with RNI-RNT.

Using the Ant build script

Tip

Change directory to $BT_ROOT/rlpnc/samples/java and run Ant:

ant -Dbt.arch=$BT_BUILD target

where target is one of the Ant build targets in the following table.

`target`	Description
`compile`	Compiles the samples and places the class files in `$BT_ROOT/rlpnc/samples/java/obj/$BT_BUILD`.
compile.`Class`^[a]	Compiles the specified sample.
`run`	Compiles (if necessary) and runs the samples with the command-line arguments defined in the Ant build file. Each sample prints a message to the console indicating what it has done, including any file it has created.
run.`Class` ^[a]	Runs the specified sample.
`clean`	Removes the class files and any files created by the samples.
clean.`Class` ^[a]	Removes the sample class file(s) and any file created by the sample.
`all`	Calls `compile` and `run`.
^[a]`Class` is the sample class name. Use the `Class targets` to compile, run, or clean a single sample. For example, to run `LoadGazetteerSample`, the `target` is `run.LoadGazetteerSample`.

As you create your own applications, you can use the Ant build file as the starting point for establishing your own build procedures.

Matching names

RNI provides a Java API for matching names across the boundaries of writing scripts. For the complete list of the languages and writing scripts that name matching supports, see Supported Text Domains for Rosette Name Indexer and Name Matching.

In the RNI context, name matching means comparing two names, performing linguistic analysis, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two names are. A value of 1.0 is returned if and only if the two names are identical (the strings, languages, languages of origin, and entity types match). A score of less than 1.0 is returned for names that potentially match, with different mismatched name variations.

Interpreting RNI scores

Names are complex to match because of the large number of variations that occur within a language and across languages. RNI breaks a name into tokens and compares the matching tokens. RNI can identify variations between matching tokens including, but not limited to, typographical errors, phonetic spelling variations, transliteration differences, initials, and nicknames.

RNI scores range from 0 to 1. The higher the score, the greater the confidence that this a relevant match. A score of 1.0 indicates that the query name string and result name string are identical (including all name properties).

The match score is a relative indication of how similar the match is; it is not an absolute value. When comparing different name matches, the relative matches of the scores are more relevant than the actual score. Similar name matches in different languages may generate different match scores. To understand how RNI calculates the score, see Understanding name match scores.

Scores less than 1.0 for similar names indicate the query name and index name vary with respect to one or more properties (such as language of origin) and/or one or more of the following:

Variation	Example(s)
Phonetic and/or spelling differences	Nayif Hawatmeh and Nayif Hawatma
Missing name components	Mohammad Salah and Mohammad Abd El-Hamid Salah
Rarity of a shared name component	Two English names that contain Ditters are more likely to match than two names that contain Smith
Initials	John F. Kennedy and John Fitzgerald Kennedy
Nicknames	Bobby Holguin and Robert Holguin
"Cousin" or cognate names	Pedro Calzon and Peter Calzon
Uppercase/Lowercase	Rosa Elena PACHECO and Rosa Elena Pacheco
Reordered name components	Zedong Mao and Mao Zedong
Variable Segmentation	Henry Van Dick and Henri VanDick, Robert Smith and Robert JohnSmyth
Corresponding name fields	For [Katherine][Anne][Cox], the similarity with [Katherine][Ann][Cox] is higher than the similarity with [Katherine Ann][Cox]
Truncation of name elements	For Sawyer, the similarity with Sawy is higher than the similarity with Sawi.

Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.

You can configure RNI to customize how it scores different match phenomena.

The score weighting associated with a token may vary depending on the token's characteristics, such as the frequency with which it appears in the language model (the more frequent, the lower the weighting).

Entity types

The entityType field identifies the type of name being matched and to select the algorithms to use for matching. Where supported, stop words and override files are specific to an entity type. Parameters can be set for specific languages and entity types.

Important

The entityType should always be specified to utilize all available methods when indexing and matching names. If you don't specify an entityType, the type PERSON will be used.

Table 15. Entity Types

Type	Description	Features
PERSON	A human identified by name, nickname, or alias.	Values are tokenized and token pairs are compared. Stop words, overrides, frequency and gender models are supported.
LOCATION	A city, state, country, region or other location.	Values are tokenized and token pairs are compared. Stop words, overrides, and frequency models are supported.
ORGANIZATION	A corporation, institution, government agency, or other group of people defined by an established organizational structure.	Values are tokenized and token pairs are compared. Stop words, overrides, frequency models, and embeddings are supported. Real World IDs are supported.
IDENTIFIER IDENTIFIER:DRIVERS_LICENSE IDENTIFIER:LICENSE_PLATE IDENTIFIER:NATIONAL_ID_NUM	An alphanumeric identifier.	Values are not tokenized. The entire identifier is treated as a string. Scoring is primarily by string edit distance.

Names with data fields

By using a string array (such as String[] nameData = {"John", "Smith"};), you can create a name with data fields. The maximum number of data fields is 5. We assign no explicit semantics to each field (such as given name or surname), but the order of the fields does matter when comparing two names that have fields. RNI assigns lower scores to matches that cross field boundaries (e.g., the first field in one name matches the second field in another name). The use of fields may enhance accuracy when you are performing queries and matches with PERSON names in languages where standard name ordering is not the norm. By dictating a consistent name ordering, you can avoid penalties for mis-ordered tokens.

For consistency, you may want to adopt a paradigm for name fields, such as {title, given names, surname, suffix}. Include empty fields in the appropriate position for names that do not contain all these elements. If a trailing field is empty, you can leave it out. For example:

{"Mr", "John Miles", "Doe", "Jr"}
{"Queen", "Elizabeth", "", "II"}
{"Mr", "Anthony Charles", "Blair"}
{"Ms", "Rosanne Christine", "Atwood"}
{"", "Martin Luther", "King", "Jr"}

Note

When scoring a potential match between a name with data fields and a name without data fields, RNI treats the name without data fields as if it were a name with one data field.

RNI treats trailing empty fields as if they were not present. For example, {"Rosanne", "Taylor Smith",""} is treated the same as {"Rosanne", "Taylor Smith"}.

Alternatively, you have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with Name.UNKNOWN_FIELD_MARKER.

Name matching usage model

Identify two names to compare. They may be in different languages (languages of use) and writing scripts.

Use MatchScorer to score the similarity of two Name objects. MatchScorer and Name are in the com.basistech.rni.match package.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_2names.java

For the Arabic name نايف أبو شرخ and its IC transliteration Nayif Abu-Sharakh, this comparison returns a score of 0.99.

If you want to compare one name to many names, for improved efficiency you can cache the scorer with the one name (the query name) and used the cached scorer to compare that name to multiple names. As illustrated in the following code snippet, you must prepare each name that you use with the cached scorer.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/match_1name_tomany.java

For a sample Java application that matches two names and matches a query name against multiple reference names, see MatchNamesSample.

Configuring name matching

There are many ways to configure RNI to better fit your use case and data. The two primary mechanisms are by modifying match parameters and editing overrides. You can also train a custom language model.

Tuning match parameters

The default values of the RNI match parameters are tuned to perform well on most queries and datasets. However, every use case uses different data with distinct match requirements. You can modify match parameters to optimize match results for your data and business case.

The typical process for tuning parameters is as follows:

Gather a list of names to index and queries to run against them to use as a set of test data. Ideally the test data set should be big enough to reflect the diversity in your real data with at least 100 queries.
After indexing the data, run the queries using RNI and determine a match score threshold that appears to provide the best results.
Analyze the results to discover cases that RNI failed to score high enough or that RNI incorrectly scored higher than the threshold.
Choose a subset of these name pairs that RNI scored too low or too high that will be used as examples to tune your parameters.
Tune the match parameters to change the match scores of the test set of undesirable results, so that the score is correctly above or below your threshold. For name or address pairs that have to match in a specific way and are very dissimilar (eg. aliases), we recommend you add them as token or full-name overrides.
Run the large set of queries through RNI again to test that the new parameter values still return the desired matches, and not new undesired results.

Parameter configuration files

Individual name tokens are scored by a number of algorithms or rules. These algorithms can be optimized by modifying configuration parameters, thus changing the final similarity score.

The parameter files are contained in two .yaml files located in plugins/rni/bt_root/rlpnc/data/etc. The parameters are defined in parameter_defs.yaml and modified in parameter_profiles.yaml.

parameter_defs.yaml lists each match parameter along with the default value and a description. Each parameter may also have a minimum and maximum value, which is the system limit and could cause an error if exceeded. A parameter may also have a recommended minimum (sane_minimum) and recommended maximum (sane_maximum) value, which we advise you do not exceed.
parameter_profiles.yaml is where you change parameter values based on the language pairs in the match.

Important

Do not modify the parameter_defs.yaml file. All changes should be made in the parameter_profiles.yaml file.

Do refer to the parameter_defs.yaml file for definitions and usage of all available parameters.

Parameter profiles

The parameters in the parameter_profiles.yaml file are organized by parameter profiles. Each profile contains parameter values for a specific language pair. For example, matching "Susie Johnson" and "Susanne Johnson" will use the eng_eng profile. There is also an any profile which applies to all language pairs.

Parameter profiles have the following characteristics:

Parameter profile names are formed from the language pairs they apply to. The 3 letter language codes are always written in alphabetical order, except for English (eng), which always comes last. The two languages can be the same. Examples:
- spa_eng
- ara_jpn
- eng_eng
They can include the entity type being matched, such as eng_eng_PERSON. The parameter values in this profile will only be used when matching English names with English names, where the entity type is specified as PERSON. Any entity type listed in the table can be used.
Parameter profiles can inherit mappings from other parameter profiles. The global any profile applies to all languages; all profiles inherit its values.
The any profile can include an entity type; any_PERSON applies to all PERSON matches regardless of language.
Specific language profiles inherit values from global profiles. The profile matching person names is named any_PERSON. The profile for matching Spanish person against English person names is named spa_eng_PERSON. It inherits parameter values from the spa_eng profile and the any_PERSON profile. The any_PERSON profile will not override parameter values from more specific profiles, such as the spa_eng profile.

Important

Global changes are made with the any profile.

Any changes to address parameters should go under the any profile, and will affect all fields for all addresses.

Any changes to date parameters must go under the any profile.

Parameter universe

A parameter universe is a named profile containing a set of RNI parameter profiles with values. Each universe has a name and can contain multiple parameter profiles, including the global any profile. A parameter universe profile can also include the entity type being matched, just like regular parameter profiles. Examples:

For example, the MyParameterUniverse universe may include the following parameter profiles:

"name": "MyParameterUniverse/any" applies to all language pairs.
"name": "MyParameterUniverse/spa_eng" applies to English - Spanish name pairs.
"name": "MyParameterUniverse/spa_eng_PERSON" applies to all PERSON English - Spanish name pairs.

Each parameter in the profile must match the name of a parameter declared in the parameters_defs.yaml file, along with a value. Parameter universes are added to the parameter_profiles.yaml file.

A parameter universe can also be defined dynamically . We recommend that you use dynamic parameter universes for testing and tuning only. For production use, add all parameter universes to the parameter_profiles.yaml file.

Tip

You can define multiple named parameter profiles.

Define the parameter universe in the parameter_profiles.yaml file. Example:

parameterUniverseOne/spa_eng_PERSON:
    reorderPenalty: 0.4
    HMMUsageThreshold: 0.8
    stringDistanceThreshold: 0.1
    useEditDistanceTokenScorer: true
parameterUniverseOne/eng_eng:
    reorderPenalty: 0.6

Modifying name parameters

To start tuning the parameters, run the RNI pairwise match on the test set and look at the match reasons in the response. These match reasons will serve as a guide for which parameters to tune, which are defined in parameter_defs.yaml. For additional support on tuning the parameters, contact support@rosette.com.

Once you define a profile and set a parameter value, rerun the RNI pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Selected name parameters

Given the large number of configurable name match parameters in RNI, you should start by looking at the impact of modifying a small number of parameters. The complete definition of all available parameters is found in the parameter_defs.yaml file.

Parameter	Description	Behavior
`conflictScore`	The score that is assigned to unmatched conflict tokens	Increasing leads to higher final score
`initialsConflictScore`	The score that is assigned to unmatched conflict initials	Increasing leads to higher final score
`initialsScore`	The score that is assigned to an initial matching a token	Increasing leads to higher final score
`initialismScore`	Score assigned to initialism matching a name	The score that is assigned to an initial matching a token
`stuckInitialScore`	Score applied when initial is “stuck” to previous token	Increasing leads to higher final score
`deletionScore`	Score applied to an unmatched token when surrounding tokens are matched	Increasing leads to higher final score
`outOfOrderDeletionScore`	Score applied to an unmatched token when surrounding tokens are also unmatched	Increasing leads to higher final score
`reorderPenalty`	Penalty applied to matching tokens with different positions	Increasing leads to lower final score
`initialsDeletionPenalty`	Multiplier on token deletion score when deleted token is an initial	Increasing leads to higher final score
`genderConflictPenalty`	Penalty applied when name genders don’t match	Increasing leads to lower final score
`crossLanguageGenderConflictPenalty`	Penalty applied when name genders of different languages don’t match	Increasing leads to lower final score
`boostWeightAtRightEnd`	Boost applied to tokens at the right end of the name (i.e. surnames in English)
`boostWeightAtLeftEnd`	Boost applied to tokens at the left end of the name (i.e. given names in English)
`boostWeightAtBothEnds`	Boost applied to tokens at either end of the name (i.e. less weight for middle names in English)
`adjustOneSideDeletionScores`	Multiplier on the token deletion score when all deleted tokens are on one side of the name	Increasing leads to higher score
`reorderCorrection`	Boost to final score if one name’s tokens are a reordering of the other's	Increasing leads to higher final score
`finalBias`	Helps normalize scores	Increasing leads to a higher score for ALL names

The following examples describe the impact of parameter changes in more detail.

Example 11. Token Conflict Score conflictScore

Let’s look at the two names: ‘John Mike Smith’ and ‘John Joe Smith’. ‘John’ from the first and second name will be matched as well the token ‘Smith’ from each name. This leaves unmatched tokens ‘Mike’ and ‘Joe’. These two tokens are in direct conflict with each other and users can determine how it is scored. A value closer to 1.0 will treat ‘Mike’ and ‘Joe’ as equal. A value closer to 0.0 will have the opposite effect. This parameter is important when you decide names that have tokens that are dissimilar should have lower final scores. Or you may decide that if two of the tokens are the same, the third token (middle name?) is not as important.

Example 12. Initials Score (initialsScore)

Consider the following two names: 'John Mike Smith' and 'John M Smith'. 'Mike' and 'M' trigger an initial match. You can control how this gets scored. A value closer to 1.0 will treat ‘Mike’ and ‘M’ as equal and increase the overall match score. A value closer to 0.0 will have the opposite effect. This parameter is important when you know there is a lot of initialism in your data sets.

Example 13. Token Deletion Score (deletionScore)

Consider the following two names: ‘John Mike Smith’ and ‘John Smith’. The name token ‘Mike’ is left unpaired with a token from the second name. In this example a value closer to 1.0 will not penalize the missing token. A value closer to 0.0 will have the opposite effect. This parameter is important when you have a lot of variation of token length in your name set.

Example 14. Token Reorder Penalty (reorderPenalty)

This parameter is applied when tokens match but are in different positions in the two names. Consider the following two names: ‘John Mike Smith’, and ‘John Smith Mike’. This parameter will control the extent to which the token ordering ( ‘Mike Smith’ vs. ‘Smith Mike’) decreases the final match score. A value closer to 1.0 will penalize the final score, driving it lower. A value closer to 0.0 will not penalize the order. This parameter is important when the order of tokens in the name is known. If you know that all your name data stores last name in the last token position, you may want to penalize token reordering more by increasing the penalty. If your data is not well-structured, with some last names first but not all, you may want to lower the penalty.

Example 15. Right End Boost/Left End Boost/Both Ends Boost (boostWeightAtRightEnd, boostWeightAtLeftEnd, boostWeightAtBothEndsboost)

These parameters boost the weights of tokens in the first and/or last position of a name. These parameters are useful when dealing with English names, and you are confident of the placement of the surname. Consider the following two names: “John Mike Smith’ and ‘John Jay M Smith’. By boosting both ends you effectively give more weight to the ‘John’ and ‘Smith’ tokens. This parameter is important when you have several tokens in a name and are confident that the first and last token are the more important tokens.

The parameters boostWeightAtRightEnd and boostWeightAtLeftEnd should not be used together.

Language support parameters

RNI currently has two levels of language support: complete and limited. Complete support uses a comprehensive set of algorithms to calculate match scores. Fully supported text domains for name matching lists the languages and scripts with complete support. For all other languages, RNI has limited support.

Note

Prior to release 7.36.0, RNI did not support the limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Limited support uses two match score computations:

Exact matches return a score of 1. This is the same for all languages.
A score is calculated based on string edit distance.

Two parameters control the level of language support.

Table 16. Language Support Parameters

Parameter	Description	Default
`allLanguageSupport`	When set to `true`, all languages are supported.	`true`
`limitedLanguageEditDistance`	When set to `true`, edit distance match scores are enabled for limited support languages. `allLanguageSupport` must be `true`.	`true`

Neural model for matching

When matching Japanese names in Katakana to English names, you can replace the HMM with a neural model. This model should improve accuracy, but will have an impact on performance.

To enable the neural model, set enableSeq2SeqTokenScorer to true in the jpn_eng profile in the parameter_profiles.yaml file. This applies to Japanese names in Katakana only. Japanese names in other scripts will still use the HMM.

To use the neural model:

Extract the appropriate library files from the platform-specific tensorflow JAR provided in the rni-es-<version>-seq2seq-libraries.zip bundle.
Elasticsearch must be started with an additional Java property and point to the directory containing the extracted libraries:
```
 ES_JAVA_OPTS="-Dorg.bytedeco.javacpp.cacheLibraries=false -Djava.library.path=<path-to-extracted-libraries>"
```

Note

The neural model is currently only available on MacOS and Linux platforms in RNI-ES versions 7.10.2.x and all plugins including RNI-RNT 7.38.1.67.0 or later.

Matching Korean names

If your data includes a lot of Korean names written in Han script mixed in with Chinese and/or Japanese names, you may want to enable Korean readings. This is only used when the language (languageOfUse) of the document is not specified for each request. The following steps may increase accuracy for Korean names, at the cost of decreased throughput.

To enable Korean readings of names in Han script you need to edit the parameter files as follows:

Edit the zho_eng profile in the internal_param_profiles.yaml file and remove kor from the list of ignoreTranslationOrigins parameter.
Edit the zho_eng profile in the parameter_profiles.yaml file to increase the alternativePairsToCheck parameter by 1 to compensate for the additional reading.

Matching names with Han characters

We've added experimental support to leverage mechanisms within the unicode data to improve matching of Han characters.

The four-corner system is a method for encoding Chinese characters using four numerical digits per characters. The digits encode the shapes found in the corners of the symbol, from the top-left to the bottom-right. While this does not uniquely identify a Chinese character, it does limit the list of possibilities.

The parameter haniFourCornerCodeMismatchPenalty applies a penalty if the names have different four corner codes. By default, haniFourCornerCodeMismatchPenalty is set to 0, which turns it off. Experiments have shown positive accuracy improvements when setting the value of the parameter to 1.

To enable the feature, add the following line to your parameter_profiles.yaml file:

zho_zho_PERSON:  
  haniFourCornerCodeMismatchPenalty: 1

Note

This is an experimental feature. As with any experimental feature, we highly recommend experimenting in your environment with your data.

Matching Turkish and Vietnamese names

Vietnamese and Turkish have their own detectors which must be enabled. If your data includes Turkish and/or Vietnamese names, then you must enable the respective detector.

Edit the parameter_profiles.yaml file.

To enable Turkish detection, add:

detectableLanguagesRuleBased:
  [tur]

To enable Vietnamese detection, add:

detectableLanguagesRuleBased:
  [vie]

Restart the system.

Evaluating parameter configuration

To evaluate the newly tuned parameter values, query a large dataset of names or addresses that does not include your test set. For an exact evaluation, query an annotated dataset that includes the correct answers for a number of queries. For a general evaluation, measure the number of pair matches that have scores above your threshold, compared to before tuning the parameter values. If there were too many matches before, now there should be fewer matches. If there were too few matches before, there should be more now. If the number of matches increases or decreases dramatically, then there is a higher chance of missing correct matches below the threshold or including incorrect matches above the threshold.

If you find new pair matches that you want to score above or below your threshold, collect them into a test set to retune the parameters. Then evaluate the parameters again using a large dataset to review results. It is important to frequently evaluate new parameter settings on separate test data to ensure the parameters continue to return correct results.

Configuring name overrides

RNI includes override files (UTF-8 encoded) to improve name matching. There are different types of override files:

Stop patterns and stop word prefixes designate name elements to strip during indexing and queries, and before running any matching algorithms.
Name pair matches specify scores to be assigned for specified full-name pairs.
Token pair overrides specify name token pairs that match along with a match score.
Token normalization files specify the normalized form for tokens and variants to normalize to that form.
Low weight tokens specify parts of names (such as suffixes) that don't contribute much to name matching accuracy.

The name matching override files are in the plugins/rni/bt_root/rlpnc/data/rnm/ref/override directory.

You can modify these files and add additional files in the same subdirectory to extend coverage to additional supported languages. You can also create files that only apply to a specified entity type, such as PERSON.

Stop patterns and stop word prefixes

Before running any matching algorithms, the names are transformed into tokens that can be compared. RNI uses stop patterns and stop word prefixes to remove patterns, including titles such as Mr., Senator, or General, that you do not want to include in name matching. Both stop patterns and stop word prefixes are used to strip matching name elements during indexing and querying. Stop words are string literals and are processed much more quickly than stop patterns, which are regular expressions. You should use stop words for the most efficient removal of prefixes, such as titles. Stop words are language-dependent.

For each name, RNI performs the following steps in order:

Character-level normalization, stripping punctuation (except for periods, commas, and hyphens). White space is reduced to single spaces and all characters are lower-cased. Diacritical marks are removed.
Stop patterns are applied.
Stop words are applied.

RNI cycles its way through the stop patterns then the stop words, each cycle removing the patterns and words that strip nothing, until the list of stop patterns and stop words is empty.

Stop Pattern

A stop pattern is a regular expression that excludes matching name elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern; see the Javadoc for detailed documentation.

Stop patterns for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopregexes_LANG[_TYPE].txt

where LANG is a three-letter language code.

Each row in the file, except for rows that begin with #^[8] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at the beginning and end as needed.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name, matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Name elements matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern. For example, the brigadier[-]general stop pattern is applied first, but general is also a stop pattern and will be applied as well.

RNI includes files with stop patterns for names in English (generic and ORGANIZATION), Japanese (PERSON), Spanish (generic), and Chinese (PERSON). These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. The generic (non-entity-specific) English file is stopregexes_eng.txt. For example, the entries

^fnu\b
\blnu$

indicate that the common indicators for first-name-unknown at the start of a name and last-name-unknown at the end of a name, are to be removed.

You can also specify which field the regex is to be applied to when processing a fielded name. Simply add Tabn, where n is the field number. To search multiple fields, include an entry for each field, as illustrated below. When processing a name without fields, the field parameter is ignored. For example,

\blnu$    2
\blnu$    3

indicates that the regex is to be applied to fields 2 and 3 in fielded names.

You can modify the contents of this file. To add stop patterns for a different language, create an additional UTF-8 file in the same subdirectory with the three-letter language code in the filename. For example, stopregexes_ara.txt would include regular expressions with Arabic text; stopregexes_eng_PERSON.txt would include regular expression to remove elements from PERSON names in English text.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop Word Prefixes

A stop word prefix is a string literal that strips the matching prefix from name elements during indexing and querying.

Stop word prefixes for a given language are specified in a UTF-8 file with the ISO 639-3 three-letter language code in the filename:

stopprefixes_LANG[_TYPE].txt

where LANG is a three-letter language code. Each row in the file, except for rows that begin with #, is a string literal. Prefixes matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes contained within the longer stop word. For example, the lieutenant colonel stop word prefix is applied where applicable when colonel is also a stop word prefix.

RNI includes files with generic stop word prefixes for names in Arabic, English, Greek, Hebrew, Hungarian, Khmer, Spanish, Thai, Turkish, and Vietnamese. These files are in plugins/rni/bt_root /rlpnc/data/rnm/ref/override. You can modify the contents of these files. To add stop word prefixes for another language, create a UTF-8 file in the same directory with the three-letter language code in the filename. For example, stopprefixes_rus.txt would include stop word prefixes for use with Russian text.

Overriding name pair matches

You can create UTF-8 text files that specify the scores to be assigned for specified full-name pairs. The filename uses the ISO 639-3 three-letter language codes to designate the language of each full name in each of the full-name pairs:

fullnames_LANG1_LANG2[_TYPE].txt

where LANG1 is the three-letter language code for the first name and LANG2 is the three letter language code for the second name.

Tip

Include _TYPE, where TYPE designates an entity type, such as PERSON if you want the override to apply only if the name (for stop patterns), matching names, or matching tokens have been assigned this entity type. If the filename does not include _TYPE, it will be applied to all names, regardless of the entity type.

Each row in the file, except for rows that begin with #, is a tab-delimited full-name pair and score:

name1 Tab name2 Tab score

The scores must be between 0 and 1.0, where 0 indicates no match, and 1.0 indicates a perfect match.

Tip

Since the minimum score for names returned by RNI queries must be greater than 0, an RNI query will not return the name if the override score is 0. Name match operations, on the other hand, will return an override score of 0.

The installation includes a sample file with sample entries commented out: plugins/rni/bt_root/rlpnc/data/rnm/ref/override/fullnames_eng_eng.txt. Any non-commented-out entries in this file assign scores to English queries applied to English names in an RNI index. For example,

John Doe	Joe Bloggs	1.0

indicates that the query name John Doe matches the index name Joe Bloggs (both used in different regions to indicate 'person unknown') with a score of 1.0.

These match patterns are commutative. The previous entry also specifies a match score of 1.0 if the query name is Joe Bloggs and the index includes a document with an rni_name field containing John Doe.

You can add entries for English to English name matches to fullnames_eng_eng.txt, and create additional override files, using the filename to specify the languages. For example the following entries could appear in fullnames_jpn_eng.txt:

外山恒	   Toyama Koichi    1.0
ヒラリークリントン    Hillary Clinton    1.0

Overriding token pair matches

You can create text files that specify token (name-element) pairs that match. Token pair overrides are supported^[9] for English-English, Japanese-English, Chinese-English, Russian-English, Spanish-English, Japanese-Japanese, Russian-Russian, English-Korean, Korean-Korean, Spanish-Spanish, Greek-English and Hungarian-English token pairs. Such pairs may include proper name and nickname, such as Peter and Pete, and cognate names such as Peter and Pedro. When RNI evaluates two names, each of which contains an element from the pair, it enhances the value of the resulting name match score. For example, if Abigail and Abby constitute a token pair, then the match score for Abigail Harris and Abby Harris will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

tokens_LANG1_LANG2_[TYPE].txt

where LANG1 is the three-letter language code for the first token in each pair and LANG2 is the three-letter language code for the second token in each pair. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0 or an indicator that at least one of the tokens is a nickname or that the tokens are cognates:

Token1 Tab Token2 Tab [[0.0-1.0]|NICKNAME|COGNATE|VARIANT]

A token pair override score (raw score or indicator) serves as a minimum score, but you can write "/force" after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [([0.0-1.0]|NICKNAME|COGNATE|VARIANT)/force]

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force". If you do not include NICKNAME, COGNATE, VARIANT, or SUPPRESS, RNI assumes NICKNAME.

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/tokens_eng_eng.txt, which contains a list of English/English token pairs. For example:

Peter    Pete    NICKNAME
Peter    Pedro   COGNATE

This directory also contains Chinese to English token overrides for LOCATION and ORGANIZATION: tokens_zho_eng_LOCATION.txt, tokens_zho_eng_ORGANIZATION.txt.

When you create an additional file in the same location, use the ISO 639-3 three-letter language name in the filename to identify the language of each name element in the pair. For example tokens_eng_eng.txt indicates that the contents match English names to English names; tokens_eng_eng_ORGANIZATION.txt indicates that the contents match English ORGANIZATION names to English ORGANIZATION names. The SDK includes a sample file for matching English/English tokens in LOCATION entities: tokens_eng_eng_LOCATION.txt.

We recommend that you enter the language names in alphabetical order in the filename and token pairs. Keep in mind that the order has no influence on the resulting score, since the scoring is commutative.

Multiple sets of token overrides

There may be situations in which you want to define multiple sets of token overrides for an index. This can be accomplished by combining override file names with the overrideSelector parameter.

The value of overrideSelector is an alphanumeric string, and it controls which set of overrides will be considered during querying and matching. The value is case-insensitive. By default, it will read overrides for the "default" selector.
The value of overrideSelector can be appended to the name of the override text file containing the token pairs, preceded by a dash (-). For example, a file for person name overrides in English - English matching using the overrideSelector of OverrideGroup1 would be named:
```
tokens_eng_eng_PERSON-OverrideGroup1.txt
```
If no valid selector name is found in the override text file filename, overrides for that file will be applied to the "default" selector.

Note

Overrides that are associated with a specific selector are not additive to the base overrides. If a custom overrideSelector value is specified, RNI will only consider overrides in that specific selector. As with the base overrides, for a given selector, RNI will consider non-entity-type overrides for that selector if no entity-type-specific override pair is found for that selector.

Normalizing token variants

You can create text files that specify the normalized form for tokens (name elements) and variants to normalize to that form. The file name indicates the language and optionally the entity type for the tokens to be normalized:

equivalenceclasses_LANG_[TYPE].txt

For example, equivalenceclasses_jpn.txt would contain entries for normalizing Japanese token variants for any entity type to a normalized form.

Each entry in the file contains a normalized form followed by one or more variant forms. The syntax is as follows:

[normal_form1]
variant1_1
variant1_2
variant1_3
[normal_form2]
variant2_1
variant2_2
variant2_3
...

RNI includes plugins/rni/bt_root/rlpnc/data/rnm/ref/override/equivalenceclasses_eng_PERSON.txt, which contains a list of variant renderings to normalize to muhammad:

[muhammad]
mohammed
mahamed
mohamed
mohamad
mohammad
muhammed
muhamed
muhammet
muhamet
md
mohd
muhd

You can add lists of variants to this file, including the normalized form in square brackets to start each list.

Unimportant tokens

You can edit the list of tokens that are given low influence in RNI. These low weight tokens are parts of a name (such as suffixes) that don't contribute much to the name matching accuracy.

The file name is lowWeightTokens_LANG.txt.

For example, plugins/rni/bt_root/rlpnc/data/rnm/ref/lowWeightTokens_eng.txt contains entries for tokens in English that you may want to put less emphasis on: "jr", "sr", "ii", "iii", "iv", "de".

Matching organizations with real world IDs

Organizations and companies often have nicknames which are very different from the company's official name. For example, International Business Machines, or IBM, is known by the nickname Big Blue. As there is no phonetic similarity between the two names, a match query between those two organization names would result in a low score. A real world identifier associates companies, along with their associated nicknames and permutations, with an identifier. When enabled, a search between two company names will include a comparison between the real world identifiers for the two names, thus matching dissimilar names for the same corporate entity.

RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name. Name matching within a language lists the languages with provided real-world ID dictionaries. Customers can also generate their own real-world ID dictionaries to supplement the provided dictionaries.

Table 17. Real World ID Parameters

Parameter	Description	Default
`useRealWorldIds`	Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.	`true` (enabled)
`doQueryRealWorldIds`	Enables querying with real world IDs; set by language pair.	`true` (enabled)
realWorldIdScore	Sets the match score when two names match due to matching real world IDs. Set by language pair.	0.98
nameRealWorldQueryBoost	Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.	35

Building a real world ID file

Many companies have their own file of organizations with their different names. To improve matching between organization names, you can supplement the real world IDs provided in RNI and build your own file of real world IDs. The provided file will build a binary file in the specified output directory named <LANG>_ORGANIZATION_ids.bin where <LANG> is the three-letter language code of the file.

The input file is a tab separated file (.tsv). Each line contains an organization name and a corresponding alphanumeric ID. The file can only contain a single language and script. You must create a separate file for each language.

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

You may want to use real world ID matching even if there are some entities which you do not want to match via real world IDs. You can omit specific organizations and QIDs (Wikidata's identifier for entities) from matching by creating an omit file listing the organization names and QIDs you would like to omit.

The omit file is a tab separated file (.tsv) named <LANG>_ORGANIZATION_ids.tsv where <LANG> is the three-letter language code of the file. Each omit file can only contain names in one language and separate files must be made for each language. There are three types of lines that can appear in an omit file, which have different effects on omission: pairs, lone names, and lone QIDs.

Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in RNI:

Place the omit file in the BT_ROOT directory.
Open omit_ids.datafiles, which is in the plugins/rni/bt_root/rlpnc/data/real_world_ids/ref/omit_ids directory by default.
Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:
```
ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
```
Save omit_ids.datafiles.

Custom language model training

You can train a language model on your own name data. RNI uses language models in which common names score differently than rare names. For example, "John Jingleheimer" should match "Jingleheimer" better than "John", because Jingleheimer is a rarer name than John. RNI already comes with language models for many supported languages, but you might find it best to train a new language model so that it reflects the statistics of your data. Please note that a large amount of full names are required to train an effective language model.

Installation

Unpack frequencyModelTrainer.zip to any desired location. Ensure that the JAVA_HOME environment variable is set and points to a Java version of 11 or higher.

Simple usage example

bin/buildLM.sh -root rni-rnt -in eng_PER_LM.tsv 
-out rni-rnt/data/rnm/ref/user_models/eng_PERSON_unigram.bin
-lang eng -script Latn

See README.txt in frequencyModelTrainer.zip for more details, including the full description of arguments.

Indexing and querying names

The Rosette Name Indexer (RNI) enables high-speed, scalable, cross-language, and cross-script searches for names.

RNI uses the Apache Lucene full-text search engine to store names with their search keys and a key index. RNI updates and queries with Lucene are transactional.

When you search for a name, RNI generates a search key for each component of the name, locates all the names indexed by those search keys, and uses linguistic matching algorithms to filter that set of names down to the most similar names.

For a list of the languages and writing scripts that RNI supports, see Fully supported text domains for name matching.

RNI provides a Java API that you can use to embed it in your applications. The RNI classes are in com.basistech.rni.index. Unqualified class names that appear in this section are in com.basistech.rni.index.

For detailed information about the API, see the Java API Reference shipped with RNI.

Using Rosette Name Indexer

Note

If you have not already done so, you must set the Basis root directory.

Constructing a name index

A name index is an indexed list of names. The list includes a collection of Name objects and associated keys.

The Name object includes the name, language, ^[10] script, (script and language will be inferred if not included in the name definition) and may include entity type (such as person or place), language of origin, and additional information (with place names, for example, you may want to store the geocoordinates).

Tip

You can also create an index in memory that is never stored on disk.

To create an indexed list of names on disk, you must specify a pathname for the data store, and you must use a IndexStoreDataModelFlags object (the default is fine).

Example:

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_index.java

Once the index is created, use NameBuilder to create Name objects and add them to the index. NameBuilder provides a fluent interface that supports method chaining. The following fragment illustrates the syntax for creating and adding a name to the index.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_name.java

When you are finished adding names, close the name index, as in the preceding fragment.

Note

NameBuilder also includes static methods that you can use for determining the language and script for a name prior to creating the Name object: guessLanguage(String nameData) and guessScript(String nameData).

You can use hintLanguage(com.basistech.util.LanguageCode hintLanguage) to suggest the language when you create a Name. The NameBuilder uses the suggestion if it is compatible with the script, otherwise it uses its own language guess.

When you are adding a large number of names to an index, you can use an INameIndexSession object to batch these additions into a single transaction. A single transaction is faster than adding each name in a separate transaction. For information, see RNI Sessions and Transactions, and for a sample application that adds multiple names in a single transaction, see AddNamesSample.

Querying a name index

Once you have an index created, you can use queries to search the index for similar names.

Opening a name index

The primary role of a name index is to perform queries. You can also perform updates (insertions and deletions).

StandardNameIndex provides a static method for opening a name index.

INameIndex index = StandardNameIndex.open(String indexPathname);

indexPathname is the path to the directory that contains the name index.

To optimize the index for more efficient queries, call

index.optimize();

When you are done using the name index, you must close it:

index.close();

Defining a name search query

A query includes a Name object and may also include settings to constrain the query. For example, the query can specify the entity type, language, and/or script of the names that it returns. For the details, see the Javadoc for com.basistech.rni.index.IndexStoreDataModelFlags.

You can also define a query to return all the names associated with a specified entity.

Set up a NameIndexQuery object. For example:

// Define a query.
NameIndexQuery defineQuery(Name queryName)
 throws NameIndexException, NameIndexStoreException, RNTException {
 NameIndexQuery query = new NameIndexQuery(queryName);
 query.setNameDataMinimumMatchScore(.30);
 return query;
}

Running the query and accessing the query results

INameIndex includes a query method that takes as its parameter the defined NameIndexQuery.

The query returns a NameIndexQueryResult iterator. Each NameIndexQueryResult object provides a Name object and a similarity score. As the following fragment illustrates, you can obtain and process each name and its score. The higher the score (greater than 0 and less than or equal to 1), the greater the confidence that this is a relevant match. A score of 1.0 indicates that the query name string and result name string are identical. The types of variations matched by RNI are described in Name Variations. Scoring is commutative: the scores for two given names are always the same, regardless of which name is in the index and which name is in the query.Name variations

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/query_index.java

SpanMatches. Each query result may contain information about spans (one or more tokens) in the query name that match or do not match spans in each result name. The NameIndexQueryResult provides a MatchResult object, which in turn provides match type and a list of SpanMatch objects. For more information, see the Javadoc for com.basistech.rni.match.SpanMatch and com.basistech.rni.match.Span. The Javadoc for MatchResult#getSpanMatches() provides information about the scope and limitations on what is returned for names in various text domains.

Cleanup

When you are done running queries, close the index:

index.close();

Sample

For a sample Java application that defines a query, runs the query, and reports the results, see IndexQuerySample.

Retrieving groups of names

You may want to retrieve a group of names that share some common characteristic other than name similarity. Perhaps you even want to retrieve all the names in an RNI index.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/name_groups.java

The query returns all the names for which the Extra field contains the token used in the query.

Optimizing query performance

By adjusting NameIndexQuery parameters, you can optimize queries for your use case.

Tradeoffs between accuracy and speed

RNI passes a subset of the highest scoring names from the first-pass high-recall search to the second-pass high-precision filter. The namesToCheckAllowance and maximumNamesToCheck parameters can be adjusted to control how many names are included in that subset.

maximumNamesToCheck: The maximumNamesToCheck parameter sets a hard limit on the number of names passed to the high-precision filter for each query. Use it to control the maximum query latency. The appropriate value is largely determined by the size of your index and should increase as your index grows.
namesToCheckAllowance: The namesToCheckAllowance parameter is a value between 0.0 and 1.0 used at query time to dynamically calculate the most efficient number of names to pass to the high-precision filter based on the commonality of the query name in the index. When set to 1.0, the value of maximumNamesToCheck is used for every query. After determining a good value for maximumNamesToCheck, adjust this parameter to fine-tune the performance.

In general, for greater speed and less accuracy (particularly recall), decrease the value of these parameters using:

setNamesToCheckAllowance(double namesToCheckAllowance)
setMaximumNamesToCheck(int maxNamesToCheck)

For greater recall and less speed, increase those settings.

To pass all names found by the high-recall search to the high-precision filter, set:

namesToCheckAllowance to 1.0
maximumNamesToCheck to NameIndexQuery.UNLIMITED_RESULTS.

Optimizing for Duplicate Names. If your index contains duplicate names, you should use setMaximumNamesToConsider(int maxNamesToConsider) to set the maximum number of names to consider to a value higher than the maximum number of names to check. RNI returns the maximum names to consider in the first-pass high-recall search and sends the maximum names to check to the second-pass high-precision filter. If there are any duplicates in the names returned by the first pass, the duplicates are not passed to the second-pass. In other words, the score assigned by the second pass to the first instance of a given name is assigned to its duplicates without spending time sending them through the second pass. For optimal behavior, the ratio of maximumNamesToConsider to maximumNamesToCheck should be approximately the same as the average number of times that a name is repeated in the RNI index. So, for example, if each name is entered twice (on average), maximumNamesToConsider should be twice as big as maximumNamesToCheck. If your index does not include duplicates, you can use IndexStoreDataModelFlags to set optimizeDuplicateNames to false (the default setting is true), in which case RNI does not perform this optimization procedure.

Constraints on maximum settings. maximumNamesToCheck and maximumResultsToReturn must be less than or equal to maximumNamesToConsider. As described above, maximumNamesToCheck may be less than maximumResultsToReturn. Accordingly, the order in which you make these settings is important. For example, you cannot set maximumResultsToReturn to a value higher than maximumNamesToConsider, so you may need to reset maximumNamesToConsider before you can reset maximumResultsToReturn.

To simulate a high-recall search with perfect recall:

Retrieve all names in the index as described in Retrieving Groups of Names.
Apply the high-precision filter to each name by matching it against the query with a MatchScorer (see Matching Names).

This is not recommended for a production environment due to the high amount of computation such a procedure requires, but it can be useful during development to identify recall errors (false negatives) made by the high-recall search but not the high-precision filter.

Tradeoffs between false positives and false negatives

For fewer false positives (bad matches) and more false negatives (missing good matches) in your query results, you can:

increase the minimum match score that a candidate must reach to be returned
decrease the number of results that are returned (candidates with the highest scores are included)

The default minimum match score is NameIndexQuery.DEFAULT_MINIMUM_MATCH_SCORE. To reset this threshold, use setNameDataMinimumMatchScore(double nameDataMinimumMatchScore), where nameDataMinimumMatchScore is greater than 0 and less than or equal to 1.

The default maximum number of results to return is NameIndexQuery.DEFAULT_MAXIMUM_RESULTS_TO_RETURN. To reset this value, use setMaximumResultsToReturn(int maximumResultsToReturn).

To return an unlimited number of results, use setMaximumResultsToReturn(NameIndexQuery.UNLIMITED_RESULTS).

RNI sessions and transactions

In addition to using the INameIndex API for performing operations on an RNI Index, you can use the INameIndexSession API for finer-grained control. Sessions allow a set of operations to happen atomically (all occur or nothing occurs), and, especially for write operations, more efficiently. For those familiar with relational databases and SQL, the RNI concept of a session is similar to the JDBC concept of a connection with auto-commit mode off.

To start a session, call INameIndex.openSession().
To end the session, call close() on the resultant INameIndexSession object.

While INameIndexSession provides many of the same operations as INameIndex, such as query() and addName(), the difference is when changes to the index become permanent. INameIndex update operations are immediately flushed to disk, but INameIndexSession operations are not made permanent until you call commit(). At any time, you can invoke rollback() to undo all the operations since the last commit(). If you call rollback() before ever calling commit(), all of the operations of the session are undone.

You can run multiple sessions concurrently by having multiple threads call openSession() on the same INameIndex object. When multiple sessions are acting concurrently in separate threads, they are logically isolated from each other in order to not interfere with each other's operations. The isolation level is equivalent to READ COMMITTED, as outlined in the SQL-1992 Specification. This guarantees that one session will not see any uncommitted changes to the index performed by another session. In addition, a session will not see any uncommitted changes that it has made itself. For example, if a session adds a name to the index and then searches for that name before committing, it will not find the name it has added. You can also perform INameIndex auto-commit operations in the midst of one or more sessions; each INameIndex update or query is performed in its own session.

The session objects themselves are thread-safe; a session object may be shared by multiple threads.

The INameIndexSession API is recommended for doing bulk adds to the index. It is much more efficient to create a single session for adding all the names of a bulk add than to use the INameIndex API. The following fragment shows an example.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/add_names.java

A Sample. For a sample application that adds multiple names in a single transaction, see AddNamesSample.

Local vs. Distributed Transactions. A local transaction is a set of operations performed atomically (all occur or nothing occurs) on a single index. A distributed transaction is a set of operations performed atomically on multiple data sources, such as a relational database and an RNI index. All the operations on all the data sources must take place, or none of the operations take place.

For local transactions, use the INameIndexSession API, as illustrated above. The transaction object is managed internally and is not visible to the user.

In order to participate in a distributed transaction, an INameIndexTransaction object must be created from the session by calling INameIndexSession.startTransaction(). This transaction object is linked with the session internally. There is a division of labor between the two objects: the session object can only be used for adding/removing/searching, and the transaction object can only be used for committing or rolling back. A typical use case would be to provide the session object to the user application while handing over the transaction object to a transaction manager.

One side effect of this division of labor between the session and transaction objects is that a session cannot call commit() or rollback() once it is associated with a distributed transaction. These operations are only allowed by the linked transaction object. Specifically, after calling INameIndexSession.startTransaction(), you should not call INameIndexSession.commit(). You must call INameIndexTransaction.commit() instead.

A session can be associated with multiple distributed transactions, one at a time. When the work for one transaction is finished, you may call INameIndexSession.startTransaction() again to start a new one.

Two-Phase Commit. INameIndexTransaction supports two-phase commits, a standard protocol for managing transactions robustly among multiple data sources. INameIndexTransaction provides the prepare(), commit(), and rollback() operations necessary for a transaction manager to effectively execute the protocol. RNI does not include a transaction manager.

The following simplified example illustrates the use of INameIndexTransaction in a distributed transaction with a two-phase commit. In this example, both transactions are RNI transactions.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/distributed_transaction.java

A Sample. For a sample application that illustrates a distributed transaction with a two-phase commit involving two RNI indexes, see DistributedTransactionSample.

Multithreading

No more than one INameIndex object may exist for a given name index on disk at any time.

Queries and updates may be performed in multiple threads on a single INameIndex object.

One write session at a time

While a write session (which may be shared by multiple threads) is open, all other writing sessions (including optimization) are blocked. If there is an operation that is expected to take a long time (e.g., batch document adds or calls to optimize), care should be taken to ensure it is the only active writing session. If a write attempt needs to wait too long, a timeout exception is thrown, and the transaction is aborted.

Matching organizations with real world IDs

Table 18. Real World ID Parameters

Parameter	Description	Default
`useRealWorldIds`	Enables real world iIDs, indexes the real-world ids as corporation names are added to the index. Must reindex if you enable it after indexing.	`true` (enabled)
`doQueryRealWorldIds`	Enables querying with real world IDs; set by language pair.	`true` (enabled)
realWorldIdScore	Sets the match score when two names match due to matching real world IDs. Set by language pair.	0.98
nameRealWorldQueryBoost	Boosts the value of the real world ID results from the first pass. Increases the likelihood of real world ID matches being returned from the first pass. Set by language pair.	35

Building a real world ID file

IBM    WE1X92
Big Blue    WE1X92
International Business Machines    WE1X92

Unzip the file realWorldIDBuilder.zip found in the plugins/rni/bt_root directory and run the build command. Instructions on how to run the program are in the README.md file in the zip file.

Omit real world IDs

Pair: A name and a QID on the same line. The QID will no longer be used for matching against the name. The same name can be associated with multiple QIDs to omit by placing each pair on its own line.
Lone name: A name followed by an asterisk in the QID column. The name will not be used at all for RWID matching.
Lone QID: A QID is preceded by an asterisk in the name column. No names in the specified language will be able to match against each other using this QID.

Example:

IBM    Q37156
Nintendo    *
*    Q45700

To enable an omit file in RNI:

Place the omit file in the BT_ROOT directory.
Open omit_ids.datafiles, which is in the plugins/rni/bt_root/rlpnc/data/real_world_ids/ref/omit_ids directory by default.
Add a new entry for your omit file following the format <LANG>_ORGANIZATION tab * tab <file path>, where LANG is the three-letter language code of the file. File paths must be relative to BT_ROOT, meaning absolute paths will not work. For example:
```
ara_ORGANIZATION	*	rlpnc/data/real_world_ids/ref/omit_ids/ara_ORGANIZATION_ids.tsv
```
Save omit_ids.datafiles.

Matching addresses

RNI provides a Java API for matching addresses in English, Traditional Chinese, and Simplified Chinese.

In the RNI context, address matching means comparing two addresses, performing linguistic analysis per address field, and returning a score (a double greater than zero and less than or equal to one) that indicates how similar the two addresses are. A value of 1.0 is returned if and only if the two addresses are identical (each address field matches exactly). A score of less than 1.0 is returned for addresses that potentially match, with a score indicating the relative similarity of the two addresses.

Note

Address matching in Latin script is optimized for addresses in English. Non-English addresses in Latin script may also be matched; results will vary by language.

Address definition

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library^[11] is used to parse the address string into address fields.

When entered as a set of fields, the address may include any of the fields in Table 19, “Supported Address Fields”. At least one field must be specified, but no specific fields are required.

RNI optimizes the matching algorithm to the field type. Named entity fields, such as street name, city, and state, are matched using a linguistic, statistically-based algorithm that handles name variations. Numeric and alphanumeric fields, such as house number, postal code, and unit, are matched using character-based methods.

Table 19. Supported Address Fields

Field Name	Description	Example(s)
`house`	venue and building names	"Brooklyn Academy of Music", "Empire State Building"
`houseNumber`	usually refers to the external (street-facing) building number	"123"
`road`	street name(s)	"Harrison Avenue"
`unit`	an apartment, unit, office, lot, or other secondary unit designator	"Apt. 123"
`level`	expressions indicating a floor number	"3rd Floor", "Ground Floor"
`staircase`	numbered/lettered staircase	"2"
`entrance`	numbered/lettered entrance	"front gate"
`suburb`	usually an unofficial neighborhood name	"Harlem", "South Bronx", "Crown Heights"
`cityDistrict`	these are usually boroughs or districts within a city that serve some official purpose	"Brooklyn", "Hackney", "Bratislava IV"
`city`	any human settlement including cities, towns, villages, hamlets, localities, etc.	"Boston"
`island`	named islands	"Maui"
`stateDistrict`	usually a second-level administrative division or county	"Saratoga"
`state`	a first-level administrative division	"Massachusetts"
`countryRegion`	informal subdivision of a country without any political status	"South/Latin America"
`country`	sovereign nations and their dependent territories, which have a designated ISO-3166 code	"United States of America"
`worldRegion`	currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean	"Jamaica, West Indies"
`postCode`	postal codes used for mail sorting	"02110"
`poBox`	post office box: typically found in non-physical (mail-only) addresses	"28"

Address field groups

When an address is parsed into address fields, values can get put into the wrong field. Address field groups encapsulate common transpositions between fields. When scoring matching values in fields, RNI uses address field groups to group related or similar fields. If two field values match, but they are dissimilar fields, RNI applies a penalty to that match, reducing the score for that pair.

When matching two fields, the following penalties are applied:

If the fields are the same, no penalty is applied. (street - street)
If the fields are different, but the fields are in the same group, a small penalty is applied. (suburb - city)
If the fields are in different field groups, a large penalty is applied. (road - city)

Table 20. Address Groups

Group	Fields
house	house
house_number	houseNumber
road	road
unit	unit level staircase entrance
city	suburb cityDistrict city
state	island stateDistrict state
country	countryRegion country worldRegion
post_code	postCode
po_box	po_box

Address matching usage model

Identify two addresses to compare.

Use MatchScorer to score the similarity of two AddressSpec objects. MatchScorer and AddressSpec are in the com.basistech.rni.match and com.basistech.rni.match.address packages respectively.

// Use MatchScorer to match two addresses.
void match2Addresses(AddressSpec addr1, AddressSpec addr2) {
    MatchScorer ms = new MatchScorer();
    double score = ms.score(addr1, addr2);
    // Handle the score.
    System.out.println("Score: " + score);
    // Release resources used by the match scorer.
    ms.close();
}

How Rosette calculates address match scores

The address match score is a value between 0.0 and 1.0; the higher the score, the stronger the match. The score is a relative indication of how similar two addresses are; it is not an absolute value. Calculating the match score is a complex process that utilizes multiple matching techniques and algorithms, as explained below.

Identify the address fields. This step is only performed if the address is provided as an unparsed string. In that case, Rosette uses the jpostal library to parse the addresses into address fields. This process works well for well-formatted addresses, but may have difficulty when an addresses are irregularly formatted.
For example, most addresses are formatted from specific to general:
```
houseNumber road city state postCode
```
- The parser would provide predictable results for an address in an expected order:
  38 Concord Road, Apt. B Arlington MA
- The parser would have more difficulty if the address format was in an unexpected order:
  Arlington MA Concord Road #38 Apt B
If you are getting unexpected match values, check how the addresses are being parsed into address fields.
Normalize the fields in each address. Address fields are normalized so they can be compared. Normalization includes removing stop words, such as The from The United States.
Compare each address field. For the addresses being compared, every field in each address is compared to every field in the other address, with a match score calculated for each comparison. The algorithm used will depend on the field type. Scoring algorithms include:
- Edit distance: Alphanumeric fields, such as house number, are scored based on the number of character addition, substitutions, and deletions.
- Fuzzy match: Text fields, such as street names, are scored with intelligent name comparison algorithms to determine how similar they are.
- Postal codes: Rosette uses meanings of US, UK, and Canadian postal codes to provide scores for these fields. Even if a postal code is poorly formatted, Rosette can recognize and score the match correctly.
Select the best scores. Once all scores have been calculated, the best mapping of fields between the two addresses is selected to maximize the complete score.
Field Weights: Some fields in an address are considered more important than other fields. The score from each selected match are weighted by field types. These field type weightings can be modified based on the type of address data in your system.

Configuring address matching

Addresses have their own match parameters and override files that you can customize to achieve the best results for your data.

There are two types of override files for addresses:

Stop patterns and stop word prefixes designate address field elements to strip during indexing and queries.
Token pair overrides specify address field elements pairs that match.

File Directories

The parameters are modified in the plugins/rni/bt_root/rlpnc/data/etc/parameter_profiles.yaml file.
The address matching override files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/overrides directory.
The address stop word files are in the plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords directory.

Modifying address parameters

To start tuning the parameters, run address matching on the test set and look for any unexpected results. Tunable parameters are defined in parameter_defs.yaml. The parameter files are described in Parameter configuration files.

Note

Changes made to the any profile apply to all supported languages.

An example parameter to tune is addressJoinedTokenLimit, which controls leniency towards joining or separating tokens. For some use cases, you may decide that joining many tokens within a field is acceptable. To adjust this parameter, find an existing parameter profile or define a new one, add the parameter and modify the value. By increasing the parameter value, the addressJoinedTokenLimit will be allowed to merge more tokens.

Another example parameter is houseNumberAddressFieldWeight, which controls the weight of the houseNumber score when calculating the overall score. This type of parameter is available for all address fields, and is weighted evenly at 1 by default. For example, cityAddressFieldWeight controls the weight of the city field when matching addresses.

Once you define a profile and set a parameter value, rerun the address pairwise match, scoring the match with the edited parameter_profiles.yaml file.

Address parameters

Parameter	Description	Behavior
`addressFinalBias`	Helps normalize scores	Increasing leads to a higher score for ALL names
`addressReorderPenalty`	Penalty for token reordering within a comparison	Increasing leads to lower final score
`addressDeletionScore`	Score for deleted token within a comparison
`addressUnpairedFieldScore`	Score for an unpaired field
`addressOverrideDefaultScore`	Score for override matches
`addressJoinedTokenLimit`	Maximum sum of the number of tokens considered when matching two address fields
`addressCrossFieldScoreThreshold`	Minimum value a cross-field score must have to be included in the final score
`addressSameGroupPenalty`	Multiplier on field comparisons from the same group	Increasing leads to lower final score
`addressDifferentGroupPenalty`	Multiplier on field comparisons from different groups.	Increasing leads to lower final score
`houseAddressFieldWeight`	Weight used during comparison of the house field
`houseNumberAddressFieldWeight`	Weight used during comparison of the house number field
`roadAddressFieldWeight`	Weight used during comparison of the road field
`unitAddressFieldWeight`	Weight used during comparison of the unit field
`levelAddressFieldWeight`	Weight used during comparison of the field
`staircaseAddressFieldWeight`	Weight used during comparison of the field
`entranceAddressFieldWeight`	Weight used during comparison of the entrance field
`suburbAddressFieldWeight`	Weight used during comparison of the suburb field
`cityDistrictAddressFieldWeight`	Weight used during comparison of the cityDistrict field
`cityAddressFieldWeight`	Weight used during comparison of the city field
`islandAddressFieldWeight`	Weight used during comparison of the island field
`stateDistrictAddressFieldWeight`	Weight used during comparison of the stateDistrict field
`stateAddressFieldWeight`	Weight used during comparison of the state field
`countryRegionAddressFieldWeight`	Weight used during comparison of the countryRegion field
`countryAddressFieldWeight`	Weight used during comparison of the country field
`worldRegionAddressFieldWeight`	Weight used during comparison of the worldRegion field
`postCodeAddressFieldWeight`	Weight used during comparison of the postCode field
`poBoxAddressFieldWeight`	Weight used during comparison of the poBox field

Stop patterns and stop word prefixes

RNI uses stop patterns and stop word prefixes to remove patterns from address fields during indexing and queries before matching algorithms are applied. Using string literals to strip prefixes can be performed more quickly than the application of stop patterns (regular expressions), so you should use stop words for the efficient removal of prefixes, such as the, that you do not want to include in address matching.

For each address field, RNI performs the following steps in order:

Character-level normalization, stripping punctuation including periods, commas, hyphens, and the number sign. White space is reduced to single spaces and all characters are lower-cased.
Stop patterns are applied.
Stop words are applied.

Stop pattern

A stop pattern is a regular expression that excludes matching address field elements during indexing and queries. You can use any regular expression supported by the Java java.util.regex.Pattern class; see the Javadoc for detailed documentation.

Stop patterns for a given address field are specified in a UTF-8 file with the AddressField name:

stopregexes_LANG_ADDRESS_FIELD__FIELD.txt

where LANG is a three-letter language code and FIELD is an AddressField name. Currently, the only supported values for LANG are eng and zho. Each row in the file, except for rows that begin with #,^[12] is a regular expression. Leading and trailing whitespace is removed from regex lines, so use \s at beginning and end where needed.

Note

The delimiter before FIELD is a double underscore (__)

Elements in the address fields matching any of these regular expressions are removed. Longer stop patterns are applied before shorter stop patterns, so the presence of a shorter stop pattern does not prevent the stripping of a longer pattern that includes the shorter pattern.

Stop pattern files are arranged by field in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can add patterns to existing files, or if the file doesn't exist, create a UTF-8 file in the directory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopregexes_eng_ADDRESS_FIELD__CITY.txt would include regular expressions to remove elements from the CITY address field for English.

Use of complex patterns may increase processing time. When possible, use stop word prefixes.

Stop word prefixes

A stop word prefix is a string literal that strips the matching prefix from address field elements during indexing and queries.

Stop word prefixes for a given address field are specified in a UTF-8 file with the AddressField name:

stopprefixes_LANG_ADDRESS_FIELD__FIELD.txt

Note

The delimiter before FIELD is a double underscore (__)

Prefixes in the address field matching any of these string literals are removed.

Like stop patterns, longer stop word prefixes take precedence over shorter prefixes that the longer stop word contains.

RNI includes files with stop word prefixes for selected address fields in English and Chinese. These files are in plugins/rni/bt_root/rlpnc/data/addresses/ref/stopwords. You can modify the contents of these files. To add stop word prefixes for a different address field, create an additional UTF-8 file in the same subdirectory and include the full address field identifier (ADDRESS_FIELD__FIELD) in the filename. For example, stopprefixes_eng_ADDRESS_FIELD__CITY.txt would include stopword prefixes for use on CITY address field for English.

Overriding token pair matches

You can create text files that specify token (address field element) pairs that match. Token pair overrides are supported for English-English, Chinese-English, and Chinese-Chinese. When RNI evaluates two address fields, each of which contains an element from the pair, it enhances the value of the resulting address match score. For example, if road and rd constitute a token pair, then the match score for Stuart Road and Stuart Rd will be higher than it would be if the token pair had not been specified.

The token pairs may be within a language or cross-lingual, as indicated by the file name:

LANG1_LANG2_FIELD.txt

where LANG1 is the three-letter language code for the first token in each pair, LANG2 is the three letter language code for the second token in each pair, and FIELD is the AddressField name. Each entry in the file, except for rows that begin with #, is a tab-delimited token pair and may include a raw score between 0.0 and 1.0. If no score is provided, the addressOverrideDefaultScore parameter value will be used.

Token1 Tab Token2 Tab [0.0-1.0]

A token pair override score serves as a minimum score, but you can write /force after a token score to force it to be exactly that value:

Token1 Tab Token2 Tab [0.0-1.0]/force

If you would like to prevent a token pair from matching, you can use the SUPPRESS indicator as an alias for "0.0/force".

RNI includes plugins/rni/bt_root/rlpnc/data/addresses/ref/override/eng_eng_state.txt, which contains a list of U.S. state abbreviations. For example:

Massachusetts  MA
California  CA

When you create an additional file in the same location, use the respective AddressField name in the filename to identify the address field each token element in the pair pertains to. For example zho_eng_cityDistrict.txt indicates that the contents match Chinese - English cityDistrict address fields.

Indexing addresses

RNI enables high-speed, scalable searches for addresses in English using the Apache Lucene full-text search engine to store addresses with their search keys and a key index.

When you search for an address, RNI generates a search key for each component of each address field, locates all addresses indexed by those search keys, and uses linguistic matching algorithms to filter that set of addresses down to the most similar addresses.

RNI provides a Java API that you can use to embed it in your applications.

Java packages: The address indexing classes are in com.basistech.rni.index.internal. Unqualified class names that appear in this section are in com.basistech.rni.index.internal.

For detailed information about the API, see the Java API Reference.

Using RNI to Index Addresses

Reminder: If you have not already done so, you must set the Basis root directory.

Constructing an address index

An address index is an indexed list of addresses. The list includes a collection of AddressSpec objects and associated keys.

The AddressSpec object may include house, house number, road, unit, level, staircase, entrance, suburb, city district, city, island, state district, state, country region, country, world region, post code, post office box and additional fields.

Note

You can also create an index in memory that is never stored on disk.

To create an indexed list of addresses on disk, you must specify a pathname for the data store.

For example:

// Create an Address index.
// indexPathname specifies the directory where the index will be created.
StandardAddressIndex createIndex(String indexPathname) throws NameIndexStoreException,
        RNTException {
    StandardAddressIndex index = StandardAddressIndex.create(indexPathname);
    return index;
}

Now you can use AddressSpecBuilder to create AddressSpec objects and add them to the index. AddressSpecBuilder provides a fluent interface that supports method chaining.

You can also create an AddressSpec object by parsing an address using AddressSpecBuilder.parse(String str) which internally utilizes the jpostal library. The following fragment illustrates the syntax for creating and adding an AddressSpec to the index.

// Add an address to the index.
void addAddress(StandardAddressIndex index, Integer id) throws NameIndexException, IOException {
    // Give the address a unique identifier. Must be a string.
    String uid = Integer.toString(id);
    // AddressSpecBuilder provides methods for adding address fields,
    // and a build method that returns the AddressSpec.
    AddressSpec addr = new AddressSpecBuilder()
            .house("101")
            .road("Stuart Street")
            .city("Boston")
            .state("MA")
            .countryRegion("New England")
            .uid(uid)
            .build();
// AddressSpecBuilder also provides a method for parsing addresses which uses jpostal,
// and a build method that returns the AddressSpec.
AddressSpec addr2 = AddressSpecBuilder.parse("101 Stuart Street, Boston, MA").build();

index.addAddress(addr);
index.close();}

When you are done adding addresses, be sure to close the address index, as in the preceding fragment.

Querying an address index

You can define and run queries that search an index for similar addresses.

Opening an address index

The primary role of an address index is to perform queries. You can also perform updates (insertions and deletions).

StandardAddressIndex provides a static method for opening an address index.

StandardAddressIndex index = StandardAddressIndex.open(String indexPathname);

indexPathname is the path to the directory that contains the address index.

To optimize the index for more efficient queries, call

index.optimize();

When you are done using the address index, you must close it:

index.close();

Defining an address search query

A query includes an AddressSpec object and several settings that you can use to constrain the query.

Set up an AddressIndexQuery object . For example:

// Define a query.
AddressIndexQuery defineQuery(AddressSpec address){
    AddressIndexQuery query = new AddressIndexQuery(address);
    query.setAddressDataMinimumMatchScore(.30);   
    return query;
}

Query performance tradeoffs

You can make tradeoffs between different dimensions of performance by adjusting certain AddressIndexQuery parameters.

For more information about tradeoffs between accuracy and speed and between false positives and false negatives, refer to Query Performance Tradeoffs for names. For addresses, you will adjust the addressesToCheckAllowance and maximumAddressesToCheck AddressIndexQuery parameters.

Running the query and accessing the query results

StandardAddressIndex includes a query method that takes as its parameter the AddressIndexQuery you have set up.

The query returns an AddressIndexQueryResult list. Each AddressIndexQueryResult object provides an AddressSpec object and a similarity score. As the following fragment illustrates, you can obtain and process each AddressSpec and its score. The higher the score (greater than 0 and less than or equal to 1), the greater the confidence that this is a relevant match. A score of 1.0 indicates that the query address and result address are identical. See Address Variations. Scoring is commutative: the scores for two given addresses are always the same, regardless of which address is in the index and which address is in the query.Address variations

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/address_query_index.java

AddressMatchResult. The AddressIndexQueryResult provides an AddressMatchResult object, which in turn provides a match type and score.

Cleanup

When you are done running queries, close the index:

index.close();

Sample

For a sample Java application that defines a Rosette Address Indexer query, runs the query, and reports the results, see AddressIndexQuerySample.

Multithreading

No more than one StandardAddressIndex object may exist for a given address index on disk at any time.

Queries and updates may be performed in multiple threads on a single StandardAddressIndex object.

Matching dates

RNI can match dates returning a data match score reflecting the time similarity of the two dates. Dates that are closer together are considered a stronger match and return a match score closer to 1.

For example, 11/05/1993 and 11/07/1993 have a high score, as they are very similar and just two days apart. However, 11/05/1993 and 11/05/1995 yield a low score as they differ by two years.

Date definition

A date contains a year, month, and day, but not all fields are required for matching. All common delimiters for English dates are supported, and dates can be expressed with various orderings. RNI will filter out some non-date related words. Formats that include time of day are not supported.

You can specify an Elasticsearch date format that includes time information in the mapping. The time component will be ignored.

RNI supports a wide variety of date formats. The best date format will always be the ISO standard of YYYY-MM-DD, where March 7, 1984 is written as 1984-03-07. RNI will attempt to interpret any date provided, although the less standard the format, the less guarantee that its interpretation will be the one you might expect.

Dates can be represented as YYYY-MM-DD. When some fields are unspecified, the letters represent the unknown values. For example, March 7 is YYYY-03-07, since the year in unspecified. Two digit years will be assumed to have unknown centuries. 3/7/84 is interpreted as YY84-03-07. March 7, 1984 will be an equally good match as March 7, 2084 and March 7, 1884.

When a date is provided, RNI will attempt to identify the year, month, and day within it, leaving blank any fields it cannot determine. You can omit fields if you do not have the value for one or more fields. For example: 1955-12-30, 1955--03, 12/30, -12-, --30, 1955, 1955-12- are all valid dates.

If RNI encounters an invalid date in an acceptable format, such as March 38, 1984, it will not return an error. Rather it will replace the impossible value as an unknown, March 1984.

Supported date formats

RNI supports a wide variety of date formats.

Days can be represented by 1 or 2 digits.
Months can be numerics (1 or 2 digits) or English characters (full name or 3 character abbreviation).
Years can be represented by 1, 2, 3 or 4 digits.
Supported delimiters include , . - /, as well as a space.
Partial fields can be entered.
At this time, only English month names and abbreviations are recognized.
All words are case-insensitive; upper and lower case are interpreted the same.

The following table shows different acceptable formats for the date March 7, 1984.

Format	Valid Examples	Notes
Y-M-D	1984-03-07; 1984/3/7; 1984.3.07; 1984 Mar 07; 1984-March-7
M-D	03-07; 3/7; Mar-07; March 7
Y-M	1984-03; 1984 March; 1984-Mar
YYYYMMDD	19840307	All 8 digits must be included
M-D-Y	03-07-1984; 3/7/84; March 7 84; Mar. 7, 1984
M-YYYY	03-1984; March 1984; Mar-1984	The year must include 4 digits. March-84 will not be recognized.
D-M-Y	07 03 1984; 7/3/84; 07 March 84; 7/Mar/1984
D-M	07-03; 7/3; 07-Mar; 7 March
D(MONTH)Y	7MAR84; 07March1984	The month is a word or abbreviation
YYYY	1984
Month	March

Date match parameters

Similarly to the name matching parameters, there are a series of date matching parameters. The parameter values can be edited in the plugins/rni/bt_root/rlpnc/data/etc/parameter_defs.yaml file.

Table 21. Date General Parameters

Parameter Name

Description

Behavior

alternativeTimeProximityMatch

When enabled, computes the chronological distance between dates in years. The default is false.

Instead of using the distance between dates in days, the score is calculated based on the distance between the dates in unit time of years.

dateOrdering

Sets the default date representations. Valid values are YMD, DMY, and MDY. The default value is MDY.

Supports different date formats. For example, UK dates tend to be DMY; US dates tend to be MDY.

improveSingleDigitManipulationMatch

Controls how much the score is increased if there is exactly one instance of digit manipulation^[a]and there are no other differences. The default value is 0 (off).

If the parameter is set to 0 (minimum), then the score is not increased at all. If the parameter is set to 1 (maximum), then two dates with a single digit manipulation will be an exact match.

maxYearDistanceForDigitManipulation

Sets the maximum number of years beyond which two dates will not be affected by improveSingleDigitManipulation. The default value is 10.

By default, two dates that are more than ten years apart do not have their match score increased by improveSingleDigitManipulation, even if they contain exactly one instance of digit manipulation^[a] and no other differences.

thresholdToDropoffBiasMapping

Specifies the points at which scores should drop, and by how much, based on the difference in years between the two dates. By default, this object is empty, which means no dropoff bias applies.

The match score is decreased based on the difference in years.

If the parameter is {2: 0.7, 5: 0.1} then the following biases will be applied:

Years differ by	Bias applied
1	none
2-4	0.7
5 or more	0.1

timeProximityYearInterval

Specifies the time interval in years that alternativeTimeProximityMatch uses to determine a score. The default is 10.

By default, dates within 10 years of each other score above a threshold of 0.8.

tryDayMonthSwap

Allows for date matching with swapped day and month fields. It is on by default.

This parameter attempts to correct for parsing errors by swapping the day and month. Turn it off if you only want to match the dates exactly as indexed.

^[a]A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0<>8, 1<>7, 3<>8, 5<>8, 5<>6, 6<>8, 7<>2.

Because dates are sometimes written month day and other times written day month, swap tries matching the date fields as written as well as with the month and date fields switched. The best score is returned as the match score. For example, if the dates in question are 1970-3-5 and 1970-6-4, this feature will match the following four pairs:

1970-3-5	↔	1970-6-4
1970-3-5	↔	1970-4-6
1970-5-3	↔	1970-6-4
1970-5-3	↔	1970-4-6

Table 22. Date Weighting Parameters

Parameter Name	Description	Behavior
`dayDistanceWeight`	Weight for the day field comparison of the dates	1 and 30 are far, even if they are close in time. They will have a low match score.
`monthDistanceWeight`	Weight for the month field comparison of the dates	1 and 12 are far, even if they are close in time. They will have a low match score.
`stringDistanceWeight`	The edit difference between the two dates, when converted to a standard string (05021974 for 5/2/1974)	1979-12-31 and 1980-1-1 will be 19791231 and 198000101. They will have a low match score.
`timeDistanceWeight`	Weight for the time distance (i.e. #days) between two dates). The score is based on the number of days between the two dates.	1979-12-31 and 1980-1-1 look different, but their time difference is very close. They will have a high match score.
`yearDistanceWeight`	Weight for the year field comparison of the dates.	Close years will have a high match score.

The date weighting fields control the relative strength of each aspect of the date-matching algorithm. A separate score is calculated for each match type. The final match score is calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is ignored and its weight evenly distributed across other fields.

Dates with a high time match score may have a very low string match score. Time finds dates that are close together; string gives high scores to similarly formatted dates.

Matching records

Record similarity refers to a pairwise match between two lists of records which can include multiple fields and return a single match and match score. The fields can be any combination of RecordFieldType.RNI_NAME, RecordFieldType.RNI_DATE, and RecordFieldType.RNI_ADDRESS. The records do not have to contain the same fields; only fields with the same field name are compared. If one record has three fields and the other has two fields, the missing field will be ignored and the other two fields compared.

Each field can be assigned a weight to reflect its importance in the overall matching logic. When matching two records, some fields are more important in determining a match than others. For example, the name field is likely more important in determining a match than an address field. If no weights are defined, each field is weighted equally.

You can specify individual parameter values or a parameter universe string in the record similarity properties object to set tuning variables for the record similarity call.

When matching records, a similarity score is calculated for each field. The final match score is then calculated by performing a weighted arithmetic mean over each of the similarity scores. If a field is missing from a record, that field is removed from the score calculation and its weight is evenly distributed across the other fields. Set the score_if_null parameter to a value between 0 and 1 to include missing fields in the score. When set, that value is returned when the field is missing from the record.

Record matching usage model

Use RecordScorer to score the similarity of two records that include multiple field types, instead of the functions for a single type, such as the MatchScorer, DateScorer and AddressScorer functions for names, dates and address matching respectively.

Supported field types

The RecordScorer has default support for the RecordFieldType.RNI_NAME, RecordFieldType.RNI_DATE, and RecordFieldType.RNI_ADDRESS field types. All default similarity scores are between 0.0 and 1.0.

Table 23. RecordScorer Supported Field Types

Field Type	Entity Type	Examples
`RecordFieldType.RNI_NAME`	PERSON	'John David Smith' vs. 'Jon D Smith' = 0.88
`RecordFieldType.RNI_NAME`	IDENTIFIER:DRIVERS_LICENSE	'S82062270' vs. 'S82062272' = 0.9
`RecordFieldType.RNI_NAME`	IDENTIFIER:LICENSE_PLATE	'E23 2IN' vs. 'E23 2IM' = 0.875
`RecordFieldType.RNI_NAME`	IDENTIFIER:NATIONAL_ID_NUM	'691-84-8999' vs. '691-84-9999' = 0.9167
`RecordFieldType.RNI_DATE`	N/A	'2010-11-4' vs. '2010-5-11' = 0.92
`RecordFieldType.RNI_ADDRESS`	N/A	'Red Cedar Ct' vs. 'Cedar Ct' = 0.53

Explainability of RNI Matching

Explainability of RNI matching

As important as getting a match score is, understanding how the system calculated the score can be just as important. When matching two names or records, RNI returns a JSON response explaining in detail how the two names, dates, addresses, or records were matched. With this information, you can understand how the score was calculated and, if necessary, modify the matching parameters to better solve your matching problems.

The following concepts are helpful when reviewing the explainInfo JSON file.

When two objects are being compared, one is referred to as the left input, one as the right input.
Every token of the left object is compared to every token of the right object. Token strings, made up of multiple tokens, may also be compared.
Names are usually composed of multiple tokens. For example, John Fitzgerald Kennedy is 3 tokens.

Common Terms

The response JSON contains sections for each type of object: names, addresses, and dates. While each object has its own criteria for comparison, there are common terms used for all comparisons, as shown below.

Table 24. Definitions of Terms

Term	Definition	Note
bin	A number representing the frequency of the token in the language. A lower bin indicates the token in unusual and therefore should be more highly weighted when calculating the similarity score.
biasedBin	The bin raised to a power from .1 to 10 (default 0.970). This value is set by the `frequencyRankBias` parameter.
scoreInIsolation	The matching score of just the tuples being compared, ignoring things like position in the name, name weighting, etc. This will show a match core of 1.000 if it is an exact match of tokens, even if if there are biases that will lower the score in context.
scoreInContext	The matching score between the tuples taking into account the placement in the overall query and any biases related to the overall query.
(left/right)MinTokenIndex	This is the index of the first token in the string of tokens. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.	An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.
(left/right)MaxTokenIndex	This is the index of the last token in the string. For single tokens, the min and max tokenIndex will have the same value. An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.	An index of -1 or -2 means the token isn't in the name and the token is considered a deletion.
unbiasedScore	The raw score before any calculations using `finalBias`, `adjustOnesideDeletionScores`, or other such bias parameters.
score	The final score after `finalBias`, `adjustOnesideDeletionScores`, and other such bias parameters are added to the calculation.

Response structure

All matches responses contain the same sections. The details contained within the section can change based on the match object (names, dates, addresses).

Left/right input information: The input information for each input along with the properties for each token in the input. Properties depend on the type of object being matched.

For example, the name matching example contains the following properties:

"data": "John Smith",
"normalizedData": "john smith",
"latnData": "john smith",
"script": "Latn",
"languageOfUse": "ENGLISH",
"languageOfOrigin": "ENGLISH"

While a date comparison would contain different properties:

"century": 20,
"month": 10,
"canonicalForm": "2024-10-01",
"yearWithoutCentury": 24,
"dayMonthSwapped": true,
"originalString": "10 January 2024",
"modifiedJulianDay": 60584,
"day": 1

Tuple scores: The score for every tuple, where a tuple is a token string from the left input and a token string from the right input. Every token in the left input is matched to every token in the right input, along with some token strings (multiple tokens combined together).
Score adjustments: The score adjustments list the parameters applied, and the score calculated with those parameters.
For example, the name example here contains the following parameters:
```
"unbiasedScore": 0.6829129823127231,
"score": 0.6919264820086959,
"parameter": "adjustOneSidedDeletionScores"

"unbiasedScore": 0.6919264820086959,
"score": 0.8435140063279181,
"parameter": "finalBias"
```
Meanwhile, a date comparison would contain different parameters. In this case, a different matching scheme, tryDayMonthSwap, is tried to see if a better result is returned.
```
"score": 0.95,
"unbiasedScore": 0.5926523220980572,
"parameter": "tryDayMonthSwap"

"score": 0.95,
"unbiasedScore": 0.95,
"parameter": "dateFinalBias"
```
Final score: The similarity score for the two names.

Example: matching names

Let's take a look at an example. In this example we're matching the following 2 names:

John Smith
Jon J Smyth

The JSON output is broken down by section.

Example 16. Left Input: John Smith

"leftInput": {
    "data": "John Smith",
    "normalizedData": "john smith",
    "latnData": "john smith",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "john",
        "latnToken": "john",
        "bin": 5,
        "biasedBin": 4.764319787410581,
        "tokenWeight": 0.41435888604672094,
        "tokenType": "GIVEN"
      },
      {
        "token": "smith",
        "latnToken": "smith",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.585641113953279,
        "tokenType": "SURNAME"
      }
    ],
    "entityType": "PERSON"
  },

The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Even if the name was provided as Smith John, Smith would be identified as a SURNAME and John as a GIVEN name.

Example 17. Right input: Jon J Smyth

"rightInput": {
    "data": "Jon J. Smyth",
    "normalizedData": "jon j. smyth",
    "latnData": "jon j. smyth",
    "script": "Latn",
    "languageOfUse": "ENGLISH",
    "languageOfOrigin": "ENGLISH",
    "tokens": [
      {
        "token": "jon",
        "latnToken": "jon",
        "bin": 3.5,
        "biasedBin": 3.3709010396413017,
        "tokenWeight": 0.2083122782666673,
        "tokenType": "UNKNOWN"
      },
     {
        "token": "j",
        "latnToken": "j",
        "bin": 8,
        "biasedBin": 7.5161819937120935,
        "tokenWeight": 0.08948764635417582,
        "tokenType": "UNKNOWN"
      },
      {
        "token": "smyth",
        "latnToken": "smyth",
        "bin": 1,
        "biasedBin": 1,
        "tokenWeight": 0.702200075379157,
        "tokenType": "UNKNOWN"
      }
    ],
    "entityType": "PERSON"
  },

The name is tokenized. Each token is evaluated.
The entityType is identified as PERSON. We recommend always providing the entityType in your search for the best results.
The tokenTypes are identified. Since both Jon and Smyth are unusual spellings, the tokenType is not identified.

Example 18. Score tuples

"scoreTuples": [
    {
      "scoreInIsolation": 0.7595918889283346,
      "scoreInContext": 0.7595918889283346,
      "left": "john",
      "right": "jon",
      "marked": true,1
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.4912303477031893,
      "scoreInContext": 0.4666688303180298,
      "left": "john",
      "right": "jonj",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.542,
      "scoreInContext": 0.4743439389212776,
      "left": "john",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 0,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.2941408383164158,
      "scoreInContext": 0.279433796400595,
      "left": "johnsmith",
      "right": "jon",
      "marked": false,
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 0,
      "rightMaxTokenIndex": 0
    },
    {
      "scoreInIsolation": 0.46557800000000005,
      "scoreInContext": 0.4422991,
      "left": "johnsmith",
      "right": "j",
      "marked": false,
      "reason": "INITIAL_MATCH",
      "leftMinTokenIndex": 0,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    },
    {
      "scoreInIsolation": 0.7237045473947534,
      "scoreInContext": 0.7237045473947534,
      "left": "smith",
      "right": "smyth",
      "marked": true,2
      "reason": "HMM_MATCH",
      "leftMinTokenIndex": 1,
      "leftMaxTokenIndex": 1,
      "rightMinTokenIndex": 2,
      "rightMaxTokenIndex": 2
    },
    {
      "scoreInIsolation": 0.27169000000000004,
      "scoreInContext": 0.27169000000000004,
      "left": "",
      "right": "j",
      "marked": true,3
      "reason": "DELETION",
      "leftMinTokenIndex": -1,
      "leftMaxTokenIndex": -1,
      "rightMinTokenIndex": 1,
      "rightMaxTokenIndex": 1
    }
  ],

All tuples are compared. The tuples that are marked as true are the matches that are used to calculate the scores.

Matching tuples:

1	John : Jon
2	Smith: Smyth
3	J: deleted (no match)

Example 19. Score adjustments

"scoreAdjustments": [
    {
      "unbiasedScore": 0.6829129823127231,
      "score": 0.6919264820086959,
      "parameter": "adjustOneSidedDeletionScores"
    },
    {
      "unbiasedScore": 0.6919264820086959,
      "score": 0.8435140063279181,
      "parameter": "finalBias"
    }
  ],

The unbiased score is the score before the parameter is applied. The score is after the parameter is applied.

Example 20. Final score

"finalScore": 0.8435140063279181

The final calculated score with all parameters applied. This is the similarity score returned by RNI.

Response schemas by object

The following sections list the JSON schema for each object type.

Name response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "data": { "type": "string" },
        "normalizedData": { "type": "string" },
        "latnData": { "type": "string" },
        "script": { "type": "string" },
        "languageOfUse": { "type": "string" },
        "languageOfOrigin": { "type": "string" },
        "tokens": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "token": { "type": "string" },
              "latnToken": { "type": "string" },
              "bin": { "type": "number", "default": 0.0 },
              "biasedBin": { "type": "number", "default": 0.0 },
              "tokenWeight": { "type": "number", "default": 0.0 },
              "tokenType": { "type": "string", "default": null }
            }
          }
        },
        "entityType": { "type": "string" },
        "realWorldIds": { "type" : "array", "items": { "type": "string" } }
      },
      "required": ["entityType"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftMinTokenIndex": { "type": "integer", "default": 0 },
          "leftMaxTokenIndex": { "type": "integer", "default": 0 },
          "rightMinTokenIndex": { "type": "integer", "default": 0 },
          "rightMaxTokenIndex": { "type": "integer", "default": 0 }
        },
        "required": ["left", "right", "reason"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Address response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "fieldInputInfos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "data": { "type": "string" },
              "latnData": { "type": "string" },
              "script": { "type": "string" },
              "languageOfUse": { "type": "string" },
              "languageOfOrigin": { "type": "string" },
              "tokens": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "token": { "type": "string" },
                    "latnToken": { "type": "string" },
                    "tokenWeight": { "type": "number", "default": 0.0 }
                  }
                }
              },
              "addressField": { "type": "string" },
              "normalizedData": { "type": "string" }
            },
          }
        }
      },
      "required": ["fieldInputInfos"]
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "reason": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "leftMinTokenIndex": { "type": "number", "default": 0 },
          "leftMaxTokenIndex": { "type": "number", "default": 0 },
          "rightMinTokenIndex": { "type": "number", "default": 0 },
          "rightMaxTokenIndex": { "type": "number", "default": 0 }
        },
        "required": ["left", "right", "reason", "leftField", "rightField"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" },
          "leftField": { "type": "string" },
          "rightField": { "type": "string" }
        },
      }
    },
    "finalScore": { "type": "number" },
    "fieldScores": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "leftField": { "type": "string" },
          "rightField": { "type": "string" },
          "score": { "type": "number", "default": 0.0 },
          "marked": { "type": "boolean", "default": false }
        },
        "required": ["leftField", "rightField"]
      }
    }
  },
}

Date response schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "leftInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "rightInput": {
      "type": "object",
      "properties": {
        "originalString": { "type": "string" },
        "day": { "type": "integer" },
        "month": { "type": "integer" },
        "yearWithoutCentury": { "type": "integer" },
        "century": { "type": "integer" },
        "modifiedJulianDay": { "type": "integer" },
        "canonicalForm": { "type": "string" },
        "dayMonthSwapped": { "type": "boolean" }
      }
    },
    "scoreTuples": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "scoreInIsolation": { "type": "number", "default": 0.0 },
          "scoreInContext": { "type": "number", "default": 0.0 },
          "left": { "type": "string" },
          "right": { "type": "string" },
          "marked": { "type": "boolean", "default": false },
          "weight": { "type": "number", "default": 0.0 },
          "component": { "type": "string" },
          "differenceInDays": { "type": "integer" }
        },
        "required": ["left", "right", "component"]
      }
    },
    "scoreAdjustments": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "unbiasedScore": { "type": "number", "default": 0.0 },
          "score": { "type": "number", "default": 0.0 },
          "parameter": { "type": "string" }
        }
      }
    },
    "finalScore": { "type": "number" }
  }
}

Using RNI with Solr

RNI includes plugins for Solr 8.11.3, and Solr 9.6.0 that support the use of RNI with Solr documents that contain names, addresses, and dates along with other data. The plugins support single-valued and multi-valued name, address, and date fields. With them, you can run Solr queries against documents that include, but are not limited to, name, address, and date fields.

Getting started with the Solr plugin

To index and search documents with RNI in a Solr application, you must add JARs to the Solr classpath, add the name fields to the schema.xml, and modify the solrconfig.xml.

Placing the Solr plugin jar

The Solr plugin Jar should be in your Solr sharedLib directory.

Jar files used by all of the cores in your Solr application (including bt-rni-solr<version>-plugin.jar) should be placed in a sharedLib directory that is defined in solr.xml.

We have placed the Solr plugin jar in rlpnc/data/rnm/sample/solr_shared_lib and included the corresponding sharedLib setting in solr.xml in our sample solr home.

Example, from rlpnc/data/rnm/sample/solr9x_home/solr.xml:

 <!--Adjust the sharedlib setting if you move bt-rni-solr9x-plugins.jar to a different location.-->  
<str name="sharedLib">${bt.root}/rlpnc/data/rnm/sample/solr_shared_lib</str>

Modifying `schema.xml`

Add fieldType and field definitions to schema.xml.

In types, define the NameField field type.

<fieldType name="bt_rni_name" class="com.basistech.rni.solr.NameField" needNameStore="true"/>

In types, define the AddressField field type.

<fieldType name="bt_rni_addr" class="com.basistech.rni.solr.AddressField" needAddressStore="true"/>

In types define the DateField field type.

<fieldType name="bt_rni_date" class="com.basistech.rni.solr.DateField" needDateStore="true"/>

Add your name, address, and date fields in fields. For example:

<field name="primaryName" type="bt_rni_name" indexed="true" stored="true" multiValued="false"/>
<field name="aka" type="bt_rni_name" indexed="true" stored="true" multiValued="true"/>
<field name="residence" type="bt_rni_addr" indexed="true" stored="true" multiValued="false"/>
<field name="dateOfBirth" type="bt_rni_date" indexed="true" stored="true" multiValued="false"/>

You can copy fragments from rlpnc/data/rnm/sample/solr8x_home/collection1/conf/schema-xml-sample-fragments.xml.

These changes can also be made using the Solr Schema API in the Solr Admin page.

Modifying `solrconfig.xml`

As top-level elements, add the reRank queryParser included in the RNI release along with rniMatch valueSourceParser to solrconfig.xml.

<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<queryParser name="rniAddrRerank" class="com.basistech.rni.solr.RNIAddressReRankQParserPlugin"/>
<queryParser name="rniDateRerank" class="com.basistech.rni.solr.RNIDateReRankQParserPlugin"/>
<valueSourceParser name="rniMatch" class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
<valueSourceParser name="rniAddrMatch" class="com.basistech.rni.solr.AddressMatchValueSourceParser"/>
<valueSourceParser name="rniDateMatch" class="com.basistech.rni.solr.DateMatchValueSourceParser"/>

If your documents include one or more multivalued name fields, include an RNI updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIName">
  <processor class="com.basistech.rni.solr.MultiValueNameUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the RNI update chain.

<requestHandler name="/update" 
      class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <str name="update.chain">RNIName</str>
  </lst>
</requestHandler>

If your documents include one or more multivalued address fields, include an RNI updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIAddr">
 <!--Custom processor required when using multivalued address fields-->
  <processor class="com.basistech.rni.solr.MultiValueAddressUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the RNI update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
   <str name="update.chain">RNIAddr</str>
 </lst>
</requestHandler>

If your documents include one or more multivalued date fields, include an RNI updateRequestProcessorChain.

<updateRequestProcessorChain name="RNIDate">
  <!--Custom processor required when using multivalued date fields-->
  <processor class="com.basistech.rni.solr.MultiValueDateUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the RNI update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">RNIDate</str>
 </lst>
</requestHandler>

If your documents include one or more multivalued name, address or date fields, include an RNI updateRequestProcessorChain.

<updateRequestProcessorChain name="RNI">
  <!--Custom processor required when using multivalued name fields-->
  <processor class="com.basistech.rni.solr.MultiValueNameUpdateRequestProcessorFactory"/>
  <!--Custom processor required when using multivalued address fields-->
  <processor   class="com.basistech.rni.solr.MultiValueAddressUpdateRequestProcessorFactory"/>
  <!--Custom processor required when using multivalued date fields-->
  <processor class="com.basistech.rni.solr.MultiValueDateUpdateRequestProcessorFactory"/>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

Modify the /update requestHandler to use the RNI update chain.

<requestHandler name="/update"
      class="solr.UpdateRequestHandler">
 <lst name="defaults">
  <str name="update.chain">RNI</str>
 </lst>
</requestHandler>

You can copy fragments from rlpnc/data/rnm/sample/solr8x_home/collection1/conf/solrconfig-xml-sample-fragments.xml.

Starting Solr

When starting Solr, you must include a java property setting that points to the root of the RNI SDK as well as increase the heap size. If you are running JDK 17, you need to enable security manager by including -Djava.security.manager=allow in -a options. For example:

bin/solr -a "-Dbt.root=$BT_ROOT -Djava.security.manager=allow" -m 2g

Loading data into Solr

The data model

Documents or records typically contain multiple names and not all are the same type. For instance, in the OFAC Specially Designated Nationals list, a record may contain a primary name and a list of akas (also known as). Ideally these would all be stored in a single Solr document to efficiently process complex queries involving multiple document fields, especially in a distributed setting.

For example, a Solr document might contain the following data:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>

Solr documents may also include multiple names referring to different persons, locations, or organizations. A single news document, for example, may contain references to a number of individuals.

Address fields

An address may include any of the fields in Table 25, “Supported Address Fields” below. At least one field must be specified, but no specific fields are required.

Addresses can be defined either as a set of address fields or as a single string. When defined as a string, the jpostal library^[14] is used to parse the address string into address fields.

The format to represent an address with fields consists of non-empty consecutive address fields where each field is an AddressField's name in lower camel case (house, houseNumber, road, unit, level, staircase, entrance, suburb, cityDistrict, city, island, stateDistrict, state, countryRegion, country, worldRegion, postCode, poBox) followed by the value of the field with Hex encoded special characters preceded by the percent sign, and the value itself is enclosed with angle brackets.

Table 25. Supported Address Fields

Field Name	Description	Example
`house`	venue and building names	house<Brooklyn Academy of Music>
`houseNumber`	usually refers to the external (street-facing) building number	houseNumber<123>
`road`	street name(s)	road<Harrison Avenue>
`unit`	an apartment, unit, office, lot, or other secondary unit designator	unit<Apt. 123>
`level`	expressions indicating a floor number	level<3rd Floor>
`staircase`	numbered/lettered staircase	staircase<2>
`entrance`	numbered/lettered entrance	entrance<front gate>
`suburb`	usually an unofficial neighborhood name	suburb<Crown Heights>
`cityDistrict`	these are usually boroughs or districts within a city that serve some official purpose	cityDistrict<Brooklyn>
`city`	any human settlement including cities, towns, villages, hamlets, localities, etc.	city<Boston>
`island`	named islands	island<Maui>
`stateDistrict`	usually a second-level administrative division or county	stateDistrict<Saratoga>
`state`	a first-level administrative division	state<Massachusetts>
`countryRegion`	informal subdivision of a country without any political status	countryRegion<South/Latin America>
`country`	sovereign nations and their dependent territories, which have a designated ISO-3166 code	country<United States of America>
`worldRegion`	currently only used for appending "West Indies" after the country name, a pattern frequently used in the English-speaking Caribbean	worldRegion<Jamaica, West Indies>
`postCode`	postal codes used for mail sorting	postCode<02110>
`poBox`	post office box: typically found in non-physical (mail-only) addresses	poBox<28>

In the string that defines the content of an address field, place a tilde (~) after the address, followed by a comma-delimited attribute-value pair: (fielded=true) or (fielded=false) to specify whether the address consists of a single string or a set of fields.

The above example of a Solr document might contain the following additional data where the address is defined as a set of fields:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">houseNumber<3302>road<Grand Av.>city<West Louisville>state<KY>~fielded=true</field>

The address field can also consist of a single string, and the above example of a Solr document would look like this:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">3302 Grand Av., West Louisville, KY~fielded=false</field>

Date fields

Documents or records may also contain one or multiple dates in which a format can be specified. In order to specify a date format, in the string that defines the content of a date field, place a tilde (~) after the date, followed by a comma-delimited attribute-value pair: (format=dd-MM-yyyy) or (format=MMdd-yyyy) for example, to specify the format to parse the date string with.

An example including a date which is defined without specifying a format:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">1/7/1942</field>
<field name="address">houseNumber<3302>road<Grand Av.>city<West Louisville>state<KY>~fielded=true</field>

The date field can also consist of a date string with a specified format:

<field name="primary">Muhammad Ali</field>
<field name="aka">Cassius Clay Jr</field>
<field name="aka">The Greatest</field>
<field name="dob">01/07/42~format=dd/MM/yy</field>
<field name="address">3302 Grand Av., West Louisville, KY~fielded=false</field>

Fielded names

You can process names with data fields. Use "|" to separate the fields. For example, "Mr|Jon|Q|Smith" has four fields. You can define names with empty fields: in "|Jon|Q|Smith", the first field is ""; in "Mr|Jon||Smith", the third field is "".

You have the option of specifying that there is an unknown value in a field. To specify an unknown name field, replace the field with *?*.

Specifying name attributes

In addition to the name itself, an RNI Name object may contain attributes that you can specify when you index the name or perform a query.

Name Attribute	Example	Description
`language`	`language="kor"`	ISO 639-3 code for the language of use in which the name appears.
`hintLanguage`	`hintLanguage="jpn"`	Hint (ISO 639-3 code) for the language of use. Only used if `language` is not specified or "xxx". The hint is used if compatible with the script. Otherwise, RNI makes its own language guess.
`languageOfOrigin`	`languageOfOrigin="eng"`	ISO 639-3 code for the name's language of origin.
`script`	`script="Hani"`	ISO 15924 code for the script in which the name appears.
`entityType`	`entityType="PERSON"`	The entity type, such as "PERSON", "LOCATION" or "ORGANIZATION".
`uid`	`uid="4072"`	Unique identifier for the name.
`gender`	`gender="male"`	Explicit gender for the name, such as "male", "female", or "nonbinary".

Note

The entityType field in the query must match the entityType field in the indexed name. If the query does not specify an entity type, the indexed name must also not specify an entity type.

In the string that defines the content of a name field, place a tilde (~) after the name, followed by a comma-delimited list of attribute-value pairs.

Examples:

When posting a Solr document:

<field name="primaryName">Muhammad Ali~language=eng,languageOfOrigin=ara,entityType=PERSON</field>

In a query:

primaryName:"Muhamid Ali~language=eng,languageOfOrigin=ara,entityType=PERSON"

In a pairwise name match:

&rq={!rniRerank reRankQuery=$rrq} 
&rrq={!func}rniMatch(primaryName,"Muhammad Ali~language=eng,languageOfOrigin=ara,entityType=PERSON")

Attributes in the bt Namespace. You can include bt attributes as query parameters. These attributes are then used in both the base query and the reRank pairwise match query. For example: &bt.language=jpn &bt.script=Kana

Setting Default Attribute Values. You can include name attributes in field or field type definition as defaults that can be overridden by individual name entries. For example:

<field name="primaryKoreanName" type="bt_rni_name" indexed="true" stored="true" multiValued="false"
       language="kor" script="Hang" entityType="PERSON"/>

Then you only need to include these attributes in name entries when you want to override the defaults.

Query enhancements

It is often necessary to query on other fields besides names fields, such as date of birth and address. The plugin enables the seamless integration of RNI into your Solr queries. To apply Boolean logic to queries, combine multiple fields with Boolean operators. The plugin supports all Boolean operators supported by the standard Lucene query parser (AND, OR, NOT, + , -). The OR operator is the default conjunction operator; if there is no Boolean operator between two terms (fields), the OR operator is used.

Example of a query with name and date fields:

primaryName:"Chuy Lopez A Deyas~entityType=PERSON" AND dateOfBirth:"1960-09-30"

Example of a query including an address field:

primaryName:"Chuy Lopez A Deyas~entityType=PERSON" AND residence:"road<Avenida Const. Pedro L Zavala 1957>
 house<Colonia Libertad>city<Culiacan>region<Sinaloa>postalCode<80180>country<Mexico>~fielded=true"

You can include name fields and other fields in your base query in conjunction with an RNI Solr reRank query and a custom valueSourceParser. The base query identifies candidate documents. The reRank query sends the top N candidates to the rniMatch valueSourceParser for pairwise matching. You can combine multiple fields in function queries which enable you to generate a relevancy score of those fields. The plugin supports all the functions available for function queries in Solr.

In a pairwise name match we return the maximum score of querying for primaryName and contactName:

&rq={!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
  &rrq={!func}max(rniMatch(primaryName, "Chuy Lopez A Deyas"), rniMatch(contactName, "Chuy Lopez"))

In a pairwise address match we return the maximum score of querying for primaryAddress and residence:

&rq={!rniAddrRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
  &rrq={!func}max(rniAddrMatch(primaryAddress, "road<Calle Lago Cuitzeo 1394>house<Colonia Las Quintas>
 city<Culiacan>region<Sinaloa>postalCode<80060>country<Mexico>~fielded=true"),
rniAddrMatch(residence, "road<Avenida Const. Pedro L Zavala 1957>house<Colonia Libertad>
 city<Culiacan>region<Sinaloa>postalCode<80180>country<Mexico>~fielded=true"))

In a pairwise date match we return the maximum score of querying for dateOfBirth and dob:

&rq={!rniDateRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0}
&rrq={!func}max(rniDateMatch(dateOfBirth, "01/07/42~format=dd/MM/yy"), rniDateMatch(dob, "1/7/1940"))

You can combine the score of multiple RNI fields of the same type where each field can be given a weight to reflect its importance in the overall matching logic.

For example, in a pairwise name match we can return the combined score of querying for aka and primaryName where aka has a weight of 0.3 and the remaining 0.7 is assigned to primaryName field:

&rq={!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0} 
  &rrq={!func}sum(linear(rniMatch(aka, "Jesus Alfonso Diaz"), 0.3, 0),
  linear(rniMatch(primaryName, "Jesus Diaz"), 0.7, 0))

Setting `reRank` parameters

The RNIRerankQParserPlugin provides parameters that you can set to customize your reRank query:

reRankDocs (an integer) specifies the maximum number of documents from the base query to pass to the RNI pairwise match.
Use this parameter to limit the number of compute-intensive name matches that need to be performed, thus decreasing maximum query latency.
reRankMode ("add" or "replace") specifies whether the RNI match score is added to the Solr score (the default) or replaces the Solr score.
reRankWeight (a float) specifies the weighting of the maximum RNI pairwise match score when it is combined with the Solr score. This parameter is ignored if reRankMode is set to "replace".
The RNI score, multiplied by the reRankWeight (the Solr default is 2.0), is added to the Solr score to provide the document score that is used to determine the ordering of the documents in the result set. Use this parameter to influence the role that the RNI pairwise match plays in the ordering of the result set. If you want to prioritize the RNI score and de-emphasize the Solr score, specify a large reRankWeight
reRankDocsAllowance (a float from 0 to 1) controls the general proportion of documents from the base query to pass to the RNI pairwise match. This is used at query time to dynamically determine the number of documents to rescore based on the commonality of the query name in the index. Setting this to 1.0 will ensure that the maximum number of documents (reRankDocs) are always rescored.
Use this parameter to limit the number of compute-intensive name matches that need to be performed, thus decreasing query latency.
scoreToRerankRestriction (a float from 0 to 1) influences the minimum similarity score, calculated based on the results of the base query, that documents returned by the base query must have in order to be passed to the RNI pairwise match for rescoring.
reRankFilter (a Solr query) further filters any results from the main query from being passed to the RNI pairwise match.

In the following example, pairwise matching is performed on the top 200 names returned by the base query, and the RNI score is multiplied by 3 before it is added to the Solr score.

q=primaryName:"Lopez Diaz"
fl=primaryName,aka,score
&rq={!rniRerank reRankQuery=$rrq reRankDocs=200 reRankWeight=3}
&rrq={!func}rniMatch(primaryName, "Lopez Diaz")

Example with Solr Admin

This example walks you through the steps for using the Solr 9 Admin example to perform queries.

Basic Procedure

Download and expand Solr 9.6.0.
Start the Solr webserver.
You can point it at a Solr core included in the RNI package that contains the OFAC list already indexed. From Solr-9.6.0, run the following:
```
bin/solr -f -s $BT_ROOT/rlpnc/data/rnm/sample/ofac_solr_home -a \
"-Dbt.root=$BT_ROOT -Djava.security.manager=allow" -m 2g
```
Use a Web browser to navigate to http://localhost:8983/solr/#/collection1/query. This form provides the full interface for submitting queries in Solr Admin.

Submit a Solr Query

Fill in the q (query) textbox with a query that includes a name string and a date-of-birth range starting at 9/30/1960:
```
name:"Chuy Lopez A Deyas~entityType=PERSON" AND dateOfBirth:[1960-09-30T00:00:00Z TO *]
```

Fill in the fl (fields to return) textbox:

name,aka,dateOfBirth,address,nationality,score

Set raw query parameters to define the reRankQuery

&rq={!rniRerank reRankQuery=$rrq reRankMode=replace
 reRankWeight=1.0} &rrq={!func}rniMatch(name, "Chuy Lopez A Deyas~entityType=PERSON")

Click Execute Query.

Solr Admin displays a response.

For this query, Solr returns the appropriate Diaz document.

{
{
  "responseHeader": {
    "status": 0,
    "QTime": 387,
    "params": {
      "rrq": "{!func}rniMatch(name, \"Chuy Lopez A Deyas~entityType=PERSON\\")",
      "q": "name:\"Chuy Lopez A Deyas~entityType=PERSON\" AND dateOfBirth:[1960-09-30T00:00:00Z TO *]",
      "fl": "name,aka,dateOfBirth,address,nationality,score",  
      "_": "1631115490163",   
      "rq": "{!rniRerank reRankQuery=$rrq reRankMode=replace reRankWeight=1.0} "
    }
  },
  "response": {
    "numFound": 145,
    "start": 0,
    "maxScore":0.6820811,
    "numFoundExact":true,
    "docs": [
      {
        "name": "Jesus Alfonso LOPEZ DIAZ~uid=10353,entityType=PERSON",
        "address": [
          "c/o ESTABLO PUERTO RICO S.A. DE C.V.\nCuliacan Sinaloa\nMexico",
          "Avenida Const. Pedro L Zavala 1957\nColonia Libertad\nCuliacan Sinaloa 80180\nMexico"
        ],
        "nationality": [
          "Mexico"
        ],
        "dateOfBirth": [
          "1962-09-30T00:00:00Z"
        ],
        "score": 0.6820811
      },
      ...
    ]
  }
}

Example using the `solrj` API

The RNI-RNT SDK ships with an example that illustrates the use of the org.apache.solr.client.solrj API to integrate the RNI Solr plugin into a Solr application. See RNISolrjSample. This sample also illustrates a procedure for posting Solr documents from an xml file.

You can use the org.apache.solr.client.solrj API to integrate the RNI Solr plugin into a Solr application.

The basic steps are as follows:

Add bt-rni-solr8.11-plugin.jar (distributed in rlpnc/data/rnm/sample/solr_shared_lib/lib) to the classpath.
Set solr.solr.home to a solr directory that contains a collection with a modified schema.xml and solrconfig.xml as described in previous sections.
Instantiate a SolrServer and use it to add documents to a Solr index. The documents should contain one or more name fields along with any other fields of interest. Name, address, and date fields may be multivalued.
Define a Solr query that involves name fields and other fields of interest, and that reranks the documents according to RNI's pairwise name match score.
Run the query and examine the documents that are returned.

The following sample code snippets use these imports:

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.core.CoreContainer;

Setup:

// Set the bt.root property to point to the RNI installation
String btRoot = args[0];
System.setProperty("bt.root", btRoot);
// Set solr.solr.home to the parent of a collection1/conf directory that contains
// a modified schema.xml and solrconfig.xml.
String solrHome = btRoot + "/rlpnc/data/rnm/sample/solr8x_home";
System.setProperty("solr.solr.home", solrHome);
CoreContainer coreContainer = new CoreContainer(solrHome);
coreContainer.load();
// For simplicity, use an embedded SolrServer rather than an HTTPSolrServer.
EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, "");

Add a Solr document with fields of interest, including name fields:

SolrInputDocument doc = new SolrInputDocument();
// Primary name field
doc.addField("primaryName", "Midiam Patricia ZAMBADA NIEBLA");
// Multivalued also-known-as name field
doc.addField("aka", "Midian Patricia ZAMBADA NIEBLA");
doc.addField("aka", "Miriam ZAMBADA NIEBLA");
doc.addField("aka", "Midian Patricia LOPEZ LANDEY");
doc.addField("id", "3");
// Entity id field.
doc.addField("uid", "10358");
// Date field
doc.addField("dob", "1971-03-04");
// Address field
doc.addField("address", "road<Calle Lago Cuitzeo 1394>house<Colonia Las Quintas>"
             + "city<Culiacan>region<Sinaloa>postalCode<80060>"
             + "country<Mexico>~fielded=true");
doc.addField("nationality", "Mexico");

// Add the document to the index.
server.add(createInputDoc());

Commit updates:

// When you have completed updates, commit the updates.
server.commit();

Define and run a query against name and other fields, using RNI pairwise matching to rerank the documents returned:

// Define a query that combines name fields and other fields, and uses RNI
// pairwise matching to rerank the documents returned.
String queryName = "Chuy A Lopez";
SolrQuery solrQuery = new SolrQuery("aka" + ":\"" + queryName +
 "\" AND dob:[1960-09-30T00:00:00Z TO *]");
// Set the rerank query parser parameters
solrQuery.set(CommonParams.RQ,
 "{!rniRerank reRankQuery=$rrq reRankDocs=100 reRankWeight=1}");
// Create a rerank query that uses the RNI pairwise matching function
solrQuery.set("rrq", "{!func}rniMatch(" + "aka" + ", \"" + queryName + "\")");
// Set which fields to include in the results
solrQuery.setFields("uid", "primaryName", "address", "dob", "score");
QueryResponse qResults = server.query(solrQuery);
//QueryResponse qResults = server.query(createQuery());

Define and run a query against name, address, and other fields, using RNI pairwise matching to rerank the documents returned:

// Define a query that combines name, address and other fields, and uses RNI
// pairwise matching to rerank the documents returned.
String queryName = "Chuy A Lopez";
String queryAddress = "road<Avenida Const. Pedro L Zavala 1957>house<Colonia Libertad>"
           + "city<Culiacan>country<Mexico>~fielded=true";
SolrQuery solrQuery = new SolrQuery("aka" + ":\"" + queryName +
                    "\" AND dob:\"1960-09-30\"");
// Set the rerank query parser parameters
solrQuery.set(CommonParams.RQ,
        "{!rniAddrRerank reRankQuery=$rrq reRankMode=replace reRankDocs=100 reRankWeight=1}");
// Create a rerank query that uses the RNI pairwise matching function
solrQuery.set("rrq", "{!func}rniAddrMatch(" + "address" + ", \"" + queryAddress + "\")");
// Set which fields to include in the results
solrQuery.setFields("uid", "primaryName", "address", "dob", "score");
QueryResponse qResults = server.query(solrQuery);
//QueryResponse qResults = server.query(createQuery());

Define and run a query against name, date, and other fields, using RNI pairwise matching to rerank the documents returned.

// Define a query that combines name, address and other fields, and uses RNI
// pairwise matching to rerank the documents returned.
String queryName = "Chuy A Lopez";
String queryDate = "04/03/1971~format=dd/MM/yyyy";
SolrQuery solrQuery = new SolrQuery("aka" + ":\"" + queryName +
                    "\" AND dob:\"1960-09-30\"");
// Set the rerank query parser parameters
solrQuery.set(CommonParams.RQ,
          "{!rniDateRerank reRankQuery=$rrq reRankMode=replace reRankDocs=100 reRankWeight=1}");
// Create a rerank query that uses the RNI pairwise matching function
solrQuery.set("rrq", "{!func}rniDateMatch(" + "dob" + ", \"" + queryDate + "\")");
// Set which fields to include in the results
solrQuery.setFields("uid", "primaryName", "address", "dob", "score");
QueryResponse qResults = server.query(solrQuery);
//QueryResponse qResults = server.query(createQuery());

Display the results:

// Print information about the documents returned with their Solr score.
for (SolrDocument rdoc : qResults.getResults()) {
 System.out.println("Returned Entity: " + rdoc.getFieldValue("uid")+ 
       "\n Name: " + rdoc.getFieldValue("primaryName") + 
       "\n Address: " + rdoc.getFieldValue("address") + 
       "\n DOB: " + rdoc.getFieldValue("dob") + 
       "\n Document Score: " + rdoc.getFieldValue("score"));
}

For convenience utilities for working with RNI names in a Solrj environment, see the Javadoc for com.basistech.rni.solr.index.

Translating names

Rosette Name Translator (RNT) supports name translation in complex, non-Latin languages, such as Arabic and Chinese. See Supported languages of origin for the complete list of supported languages and scripts. RNT supports multiple transliteration standards for translating from non-Latin scripts to English.

Text domains

Rosette Name Translator translates a name from one text domain to another. A text domain is specified by three parameters:

Language (ISO 639)
The language of the document in which the name is found.
Writing script (ISO 15294)
The script used to represent the name, such as the Latin alphabet, Arabic script, or Chinese Han characters.
Transliteration scheme
The transliteration system in which the name is represented. If the name is in its native script, the transliteration scheme is native.

The source domain is the text domain of the document in which the name is found. The target domain is the text domain to which the name is to be translated.

Supported translation domains provides a list of supported source and target domains.

Types of translation

The type of translation depends on the characteristics of the source and target domains and the language of origin of the name to be translated. RNT supports the following types of translations:

Translation of a person name to English

How the name is translated depends on whether the language of origin of the name is the source language.

If the language of origin of the person name is the same as the source language, the name is translated according to the specified target transliteration scheme. For example, a Japanese name which is in Japanese.
If the language of origin is not the source language, the name is translated to its conventional English form. For example, a non-Japanese name that appears in Japanese.

RNT supports the following translations of names to their conventional English representation:

non-Arabic names that appear in the Arabic language
non-Chinese names that appear in the Chinese language
non-Hebrew names that appear in the Hebrew language
non-Japanese names that appear in the Japanese language
non-Korean names that appear in the Korean language
non-Russian names that appear in the Russian language

Use the languageOfOrigin Name field to inform RNT that the language of origin is not the language of use in which the name appears.

If the language of origin is Unknown (the default), the language model may classify the name as foreign (for Japanese, the script must be Katakana).
If the language of use is Japanese, the script is Kanji, and the language of origin is Chinese or Korean, RNT attempts to translate the name, using Pinyin for Chinese, and Revised Romanization of Korean for Korean.
If the language of use is Chinese and the language of origin is anything other than Chinese, RNT attempts to translate the name to its standard English representation.
If the language of use is Korean, the script is Hangul, and the language of origin is any language other than Korean, RNT attempts to translate the name to its standard English representation.
For other languages, RNT uses the specified target transliteration scheme to transliterate the name to Latin script, regardless of whether or not the name is etymologically native to the respective source language.

Example - Arabic:

Source domain: Arabic language, Arabic script, native transliteration scheme.
Target domain: English language, Latin script, IC transliteration scheme.
The translation of جورج بوش is George Bush. Note: The IC transliteration is Jwrj Bwsh.
The translation of صفية طالب السهيل (an Arabic name) is the IC transliteration: Safiyyah Talib al-Suhayl.

Example - Pashto with IC transliteration scheme:

For Pashto, if you are using the IC transliteration scheme and the language of origin is Afghan Persian, RNT provides special handling of two short vowels, using 'e' and 'o' in place of 'i' and 'u', as designated in the IC Pashto Standardized Transliteration System for Personal Names.
Source domain: Pashto language, Arabic script, native transliteration scheme.
Target domain: English language, Latin script, IC transliteration scheme.
The standard translation of اسحاق is Ishaq. If the language of origin is Afghan Persian, the translation is Eshaq.

Example - Japanese, Katakana:

Source domain: Japanese language, Katakana script, native transliteration scheme.
Target domain: English language, Latin script, Hebon transliteration scheme.
The translation of ウィリアム・シェイクスピアー is William Shakespeare. Note: The Hebon transliteration is Iriamu Shieikusupiaa.

Example - Japanese, Kanji:

Source domain: Japanese language, Kanji script, native transliteration scheme.
Target domain: English language, Latin script, Hebon transliteration scheme.
With Chinese as the language of origin, the translation (Pinyin transliteration) of 温家宝 is Wen Jiabao. Note: The Hebon transliteration of 温家宝 is On Kahou.

Example - Russian:

Source domain: Russian language, Cyrillic script, native transliteration scheme.
Target domain: English language, Latin script, BGN transliteration scheme.
The translation of Маргарет Этвуд is Margaret Atwood. Note: The BGN transliteration is Margaret Etvud.
The translation of Алекса́ндр Солжени́цын (a Russian name) is the BGN transliteration: Aleksándr Solzhenítsyn.

Example - Thai

Source domain: Thai language, Thai script, native transliteration scheme.
Target domain: English language, Latin script, ISO11940_2_2007 transliteration scheme.
The translation of นายก รัฐมนตรี (a Thai name) is the ISO11940_2_2007 transliteration: Nayok Ratthamontri.

Example - Greek

Source domain: Greek language, Greek script, native transliteration scheme.
Target domain: English language, Latin script, ISO843_1997 transliteration scheme.
The translation of Γεώργιος Αθανασιάδης-Νόβας (a Greek name) is the ISO843_1997 transliteration: Geōrgios Athanasiadīs-Novas.

Example - Hebrew

Source domain: Hebrew language, Hebrew script, English language of origin, native transliteration scheme.
Target domain: English language, Latin script, ISO259_2_1994 transliteration scheme.
The translation of ברברה סטרייסנד is Barbara Streisand. Note: The ISO259_2_1994 transliteration is Brbrah Sṭriysnd.
Note that the translation to Barbara Streisand will be returned only if the user specifies the language of origin as English.

Translation from Native script to Latin Script

This is used when the source script and the transliteration scheme are native while the target script is Latin, the transliteration scheme is something other than native, and the language of origin of the name is native.

Examples:

Source domain: Arabic language, Arabic script, native transliteration.
Target domain: English language, Latin script, IC transliteration.
The translation of صفية طالب السهيل is Safiyyah Talib al-Suhayl.

Reverse transliterations from Latin script to native script

Some transliteration schemes provide enough information to enable reverse transcription, going from English and Latin script to a native script.

Examples:

Source domain: English language, Latin script, Basis transliteration.
Target domain: Arabic language, Arabic script, native transliteration.
The translation of naayif abuu sharkh is نَايِف أَبُو شَرْخ.

Source domain: English language, Latin script, Basis transliteration.
Target domain: Russian language, Cyrillic script, native transliteration.
The translation of Dmitry Medvedev is Дмитрий Медведев.

Standardization of Arabic-origin names in English

This translation takes a name in English that is of Arabic-origin and translates the Arabic components according to the specified transliteration scheme.

Example:

Source domain: English language, Latin script, native transliteration.
Target domain: English language, Latin script, IC transliteration.
The IC standardization of Moustephah Ehmed ben Samire is Mustafa Ahmad Bin-Samir.

Orthographic completion

This is available if the source and target languages are Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu, the source and target scripts are Arabic or Hebrew, and the source and target transliteration schemes are native.

Language: Source and target language is Arabic, Hebrew, Iranian Persian, Afghan Persian, Pashto, or Urdu.
Script: Source and target script is Arabic or Hebrew script.
Transliteration: Source and target transliteration scheme is native.

In conventional Arabic and Hebrew script, short vowels and other diacritics are not included. In orthographic completion, the translator attempts to vocalize the names by adding the short vowels and other diacritics that don't appear in conventional Arabic and Hebrew script.

You can also perform orthographic completion as part of the translation process. See Translation options.

Segmentation

Arabic, Chinese, Japanese, and Korean, names are often unsegmented, so that is there are no spaces between the words in the name. The translator attempts to segment the unsegmented names by adding spaces between the words in the name.

Segmentation is available when the source and target languages are Arabic Chinese, Japanese, or Korean and the source and target transliteration schemes are native.

You can also perform segmentation as part of the translation process. See Translation options.

Variant Latin-Script representations of name in non-Latin Script

Language: Source and target language is Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu.
Script: Source script is Arabic script; the target script is Latin script.
Transliteration: Source transliteration scheme is native; the target transliteration scheme is folk.

The variants of a multi-word name include the cross product of the variants of each word. If, for example, each word in a two-word name has 10 variants, the name has 100 variants. Accordingly, it is a good idea to translate one word at a time. A two-word translation must produce 100 variants to provide the same information as two one-word translations producing 10 variants each.

Use the com.basistech.rnt.ITranslator setMaximumResults method to control the number of variants that are returned.

Example:

Source domain: Arabic language, Arabic script, native transliteration.
Target domain: Arabic language, Latin script, folk transliteration
نبيل شعث contains two words. Ten variants of نبيل are nabil, nabile, nabille, nabeel, nabiyl, nabiyle, nabiylle, nebil, nebile, nebille. Ten variants of شعث are sha`ath, sha'ath, shaath, sha`th, sha'th, shath, cha`ath, cha'ath, chaath, cha`th. These translations use the orthographic completion option, which is turned on by default.

Automated translation

With automated translation, the client provides one or more names, input and output text domains, and types of translation desired. For each name, the application generates a list of translations and associated confidence scores.

Automated usage model for performing RNT translations:

Set up your environment.
You must define the directory in which you installed RNI-RNT ($BT_ROOT), and instantiate an Environment object. See Handling the Runtime Environment.
Create a Translator Factory and use it to instantiate a Translator.
A given translator can perform translations from one source text domain to one target text domain. The Java API includes support for creating a Translator wrapper that can handle multiple source and target domains.
Set translation options (or use the default option settings).
For a listing of the source and target language domains to which each of these translation options applies, see Supported Translation Option Domains.
Use the Translator to translate names from the source domain to the target domain.
Handle the list of one or more translation results that the Translator generates for each translation. Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest. The sum of the confidence score for all the results that the Translator can generate is less than or equal to 1.0.
Release resources, such as the Translator and Environment.
```
translator.close();
environment.close();
```

Multithreading

RNT translators are multithreadable.

Using the API

Java Packages: The RNT classes are in com.basistech.rnt and com.basistech.rnt.options (translation options). Utility classes that RNT uses are in com.basistech.util.

Note: Unqualified class names that appear in this section are in the com.basistech.rnt and com.basistech.rnt.options packages.

For detailed information about the API, see the Java API Reference.

Sample

For a sample Java application that translates a name, see AutomatedTranslationSample.

Creating a translator

RNT provides a factory class for creating translators. The factory is responsible for instantiating the correct RNT internal implementation class, which may vary depending on the source and target text domains you specify. For a table that maps input domains to output domains, see Supported Translation Domains.

The following fragment uses the factory to create a translator for translating names from Arabic documents in Arabic script to their standard English form in Latin script, using the IC transliteration scheme:

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/create_translator.java

When you are done using the translator, close it:

translator.close();

To create a wrapper object that packages a number of Translators, use RuleSetTranslator and define a list of TranslationRules. Each TranslationRule specifies the transliteration scheme for the specified language domain and entity type (NEConstants.NE_TYPE_NONE for all entity types).

Translation options

The translations options are defined in the package com.basistech.rnt.options.

Orthographic Completion. Class: CompleteOrthographyOption
For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to perform orthographic completion (add short vowels and other diacritics) prior to translation. The translator infers an orthographic completion if it cannot locate the name in its Arabic dictionary. For Pashto or Urdu, the translator omits orthographic completion for any named elements it cannot locate in the appropriate dictionary. The default setting for this option is true.
Given the lack of clear diacritization standards for Iranian Persian, Afghan Persian, Pashto and Urdu, the orthographic completion for these languages reflects BasisTech standards to assist the translation process, and is not intended for external use.
Suppose you are processing نايف أبو شرخ. As is the case in conventional Arabic, this text is not vocalized. For some transliteration schemes (such as IC) the transliteration of unvocalized Arabic is undefined. The translator produces NAyf 'Bw Shrkh. With the orthographic completion option, the Translator adds the missing vowels (giving نَايِف أَبُو شَرْخ) and produces the correct IC transliteration: Nayif Abu-Sharkh.
Orthographic completion is performed for Hebrew names in Hebrew script, but it is not controlled by this option. It is always enabled.
For supported languages and scripts, see Orthographic Completion.
Orthographic Minimization. Class: MinimizeOrthographyOption
For Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu names in Arabic script, you can set the option to devocalize (remove short vowel diacritics). The source domain and the target domain must be one of these languages, Arabic script, and Native transliteration. You can use this option to generate the Arabic script representation of names found in most media, such as news articles. The default setting for this option is false.
For supported languages and scripts, see Orthographic Minimization.
Statistical Methods. Class: StatisticalMethodsOption
For personal names in Arabic, Hebrew, Japanese, and Russian, use statistical methods to establish information that is not found in a dictionary. Statistical methods are used to do the following:
- Classify unknown personal names as native or foreign.
- (Arabic, Hebrew) Vocalize unknown personal names classified as native, or unknown personal names classified as foreign.
- (Arabic, Hebrew, Japanese, Russian) translate unknown personal names classified as foreign.
For Arabic this option is set to true by default. If statistical methods are turned off, performance with Arabic input is faster, but for personal names not found in its dictionary, the translator can only mechanically transliterate the input.
For Hebrew, Japanese, and Russian, statistical analysis is always performed.
For supported languages and scripts, see Statistical Methods.
Performance Tradeoff. Class: PerformanceTradeoff
For personal names in Arabic, Japanese, or Russian, you can control the tradeoff the translator makes between speed and correctness when it is performing statistical analysis. For Arabic, statistical methods must be turned on (the default). As mentioned above, statistical analysis is always performed for Japanese and Russian. Four settings are defined in com.basistech.rnt.options.TradeoffEnum:
For supported languages and scripts, see Performance Tradeoff.
FAST
Greatest speed; least correctness
NORMAL
Even tradeoff between speed and correctness (the default)
CAREFUL
More correctness; less speed
PRECISE
Greatest correctness; least speed
Segmentation. Class: SegmentOption
For Chinese, Japanese or Korean names, you set the option to segment unsegmented names. Unsegmented Thai names are segmented, but it is not controlled by this option; it is always enabled.
Suppose you are processing 胡錦濤. This name is not segmented and the Pinyin transliteration is hujintao. With the segmentation option, the Translator segments the name into 胡 and 錦濤, and produces the correct Pinyin transliteration: hu jintao.
For Korean, the name may be in Hangul or Han script. For example, the following Hangul and Han representations of the same name are not segmented: 김정일 and 金正日. With the segmentation option, the Translator uses Hangul to segment either of these forms into 김 and 정일.
By default, the Segmentation option is set to true.
For supported languages and scripts, see Segmentation.
Normalization. Class: NormalizeOption
The normalization option applies to Arabic, Chinese, and Japanese names.
For Arabic native names, the normalizer applies a set of standardization rules. For example, the normalizer inserts a space in عبدالمجيد, producing the more standard representation: عبد المجيد (the IC transliteration is 'Abd-al-Majid).
For Chinese, normalization converts any characters in the traditional Chinese variant to the simplified Chinese variant (the standard for China). For example, the normalizer converts 張 to 张.
For Japanese names, normalization converts Kanji variants (including old Kanji) to their standard form. For example, the normalizer converts 亞 to 亜.
By default, the Normalization option is set to true.
For supported languages and scripts, see Normalization.
Pashto IC: Variant Spelling and Region. For Pashto, when applying the IC standard, these two options implement variations specified in the IC Pashto Standardized Transliteration System for Personal Names. For supported languages and scripts, see Variant Spelling and Region.
Korean Geography. For Korean, when applying the BGN standard, the standard that is actually used depends on the Korean Geography Option. For North Korea (the default) McKune-Reischauer is used. For South Korea, Revised Romanization of Korean is used. For supported languages and scripts, see Korean Geography.

Performing a translation

You can set various parameters for the ITranslator, and you must instantiate an ITranslatable object with which the Translator performs the translation.

ITranslator Methods for Setting Translation Parameters

void setMaximumResults (int maxResults)

Sets the maximum number of candidate translations that RNT generates. If you are only interested in the best or most likely result, set this to 1.

void setMinimumConfidence (double confidence)

Each result is tagged with a confidence score between 0 and 1.0. The higher the number, the higher the confidence that this is a result of interest.

<T> void setOption (optionValue)

Each option is defined by a class. The options are defined in com.basistech.rnt.options).

By default, PerformanceTradeoff is set to TradeoffEnum.NORMAL, VariantSpellingOption is false, RegionOption is RegionEnum.DEFAULT (region unknown), and the other options are set to true. You can use this method to reset an option. For example:

setOption(new CompleteOrthographyOption(false));
setOption(new PerformanceTradeoff(TradeoffEnum.FAST);

RNT performs the translation on an ITranslatable object. An ITranslatable object contains several properties: data (the name), language, script, and entity type (person, location, organization, etc.). Language and script should match the language and script of the source text domain. Entity type may be unknown (com.basistech.util.NEConstants.NE_TYPE_NONE). The ITranslatable object may be extended to include additional information, such as geocoordinates for locations. For more information, see the Javadoc for the implementation of ITranslatable: com.basistech.rni.match.Name.

The following example translates an Arabic name: "صفية طالب السهيل".

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/translate.java

Inspecting translation results

The ITranslator translate() method returns a list of TranslationResult objects.

Each TranslationResult provides access to the translation, the confidence associated with that translation (a double from 0 to 1.0), and may provide additional information with an associated confidence score. The sum of the confidence of all the results returned by a translation is less than or equal to 1.0. The additional information may include orthographic completion (the diacritization of names in Arabic or Hebrew script), segmentation (of names in Chinese, Korean, Japanese, or Thai), and language of origin (for Arabic, Chinese, or Japanese Katakana script). By default, these options are set to true, in which case the Translator attempts to infer the additional information. You can turn off one or all of these options.

For names in Arabic script, orthographic completion means the addition of short-vowel markers and other diacritics that are absent in conventional Arabic script but required for accurate transliteration.

https://raw.githubusercontent.com/basis-technology-corp/rosette-sample-code/master/rni-rnt/inspect_translation_results.java

This translation returns a list containing one result.

Result	Value
Translation (IC Transliteration)	Safiyyah Talib al-Suhayl
Translation Confidence	1.0
Orthographic Completion (Diacritization)	صَفِيَّة طَالِب اَلسُّهَيْل
Orthographic Completion Confidence	1.0
Language of Origin	`LanguageCode.ARABIC`

Overriding name pair translations

You can create UTF-8 files that specify how names are to be translated. The filenames specify the language of the source and target domains, and may specify an entity type. The file entries specify the text of the source and target names, the script of the target domain (required if the target language may be written in multiple scripts), and may specify a confidence score for the translation. RNT applies Unicode NFD normalization to the name strings, and performs the translations specified in these files. RNT supports both fullname and token overrides.

Filenames. The filenames use ISO639 three-letter codes to specify the language of the source domain and the language of the target domain. The filename may also specify an entity type.

fullnames_SRCLANG_TARGETLANG[_TYPE].txt

tokens_SRCLANG_TARGETLANG[_TYPE].txt

For example, fullnames_eng_zho_PERSON.txt would contain entries for translating English PERSON names to Chinese. fullnames_ara_eng.txt would contain entries for translating Arabic names of any or no entity type to English.

Sample Fullname Override Files. fullnames_ara_eng_LOCATION.txt, fullnames_jpn_eng_LOCATION.txt, fullnames_rus_eng_LOCATION.txt contain entries respectively for translating LOCATION names from Arabic, Japanese, and Russian to English. Entity-specific override files are additive to non-entity type overrides. Fullname overrides only take effect if the entire contents of an entry in the first column matches the entire input name.

Sample Token Override Files. tokens_ara_eng_ORGANIZATION.txt, tokens_jpn_eng_ORGANIZATION.txt, tokens_rus_eng_ORGANIZATION.txt contain entries respectively for translating ORGANIZATION names from Arabic, Japanese, and Russian to English. Entity-specific override files are additive to non-entity type overrides.

File entries. Each row in the file, except for rows beginning with #, contains tab-delimited fields with source name, target name, target script (not required if the target language is only written in one script), and optional confidence score.

source name Tab target name[ Tab target_script] [Tab confidence_score]

The confidence score must be between 0 and 1.0. If it is not included, RNT sets the confidence score to 1.0.

The following entry in fullnames_eng_zho_PERSON.txt specifies that Ho Lide should be translated to 贺利得 with a confidence score of 0.99 if the entity type for the source name is PERSON and the script for the target domain is Hans (simplified Chinese).

Ho Lide贺 利得Hans0.99

The translations you specify are not commutative, so the preceding entry has no influence on the translation of the Chinese 贺利得 to English.

The following entry in fullnames_ara_eng.txt specifies that علي سعيد should be translated to 'Ali Sa'id with a confidence score of 1.0.

علي سعيد'Ali Sa'id

You can include multiple entries with the same source name, in which case it translates to multiple target names. The sum of the confidence scores for the source name must be between 0 and 1.0.

If you do not include result scores, and a source name translates to multiple target names, RNT sets the confidence score for each pair to 1 divided by the number of targets. If for example, a source name translates to two targets, the confidence score for each translation is 0.5. For fullname overrides, if a source name has multiple targets and only some have a specified confidence score, the confidence scores for the non-specified target names will be an even split of 1 minus the total confidence of the specified targets.

Note

Multiple override entries for tokens is not supported. If there are multiple override entries for a single token, the resulting translations will only contain the first entry in the file and ignore the others.

Note

Specifying a confidence score in token override files is only supported for the following [script/language/transliteration scheme] pairs:

source: [Khmr/khm/native] target: [Latn/eng/folk]
source: [Latn/eng/folk] target: [Cyrl/rus/native]
source: [Thai/tha/native] target: [Latn/eng/iso_11940_2]
source: [Thai/tha/native] target: [Latn/eng/iso11940_2_2007]
source: [Thai/tha/native] target: [Latn/eng/icu]
source: [Mymr/mya/native] target: [Latn/eng/folk]
source: [Mymr/mya/native] target: [Latn/eng/mlcts]
source: [Grek/ell/native] target: [Latn/eng/iso843_1997]
source: [Grek/ell/native] target: [Latn/eng/icu]
source: [Hebr/heb/native] target: [Latn/eng/folk]
source: [Hebr/heb/native] target: [Latn/eng/iso259_2_1994]
source: [Hebr/heb/native] target: [Latn/eng/icu]
source: [Hebr/heb/native] target: [Hebr/heb/native]
source: [Cyrl/rus/native] target: [Latn/eng/ic]
source: [Cyrl/rus/native] target: [Latn/eng/bgn]
source: [Cyrl/rus/native] target: [Latn/eng/und_bgn]
source: [Cyrl/rus/native] target: [Latn/eng/iso9_1995]
source: [Deva/hin/native] target: [Latn/eng/ic]
source: [Hani/yue/jyutping] target: [Latn/eng/folk]
source: [Hans/yue/jyutping] target: [Latn/eng/folk]
source: [Hant/yue/jyutping] target: [Latn/eng/folk]

For all other pairs, specifying a confidence score in the override file will not affect the score of the final result.

Location of Override Files. Place your override files in the $BT_ROOT/rlpnc/data/rnt/ref/override directory.

Tip

To define your own override tables (character streams) in place of the tables in the default directory. See the HTML API documentation for the com.basistech.rnt.DictionaryService.replaceConfiguration method.

Interactive translation

RNT provides an API that you can use to build interactive applications to translate Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and Russian names from English or native script to standardized English. For supported transliteration schemes, see Supported Translation Domains.

The input is an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name in native script or in English that the user wants to translate. In the common case, the name is in English but may not conform to the desired transliteration standard.

For Arabic names, the application walks users through the procedure of generating the name in fully vocalized Arabic script (conventional Arabic does not include short vowels), and transliterating the name.
For Iranian Persian, Afghan Persian and Pashto names, the application walks the user through the process of generating the names in standard Arabic script (no short-vowel markers).
For Chinese, Korean, or Russian names, the application walks the user through the process of generating the name in Hani, Hangul, or Cyrillic.

To take full advantage of the resources that RNT Interactive provides, the user should have some familiarity with Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, and/or Russian.

For detailed information about the API, see the com.basistech.rnt.assistant package in the Java API Reference.

Overview of an interactive application

An interactive application that walks the user through the process of translating an Arabic, Iranian Persian, Afghan Persian, Pashto, Chinese, Korean, or Russian name does the following:

Sets the Basis root directory and instantiates a TranslationAssistant.
Collects user input: a name to transliterate and a description of the desired output.
The name is in English (a 'folk' transliteration) or in native script.
The output is defined as one foreign language text domain (such as Arabic, Arab, NATIVE) and one or more English text domains (such as English, Latn, BGN). See Supported Translation Domains.
Asks RNT to initialize an output object, which includes segmentation information about the input.
The segmentation may not match the segmentation implied by the user input, and needs to be recirculated to RNT as part of each user interaction. For example, in Arabic the definite article or family/clan indicator 'al' may or may not be joined to the element that follows. In some cases, whether it should be joined is unambiguous. In other cases, either segmentation is possible. The selection you make for one element may undo the selection you have already made for another element and/or may change the options available for that other element.
For each segment in the name, provides the users with a set of output alternatives. Each alternative includes the segment transliteration for the specified output text domains. For Arabic input only, the alternative may include a brief gloss and part of speech to assist the user in making a choice: either may be 'Name'; the gloss may contain a Buckwalter annotation, such as 's' for surname' or 'f' for feminine name.
When the user selects an alternative, the application passes it to the output object, and passes the current segmentation (which the selection may have changed) back to the input.
Publishes the final output: for each output text domain, the combination of alternatives that the user has selected.
When the user is done, closes interactive RNT to free resources.

Sample application

For a sample application that simulates the interactive process described above, see InteractiveTranslationSample and the source code in $BT_ROOT/rlpnc/samples/java/InteractiveTranslationSample.java.Sample Applications: Details

As shipped (you can modify the sample), the input is an Arabic name: Safiyah Talib Al Suhail.

RNT divides this input into a number of segments and generates alternatives for each segment. RNT returns these alternatives in descending order of confidence (the best alternative is the first). For Arabic input only, as the following table shows, RNT provides additional information about each alternative to help the user make the best selection.

As the table also indicates, Al could be an individual component, but in the context of the word that follows, should be joined with Suhail

Arabic (native)	English (IC)	Gloss	Part of Speech

صَفِيَّة	Safiyyah	pure/clear/sincere	Adj
صَافِيَة	Safiyah	net	Noun
صَافِيَة	Safiyah	pure/clear/sincere	Adjective
سَافِيَاء	Safiya'	fine dust	Noun
???	Safiyah	original input

تَلِيب	Talib	Talib s Libyan	Name
طَلِيب	Talib	Talib s	Name
طَالِب	Talib	requesting	Adj
تَعْلِيب	Ta'lib	canning	Noun
تَأْلِيب	Ta'lib	rallying/assembling	Noun
طَلَب	Talab	quest/search // request/demand	Noun
تَأَلُّب	Ta'allub	gathering/rally/assembly	Noun
طَلِيبَة	Talibah	Talibah s	Name
طُلَيْب	Tulayb	Tulayb	Name
تَعْلَب	Ta'lab	Ta'lab	Name
???	Talib	original input

اَل	Al	al- definite article	Definite Article
آل	Al	Al family/clan of	Name
???	Al	original input

اَلسُّهَيْل	al-Suhayl	Suheil // Canopus	Name
اَلصَّهِيل	al-Sahil	neighing	Noun
اَلسُّهَيْلَة	al-Suhaylah	Suhaylah f	Name
اَلسُّحَيْل	al-Suhayl	Suhayl	Name
اَلسُّهِيل	al-Suhil	Suhil	Name
???	Suhail	original input

The final output (choosing the first alternative for each segment) is as follows:

IC transliteration: Safiyyah Talib al-Suhayl

Native transliteration: صَفِيَّة تَلِيب اَلسُّهَيْل

Fully supported text domains for name matching

The following tables describe the domain pairings for which RNI provides full support. All other domain pairings have limited support, as described in Language support parameters. A domain refers to the language and script of a piece of text. For example, one domain might be Latin (Latn) script in the English (eng) language.

Note

"Language" in this appendix refers to the language of use, the language of the document in which the name is found, which may not be the language of origin associated with the name. If the language of use is undetermined, use unknown (xxx).

Note

Prior to release 7.36.0, RNI did not support any limited languages; when presented with names in those languages, an "unsupported language" error would be returned.

To set RNI to behave as it did previously, set allLanguageSupport to false.

Name matching within a language

The first table identifies the languages, and for each language the writing scripts that Rosette Name Indexer fully supports.

Language (ISO 639-3)	Scripts (ISO 15924)	Real World Id Dictionary
Arabic (ara)	Arabic (Arab)	✓
Burmese (mya)	Burmese (Mymr)	✓
Chinese (zho)^[a]	Han (Hanzi) (Hani), Han (Simplified variant) (Hans), Han (Traditional variant) (Hant)	✓
English (eng)	Latin (Latn)	✓
French (fra)	Latin (Latn)	✓
German (deu)	Latin (Latn)	✓
Greek (ell)	Greek (Grek)	✓
Hebrew (heb)	Hebrew (Hebr)	✓
Hungarian (hun)	Latin (Latn)	✓
Italian (ita)	Latin (Latn)	✓
Japanese (jpn)	Han (Kanji) (Hani), Hiragana (Hira), Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt), Japanese (alias for Han + Hiragana + Katakana) (Jpan), Katakana (Kana)	✓
Khmer (khm)	Khmer (Khmr)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang), Han (Hanja) (Hani), Korean (alias for Hangul + Han) (Kore)	✓
Malay (zsm)	Latin (Latn)
Pashto (pus)	Arabic (Arab)
Persian (fas) ^[b]	Arabic (Arab)
Persian, Afghan (prs)	Arabic (Arab)
Persian, Iranian (pes)	Arabic (Arab)
Portuguese (por)	Latin (Latn)	✓
Russian (rus)	Cyrillic (Cyrl)	✓
Spanish (spa)	Latin (Latn)	✓
Thai (tha)	Thai (Thai)	✓
Turkish (tur)	Latin (Latn)
Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	Latin (Latn)	✓
^[a]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[b]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Cross-language matches

This table identifies the range of cross-language searching and matching that Rosette Name Indexer and name matching fully support. If your query is a name in an Arabic document in Arabic script, the query may return one or more names in English documents in Latin script, in addition to names from Arabic documents in Arabic script. If the query is a name in English and Latin script, it may return documents from any of the supported languages and their native scripts.

Note

For supported scripts for each language, see the table in section 13.1.

Query Domain	Index Domain / Match Domain
Language (ISO 639-3)	Language (ISO 639-3)	Scripts (ISO 15924)
Arabic (ara)	Arabic (ara)	Arabic (Arab)
Arabic (ara)	English (eng)	Latin (Latn)
Burmese (mya)	Burmese (mya)	Burmese (Mymr)
Burmese (mya)	English (eng)	Latin (Latn)
Chinese (zho)^[a]	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
English (eng)	Arabic (ara)	Arabic (Arab)
	Burmese (mya)	Burmese (Mymr)
	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	French (fra)	Latin (Latn)
	German (deu)	Latin (Latn)
	Greek (ell)	Greek (Grek)
	Hebrew (heb)	Hebrew (Hebr)
	Hungarian (hun)	Latin (Latn)
	Italian (ita)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Khmer (khm)	Khmer (Khmr)
	Korean (kor)	(Hani), (Hang), (Kore)
	Malay (zsm	Latin (Latn)
	Pashto (pus)	Arabic (Arab)
	Persian (fas)	Arabic (Arab)
	Persian, Afghan (prs)	Arabic (Arab)
	Persian, Iranian (pes)	Arabic (Arab)
	Portuguese (por)	Latin (Latn)
	Russian (rus)	Cyrillic (Cyrl)
	Spanish (spa)	Latin (Latn)
	Thai (tha)	Thai (Thai)
	Urdu (urd)	Arabic (Arab)
	Turkish (tur)	Latin (Latn)
	Vietnamese (vie)	Latin (Latn)
French (fra)	English (eng)	Latin (Latn)
French (fra)	French (fra)	Latin (Latn)
German (deu)	English (eng)	Latin (Latn)
German (deu)	German (deu)	Latin (Latn)
Greek (ell)	English (eng)	Latin (Latn)
Greek (ell)	Greek (ell)	Greek (Grek)
Hebrew (heb)	English (eng)	Latin (Latn)
Hebrew (heb)	Hebrew (heb)	Hebrew (Hebr)
Hungarian (hun)	English (eng)	Latin (Latn)
Hungarian (hun)	Hungarian (hun)	Latin (Latn)
Italian (ita)	English (eng)	Latin (Latn)
Italian (ita)	Italian (ita)	Latin (Latn)
Japanese (jpn)	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Khmer (khm)	English (eng)	Latin (Latn)
Khmer (khm)	Khmer (khm)	Khmer (Khmr)
Korean (kor)	Chinese (zho)^[a]	(Hani), (Hans), (Hant)
	English (eng)	Latin (Latn)
	Japanese (jpn)	(Hani), (Hira),(Jpan), (Hrkt), (Kana)
	Korean (kor)	(Hani), (Hang), (Kore)
Malay (zsm)	English (eng)	Latin (Latn)
Malay (zsm)	Malay (zsm)	Latin (Latn)
Pashto (pus)	English (eng)	Latin (Latn)
Pashto (pus)	Pashto (pus)	Arabic (Arab)
Persian^[b] (fas)	English (eng)	Latin (Latn)
Persian^[b] (fas)	Persian (fas)	Arabic (Arab)
Persian, Afghan (prs)	Afghan Persian (prs)	Arabic (Arab)
Persian, Afghan (prs)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	English (eng)	Latin (Latn)
Persian, Iranian (pes)	Iranian Persian (pes)	Arabic (Arab)
Portuguese (por)	English (eng)	Latin (Latn)
Portuguese (por)	Portuguese (por)	Latin (Latn)
Russian (rus)	English (eng)	Latin (Latn)
Russian (rus)	Russian (rus)	Cyrillic (Cyrl)
Spanish (spa)	English (eng)	Latin (Latn)
Spanish (spa)	Spanish (spa)	Latin (Latn)
Thai (tha)	English (eng)	Latin (Latn)
Thai (tha)	Thai (tha)	Thai (Thai)
Turkish (tur)	English (eng)	Latin (Latn)
Turkish (tur)	Turkish (tur)	Latin (Latn)
Urdu (urd)	English (eng)	Latin (Latn)
Urdu (urd)	Urdu (urd)	Arabic (Arab)
Vietnamese (vie)	English (eng)	Latin (Latn)
Vietnamese (vie)	Vietnamese (vie)	Latin (Latn)
^[a]This is a macro language consisting of Mandarin (cnm) and Cantonese (yue). ^[b]Persian is the macro language that includes Afghan Persian ("prs") and Iranian Persian ("pes")

Supported translation domains

This section specifies supported translation domain pairs and languages of origin for names to be translated with each domain pair.

Source and target translation domains

The following table identifies the translations (target text domains) that Rosette Name Translator supports for each source text domain.

For each source domain, the target domains for each language and script combination are presented in a single row.

Translation. If the languages in the source domain and target domain do not match, Rosette Name Translator translates the name to the target language.

Language of Origin. For Arabic, Chinese, Hebrew, Japanese, and Korean, if the target language is English and the language of origin does not match the source language, Rosette Name Translator attempts to translate the name to its standard English representation. If the language of origin is the source language, Rosette Name Translator transliterates the name with the specified transliteration scheme. If the language of origin for the name object is unspecified (UNKNOWN), Rosette Name Translator guesses the language of origin. For the supported languages of origin for each domain pair, see Supported Languages of Origin.

Orthographic Enhancement. When the source and target domains match (the transliteration schemes are native), the translations are orthographic enhancements in the native script (vocalization in Arabic and Hebrew script languages, segmentation in Chinese, Japanese, Korean, or Thai).

Name Variants. When the target transliteration scheme is folk, Rosette Name Translator generates a list of variant representations of the name in Latin script.

Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...	Language(s) of Origin
Language (ISO639-3)	Script (ISO15924)	Language (ISO639-3)	Script (ISO15924)	Transliteration Scheme(s) (name),...	Language(s) of Origin
Afghan Persian (prs)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)	prs
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)	ara, eng
Burmese (mya)	Burmese (Mymr)	English (eng)	Latin (Latn)	Folk (folk), MLCTS (mlcts)	mya
Chinese (zho)	Han (Hanzi) (Hani)	English (eng)	Latin (Latn)	BGN (bgn)^[a] , IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[b], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[c]	zho, eng
Chinese (zho)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	BGN (bgn)^[a], IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[b], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[c]	zho, eng
Chinese (zho)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	BGN (bgn)^[a], IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy)^[b], Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[c]	zho, eng
English (eng)	Latin (Latn)	Afghan Persian (prs)	Arabic (Arab)	Folk (folk)	prs
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	Basis (basis), BGN (bgn), Buckwalter (buckwalter), Folk (folk), SATTS (satts)	ara
English (eng)	Latin (Latn)	Chinese (zho)	Han (Hanzi, Kanji, Hanja) (Hani)	Chinese Telegraph Code (ctc), Folk (folk), Jyutping (LSHK)^[c]	zho, eng
English (eng)	Latin (Latn)	English (eng)	Latin (Latn)	BGN (bgn), Basis (basis), IC (ic)^[d]	ara
English (eng)	Latin (Latn)	English (eng)	Latin (Latn)	Undiacritized BGN (und_bgn)	eng, ara, pus, urd, prs, pes, fas, rus, zho, kor
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	BGN (bgn)	pes
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	Folk (folk)	pes
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	BGN (bgn)	kor
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Folk (folk)	kor, eng
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	McCuneReischauer (mcr)	kor
English (eng)	Latin (Latn)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Revised Romanization of Korean (moct)	kor
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	BGN (bgn)	pus
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	Folk (folk)	pus
English (eng)	Latin (Latn)	Persian (fas)	Arabic (Arab)	BGN (bgn)	fas
English (eng)	Latin (Latn)	Russian (rus)	Cyrillic (Cyrl)	Folk (folk)	rus, eng
English (eng)	Latin (Latn)	Urdu (urd)	Arabic (Arab)	BGN (bgn)	urd
Greek (ell)	Greek (Grek)	English (eng)	Latin (Latn)	ISO 843:1997 (iso843_1997), ICU (icu)	eng, ell
Hebrew (heb)	Hebrew (Hebr)	English (eng)	Latin (Latn)	ISO 259-2:1994 (iso259_2_1994), Folk (folk), ICU (icu)	heb, eng
Hindi (hin)	Devanagari (Nagari) (Deva)	English (eng)	Latin (Latn)	IC (ic)	hin
Iranian Persian (pes)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)	pes
Japanese (jpn)	Han (Kanji) (Hani)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, zho, kor
Japanese (jpn)	Hiragana (Hira)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng, zho, kor
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng
Japanese (jpn)	Katakana (Kana)	English (eng)	Latin (Latn)	Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)	jpn, eng
Khmer (khm)	Khmer (Khmr)	English (eng)	Latin (Latn)	Folk (folk)	khm, eng
Korean (kor)	Han (Hanja) (Hani)	English (eng)	Latin (Latn)	BGN (bgn)^[e], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	English (eng)	Latin (Latn)	BGN (bgn)^[e] , IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	English (eng)	Latin (Latn)	BGN (bgn)^[e], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)	kor, eng
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)	pus
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	IC (ic)	pus, prs
Persian (fas)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)	fas
Russian (rus)	Cyrillic (Cyrl)	English (eng)	Latin (Latn)	BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)	rus, eng
Thai (tha)	Thai (Thai)	English (eng)	Latin (Latn)	ICU (icu), ISO :11940-2 (iso_11940_2), ISO 11940-2:2007 (iso11940_2_2007)	eng, tha
Urdu (urd)	Arabic (Arab)	English (eng)	Latin (Latn)	BGN (bgn)^[f] , IC (ic), Undiacritized BGN (und_bgn), Folk (folk)	urd
^[a]For Chinese, BGN uses the Pinyin transliteration scheme. ^[b]Pinyin (hypy) is the only transliteration scheme used for name matching. ^[c]For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts. ^[d]Standardizes names of Arabic origin in Latin script to conform to the target transliteration scheme. ^[e]For Korean, BGN uses the McCune-Reischauer transliteration scheme. ^[f]For Urdu, Rosette implemented a BGN transliteration scheme based on an unofficial specification prior to the specification being officially adopted.

Supported languages of origin

The following table displays the languages of origin that are supported for each source and target domain. For the full English names for the scripts, languages, and transliteration schemes, see the preceding table.

Target Domain(s)			Source Domain			Language(s) of Origin
Language	Script	Transliteration Scheme	Language	Script	Transliteration Scheme(s),...	Language(s),...
ara	Arab	native	ara	Arab	native	ara, eng
ara	Arab	native	eng	Latn	fbis, bgn, basis, ic, satts, buckwalter, und_bgn, ext_ic, folk	ara, eng
ell	Grek	native	eng	Latn	iso843_1997, icu	eng, ell
eng	Latn	bgn	ara	Arab	native	ara
eng	Latn	bgn	pus	Arab	native	pus
eng	Latn	bgn	fas	Arab	native	fas
eng	Latn	bgn	urd	Arab	native	urd
eng	Latn	bgn	pes	Arab	native	pes
eng	Latn	bgn	eng	Latn	und_bgn	eng, ara, pus, urd, prs, pes, fas, rus, zho, kor
eng	Latn	bgn	kor	Hang	native	kor
eng	Latn	basis	ara	Arab	native	ara
eng	Latn	satts	ara	Arab	native	ara
eng	Latn	buckwalter	ara	Arab	native	ara
eng	Latn	mcr	kor	Hang	native	kor
eng	Latn	moct	kor	Hang	native	kor
eng	Latn	ctc	zho	Hani	native	zho
eng	Latn	folk	ara	Arab	native	ara
eng	Latn	folk	pus	Arab	native	pus
eng	Latn	folk	prs	Arab	native	prs
eng	Latn	folk	pes	Arab	native	pes
eng	Latn	folk	rus	Cyrl	native	rus, eng
eng	Latn	folk	kor	Hang	native	kor, eng
eng	Latn	folk	zho	Hani	native	zho, eng
eng	Latn	native	eng	Latn	bgn, basis, ic	ara
fas	Arab	native	eng	Latn	bgn, und_bgn, folk	fas
heb	Hebr	native	eng	Latn	native	heb, eng
hin	Deva	native	eng	Latn	ic	hin
jpn	Hani	native	eng	Latn	hebon, kunrei	jpn, zho, kor
jpn	Hani	native	jpn	Hira	native	jpn
jpn	Hani	native	jpn	Hani	native	jpn, zho, kor
jpn	Hans	native	eng	Latn	hebon, kunrei	jpn, zho, kor
jpn	Hant	native	eng	Latn	hebon, kunrei	jpn, zho, kor
jpn	Hira	native	eng	Latn	hebon, kunrei	jpn
jpn	Hrkt	native	eng	Latn	hebon, kunrei	jpn, eng
jpn	Kana	native	eng	Latn	hebon, kunrei	jpn, eng
jpn	Jpan	native	eng	Latn	hebon, kunrei	jpn, eng, zho, kor
kmr	Khmr	native	eng	Latn	native	kmr, eng
kor	Hang	native	eng	Latn	bgn, ic, und_bgn, korda, mcr, moct, folk	kor, eng
kor	Hang	native	kor	Hang	native	kor, eng
kor	Hani	native	eng	Latn	bgn, ic, und_bgn, korda, mcr, moct, folk	kor, eng
kor	Hani	native	kor	Hang	native	kor, eng
kor	Kore	native	eng	Latn	bgn, ic, und_bgn, korda, mcr, moct, folk	kor, eng
kor	Kore	native	kor	Hang	native	kor, eng
mya	Burmese	folk	eng	Latn	icu	mya, eng
mya	Mymr	native	eng	Latn	folk, mlcts	mya
pes	Arab	native	pes	Arab	native	pes
pes	Arab	native	eng	Latn	bgn, ic, und_bgn	pes
prs	Arab	native	prs	Arab	native	prs
prs	Arab	native	eng	Latn	bgn, ic, und_bgn	prs
pus	Arab	native	pus	Arab	native	pus
pus	Arab	native	eng	Latn	bgn	pus
pus	Arab	native	eng	Latn	ic	pus, prs
pus	Arab	native	eng	Latn	und_bgn, folk	pus
rus	Cyrl	native	eng	Latn	bgn, ic, iso9_1995, und_bgn	rus, eng
rus	Cyrl	native	rus	Cyrl	native	rus, eng
tha	Thai	native	eng	Latn	icu, iso_11940_2, iso11940_2_2007	eng, tha
urd	Arab	native	urd	Arab	native	urd
urd	Arab	native	eng	Latn	bgn, ic, und_bgn, folk	urd
zho	Hani	native	eng	Latn	bgn, ic, und_bgn, hypy, hypy_toned, wade_giles, ctc	zho, eng
zho	Hani	native	zho	Hani	native	zho, eng
zho	Hans	native	eng	Latn	bgn, ic, und_bgn, hypy, hypy_toned, wade_giles	zho, eng
zho	Hant	native	eng	Latn	bgn, ic, und_bgn, hypy, hypy_toned, wade_giles	zho, eng
zho	Hant	native	zho	Hans	native	zho, eng

Supported translation option domains

This section specifies the translation domain pairs to which each of the RNT translation options applies.

Orthographic completion

For Arabic-script names in Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu, this option (com.basistech.rnt.options.CompleteOrthographyOption) adds short vowels and other diacritics. If the language is Arabic or Hebrew^[15] and statistical methods are turned on (the default), the translator statistically infers an orthographic completion if the name does not appear in its dictionary. For the other languages, orthographic completion only takes place if the name appears in the relevant dictionary. By default, this option is set to true. See Translation Options.

Orthographic Completion
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Afghan Persian (prs)	Arabic (Arab)	Afghan Persian (prs)	Arabic (Arab)	Native (native)
Afghan Persian (prs)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)
Afghan Persian (prs)	Arabic (Arab)	Afghan Persian (prs)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Arabic (Arab)	Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Latin (Latn) [Deprecated domain]	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Hebrew (heb)	Hebrew (Hebr)	Hebrew (heb)	Hebrew (Hebr)	Native (native)
Iranian Persian (pes)	Arabic (Arab)	Iranian Persian (pes)	Arabic (Arab)	Native (native)
Iranian Persian (pes)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)
Iranian Persian (pes)	Arabic (Arab)	Iranian Persian (pes)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Arabic (Arab)	Native (native)
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
Persian (fas)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)
Persian (fas)	Arabic (Arab)	Persian (fas)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), Undiacritized BGN (und_bgn), Folk (folk)
Urdu (urd)	Arabic (Arab)	Urdu (urd)	Arabic (Arab)	Native (native)
Urdu (urd)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[a] , IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
Urdu (urd)	Arabic (Arab)	Urdu (urd)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[a], IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
^[a]For Urdu, Basis implemented a BGN transliteration scheme based on an unofficial specification prior to the specification being officially adopted.

Orthographic minimalization

For Arabic-script names in Arabic, Iranian Persian, Afghan Persian, Pashto, or Urdu, this option (com.basistech.rnt.options.MinimizeOrthographyOption) removes diacritics for short vowels. The translation produces the Arabic script representation of names found in most print media, including news articles. By default this option is set to false. See Translation Options.

Orthographic Minimalization
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Afghan Persian (prs)	Arabic (Arab)	Afghan Persian (prs)	Arabic (Arab)	Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Arabic (Arab)	Native (native)
Iranian Persian (pes)	Arabic (Arab)	Iranian Persian (pes)	Arabic (Arab)	Native (native)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Arabic (Arab)	Native (native)
Urdu (urd)	Arabic (Arab)	Urdu (urd)	Arabic (Arab)	Native (native)

Statistical methods

For personal names in Arabic, Hebrew, Japanese, and Russian this option (com.basistech.rnt.options.StatisticalMethodsOption) uses statistical methods to establish information that is not found in a dictionary. For Arabic, this option is set to true by default. For Hebrew, Japanese and Russian it is always set to true. See Translation Options.

Statistical Methods
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Afghan Persian (prs)	Latin (Latn) [Deprecated domain]	Afghan Persian (prs)	Arabic (Arab)	BGN (bgn), Native (native)
Afghan Persian (prs)	Latin (Latn) [Deprecated domain]	Afghan Persian (prs)	Arabic (Arab)	Folk (folk), Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Arabic (Arab)	Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Latin (Latn) [Deprecated domain]	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Arabic (ara)	Latin (Latn) [Deprecated domain]	Arabic (ara)	Arabic (Arab)	Basis (basis), Native (native)
Arabic (ara)	Latin (Latn) [Deprecated domain]	Arabic (ara)	Arabic (Arab)	BGN (bgn), Native (native)
Arabic (ara)	Latin (Latn) [Deprecated domain]	Arabic (ara)	Arabic (Arab)	Buckwalter (buckwalter), Native (native)
Arabic (ara)	Latin (Latn) [Deprecated domain]	Arabic (ara)	Arabic (Arab)	Folk (folk), Native (native)
Arabic (ara)	Latin (Latn) [Deprecated domain]	Arabic (ara)	Arabic (Arab)	SATTS (satts), Native (native)
English (eng)	Latin (Latn)	Afghan Persian (prs)	Arabic (Arab)	Folk (folk), Native (native)
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	Basis (basis), Native (native)
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	BGN (bgn), Native (native)
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	Buckwalter (buckwalter), Native (native)
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	Folk (folk), Native (native)
English (eng)	Latin (Latn)	Arabic (ara)	Arabic (Arab)	SATTS (satts), Native (native)
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	Folk (folk), Native (native)
English (eng)	Latin (Latn)	Iranian Persian (pes)	Arabic (Arab)	BGN (bgn), Native (native)
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	BGN (bgn), Native (native)
English (eng)	Latin (Latn)	Pashto (pus)	Arabic (Arab)	Folk (folk), Native (native)
English (eng)	Latin (Latn)	Persian (fas)	Arabic (Arab)	BGN (bgn), Native (native)
English (eng)	Latin (Latn)	Urdu (urd)	Arabic (Arab)	BGN (bgn), Native (native)
Hebrew (heb)	Hebrew (Hebr)	English (eng)	Latin (Latn)	Native (native), ISO 259-2:1994 (iso259_2_1994), Folk (folk), ICU (icu)
Iranian Persian (pes)	Latin (Latn) [Deprecated domain]	Iranian Persian (pes)	Arabic (Arab)	BGN (bgn), Native (native)
Iranian Persian (pes)	Latin (Latn) [Deprecated domain]	Iranian Persian (pes)	Arabic (Arab)	Folk (folk), Native (native)
Pashto (pus)	Latin (Latn) [Deprecated domain]	Pashto (pus)	Arabic (Arab)	Folk (folk), Native (native)
Pashto (pus)	Latin (Latn) [Deprecated domain]	Pashto (pus)	Arabic (Arab)	BGN (bgn), Native (native)
Persian (fas)	Latin (Latn) [Deprecated domain]	Persian (fas)	Arabic (Arab)	BGN (bgn), Native (native)
Urdu (urd)	Latin (Latn) [Deprecated domain]	Urdu (urd)	Arabic (Arab)	BGN (bgn), Native (native)

Performance tradeoff

For personal names in Arabic, Japanese, or Russian, this option (com.basistech.rnt.options.PerformanceTradeoff) controls the tradeoff the translator makes between speed and correctness. See Translation Options.

Performance Tradeoff
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Arabic (Arab)	Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Latin (Latn) [Deprecated domain]	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Han (Hanzi) (Hani)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Han (Simplified variant) (Hans)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
English (eng)	Latin (Latn)	Russian (rus)	Cyrillic (Cyrl)	Folk (folk), Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Han (Kanji) (Hani)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Hiragana (Hira)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Korean (kor)	Han (Hanja) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Han (Hanja) (Hani)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Han (Hanja) (Hani)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[b] , IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[b], IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Russian (rus)	Cyrillic (Cyrl)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)
Russian (rus)	Cyrillic (Cyrl)	Russian (rus)	Cyrillic (Cyrl)	Native (native)
Russian (rus)	Cyrillic (Cyrl)	Russian (rus)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), ISO 9:1995 (iso9_1995), Undiacritized BGN (und_bgn)
^[a]For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts. ^[b]For how BGN is handled for Korean, see Korean Geography.

Segmentation

For Chinese, Japanese, and Korean^[16] names, this option (com.basistech.rnt.options.SegmentOption) segments unsegmented names. By default, this option is set to true. See Translation Options.

Segmentation
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Han (Hanzi) (Hani)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Han (Simplified variant) (Hans)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Japanese (jpn)	Han (Kanji) (Hani)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Han (Kanji) (Hani)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Hiragana (Hira)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Korean (kor)	Han (Hanja) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Han (Hanja) (Hani)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Han (Hanja) (Hani)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Native (native)
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), KORDA (korda), McCune-Reischauer (mcr), Revised Romanization of Korean (moct), Folk (folk)
^[a]For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts.

Normalization

This option (com.basistech.rnt.options.NormalizeOption) applies a set of rules to normalize Arabic native names to a standard form, converts characters in the traditional Chinese variant to the corresponding simplified Chinese variant, and for converts Japanese Kanji variants (including old Kanji) to their standard form. Normalization occurs before any other name processing. By default, this option is set to true. See Translation Options.

Normalization
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Arabic (Arab)	Native (native)
Arabic (ara)	Arabic (Arab)	Arabic (ara)	Latin (Latn) [Deprecated domain]	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Arabic (ara)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), FBIS (fbis), BGN (bgn), Basis (basis), IC (ic), SATTS (satts), Buckwalter (buckwalter), Undiacritized BGN (und_bgn), Extended IC (ext_ic), Folk (folk)
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Han (Hanzi) (Hani)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Hanzi) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Chinese Telegraph Code (ctc), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Han (Simplified variant) (Hans)	Native (native), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	Chinese (zho)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Chinese (zho)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Pinyin (hypy), Hanyu Pinyin Toned (hypy_toned), Wade-Giles (wade_giles), Jyutping (LSHK)^[a]
Japanese (jpn)	Han (Kanji) (Hani)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Han (Kanji) (Hani)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Hiragana (Hira)	Native (native)
Japanese (jpn)	Han (Kanji) (Hani)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Simplified variant) (Hans)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Han (Traditional variant) (Hant)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Hiragana (Hira)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese (alias for Han + Hiragana + Katakana) (Jpan)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Japanese syllabaries (alias for Hiragana + Katakana) (Hrkt)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	English (eng)	Latin (Latn)	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Japanese (jpn)	Katakana (Kana)	Japanese (jpn)	Latin (Latn) [Deprecated domain]	Native (native), Hebon Romaji (hebon), Kunrei-shiki Romaji (kunrei)
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Arabic (Arab)	Native (native)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn), IC (ic), Undiacritized BGN (und_bgn), Folk (folk)
^[a]For transliteration of Cantonese (yue) in Han, Hans, and Hant scripts.

Variant spelling

When using the IC transliteration standard for personal names in Pashto, this option (com.basistech.rnt.options.VariantSpellingOption) specifies how long vowels are transliterated. When variant spelling is false (the default), long vowels are transliterated with a single vowel. When true, long vowels are transliterated with double vowels. For example, Hamid vs. Hamiid. See Translation Options.

Variant Spelling
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), IC (ic)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Latin (Latn) [Deprecated domain]	Native (native), IC (ic)

Region

When using the IC transliteration standard for personal names in Pashto, this option (com.basistech.rnt.options.RegionOption) designates how the Pashto letter ږ (ģe) is transliterated. If the region is set to DEFAULT (unknown) or SOUTH, the transliteration is 'zh'. If the region is set to NORTH, the transliteration is 'g'. See Translation Options.

Region
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Pashto (pus)	Arabic (Arab)	English (eng)	Latin (Latn)	Native (native), IC (ic)
Pashto (pus)	Arabic (Arab)	Pashto (pus)	Latin (Latn) [Deprecated domain]	Native (native), IC (ic)

Korean geography

When using the BGN transliteration standard for names in Korean, this option (com.basistech.rnt.options.KorGeographyOption) designates which transliteration scheme is used. See Translation Options.

Korean Geography Option
Source Domain		Target Domain(s)		Transliteration Scheme(s) (name),...
Language (ISO 639-3)	Script (ISO 15924)	Language (ISO 639-3)	Script (ISO 15924)	Transliteration Scheme(s) (name),...
Korean (kor)	Han (Hanja) (Hani)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[a]
Korean (kor)	Han (Hanja) (Hani)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[a]
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[a]
Korean (kor)	Hangul (Hangŭl, Hangeul) (Hang)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[a]
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	English (eng)	Latin (Latn)	Native (native), BGN (bgn)^[a]
Korean (kor)	Korean (alias for Hangul + Han) (Kore)	Korean (kor)	Latin (Latn) [Deprecated domain]	Native (native), BGN (bgn)^[a]
^[a]For Korean, BGN uses the McKune-Reischauer (mcr) transliteration scheme if the option is `NORTHKOREAN` (the default) and Revised Romanization of Korean (moct) if the option is `SOUTHKOREAN`.

Appendix

Match phenomena

Match phenomena describe why token spans did or did not match. For example, some match phenomena, such as HMM_MATCH, occur when tokens are matched by a particular scorer. Others, such as DELETION, occur when tokens cannot be matched at all.

Name	Description	Example
CONFLICT	The tokens do not match.	When comparing "William Omega Stephens" and "William Kappa Stephens", "Omega" and "Kappa" are a CONFLICT.
DELETION	The token is unmatched.	When comparing "Richard William Smith" with "Richard Smith", "william" would be considered a DELETION.
EMBEDDING_MATCH	The tokens are semantically similar as determined by word-embedding vectors.	When comparing "boston building company" and "boston construction company", "building" and "construction" are an EMBEDDING_MATCH.
FIELD_BLOCKED	This field cannot be matched because of a cross-field match involving the same field in the other name.	When comparing "Bob\|William\|Smith" with "William\|\|Smith", "bob" is a FIELD_BLOCKED since the cross-field william match prevents it from matching with its corresponding field.
FIELD_CONFLICT	When comparing two names that are divided into fields, these fields do not match.	When comparing "Richard\|William\|Smith" with "Richard\|Johnson\|Smith", "william" and "johnson" would be considered a FIELD_CONFLICT.
FIELD_DELETION	When comparing two names that are divided into fields, this field is unmatched.	When comparing "Richard\|Xi\|Smith" with "Richard\|\|Smith", "xi" would be considered a FIELD_DELETION.
GIVEN_NAME_DELETION	When comparing two names that are divided into fields, the GIVEN_NAME field is unmatched.	When comparing "Richard\|William\|Smith" and "\|\|William\|Scott", "Richard" will be a GIVEN_NAME_DELETION if that field in both names is marked as a `Given_name` field.
HANI_ABBREVIATION	One Hani token appears to be an abbreviation of another Hani token.	"北京大学" and "北大" are a HANI_ABBREVIATION match.
HMM_MATCH	The tokens are similar but not identical, and the match was determined by a particular model (hidden Markov model). This is a type of fuzzy match.	"richard" and "richerd" are an HMM_MATCH.
INITIALISM	One token is a name and the other token is the initials of the words which make up the name.	"john fitzgerald kennedy" and "JFK" are an INITIALISM. "consumer value stores" and "CVS" are an INITIALISM.
INITIAL_MATCH	One token is the first initial of the other.	"w" and "william" are an INITIAL_MATCH.
LANGUAGE_SPECIFIC_MATCH	The match was determined by a language-specific matcher.	"laden" and "لادن" are a LANGUAGE_SPECIFIC_MATCH.
MATCH	The tokens are identical (after stop word elimination and normalization).	"john" and "john" are a MATCH.
NULL	The NULL phenomenon is only listed in this table for completeness. It is only used internally and will never be returned in the SpanMatch object.	N/A
OUT_OF_ORDER_DELETION	This unmatched token still leaves the remaining tokens out of order when it is removed.	When comparing "George Herbert Walker Bush" with "George Bush Walker", "herbert" would be considered an OUT_OF_ORDER_DELETION.
OVERRIDE	The tokens appear as a pair on the override list. This is often used for nicknames.	"john" and "jack" will be an OVERRIDE match if they appear as a pair on the override list.
PREFIX_INITIAL	One token is an initial that matches a prefix in the other token. In practice, the PREFIX_INITIAL phenomenon is rare.	If the `initialsScore` parameter is set to 0.1, "E Silva" and "EduardoSil" will be a PREFIX_INITIAL match.
STRING_SIMILARITY	The tokens are similar in string edit distance (number of insertions, deletions, and substitutions) but not similar enough to be a fuzzy match.	"akcd" and "xkcd" are a STRING_SIMILARITY match.
STUCK_INITIAL	One name appears to have an initial mistakenly attached to a preceding token.	"DavidK" and "David Keith" are a STUCK_INITIAL match.
SURNAME_DELETION	When comparing two names that are divided into fields, the SURNAME field is unmatched.	When comparing "Richard\|William\|Smith" and "Richard\|William\|\|", "Smith" will be a SURNAME_DELETION if that field in both names is marked as a `Surname` field.
TRAILING_PATRONYMIC_DELETION^[a]	The unmatched token is a patronymic which has been truncated in the other name.	When comparing "Faisal bin Fahd bin Abdullah" and "Faisal bin Fahd", "bin Abdullah" is considered a TRAILING_PATRONYMIC_DELETION.
TRUNCATED_EXACT_MATCH	The tokens are identical except that one has been slightly truncated.	"murgatroyd" and "murgatroy" are a TRUNCATED_EXACT_MATCH.
TRUNCATED_HMM_MATCH	The tokens are similar, but not identical, and one has been slightly truncated.	"gilpatrickz" and "gillpatrick" are a TRUNCATED_HMM_MATCH.
UNKNOWN_FIELD_MATCH	One of the tokens is part of an "unknown" field in a fielded name. The UNKNOWN_FIELD_MATCH phenomenon is rare and usually requires use of the Java API.	When comparing "Richard\|William\|Smith" with "Richard\|William\|Scott", if the first field is an "unknown" field, "richard" and "richard" would be considered an UNKNOWN_FIELD_MATCH.
^[a]Only applies to Latin script names of Arabic origin.

Parameters

This table lists the parameters that can be configured via paramater_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in paramater_defs.yaml.

Table 26. Parameter impacts

Parameter name	Applies to	Impacts
addressCrossFieldScoreThreshold	Addresses	Can affect any kind of match
addressDeletionScore	Addresses	DELETION match phenomenon
addressDifferentGroupPenalty	Addresses	Can affect any kind of match
addressFinalBias	Addresses	All address match scores
addressJoinedTokenLimit	Addresses	Concatenation^[a]
addressOverrideDefaultScore	Addresses	OVERRIDE match phenomenon
addressOverrideTablePath	File locations	Internal engineering detail
addressReorderPenalty	Addresses	Reordering^[e]
addressSameGroupPenalty	Addresses	Can affect any kind of match
addressStopPatternsPath	File locations	Internal engineering detail
addressUnpairedFieldScore	Addresses	FIELD_DELETION match phenomenon
adjustOneSidedDeletionScores	All names	DELETION match phenomenon
allowNullValue	Elasticsearch	Elasticsearch setting
alternativePairsToCheck	All names	Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)
alternativeTimeProximityMatch	Dates	All date match scores
boostWeightAtBothEnds	All names	All name match scores
boostWeightAtLeftEnd	All names	All name match scores
boostWeightAtRightEnd	All names	All name match scores
caseSensitiveData	All names	INITIALISM match phenomenon
cityAddressFieldWeight	Addresses	Weighting
cityDistrictAddressFieldWeight	Addresses	Weighting
cognateOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as COGNATE in the override file
conflictScore	All names	CONFLICT match phenomenon
conflictThreshold	All names	CONFLICT match phenomenon
countryAddressFieldWeight	Addresses	Weighting
countryRegionAddressFieldWeight	Addresses	Weighting
crossFieldInitialsPenalty	Fielded names	INITIAL_MATCH match phenomenon
crossFieldJoinInitialPenalty	Fielded names	Concatenation^[a]
crossFieldJoinPenalty	Fielded names	Concatenation^[a]
crossFieldMatchPenalty	Fielded names	Can affect any kind of match
crossLanguageGenderConflictPenalty	All names	Gender mismatch^[b]
dateFinalBias	Dates	All date match scores
dateOrdering	Dates	All date match scores
dayDistanceWeight	Dates	All date match scores
deletionScore	All names	DELETION match phenomenon
detectableLanguagesModelBased	All names	Can affect any kind of match where one or more names is in Latin script and the language is not already specified.
detectableLanguagesRuleBased	All names	Currently, you can only enable detection of Latin script as Turkish or Vietnamese.
editDistanceScoreBias	All names	Can affect any kind of match
enableDynamicConfigurationEndpoints	Elasticsearch	Elasticsearch setting
enablePromisingTermFiltering	Speed/Accuracy	Performance only
enableYueReadings	All names	Names written in Han script
entranceAddressFieldWeight	Addresses	Weighting
equivalenceClassesPath	File locations	Internal engineering detail
estimatedConflictOrDeletionScore	All names	Internal engineering detail
exactLatnMatchScore	All names	Token normalization
expensiveScorerJoinedTokenLimit	All names	Concatenation^[a]
fieldBlockedScore	Fielded names	OUT_OF_ORDER_DELETION match phenomenon
fieldConflictScore	Fielded names	CONFLICT match phenomenon
fieldDeletionScore	Fielded names	DELETION match phenomenon
finalBias	All names	All name match scores
frequencyRankBias	All names	Can affect any kind of match
genderConflictPenalty	All names	Gender mismatch^[b]
genderConflictPenaltyThreshold	All names	Gender mismatch^[b]
globalTokenCacheConfig	Speed/Accuracy	Performance only
globalTokenPairCacheConfig	Speed/Accuracy	Performance only
haniAbbreviationScore	All names	INITIALISM match phenomena in Han script
haniAbbreviationThreshold	All names	INITIALISM match phenomena in Han script
haniFourCornerCodeMismatchPenalty	All names	Names written in Han script
hmmNormalizationAlternative	All names	HMM_MATCH phenomenon
hmmScoreBias	All names	HMM_MATCH phenomenon
hmmScoreLimit	All names	HMM_MATCH phenomenon
houseAddressFieldWeight	Addresses	Weighting
houseNumberAddressFieldWeight	Addresses	Weighting
ignoreBadData	Elasticsearch	Elasticsearch setting
improveSingleDigitManipulationMatch	Dates	Date match scores containing exactly one instance of digit manipulation^[c] and no other differences
initialFrequencyRank	All names	INITIAL_MATCH match phenomenon
initialismMismatchPenalty	All names	HMM_MATCH phenomenon
initialismScore	All names	INITIALISM match phenomenon
initialsConflictScore	All names	CONFLICT match phenomenon
initialsDeletionPenalty	All names	DELETION match phenomenon
initialsScore	All names	INITIAL_MATCH match phenomenon
islandAddressFieldWeight	Addresses	Weighting
joinedTokenInitialsPenalty	All names	Concatenation^[a] INITIAL_MATCH match phenomenon
joinedTokenLimit	All names	Concatenation^[a]
joinedTokenPenalty	All names	Concatenation^[a]
levelAddressFieldWeight	Addresses	Weighting
libpostalDataDirPath	File locations	Internal engineering detail
lowWeightTokenFrequencyRank	All names	Can affect any kind of match
lowWeightTokenPath	File locations	Internal engineering detail
maximumAlternateTokenizationRelativeDistance	All names	Affects tokenization and therefore any potential score
maximumOrganizationInitialismLength	Organization names	INITIALISM match phenomenon
maximumPersonInitialismLength	Person names	INITIALISM match phenomenon
maxYearDistanceForDigitManipulation	Dates	Date match scores containing exactly one instance of digit manipulation^[c] and no other differences.
minFieldWeightFactor	Fielded names	Weighting
minimumAlternateTokenizationLength	All names	Affects tokenization and therefore any potential score
minimumOrganizationInitialismLength	Organization names	INITIALISM match phenomenon
minimumPersonInitialismLength	Person names	INITIALISM match phenomenon
monthDistanceWeight	Dates	All date match scores
nameBigramQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameDoubleMetaphoneQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameGluedQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameInitialQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
nameLengthMismatchPenalty	All names	DELETION match phenomenon Concatenation^[a] Any phenomenon that changes the number of tokens in a name
nameRealWorldIdQueryBoost	Lucene searches	First-pass accuracy Can affect any kind of match
ngramLMPath	File locations	Internal engineering detail
ngramThresholdPath	File locations	Internal engineering detail
nicknameOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as NICKNAME in override file
numericTokenFrequencyRank	All names	Can affect any kind of match
outOfOrderDeletionScore	All names	OUT_OF_ORDER_DELETION match phenomenon
parseUnknownFieldMarker	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
poBoxAddressFieldWeight	Addresses	Weighting
postCodeAddressFieldWeight	Addresses	Weighting
queryAlternativeOriginLanguages	Speed/Accuracy	Can affect any kind of match
realWorldIdsPath	File locations	Internal engineering detail
realWorldIdsPathUser	File locations	Internal engineering detail
reorderCorrection	All names	Rotation ^[d]
reorderCorrectionThreshold	All names	Rotation^[d]
reorderPenalty	All names	Reordering^[e]
rniFullnameOverridesPath	File locations	Internal engineering detail
rntFullnameOverridesPath	File locations	Internal engineering detail
roadAddressFieldWeight	Addresses	Weighting
sameNameUnknownFieldMatchInterpolator	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
staircaseAddressFieldWeight	Addresses	Weighting
stateAddressFieldWeight	Addresses	Weighting
stateDistrictAddressFieldWeight	Addresses	Weighting
stopPatternsPath	File locations	Internal engineering detail
stringDistanceWeight	Dates	All date match scores
stuckInitialAffixMinLength	All names	STUCK_INITIAL match phenomenon
stuckInitialScore	All names	STUCK_INITIAL match phenomenon
suburbAddressFieldWeight	Addresses	Weighting
thresholdToDropoffBiasMapping	Dates	All date match scores
timeDistanceWeight	Dates	All date match scores
timeProximityYearInterval	Dates	All date match scores
tokenizeOrganizationsWithNumbers	Organization names	Affects tokenization and therefore any potential score
tokenOverridesPath	File locations	Internal engineering detail
trailingPatronymicDeletionScore	Person names	TRAILING_PATRONYMIC_DELETION match phenomenon
truncationFractionLimit	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationScorerBias	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
tryAlternateTokenization	All names	Affects tokenization and therefore any potential score
tryDayMonthSwap	Dates	All date match scores
unigramLMPath	File locations	Internal engineering detail
unitAddressFieldWeight	Addresses	Weighting
unknownFieldFrequencyRank	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
unknownVsKnownScore	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
unknownVsUnknownScore	Fielded names	UNKNOWN_FIELD_MATCH match phenomenon
useEmbeddings	Organization names	EMBEDDING_MATCH match phenomenon
useSolrPhraseQueries	Solr	Solr plugin setting
variantOverrideScore	All names	OVERRIDE match phenomenon for tokens marked as VARIANT in the override file
worldRegionAddressFieldWeight	Addresses	Weighting
yearDistanceWeight	Dates	All date match scores
^[a]Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. ^[b]Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. ^[c]A digit manipulation is a transformation to a digit that can be accomplished with minimal additional lines. Digit manipulations may be intentional or the result of an OCR error. Our list of possible digit manipulations includes: 0↔8, 1↔7, 3↔8, 5↔8, 5↔6, 6↔8, 7↔2. ^[d]If the tokens in a name have been rotated, the reorder penalty will negatively impact the match score. RNI detects and compensates for this error. ^[e]Tokens that match, but that appear to be out-of-order, have their match scores adjusted to reflect that fact.

Internal parameters

This table lists the parameters that can be configured via internal_param_profiles.yaml. When applicable, each parameter has been linked to the specific match phenomena that it impacts. You can find more information on each parameter in internal_param_defs.yaml.

Important

We recommend against modifying these parameters unless advised to by Rosette support.

Name	Applies to	Impacts
affixGlueThreshold^[a]	All names and addresses	Concatenation^[b]
allLanguageSupport	All names	Can affect any kind of match
allowCacheBonuses	All names	Internal engineering detail
alwaysComputeSuffixes^[a]	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
araRNISpeedOption	Translated names	Speed and accuracy tradeoff
crossSurnameMatchPenalty	All names	Matches in languages with onomastic information
debuggableIndex	N/A	Internal engineering detail Has no effect on matching
debugPrintTuples	N/A	Internal engineering detail Has no effect on matching
defaultScoreToCheckRestriction	All names Dates Addresses	First-pass scoring
disabledLanguages	All names
doFrontTruncations	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
doQueryBigrams	All names	First-pass accuracy
doQueryCompleted	All names	First-pass accuracy
doQueryFullnameOverrides	All names	First-pass accuracy
doQueryFuzzy	All names	First-pass accuracy
doQueryGlued	All names	First-pass accuracy
doQueryIndexKeys	All names	First-pass accuracy
doQueryInitials	All names	First-pass accuracy
doQueryNormalized	All names	First-pass accuracy
doQueryPersonInitialisms	All names	First-pass accuracy
doQueryPhrase	All names	First-pass accuracy
doQueryRealWorldIds	All names	First-pass accuracy
doQueryTokenOverrides	All names	First-pass accuracy
doQueryTranslated	All names	First-pass accuracy
doViterbiRescaling	All names	Fuzzy match^[c]
editDistanceTokenScorerPenalty	All names	STRING_SIMILARITY match phenomenon
embeddingBias	Organization names	EMBEDDING_MATCH match phenomenon
embeddingZeroScore	Organization names	EMBEDDING_MATCH match phenomenon
enableAdditionalOnomastics	All names	Matches in languages with onomastic information
enableRemoteTokenScorer	All names	Japanese/English fuzzy match^[c]
enableSeq2SeqTokenScorer	All names	Japanese/English fuzzy match^[c]
enableTokenPairLogging	N/A	Internal engineering detail Has no effect on matching
engEngFastMode	All names	English/English fuzzy match^[c]
expandedLanguages	All names	Fuzzy match^[c] OVERRIDE match phenomenon
expansionLimit	All names	Fuzzy match^[c] OVERRIDE match phenomenon
expansionScoreThreshold	All names	Fuzzy match^[c] OVERRIDE match phenomenon
familiarTokenMismatchPenalty	All names	Can affect any kind of match
familiarTokenThreshold	All names	Can affect any kind of match
firstPassDayRange	Dates	Performance only
firstPassMonthRange	Dates	Performance only
firstPassYearRange	Dates	Performance only
foreignAddressFinalBias	Addresses	All English-to-non-English address matches
genderPenaltyMinimumLength	All names	Gender mismatch^[d]
givenFieldDeletionScore	Fielded names	DELETION match phenomenon
HMMCachePerProcess	All names	Internal engineering detail HMM_MATCH phenomenon
HMMCachePerThread	All names	Internal engineering detail HMM_MATCH phenomenon
hmmNormBias	All names	Internal engineering detail Fuzzy match^[c]
HMMUsageThreshold	All names	Internal engineering detail HMM_MATCH phenomenon
identifierEditDistanceTokenScorerPenalty	Identifiers	STRING_SIMILARITY match phenomenon
ignoreTranslationOrigins	All names	Can affect any kind of match that uses English transliteration
includeExtraKatakanaPersonReadings	Translated names	Can affect any kind of match
initialAndSuffixMinLength	All names	Fuzzy match^[c] INITIAL_MATCH match phenomenon
initialAndSuffixScore	All names	Fuzzy match^[c] INITIAL_MATCH match phenomenon
jniBias	All names	Can affect any kind of match in languages that use a JNI scorer
jpnRNISpeedOption	Translated names	Speed and accuracy tradeoff
kanjiMismatchPenalty	All names	Normalization of tokens that include kanji
katakanaTransliterationsOnly	Translated names	Can affect any kind of match
korRNISpeedOption	Translated names	Speed and accuracy tradeoff
latinDataAlternativesToCheck	All names	Can affect any kind of match where English transliterations are being used, and where there are multiple possible transliterations (e.g., Chinese/Japanese/Korean readings of Han names)
limitedLanguageEditDistance	All names	STRING_SIMILARITY match phenomenon
maxIdentifierEditDistance	All names	First-pass accuracy
notExactMatchPenalty	All names	Normalization
_postCodePathAddressFieldWeight	Addresses	Weighting
promisingFuzzyTermFrequencyFactor	Speed/Accuracy	Performance only
promisingTermFrequencyFactor	Speed/Accuracy	Performance only
queryMaxResults	All names Dates Addresses	First-pass scoring
queryMaxToCheck	All names Dates Addresses	First-pass scoring
queryMaxToConsider	All names Dates Addresses	First-pass scoring
queryToCheckAllowance	All names Dates Addresses	First-pass scoring
realWorldIdScore	Organization names	Real-world id match^[e]
remoteTokenScorerURL	All names	Internal engineering detail Japanese/English fuzzy match^[c]
rntTokenOverridesPath	File locations	Internal engineering detail
rusRNISpeedOption	Translated names	Speed and accuracy tradeoff
secondarySurnameTokenTypeWeight	All names	Matches in languages with onomastic information
seq2seqCachePerProcess	All names	Internal engineering detail Japanese/English fuzzy match^[c]
seq2seqCachePerThread	All names	Internal engineering detail Japanese/English fuzzy match^[c]
seq2seqTokenOverridesPath	File locations	Internal engineering detail
seq2seqUsageThreshold	All names	Internal engineering detail Japanese/English fuzzy match^[c]
splitTokens	All names	Internal engineering detail
stringDistanceThreshold^[a]	All names	Fuzzy match^[c]
surnameFieldDeletionScore	fielded names	DELETION match phenomenon
surnameTokenTypeWeight	All names	Matches in languages with onomastic information
taggerMinimumConfidenceThreshold	All names	Matches in languages with onomastic information
translatorResultsToKeep	translated names	Can affect any kind of match
truncationAffixSimilarityLength	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationAffixSimilarityThreshold	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
truncationLengthLimit	All names	TRUNCATED_EXACT_MATCH match phenomenon TRUNCATED_HMM_MATCH match phenomenon
useCharacterLM	All names	Can affect any kind of match
useEditDistanceTokenScorer	All names	STRING_SIMILARITY match phenomenon
useIdentifierEditDistanceTokenScorer	Identifiers	STRING_SIMILARITY match phenomenon
useLM	All names	Can affect any kind of match
useOldAndNewNameSegmentationForJapanese	All names	Can affect any kind of match involving Japanese translations
useRealWorldIds	all names (or just orgs?)	Real-world id match^[e]
zhoRNISpeedOption	Translated names	Speed and accuracy tradeoff
^[a]Unlike public parameters for this feature, this is a speed/accuracy tradeoff, not a science-tuning parameter. ^[b]Concatenation occurs when adjacent tokens are joined together to see if the resulting compound token will be a good match for any tokens in the other name. ^[c]A fuzzy match is a match between tokens that are similar but not identical. The HMM_MATCH and SEQ2SEQ_MATCH phenomena are examples of this. ^[d]Gender mismatch occurs when the apparent or specified genders of the two names being compared do not match. ^[e]RNI contains real world identifiers for corporations, which pair an entity id with nicknames and common permutations of the corporation name

Directory structure

While not exhaustive, this appendix briefly describes and details the location of parts of RNI-RNT that developers may need to access. Any directories not described in this appendix, such as those containing internally used compiled code referenced by other parts of the product, will not need to be accessed by most developers.

`BT_ROOT` directory

BT_ROOT is the Basis Root directory. This is the high-level structure of an RNI-RNT installation. As a developer, you will not need to access most of these files and folders. Some files and folders of note are listed below. For more information on this directory, see Setting the Basis root directory.

`$BT_ROOT\rlp\rlp\licenses`	This is where you should copy the license file included in your product shipment.
`$BT_ROOT\rlpnc\lib\jvm`	Library of jar files you may need to include on your classpath for logging purposes or when building and running RNI-RNT applications.
`$BT_ROOT\rlpnc\samples\java`	Source files for sample applications and the Ant build file for compiling and running them (`build.xml`).
`$BT_ROOT\rlpnc\samples\java\logging`	Contains a copy of `log4j.properties`, which is used by our samples. Adjust the copy that you place on your classpath to meet your specific runtime logging needs.
`$BT_ROOT\rlpnc\copyright.txt`	Copyright information.
`$BT_ROOT\rlpnc\ThirdPartyLicenses.txt`	Information on the third party components included in the software.
`$BT_ROOT\frequencyModelTrainer.zip`	Use this to train a language model on your own name data. Instructions and a full description of arguments are in the `README.txt` file in the zip file.
`$BT_ROOT\realWorldIDBuilder.zip`	Use this to build a real world ID binary file. Instructions on how to run the program are in the `README.md` file in the zip file.
`$BT_ROOT\RLPNC-version.txt`	A text file containing the version number.

`data` directory

This is the low-level structure of the $BT_ROOT\rlpnc\data folder. As a developer, most of the files you will need to access are contained in this folder. File paths and descriptions of important files and folders are included below.

`$BT_ROOT\rlpnc\data\addresses\ref`	Override and stop word files for matching addresses.
`$BT_ROOT\rlpnc\data\etc`	Files pertaining to match parameters. The parameters are defined in `parameter_defs.yaml` and modified in `parameter_profiles.yaml`. You can also define parameter universes in `parameter_profiles.yaml`.
`$BT_ROOT\rlpnc\data\libpostal`	Data for libpostal, a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. If you are certain that you won't be utilizing address matching of unfielded addresses, you can safely delete the libpostal data directory without impacting any other RNI-RNT functionalities.
`$BT_ROOT\rlpnc\data\real_world_ids\ref\omit_ids`	Contains `omit_ids.datafiles`, which is where you can enable and point to any omit files you have placed in the `$BT_ROOT` directory.
`$BT_ROOT\rlpnc\data\rnm\ref\override`	Files for name matching overrides, stop patterns, stop word prefixes, normalizing token variants, and unimportant tokens.
`$BT_ROOT\rlpnc\data\rnm\sample`	Sample data for name matching.
`$BT_ROOT\rlpnc\data\rnt\ref\override`	Name translation override files.

^[8]# may also be used after an entry on the same line to begin a comment.

^[9]Override files are not provided for all supported languages. Specifically, while no files are provided for Russian or Korean, you can create token pair files for these languages.

^[10]Language of use, the language of the document in which the name appears

^[11]RNI depends on the jpostal binding for the open source libpostal library to parse unfielded addresses as a pre-processing step. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support@basistech.com if you discover any issues.

^[12]# may also be used after an entry on the same line to begin a comment.

^[13]# may also be used after an entry on the same line to begin a comment.

^[14]RNI depends on the jpostal binding for the open source libpostal library to parse unfielded addresses as a pre-processing step. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support@basistech.com if you discover any issues.

^[15]Hebrew orthographic completion occurs automatically; it is not controlled by this option.

^[16]Arabic, Burmese, Khmer, and Thai segmentation occurs automatically; they are not controlled by this option.

`FAST`	Greatest speed; least correctness
`NORMAL`	Even tradeoff between speed and correctness (the default)
`CAREFUL`	More correctness; less speed
`PRECISE`	Greatest correctness; least speed

Match Identity