Base Linguistics (RBL)
Emoji images used in this document are licensed by Twitter, Inc. and other contributors at https://github.com/twitter/twemoji under CC BY 4.0 International.
Introduction
Rosette Base Linguistics (RBL) provides a set of linguistic tools to prepare your data for analysis. Language-specific modules provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots. RBL also includes a Chinese Script Converter (CSC) which converts tokens in Traditional Chinese text to Simplified Chinese and vice versa.
Using RBL
You can use RBL in your own JVM application, use its Apache Lucene compatible API in a Lucene application, or integrate it directly with either Apache Solr or Elasticsearch. [1]
JVM Applications
To integrate base linguistics functionality in your applications, the JVM includes two sets of Java classes and interfaces:
ADM API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis as a set of annotations. This collection is called the Annotated Data Model (ADM) and is used in other Rosette tools, such as Rosette Language Identifier and Rosette Entity Extractor, as well as RBL. There are some advanced features which are only supported in the ADM API and not the classic API.
When using the ADM API, you create an annotator which includes both tokenizer and analyzer functions.
Classic API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis that is analogous to the ADM API, except it is not compatible with any other Rosette products. It also supports streaming: a user can start processing a document before the entire document is available and it can produce results for pieces of a document without storing the results for the entire document in memory at once.
When using the classic API, you create tokenizers and analyzers.
Lucene
In an Apache Lucene application, you use a Lucene analyzer which incorporates a Base Linguistics tokenizer and token filter to produce an enriched token stream for indexing documents and for queries.
Solr
With the Solr plugin, an Apache Solr search server uses RBL for both indexing documents and for queries.
Elasticsearch
Install the Elasticsearch plugin to use RBL for analysis, indexing, and queries.
Note
The Lucene, Solr, and Elasticsearch plugins use APIs based on the classic API. All options that are in the enums TokenizerOption
or AnalyzerOption
are available along with some additional plugin-specific options.
Linguistic Objects
RBL performs multiple types of analysis. Depending on the language, one or more of the following may be identified in the input text:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object.
In the ADM API, use the BaseLinguisticsFactory
to set the linguistic options and instantiate an Annotator
which annotates the input text. The ADM API creates an annotator for all linguistic objects, including tokens.
In the classic API, use the BaseLinguisticsFactory
to configure and create tokenizers, analyzers, and CSC analyzers. The classic API creates separate tokenizers and analyzers.
Features
The following table indicates the type of support that RBL provides for each supported language. The RBL tokenizer provides normalization, tokenization, and sentence boundary detection. The RBL analyzer provides lemma lookup (including orthographic normalization for Japanese), lemma guessing (when the lookup fails), decompounding, and supports lemma, segmentation, and many-to-one normalization user dictionaries.
For unknown languages (language code xxx
), generic rules, such as whitespace and punctuation delimitation, are used to tokenize. It will also identify some common acronyms and abbreviations, as well as sentence boundaries. Segmentation user dictionaries are supported for unknown languages.
Languages | Tokenization | Sentence Boundary | Token Normalization | Lemma Lookup | Part-of-Speech Tagging | Disambiguation | Lemma User Dictionary | Segmentation User Dictionary | Decompounding | Readings | Script Conversion | Stem | Semitic Root | n:1 Normalization User Dictionary | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arabic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Catalan | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
Chinese | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Chinese (alternative [k]) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||
Czech | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Danish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||
Dutch | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
English | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Estonian | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
Finnish | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
French | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
German | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Greek | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Hebrew | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Hungarian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Indonesian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||
Italian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Japanese (statistical) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Japanese (alternative) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||
Korean | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||
Latvian | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
Malay, Standard | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||
Norwegian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||
Persian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||
Polish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Portuguese | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Pashto | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||||
Romanian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||
Russian | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Serbian | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
Slovak | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||||
Spanish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||
Swedish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||
Tagalog | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||
Thai | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||
Turkish | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||
Ukrainian | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||||
Urdu | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||||||||||||||||||||||
Unknown | ✓ | ✓ | ✓ | ✓ | |||||||||||||||||||||||||||||||||||||||||||||
[a] With the exception of Hebrew and Arabic, the tokenizer can apply Normalization Form KC (NFKC) to the tokens. For Arabic, Persian, and Urdu, see Arabic, Persian, and Urdu Token Analysis. [b] For Japanese, Japanese Lemma Normalization is also available. [c] See Part-of-Speech Tags. [d] With the exception of Japanese using the statistical model, by default the analyzer returns a disambiguated analysis for the supported languages. For performance, Japanese disambiguation using the statistical model is turned off by default. If [e] The base form of the token to which affixes may be added. For Finnish, this is the Porter stem. [f] The Semitic root for the token (an empty string if the root cannot be determined). [g] Maps one or more tokens to a single token. [h] Simplified and Traditional Chinese. [i] Using the statistical tokenizer. [k] Using the alternative tokenizer. [l] For Hebrew, the tokenizer generates a lemma and a Semitic root for each token. [m] The base linguistics token filter excludes Japanese lemmas for auxiliary verbs, particles, and adverbs from the token stream. [n] Transcriptions rendered in Hiragana for Japanese tokens. [o] North Korean and South Korean [p] Bokmål and Nynorsk [q] Iranian and Afghan [r] Latin alphabet, but not Cyrillic. |
Neural Models
Some of the features in RBL are supported using deep learning, or neural models which require the native library, TensorFlow. The neural network is used for the following features:
Parts-of-speech (POS) tagging for Indonesian, Standard Malay, and Tagalog.
Hebrew disambiguation when the
disambiguatorType
is set toDNN
.Tokenization for spaceless Korean, when the
tokenizerType
is set toSPACELESS_STATISTICAL
.
Annotator - ADM API
BasisTech has created a collection of Java classes that generate and represent Rosette's linguistic analyses as a set of annotations. This collection of Java classes is called the Annotated Data Model (ADM) and may be used in RLI and Entity Extractor as well as in RBL.
The ADM provides advanced functionality not available with the classic API.
When using the ADM API, you use BaseLinguisticsFactory
to set the options and instantiate an Annotator
. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
The ADM API supports more options than the classic API. Whenever possible, we recommend using this API.
For complete API documentation of the ADM, consult the Javadoc for the package:
com.basistech.rosette.dm
ADM Usage Pattern
The standard procedure for using the ADM is as follows:
Instantiate an
Annotator
.Use
Annotator
to annotate the input text.Get the analytical data you want from the returned
AnnotatedText
object.
Create an Annotator
Use BaseLinguisticsFactory
to set the BaseLinguisticsOptions
and to instantiate an Annotator
. Options may be set on the factory itself or passed in to a create method, such as createSingleLanguageAnnotator
or createCSCAnnotator
.
At the minimum, you should set options for
rootDirectory
andlanguage
.Tip
If the license is not the default directory (
${rootDirectory}/licenses)
, you need to pass in thelicensePath
.BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); File rootPath = new File(rootDirectory);
Then use the
BaseLinguisticsFactory
to create theAnnotator
. This sample sets the language to English (eng
).EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class); options.put(BaseLinguisticsOption.language, "eng"); Annotator annotator = factory.createSingleLanguageAnnotator(options);
Annotate
Use Annotator
to annotate the input text, which returns a AnnotatedText
object.
The AnnotatedText
object provides an API for gathering data from the linguistic analysis that RBL performs during the annotation process. Depending on the analysis and the language, you may get information about sentences, layout regions, tokens, compounds, and readings.
getTokens()
returns a list of tokens, each of which contains a list of morphological analyses.
AnnotatedText results = annotator.annotate(getInput(inputFilePathname)); int index = 0; for (Token token : results.getTokens()) { outputData.format("token %d:\t%s%n", index, token.getText()); int aindex = 0; List<MorphoAnalysis> analyses = token.getAnalyses(); if (null != analyses) { outputData.format("\tindex\tlemma\tpart-of-speech%n"); for (MorphoAnalysis ma : analyses) { outputData.format("\t%d %s\t%s%n", aindex, ma.getLemma(), ma.getPartOfSpeech()); aindex++; } } index++; }
Classic API
When using the classic API, you instantiate separate factories for tokenizers and analyzers.
BaseLinguisticsFactory#createTokenizer
produces a language-specific tokenizer that processes documents, producing a sequence of tokens.BaseLinguisticsFactory#createAnalyzer
produces a language-specific analyzer that uses dictionaries and statistical analysis to add analysis objects to tokens.
If your application requires streaming, use this API. The Lucene, Solr, and Elasticsearch integrations use these methods.
For the complete API documentation, consult the Javadoc for BaseLinguisticsFactory.
Tokenizer
Use BaseLinguisticsFactory#createTokenizer
to create a language-specific tokenizer that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with BaseLinguisticsOption
to define the root of your RBL installation, as illustrated in the following sample. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
The Tokenizer
uses a word breaker to establish token boundaries and detect sentences. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate analysis entries in the token object that they return. For other languages, you use the analyzer described below to return analysis objects for each token.
Create a factory
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); File rootPath = new File(rootDirectory);
Set tokenization options
factory.setOption(BaseLinguisticsOption.nfkcNormalize, "true");
Create the tokenizer
Tokenizer tokenizer = factory.createTokenizer();
Analyzer
Use the BaseLinguisticsFactory#createAnalyzer
to create a language-specific analyzer. Prior to creating the analyzer, use the factory and BaseLinguisticsOption
to define RBL root, as illustrated in the sample below. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); File rootPath = new File(rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath()); Analyzer analyzer = factory.createAnalyzer();
Use the Analyzer
to return an array of Analysis
objects for each token.
Multithreading
Annotator Management for Multithreaded Applications
For a multithreaded application it is recommended that a pool of annotators be constructed. Annotators must be used on a per-thread basis, and it is desirable to avoid the overhead of creating annotators each time one is required. The settings of annotators cannot be changed after they are built, so if multiple configurations are required they should be stored separately.
API Threading
RBL integrates easily with multi-threaded architectures, but, to avoid performance penalties, it does not make promiscuous use of locks. Instead, most objects and interfaces in RBL are either read-only, reentrant objects or read-write, per-thread objects. BaseLinguisticsFactory
objects are hybrids, as they have both thread-safe and per-thread methods. The create
methods of this factory are thread-safe because they do not alter any data within the factories, but the setOption
, user dictionary, and dynamic user dictionary methods are not thread-safe because they do alter data within the factory. The tokenizers and analyzers created by this factory are always meant to be used on a per-thread basis because they are not reentrant and do alter data within the objects. You can use a factory across multiple threads to create objects as long as calls to the factory methods for setting options or adding user dictionaries are synchronized appropriately. The objects created by the factory must each be created and used by only one thread, which need not be the thread initializing the factory.
Getting Started
Minimum System Requirements
RBL is a Java SDK and works on any system with Java, including OpenJDK, installed.
Java Runtime Environment 11, 17, or 21.
Ant 1.7.1 or later (optional - required to run sample applications with Ant build scripts)
Memory
The minimum Java heap required to run RBL smoothly is 1.5 GB. It has been observed that a -Xmx
setting of 8 GB allows for optimal performance in significantly multithreaded environments. This memory setting accounts for the simultaneous use of multiple languages.
We also recommend that you set -Xms
equal to your -Xmx
setting. This prevents the JVM from having to grow the heap, which is time-consuming.
System Requirements for TensorFlow
RBL uses TensorFlow, a native library, when using a neural network. Ubuntu 14.04+, Windows 7+, and macOS 10.11+ are supported, but you should be able to run TensorFlow successfully on other modern Linux flavors as well.
The TensorFlow library for Linux requires a system with:
libc.so.6
(GLIBC
) 2.2.5 or newerlibstdc++.so.6
(GLIBCXX
) 3.4 or newerlibgcc_s.so.1
(GCC
) 3.0 or newer.
The version of TensorFlow included with RBL is configured to work on a broad range of systems, which requires it to avoid some features not available everywhere. Compiling TensorFlow for your specific system and using it on the classpath instead of the default libtensorflow_jni-<tensorflowversion>.jar
will likely improve performance. The version of TensorFlow included supports CUDA on Windows and Linux. Our neural models may perform better after installing CUDA on machines with supported GPUs.
Installing RBL
Your installation of RBL will include the following files:
The SDK package:
rbl-je-<version>.zip
, where <version> is the version of RBL you are installing, e.g.rbl-je-7.36.0.c62.2.zip
. When you unzip the SDK package, the root directory isrbl-je-<version>
. It contains text files with license and copyright information, along with the following subdirectories:- dicts
RBL binary dictionaries.
- lib
The JAR files that the SDK uses.
The core SDK .jar files are
btrbl-je-<version>.jar
andbtcommon-api-<btcommonversion>.jar
. The SDK also uses the Simple Logging Facade for Java (SLF4J),slf4j-api-<slf4jversion>.jar
. You should add these .jar files to your classpath.If you are using Lucene or Solr, you should also add the appropriate JAR file to your classpath. The versions are dependent on the version of Lucene/Solr you are using.
- licenses
Default location for placing your license file,
rlp-license.xml
. The samples that accompany this SDK requirerlp-license.xml
to be in this directory.- models
RBL binary models.
- samples
RBL sample files.
Sample text and query files in all supported languages are in
samples/data
. These files are used with the code samples.- tools
Tools for generating user dictionaries and the RBLCmd line utility.
The Documentation:
rbl-je-<version>-doc.zip
When you unzip the documentation package to the same location where you have unzipped the SDK package, the root directory contains a
doc
subdirectory containing:Rosette Base Linguistics Application Developer's Guide (this document,
rbl-je-<version>-appdev-guide.pdf
)Release Notes (
rbl-je-<version>-release-notes.pdf
)Java API documentation (
apidocs/index.html
)
The license file:
rlp-license.xml
.The samples expect to find it in
rbl-je-<version>/licenses
.
Lucene/Solr Versions
RBL contains version-specific files for the Lucene and Solr integrations. These files support multiple versions of Lucene/Solr, as indicated below.
RBL supports Lucene versions 7.0 - 10.1.0 and Solr versions 7.0 - 9.8 with the following files, where <version> is the RBL version:
Solr version | Lucene version | RBL file name version | Solr lib JAR file | Lucene sample directory |
---|---|---|---|---|
7.0 - 8.11 | 7.0 - 8.11 | 7_0 |
|
|
9.0 - 9.8 | 9.0 - 10.1.0 | 9_0 |
|
|
Note on Logging
RBL uses the Logging Facade for Java (SLF4J) to log activities. The SLF4J API JAR is in lib/
.
SFL4J is a facade for various logging APIs. Using SFL4J, the developer or an administrator can determine which one of many popular logging systems to use at runtime. In the tools/lib
directory, we include the SLF4J binding JARs:
log4j-api-<version>.jar
log4j-core-<version>.jar
log4j-slf4j-impl-2.17.1.jar
slf4j-api-<version>
These JARs are used by our samples and RBLCmd
. When you place these JARs on your classpath, the logging facade is bound to the implementation, and RBL logging is turned on. As defined in the file etc/log4j2.properties
, by default, WARN
messages are output to the console (System.err
). As the Javadoc for org.slf4j.impl.SimpleLogger
explains, you can use system properties or this properties file to output to a file and control other logging parameters.
If you want to use SLF4J with a different implementation, put the appropriate binding JAR files and properties file on your classpath.
Removing Unnecessary Files
Depending on the scope of your application, you may wish to remove unnecessary files to reduce the size of your application.
Tool Files
The tools directory contains files for:
Building user dictionaries
Building CSC user dictionaries
You can delete some or all of the directories, as needed. For example, you may decide to delete the user dictionary directories, but keep the RBLCmd utility. If you don't need to build these dictionaries or use the RBL command line utility, you may freely delete the entire tools
directory.
Language-Specific Model and Dictionary Files
You can remove files that represent languages your application does not need to support. Some languages require files for other language codes, either because they are canonicalized to the other language, or because there is some internal RBL requirement. Some languages require the files of more than one language.
When removing language files, be sure to check the deletion rules for the language and keep the files for all required language codes.
Language | Language Code | Required Language Code(s) |
---|---|---|
Chinese (Simplified) |
|
|
Chinese (Traditional) |
| |
Korean |
|
|
Norwegian |
|
|
Norwegian Bokmål |
| |
Persian, Afghan |
|
|
Western Persian, Western |
| |
Russian |
|
|
Tagalog |
|
|
Additionally, if your distribution platform is of a particular endianness, you can remove the models of the opposite endianness. When applicable, the endianness of a file is given at the end of the file name; for example, the file root/dicts/ara/dictLemmas-LE.bin
is a little-endian binary storing the Arabic lemma dictionary, whereas root/dicts/ara/dictLemmas-BE.bin
is the same but stored in big-endian format.
root/dicts
: Any directory named after a language code (or “csc” for the Chinese Script Converter) and all its contents are used only for that language, and any file with “BE” or “LE” in its name is only used on big- or little-endian systems, respectively.root/models
: Any directory named after a language code and all its contents are used only for that language.root/contractions
: Any.yaml
file whose name ends with a language code is used only for that language.root/upt-16
: Any.yaml
file whose name ends with a language code is used only for that language.
The files required for Japanese and Chinese depend on the value of the tokenizerType
option, as shown in the table below.
| Files required |
---|---|
|
|
| |
|
|
| |
|
Unshaded Dependencies
The JAR jakarta.annotation-api-<version>.jar
is included in RBL but cannot be shaded. There may be conflicts if this JAR is used elsewhere in your processing. The JAR is used in disambiguation of Arabic, English, Greek, Hebrew, Japanese, Korean, and Spanish. If you are not using this functionality, the JAR can be removed.
A Quick Look at RBL: Running a Sample Program
After you install RBL and the license file, try running a sample application. RBL contains a samples directory rbl-je-<version>/samples
.
At a command prompt, navigate to
rbl-je-<version>/samples/<sampleName>
.Use the Ant build script to compile and run the samples.
ant run
Initial and Path Options
If the option rootDirectory
is specified, then the string ${rootDirectory}
takes that value in the dictionaryDirectory
, modelDirectory
, and licensePath
options.
Option | Description | Type (Default) (Default) | Supported Languages |
---|---|---|---|
The path of the lemma and compound dictionary, if it exists. | Path ${rootDirectory}/dicts | All | |
The language to process by analyzers or tokenizers created by the factory. | Language code | All | |
The path of the RBL license file. | Path ${rootDirectory}/licenses/rlp-license.xml | All | |
The XML license content, overrides | String | All | |
The directory containing the model files. | Path ${rootDirectory}/models | All | |
Set the root directory. Also sets default values for other required options ( | Path | All |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
CSCAnalyzerOption
(except formodelDirectory
)TokenizerOption
ADM Sample Application
A sample application that illustrates the use of ADM is in rbl-je-<version>/samples/annotator-tokenize
.
In a Bash shell (Unix) or command prompt (Windows), navigate to
rbl-je-<version>/samples/annotator-tokenize
.Use the Ant build script to compile and run the sample.
ant run
Your license (rlp-license.xml
) must be in the licenses
subdirectory of the RBL installation.
AnnotatorTokenize
tokenizes the English string and provides one or more analyses with lemma and part-of-speech for each token.
The output appears in annotator-tokenize.txt
.
length: 29 ------ Some members spoke yesterday. ------ token 0: Some index lemma part-of-speech 0 some QUANT token 1: members index lemma part-of-speech 0 member NOUN token 2: spoke index lemma part-of-speech 0 speak VPAST 1 spoke VI 2 spoke VPRES 3 spoke NOUN token 3: yesterday index lemma part-of-speech 0 yesterday ADV 1 yesterday NOUN token 4: . index lemma part-of-speech 0 . SENT
Classic API Sample Application
A sample application illustrating the use of the classic API is in rbl-je-<version>/samples/tokenize-analyze
.
In a Bash shell (Unix) or command prompt (Windows), navigate to
rbl-je-<version>/samples/tokenize-analyze
.Use the Ant build script to compile and run the sample.
ant run
Your license (rlp-license.xml
) must be in the licenses
subdirectory of the RBL installation.
TokenizeAnalyze
tokenizes the sample German document and provides a disambiguated analysis of each token.
The output appears in two files: deu-tokenized.txt
and deu-analyzed.txt
. The first file contains a token on each line, with a blank line following the end of a sentence.
The second file contains the token, lemma, part of speech, and compound components (where relevant) on each line. For those languages for which disambiguation is not supported,[2] there may be multiple rows for each token (the token appearing in the first column), one for each analysis. Here is a fragment with a sentence from deu-analyzed.txt
:
TOKEN LEMMA POS COMPOUNDS ----- ----- --- --------- 3.11.06 3.11.06 CARD - - PUNCT Not Not NOUN und und COORD Elend Elend NOUN in in PREP ihren ihr POSDET Heimatländern Heimatland NOUN [Heimat, Land] lassen lassen VVFIN immer immer ADV mehr mehr INDADJ Afrikaner Afrikaner NOUN die der ART Reise Reise NOUN nach nach PREP Europa Europa NOUN antreten antreten VVINF . . SENT
To run the samples with sample text in a different language, set the test.language
parameter with the language code. For example to tokenize and analyze the Spanish sample, call
ant -Dtest.language=spa run
RBL Command Line Utility
RBLCmd
is a general-purpose command line utility for RBL. It provides a simple way to produce RBL output without writing code. It is also useful for ad hoc speed and thread testing.
A Bash shell script (RBLCmd
) and Windows script (RBLCmd.bat
) for running this utility are in rbl-je-<version>/tools/bin
. For more information, see RBLCmd
's on-line help, RBLCmd -h
.
The command:
echo 'Hola' | ./tools/bin/RBLCmd -outputJson --language spa --rootDirectory . | jq
produces the following output:
{ "version": "1.1.0", "data": "Hola\n", "attributes": { "sentence": { "type": "list", "itemType": "sentence", "items": [ { "startOffset": 0, "endOffset": 5 } ] }, "scriptRegion": { "type": "list", "itemType": "scriptRegion", "items": [ { "startOffset": 0, "endOffset": 5, "script": "Latn" } ] }, "layoutRegion": { "type": "list", "itemType": "layoutRegion", "items": [ { "startOffset": 0, "endOffset": 5, "layout": "STRUCTURED" } ] }, "token": { "type": "list", "itemType": "token", "items": [ { "startOffset": 0, "endOffset": 4, "text": "Hola", "analyses": [ { "partOfSpeech": "INTERJ", "lemma": "hola", "raw": "hola[+INTERJ]", "tagSet": "BT_SPANISH" } ] } ] } }, "documentMetadata": {} }
Tokenizers
The tokenizer is a language-specific processor that evaluates documents and identifies the tokens. RBL supports tokenization and sentence boundaries for all languages. For many languages, you can choose the tokenizer by setting tokenizerType
.
TokenizerType | Description | Supported Languages |
---|---|---|
| Uses the ICU tokenizer | All, except for Chinese and Japanese |
| Uses the FST tokenizer | Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Spanish |
| Uses a lexicon and rules to tokenize input without spaces. Uses the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA). | Chinese, Japanese |
| Uses statistical approach to tokenize input without spaces. | Chinese, Japanese, Korean, Thai |
| Selects the default tokenizer for each language. The default is | All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Tip
When using the SPACELESS_LEXICAL
tokenizer, you must use the CLA/JLA dictionaries instead of the segmentation dictionary. The analysis dictionary is not intended to be used with the SPACELESS_LEXICAL
tokenizer.
For most languages the default tokenizer is referred to as the ICU tokenizer. It implements standard Unicode guidelines for determining boundaries between sentences and for breaking each sentence into individual tokens. Many languages have an alternate tokenizer, the FST tokenizer, enabled by setting the tokenizerType
to FST
. The FST tokenizer provides somewhat different sentence and token boundaries. For example, the FST tokenizer keeps hyphenated tokens together, while the ICU tokenizer breaks them into separate tokens. For applications that don't want tokens or lemmas that contains spaces, the ICU tokenizer provides the best accuracy. To determine which tokenizer is best for your use case, we recommend running each of them against a test dataset and reviewing the output.
For Chinese, Japanese, and Thai, the default tokenizer determines sentence boundaries, and then uses statistical models to segment each sentence into individual tokens. If Latin-script or other non-Chinese, non-Japanese, or non-Thai fragments greater than a certain length (defined by minNonPrimaryScriptRegionLength
) are embedded in the Chinese, Japanese, or Thai text, then the tokenizer applies default Unicode tokenization to those fragments. If a non-primary script region is less than this length, and adjacent to a primary script region, it is appended to the primary script region.
To use the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA) tokenization algorithm, set the tokenizerType
to SPACELESS_LEXICAL
. This disables post-tokenization analysis; an analyzer created with this option will leave its input tokens unchanged.
For all languages, the RBL tokenizer can apply Normalization Form KC (NFKC) as specified in Unicode Standard Annex #15 to normalize the tokens. This normalization includes a normalizing a fullwidth numeral to a halfwidth numeral, a fullwidth Latin letter to a halfwidth Latin letter, and a halfwidth Katakana character to a fullwidth Katakana character. NFKC normalization is turned off by default. Use the nfkcNormalize
option to turn it on and use tokenizerType
of ICU
. To apply NKFC for Chinese and Japanese, tokenizerType
must be SPACELESS_STATISTICAL
or DEFAULT
.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Ukrainian | |
Specify language to use for script regions, other than the script of the overall language. | Language code ( | Chinese, Japanese, Thai | |
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. | Integer (10) | Chinese, Japanese, Thai | |
Turns on Unicode NFKC normalization before tokenization.
| Boolean (false) | All | |
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. | Boolean (false) | All | |
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts. | Boolean (false) | Chinese, Japanese, Thai | |
Selects the tokenizer to use |
( | All |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Structured Text
A document may contain tables and lists in addition to regular sentences. Structured text is composed of fragments, such as list items, table cells, and short lines of text. The tokenizer emits sentence offsets for each fragment it encounters.
One way fragments are identified is by detecting fragment delimiters. A delimiter is restricted to one character; the default delimiters are U+0009 (tab), U+000B (vertical tab), and U+000C (form feed). To modify the set of recognized delimiters, pass a string containing all desired delimiter values to the fragmentBoundaryDelimiters
option. The string must include any default values you want to keep. Non-whitespace delimiters within a token will be ignored.
The following rules determine where fragments are identified, in descending priority:
Each line in a list, where a list is defined as 3 or more lines containing the same punctuation mark within the first 5 characters of the line, are fragments
A delimiter or three or more consecutive whitespace characters breaks a line into fragments
A short line is a fragment if it is preceded by another short line, preceded by a fragment, or if it's the first line of text. The length of a short line is configurable with the
maxTokensForShortLine
option; the default is 6 or fewer tokens.
Fragments always include trailing whitespace.
Example:
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory) factory.setOption(BaseLinguisticsOption.language, "eng"); EnumMap<BaseLinguisticsOption, String> options = Maps.newEnumMap(BaseLinguisticsOption.class); options.put(BaseLinguisticsOption.fragmentBoundaryDelimiters, "|~"); options.put(BaseLinguisticsOption.maxTokensForShortLine, "5"); factory.createSingleLanguageAnnotator(options);
By default, fragment detection is enabled. Use the fragmentBoundaryDetection
option to disable it.
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Customizing the ICU Tokenizer
The ICU tokenizer is the default tokenizer used for European languages. It works based on behavior defined in a rule file. If the default behavior is not exactly what is desired, RBL allows custom rule files to be supplied that will determine the behavior of the tokenizer. How to make these customizations is briefly outlined here. Be careful with any changes you make to the tokenizer behavior; BasisTech does not support customizations made by the user.
BaseLinguisticsFactory
has a method addCustomTokenizerRules
which can be used to specify a custom rule file. RBLCmd also has the -ctr
option to specify a path on the command line. All of these methods accept a case-sensitivity value (for -ctr
, cs
and ci
mean case-sensitive and case-insensitive), which is important because only when BaseLinguisticsOption.caseSensitive
is the same as the value for a rule file will it be selected. Custom rule files are not cumulative, i.e. only one set of rules may be used at a time for any one combination of case sensitivity and language.
Note
BasisTech reserves the right to change the version of ICU used in RBL. Thus any rule file provided by BasisTech for a particular version of RBL may or may not work with newer versions.
Tokenization Rule File Format
A tokenization rule file is a ICU break rule file encoded in UTF-8. A custom file replaces BasisTech’s tokenization rules, so a custom rule should include all the rules for basic tokenization as well as the new custom rules. The default rule files that RBL uses can be obtained by contacting BasisTech support, or you can copy the rule file from ICU.
RBL also provides the ability to pass in a subrule file if desired. This is for splitting tokens produced according to rules in the main file. The subrule file is a list of subrules, each of which is a number and a regex separated by a tab character. This number corresponds to the “rule status”[3] of the main rule whose tokens the subrule splits. Each capturing group in the subrule regex corresponds to a token that will be produced by the tokenizer.
The rule file and the subrule file can be placed anywhere. In particular, they need not be placed anywhere within your RBL installation directory.
There is one BasisTech-specific extension, !!btinclude <filename>
. This command tells the preprocessor to replace the !!btinclude
line with the contents of the specified file. Relative paths are relative to the location of the file containing the !!btinclude
line. Recursive inclusion is allowed.
Example
The ICU tokenizer does not normally tokenize with an eye to emoticons, but perhaps that is important to your use case. You could make a copy of the default rule file and add the following.
... $Smiley = [\:=][)}\]]; !!forward; $Smiley; ...
For the input:
=)
instead of the output:
= )
of two tokens with the BasisTech default rules you would get back one token:
=)
Unknown Language Tokenization
RBL provides basic tokenization support when the language is "Unknown" (xxx
). The tokenizer uses generic rules to tokenize, such as whitespace and punctuation delimitation.
Supported Features when language is unknown (xxx
):
Tokenization
Sentence breaking
Identification of some common acronyms and abbreviations
Segmentation user dictionaries
Using the language code of xxx
will provide basic tokenization support for languages not supported by RBL.
Analyzers
The analyzer is a language-specific processor that uses dictionaries and statistical analysis to add analysis objects to tokens.
To extend the coverage that RBL provides for each supported language you can create User Dictionaries. Segmentation user dictionaries are supported for all languages. Lemma user dictionaries are supported for Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Thai.
A stem is the substring of a word that remains after prefixes and suffixes are removed, while the lemma is the dictionary form of a word. RBL supports stems for Arabic, Finnish, Persian, and Urdu.
Semitic roots are generated for Arabic and Hebrew.
The option name to set the analysis cache depends on the accepting factory. The option analysisCacheSize
is a BaseLinguisticsOption
while cacheSize
is an option for both AnalyzerOption
and CSCAnalyzerOption
. They all perform the same function.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. | Integer (100.000) | All | |
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish | |
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned. | Boolean (false) | All | |
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator. | List of paths | All | |
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation) | Boolean (false) | All | |
Selects the tokenizer to use with this analyzer. |
( | All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
CSCAnalyzerOption
(cacheSize
only)
Lemma Lookup
For each token and normalized form in the token stream, the analyzer performs a dictionary lookup starting with any user dictionaries followed by the RBL dictionary. During lookup, RBL ignores the context in which the token or normalized form appears.
Once the analyzer has found one or more lemmas in a dictionary, it does not consult additional dictionaries. In other words, if two user dictionaries are specified, and the filter finds a lemma in the first dictionary, it does not consult the second user dictionary or the RBL dictionary.
Unless overridden by an analysis dictionary, the only lemmatization done in Chinese and Thai is number normalization. Other Chinese and Thai tokens' lemmas are equal to their surface forms.
There is no analysis dictionary available for Finnish, Pashto, or Urdu. All other languages are supported.
Guessing
No dictionary can ever be complete: new words get added to languages, languages change and borrow. So, in general, analysis for each language includes some sort of guessing capability. The job of a guesser is to take a word and to come up with some analysis of it. Whatever facts we generate for a language, those are all possible outputs of a guesser.
In European languages, guessers deliver lemmas and parts of speech. In Korean, guessers provide morphemes, morpheme tags, compound components, and parts of speech.
Whitespace in Lemmas
By default, the analyzer returns any lemma that contains whitespace as multiple lemmas (each with no whitespace). To allow lemmas with whitespace (such as International Business Machines
as a lemma for the token IBM
) to be placed as such in the token stream, you can create a user analysis dictionary with an entry that defines the lemma. For example:
IBM International[^_]Business[^_]Machines[+PROP]
Compounds
The analyzer decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components.
The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.
For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting 's' is not present in the component list. For this input token, the RBL tokenizer and analyzer return the following entries:
- Original form: Eingangstüren
- Lemma for the compound: Eingangstür
- Component lemmas: Eingang, Tür
Other German examples include letter removal (Rennrad ⇒ rennen + Rad), vowel changes (Mängelliste ⇒ Mangel + Liste), and capitalization changes (Blaugrünalge ⇒ blau + grün + Alge).
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether to decompose compounds. For Chinese and Japanese, If | Boolean (true) | Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish | |
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, This option has no effect when | Boolean (false) | Dutch, German, Hungarian |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Disambiguation
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object. The disambiguate
option enables the disambiguator. When true
, the disambiguator determines the best analysis for each word given the context in which it appears.
When using an annotator, the disambiguated result is at the head of all possible analyses. The remainder of the list is ordered randomly. When using a tokenizer/analyzer, use the method getSelectedAnalysis
to return the disambiguated result.
For all languges except Japanese, disambiguation is enabled by default. For performance reasons, disambiguation is disabled by default for Japanese when using the statistical model.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the analyzers should disambiguate the results. | Boolean (true) | Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish | |
Enables faster part of speech disambiguation for English. | Boolean (false) | English | |
Enables faster part of speech disambiguation for Greek | Boolean (false) | Greek | |
Enables faster part of speech disambiguation for Spanish. | Boolean (false) | Spanish |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Part-of-Speech (POS) Tags
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
Returning Universal Part-of-Speech (POS) Tags
The universalPosTags
option converts BasisTech POS tags to universal POS tags, as defined by the Universal Dependencies project. The POS tag mappings are defined by POS tag map files. By default, the annotator uses the map in rootDirectory/upt-16/upt-16-<language>.yaml
, where <language> is a language code. customPosTagsUri
allows you to specify custom POS tag mappings.
If you want to return universal part-of-speech tags in place of the language-specific tags that RBL ordinarily returns, set universalPosTags
to true
.
For an ADM sample that follows the same pattern as the preceding sample and returns universal POS tags for each token, see rbl-je-<version>/samples/universal-pos-tags
.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates if POS tags should be converted to universal versions | Boolean (false) | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. | |
URI of a POS tag map | URI | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. |
Enum Classes:
BaseLinguisticsOption
POS Tag Map File Format
A POS tag map file is a YAML file encoded in UTF-8. It is a sequence of mapping rules.
A mapping rule is a sequence of two elements: the POS tag to be mapped and a sequence of submappings. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules.
A submapping is a mapping with the keys m
, s
, and t
. m
is a Java regular expression. s
is a surface form. m
and s
are optional: they can be omitted or null. t
specifies the output POS tag to use when the following criteria are met:
The input token's POS tag equals the POS tag to be mapped.
m
(if any) matches a substring of the input token's raw analysis.s
(if any) equals the input token's surface form, compared case-insensitively.
Example
- - NUM_VOC - - { m: \+Total, t: PRON } - { s: moc, t: DET } - { s: oba, t: DET } - { t: NUM }
This rule maps tokens with BasisTech's NUM_VOC POS tag. If the input token's raw analysis matches the regular expression \+Total
, the token becomes a PRON. Otherwise, if the token's surface form is moc or oba, the token becomes a DET. Otherwise, the token becomes a NUM.
Splitting Contractions
You can split contractions and return analyses with tokens, lemmas, and POS tags for each constituent. For example, given the English contraction can't, RBL returns analyses for can
and for not
. To split contractions, set tokenizeContractions
to true
.
Contractions are defined by contraction rule files. By default, the tokenizer uses the rules in rootDirectory/contractions/contraction-rules-<language>.yaml
, where <language> is the language code. RBL comes with contraction rules for English, German, and Portuguese. To add rules for these languages or add rules to support another language, edit the default files or create a custom rule file. The URI for the custom file is defined by customTokenizeContractionRulesUri.
Enum Classes:
BaseLinguisticsOption
For a sample, see rbl-je-<version>/samples/contractions
.
Contraction Splitting Rule File Format
A contraction rule file is a YAML file encoded in UTF-8. It must be a sequence of contraction rules.
A contraction rule is a sequence of two elements: a contraction key and a contraction replacement. Any token which matches the key is replaced with the replacement. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules. A token which matches no rule is not rewritten.
A contraction key is a sequence of a surface form and a POS tag. A token matches a key if and only if its surface form and POS tag match the key's surface form and POS tag.
A surface form is a string. Surface forms are compared case-insensitively.
A POS tag is a string. POS tags are compared case-sensitively.
A contraction replacement is a sequence of replacement tokens.
A replacement token is a sequence of a replacement surface form, POS tag, lemma, and raw analysis. All four are strings. The raw analysis can also be null.
Example
- - [ "ain't", "VBPRES" ] - - [ "am", "VBPRES", "be", null ] - [ "not", "NOT", "not", null ] - - [ "amn't", "ADJ" ] - - [ "am", "VBPRES", "be", null ] - [ "not", "NOT", "not", null ] - - [ "amn't", "NOUN" ] - - [ "am", "VBPRES", "be", null ] - [ "not", "NOT", "not", null ]
The first entry is for ain't with POS tag VBPRES. This splits into am and not. The next is for amn't as an ADJ, and the third is for amn't as a NOUN.
The replacement surface form uses the same capitalization format as the original surface form. Using the first entry of the above example, ain't becomes am not, Ain't becomes Am not, and AIN'T becomes AM NOT.
Chinese and Japanese Lexical Tokenization
For Chinese and Japanese, in addition to the statistical model described above, RBL includes Chinese Language Analyzer (CLA) and Japanese Language Analyzer (JLA) modules [4] which are optimized for search. They are activated by setting tokenizerType
to SPACELESS_LEXICAL
.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when |
| Chinese | |
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of |
| Chinese, Japanese | |
Indicates whether to decompose compounds. |
| Chinese, Japanese | |
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when |
| Chinese, Japanese | |
Indicates whether to favor words in the user dictionary during segmentation. |
| Chinese, Japanese | |
Indicates whether to ignore whitespace separators when segmenting input text. If |
| Japanese | |
Indicates whether to filter stop words out of the output. |
| Chinese, Japanese | |
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token. | true | Japanese | |
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when | 10 | Chinese, Japanese | |
Indicates whether to add parts of speech and secondary parts of speech to morphological analyses. |
| Chinese, Japanese | |
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when |
| Japanese | |
Indicates whether to return numbers and counters as separate tokens. |
| Japanese | |
Indicates whether to segment place names from their suffixes. |
| Japanese | |
Indicates whether to treat whitespace as a number separator. Has no effect when |
| Chinese | |
Indicates whether to treat whitespace as a morpheme delimiter. |
| Chinese, Japanese |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Chinese and Japanese Readings
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to return all the readings for a token. Has no effect when |
| Chinese | |
Indicates whether to skip directly to the fallback behavior of |
| Chinese, Japanese | |
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. |
| Chinese, Japanese | |
| Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when |
| Chinese, Japanese |
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
|
| Chinese | |
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when |
| Chinese |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Editing the stop words list
The ignoreStopwords
option uses a stop words list to define stop words. The path to the stop words list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8
and Japanese uses root/dicts/jpn/jla/JP_stop.utf8
.
You can add stop words to these files. When you edit one of these files, you must follow these rules:
The file must be encoded in UTF-8.
The file may include blank lines.
Comment lines begin with
#
.Each non-blank non-comment line represents exactly one lexeme (stop word).
Japanese Lemma Normalization
In Japanese, foreign and borrowed words may vary in their phonetic transcription to Katakana, and some words may be expressed with an older or a modern Kanji form. The Japanese lemma dictionary maps Katakana variants to a standard form and old Kanji forms to their modern forms. Examples:
Katakana Spelling Variants | Normalized Form |
---|---|
ヴァイオリン | バイオリン |
エクスポ | エキスポ |
Older Kanji Form | Normalized Form |
---|---|
渡邊 | 渡辺 |
松濤 | 松涛 |
大學 | 大学 |
You can include orthographic normalization in lemma user dictionaries for Japanese. This information can be accessed at runtime from the Analysis
or MorphoAnalysis
object.
Hebrew Analyses
The following analyzer options are available for Hebrew.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Hebrew Disambiguator Types
RBL includes multiple disambiguators for Hebrew. Set the value for the option disambiguatorType
to select which type to use. The valid values for DisambiguatorType
are:
PERCEPTRON
: a perceptron modelDICTIONARY
: a dictionary-based rerankerDNN
: a deep neural network.TensorFlow, which is not supported on all systems, much be installed. If
DNN
is selected and TensorFlow is not supported, RBL will throw aRosetteRuntimeException
.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Arabic, Persian, and Urdu Token Analysis
For Arabic, Persian (Western Persian and Afghan Persian), and Urdu, RBL may return multiple analyses for each token. Each analysis contains the normalized form of the token, a part-of-speech tag, and a stem. For Arabic, the analysis also includes a lemma and a Semitic root. For Persian, some analyses include a lemma.
This appendix provides information on token normalization and the generation of variant tokens. For Arabic, it also provides information on stems and Semitic roots.
Token normalization is performed in two stages:
Generic Arabic script normalization
Language-specific normalization
Generic Arabic Script Token Normalization
Generic Arabic script normalization includes the following:
The following diacritics are removed: dammatan, kasratan, fatha, damma, kasra, shadda, sukun.
The following characters are removed: kashida, left-to-right marker, right-to-left marker, zero-width joiner, BOM, non-breaking space, soft hyphen, space.
Alef maksura is converted to yeh unless it is at the end of the word or followed by hamza.
All numbers are converted to Arabic numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
[5]Thousand separators are removed, and the decimal separator is changed to a period (U+002E). The normalizer handles cases where ر (reh) is (incorrectly) used as the decimal separator.
Alef with hamza above: ٵ (U+0675), ٲ (U+0672), or ا (U+0627) combined with hamza above (U+0654) is converted to أ (U+0623).
Alef with madda above: ا (U+0627) combined with madda above (U+0653) is converted to آ (U+0622).
Alef with hamza below: ٳ (U+0673) or ا (U+0627) combined with hamza below (U+0655) is converted to إ (U+0625).
Misra sign to ain: ؏ (U+060F) is converted to ع (U+0639).
Swash kaf to kaf: ڪ (U+06AA) is converted to ک (U+06A9).
Heh: ە (U+06D5) is converted to ه (U+0647).
Yeh with hamza above: The following combinations are converted to ئ (U+0626).
ی (U+06CC) combined with hamza above (U+0654)
ى (U+0649) combined with hamza above (U+0654)
ي (U+064A) combined with hamza above (U+0654)
Waw with hamza above: و (U+0648) combined with hamza above (U+0654), ٷ (U+0677), or ٶ (U+0676) is converted to ؤ (U+0624).
Arabic Token Analysis
Token Normalization
For Arabic input, the following language-specific normalizations are performed on the output of the Arabic script normalization:
Zero-width non-joiner (U+200C) and superscript alef ٰ (U+0670) are removed.
Fathatan (U+064B) is removed.
Persian yeh (U+06CC) is normalized to yeh (U+064A) if it is initial or medial; if final, it is normalized to alef maksura (U+0649).
Persian kaf ک (U+06A9) is converted to ك (U+0643).
Heh ہ (U+06C1) or ھ (U+06BE) is converted to ه (U+0647).
Following morphological analysis, the normalizer does the following:
Alef wasla ٱ (U+0671) is replaced with plain alef ا (U+0627).
If a word starts with the incorrect form of an alef, the normalizer retrieves the correct form: plain alef ا (U+0627), alef with hamza above أ (U+0623), alef with hamza below إ (U+0625), or alef with madda above آ (U+0622).
Token Variants
The analyzer can generate a number of variant forms for each Arabic token to account for the orthographic irregularity seen in contemporary written Arabic. Each token variant is generated in normalized form.
If a token contains a word-final hamza preceded by yeh or alef maksura, then a variant is created that replaces these with hamza seated on yeh.
If a token contains waw followed by hamza on the line, a variant is created that replaces these with hamza seated on waw.
Variants are created where word-final heh is replaced by teh marbuta, and word-final alef maksura is replaced by yeh.
Stems and Semitic Roots
The stem returned is the normalized token with affixes (such as prepositions, conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.
In the process of stripping morphemes (affixes) from a token, the analyzer produces a stem, a lemma, and a Semitic root. Stems and lemmas result from stripping most of the inflectional morphemes, while Semitic roots result from stripping derivational morphemes.
Inflectional morphemes indicate plurality or verb tense. Different forms, such as singular and plural noun, or past and present verb tense share the same stem if the forms are regular. If some of the forms are irregular, they do not share the same stem, but do share the same lemma. Since stems and lemmas preserve the meaning of words, they are very useful in text retrieval and search in general.
Words that have a more distant linguistic relationship share the same Semitic root.
Examples. The singular form الكتابة (al-kitaaba, the writing) and plural form كتابات (kitaabaat, writings) share the same stem: كتاب (kitaab). On the other hand, كُتُب (kutub, books) is an irregular form and does not have the same stem as كِتَاب (kitaab, book). But both forms do share the same lemma, which is the singular form كِتَاب (kitaab). The words مكتبة (maktaba, library), المَكْتَب (al-maktab, the desk), كُتُب (kutub, books), and الكتابة (al-kitaaba, the writing) are related in the sense that a library contains books and desks, a desk is used to write on, and writings are often found in books. All of these words share the same Semitic root: كتب (ktb)
Persian Token Analysis
Persian Token Normalization
The following Persian-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B) and superscript alef (U+0670) are removed.
Alefأ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Arabic kafك (U+0643) is converted to Persian kafک (U+06A9).
Heh goal (U+06C1) or heh doachashmee (U+06BE) is converted to heh (U+0647).
Heh with hamzaۂ (U+06C2) is converted to ۀ (U+06C0).
Arabic yehي (U+064A) or ى (U+0649) is converted to Persian yehی (U+06CC).
Following morphological analysis:
Zero-width non-joiner (U+200C) and superscript alef (U+0670) are removed.
Token Variants
The analyzer can generate a variant form for some tokens to account for the orthographic irregularity seen in contemporary written Persian. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains a zero-width non-joiner (U+200C), a variant is generated without the zero-width non-joiner.
If a word ends in teh marbuta (U+0629), two variants are generated. The first replaces the teh marbuta with teh (U+062A); the second replaces the teh marbuta with heh (U+0647).
Stems and Lemmas
The Persian analyzer produces both stems and lemmas. A stem is the substring of a word that remains after all prefixes and suffixes are removed. A lemma is the dictionary form of a word. The lemma may differ from the stem if a word is irregular, or if a word contains regular transformations. The distinction between stems and lemmas is especially important for Persian verbs. The typical verb inflection table for Persian includes a past stem and and a present stem that cannot be derived from each other.
Examples. The present subjunctive tense verb بگویم (beguyam, that I say) has the stem گوی (guy) . The past tense verb گفتم (goftam, I said) has the stem گفت (goft). These two have different stems, because the word-internal strings are different. They have the same lemma گفت (goft) because they are inflections of the same word.
Urdu Token Analysis
Token Normalization
The following Urdu-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B), zero-width non-joiner (U+200C), and jazm (U+06E1) are removed.
Alef أ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Kaf ك (U+0643) is converted to ک (U+06A9).
Heh with hamza ۀ (U+06C0) is converted to ۂ (U+06C2).
Yehي (U+064A) or ى (U+0649) is converted to ی (U+06CC).
Token Variants
The analyzer can generate a number of variant forms for each Urdu token to account for the orthographic irregularity seen in contemporary written Urdu. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains heh doachashmee (U+06BE), a variant is generated replacing the heh doachashmee with heh goal (U+06C1).
If a word ends with teh marbuta (U+0647), a variant is generated replacing the teh marbuta with heh goal (U+06C1).
User Dictionaries
User dictionaries are supplementary dictionaries that change the default linguistic analyses. These dictionaries can be static or dynamic.
Static dictionaries are compiled ahead of time and passed to a factory.
Dynamic dictionaries are created and configured at runtime. Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the factory becomes unused, the contents are lost.
In all dictionaries, the entries should be Form KC normalized. Japanese Katakana characters, for example, should be full width, and Latin characters, numerals, and punctuation should be half width. Analysis dictionaries can contain characters of any script, while for most consistent performance in Chinese, Japanese, and Thai, token dictionaries should only contain characters in the Hanzi (Kanji), Hiragana, Katakana, and Thai scripts.
In Chinese, Japanese, and Thai, text in foreign scripts (such as Latin script) in the input that equals or exceeds the length specified by minNonPrimaryScriptRegionLength
(the default is 10) is passed to the standard Tokenizer and not seen by a user segmentation dictionary.
Types of User Dictionaries
Analysis Dictionary: An analysis dictionary allows users to modify the analysis or add new variations of a word. The analysis associated with a word includes the default lemma as well as part-of-speech tag and additional characteristics for some languages. Use of these dictionaries is not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
Segmentation Dictionary: A segmentation dictionary allows users to specify strings that are to be segmented as tokens.
Chinese and Japanese segmentation user dictionary entries may not contain the ideographic full stop.
Many-to-one Normalization Dictionaries: Users can implement a many-to-one normalization dictionary to map multiple spelling variants to a single normalized form.
CLA/JLA Dictionaries: The Chinese Language Analyzer and Japanese Language Analyzer both include the capability to create and use one or more segmentation (tokenization) user dictionaries for vocabulary specific to an industry or application. A common usage for both languages is to add new nouns like organizational and product names. These and existing nouns can have a compounding scheme specified if, for example, you wish to prevent an otherwise compound product name from being segmented as such. When the language is Japanese, you can also create user reading dictionaries with transcriptions rendered in Hiragana. The readings can override the readings returned from the JLA reading dictionary and override readings that are otherwise guessed from segmentation (tokenization) user dictionaries.
CSC Dictionary: Users can specify conversion for use with the Chinese Script Converter (CSC).
Tip
When using the SPACELESS_LEXICAL
tokenizer, you must use the CLA/JLA dictionaries instead of the segmentation dictionary. The analysis dictionary is not intended to be used with the SPACELESS_LEXICAL
tokenizer.
Prioritization of User Dictionaries
All static and dynamic user dictionaries, except for many-to-one normalization dictionaries, are consulted in reverse order of addition. In cases of conflicting information, dictionaries added later take priority over those added earlier. Once a token is found in a user dictionary, RBL stops and will consult neither the remaining user dictionaries nor the RBL dictionary.
Many-to-one normalization dictionaries are consulted in the following order:
All dynamic user dictionaries, in reverse order of addition.
Static dictionaries in the order that they appear in the option list for
normalizationDictionaryPaths
.
Example of non-many-to-one user dictionary priority:
User adds dynamic dictionary named
dynDict1
User adds static dictionary named
statDict2
User adds static dictionary named
statDict3
User adds dynamic dictionary named
dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, statDict3
, statDict2
, dynDict1
Example of many-to-one normalization user dictionary priority:
User adds dynamic dictionary named
dynDict1
User sets
normalizationDictionaryPaths = "statDict2;statDict3"
User adds dynamic dictionary named
dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, dynDict1
, statDict2
, statDict3
The Chinese and Japanese language analyzers load all dictionaries with the user dictionaries loaded at the end of the list. To prioritize the user dictionaries and put them at the front of the list, guaranteeing the matches in the user dictionaries will be used, set the option favorUserDictionary
to true
.
Preparing the Source
The following formatting rules apply to user dictionary source files.
The source file is UTF-8 encoded.
The file may begin with a byte order mark (BOM).
Each entry is a single line.
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Dynamic User Dictionaries
A dynamic user dictionary allows users to add user dictionary values at runtime. Instead of creating and compiling the dictionary in advance, the values are added dynamically. Dynamic dictionaries are available for all types of user dictionaries, except the CSC dictionary.
The process for using dynamic dictionaries is the same for each dictionary type:
Create an empty dynamic dictionary for the dictionary type.
Use the appropriate add method to add entries to the dictionary.
Dynamic dictionaries use the same structure as the compiled user dictionaries, but instead of having a single tab-delimited string, they are composed of separate strings. As an example, let's look at a many-to-one normalization dictionary entry:
Static dictionary entry (values are separated by tabs):
norm1 var1 var2 var3
Dynamic dictionary entry:
dictionary.add("norm1", "var1", "var2", "var3");
Caution
Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the annotator, tokenizer, or analyzer becomes unused, the contents are lost.
Case
User dictionary lookups are case-sensitive. RBL provides an option, caseSensitive
, to control whether the analysis phase is case-sensitive.
If
caseSensitive
istrue
, (the default), then the token itself is used to query the dictionary.If
caseSensitive
isfalse
, the token is lowercased before consulting the dictionary. If the analyses is intended to be case-insensitive then the words in the user dictionary must all be in lowercase.
If you are using the BaseLinguisticsTokenFilterFactory
, then the value for AnalyzerOption.caseSensitive
both turns on the corresponding analysis and associates the dictionary with that analysis.
For Danish, Norwegian, and Swedish, the provided dictionaries are lowercase and caseSensitive
is automatically set to false
.
Valid Characters for Chinese and Japanese User Dictionary Entries
An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin script. In this listing, ..
indicates an inclusive range of valid code points:
0022..007E, 00A2, 00A3, 00A5, 00A6, 00A9, 00AC, 00AF, 00B7, 0370..04FF, 2010..206F, 2160..217B, 2190..2193, 2200..22FF, 2460..24FF, 2502, 25A0..26FF, 2985, 2986, 3001..3007, 300C, 300D, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FE, 3200..33FF, 4E00..9FFF, D800..FA2D, FF00, FF02..FFEF
Compiling a User Dictionary
In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-user-dictionary
and a .bat file for Windows.
rbl-build-user-dictionary.bat
To compile a user dictionary, from the RBL root directory:
tools/bin/rbl-build-user-dictionary -type TYPE_ARGUMENT LANG INPUT_FILE OUTPUT_FILE
where TYPE_ARGUMENT is the dictionary type, LANG is the language code, INPUT_FILE is the pathname of the source file you have created, and OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-user-dictionary -type analysis jpn jpn_lemmadict.txt jpn_lemmadict.bin
Dictionary Type | TYPE_ARGUMENT |
---|---|
Analysis |
|
Segmentation |
|
Many-to-one |
|
CLA or JLA segmentation |
|
JLA reading |
|
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Segmentation Dictionaries
The format for a segmentation dictionary source file is very simple. Each word is written on its own line, and that word is guaranteed to be segmented as a single token when seen in the input text, regardless of context. Japanese example:
三菱UFJ銀行 酸素ボンベ
Class | Method | Task |
---|---|---|
|
| Adds a user segmentation dictionary for a given language. |
| Create and load new dynamic segmentation dictionary |
Analysis Dictionaries
Note
Analysis dictionaries are not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
Each entry is a word, followed by a tab and an analysis. The analysis must end with a lemma and a part-of-speech (POS) tag.
word lemma[+POS]
For those languages for which RBL does not return POS tags, use DUMMY
.
Variations. You can provide more than one analysis for a word or more than one version of a word for an analysis.
The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog" for the same analysis (noun).
telephone telephone[+NOUN] telephone telephone[+VI] dog dog[+NOUN] Dog dog[+NOUN]
For some languages, the analysis may include special tags and additional information.
Contracted forms. For English, French, Italian, and Portuguese, ^=
is a separator for a contraction or elision.
English example:
doesn't does[^=]not[+VDPRES]
Multi-Word Analysis. For English, Italian, Spanish, and Dutch, ^_
indicates a space.
English example:
IBM International[^_]Business[^_]Machines[+PROP]
Compound Boundary. For Danish, Dutch, Norwegian, German, and Swedish, ^#
indicates the boundary between elements in a compound word. For Hungarian, the compound boundary tag is ^CB+
.
German example:
heimatländern Heimat[^#]Land[+NOUN]
Compound Linking Element. For German ^/
, indicates a compound linking element. For Dutch, use ^//
.
German example:
arbeitskreis Arbeit[^/]s[^#]Kreis[+NOUN]
Derivation Boundary or Separator for Clitics. For Italian, Portuguese, and Spanish, ^|
indicates a derivation boundary or separator for clitics.
Spanish example with derivation boundary:
duramente duro[^|][+ADV]
Italian example with separator for clitics:
farti fare[^|]tu[+VINF_CLIT]
Japanese Readings and Normalized Forms. For Japanese, [^r]
precedes a reading (there may be more than one), and [^n]
precedes a normalization. For example:
行わ 行う[^r]オコナワ[+V] tv テレビ[^r]テレビ[^n]テレビ[+NC] アキュムレータ アキュムレーター[^r]アキュムレータ[^n]アキュムレーター[+NC]
Korean Analysis. A Korean analysis uses a different pattern than the analysis for other languages. The pattern for an analysis in a user Korean dictionary is as follows:
Token Mor1[/Tag1][^+]Mor2[/Tag2][^+]Mor3[/Tag3]
Where each MorN
is a morpheme, consisting of one or more Korean characters, and TagN
is the POS tag for that morpheme. [^+]
indicates the boundary between morphemes.
Here's an example:
유전자이다 유전자[/NPR][^+]이[/CO][^+]다[/ECS]
If the analysis contains one noun morpheme, that morpheme is the lemma and the POS tag is the POS tag for that morpheme. If more than one of the morphemes are nouns, the lemma is the concatenation of those nouns (a compound). Example:
정보검색 정보[/NNC][^+]검색/[NNC]
Otherwise, the lemma is the first morpheme, and the POS tag is the POS tag associated with that morpheme.
You can override this algorithm for identifying the lemma and/or POS tag in a user dictionary entry by placing [^L]lemma
and/or [^P][/Tag] at the end of the analysis. The lemma may or may not correspond to one of the morphemes in the analysis. For example:
유전자이다 유전자[/NNC][^+]이[/CO][^+]다[/ECS][^L]유전[^P][/NPR]
The KoreanAnalysis
interface provides access to the morphemes and tags associated with a given token in either the standard Korean dictionary or a user Korean dictionary.
Class | Method | Task |
---|---|---|
|
| Add a user analysis dictionary |
| Add a dynamic analysis dictionary |
Many-to-one Normalization Dictionaries
A many-to-one normalization dictionary maps one or more variants to a normalized form. The first value on each line is the normalized form. The remainder of the entries on the line are the variants to be mapped to the normalized form. All values on the line are separated by tabs.
Example:
norm1 var1 var2 norm1 var3 var4 var5
Class | Method | Task |
---|---|---|
|
| Create and load new dynamic normalization dictionary |
Use the option normalizationDictionaryPaths
to specify the static user normalization dictionaries.
CLA and JLA Dictionaries
The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries). The file may begin with a byte order mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese dictionary may begin !DICT_LABEL
followed by Tab and an arbitrary string to set the dictionary's name, which is not currently used anywhere.
Each entry in the dictionary source file is a single line:
word Tab POS Tab DecompPattern Tab Reading1,Reading2,...
where word is the entry or surface form, POS is one of the user-dictionary part-of-speech tags listed below, DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify the number of characters from word to include in each component of the compound (0 for no decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered in Hiragana or Katakana (only applicable to Japanese).
The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.
For more information on Chinese and Japanese POS tags, refer to the following sections:
The tables below list the primary and secondary tags emitted when the user POS tag is specified.
User POS tag | Primary POS tag | Secondary POS tag |
---|---|---|
ABBREVIATION | NA | |
ADJECTIVE | A | |
ADVERB | D | |
AFFIX | X | |
CONJUNCTION | J | |
CONSTRUCTION | OC | |
DERIVATIONAL_AFFIX | W | |
DIRECTION_WORD | WL | |
FOREIGN_PERSON | NP | FOREIGN_PERSON |
IDIOM | E | |
INTERJECTION | I | |
MEASURE_WORD | NM | |
NON_DERIVATIONAL_AFFIX | F | |
NOUN | NC | |
NUMERAL | NN | |
ONOMATOPE | M | |
ORGANIZATION | NP | ORGANIZATION |
PARTICLE | PL | |
PERSON | NP | PERSON |
PLACE | NP | PLACE |
PREFIX | XP | |
PREPOSITION | PR | |
PRONOUN | NR | |
PROPER_NOUN | NP | |
PUNCTUATION | PUNCT | |
SUFFIX | XS | |
TEMPORAL_NOUN | NT | |
VERB | V | |
VERB_ELEMENT | WV |
User POS tag | Primary POS tag | Secondary POS Tag |
---|---|---|
AJ (adjective) | AJ | |
AN (adjectival noun) | AN | |
D (adverb) | D | |
FOREIGN_GIVEN_NAME | NP | FOREIGN_GIVEN_NAME |
FOREIGN_PLACE_NAME | NP | FOREIGN_PLACE |
FOREIGN_SURNAME | NP | FOREIGN_SURNAME |
GIVEN_NAME | NP | GIVEN_NAME |
HS (honorific suffix) | HS | |
NOUN | NC | |
ORGANIZATION | NP | ORGANIZATION |
PERSON | NP | PERSON |
PLACE | NP | PLACE |
PROPER_NOUN | NP | |
SURNAME | NP | SURNAME |
UNKNOWN | UNKNOWN |
|
V1 (vowel-stem verb) | V1 | |
VN (verbal noun) | VN | |
VS (suru-verb) | VS | |
VX (irregular verb) | VX |
Examples (the last three entries include readings):
!DICT_LABEL New Words 2014 デジタルカメラ NOUN デジカメ NOUN 0 東京証券取引所 ORGANIZATION 2,2,3 狩野 SURNAME 0 安倍晋三 PERSON 2,2 あべしんぞう 麻垣康三 PERSON 2,2 あさがきこうぞう 商人 NOUN 0 しょうにん,あきんど
The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:
東京証券取引所 organization 2,2,3
The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into
東京 証券 取引所
Class | Method | Task |
---|---|---|
|
| Add a user analysis dictionary |
| Add a dynamic analysis dictionary |
Class | Method | Task |
---|---|---|
|
| Adds a JLA readings dictionary. |
| Create and load new dynamic JLA readings dictionary |
Chinese Script Converter (CSC)
Overview
There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in Mainland China. TC is used in Taiwan, Hong Kong, and Macau.
Conversion from one script to another is a complex matter. The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with vocabulary differences and context-dependence.
The Chinese Script Converter converts text in simplified script to text in traditional script, or vice versa. The conversion can be on any of three levels, listed here from simplest to the most complex:
Codepoint Conversion: Codepoint conversion uses a mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified form 头发 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or 發. Using this approach, however, there is no recognition that 头发 is a word, so the choice could be 發, in which case the end result 頭發 is nonsense. On the other hand, the choice of 髮 leads to errors for other words. So while conversion mapping is straightforward, it is unreliable.
Orthographic Conversion: The second level of conversion is orthographic. This level relies upon identification of the words in the input text. Within each word, orthographic variants of each character may be reflected in the conversion. In the above example, 头发 is identified as a word and is converted to a traditional variant of the word, 頭髮. There is no basis for converting it to 頭發, because the conversion considers the word as a whole rather than as a collection of individual characters.
Lexemic Conversion: The third level of conversion is lexemic. This level also relies upon identification of words. But rather than converting a word to an orthographic variant, the aim here is to convert it to an entirely different word. For example, "computer" is usually 计算机 in SC but 電腦 in TC. Whereas codepoint conversion is strictly character-by-character and orthographic conversion is character-by-character within a word, lexemic conversion is word-by-word.
Note
If the converter cannot provide the level of conversion requested (lexemic or orthographic), the next simpler level of conversion is provided.
For example, if you ask for a lexemic conversion, and none is available for a given token, CSC provides the orthographic conversion unless it is not available, in which case CSC provides a codepoint conversion.
The Chinese input may contain a mixture of TC and SC, and even some non-Chinese text. The Chinese Script Converter converts to the target (SC or TC), leaving any tokens already in the target form and any non-Chinese text unchanged.
CSC Options
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates most complex conversion level to use |
( | Chinese | |
The language from which the |
| Chinese, Simplified Chinese, Traditional Chinese | |
The language to which the |
| Chinese, Simplified Chinese, Traditional Chinese |
See Initial and Path Options for additional options
Enum Classes:
BaseLinguisticsOption
CSCAnalyzerOption
Using CSC with the ADM API
This example uses the BaseLinguisticsFactory
and CSCAnalyzer
.
Create a
BaseLinguisticsFactory
and set the required options.final BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, licensePath); factory.setOption(BaseLinguisticsOption.conversionLevel, CSConversionLevel.orthographic.levelName()); factory.setOption(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3()); factory.setOption(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3());
Create an annotator to get translations.
final EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class); options.put(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3()); options.put(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3()); final Annotator cscAnnotator = factory.createCSCAnnotator(options);
Annotate the input text for tokens and translations.
final AnnotatedText annotatedText = cscAnnotator.annotate(inputText); final Iterator<Token> tokenIterator = annotatedText.getTokens().iterator(); for (final String translation : annotatedText.getTranslatedTokens().get(0).getTranslations()) { final String originalToken = tokenIterator.hasNext() ? tokenIterator.next().getText() : ""; outputData.format(OUTPUT_FORMAT, originalToken, translation); }
Using CSC with the Classic API
The RBL distribution includes a sample (CSCAnalyze
) that you can compile and run with an ant
build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/csc-analyze
and call
ant run
The sample reads an input file in SC and prints each token with its TC conversion to standard out.
This example tokenizes Chinese test and converts from TC to SC.
Set up a
BaseLinguisticsFactory
.BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, licensePath);
Use the
BaseLinguisticsFactory
to create aTokenizer
to tokenize Chinese text.Tokenizer tokenizer = factory.create(new StringReader(tcInput), LanguageCode.CHINESE);
Use the
BaseLinguisticsFactory
to create aCSCAnalyzer
to convert from TC to SC.CSCAnalyzer cscAnalyzer = factory.createCSCAnalyzer(LanguageCode.TRADITIONAL_CHINESE, LanguageCode.SIMPLIFIED_CHINESE);
Use the
CSCAnalyzer
to analyze eachToken
found by theTokenizer
.Token token; while ((token = tokenizer.next()) != null) { String tokenIn = new String(token.getSurfaceChars(), token.getSurfaceStart(), token.getLength()); System.out.println("Input: " + tokenIn); cscAnalyzer.analyze(token);
Get the conversion (SC or TC) from each
Token
.System.out.println("SC translation: " + token.getTranslation());
CSC User Dictionaries
CSC user dictionaries support orthographic and lexemic conversions between Simplified Chinese and Traditional Chinese. They are not used for codepoint conversion.
CSC user dictionaries follow the same format as other user dictionaries:
The source file is UTF-8 encoded.
The file may begin with a byte order mark (BOM).
Each entry is a single line.
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Each entry contains two or three tab-delimited elements:
input_token orthographic_translation [lexemic_translation]
The input_token is the form you are converting from and the orthographic_translation and optional lexemic_translation are the form you are converting to.
Sample entries for a TC to SC user dictionary:
電腦 电脑 计算机 宇宙飛船 宇宙飞船
Compiling a CSC User Dictionary. In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-csc-dictionary
and a .bat file for Windows
rbl-buld-csc-dictionary.bat
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Compile the CSC user dictionary from the RBL root directory:
tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE
INPUT_FILE
is the pathname of the source file you have created, and OUTPUT_FILE
is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin
Class | Method | Task |
---|---|---|
|
| Add a user CSC dictionary for a given language. |
| Add dynamic CSC dictionary |
Using RBL in Apache Lucene
Introduction
RBL provides an API for integrating with Apache Lucene. See the Javadoc that accompanies this package for the complete API documentation.
A Lucene analyzer examines the text of fields and generates a token stream. Rosette provides an RBL base linguistics analyzer for Lucene, which uses RBL's linguistic tools.
RBL supports two usage patterns for incorporating Rosette Base Linguistics in a Lucene application.
Use the RBL Lucene base linguistics analyzer (
BaseLinguisticsAnalyzer
) to parse an input stream in one of the languages RBL supports and to generate a token stream with tokens and lemmas.Create your own analysis chain, using a language-specific RBL tokenizer and token filter.
Samples
Samples are included in the use cases below. The code comes from the Lucene 7.0 samples found in rbl-je-<version>/samples/lucene-7_0
.
RBL supports Lucene versions 7.0 - 10.1.0 and Solr versions 7.0 - 9.8 with the following files, where <version> is the RBL version:
Solr version | Lucene version | RBL file name version | Solr lib JAR file | Lucene sample directory |
---|---|---|---|---|
7.0 - 8.11 | 7.0 - 8.11 | 7_0 |
|
|
9.0 - 9.8 | 9.0 - 10.1.0 | 9_0 |
|
|
Lucene Options
The following options are passed in through an options Map
.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the token filter should add the lemmas (if none, the steps) of each surface token to the tokens being returned.. | Boolean (true) | All | |
Indicates whether the token filter should add the readings of each surface token to the tokens being returned | Boolean (false) | Chinese, Japanese | |
Indicates whether the token filter should identify contraction components as contraction components rather than as lemmas | Boolean (false) | All | |
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled. | Boolean (false) | All | |
Indicates whether the token filter should replace a surface form with its normalization. Normalization must be enabled. | Boolean (false) | All |
Enum Classes:
FilterOption
Using the RBL Lucene Base Linguistics Analyzer
The RBL Lucene base linguistics analyzer (BaseLinguisticsAnalyzer) provides an analysis chain with a language-specific tokenizer and token filter.
You can configure the analyzer with a number of tokenizer and token filter options.
The analysis chain includes the Lucene filters LowerCaseFilter and CJKWidthFilter,. It also provides support for including a StopFilter.
Japanese Analyzer Sample. This Lucene sample, JapaneseAnalyzerSample.java
, uses the base linguistics analyzer to generate an enriched token stream from Japanese text.
Assemble a set of options that include the root directory, the path to the Rosette license, the language to analyze (Japanese), and the generation of part-of-speech tags for the tokens.
File rootPath = new File(rootDirectory); String licensePath = new File( rootPath, "licenses/rlp-license.xml").getAbsolutePath(); Map<String, String> options = new HashMap<>(); options.put("language", "jpn"); options.put("rootDirectory", rootDirectory); options.put("licensePath", licensePath); options.put("addReadings", "true");
Instantiate a Japanese base linguistics analyzer with these options.
rblAnalyzer = new BaseLinguisticsAnalyzer(options);
Read in a Japanese text file.
Use the analyzer to generate a token stream. The token stream contains tokens, lemmas that are not identical to their tokens, and readings. Disambiguation is turned off by default for Japanese, so multiple analyses may be returned for each token. To turn on disambiguation, add
options.put("disambiguate", "true");
to the construction of the
options
Map.Write each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/lucene-5_0
and use the Ant build script:
ant runAnalyzer
The sample reads rbl-je-<version>/samples/data/jpn-text.txt
and writes the output to jpn-Analyzed-byAnalyzer.txt
.
The output includes each token, lemma, and reading in the token stream on a separate line with the type attribute for each element. There may be more than one analysis (hence lemma and reading in the sample output) for a token; lemmas are not put into the token stream when identical to the token.
Type Attribute | Meaning | Option |
---|---|---|
| token | |
| compound component | |
| contraction component | |
| lemma | |
| reading | |
For example:
メルボルン <ALNUM> メルボルン <READING> で <ALNUM> 行わ <ALNUM> 行う <LEMMA> オコナワ <READING> れ <ALNUM> る <LEMMA> れる <LEMMA> レ <READING>
Creating your own RBL Analysis Chain
When creating an analysis chain, you can do the following:
Use the
BaseLinguisticsTokenizerFactory
to generate a language-specific tokenizer that applies Rosette Base Linguistics to tokenize text.Use the
BaseLinguisticsTokenFilterFactory
to generate a language-specific token filter that enhances a stream of tokens.Add other token filters to the analysis chain.
One of the tasks of the token filter is to set tokens’ token types. For example, given a lemma token, the token filter gives it the type <LEMMA>
. Given a contraction component token, the token filter has a choice: by default, the type is set to <LEMMA>
, but when the identifyContractionComponents
option is enabled, the type is set to <CONT>
.
Japanese Tokenizer and Filter Sample
This Lucene sample, JapaneseTokenizerAndFilterSample.java
, creates an analysis chain to generate an enriched token stream.
Use a factory to set up a language-specific base linguistics tokenizer, which puts tokens in the token stream.
BaseLinguisticsFactory factory = new (BaseLinguisticsFactory); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, licensePath); Map<String, String> options = new HashMap<>(); options.put("language", "jpn"); options.put("rootDirectory", rootDirectory); options.put("addReadings", "true"); tokenFilterFactory = new BaseLinguisticsTokenFilterFactory(options);
Use a base linguistics token filter factory to set up language-specific base linguistics token filter, which adds lemmas and readings to the tokens in the token stream.
Tokenizer tokenizer = new BaseLinguisticsTokenizer(factory.createTokenizer(null, LanguageCode.JAPANESE)); tokenizer.setReader(input); TokenStream tokens = tokenFilterFactory.create(tokenizer);
To replicate the behavior of the analyzer in the previous example, this sample also includes the LowerCaseFilter and CJKWidthFilter.
tokens = new LowerCaseFilter(tokens); tokens = new CJKWidthFilter(tokens);
Write each element in the token stream with its type attribute to an output file.
To run the sample, in a Bash shell (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/lucene-7_0
and use the Ant build script:
ant runTokenizerAndFilter
The example reads the same file as the previous sample and writes the output to a jpn-analyzed-byTokenizerAndFilter.txt
. The content matches the content generated by the previous example.
Using the BaseLinguisticsSegmentationTokenFilter
If you are using your own whitespace tokenizer and processing text that requires segmenting Chinese, Japanese, or Thai, you can use the BaseLinguisticsSegmentationTokenFilterFactory
to create a BaseLinguisticsSegmentationTokenFilter
, then place the segmentation token filter in an analysis chain following the whitespace tokenizer and preceding other filters, such as a base linguistics token filter.
The segmentation token filter segments each of the tokens from the whitespace tokenizer into individual tokens where necessary. Refer to the Javadoc for the RBL API for Lucene for more information.
Analyses Attributes
You can use the com.basistech.rosette.lucene.AnalysesAttribute
object to gather linguistic data about the text in a document. Depending on the language, the data may include tokens, normalized tokens, lemmas, part-of-speech tags, readings, compound components, and Semitic roots.
The Lucene sample, AnalysesAttributeSample.java
, illustrates this.
To run the sample with the German sample file, navigate to rbl-je-<version>/samples/lucene-<version>
, and call ant as follows:
ant -Dtest.language=deu runAnalysesAttribute
The sample writes the output to deu-analysesAttributes.txt
.
Case Sensitivity During the Analysis
In some languages, case distinctions are meaningful. For example, in German, a word may be a noun if it begins with an upper-case letter, and not a noun if it does not. As a result, RBL delivers higher accuracy in selecting lemmas and splitting compounds when it can process text with correct casing. On the other hand, users typing in queries may be sloppy with capital letters.
For this reason, the default behavior of the Lucene integration is to perform the following analysis steps:
- tokenize
- determine lemmas
- map to lowercase
The result is that the index contains the lowercase form of the most accurately selected lemma.
However, some applications work with text in which case distinctions are not reliably present, even in languages where they are important. These applications need to determine lemmas and compound components even though the spelling is nominally incorrect with respect to case.
To support these applications, RBL provides a 'case-insensitive' mode of operation. In this mode, RBL performs the following analysis steps:
- tokenize, ignoring the case of abbreviations and such
- determine lemmas, ignoring case in choosing lemmas and compound components
- map to lowercase
The mapping is still required to ensure that the index or query ends up with uniformly lowercase text.
To specify case sensitivity for the analysis, set com.basistech.rosette.bl.AnalyzerOption.caseSensitive
to true
or false
. By default, the setting is true
, except for Danish, Norwegian, and Swedish, for which our dictionaries are lowercase and the setting is false
irrespective of the user setting.
When you are making this setting in the com.basistech.rosette.lucene
package, include the caseSensitive
option as a string. For example:
Map<String, String> options = new HashMap<>(); options.put("language", LanguageCode.ITALIAN.ISO639_30); options.put("caseSensitive", "true"); TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);
Activating User Dictionaries in Lucene
User Dictionaries can be used when using RBL with Lucene.
In the com.basistech.rosette.lucene
package, BaseLinguisticsTokenizerFactory
and BaseLinguisticsTokenFilterFactory
can load segmentation and analysis dictionaries respectively.
The path options are provided as a list of paths, separated by semicolons or the OS-specific path separator.
Option | Description | Type | Supported Languages |
---|---|---|---|
A list of paths to user lemma dictionaries. | List of Paths | Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai | |
A list of paths to user segmentation dictionaries. | List of Paths | All | |
A list of paths to user dictionaries. | List of Paths | All | |
A list of paths to reading dictionaries. | List of Paths | Japanese |
BaseLinguisticsTokenizerFactory
provides the method addUserDefinedDictionary
for adding a segmentation dictionary. For example:
Map<String, String> args = new HashMap<>(); args.put(TokenizerOption.language.name(), LanguageCode.JAPANESE.ISO639_30); args.put(TokenizerOption.nfkcNormalize.name(), "true"); BaseLinguisticsTokenizerFactory factory = new BaseLinguisticsTokenizerFactory(args); factory.addUserDefinedDictionary(LanguageCode.JAPANESE, "/path/to/my/jpn-dict.bin");
The constructor for BaseLinguisticsTokenFilterFactory
takes a Map
of options. Use the userDefinedDictionaryPath
option to load an analysis dictionary:
Map<String, String> options = new HashMap<>(); options.put("language", LanguageCode.ITALIAN.ISO639_30); options.put("userDefinedDictionaryPath", "/path/to/my/ita-dict.bin"); options.put("caseSensitive", "true"); TokenFilterFactory factory = new BaseLinguisticsTokenFilterFactory(options);
Using CSC with Lucene
Set up a
com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory
.Use the
BaseLinguisticsTokenizerFactory
to create acom.basistech.rosette.lucene.BaseLinguisticsTokenizer
, which contains a LuceneTokenizer
.Set up a
com.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilterFactory
.Use the
BaseLinguisticsCSCTokenFilterFactory
to create acom.basistech.rosette.lucene.BaseLinguisticsCSCTokenFilter
to convert from TC to SC or vice versa.Use the
BaseLinguisticsCSCTokenFilter
to convert eachToken
found by theTokenizer
.
RBL/Lucene Distribution Sample. For supported versions of Lucene, the RBL distribution includes a sample (CSCCharTermAttributeSample
) that you can compile and run with an ant
build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to the samples directory (rbl-je-<version>/samples) and the file for your version of lucene (/csc-analyze-<luceneversion>
and call:
ant run
The sample reads an input file in SC and prints the TC conversion for each token to standard out.
Using RBL in Apache Solr
Introduction
You can configure Apache Solr search to use RBL for both indexing documents as well as for queries.
To index and search documents with RBL in a Solr application, you must add JARs to the Solr classpath and define Solr analysis chains that apply the RBL analysis components to process text at index and query time.
Setting Solr 9.0 Permissions
In Solr 9.0.0 and later, you must update the policy file at server/etc/security.policy
to grant Solr permissions to read the files in RBL-JE's root directory. Add the following line to the file:
grant { permission java.io.FilePermission "<RBLJE_ROOT>${/}-", "read"; };
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation.
Adding to the Solr Classpath
Add the following lib
elements to the solrconfig.xml
for each Solr collection you are using.
<lib path="<RBLJE_ROOT>/lib/btrbl-je-<version>.jar"/> <lib path="<RBLJE_ROOT>/lib/btcommon-api-<version>.jar"/> <lib path="<RBLJE_ROOT>/lib/slf4j-api-<version>.jar"/> <lib path="<RBLJE_ROOT>/lib/btrbl-je-lucene-solr-<version>-<version>.jar"/>
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation and take the full file names with correct <version> values from the RBL <lib path>
. They can change with each new release.
For example, if the root of the RBL installation is /opt/local/bt/rbl-je
, the version of Solr is 8.6, and the version of RBL is 7.44.0.c67.0, the lib paths are:
<lib path="/opt/local/bt/rbl-je/lib/btrbl-je-7.44.0.c67.0.jar"/> <lib path="/opt/local/bt/rbl-je/lib/btcommon-api-37.0.1.jar"/> <lib path="/opt/local/bt/rbl-je/lib/slf4j-api-1.7.28.jar"/> <lib path="/opt/local/bt/rbl-je/btrbl-je-lucene-solr-7_0-7.44.0.c67.0.jar"/>
The correct Lucene/Solr version files are listed in the section Lucene/Solr Versions.
The SLF4J JARs enable logging.
Defining a Solr Analysis Chain
In the Solr schema.xml
or managed-schema.xml
, add a fieldType
element and a corresponding field
element for the language of the documents processed by the application.
Field Type. The fieldType
includes two analyzers: one for indexing documents and one for querying documents. Each analyzer contains a tokenizer and a token filter.
Here, for example, is a fieldType
for Japanese:
<fieldtype name="basis-japanese" class="solr.TextField"> <analyzer type="index"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="jpn" rootDirectory="<RBLJE_ROOT>" /> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="jpn" rootDirectory="<RBLJE_ROOT>" /> </analyzer> <analyzer type="query"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="jpn" rootDirectory="<RBLJE_ROOT>" /> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="jpn" rootDirectory="<RBLJE_ROOT>" query="true" /> </analyzer> </fieldtype>
where you replace <RBLJE_ROOT>
with the path to the root of your RBL installation. The fieldType name
indicates the language, and each language
attribute is set to the ISO 693-3 language code for Japanese.
Note
You can incorporate any Solr filter you need, such as the Solr lowercase filter; however, they should be added into the chain after the Base Linguistics token filter. If you modify the token stream too radically before RBL, you will degrade its ability to analyze the text.
Field. The analysis chain requires a field
definition with a type
attribute that maps to the fieldType
. For the Japanese example above, add the following field
definition to schema.xml
.
<field name="text-japanese" type="basis-japanese" indexed="true" stored="true"/>
In your Solr application, you can now index and query Japanese documents placed in the text-japanese
field.
Using Options in Solr
Most API options can be used in a Solr analysis chain. In Solr, you do not directly use the option enum classes. Instead, options are specified in the schema using the format option="value"
.
When specifying options for the tokenizer
(class BaseLinguisticsTokenizerFactory
), use options in the TokenizerOption
enum class. When specifying options for a filter
(class BaseLinguisticsTokenFilterFactory
), use options in the AnalyzerOption
and FilterOption
enum classes.
Example:
<fieldType class="solr.TextField" name="basis-french"> <analyzer type="index"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="fra" rootDirectory="${bt_root}"/> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="fra" rootDirectory="${bt_root}"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="fra" query="true" rootDirectory="${bt_root}"/> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="fra" query="true" rootDirectory="${bt_root}"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
Using Neural Models with Solr
When using the neural models with Solr, you must grant the appropriate permissions and may want to increase the JVM memory.
The security manager is enabled by default in Solr 9. To use any of the neural models, grant the following permissions in server/etc/security.policy
:
grant { permission java.io.FilePermission \"$HOME/.javacpp/cache\", \"read,write,delete,execute\"; permission java.lang.RuntimePermission \"loadLibrary.*\"; };
Neural models may require more memory. We recommend increasing the memory to at least 1 GB by editing bin/solr/in.sh
:
SOLR_JAVA_MEM="-Xms1g -Xmx1g"
Activating User Dictionaries in Solr
User Dictionaries can be used when using RBL with Solr.
Use the option userDefinedDictionaryPath
as shown in this example:
<fieldType class="solr.TextField" name="basis-japanese"> <analyzer> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" tokenizerType="spaceless_lexical" language="jpn" rootDirectory="${bt_root}" userDefinedDictionaryPath= "/path/to/my/jpn-udd.bin"/> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" tokenizerType="spaceless_lexical" language="jpn" rootDirectory="${bt_root}"/> </analyzer> </fieldType>
Here is an example of using a JLA reading dictionary:
<fieldType class="solr.TextField" name="basis-japanese-rd"> <analyzer> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" tokenizerType="spaceless_lexical" readings="true" language="jpn" rootDirectory="${bt_root}" userDefinedReadingDictionaryPath="/path/to/my/readings.bin"/> <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" tokenizerType="spaceless_lexical" language="jpn" rootDirectory="${bt_root}"/> </analyzer> </fieldType>
Language Codes
Canonical Language Code
RBL canonicalizes some language codes to other language codes. These canonicalization rules apply to all APIs.
LanguageCode.NORWEGIAN
is canonicalized to LanguageCode.NORWEGIAN_BOKMAL
immediately upon any input to the API. This means that there can be no distinguishing between them. In particular, an Analyzer
built from a factory configured to use Norwegian will report its language as Norwegian Bokmål instead.
Similarly, LanguageCode.SIMPLIFIED_CHINESE
and LanguageCode.TRADITIONAL_CHINESE
are canonicalized to LanguageCode.CHINESE
immediately. The one exception is that they are not canonicalized as inputs to or outputs from the Chinese Script Converter.
Those are the only language code canonicalizations. Although RBL internally treats Afghan Persian and Iranian Persian as Persian, they are not considered the same language. This makes it possible to configure different user dictionaries for each variety of Persian, even though they are otherwise processed identically.
ISO 639-3 Language Codes
RBL uses ISO 639-3 language codes to specify languages as strings. There are a few nonstandard language codes, as indicated. RBL also accepts 2-letter codes specified by the ISO-639-1 standard. See the Javadoc for more details on language codes.
Language | Code | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arabic |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Catalan |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese (Simplified) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese (Traditional) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Czech |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Danish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Dutch |
| ||||||||||||||||||||||||||||||||||||||||||||||||
English |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Estonian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Finnish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
French |
| ||||||||||||||||||||||||||||||||||||||||||||||||
German |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Greek |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Hebrew |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Hungarian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Indonesian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Italian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Japanese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Latvian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Malay (Standard) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
North Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian Bokmål |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian Nynorsk |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Pashto |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian, Afghan |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian, Iranian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Polish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Portuguese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Romanian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Russian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Serbian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Slovak |
| ||||||||||||||||||||||||||||||||||||||||||||||||
South Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Spanish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Swedish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Tagalog |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Thai |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Turkish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Ukrainian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Urdu |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Unknown |
| ||||||||||||||||||||||||||||||||||||||||||||||||
[a] Not a real ISO 639-3 code. |
Part-of-Speech Tags
Note
For a mapping of the part-of-speech tags that appear in this appendix to the tags used in the Penn Treebank Project, see the tab-delimited .csv files (one per language) distributed along with this Application Developer's Guide in the penn_treebank
subdirectory. The RBL Korean part-of-speech tags conform to the Penn Treebank standard.
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
The header of each table lists the tag set identifier. For example, the English tag set is identified as BT_ENGLISH
.
For Chinese and Japanese using statistical models for tokenization (tag sets BT_CHINESE_RBLJE_2
and BT_JAPANESE_RBLJE_2
), if the analyzer cannot find the lemma for a word in its analysis dictionary, it will return GUESS for a part-of-speech tag. GUESS is not really a part-of-speech and does not appear below in this appendix.
Arabic POS Tags – BT_ARABIC
Tag | Description | Example |
---|---|---|
ABBREV | abbreviation | ا ف ب |
ADJ | adjective | عَرَبِي اَلأَمْرِيكِيّ |
ADV | adverb | هُنَاكَ ,ثُمَّ |
CONJ | conjunction | وَ |
CV | verb (imperative) | أَضِفْ |
DEM_PRON | demonstrative pronoun | هٰذَا |
DET | determiner | لِل |
EOS | end of sentence | ! ؟ . |
EXCEPT_PART | exception particle | إلا |
FOCUS_PART | focus particle | أما |
FUT_PART | future particle | سَوْفَ |
INTERJ | interjection | آه |
INTERROG_PART | interrogative particle | هَلْ |
IV | verb (imperfect) | يَكْتُبُ ,يَأْكُلُ |
IV_PASS | verb (passive imperfect) | يُضَافُ، يُشَارُ |
NEG_PART | negative particle | لَن |
NON_ARABIC | not Arabic script | a b c |
NOUN | noun | طَائِرْ، كُمْبِيُوتَرْ، بَيْتْ |
NOUN_PROP | proper noun | طَوُنِي، مُحَمَّدْ |
NO_FUNC | unknown part of speech | |
NUM | numbers (Arabic-Indic numbers, Latin, and text-based cardinal) | أَرْبَعَة عَشَرْ، ١٤، 14 |
PART | particle | أَيَّتُهَا، إِيَّاهُ |
PREP | preposition | أَمَامْ، فِي |
PRONOUN | pronoun | هُوَ |
PUNC | punctuation | () ،.: |
PV | perfective verb | كَانَت، قَالَ |
PV_PASS | passive perfective verb | أُعْتَبَر |
RC_PART | resultative clause particle | فَلَمَا |
REL_ADV | relative adverb | حَيْثُ |
REL_PRON | relative pronoun | اَلَّذِي، اَللَّذَانِ |
SUB_CONJ | subordinating conjunction | إِذَا، إِذ |
VERB_PART | verbal particle | لَقَدْ |
Chinese POS Tags – Simplified and Traditional – BT_CHINESE
Tag | Description | Simplified Chinese | Traditional Chinese |
---|---|---|---|
A | adjective | 可爱 | 可愛 |
D | adverb | 必定 | 必定 |
E | idiom/phrase | 胸有成竹 | 胸有成竹 |
EOS | sentence final punctuation | 。 | 。 |
F | non-derivational affix | 鸳 | 鴛 |
FOREIGN | non-Chinese | c123 | c123 |
I | interjection | 吧 | 吧 |
J | conjunction | 但是 | 但是 |
M | onomatope | 丁丁 | 丁丁 |
NA | abbreviation | 日 | 日 |
NC | common noun | 水果 | 水果 |
NM | measure word | 个 | 個 |
NN | numeral | 3, 2, 一 | 3, 2, 一 |
NP | proper noun | 英国 | 英國 |
NR | pronoun | 我 | 我 |
NT | temporal noun | 一月 | 一月 |
OC | construction | 越~越~ | 越~越~ |
PL | particle | 之 | 之 |
PR | preposition | 除了 | 除了 |
PUNCT | non-sentence-final punctuation | , 「」(); | , 《》() |
U | unknown | ||
V | verb | 跳舞 | 跳舞 |
W | derivational suffix | 家 | 家 |
WL | direction word | 下 | 下 |
WV | word element - verb | 以 | 以 |
X | generic affix | 老 | 老 |
XP | generic prefix | 可 | 可 |
XS | generic suffix | 员 | 員 |
Tag |
---|
FOREIGN_PERSON |
ORGANIZATION |
PERSON |
PLACE |
Chinese POS Tags – Simplified and Traditional – BT_CHINESE_RBLJE_2
Tag | Description | Simplified Chinese | Traditional Chinese |
---|---|---|---|
A | adjective | 可爱 | 可愛 |
D | adverb | 必定 | 必定 |
E | idiom/phrase | 胸有成竹 | 胸有成竹 |
EOS | sentence final punctuation | 。 | 。 |
F | non-derivational affix | 鸳 | 鴛 |
I | interjection | 吧 | 吧 |
J | conjunction | 但是 | 但是 |
M | onomatope | 丁丁 | 丁丁 |
NA | abbreviation | 日 | 日 |
NC | common noun | 水果 | 水果 |
NM | measure word | 个 | 個 |
NN | numeral | 3, 2, 一 | 3, 2, 一 |
NP | proper noun | 英国 | 英國 |
NR | pronoun | 我 | 我 |
NT | temporal noun | 一月 | 一月 |
OC | construction | 越~越~ | 越~越~ |
PL | particle | 之 | 之 |
PR | preposition | 除了 | 除了 |
PUNCT | non-sentence-final punctuation | , 「」(); | , 《》() |
U | unknown | ||
V | verb | 跳舞 | 跳舞 |
W | derivational suffix | 家 | 家 |
WL | direction word | 下 | 下 |
WV | word element - verb | 以 | 以 |
X | generic affix | 老 | 老 |
XP | generic prefix | 可 | 可 |
XS | generic suffix | 员 | 員 |
Czech POS Tags – BT_CZECH
Tag | Description | Example |
---|---|---|
ADJ | adjective: nominative | [vál] silný [vítr] |
adjective: genitive | [k uvedení] zahradní [slavnosti] | |
adjective: dative | [k] veselým [lidem] | |
adjective: accusative | [jak zdolat] ekonomické [starosti] [vychutná] jeho [radost] | |
adjective: instrumental | první bushovou [zastávkou] | |
adjective: locative | [na] druhém [okraji silnice] | |
adjective: vocative | ty mladý [muž] | |
ordinal | [obsadil] 12. [místo] | |
ADV | adverb | velmi, nejvíce, daleko, jasno |
CLIT | clitic | bych, by, bychom, byste |
CM | comma | , |
CONJ | conjunction | a, i, ale, aby, nebo, však, protože |
DATE | date | 11. 12. 1996, 11. 12. |
INTJ | interjection | ehm, ach |
NOUN | noun: nominative | [je to] omyl |
noun: genitive | [krize] autority státu | |
noun: dative | [dostala se k] moci | |
noun: accusative | [názory na] privatizaci | |
noun: instrumental | [Marx s naprostou] jistotou | |
noun: locative | [ve vlastním] zájmu | |
noun: vocative | [ty] parlamente | |
abbreviation, initial, unit | v., mudr., km/h, m3 | |
NUM_ACC | numeral: accusative | [máme jen] jednu [velmoc] |
NUM_DAT | numeral: dative | [jsme povinni] mnoha [lidem] |
NUM_DIG | digit | 123, 2:0, 1:23:56, -8.9, -8 909 |
NUM_GEN | numeral: genitive | [po dobu] dvou [let] |
NUM_INS | numeral: instrumental | [s] padesáti [hokejisty] |
NUM_LOC | numeral: locative | [po] dvou [závodech] |
NUM_NOM | numeral: nominative | oba [kluby tají, kde] |
NUM_ROM | Roman numeral | V |
NUM_VOC | numeral: vocative | [vy] dva [, zastavte] |
PREP | preposition | dle [tebe], ke [stolu], do [roku], se [mnou] |
PREPPRON | prepositional pronoun | nač |
PRON_ACC | pronoun: accusative | [nikdo] je [nevyhodí] |
PRON_DAT | pronoun: dative | [kdy je] mu [vytýkána] |
PRON_GEN | pronoun: genitive | [u] nás [i kolem] nás |
PRON_INS | pronoun: instrumental | [mezi] nimi [být] |
PRON_LOC | pronoun: locative | [aby na] ní [stál] |
PRON_NOM | pronoun: nominative | já [jsem jedinou] |
PRON_VOC | pronoun: vocative | vy [dva, zastavte ] |
PROP | proper noun | Pavel, Tigrid, Jacques, Rupnik, Evropy |
PTCL | particle | ano, ne |
PUNCT | punctuation | ( ) { } [ ] ; |
REFL_ACC | reflexive pronoun: accusative | se |
REFL_DAT | reflexive pronoun: dative | si |
REFL_GEN | reflexive pronoun: genitive | sebe |
REFL_INS | reflexive pronoun: instrumental | sebou |
REFL_LOC | reflexive pronoun: locative | sobě |
SENT | sentence final punctuation | . ! ? ... |
VERB_IMP | verb: imperative | odstupte |
VERB_INF | verb: infinitive | [mohli si] koupit |
VERB_PAP | verb: past participle | mohli [si koupit] |
VERB_PRI | verb: present indicative | [trochu nás] mrzí |
VERB_TRA | verb: transgressive | maje [ode mne] |
Dutch POS Tags – BT_DUTCH
Tag | Description | Example |
---|---|---|
ADJA | attributive adjective | [een] snelle [auto] |
ADJD | adverbial or predicative adjective | [hij rijdt] snel |
ADV | non-adjectival adverb | [hij rijdt] vaak |
ART | article | een [bus], het [busje] |
CARD | cardinals | vijf |
CIRCP | right part of circumposition | [hij viel van dit dak] af |
CM | comma | , |
CMPDPART | right truncated part of compound | honden- [kattenvoer] |
COMCON | comparative conjunction | [zo groot] als, [groter] dan |
CON | co-ordinating conjunction | [Jan] en [Marie] |
CWADV | interrogative adverb or subordinate conjunction | wanneer [gaat hij weg ?], wanneer [hij nu weggaat] |
DEMDET | demonstrative determiner | deze [bloemen zijn mooi] |
DEMPRO | demonstrative pronoun | deze [zijn mooi] |
DIG | digits | 1, 1.2 |
INDDET | indefinite determiner | geen [broer] |
INDPOST | indefinite postdeterminer | [de] beide [broers] |
INDPRE | indefinite predeterminer | al [de broers] |
INDPRO | indefinite pronoun | beide [gingen weg] |
INFCON | infinitive conjunction | om [te vragen] |
ITJ | interjections | Jawel, och, ach |
NOUN | common noun or proper noun | [de] hoed, [het goede] gevoel, [de] Betuwelijn |
ORD | ordinals | vijfde, 125ste, 12de |
PADJ | postmodifying adjective | [iets] aardigs |
PERS | personal pronoun | hij [sloeg] hem |
POSDET | possessive pronoun | mijn [boek] |
POSTP | postposition | [hij liep zijn huis] in |
PREP | preposition | [hij is] in [het huis] |
PROADV | pronominal adverb | [hij praat] hierover |
PTKA | adverb modification | [hij wil] te [snel] |
PTKNEG | negation | [hij gaat] niet [snel] |
PTKTE | infinitive particle | [hij hoopt] te [gaan] |
PTKVA | separated prefix of pronominal adverb or verb | [daar niet] mee [hij loopt] mee |
PUNCT | other punctuation | " ' ` { } [ ] < > - --- |
RELPRO | relative pronoun | [de man] die [lachte] |
RELSUB | relative conjunction | [Het kind] dat, [Het feit] dat |
SENT | sentence final punctuation | ; . ? |
SUBCON | subordinating conjunction | Hoewel [hij er was] |
SYM | symbols | @, % |
VAFIN | finite auxiliary verb | [hij] is [geweest] |
VAINF | infinite auxiliary verb | [hij zal] zijn |
VAPP | past participle auxiliary verb | [hij is] geweest |
VVFIN | finite substantive verb | [hij] zegt |
VVINF | infinitive substantive verb | [hij zal] zeggen |
VVPP | past participle substantive verb | [hij heeft] gezegd |
WADV | interrogative adverb | waarom [gaat hij] |
WDET | interrogative or relative determiner | [de vrouw] wier [man....] |
WPRO | interrogative or relative pronoun | [de vraag] wie ... |
English POS Tags – BT_ENGLISH
Tag | Description | Example |
---|---|---|
ADJ | (basic) adjective | [a] blue [book], [he is] big |
ADJCMP | comparative adjective | [he is] bigger, [a] better [question] |
ADJING | adjectival ing-form | [the] working [men] |
ADJPAP | adjectival past participle | [a] locked [door] |
ADJPRON | pronoun (with determiner) or adjective | [the] same; [the] other [way] |
ADJSUP | superlative adjective | [he is the] biggest; [the] best [cake] |
ADV | (basic) adverb | today, quickly |
ADVCMP | comparative adverb | sooner |
ADVSUP | superlative adverb | soonest |
CARD | cardinal (except one) | two, 123, IV |
CARDONE | cardinal one | [at] one [time] ; one [dollar] |
CM | comma | , |
COADV | coordination adverbs either, neither | either [by law or by force]; [he didn't come] either |
COORD | coordinating conjunction | and, or |
COSUB | subordinating conjunction | because, while |
COTHAN | conjunction than | [bigger] than |
DET | determiner | the [house], a [house], this [house], my [house] |
DETREL | relative determiner whose | [the man] whose [hat ...] |
INFTO | infinitive marker to | [he wants] to [go] |
ITJ | interjection | oh! |
MEAS | measure abbreviation | [50] m. [wide], yd |
MONEY | currency plus cardinal | $1,000 |
NOT | negation not | [he will] not [come in] |
NOUN | common noun | house |
NOUNING | nominal ing-form | [the] singing [was pleasant], [the] raising [of the flag] |
ORD | ordinal | 3rd, second |
PARTPAST | past participle (in subclause) | [while] seated[, he instructed the students]; [the car] sold [on Monday] |
PARTPRES | present participle (in subclause), gerund | [while] doing [it];[they began] designing [the ship];having [said this ...] |
POSS | possessive suffix 's | [Peter] 's ; [houses] ' |
PREDET | pre-determiner such | such [a way] |
PREP | preposition | in [the house], on [the table] |
PREPADVAS | preposition or adverbial as | as [big] as |
PRON | (non-personal) pronoun | everybody, this [is ] mine |
PRONONE | pronoun one | one [of them]; [the green] one |
PRONPERS | personal pronoun | I, me, we, you |
PRONREFL | reflexive pronoun | myself, ... |
PRONREL | relative pronoun who, whom, whose; which; that | [the man] who [wrote that book], [the ship] that capsized |
PROP | proper noun | Peter, [Mr.] Brown |
PUNCT | punctuation (other than SENT and CM) | " |
QUANT | quantifier all, any, both, double, each, enough, every, (a) few, half, many, some | many [people]; half [the price]; all [your children]; enough [time]; any [of these] |
QUANTADV | quantifier or adverb much, little | much [better] , [he cares] little |
QUANTCMP | quantifier or comparative adverb more, less | more [people], less [expensive] |
QUANTSUP | quantifier or superlative adverb most, least | most [people], least [expensive] |
SENT | sentence final punctuation | . ! ? : |
TIT | title | Mr., Dr. |
VAUX | auxiliary (modal) | [he] will [run], [I] won't [come] |
VBI | infinitive or imperative of be | [he will] be [running]; be [quiet!] |
VBPAP | past participle of be | [he has] been [there] |
VBPAST | past tense of be | [he] was [running], [he] was [here] |
VBPRES | present tense of be | [he] is [running], [he] is [old] |
VBPROG | ing-form of be | [it is] being [sponsored] |
VDI | infinitive of do | [He will] do [it] |
VDPAP | past participle of do | [he has] done [it] |
VDPAST | past tense of do | [we] did [it], [he] didn't [come] |
VDPRES | present tense of do | [We] do [it], [he] doesn't [go] |
VDPROG | ing-form of do | [He is] doing [it] |
VHI | infinitive or imperative of have | [he would] have [come]; have [a look!] |
VHPAP | past participle of have | [he has] had [a cold] |
VHPAST | past tense of have | [he] had [seen] |
VHPRES | present tense of have | [he] has [been watching] |
VHPROG | ing-form of have | [he is] having [a good time] |
VI | verb infinitive or imperative | [he will] go, [he comes to] see; listen [!] |
VPAP | verb past participle | [he has] played, [it is] written |
VPAST | verb past tense | [I] went, [he] loved |
VPRES | verb present tense | [we] go, [she] loves |
VPROG | verb ing-form | [you are] going |
VS | verbal 's (short for is or has) | [he] 's [coming] |
WADV | interrogative adverb | when [did ...], where [did ...], why [did ...] |
WDET | interrogative determiner | which [book], whose [hat] |
WPRON | interrogative pronoun | who [is], what [is] |
French POS Tags – BT_FRENCH
Tag | Description | Example |
---|---|---|
ADJ2_INV | special number invariant adjective | gros |
ADJ2_PL | special plural adjective | petites, grands |
ADJ2_SG | special singular adjective | petit, grande |
ADJ_INV | number invariant adjective | heureux |
ADJ_PL | plural adjective | gentils, gentilles |
ADJ_SG | singular adjective | gentil, gentille |
ADV | adverb | finalement, aujourd'hui |
CM | comma | , |
COMME | reserved for the word comme | comme |
CONJQUE | reserved for the word que' | que |
CONN | connector subordinate conjunction | si, quand |
COORD | coordinate conjunction | et, ou |
DET_PL | plural determiner | les |
DET_SG | singular determiner | le, la |
MISC | miscellaneous | miaou, afin |
NEG | negation particle | ne |
NOUN_INV | number invariant noun | taux |
NOUN_PL | plural noun | chiens, fourmis |
NOUN_SG | singular noun | chien, fourmi |
NUM | numeral | treize, 13, XIX |
PAP_INV | number invariant past participle | soumis |
PAP_PL | plural past participle | finis, finies |
PAP_SG | singular past participle | fini, finie |
PC | clitic pronoun | [donne-]le, [appelle-]moi, [donne-]lui |
PREP | preposition (other than à, au, de, du, des) | dans, après |
PREP_A | preposition "à'' | à, au, aux |
PREP_DE | preposition "de'' | de, d', du, des |
PRON | pronoun | il, elles, personne, rien |
PRON_P1P2 | 1st or 2nd person pronoun | je, tu, nous |
PUNCT | punctuation (other than comma) | : - |
RELPRO | relative/interrogative pronoun (except "que'') | qui, quoi, lequel |
SENT | sentence final punctuation | . ! ? ; |
SYM | symbols | @ % |
VAUX_INF | infinitive auxiliary | être, avoir |
VAUX_P1P2 | 1st or 2nd person auxiliary verb, any tense | suis, as |
VAUX_P3PL | 3rd person plural auxiliary verb, any tense | seraient |
VAUX_P3SG | 3rd person singular auxiliary verb, any tense | aura |
VAUX_PAP | past participle auxiliary | eu, été |
VAUX_PRP | present participle auxiliary verb | ayant |
VERB_INF | infinitive verb | danser, finir |
VERB_P1P2 | 1st or 2nd person verb, any tense | danse, dansiez, dansais |
VERB_P3PL | 3rd person plural verb, any tense | danseront |
VERB_P3SG | 3rd person singular verb, any tense | danse, dansait |
VERB_PRP | present participle verb | dansant |
VOICILA | reserved for voici, voilà'' | voici, voilà |
German POS Tags – BT_GERMAN
Tag | Description | Example |
---|---|---|
ADJA | (positive) attributive adjective | [ein] schnelles [Auto] |
ADJA2 | comparative attributive adjective | [ein] schnelleres [Auto] |
ADJA3 | superlative attributive adjective | [das] schnellste [Auto] |
ADJD | (positive) predicative or adverbial adjective | [es ist] schnell, [es fährt] schnell |
ADJD2 | comparative predicative or adverbial adjective | [es ist] schneller, [es fährt] schneller |
ADJD3 | superlative predicative or adverbial adjective | [es ist am] schnellsten, [er meint daß er am] schnellsten [fährt]. |
ADV | non-adjectival adverb | oft, heute, bald, vielleicht |
ART | article | der [Mann], eine [Frau] |
CARD | cardinal | 1, eins, 1/8, 205 |
CIRCP | circumposition, right part | [um der Ehre] willen |
CM | comma | , |
COADV | adverbial conjunction | aber, doch, denn |
COALS | conjunction als | als |
COINF | infinitival conjunction | ohne [zu fragen], anstatt [anzurufen] |
COORD | coordinating conjunction | und, oder |
COP1 | coordination 1st part | entweder [... oder] |
COP2 | coordination 2nd part | [weder ...] noch |
COSUB | subordinating conjunction | weil, daß, ob [ich mitgehe] |
COWIE | conjunction wie | wie |
DATE | date | 27.12.2006 |
DEMADJ | demonstrative adjective | solche [Mühe] |
DEMDET | demonstrative determiner | diese [Leute] |
DEMINV | invariant demonstrative | solch [ein schönes Buch] |
DEMPRO | demonstrative pronoun | jener [sagte] |
FM | foreign word | article, communication |
INDADJ | indefinite adjective | [die] meisten [Leute], viele [Leute], [die] meisten [sind da], viele [sind da] |
INDDET | indefinite determiner | kein [Mensch] |
INDINV | invariant indefinite | manch [einer] |
INDPRO | indefinite pronoun | man [sagt] |
ITJ | interjection | oh, ach, weh, hurra |
NOUN | common noun, nominalized adjective, nominalized infinitive, or proper noun | Hut, Leute, [das] Gute, [das] Wollen, Peter, [die] Schweiz |
ORD | ordinal | 2., dritter |
PERSPRO | personal pronoun | ich, du, ihm, mich, uns |
POSDET | possessive determiner | mein [Haus] |
POSPRO | possessive pronoun | [das ist] meins |
POSTP | postposition | [des Geldes] wegen |
PREP | preposition | in, auf, wegen, mit |
PREPART | preposition article | im, ins, aufs |
PTKANT | sentential particle | ja, nein, bitte, danke |
PTKCOM | comparative particle | desto [schneller] |
PTKINF | particle: infinitival zu | [er wagt] zu [sagen] |
PTKNEG | particle: negation nicht | nicht |
PTKPOS | positive modifier | zu [schnell], allzu [schnell] |
PTKSUP | superlative modifier | am [schnellsten] |
PUNCT | other punctuation, bracket | ; : ( ) [ ] - " |
REFLPRO | reflexive sich | sich |
RELPRO | relative pronoun | [der Mann,] der [lacht] |
REZPRO | reciprocal einander | einander |
SENT | sentence final punctuation | . ? ! |
SYMB | symbols | @, %, x311 |
TRUNC | truncated word, (first part of a compound or verb prefix) | Ein- [und Ausgang], Kinder- [und Jugendheim], be- [und entladen] |
VAFIN | finite auxiliary | [er] ist, [sie] haben |
VAINF | auxiliary infinitive | [er will groß] sein |
VAPP | auxiliary past participle | [er ist groß] geworden |
VMFIN | finite modal | [er] kann, [er] mochte |
VMINF | modal infinitive | [er wird kommen] können |
VPREF | separated verbal prefix | [er kauft] ein, [sie sieht] zu |
VVFIN | finite verb form | [er] sagt |
VVINF | infinitive | [er will] sagen, einkaufen |
VVIZU | infinitive with incorporated zu | [um] einzukaufen |
VVPP | past participle | [er hat] gesagt |
WADV | interrogative adverb | wieso [kommt er?] |
WDET | interrogative determiner | welche [Nummer?] |
WINV | invariant interrogative | welch [ein ...] |
WPRO | interrogative pronoun | wer [ist da?] |
Greek POS Tags – BT_GREEK
Tag | Description | Example |
---|---|---|
ADJ | adjective | παιδικό |
ADV | adverb | ευχαρίστως |
ART | article | η, της |
CARD | cardinal | χίλια |
CLIT | clitic (pronoun) | τον, τού |
CM | comma | , |
COORD | coordinating conjunction | και |
COSUBJ | conjunction with subjunctive | αντί [να] |
CURR | currency | $ |
DIG | digits | 123 |
FM | foreign word | article |
FUT | future tense particle | θα |
INTJ | interjection | χμ |
ITEM | item | 1.2 |
NEG | negation particle | μη |
NOUN | common noun | βιβλίο |
ORD | ordinal | τρίτα |
PERS | personal pronoun | εγώ |
POSS | possessive pronoun | μας, τους |
PREP | preposition | άνευ |
PREPART | preposition with article | στο |
PRON | pronoun | αυτοί |
PRONREL | relative pronoun | οποίες |
PROP | proper noun | Μαρία |
PTCL | particle | ας |
PUNCT | punctuation (other than SENT and CM) | : - |
QUOTE | quotation marks | " |
SENT | sentence final punctuation | . ! ? |
SUBJ | subjunctive particle | να |
SUBORD | subordinating conjunction | πως |
SYMB | special symbol | *, % |
VIMP | verb (imperative) | γράψε |
VIND | verb (indicative) | γράφεις |
VINF | verb (infinitive) | γράφει |
VPP | participle | δικασμένο |
Hebrew POS Tags – MILA_HEBREW
Tag | Description | Example |
---|---|---|
adjective | adjective | נפלאי |
adverb | adverb | מאחוריהם |
conjunction | conjunction | אך, ועוד |
copula | copula | איננה, תהיו |
existential | existential | ההיה, יש |
indInf | independent infinitive | היוסדה |
interjection | interjection | אח, חבל"ז |
interrogative | interrogative | לאן |
modal | modal | צריכים |
negation | negation particle | לא |
noun | common noun | אורולוגיה |
numeral | numeral | 613, שמונה |
participle | participle | משרטטה |
passiveParticiple | passive participle | סגורה |
preposition | preposition or prepositional phrase | עם, בפניו |
pronoun | pronoun | זה/ |
properName | proper noun | שרון |
punctuation | punctuation | . |
quantifier | quantifier or determiner | רובן |
title | title or honorific | גב', זצ"ל |
unknown | unknown | ומלמם'מ |
verb | verb | לכתוב |
wPrefix | prefix | פסאודו |
Hungarian POS Tags – BT_HUNGARIAN
Tag | Description | Example |
---|---|---|
ADJ | (invariant) adjective | kis |
ADV | adverb | jól |
ADV_PART | adverbial participle | állva |
ART | article | az |
AUX | auxiliary | szabad |
CM | comma | , |
CONJ | conjunction | és |
DEICT_PRON_NOM | deictic pronoun: nominative | ez |
DEICT_PRON_ACC | deictic pronoun: accusative | ezt |
DEICT_PRON_CASE | deictic pronoun: other case | ebbe |
FUT_PART_NOM | future participle: nominative | teendõ |
FUT_PART_ACC | future participle: accusative | teendõt |
FUT_PART_CASE | future participle: other case | teendõvel |
GENE_PRON_NOM | general pronoun: nominative | minden |
GENE_PRON_ACC | general pronoun: accusative | mindent |
GENE_PRON_CASE | general pronoun: other case | mindenbe |
INDEF_PRON_NOM | indefinite pronoun: nominative | más |
INDEF_PRON_ACC | indefinite pronoun: accusative | mást |
INDEF_PRON_CASE | indefinite pronoun: other case | mással |
INF | infinitive (verb) | csinálni |
INTERJ | interjection | jaj |
LS | list item symbol | 1) |
MEA | measure, unit | km |
NADJ_NOM | noun or adjective: nominative | ifjú |
NADJ_ACC | noun or adjective: accusative | ifjút |
NADJ_CASE | noun or adjective: other case | ifjúra |
NOUN_NOM | noun: nominative | asztal |
NOUN_ACC | noun: accusative | asztalt |
NOUN_CASE | noun: other case | asztalra |
NUM_NOM | numeral: nominative | három |
NUM_ACC | numeral: accusative | hármat |
NUM_CASE | numeral: other case | háromra |
NUM_PRON_NOM | numeral pronoun: nominative | kevés |
NUM_PRON_ACC | numeral pronoun: accusative | keveset |
NUM_PRON_CASE | numeral pronoun: other case | kevéssel |
NUMBER | numerals (digits) | 1 |
ORD_NUMBER | ordinal | 1. |
PAST_PART_NOM | past participle: nominative | meghívott |
PAST_PART_ACC | past participle: accusative | meghívottat |
PAST_PART_CASE | past participle: other case | meghívottakkal |
PERS_PRON | personal pronoun | én |
POSTPOS | postposition | alatt |
PREFIX | prefix | át |
PRES_PART_NOM | present participle: nominative | csináló |
PRES_PART_ACC | present participle: accusative | csinálót |
PRES_PART_CASE | present participle: other case | csinálónak |
PRON_NOM | pronoun: nominative | milyen |
PRON_ACC | pronoun: accusative | milyet |
PRON_CASE | pronoun: other case | milyenre |
PROPN_NOM | proper noun: accusative | Budapestet |
PROPN_ACC | proper noun: other case | Budapestre |
PROPN_CASE | proper noun: nominative | Budapest |
PUNCT | punctuation (other than SENT or CM) | ( ) |
REFL_PRON_NOM | reflexive pronoun: accusative | magát |
REFL_PRON_ACC | reflexive pronoun: other case | magadra |
REFL_PRON_CASE | reflexive pronoun: nominative | magad |
REL_PRON_NOM | relative pronoun: nominative | aki |
REL_PRON_ACC | relative pronoun: accusative | akit |
REL_PRON_CASE | relative pronoun: other case | akire |
ROM_NUMBER | Roman numeral | IV |
SENT | sentence final punctuation | ., ; |
SPEC | special string (URL, email) | www.xzymn.com |
SUFF | suffix | -re |
TRUNC | compound part | asztal- |
VERB | verb | csinál |
Italian POS Tags – BT_ITALIAN
Tag | Description | Example |
---|---|---|
ADJEX | proclitic noun modifier ex | ex |
ADJPL | plural adjective | belle |
ADJSG | singular adjective | buono, narcisistico |
ADV | adverb | lentamente, già, poco |
CLIT | clitic pronoun or adverb | vi, ne, mi, ci |
CM | comma | , |
CONJ | conjunction | e, ed, e/o |
CONNADV | adverbial connector | quando, dove, come |
CONNCHE | relative pronoun or conjunction | ch', che |
CONNCHI | relative or interrogative pronoun chi | chi |
DEMPL | plural demonstrative | quelli |
DEMSG | singular demonstrative | ciò |
DETPL | plural determiner | tali, quei, questi |
DETSG | singular determiner | uno, questo, il |
DIG | digits | +5, iv, 23.05, 3,45, 1997 |
INTERJ | interjection | uhi, perdiana, eh |
ITEM | list item marker | A. |
LET | single letter | [di tipo] C |
NPL | plural noun | case |
NSG | singular noun | casa, balsamo |
ORDPL | plural ordinal | terzi |
ORDSG | singular ordinal | secondo |
POSSPL | plural possessive | mie, vostri, loro |
POSSSG | singular possessive | nostro, sua |
PRECLIT | pre-clitic | me [lo dai], te [la rubo] |
PRECONJ | pre-conjunction | dato [che] |
PREDET | pre-determiner | tutto [il giorno], tutti [i problemi] |
PREP | preposition | tra, di, con, su di |
PREPARTPL | preposition + plural article | sulle, sugl', pegli |
PREPARTSG | preposition + singular article | sullo, nella |
PREPREP | pre-preposition | prima [di], rispetto [a] |
PRON | pronoun (3rd person singular/plural) sé | [disgusto di] sé |
PRONINDPL | plural indefinite pronoun | entrambi, molte |
PRONINDSG | singular indefinite pronoun | troppa |
PRONINTPL | plural interrrogative pronoun | quali, quanti |
PRONINTSG | singular interrogative pronoun | cos' |
PRONPL | plural personal pronoun | noi, loro |
PRONREL | invariant relative pronoun | cui |
PRONRELPL | plural relative pronoun | quali, quanti |
PRONRELSG | singular relative pronoun | quale |
PRONSG | singular personal pronoun | esso, io, tu, lei, lui |
PROP | proper noun | Bernardo, Monte Isola |
PUNCT | other punctuation | - ; |
QUANT | invariant quantifier | qualunque, qualsivoglia |
QUANTPL | plural quantifier, numbers | molti, troppe, tre |
QUANTSG | singular quantifier | niuna, nessun |
SENT | sentence final punctuation | . ! ? : |
VAUXF | finite auxiliary essere or avere | è, sarò, saranno, avrete |
VAUXGER | gerund auxiliary essere or avere | essendo, avendo |
VAUXGER_CLIT | gerund auxiliary + clitic | essendogli |
VAUXIMP | imperative auxiliary | sii, abbi |
VAUXIMP_CLIT | imperative auxiliary + clitic | siatene, abbiatemi |
VAUXINF | infinitive auxiliary essere/avere | esser, essere, aver, avere |
VAUXINF_CLIT | infinitive auxiliary essere/avere + clitic | esserle, averle |
VAUXPPPL | plural past participle auxiliary | stati/e, avuti/e |
VAUXPPPL_CLIT | plural past part. auxiliary + clitic | statine, avutiti |
VAUXPPSG | singular past participle auxiliary | stato/a, avuto/a |
VAUXPPSG_CLIT | singular past part. auxiliary + clitic | statone, avutavela |
VAUXPRPL | plural present participle auxiliary | essenti, aventi |
VAUXPRPL_CLIT | plural present participle auxiliary + clitic | aventile |
VAUXPRSG | singular present participle auxiliary | essente, avente |
VAUXPRSG_CLIT | singular present participle auxiliary + clitic | aventela |
VF | finite verb form | blatereremo, mangio |
VF_CLIT | finite verb + clitic | trattansi, leggevansi |
VGER | gerund | adducendo, intervistando |
VGER_CLIT | gerund + clitic | saziandole, appurandolo |
VIMP | imperative | pareggiamo, formulate |
VIMP_CLIT | imperative + clitic | impastategli, accoppiatevele |
VINF | verb infinitive | sciupare, trascinar |
VINF_CLIT | verb infinitive + clitic | spulciarsi, risucchiarsi |
VPPPL | plural past participle | riposti, offuscati |
VPPPL_CLIT | plural past participle + clitic | assestatici, ripostine |
VPPSG | singular past participle | sbudellata, chiesto |
VPPSG_CLIT | singular past participle + clitic | commossosi, ingranditomi |
VPRPL | plural present participle | meditanti, destreggianti |
VPRPL_CLIT | plural present participle + clitic | epurantile, andantivi |
VPRSG | singular present participle | meditante, destreggiante |
VPRSG_CLIT | singular present participle + clitic | epurantelo, andantevi |
Japanese POS Tags – BT_JAPANESE
Tag | Description | Example |
---|---|---|
AA | adnominal adjective | その[人], この[日], 同じ |
AJ | normal adjective | 美し, 嬉し, 易し |
AN | adjectival noun | きれい[だ], 静か[だ], 正確[だ] |
D | adverb | じっと, じろっと, ふと |
EOS | sentence-final punctuation | 。. |
FP | non-derivational prefix | 両[選手], 現[首相] |
FS | non-derivational suffix | [綺麗]な, [派手]だ |
HP | honorific prefix | お[風呂], ご[不在], ご[意見] |
HS | honorific suffix | [小泉]氏, [恵美]ちゃん, [伊藤]さん |
I | interjection | こんにちは, ほら, どっこいしょ |
J | conjunction | すなわち, なぜなら, そして |
NC | common noun | 公園, 電気, デジタルカメラ |
NE | noun before numerals | 約, 翌, 築, 乾元 |
NN | numeral | 3, 2, 五, 二百 |
NP | proper noun | 北海道, 斉藤 |
NR | pronoun | 私, あなた, これ |
NU | classifier | [100]メートル, [3]リットル |
O | others | BASIS |
PL | particle | [雨]が[降る], [そこ]に[座る], [私]は[一人] |
PUNCT | punctuation other than end of sentence | ,「」(); |
UNKNOWN | unknown | デパ[地下], ヴェロ |
V | verb | 書く, 食べます, 来た |
V1 | ichidan verb stem | 食べ[る], 集め[る], 起き[る] |
V5 | godan verb stem | 気負[う], 知り合[う], 行き交[う] |
VN | verbal noun | 議論[する], ドライブ[する], 旅行[する] |
VS | suru-verb | 馳せ参[じる], 相半ば[する] |
VX | irregular verb | 移り行[く], トラブ[る] |
WP | derivational prefix | チョー[綺麗], バカ[正直] |
WS | derivational suffix | [東京]都, [大阪]府, [白]ずくめ |
Tag |
---|
FOREIGN_GIVEN_NAME |
FOREIGN_PLACE |
FOREIGN_SURNAME |
GIVEN_NAME |
ORGANIZATION |
PERSON |
PLACE |
SURNAME |
Japanese POS Tags – BT_JAPANESE_RBLJE_2
Tag | Description | Example |
---|---|---|
AA | adnominal adjective | その[人], この[日], 同じ |
AJ | normal adjective | 美し, 嬉し, 易し |
AN | adjectival noun | きれい[だ], 静か[だ], 正確[だ] |
AUXVB | auxiliary verb | た, ない, らしい |
D | adverb | じっと, じろっと, ふと |
FS | non-derivational suffix | [綺麗]な, [派手]だ |
HS | honorific suffix | [小泉]氏, [恵美]ちゃん, [伊藤]さん |
I | interjection | こんにちは, ほら, どっこいしょ |
J | conjunction | すなわち, なぜなら, そして |
NC | common noun | 公園, 電気, デジタルカメラ |
NE | noun before numerals | 約, 翌, 築, 乾元 |
NN | numeral | 3, 2, 五, 二百 |
NP | proper noun | 北海道, 斉藤 |
NR | pronoun | 私, あなた, これ |
NU | classifier | [100]メートル, [3]リットル |
O | others | BASIS |
PL | particle | [雨]が[降る], [そこ]に[座る], [私]は[一人] |
PUNCT | punctuation | 。,「」(); |
UNKNOWN | unknown | デパ[地下], ヴェロ |
V | verb | 書く, 食べます, 来た |
WP | derivational prefix | チョー[綺麗], バカ[正直] |
WS | derivational suffix | [東京]都, [大阪]府, [白]ずくめ |
Korean POS Tags – BT_KOREAN
Tag | Description | Examples |
---|---|---|
ADC | conjunctive adverb | 그리고, 그러나, 및, 혹은 |
ADV | constituent or clausal adverb | 매우, 조용히, 제발, 만일 |
CO | copula | 이 |
DAN | configurative or demonstrative adnominal | 새, 헌, 그 |
EAN | adnominal ending | 는/ㄴ |
ECS | coordinate, subordinate, adverbial, complementizer ending | 고, 므로, 게, 다고, 라고 |
EFN | final ending | 는다/ㄴ다, 니, 는가, 는지, 어라/라, 자, 구나 |
ENM | nominal ending | 기, 음 |
EPF | pre-final ending (tense, honorific) | 었, 시, 겠 |
IJ | exclamation | 아 |
NFW | word written in foreign characters | Clinton, Computer |
NNC | common noun | 학교, 컴퓨터 |
NNU | ordinal or cardinal number | 하나, 첫째, 1, 세 |
NNX | dependent noun | 것, 등, 년, 달라, 적 |
NPN | personal or demonstrative pronoun | 그, 이것, 무엇 |
NPR | proper noun | 한국, 클린톤 |
PAD | adverbial postposition | 에서, 로 |
PAN | adnominal postposition | 의, 이라는 |
PAU | auxiliary postposition | 만, 도, 는, 마저 |
PCA | case postposition | 가/이, 을/를, 의, 야 |
PCJ | conjunctive postposition | 와/과, 하고 |
SCM | comma | , |
SFN | sentence ending marker | . ? ! |
SLQ | left quotation mark | ‘ ( “ { |
SRQ | right quotation mark | ‘ ) ” } |
SSY | symbol | ... ; : - |
UNK | unknown |
|
VJ | adjective | 예쁘, 다르 |
VV | verb | 가, 먹 |
VX | auxiliary predicate | 있, 하 |
XPF | prefix | 제 |
XSF | suffix | 님, 들, 적 |
XSJ | adjectivization suffix | 스럽, 답, 하 |
XSV | verbalization prefix | 하, 되, 시키 |
Language Neutral POS Tags – BT_LANGUAGE_NEUTRAL
Tag | Description | Example |
---|---|---|
ATMENTION | @mention | @basistechnology |
email address | email@example.com | |
EMO | emoji or emoticon | :-) |
HASHTAG | hashtag | #BlindGuardian |
URL | URL | http://www.babelstreet.com/ |
Persian POS Tags – BT_PERSIAN
Tag | Description | Example |
---|---|---|
ADJ | adjective | بزرگ |
ADV | adverb | تقريبا |
CONJ | conjunction | یا |
DET | indefinite article/determiner | هر |
EOS | end of sentence indicator | . |
INT | interjection or exclamation | عجب |
N | noun | افزايش |
NON_FARSI | not Arabic script | a b c |
NPROP | proper noun | مائیکروسافت |
NUM | number | ده |
PART | particle | می |
PREP | preposition | به |
PRO | pronoun | اين |
PUNC | punctuation, other than end of sentence | , : “ ” |
UNK | unknown | نرژی |
VERB | verb | گفتم |
VINF | infinitive | خریدن |
Polish POS Tags – BT_POLISH
Tag | Description | Example |
---|---|---|
ADV | adverb: adjectival | szybko |
adverb: comparative adjectival | szybciej | |
adverb: superlative adjectival | najszybciej | |
adverb: non-adjectival | trochę, wczoraj | |
ADJ | adjective: attributive (postnominal) | [stopy] procentowe |
adjective: attributive (prenominal) | szybki [samochód] | |
adjective: predicative | [on jest] ogromny | |
adjective: comparative attributive | szybszy [samochód] | |
adjective: comparative predicative | [on jest] szybszy | |
adjective: superlative attributive | najszybszy [samochód] | |
adjective: superlative predicative | [on jest] najszybszy | |
CJ/AUX | conjunction with auxiliary być | [robi wszystko,] żebyśmy [przyszli] |
CM | comma | , |
CMPND | compound part | [ośrodek] naukowo-[badawczy] |
CONJ | conjunction | a, ale, gdy, i, lub |
DATE | date expression | 31.12.99 |
EXCL | interjection | aha, hej |
FRGN | foreign material | cogito, numerus |
NOUN | noun: common | reakcja, weksel |
noun: proper | Krzysztof, Francja | |
noun: nominalized adjective | chory, [pośpieszny z] Krakowa | |
NUM | numeral (cardinal) | 22; 10,25; 5-7; trzy |
ORD | numeral (ordinal) | 12. [maja], 2., 12go, 13go, 28go |
PHRAS | phraseology | [po] polsku, fiku-miku |
PPERS | personal pronoun | ja, on, ona, my, wy, mnie, tobie, jemu, nam, mi, go, nas, was |
PR/AUX | pronoun with auxiliary być | [co] wyście [zrobili] |
PREFL | reflexive pronoun | [nie może] sobie [przypomnieć], [zabierz to ze] sobą, [warto] sobie [zadać pytanie] |
PREL | relative pronoun | który [problem], jaki [problem], co, który [on widzi], jakie [mamuzeum] |
PREP | preposition | od [dzisiaj], na [rynku walutowym] |
PRON | pronoun: demonstrative | [w] tym [czasie] |
pronoun: indefinite | wszystkie [stopy procentowe], jakieś [nienaturalne rozmiary] | |
pronoun: possessive | nasi [dwaj bracia] | |
pronoun: interrogative | Jaki [masz samochód?] | |
PRTCL | particle | także, nie, tylko, już |
PT/AUX | particle with auxiliary być | gdzie [byliście] |
PUNCT | punctuation (other than CM or SENT) | ( ) [ ] " " - '' |
QVRB | quasi-verb | brak, szkoda |
SENT | sentence final punctuation | . ! ? ; |
SYMB | symbol | @ § |
TIME | time expression | 11:00 |
VAUX | auxiliary | być, zostać |
VFIN | finite verb form: present | [Agata] maluje [obraz] |
finite verb form: future | [Piotr będzie] malował [obraz] | |
VGER | gerund | [demonstrują] domagając [się zmian] |
VINF | infinitive | odrzucić, stawić [się] |
VMOD | modal | [wojna] może [trwać nawet rok] |
VPRT | verb participle: predicative | [wynik jest] przesądzony |
verb participle: passive | [postępowanie zostanie] zawieszone | |
verb participle: attributive | [zmiany] będące [wynikiem...] |
Portuguese POS Tags – BT_PORTUGUESE
Tag | Description | Example |
---|---|---|
ADJ | invariant adjective | [duas saias] cor-de-rosa |
ADJPL | plural adjective | [cidadãos] portugueses |
ADJSG | singular adjective | [continente] europeu |
ADV | adverb | directamente |
ADVCOMP | comparison adverb mais and menos | [um país] mais [livre] |
AUXBE | finite "be" (ser or estar) | é, são, estão |
AUXBEINF | infinitive "be" | ser, estar |
AUXBEINFPRON | infinitive "be" with clitic | sê-lo |
AUXBEPRON | finite "be" with clitic | é-lhe |
AUXHAV | finite "have" | tem, haverá |
AUXHAVINF | infinitive "have" (ter, haver) | ter, haver |
AUXHAVINFPRON | infinitive "have" with clitic | ter-se |
AUXHAVPRON | finite "have" with clitic | tinham-se |
CM | comma | , |
CONJ | (coordinating) conjunction | [por fax] ou [correio] |
CONJCOMP | comparison conjunction do que | [mais] do que [uma vez] |
CONJSUB | subordination conjunction | para que, se, que |
DEMPL | plural demonstrative | estas |
DEMSG | singular demonstrative | aquele |
DETINT | interrogative or exclamative que | [demostra a] que [ponto] |
DETINTPL | plural interrogative determiner | quantas [vezes] |
DETINTSG | singular interrogative determiner | qual [reação] |
DETPL | plural definite article | os [maiores aplausos] |
DETRELPL | plural relative determiner | ..., cujas [presações] |
DETRELSG | singular relative determiner | ..., cuja [veia poética] |
DETSG | singular definite article | o [service] |
DIG | digit | 123 |
GER | gerundive | examinando |
GERPRON | gerundive with clitic | deixando-a |
INF | verb infinitive | reunir, conservar |
INFPRON | infinitive with clitic | datar-se |
INTERJ | interjection | oh, aí, claro |
ITEM | list item marker | A. [Introdução] |
LETTER | isolated character | [da seleção] A |
NEG | negation | não, nunca |
NOUN | invariant common noun | caos |
NPL | plural common noun | serviços |
NPROP | proper noun | PS, Lisboa |
NSG | singular common noun | [esta] rede |
POSSPL | plural possessive | seus [investigadores] |
POSSSG | singular possessive | sua [sobrinha] |
PREP | preposition | para, de, com |
PREPADV | preposition + adverb | [venho] daqui |
PREPDEMPL | preposition + plural demonstrative | desses [recursos] |
PREPDEMSG | preposition + singular demonstrative | nesta [placa] |
PREPDETPL | preposition + plural determiner | dos [Grandes Bancos] |
PREPDETSG | preposition + singular determiner | na [construção] |
PREPPRON | preposition + pronoun | [atrás] dela |
PREPQUANTPL | preposition + plural quantifier | nuns [terrenos] |
PREPQUANTSG | preposition + singular quantifier | numa [nuvem] |
PREPREL | preposition + invariant relative pronoun | [nesta praia] aonde |
PREPRELPL | preposition + plural relative pronoun | [alunos] aos quais |
PREPRELSG | preposition + singular relative pronoun | [área] através do qual |
PRON | invariant pronoun | se, si |
PRONPL | plural pronoun | as, eles, os |
PRONSG | singular pronoun | a, ele, ninguém |
PRONREL | invariant relative pronoun | [um ortopedista] que |
PRONRELPL | plural relative pronoun | [as instalações] as quais |
PRONRELSG | singular relative pronoun | [o ensaio] o qual |
PUNCT | other punctuation | : ( ) ; |
QUANTPL | plural quantifier | quinze, alguns, tantos |
QUANTSG | singular quantifier | um, algum, qualquer |
SENT | sentence final punctuation | . ! ? |
SYM | symbols | @ % |
VERBF | finite verb form | corresponde |
VERBFPRON | finite verb form with clitic | deu-lhe |
VPP | past participle (also adjectival use) | penetrado, referida |
Russian POS Tags – BT_RUSSIAN
Tag | Description | Example |
---|---|---|
ADJ | adjective | красивая, зеленый, удобный, темный |
ADJ_CMP | adjective: comparative | красивее, зеленее, удобнее, темнее |
ADV | adverb | быстро, просто, легко, правильно |
ADV_CMP | adverb: comparative | быстрее, проще, легче, правильнее |
AMOUNT | currency + cardinal, percentages | $20.000, 10% |
CM | comma | , |
CONJ | conjunction | что,или и, а |
DET | determiner | какой, некоторым [из вас], который[час] |
DIG | numerals (digits) | 1, 2000, 346 |
FRGN | foreign word | бутерброд, армия, сопрано |
IREL | relative/interrogative pronoun | кто [сделает это?] каков [результат?], сколько [стоит?], чей |
ITJ | interjection | увы, ура |
MISC | (miscellaneous) | АЛ345, чат, N8 |
NOUN | common noun: nominative case | страна |
common noun: accusative case | [любить] страну | |
common noun: dative case | [посвятить] стране | |
common noun: genitive case | [история] страны | |
common noun: instrumental case | [гордиться] страной | |
common noun: prepositional case | [говорить о] стране | |
NUM | numerals (spelled out) | шестьсот, десять, два |
ORD | ordinal | 12., 1.2.1., IX. |
PERS | personal pronoun | я, ты, они, мы |
PREP | preposition | в, на, из-под [земли], с [горы] |
PRONADV | pronominal adverb | как, там, зачем, никогда, когда-нибудь |
PRON | pronoun | все, тем, этим, себя |
PROP | proper noun | Россия, Арктика, Ивановых, Александра |
PTCL | particle | [но все] же,[постой]-ка [ну]-ка, |
PTCL_INT | introduction particle | вот [она], вон [там], пускай, неужели, ну |
PTCL_MOOD | mood marker | [если] бы, [что] ли,[так] бы [и сделали] |
PTCL_SENT | stand-alone particle | впрочем, однако |
PUNCT | punctuation (other than CM or SENT) | : ; " " ( ) |
SENT | sentence final punctuation | . ? ! |
SYMB | symbol | *, ~ |
VAUX | auxiliary verb | быть,[у меня] есть |
VFIN | finite verb | ходили, любила, сидит, |
VGER | verb gerund | бывая, думая, засыпая |
VINF | verb infinitive | ходить, любить, сидеть, |
VPRT | verp participle | зависящий [от родителей], сидящего [на стуле] |
Spanish POS Tags – BT_SPANISH
Tag | Description | Example |
---|---|---|
ADJ | invariant adjective | beige, mini |
ADJPL | plural adjective | bonitos, nacionales |
ADJSG | singular adjective | bonito, nacional |
ADV | adverb | siempre, directamente |
ADVADJ | adverb, modifying an adjective | muy [importante] |
ADVINT | interrogative adverb | adónde, cómo, cuándo |
ADVNEG | negation no | no |
ADVREL | relative adverb | cuanta, cuantos |
AUX | finite auxiliary ser or estar | es, fui, estaba |
AUXINF | infinitive ser, estar | estar, ser |
AUXINFCL | infinitive ser, estar with clitic | serme, estarlo |
CM | comma | , |
COMO | reserved for word como | como |
CONADV | adverbial conjunction | adonde, cuando |
CONJ | conjunction | y, o, si, porque,sin que |
DETPL | plural determiner | los, las, estas, tus |
DETQUANT | invariant quantifier | demás, más, menos |
DETQUANTPL | plural quantifier | unas, ambos, muchas |
DETQUANTSG | singular quantifier | un, una, ningún, poca |
DETSG | singular determiner | el, la, este, mi |
DIG | numerals (digits) | 123, XX |
HAB | finite auxiliary haber | han, hubo, hay |
HABINF | infinitive haber | haber |
HABINFCL | infinitive haber with clitic | haberle, habérseme |
INTERJ | interjection | ah, bravo, olé |
ITEM | list item marker | a. |
NOUN | invariant noun | bragazas, fénix |
NOUNPL | plural noun | aguas, vestidos |
NOUNSG | singular noun | agua, vestido |
NUM | numerals (spelled out) | once, tres, cuatrocientos |
PAPPL | past participle, plural | contenidos, hechas |
PAPSG | past participle, singular | privado, fundada |
PREDETPL | plural pre-determiner | todas [las], todos [los] |
PREDETSG | singular pre-determiner | toda [la], todo [el] |
PREP | preposition | en, de, con, para, dentro de |
PREPDET | preposition + determiner | al, del, dentro del |
PRON | pronoun | ellos, todos, nadie, yo |
PRONCLIT | clitic pronoun | le, la, te, me, os, nos |
PRONDEM | demonstrative pronoun | eso, esto, aquello |
PRONINT | interrogative pronoun | qué, quién, cuánto |
PRONPOS | possessive pronoun | (el) mío, (las) vuestras |
PRONREL | relative pronoun | (lo) cual, quien, cuyo |
PROP | proper noun | Pablo, Beralfier |
PUNCT | punctuation (other than CM or SENT) | ' ¡ ¿ : { |
QUE | reserved for word que | que |
SE | reserved for word se | se |
SENT | sentence final punctuation | . ? ; ! |
VERBFIN | finite verb form | tiene, pueda, dicte |
VERBIMP | verb imperative | dejad, oye |
VERBIMPCL | imperative with clitic | déjame, sígueme |
VERBINF | verb infinitive | evitar, tener, conducir |
VERBINFCL | infinitive with clitic | hacerse, suprimirlas |
VERBPRP | present participle | siendo, tocando |
VERBPRPCL | present participle with clitic | haciéndoles, tomándolas |
Urdu POS Tags – BT_URDU
Tag | Description | Example |
---|---|---|
ADJ | adjective | بہترین |
ADV | adverb | تاہم |
CONJ | conjunction | یا |
DET | indefinite article/determiner | ایک |
EOS | end of sentence indicator | . |
INT | interjection or exclamation | جئے |
N | noun | مہینہ |
NON_URDU | not Arabic script | a b c |
NPROP | proper noun | مائیکروسافٹ |
NUM | number | اٹھارہ |
PART | particle | غیر |
PREP | preposition | با |
PRO | pronoun | وہ |
PUNC | punctuation other than end of sentence | ‘ |
UNK | unknown | اپنیاں |
VERB | verb | کرتی |
Universal POS Tags – UPT16_V1
The universal tags are coarser than the language-specific tags, but enable tracking and comparison across languages.
To return universal POS tags in place of language-specific tags, use the Annotated Data Model (ADM) and BaseLinguisticsFactory
to set BaseLinguisticsOption.universalPosTag
to true
. See Returning Universal POS Tags.
Tag | Description |
---|---|
ADJ | adjective |
ADP | adposition |
ADV | adverb |
AUX | auxiliary verb |
CONJ | coordinating conjunction |
DET | determiner |
INTJ | interjection |
NOUN | noun |
NUM | numeral |
PART | particle |
PRON | pronoun |
PROPN | proper noun |
PUNCT | punctuation |
SCONJ | subordinating conjunction |
SYM | symbol |
VERB | verb |
X | other |
Options
General Options
The following options are described in more detail in Initial and Path Options.
If the option rootDirectory
is specified, then the string ${rootDirectory}
takes that value in the dictionaryDirectory
, modelDirectory
, and licensePath
options.
Option | Description | Type (Default) (Default) | Supported Languages |
---|---|---|---|
The path of the lemma and compound dictionary, if it exists. | Path ${rootDirectory}/dicts | All | |
The language to process by analyzers or tokenizers created by the factory. | Language code | All | |
The path of the RBL license file. | Path ${rootDirectory}/licenses/rlp-license.xml | All | |
The XML license content, overrides | String | All | |
The directory containing the model files. | Path ${rootDirectory}/models | All | |
Set the root directory. Also sets default values for other required options ( | Path | All |
Tokenizer Options
The following options are described in more detail in Tokenizers.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Selects the tokenizer to use |
| All | |
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Ukrainian | |
Specify language to use for script regions, other than the script of the overall language. | Language code ( | Chinese, Japanese, Thai | |
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. | Integer (10) | Chinese, Japanese, Thai | |
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts. | Boolean (false) | Chinese, Japanese, Thai | |
Turns on Unicode NFKC normalization before tokenization.
| Boolean (false) | All | |
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. | Boolean (false) | All |
The following options are described in more detail in Structured Text.
The following options are described in more detail in Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs.
Option | Description | Default | Supported Languages | ||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n/a[a] | Enables emoji tokenization |
| All | ||||||||||||||||||||||||||||||||||||||||||||||
Enables emoticon tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables atMention tokenziation |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables hashtag tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables emailAdress tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables url tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
[a] Emoji tokenization and POS-tagging is always enabled and cannot be disabled. |
Analyzer Options
The following options are described in more detail in Analyzers.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. | Integer (100.000) | All | |
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish | |
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned. | Boolean (false) | All | |
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator. | List of paths | All | |
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation) | Boolean (false) | All |
The following options are described in more detail in Compounds.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether to decompose compounds. For Chinese and Japanese, If | Boolean (true) | Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish | |
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, This option has no effect when | Boolean (false) | Dutch, German, Hungarian |
The following options are described in more detail in Disambiguation.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the analyzers should disambiguate the results. | Boolean (true) | Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish | |
Enables faster part of speech disambiguation for English. | Boolean (false) | English | |
Enables faster part of speech disambiguation for Greek | Boolean (false) | Greek | |
Enables faster part of speech disambiguation for Spanish. | Boolean (false) | Spanish |
The following options are described in more detail in Returning Universal Part-of-Speech (POS) Tags.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates if POS tags should be converted to universal versions | Boolean (false) | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. | |
URI of a POS tag map | URI | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. |
The following options are described in more detail in Contraction Splitting Rule File Format.
The following options are only available when using the ADM API.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Enables analysis. If false, the annotator will only perform tokenization. | Boolean ( | All | |
URI of a POS tag map file for use by the | URI | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
Chinese and Japanese Options
The following options are described in more detail in Chinese and Japanese Lexical Tokenization.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when |
| Chinese | |
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of |
| Chinese, Japanese | |
Indicates whether to decompose compounds. |
| Chinese, Japanese | |
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when |
| Chinese, Japanese | |
Indicates whether to favor words in the user dictionary during segmentation. |
| Chinese, Japanese | |
Indicates whether to ignore whitespace separators when segmenting input text. If |
| Japanese | |
Indicates whether to filter stop words out of the output. |
| Chinese, Japanese | |
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token. | true | Japanese | |
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when | 10 | Chinese, Japanese | |
Indicates whether to add parts of speech and secondary parts of speech to morphological analyses. |
| Chinese, Japanese | |
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when |
| Japanese | |
Indicates whether to return numbers and counters as separate tokens. |
| Japanese | |
Indicates whether to segment place names from their suffixes. |
| Japanese | |
Indicates whether to treat whitespace as a number separator. Has no effect when |
| Chinese | |
Indicates whether to treat whitespace as a morpheme delimiter. |
| Chinese, Japanese |
The following options are described in more detail in Chinese and Japanese Readings.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to return all the readings for a token. Has no effect when |
| Chinese | |
Indicates whether to skip directly to the fallback behavior of |
| Chinese, Japanese | |
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. |
| Chinese, Japanese | |
| Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when |
| Chinese, Japanese |
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
|
| Chinese | |
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when |
| Chinese |
Hebrew Options
The following options are described in more detail in Hebrew Analyses.
Chinese Script Converter Options
The following options are described in more detail in Chinese Script Converter (CSC).
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates most complex conversion level to use |
( | Chinese | |
The language from which the |
| Chinese, Simplified Chinese, Traditional Chinese | |
The language to which the |
| Chinese, Simplified Chinese, Traditional Chinese |
Lucene Options
The following options are described in more detail in Using RBL in Apache Lucene.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the token filter should add the lemmas (if none, the steps) of each surface token to the tokens being returned.. | Boolean (true) | All | |
Indicates whether the token filter should add the readings of each surface token to the tokens being returned | Boolean (false) | Chinese, Japanese | |
Indicates whether the token filter should identify contraction components as contraction components rather than as lemmas | Boolean (false) | All | |
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled. | Boolean (false) | All | |
Indicates whether the token filter should replace a surface form with its normalization. Normalization must be enabled. | Boolean (false) | All |
Option | Description | Type | Supported Languages |
---|---|---|---|
A list of paths to user lemma dictionaries. | List of Paths | Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai | |
A list of paths to user segmentation dictionaries. | List of Paths | All | |
A list of paths to user dictionaries. | List of Paths | All | |
A list of paths to reading dictionaries. | List of Paths | Japanese |
Alphabetical List of Options
Index
A
- addLemmaTokens, Lucene Options, Lucene Options, Lucene Options
- addReadings, Lucene Options, Lucene Options, Lucene Options
- alternativeEnglishDisambiguation, Disambiguation, Analyzer Options, Disambiguation, Analyzer Options
- alternativeGreekDisambiguation, Disambiguation, Analyzer Options, Disambiguation, Analyzer Options
- alternativeSpanishDisambiguation, Disambiguation, Analyzer Options, Disambiguation, Analyzer Options
- analysisCacheSize, Analyzers, Analyzer Options, Analyzers, Analyzer Options
- analyze, Analyzer Options, Analyzer Options
- atMentions, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options
B
- breakAtAlphaNumIntraWordPunct, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
C
- cacheSize, Analyzers, Analyzer Options, Analyzers, Analyzer Options
- caseSensitive, Tokenizers, Analyzers, Tokenizer Options, Analyzer Options, Tokenizers, Analyzers, Tokenizer Options, Analyzer Options
- compoundComponentSurfaceForms, Compounds, Analyzer Options, Compounds, Analyzer Options
- consistentLatinSegmentation, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- conversionLevel, CSC Options, Chinese Script Converter Options, CSC Options, Chinese Script Converter Options
- customPosTagsUri, Returning Universal Part-of-Speech (POS) Tags, Analyzer Options, Analyzer Options
- customTokenizeContractionRulesUri, Splitting Contractions, Analyzer Options, Analyzer Options
D
- decomposeCompounds, Compounds, Chinese and Japanese Lexical Tokenization, Analyzer Options, Chinese and Japanese Options, Compounds, Chinese and Japanese Lexical Tokenization, Analyzer Options, Chinese and Japanese Options
- deepCompoundDecomposition, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- defaultTokenizationLanguage, Tokenizers, Tokenizer Options, Tokenizers, Tokenizer Options
- deliverExtendedTags, Analyzers, Analyzer Options, Analyzers, Analyzer Options
- dictionaryDirectory, Initial and Path Options, General Options, General Options
- disambiguate, Disambiguation, Analyzer Options, Disambiguation, Analyzer Options
- disambiguatorType, Hebrew Disambiguator Types, Hebrew Options, Hebrew Disambiguator Types, Hebrew Options
E
- emailAddresses, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options
- emoticons, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options
F
- favorUserDictionary, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- fragmentBoundaryDelimiters, Structured Text, Tokenizer Options, Structured Text, Tokenizer Options
- fragmentBoundaryDetection, Structured Text, Tokenizer Options, Structured Text, Tokenizer Options
G
- generateAll, Chinese and Japanese Readings, Chinese and Japanese Options, Chinese and Japanese Readings, Chinese and Japanese Options
- guessHebrewPrefixes, Hebrew Analyses, Hebrew Options, Hebrew Analyses, Hebrew Options
H
I
- identifyContractionComponents, Lucene Options, Lucene Options, Lucene Options
- ignoreSeparators, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- ignoreStopwords, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- includeHebrewRoots, Hebrew Analyses, Hebrew Options, Hebrew Analyses, Hebrew Options
J
- joinKatakanaNextToMiddleDot, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
L
- language, Initial and Path Options, CSC Options, General Options, Chinese Script Converter Options, CSC Options, General Options, Chinese Script Converter Options
- lemDictionaryPath, Activating User Dictionaries in Lucene, Lucene Options, Lucene Options
- licensePath, Initial and Path Options, General Options, General Options
- licenseString, Initial and Path Options, General Options, General Options
M
- maxTokensForShortLine, Structured Text, Tokenizer Options, Structured Text, Tokenizer Options
- minLengthForScriptChange, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- minNonPrimaryScriptRegionLength, Tokenizers, Tokenizer Options, Tokenizers, Tokenizer Options
- modelDirectory, Initial and Path Options, General Options, General Options
N
- nfkcNormalize, Tokenizers, Tokenizer Options, Tokenizers, Tokenizer Options
- normalizationDictionaryPaths, Analyzers, Analyzer Options, Analyzers, Analyzer Options
P
Q
R
- readingByCharacter, Chinese and Japanese Readings, Chinese and Japanese Options, Chinese and Japanese Readings, Chinese and Japanese Options
- readings, Chinese and Japanese Readings, Chinese and Japanese Options, Chinese and Japanese Readings, Chinese and Japanese Options
- readingType, Chinese and Japanese Readings, Chinese and Japanese Options, Chinese and Japanese Readings, Chinese and Japanese Options
- replaceTokensWithLemmas, Lucene Options, Lucene Options, Lucene Options
- replaceTokensWithNormalizations, Lucene Options, Lucene Options, Lucene Options
- rootDirectory, Initial and Path Options, General Options, General Options
S
- segDictionaryPath, Activating User Dictionaries in Lucene, Lucene Options, Lucene Options
- segmentNonJapanese, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- separateNumbersFromCounters, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- separatePlaceNameFromSuffix, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
T
- targetLanguage, CSC Options, Chinese Script Converter Options, CSC Options, Chinese Script Converter Options
- tokenizeContractions, Splitting Contractions, Analyzer Options, Analyzer Options
- tokenizeForScript, Tokenizers, Tokenizer Options, Tokenizers, Tokenizer Options
- tokenizerType, Tokenizers, Analyzers, Tokenizer Options, Tokenizers, Analyzers, Tokenizer Options
U
- universalPosTags, Returning Universal Part-of-Speech (POS) Tags, Analyzer Options, Analyzer Options
- urls, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options, Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs, Tokenizer Options
- userDefinedDictionaryPath, Activating User Dictionaries in Lucene, Lucene Options, Lucene Options
- userDefinedReadingDictionaryPath, Activating User Dictionaries in Lucene, Lucene Options, Lucene Options
- useVForUDiaeresis, Chinese and Japanese Readings, Chinese and Japanese Options, Chinese and Japanese Readings, Chinese and Japanese Options
W
- whiteSpaceIsNumberSep, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
- whitespaceTokenization, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options, Chinese and Japanese Lexical Tokenization, Chinese and Japanese Options
[1] Apache Lucene™, Lucene™, Apache Solr™, and Solr™ are trademarks of the Apache Software Foundation. Elasticsearch™ is a trademark of Elasticsearch BV.
[2] Or for Japanese, where disambiguation is turned off by default.
[3] Essentially an ID number; see the ICU break rule documentation
[4] These analyzers are compatible with the Chinese and Japanese language processors found in the legacy Rosette (C++) products.
[5] As distinguished from the Arabic-Indic numerals often used in Arabic script (٠, ١, ٢, ٣, ٤, ٥, ٦, ٧, ٨, ٩) or the Eastern Arabic-Indic numerals often used in Persian and Urdu Arabic script (۰, ۱, ۲, ۳, ۴, ۵, ۶, ۷, ۸, ۹).
Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs
RBL supports POS-tagging of emoji, emoticons, @mentions, email addresses, hashtags, and URLs in all supported languages.
Tokenization of emoji is always enabled. The other options are disabled by default but can be enabled through the options listed. When tokenization is disabled, the characters may be split into multiple tokens.
Option
Description
Default
Supported Languages
n/a[a]
Enables emoji tokenization
true
All
emoticons
Enables emoticon tokenization
false
All
atMentions
Enables atMention tokenziation
false
All
hashtags
Enables hashtag tokenization
false
All
emailAddresses
Enables emailAdress tokenization
false
All
urls
Enables url tokenization
false
All
[a] Emoji tokenization and POS-tagging is always enabled and cannot be disabled.
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Sample input and output
Type
Input
Tokenization (when option is enabled)
Tokenization (when option is disabled)
POS tag
emoji
Tokenization of emoji cannot be disabled.
EMO
emoticon
:)
:)
: )
EMO
@mention
@basistechnology
@basistechnology
@ basistechnology
ATMENTION
hashtag
#basistechnology
#basistechnology
# basistechnology
HASHTAG
email address
info@basistech.com
info@basistech.com
info @ basistech.com
EMAIL
URL
http://www.babelstreet.com
http://www.babelstreet.com
http : / / www.babelstreet.com
URL
The tokenization when the option is disabled depend on the options
language
andtokenizerType
. The samples provided here are for whenlanguage=eng
andtokenizerType=ICU
.Emoji & Emoticon Recognition
Emoji are defined by Unicode Technical Standard #51. In tokenizing emoji, RBL recognizes the emoji presentation selector (VS16; U+FE0F) and text presentation selector (VS15; U+FE0E), which indicate if the preceding character should be treated as emoji or text.
Although RBL detects sideways, Western-style emoticons, it does not currently support Japanese-style emoticons called kaomoji such as
(o^ ^o)
.Emoji Normalization & Lemmatization
RBL normalizes emoji, placing the result into the
lemma
field. The simplest example is when an emoji presentation selector follows a character that is already an emoji. In this case, RBL will simply remove the emoji presentation selector.Lemmatization applies to an emoji character in multiple ways.
Emoji that depict people or body parts may be followed by an emoji modifier indicating skin tone. Lemmatization simply removes the emoji modifier skin tone from the emoji character. The reasoning is that the skin tone is of secondary importance to the meaning of the emoji.
Surface form
Lemmatized form
(
Boy +
Medium Skin Tone)
Emoji depicting people may be followed by an emoji component indicating hair color or style. Lemmatizing removes the hair component from the emoji character.
Surface form
Lemmatized form
(
Adult + ZWJ +
Red Hair)
Where a gender symbol has been added to create a gendered occupation emoji, lemmatization removes the gender symbol.
Surface form
Lemmatized form
(
Police Officer + ZWJ + ♀ Female Sign + VS16)
Finally, RBL can normalize non-fully-qualified emoji ZWJ sequences to fully-qualified emoji ZWJ sequences. In the above example, it is possible to omit the VS16 (though discouraged by Unicode): since Police Officer is an emoji, anything joined to it by a ZWJ is implicitly an emoji too. RBL adds the missing VS16.