Base Linguistics Elasticsearch Plugin
Emoji images used in this document are licensed by Twitter, Inc. and other contributors at https://github.com/twitter/twemoji under CC BY 4.0 International.
Introduction
Rosette Base Linguistics (RBL) provides a set of linguistic tools to prepare your data for analysis. Language-specific modules provide base forms (lemmas) of words, parts-of-speech tagging, compound components, normalized tokens, stems, and roots. RBL also includes a Chinese Script Converter (CSC) which converts tokens in Traditional Chinese text to Simplified Chinese and vice versa.
Using RBL
You can use RBL in your own JVM application, use its Apache Lucene compatible API in a Lucene application, or integrate it directly with either Apache Solr or Elasticsearch. [6]
JVM Applications
To integrate base linguistics functionality in your applications, the JVM includes two sets of Java classes and interfaces:
ADM API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis as a set of annotations. This collection is called the Annotated Data Model (ADM) and is used in other Rosette tools, such as Rosette Language Identifier and Rosette Entity Extractor, as well as RBL. There are some advanced features which are only supported in the ADM API and not the classic API.
When using the ADM API, you create an annotator which includes both tokenizer and analyzer functions.
Classic API: A collection of Java classes and interfaces that generate and represent Rosette's linguistic analysis that is analogous to the ADM API, except it is not compatible with any other Rosette products. It also supports streaming: a user can start processing a document before the entire document is available and it can produce results for pieces of a document without storing the results for the entire document in memory at once.
When using the classic API, you create tokenizers and analyzers.
Lucene
In an Apache Lucene application, you use a Lucene analyzer which incorporates a Base Linguistics tokenizer and token filter to produce an enriched token stream for indexing documents and for queries.
Solr
With the Solr plugin, an Apache Solr search server uses RBL for both indexing documents and for queries.
Elasticsearch
Install the Elasticsearch plugin to use RBL for analysis, indexing, and queries.
Note
The Lucene, Solr, and Elasticsearch plugins use APIs based on the classic API. All options that are in the enums TokenizerOption
or AnalyzerOption
are available along with some additional plugin-specific options.
Linguistic Objects
RBL performs multiple types of analysis. Depending on the language, one or more of the following may be identified in the input text:
- Lemma
- Part of Speech
- Normalized Token
- Compound Components
- Readings
- Stem
- Semitic Root
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object.
In the ADM API, use the BaseLinguisticsFactory
to set the linguistic options and instantiate an Annotator
which annotates the input text. The ADM API creates an annotator for all linguistic objects, including tokens.
In the classic API, use the BaseLinguisticsFactory
to configure and create tokenizers, analyzers, and CSC analyzers. The classic API creates separate tokenizers and analyzers.
Classic API
When using the classic API, you instantiate separate factories for tokenizers and analyzers.
BaseLinguisticsFactory#createTokenizer
produces a language-specific tokenizer that processes documents, producing a sequence of tokens.BaseLinguisticsFactory#createAnalyzer
produces a language-specific analyzer that uses dictionaries and statistical analysis to add analysis objects to tokens.
If your application requires streaming, use this API. The Lucene, Solr, and Elasticsearch integrations use these methods.
For the complete API documentation, consult the Javadoc for BaseLinguisticsFactory.
Tokenizer
Use BaseLinguisticsFactory#createTokenizer
to create a language-specific tokenizer that extracts tokens from a plain text source. Prior to using the factory to create a tokenizer, you use the factory with BaseLinguisticsOption
to define the root of your RBL installation, as illustrated in the following sample. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
The Tokenizer
uses a word breaker to establish token boundaries and detect sentences. For each token, it also provides offset information, length of the token and a tag. Some tokenizers calculate morphological analysis information as part of the tokenization process, filling in appropriate analysis entries in the token object that they return. For other languages, you use the analyzer described below to return analysis objects for each token.
Create a factory
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); File rootPath = new File(rootDirectory);
Set tokenization options
factory.setOption(BaseLinguisticsOption.nfkcNormalize, "true");
Create the tokenizer
Tokenizer tokenizer = factory.createTokenizer();
Analyzer
Use the BaseLinguisticsFactory#createAnalyzer
to create a language-specific analyzer. Prior to creating the analyzer, use the factory and BaseLinguisticsOption
to define RBL root, as illustrated in the sample below. See the Javadoc for other options you may set.
Tip
If the license is not the default directory (${rootDirectory}/licenses)
, you need to pass in the licensePath
.
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); File rootPath = new File(rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, new File(rootPath, "licenses/rlp-license.xml").getAbsolutePath()); Analyzer analyzer = factory.createAnalyzer();
Use the Analyzer
to return an array of Analysis
objects for each token.
Tokenizers
The tokenizer is a language-specific processor that evaluates documents and identifies the tokens. RBL supports tokenization and sentence boundaries for all languages. For many languages, you can choose the tokenizer by setting tokenizerType
.
TokenizerType | Description | Supported Languages |
---|---|---|
| Uses the ICU tokenizer | All, except for Chinese and Japanese |
| Uses the FST tokenizer | Czech, Dutch, English, French, German, Greek, Hungarian, Italian, Polish, Portuguese, Romanian, Russian, Spanish |
| Uses a lexicon and rules to tokenize input without spaces. Uses the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA). | Chinese, Japanese |
| Uses statistical approach to tokenize input without spaces. | Chinese, Japanese, Korean, Thai |
| Selects the default tokenizer for each language. The default is | All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Tip
When using the SPACELESS_LEXICAL
tokenizer, you must use the CLA/JLA dictionaries instead of the segmentation dictionary. The analysis dictionary is not intended to be used with the SPACELESS_LEXICAL
tokenizer.
For most languages the default tokenizer is referred to as the ICU tokenizer. It implements standard Unicode guidelines for determining boundaries between sentences and for breaking each sentence into individual tokens. Many languages have an alternate tokenizer, the FST tokenizer, enabled by setting the tokenizerType
to FST
. The FST tokenizer provides somewhat different sentence and token boundaries. For example, the FST tokenizer keeps hyphenated tokens together, while the ICU tokenizer breaks them into separate tokens. For applications that don't want tokens or lemmas that contains spaces, the ICU tokenizer provides the best accuracy. To determine which tokenizer is best for your use case, we recommend running each of them against a test dataset and reviewing the output.
For Chinese, Japanese, and Thai, the default tokenizer determines sentence boundaries, and then uses statistical models to segment each sentence into individual tokens. If Latin-script or other non-Chinese, non-Japanese, or non-Thai fragments greater than a certain length (defined by minNonPrimaryScriptRegionLength
) are embedded in the Chinese, Japanese, or Thai text, then the tokenizer applies default Unicode tokenization to those fragments. If a non-primary script region is less than this length, and adjacent to a primary script region, it is appended to the primary script region.
To use the Chinese Language Analyzer (CLA) or Japanese Language Analyzer (JLA) tokenization algorithm, set the tokenizerType
to SPACELESS_LEXICAL
. This disables post-tokenization analysis; an analyzer created with this option will leave its input tokens unchanged.
For all languages, the RBL tokenizer can apply Normalization Form KC (NFKC) as specified in Unicode Standard Annex #15 to normalize the tokens. This normalization includes a normalizing a fullwidth numeral to a halfwidth numeral, a fullwidth Latin letter to a halfwidth Latin letter, and a halfwidth Katakana character to a fullwidth Katakana character. NFKC normalization is turned off by default. Use the nfkcNormalize
option to turn it on and use tokenizerType
of ICU
. To apply NKFC for Chinese and Japanese, tokenizerType
must be SPACELESS_STATISTICAL
or DEFAULT
.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Ukrainian | |
Specify language to use for script regions, other than the script of the overall language. | Language code ( | Chinese, Japanese, Thai | |
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. | Integer (10) | Chinese, Japanese, Thai | |
Turns on Unicode NFKC normalization before tokenization.
| Boolean (false) | All | |
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. | Boolean (false) | All | |
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts. | Boolean (false) | Chinese, Japanese, Thai | |
Selects the tokenizer to use |
( | All |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Structured Text
A document may contain tables and lists in addition to regular sentences. Structured text is composed of fragments, such as list items, table cells, and short lines of text. The tokenizer emits sentence offsets for each fragment it encounters.
One way fragments are identified is by detecting fragment delimiters. A delimiter is restricted to one character; the default delimiters are U+0009 (tab), U+000B (vertical tab), and U+000C (form feed). To modify the set of recognized delimiters, pass a string containing all desired delimiter values to the fragmentBoundaryDelimiters
option. The string must include any default values you want to keep. Non-whitespace delimiters within a token will be ignored.
The following rules determine where fragments are identified, in descending priority:
Each line in a list, where a list is defined as 3 or more lines containing the same punctuation mark within the first 5 characters of the line, are fragments
A delimiter or three or more consecutive whitespace characters breaks a line into fragments
A short line is a fragment if it is preceded by another short line, preceded by a fragment, or if it's the first line of text. The length of a short line is configurable with the
maxTokensForShortLine
option; the default is 6 or fewer tokens.
Fragments always include trailing whitespace.
Example:
BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory) factory.setOption(BaseLinguisticsOption.language, "eng"); EnumMap<BaseLinguisticsOption, String> options = Maps.newEnumMap(BaseLinguisticsOption.class); options.put(BaseLinguisticsOption.fragmentBoundaryDelimiters, "|~"); options.put(BaseLinguisticsOption.maxTokensForShortLine, "5"); factory.createSingleLanguageAnnotator(options);
By default, fragment detection is enabled. Use the fragmentBoundaryDetection
option to disable it.
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Customizing the ICU Tokenizer
The ICU tokenizer is the default tokenizer used for European languages. It works based on behavior defined in a rule file. If the default behavior is not exactly what is desired, RBL allows custom rule files to be supplied that will determine the behavior of the tokenizer. How to make these customizations is briefly outlined here. Be careful with any changes you make to the tokenizer behavior; BasisTech does not support customizations made by the user.
BaseLinguisticsFactory
has a method addCustomTokenizerRules
which can be used to specify a custom rule file. RBLCmd also has the -ctr
option to specify a path on the command line. All of these methods accept a case-sensitivity value (for -ctr
, cs
and ci
mean case-sensitive and case-insensitive), which is important because only when BaseLinguisticsOption.caseSensitive
is the same as the value for a rule file will it be selected. Custom rule files are not cumulative, i.e. only one set of rules may be used at a time for any one combination of case sensitivity and language.
Note
BasisTech reserves the right to change the version of ICU used in RBL. Thus any rule file provided by BasisTech for a particular version of RBL may or may not work with newer versions.
Tokenization Rule File Format
A tokenization rule file is a ICU break rule file encoded in UTF-8. A custom file replaces BasisTech’s tokenization rules, so a custom rule should include all the rules for basic tokenization as well as the new custom rules. The default rule files that RBL uses can be obtained by contacting BasisTech support, or you can copy the rule file from ICU.
RBL also provides the ability to pass in a subrule file if desired. This is for splitting tokens produced according to rules in the main file. The subrule file is a list of subrules, each of which is a number and a regex separated by a tab character. This number corresponds to the “rule status”[7] of the main rule whose tokens the subrule splits. Each capturing group in the subrule regex corresponds to a token that will be produced by the tokenizer.
The rule file and the subrule file can be placed anywhere. In particular, they need not be placed anywhere within your RBL installation directory.
There is one BasisTech-specific extension, !!btinclude <filename>
. This command tells the preprocessor to replace the !!btinclude
line with the contents of the specified file. Relative paths are relative to the location of the file containing the !!btinclude
line. Recursive inclusion is allowed.
Example
The ICU tokenizer does not normally tokenize with an eye to emoticons, but perhaps that is important to your use case. You could make a copy of the default rule file and add the following.
... $Smiley = [\:=][)}\]]; !!forward; $Smiley; ...
For the input:
=)
instead of the output:
= )
of two tokens with the BasisTech default rules you would get back one token:
=)
Unknown Language Tokenization
RBL provides basic tokenization support when the language is "Unknown" (xxx
). The tokenizer uses generic rules to tokenize, such as whitespace and punctuation delimitation.
Supported Features when language is unknown (xxx
):
Tokenization
Sentence breaking
Identification of some common acronyms and abbreviations
Segmentation user dictionaries
Using the language code of xxx
will provide basic tokenization support for languages not supported by RBL.
Analyzers
The analyzer is a language-specific processor that uses dictionaries and statistical analysis to add analysis objects to tokens.
To extend the coverage that RBL provides for each supported language you can create User Dictionaries. Segmentation user dictionaries are supported for all languages. Lemma user dictionaries are supported for Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, and Thai.
A stem is the substring of a word that remains after prefixes and suffixes are removed, while the lemma is the dictionary form of a word. RBL supports stems for Arabic, Finnish, Persian, and Urdu.
Semitic roots are generated for Arabic and Hebrew.
The option name to set the analysis cache depends on the accepting factory. The option analysisCacheSize
is a BaseLinguisticsOption
while cacheSize
is an option for both AnalyzerOption
and CSCAnalyzerOption
. They all perform the same function.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. | Integer (100.000) | All | |
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish | |
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned. | Boolean (false) | All | |
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator. | List of paths | All | |
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation) | Boolean (false) | All | |
Selects the tokenizer to use with this analyzer. |
( | All |
Note
When creating Tokenizers and Analyzers, the tokenizerType
must be the same for both.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
CSCAnalyzerOption
(cacheSize
only)
Lemma Lookup
For each token and normalized form in the token stream, the analyzer performs a dictionary lookup starting with any user dictionaries followed by the RBL dictionary. During lookup, RBL ignores the context in which the token or normalized form appears.
Once the analyzer has found one or more lemmas in a dictionary, it does not consult additional dictionaries. In other words, if two user dictionaries are specified, and the filter finds a lemma in the first dictionary, it does not consult the second user dictionary or the RBL dictionary.
Unless overridden by an analysis dictionary, the only lemmatization done in Chinese and Thai is number normalization. Other Chinese and Thai tokens' lemmas are equal to their surface forms.
There is no analysis dictionary available for Finnish, Pashto, or Urdu. All other languages are supported.
Guessing
No dictionary can ever be complete: new words get added to languages, languages change and borrow. So, in general, analysis for each language includes some sort of guessing capability. The job of a guesser is to take a word and to come up with some analysis of it. Whatever facts we generate for a language, those are all possible outputs of a guesser.
In European languages, guessers deliver lemmas and parts of speech. In Korean, guessers provide morphemes, morpheme tags, compound components, and parts of speech.
Whitespace in Lemmas
By default, the analyzer returns any lemma that contains whitespace as multiple lemmas (each with no whitespace). To allow lemmas with whitespace (such as International Business Machines
as a lemma for the token IBM
) to be placed as such in the token stream, you can create a user analysis dictionary with an entry that defines the lemma. For example:
IBM International[^_]Business[^_]Machines[+PROP]
Compounds
The analyzer decomposes Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian, and Swedish compounds, returning the lemmas of each of the components.
The lemmas may differ from their surface form in the compound, such that the concatenation of the components is not the same as the original compound (or its lemma). Components are often connected by elements that are present only in the compound form.
For example, the German compound Eingangstüren (entry doors) is made up of two components, Eingang (entry) and Tür (door), and the connecting 's' is not present in the component list. For this input token, the RBL tokenizer and analyzer return the following entries:
- Original form: Eingangstüren
- Lemma for the compound: Eingangstür
- Component lemmas: Eingang, Tür
Other German examples include letter removal (Rennrad ⇒ rennen + Rad), vowel changes (Mängelliste ⇒ Mangel + Liste), and capitalization changes (Blaugrünalge ⇒ blau + grün + Alge).
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether to decompose compounds. For Chinese and Japanese, If | Boolean (true) | Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish | |
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, This option has no effect when | Boolean (false) | Dutch, German, Hungarian |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Disambiguation
For some languages, the analyzer can disambiguate between multiple analysis objects and return the disambiguated analysis object. The disambiguate
option enables the disambiguator. When true
, the disambiguator determines the best analysis for each word given the context in which it appears.
When using an annotator, the disambiguated result is at the head of all possible analyses. The remainder of the list is ordered randomly. When using a tokenizer/analyzer, use the method getSelectedAnalysis
to return the disambiguated result.
For all languges except Japanese, disambiguation is enabled by default. For performance reasons, disambiguation is disabled by default for Japanese when using the statistical model.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the analyzers should disambiguate the results. | Boolean (true) | Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish | |
Enables faster part of speech disambiguation for English. | Boolean (false) | English | |
Enables faster part of speech disambiguation for Greek | Boolean (false) | Greek | |
Enables faster part of speech disambiguation for Spanish. | Boolean (false) | Spanish |
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Part-of-Speech (POS) Tags
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu.
POS Tag Map File Format
A POS tag map file is a YAML file encoded in UTF-8. It is a sequence of mapping rules.
A mapping rule is a sequence of two elements: the POS tag to be mapped and a sequence of submappings. Rules are checked in the order they appear in the rule file. A token which matches a rule is not checked against any further rules.
A submapping is a mapping with the keys m
, s
, and t
. m
is a Java regular expression. s
is a surface form. m
and s
are optional: they can be omitted or null. t
specifies the output POS tag to use when the following criteria are met:
The input token's POS tag equals the POS tag to be mapped.
m
(if any) matches a substring of the input token's raw analysis.s
(if any) equals the input token's surface form, compared case-insensitively.
Example
- - NUM_VOC - - { m: \+Total, t: PRON } - { s: moc, t: DET } - { s: oba, t: DET } - { t: NUM }
This rule maps tokens with BasisTech's NUM_VOC POS tag. If the input token's raw analysis matches the regular expression \+Total
, the token becomes a PRON. Otherwise, if the token's surface form is moc or oba, the token becomes a DET. Otherwise, the token becomes a NUM.
Chinese and Japanese Lexical Tokenization
For Chinese and Japanese, in addition to the statistical model described above, RBL includes Chinese Language Analyzer (CLA) and Japanese Language Analyzer (JLA) modules [8] which are optimized for search. They are activated by setting tokenizerType
to SPACELESS_LEXICAL
.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when |
| Chinese | |
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of |
| Chinese, Japanese | |
Indicates whether to decompose compounds. |
| Chinese, Japanese | |
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when |
| Chinese, Japanese | |
Indicates whether to favor words in the user dictionary during segmentation. |
| Chinese, Japanese | |
Indicates whether to ignore whitespace separators when segmenting input text. If |
| Japanese | |
Indicates whether to filter stop words out of the output. |
| Chinese, Japanese | |
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token. | true | Japanese | |
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when | 10 | Chinese, Japanese | |
Indicates whether to add parts of speech and secondary parts of speech to morphological analyses. |
| Chinese, Japanese | |
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when |
| Japanese | |
Indicates whether to return numbers and counters as separate tokens. |
| Japanese | |
Indicates whether to segment place names from their suffixes. |
| Japanese | |
Indicates whether to treat whitespace as a number separator. Has no effect when |
| Chinese | |
Indicates whether to treat whitespace as a morpheme delimiter. |
| Chinese, Japanese |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Chinese and Japanese Readings
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to return all the readings for a token. Has no effect when |
| Chinese | |
Indicates whether to skip directly to the fallback behavior of |
| Chinese, Japanese | |
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. |
| Chinese, Japanese | |
| Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when |
| Chinese, Japanese |
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
|
| Chinese | |
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when |
| Chinese |
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Editing the stop words list
The ignoreStopwords
option uses a stop words list to define stop words. The path to the stop words list is language-dependent: Chinese uses root/dicts/zho/cla/zh_stop.utf8
and Japanese uses root/dicts/jpn/jla/JP_stop.utf8
.
You can add stop words to these files. When you edit one of these files, you must follow these rules:
The file must be encoded in UTF-8.
The file may include blank lines.
Comment lines begin with
#
.Each non-blank non-comment line represents exactly one lexeme (stop word).
Japanese Lemma Normalization
In Japanese, foreign and borrowed words may vary in their phonetic transcription to Katakana, and some words may be expressed with an older or a modern Kanji form. The Japanese lemma dictionary maps Katakana variants to a standard form and old Kanji forms to their modern forms. Examples:
Katakana Spelling Variants | Normalized Form |
---|---|
ヴァイオリン | バイオリン |
エクスポ | エキスポ |
Older Kanji Form | Normalized Form |
---|---|
渡邊 | 渡辺 |
松濤 | 松涛 |
大學 | 大学 |
You can include orthographic normalization in lemma user dictionaries for Japanese. This information can be accessed at runtime from the Analysis
or MorphoAnalysis
object.
Hebrew Analyses
The following analyzer options are available for Hebrew.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Hebrew Disambiguator Types
RBL includes multiple disambiguators for Hebrew. Set the value for the option disambiguatorType
to select which type to use. The valid values for DisambiguatorType
are:
PERCEPTRON
: a perceptron modelDICTIONARY
: a dictionary-based rerankerDNN
: a deep neural network.TensorFlow, which is not supported on all systems, much be installed. If
DNN
is selected and TensorFlow is not supported, RBL will throw aRosetteRuntimeException
.
Enum Classes:
AnalyzerOption
BaseLinguisticsOption
Arabic, Persian, and Urdu Token Analysis
For Arabic, Persian (Western Persian and Afghan Persian), and Urdu, RBL may return multiple analyses for each token. Each analysis contains the normalized form of the token, a part-of-speech tag, and a stem. For Arabic, the analysis also includes a lemma and a Semitic root. For Persian, some analyses include a lemma.
This appendix provides information on token normalization and the generation of variant tokens. For Arabic, it also provides information on stems and Semitic roots.
Token normalization is performed in two stages:
Generic Arabic script normalization
Language-specific normalization
Generic Arabic Script Token Normalization
Generic Arabic script normalization includes the following:
The following diacritics are removed: dammatan, kasratan, fatha, damma, kasra, shadda, sukun.
The following characters are removed: kashida, left-to-right marker, right-to-left marker, zero-width joiner, BOM, non-breaking space, soft hyphen, space.
Alef maksura is converted to yeh unless it is at the end of the word or followed by hamza.
All numbers are converted to Arabic numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
[9]Thousand separators are removed, and the decimal separator is changed to a period (U+002E). The normalizer handles cases where ر (reh) is (incorrectly) used as the decimal separator.
Alef with hamza above: ٵ (U+0675), ٲ (U+0672), or ا (U+0627) combined with hamza above (U+0654) is converted to أ (U+0623).
Alef with madda above: ا (U+0627) combined with madda above (U+0653) is converted to آ (U+0622).
Alef with hamza below: ٳ (U+0673) or ا (U+0627) combined with hamza below (U+0655) is converted to إ (U+0625).
Misra sign to ain: ؏ (U+060F) is converted to ع (U+0639).
Swash kaf to kaf: ڪ (U+06AA) is converted to ک (U+06A9).
Heh: ە (U+06D5) is converted to ه (U+0647).
Yeh with hamza above: The following combinations are converted to ئ (U+0626).
ی (U+06CC) combined with hamza above (U+0654)
ى (U+0649) combined with hamza above (U+0654)
ي (U+064A) combined with hamza above (U+0654)
Waw with hamza above: و (U+0648) combined with hamza above (U+0654), ٷ (U+0677), or ٶ (U+0676) is converted to ؤ (U+0624).
Arabic Token Analysis
Token Normalization
For Arabic input, the following language-specific normalizations are performed on the output of the Arabic script normalization:
Zero-width non-joiner (U+200C) and superscript alef ٰ (U+0670) are removed.
Fathatan (U+064B) is removed.
Persian yeh (U+06CC) is normalized to yeh (U+064A) if it is initial or medial; if final, it is normalized to alef maksura (U+0649).
Persian kaf ک (U+06A9) is converted to ك (U+0643).
Heh ہ (U+06C1) or ھ (U+06BE) is converted to ه (U+0647).
Following morphological analysis, the normalizer does the following:
Alef wasla ٱ (U+0671) is replaced with plain alef ا (U+0627).
If a word starts with the incorrect form of an alef, the normalizer retrieves the correct form: plain alef ا (U+0627), alef with hamza above أ (U+0623), alef with hamza below إ (U+0625), or alef with madda above آ (U+0622).
Token Variants
The analyzer can generate a number of variant forms for each Arabic token to account for the orthographic irregularity seen in contemporary written Arabic. Each token variant is generated in normalized form.
If a token contains a word-final hamza preceded by yeh or alef maksura, then a variant is created that replaces these with hamza seated on yeh.
If a token contains waw followed by hamza on the line, a variant is created that replaces these with hamza seated on waw.
Variants are created where word-final heh is replaced by teh marbuta, and word-final alef maksura is replaced by yeh.
Stems and Semitic Roots
The stem returned is the normalized token with affixes (such as prepositions, conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.
In the process of stripping morphemes (affixes) from a token, the analyzer produces a stem, a lemma, and a Semitic root. Stems and lemmas result from stripping most of the inflectional morphemes, while Semitic roots result from stripping derivational morphemes.
Inflectional morphemes indicate plurality or verb tense. Different forms, such as singular and plural noun, or past and present verb tense share the same stem if the forms are regular. If some of the forms are irregular, they do not share the same stem, but do share the same lemma. Since stems and lemmas preserve the meaning of words, they are very useful in text retrieval and search in general.
Words that have a more distant linguistic relationship share the same Semitic root.
Examples. The singular form الكتابة (al-kitaaba, the writing) and plural form كتابات (kitaabaat, writings) share the same stem: كتاب (kitaab). On the other hand, كُتُب (kutub, books) is an irregular form and does not have the same stem as كِتَاب (kitaab, book). But both forms do share the same lemma, which is the singular form كِتَاب (kitaab). The words مكتبة (maktaba, library), المَكْتَب (al-maktab, the desk), كُتُب (kutub, books), and الكتابة (al-kitaaba, the writing) are related in the sense that a library contains books and desks, a desk is used to write on, and writings are often found in books. All of these words share the same Semitic root: كتب (ktb)
Persian Token Analysis
Persian Token Normalization
The following Persian-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B) and superscript alef (U+0670) are removed.
Alefأ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Arabic kafك (U+0643) is converted to Persian kafک (U+06A9).
Heh goal (U+06C1) or heh doachashmee (U+06BE) is converted to heh (U+0647).
Heh with hamzaۂ (U+06C2) is converted to ۀ (U+06C0).
Arabic yehي (U+064A) or ى (U+0649) is converted to Persian yehی (U+06CC).
Following morphological analysis:
Zero-width non-joiner (U+200C) and superscript alef (U+0670) are removed.
Token Variants
The analyzer can generate a variant form for some tokens to account for the orthographic irregularity seen in contemporary written Persian. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains a zero-width non-joiner (U+200C), a variant is generated without the zero-width non-joiner.
If a word ends in teh marbuta (U+0629), two variants are generated. The first replaces the teh marbuta with teh (U+062A); the second replaces the teh marbuta with heh (U+0647).
Stems and Lemmas
The Persian analyzer produces both stems and lemmas. A stem is the substring of a word that remains after all prefixes and suffixes are removed. A lemma is the dictionary form of a word. The lemma may differ from the stem if a word is irregular, or if a word contains regular transformations. The distinction between stems and lemmas is especially important for Persian verbs. The typical verb inflection table for Persian includes a past stem and and a present stem that cannot be derived from each other.
Examples. The present subjunctive tense verb بگویم (beguyam, that I say) has the stem گوی (guy) . The past tense verb گفتم (goftam, I said) has the stem گفت (goft). These two have different stems, because the word-internal strings are different. They have the same lemma گفت (goft) because they are inflections of the same word.
Urdu Token Analysis
Token Normalization
The following Urdu-specific normalizations are performed on the output of the Arabic script normalization:
Fathatan (U+064B), zero-width non-joiner (U+200C), and jazm (U+06E1) are removed.
Alef أ (U+0623), إ (U+0625), or ٱ (U+0671) is converted to ا (U+0627).
Kaf ك (U+0643) is converted to ک (U+06A9).
Heh with hamza ۀ (U+06C0) is converted to ۂ (U+06C2).
Yehي (U+064A) or ى (U+0649) is converted to ی (U+06CC).
Token Variants
The analyzer can generate a number of variant forms for each Urdu token to account for the orthographic irregularity seen in contemporary written Urdu. Each variation is generated with the normalized form.
If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with Persian yeh (U+06CC).
If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw with waw (U+0648).
If a word contains heh doachashmee (U+06BE), a variant is generated replacing the heh doachashmee with heh goal (U+06C1).
If a word ends with teh marbuta (U+0647), a variant is generated replacing the teh marbuta with heh goal (U+06C1).
User Dictionaries
User dictionaries are supplementary dictionaries that change the default linguistic analyses. These dictionaries can be static or dynamic.
Static dictionaries are compiled ahead of time and passed to a factory.
Dynamic dictionaries are created and configured at runtime. Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the factory becomes unused, the contents are lost.
In all dictionaries, the entries should be Form KC normalized. Japanese Katakana characters, for example, should be full width, and Latin characters, numerals, and punctuation should be half width. Analysis dictionaries can contain characters of any script, while for most consistent performance in Chinese, Japanese, and Thai, token dictionaries should only contain characters in the Hanzi (Kanji), Hiragana, Katakana, and Thai scripts.
In Chinese, Japanese, and Thai, text in foreign scripts (such as Latin script) in the input that equals or exceeds the length specified by minNonPrimaryScriptRegionLength
(the default is 10) is passed to the standard Tokenizer and not seen by a user segmentation dictionary.
Types of User Dictionaries
Analysis Dictionary: An analysis dictionary allows users to modify the analysis or add new variations of a word. The analysis associated with a word includes the default lemma as well as part-of-speech tag and additional characteristics for some languages. Use of these dictionaries is not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
Segmentation Dictionary: A segmentation dictionary allows users to specify strings that are to be segmented as tokens.
Chinese and Japanese segmentation user dictionary entries may not contain the ideographic full stop.
Many-to-one Normalization Dictionaries: Users can implement a many-to-one normalization dictionary to map multiple spelling variants to a single normalized form.
CLA/JLA Dictionaries: The Chinese Language Analyzer and Japanese Language Analyzer both include the capability to create and use one or more segmentation (tokenization) user dictionaries for vocabulary specific to an industry or application. A common usage for both languages is to add new nouns like organizational and product names. These and existing nouns can have a compounding scheme specified if, for example, you wish to prevent an otherwise compound product name from being segmented as such. When the language is Japanese, you can also create user reading dictionaries with transcriptions rendered in Hiragana. The readings can override the readings returned from the JLA reading dictionary and override readings that are otherwise guessed from segmentation (tokenization) user dictionaries.
CSC Dictionary: Users can specify conversion for use with the Chinese Script Converter (CSC).
Tip
When using the SPACELESS_LEXICAL
tokenizer, you must use the CLA/JLA dictionaries instead of the segmentation dictionary. The analysis dictionary is not intended to be used with the SPACELESS_LEXICAL
tokenizer.
Prioritization of User Dictionaries
All static and dynamic user dictionaries, except for many-to-one normalization dictionaries, are consulted in reverse order of addition. In cases of conflicting information, dictionaries added later take priority over those added earlier. Once a token is found in a user dictionary, RBL stops and will consult neither the remaining user dictionaries nor the RBL dictionary.
Many-to-one normalization dictionaries are consulted in the following order:
All dynamic user dictionaries, in reverse order of addition.
Static dictionaries in the order that they appear in the option list for
normalizationDictionaryPaths
.
Example of non-many-to-one user dictionary priority:
User adds dynamic dictionary named
dynDict1
User adds static dictionary named
statDict2
User adds static dictionary named
statDict3
User adds dynamic dictionary named
dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, statDict3
, statDict2
, dynDict1
Example of many-to-one normalization user dictionary priority:
User adds dynamic dictionary named
dynDict1
User sets
normalizationDictionaryPaths = "statDict2;statDict3"
User adds dynamic dictionary named
dynDict4
Dictionaries are prioritized in the following order:
dynDict4
, dynDict1
, statDict2
, statDict3
The Chinese and Japanese language analyzers load all dictionaries with the user dictionaries loaded at the end of the list. To prioritize the user dictionaries and put them at the front of the list, guaranteeing the matches in the user dictionaries will be used, set the option favorUserDictionary
to true
.
Preparing the Source
The following formatting rules apply to user dictionary source files.
The source file is UTF-8 encoded.
The file may begin with a byte order mark (BOM).
Each entry is a single line.
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Dynamic User Dictionaries
A dynamic user dictionary allows users to add user dictionary values at runtime. Instead of creating and compiling the dictionary in advance, the values are added dynamically. Dynamic dictionaries are available for all types of user dictionaries, except the CSC dictionary.
The process for using dynamic dictionaries is the same for each dictionary type:
Create an empty dynamic dictionary for the dictionary type.
Use the appropriate add method to add entries to the dictionary.
Dynamic dictionaries use the same structure as the compiled user dictionaries, but instead of having a single tab-delimited string, they are composed of separate strings. As an example, let's look at a many-to-one normalization dictionary entry:
Static dictionary entry (values are separated by tabs):
norm1 var1 var2 var3
Dynamic dictionary entry:
dictionary.add("norm1", "var1", "var2", "var3");
Caution
Dynamic dictionaries are held completely in memory and state is not saved on disk. When the JVM exits, or the annotator, tokenizer, or analyzer becomes unused, the contents are lost.
Case
User dictionary lookups are case-sensitive. RBL provides an option, caseSensitive
, to control whether the analysis phase is case-sensitive.
If
caseSensitive
istrue
, (the default), then the token itself is used to query the dictionary.If
caseSensitive
isfalse
, the token is lowercased before consulting the dictionary. If the analyses is intended to be case-insensitive then the words in the user dictionary must all be in lowercase.
If you are using the BaseLinguisticsTokenFilterFactory
, then the value for AnalyzerOption.caseSensitive
both turns on the corresponding analysis and associates the dictionary with that analysis.
For Danish, Norwegian, and Swedish, the provided dictionaries are lowercase and caseSensitive
is automatically set to false
.
Valid Characters for Chinese and Japanese User Dictionary Entries
An entry in a Chinese or Japanese user dictionary must contain characters corresponding to the following Unicode code points, to valid surrogate pairs, or to letters or decimal digits in Latin script. In this listing, ..
indicates an inclusive range of valid code points:
0022..007E, 00A2, 00A3, 00A5, 00A6, 00A9, 00AC, 00AF, 00B7, 0370..04FF, 2010..206F, 2160..217B, 2190..2193, 2200..22FF, 2460..24FF, 2502, 25A0..26FF, 2985, 2986, 3001..3007, 300C, 300D, 3012, 3020, 3031..3037, 3041..3094, 3099..309E, 30A1..30FE, 3200..33FF, 4E00..9FFF, D800..FA2D, FF00, FF02..FFEF
Compiling a User Dictionary
In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-user-dictionary
and a .bat file for Windows.
rbl-build-user-dictionary.bat
To compile a user dictionary, from the RBL root directory:
tools/bin/rbl-build-user-dictionary -type TYPE_ARGUMENT LANG INPUT_FILE OUTPUT_FILE
where TYPE_ARGUMENT is the dictionary type, LANG is the language code, INPUT_FILE is the pathname of the source file you have created, and OUTPUT_FILE is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-user-dictionary -type analysis jpn jpn_lemmadict.txt jpn_lemmadict.bin
Dictionary Type | TYPE_ARGUMENT |
---|---|
Analysis |
|
Segmentation |
|
Many-to-one |
|
CLA or JLA segmentation |
|
JLA reading |
|
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Segmentation Dictionaries
The format for a segmentation dictionary source file is very simple. Each word is written on its own line, and that word is guaranteed to be segmented as a single token when seen in the input text, regardless of context. Japanese example:
三菱UFJ銀行 酸素ボンベ
Class | Method | Task |
---|---|---|
|
| Adds a user segmentation dictionary for a given language. |
| Create and load new dynamic segmentation dictionary |
Analysis Dictionaries
Note
Analysis dictionaries are not supported for Arabic, Persian, Romanian, Turkish, and Urdu.
Each entry is a word, followed by a tab and an analysis. The analysis must end with a lemma and a part-of-speech (POS) tag.
word lemma[+POS]
For those languages for which RBL does not return POS tags, use DUMMY
.
Variations. You can provide more than one analysis for a word or more than one version of a word for an analysis.
The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog" for the same analysis (noun).
telephone telephone[+NOUN] telephone telephone[+VI] dog dog[+NOUN] Dog dog[+NOUN]
For some languages, the analysis may include special tags and additional information.
Contracted forms. For English, French, Italian, and Portuguese, ^=
is a separator for a contraction or elision.
English example:
doesn't does[^=]not[+VDPRES]
Multi-Word Analysis. For English, Italian, Spanish, and Dutch, ^_
indicates a space.
English example:
IBM International[^_]Business[^_]Machines[+PROP]
Compound Boundary. For Danish, Dutch, Norwegian, German, and Swedish, ^#
indicates the boundary between elements in a compound word. For Hungarian, the compound boundary tag is ^CB+
.
German example:
heimatländern Heimat[^#]Land[+NOUN]
Compound Linking Element. For German ^/
, indicates a compound linking element. For Dutch, use ^//
.
German example:
arbeitskreis Arbeit[^/]s[^#]Kreis[+NOUN]
Derivation Boundary or Separator for Clitics. For Italian, Portuguese, and Spanish, ^|
indicates a derivation boundary or separator for clitics.
Spanish example with derivation boundary:
duramente duro[^|][+ADV]
Italian example with separator for clitics:
farti fare[^|]tu[+VINF_CLIT]
Japanese Readings and Normalized Forms. For Japanese, [^r]
precedes a reading (there may be more than one), and [^n]
precedes a normalization. For example:
行わ 行う[^r]オコナワ[+V] tv テレビ[^r]テレビ[^n]テレビ[+NC] アキュムレータ アキュムレーター[^r]アキュムレータ[^n]アキュムレーター[+NC]
Korean Analysis. A Korean analysis uses a different pattern than the analysis for other languages. The pattern for an analysis in a user Korean dictionary is as follows:
Token Mor1[/Tag1][^+]Mor2[/Tag2][^+]Mor3[/Tag3]
Where each MorN
is a morpheme, consisting of one or more Korean characters, and TagN
is the POS tag for that morpheme. [^+]
indicates the boundary between morphemes.
Here's an example:
유전자이다 유전자[/NPR][^+]이[/CO][^+]다[/ECS]
If the analysis contains one noun morpheme, that morpheme is the lemma and the POS tag is the POS tag for that morpheme. If more than one of the morphemes are nouns, the lemma is the concatenation of those nouns (a compound). Example:
정보검색 정보[/NNC][^+]검색/[NNC]
Otherwise, the lemma is the first morpheme, and the POS tag is the POS tag associated with that morpheme.
You can override this algorithm for identifying the lemma and/or POS tag in a user dictionary entry by placing [^L]lemma
and/or [^P][/Tag] at the end of the analysis. The lemma may or may not correspond to one of the morphemes in the analysis. For example:
유전자이다 유전자[/NNC][^+]이[/CO][^+]다[/ECS][^L]유전[^P][/NPR]
The KoreanAnalysis
interface provides access to the morphemes and tags associated with a given token in either the standard Korean dictionary or a user Korean dictionary.
Class | Method | Task |
---|---|---|
|
| Add a user analysis dictionary |
| Add a dynamic analysis dictionary |
Many-to-one Normalization Dictionaries
A many-to-one normalization dictionary maps one or more variants to a normalized form. The first value on each line is the normalized form. The remainder of the entries on the line are the variants to be mapped to the normalized form. All values on the line are separated by tabs.
Example:
norm1 var1 var2 norm1 var3 var4 var5
Class | Method | Task |
---|---|---|
|
| Create and load new dynamic normalization dictionary |
Use the option normalizationDictionaryPaths
to specify the static user normalization dictionaries.
CLA and JLA Dictionaries
The source file for a Chinese or Japanese user dictionary is UTF-8 encoded (see Valid Characters for Chinese or Japanese User Dictionary Entries). The file may begin with a byte order mark (BOM). Empty lines are ignored. A comment line begins with #. The first line of a Japanese dictionary may begin !DICT_LABEL
followed by Tab and an arbitrary string to set the dictionary's name, which is not currently used anywhere.
Each entry in the dictionary source file is a single line:
word Tab POS Tab DecompPattern Tab Reading1,Reading2,...
where word is the entry or surface form, POS is one of the user-dictionary part-of-speech tags listed below, DecompPattern is the decomposition pattern: a comma-delimited list of numbers that specify the number of characters from word to include in each component of the compound (0 for no decomposition), and Reading1,... is a comma-delimited list of one or more transcriptions rendered in Hiragana or Katakana (only applicable to Japanese).
The decomposition pattern and readings are optional, but you must include a decomposition pattern if you include readings. In other words, you must include all elements to include the entry in a reading user dictionary, even though the reading user dictionary does not use the POS tag or decomposition pattern. To include an entry in a segmentation (tokenization) user dictionary, you only need POS tag and an optional decomposition pattern. Keep in mind that those entries that include all elements can be included in both a segmentation (tokenization) user dictionary and a reading user dictionary.
For more information on Chinese and Japanese POS tags, refer to the following sections:
The tables below list the primary and secondary tags emitted when the user POS tag is specified.
User POS tag | Primary POS tag | Secondary POS tag |
---|---|---|
ABBREVIATION | NA | |
ADJECTIVE | A | |
ADVERB | D | |
AFFIX | X | |
CONJUNCTION | J | |
CONSTRUCTION | OC | |
DERIVATIONAL_AFFIX | W | |
DIRECTION_WORD | WL | |
FOREIGN_PERSON | NP | FOREIGN_PERSON |
IDIOM | E | |
INTERJECTION | I | |
MEASURE_WORD | NM | |
NON_DERIVATIONAL_AFFIX | F | |
NOUN | NC | |
NUMERAL | NN | |
ONOMATOPE | M | |
ORGANIZATION | NP | ORGANIZATION |
PARTICLE | PL | |
PERSON | NP | PERSON |
PLACE | NP | PLACE |
PREFIX | XP | |
PREPOSITION | PR | |
PRONOUN | NR | |
PROPER_NOUN | NP | |
PUNCTUATION | PUNCT | |
SUFFIX | XS | |
TEMPORAL_NOUN | NT | |
VERB | V | |
VERB_ELEMENT | WV |
User POS tag | Primary POS tag | Secondary POS Tag |
---|---|---|
AJ (adjective) | AJ | |
AN (adjectival noun) | AN | |
D (adverb) | D | |
FOREIGN_GIVEN_NAME | NP | FOREIGN_GIVEN_NAME |
FOREIGN_PLACE_NAME | NP | FOREIGN_PLACE |
FOREIGN_SURNAME | NP | FOREIGN_SURNAME |
GIVEN_NAME | NP | GIVEN_NAME |
HS (honorific suffix) | HS | |
NOUN | NC | |
ORGANIZATION | NP | ORGANIZATION |
PERSON | NP | PERSON |
PLACE | NP | PLACE |
PROPER_NOUN | NP | |
SURNAME | NP | SURNAME |
UNKNOWN | UNKNOWN |
|
V1 (vowel-stem verb) | V1 | |
VN (verbal noun) | VN | |
VS (suru-verb) | VS | |
VX (irregular verb) | VX |
Examples (the last three entries include readings):
!DICT_LABEL New Words 2014 デジタルカメラ NOUN デジカメ NOUN 0 東京証券取引所 ORGANIZATION 2,2,3 狩野 SURNAME 0 安倍晋三 PERSON 2,2 あべしんぞう 麻垣康三 PERSON 2,2 あさがきこうぞう 商人 NOUN 0 しょうにん,あきんど
The POS and decomposition pattern can be in full-width numerals and Roman letters. For example:
東京証券取引所 organization 2,2,3
The "2,2,3" decomposition pattern instructs the tokenizer to decompose this compound entry into
東京 証券 取引所
Class | Method | Task |
---|---|---|
|
| Add a user analysis dictionary |
| Add a dynamic analysis dictionary |
Class | Method | Task |
---|---|---|
|
| Adds a JLA readings dictionary. |
| Create and load new dynamic JLA readings dictionary |
Chinese Script Converter (CSC)
Overview
There are two standard forms of written Chinese: Simplified Chinese (SC) and Traditional Chinese (TC). SC is used in Mainland China. TC is used in Taiwan, Hong Kong, and Macau.
Conversion from one script to another is a complex matter. The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with vocabulary differences and context-dependence.
The Chinese Script Converter converts text in simplified script to text in traditional script, or vice versa. The conversion can be on any of three levels, listed here from simplest to the most complex:
Codepoint Conversion: Codepoint conversion uses a mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified form 头发 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or 發. Using this approach, however, there is no recognition that 头发 is a word, so the choice could be 發, in which case the end result 頭發 is nonsense. On the other hand, the choice of 髮 leads to errors for other words. So while conversion mapping is straightforward, it is unreliable.
Orthographic Conversion: The second level of conversion is orthographic. This level relies upon identification of the words in the input text. Within each word, orthographic variants of each character may be reflected in the conversion. In the above example, 头发 is identified as a word and is converted to a traditional variant of the word, 頭髮. There is no basis for converting it to 頭發, because the conversion considers the word as a whole rather than as a collection of individual characters.
Lexemic Conversion: The third level of conversion is lexemic. This level also relies upon identification of words. But rather than converting a word to an orthographic variant, the aim here is to convert it to an entirely different word. For example, "computer" is usually 计算机 in SC but 電腦 in TC. Whereas codepoint conversion is strictly character-by-character and orthographic conversion is character-by-character within a word, lexemic conversion is word-by-word.
Note
If the converter cannot provide the level of conversion requested (lexemic or orthographic), the next simpler level of conversion is provided.
For example, if you ask for a lexemic conversion, and none is available for a given token, CSC provides the orthographic conversion unless it is not available, in which case CSC provides a codepoint conversion.
The Chinese input may contain a mixture of TC and SC, and even some non-Chinese text. The Chinese Script Converter converts to the target (SC or TC), leaving any tokens already in the target form and any non-Chinese text unchanged.
CSC Options
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates most complex conversion level to use |
( | Chinese | |
The language from which the |
| Chinese, Simplified Chinese, Traditional Chinese | |
The language to which the |
| Chinese, Simplified Chinese, Traditional Chinese |
See Initial and Path Options for additional options
Enum Classes:
BaseLinguisticsOption
CSCAnalyzerOption
Using CSC with the ADM API
This example uses the BaseLinguisticsFactory
and CSCAnalyzer
.
Create a
BaseLinguisticsFactory
and set the required options.final BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, licensePath); factory.setOption(BaseLinguisticsOption.conversionLevel, CSConversionLevel.orthographic.levelName()); factory.setOption(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3()); factory.setOption(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3());
Create an annotator to get translations.
final EnumMap<BaseLinguisticsOption, String> options = new EnumMap<>(BaseLinguisticsOption.class); options.put(BaseLinguisticsOption.language, LanguageCode.SIMPLIFIED_CHINESE.ISO639_3()); options.put(BaseLinguisticsOption.targetLanguage, LanguageCode.TRADITIONAL_CHINESE.ISO639_3()); final Annotator cscAnnotator = factory.createCSCAnnotator(options);
Annotate the input text for tokens and translations.
final AnnotatedText annotatedText = cscAnnotator.annotate(inputText); final Iterator<Token> tokenIterator = annotatedText.getTokens().iterator(); for (final String translation : annotatedText.getTranslatedTokens().get(0).getTranslations()) { final String originalToken = tokenIterator.hasNext() ? tokenIterator.next().getText() : ""; outputData.format(OUTPUT_FORMAT, originalToken, translation); }
Using CSC with the Classic API
The RBL distribution includes a sample (CSCAnalyze
) that you can compile and run with an ant
build script.
In a Bash shell script (Unix) or Command Prompt (Windows), navigate to rbl-je-<version>/samples/csc-analyze
and call
ant run
The sample reads an input file in SC and prints each token with its TC conversion to standard out.
This example tokenizes Chinese test and converts from TC to SC.
Set up a
BaseLinguisticsFactory
.BaseLinguisticsFactory factory = new BaseLinguisticsFactory(); factory.setOption(BaseLinguisticsOption.rootDirectory, rootDirectory); factory.setOption(BaseLinguisticsOption.licensePath, licensePath);
Use the
BaseLinguisticsFactory
to create aTokenizer
to tokenize Chinese text.Tokenizer tokenizer = factory.create(new StringReader(tcInput), LanguageCode.CHINESE);
Use the
BaseLinguisticsFactory
to create aCSCAnalyzer
to convert from TC to SC.CSCAnalyzer cscAnalyzer = factory.createCSCAnalyzer(LanguageCode.TRADITIONAL_CHINESE, LanguageCode.SIMPLIFIED_CHINESE);
Use the
CSCAnalyzer
to analyze eachToken
found by theTokenizer
.Token token; while ((token = tokenizer.next()) != null) { String tokenIn = new String(token.getSurfaceChars(), token.getSurfaceStart(), token.getLength()); System.out.println("Input: " + tokenIn); cscAnalyzer.analyze(token);
Get the conversion (SC or TC) from each
Token
.System.out.println("SC translation: " + token.getTranslation());
CSC User Dictionaries
CSC user dictionaries support orthographic and lexemic conversions between Simplified Chinese and Traditional Chinese. They are not used for codepoint conversion.
CSC user dictionaries follow the same format as other user dictionaries:
The source file is UTF-8 encoded.
The file may begin with a byte order mark (BOM).
Each entry is a single line.
Empty lines are ignored.
Once complete, the source file is compiled into a binary format for use in RBL.
Each entry contains two or three tab-delimited elements:
input_token orthographic_translation [lexemic_translation]
The input_token is the form you are converting from and the orthographic_translation and optional lexemic_translation are the form you are converting to.
Sample entries for a TC to SC user dictionary:
電腦 电脑 计算机 宇宙飛船 宇宙飞船
Compiling a CSC User Dictionary. In the tools/bin
directory, RBL includes a shell script for Unix
rbl-build-csc-dictionary
and a .bat file for Windows
rbl-buld-csc-dictionary.bat
The script uses Java to compile the user dictionary. The operation is performed in memory, so you may require more than the default heap size. You can set heap size with the JAVA_OPTS
environment variable. For example, to provide 8 GB of heap size, set JAVA_OPTS
to -Xmx8g
.
Unix shell:
export JAVA_OPTS=-Xmx8g
Windows command prompt:
set JAVA_OPTS=-Xmx8g
Compile the CSC user dictionary from the RBL root directory:
tools/bin/rbl-build-csc-dictionary INPUT_FILE OUTPUT_FILE
INPUT_FILE
is the pathname of the source file you have created, and OUTPUT_FILE
is the pathname of the binary compiled dictionary the tool creates. For example:
tools/bin/rbl-build-csc-dictionary my_tc2sc.txt my_tc2sc.bin
Class | Method | Task |
---|---|---|
|
| Add a user CSC dictionary for a given language. |
| Add dynamic CSC dictionary |
Language Codes
Canonical Language Code
RBL canonicalizes some language codes to other language codes. These canonicalization rules apply to all APIs.
LanguageCode.NORWEGIAN
is canonicalized to LanguageCode.NORWEGIAN_BOKMAL
immediately upon any input to the API. This means that there can be no distinguishing between them. In particular, an Analyzer
built from a factory configured to use Norwegian will report its language as Norwegian Bokmål instead.
Similarly, LanguageCode.SIMPLIFIED_CHINESE
and LanguageCode.TRADITIONAL_CHINESE
are canonicalized to LanguageCode.CHINESE
immediately. The one exception is that they are not canonicalized as inputs to or outputs from the Chinese Script Converter.
Those are the only language code canonicalizations. Although RBL internally treats Afghan Persian and Iranian Persian as Persian, they are not considered the same language. This makes it possible to configure different user dictionaries for each variety of Persian, even though they are otherwise processed identically.
ISO 639-3 Language Codes
RBL uses ISO 639-3 language codes to specify languages as strings. There are a few nonstandard language codes, as indicated. RBL also accepts 2-letter codes specified by the ISO-639-1 standard. See the Javadoc for more details on language codes.
Language | Code | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Arabic |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Catalan |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese (Simplified) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Chinese (Traditional) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Czech |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Danish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Dutch |
| ||||||||||||||||||||||||||||||||||||||||||||||||
English |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Estonian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Finnish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
French |
| ||||||||||||||||||||||||||||||||||||||||||||||||
German |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Greek |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Hebrew |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Hungarian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Indonesian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Italian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Japanese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Latvian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Malay (Standard) |
| ||||||||||||||||||||||||||||||||||||||||||||||||
North Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian Bokmål |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Norwegian Nynorsk |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Pashto |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian, Afghan |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Persian, Iranian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Polish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Portuguese |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Romanian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Russian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Serbian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Slovak |
| ||||||||||||||||||||||||||||||||||||||||||||||||
South Korean |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Spanish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Swedish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Tagalog |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Thai |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Turkish |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Ukrainian |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Urdu |
| ||||||||||||||||||||||||||||||||||||||||||||||||
Unknown |
| ||||||||||||||||||||||||||||||||||||||||||||||||
[a] Not a real ISO 639-3 code. |
Part-of-Speech Tags
Note
For a mapping of the part-of-speech tags that appear in this appendix to the tags used in the Penn Treebank Project, see the tab-delimited .csv files (one per language) distributed along with this Application Developer's Guide in the penn_treebank
subdirectory. The RBL Korean part-of-speech tags conform to the Penn Treebank standard.
In RBL, each language has its own set of POS tags and a few languages have multiple tag sets. Each tag set is identified by an identifier, which is a value of the TagSet
enum. When RBL outputs a POS tag, it also lists the identifier for the tag set it came from. Output from a single language may contain POS tags from multiple tag sets, including the language-neutral set.
The header of each table lists the tag set identifier. For example, the English tag set is identified as BT_ENGLISH
.
For Chinese and Japanese using statistical models for tokenization (tag sets BT_CHINESE_RBLJE_2
and BT_JAPANESE_RBLJE_2
), if the analyzer cannot find the lemma for a word in its analysis dictionary, it will return GUESS for a part-of-speech tag. GUESS is not really a part-of-speech and does not appear below in this appendix.
Arabic POS Tags – BT_ARABIC
Tag | Description | Example |
---|---|---|
ABBREV | abbreviation | ا ف ب |
ADJ | adjective | عَرَبِي اَلأَمْرِيكِيّ |
ADV | adverb | هُنَاكَ ,ثُمَّ |
CONJ | conjunction | وَ |
CV | verb (imperative) | أَضِفْ |
DEM_PRON | demonstrative pronoun | هٰذَا |
DET | determiner | لِل |
EOS | end of sentence | ! ؟ . |
EXCEPT_PART | exception particle | إلا |
FOCUS_PART | focus particle | أما |
FUT_PART | future particle | سَوْفَ |
INTERJ | interjection | آه |
INTERROG_PART | interrogative particle | هَلْ |
IV | verb (imperfect) | يَكْتُبُ ,يَأْكُلُ |
IV_PASS | verb (passive imperfect) | يُضَافُ، يُشَارُ |
NEG_PART | negative particle | لَن |
NON_ARABIC | not Arabic script | a b c |
NOUN | noun | طَائِرْ، كُمْبِيُوتَرْ، بَيْتْ |
NOUN_PROP | proper noun | طَوُنِي، مُحَمَّدْ |
NO_FUNC | unknown part of speech | |
NUM | numbers (Arabic-Indic numbers, Latin, and text-based cardinal) | أَرْبَعَة عَشَرْ، ١٤، 14 |
PART | particle | أَيَّتُهَا، إِيَّاهُ |
PREP | preposition | أَمَامْ، فِي |
PRONOUN | pronoun | هُوَ |
PUNC | punctuation | () ،.: |
PV | perfective verb | كَانَت، قَالَ |
PV_PASS | passive perfective verb | أُعْتَبَر |
RC_PART | resultative clause particle | فَلَمَا |
REL_ADV | relative adverb | حَيْثُ |
REL_PRON | relative pronoun | اَلَّذِي، اَللَّذَانِ |
SUB_CONJ | subordinating conjunction | إِذَا، إِذ |
VERB_PART | verbal particle | لَقَدْ |
Chinese POS Tags – Simplified and Traditional – BT_CHINESE
Tag | Description | Simplified Chinese | Traditional Chinese |
---|---|---|---|
A | adjective | 可爱 | 可愛 |
D | adverb | 必定 | 必定 |
E | idiom/phrase | 胸有成竹 | 胸有成竹 |
EOS | sentence final punctuation | 。 | 。 |
F | non-derivational affix | 鸳 | 鴛 |
FOREIGN | non-Chinese | c123 | c123 |
I | interjection | 吧 | 吧 |
J | conjunction | 但是 | 但是 |
M | onomatope | 丁丁 | 丁丁 |
NA | abbreviation | 日 | 日 |
NC | common noun | 水果 | 水果 |
NM | measure word | 个 | 個 |
NN | numeral | 3, 2, 一 | 3, 2, 一 |
NP | proper noun | 英国 | 英國 |
NR | pronoun | 我 | 我 |
NT | temporal noun | 一月 | 一月 |
OC | construction | 越~越~ | 越~越~ |
PL | particle | 之 | 之 |
PR | preposition | 除了 | 除了 |
PUNCT | non-sentence-final punctuation | , 「」(); | , 《》() |
U | unknown | ||
V | verb | 跳舞 | 跳舞 |
W | derivational suffix | 家 | 家 |
WL | direction word | 下 | 下 |
WV | word element - verb | 以 | 以 |
X | generic affix | 老 | 老 |
XP | generic prefix | 可 | 可 |
XS | generic suffix | 员 | 員 |
Tag |
---|
FOREIGN_PERSON |
ORGANIZATION |
PERSON |
PLACE |
Chinese POS Tags – Simplified and Traditional – BT_CHINESE_RBLJE_2
Tag | Description | Simplified Chinese | Traditional Chinese |
---|---|---|---|
A | adjective | 可爱 | 可愛 |
D | adverb | 必定 | 必定 |
E | idiom/phrase | 胸有成竹 | 胸有成竹 |
EOS | sentence final punctuation | 。 | 。 |
F | non-derivational affix | 鸳 | 鴛 |
I | interjection | 吧 | 吧 |
J | conjunction | 但是 | 但是 |
M | onomatope | 丁丁 | 丁丁 |
NA | abbreviation | 日 | 日 |
NC | common noun | 水果 | 水果 |
NM | measure word | 个 | 個 |
NN | numeral | 3, 2, 一 | 3, 2, 一 |
NP | proper noun | 英国 | 英國 |
NR | pronoun | 我 | 我 |
NT | temporal noun | 一月 | 一月 |
OC | construction | 越~越~ | 越~越~ |
PL | particle | 之 | 之 |
PR | preposition | 除了 | 除了 |
PUNCT | non-sentence-final punctuation | , 「」(); | , 《》() |
U | unknown | ||
V | verb | 跳舞 | 跳舞 |
W | derivational suffix | 家 | 家 |
WL | direction word | 下 | 下 |
WV | word element - verb | 以 | 以 |
X | generic affix | 老 | 老 |
XP | generic prefix | 可 | 可 |
XS | generic suffix | 员 | 員 |
Czech POS Tags – BT_CZECH
Tag | Description | Example |
---|---|---|
ADJ | adjective: nominative | [vál] silný [vítr] |
adjective: genitive | [k uvedení] zahradní [slavnosti] | |
adjective: dative | [k] veselým [lidem] | |
adjective: accusative | [jak zdolat] ekonomické [starosti] [vychutná] jeho [radost] | |
adjective: instrumental | první bushovou [zastávkou] | |
adjective: locative | [na] druhém [okraji silnice] | |
adjective: vocative | ty mladý [muž] | |
ordinal | [obsadil] 12. [místo] | |
ADV | adverb | velmi, nejvíce, daleko, jasno |
CLIT | clitic | bych, by, bychom, byste |
CM | comma | , |
CONJ | conjunction | a, i, ale, aby, nebo, však, protože |
DATE | date | 11. 12. 1996, 11. 12. |
INTJ | interjection | ehm, ach |
NOUN | noun: nominative | [je to] omyl |
noun: genitive | [krize] autority státu | |
noun: dative | [dostala se k] moci | |
noun: accusative | [názory na] privatizaci | |
noun: instrumental | [Marx s naprostou] jistotou | |
noun: locative | [ve vlastním] zájmu | |
noun: vocative | [ty] parlamente | |
abbreviation, initial, unit | v., mudr., km/h, m3 | |
NUM_ACC | numeral: accusative | [máme jen] jednu [velmoc] |
NUM_DAT | numeral: dative | [jsme povinni] mnoha [lidem] |
NUM_DIG | digit | 123, 2:0, 1:23:56, -8.9, -8 909 |
NUM_GEN | numeral: genitive | [po dobu] dvou [let] |
NUM_INS | numeral: instrumental | [s] padesáti [hokejisty] |
NUM_LOC | numeral: locative | [po] dvou [závodech] |
NUM_NOM | numeral: nominative | oba [kluby tají, kde] |
NUM_ROM | Roman numeral | V |
NUM_VOC | numeral: vocative | [vy] dva [, zastavte] |
PREP | preposition | dle [tebe], ke [stolu], do [roku], se [mnou] |
PREPPRON | prepositional pronoun | nač |
PRON_ACC | pronoun: accusative | [nikdo] je [nevyhodí] |
PRON_DAT | pronoun: dative | [kdy je] mu [vytýkána] |
PRON_GEN | pronoun: genitive | [u] nás [i kolem] nás |
PRON_INS | pronoun: instrumental | [mezi] nimi [být] |
PRON_LOC | pronoun: locative | [aby na] ní [stál] |
PRON_NOM | pronoun: nominative | já [jsem jedinou] |
PRON_VOC | pronoun: vocative | vy [dva, zastavte ] |
PROP | proper noun | Pavel, Tigrid, Jacques, Rupnik, Evropy |
PTCL | particle | ano, ne |
PUNCT | punctuation | ( ) { } [ ] ; |
REFL_ACC | reflexive pronoun: accusative | se |
REFL_DAT | reflexive pronoun: dative | si |
REFL_GEN | reflexive pronoun: genitive | sebe |
REFL_INS | reflexive pronoun: instrumental | sebou |
REFL_LOC | reflexive pronoun: locative | sobě |
SENT | sentence final punctuation | . ! ? ... |
VERB_IMP | verb: imperative | odstupte |
VERB_INF | verb: infinitive | [mohli si] koupit |
VERB_PAP | verb: past participle | mohli [si koupit] |
VERB_PRI | verb: present indicative | [trochu nás] mrzí |
VERB_TRA | verb: transgressive | maje [ode mne] |
Dutch POS Tags – BT_DUTCH
Tag | Description | Example |
---|---|---|
ADJA | attributive adjective | [een] snelle [auto] |
ADJD | adverbial or predicative adjective | [hij rijdt] snel |
ADV | non-adjectival adverb | [hij rijdt] vaak |
ART | article | een [bus], het [busje] |
CARD | cardinals | vijf |
CIRCP | right part of circumposition | [hij viel van dit dak] af |
CM | comma | , |
CMPDPART | right truncated part of compound | honden- [kattenvoer] |
COMCON | comparative conjunction | [zo groot] als, [groter] dan |
CON | co-ordinating conjunction | [Jan] en [Marie] |
CWADV | interrogative adverb or subordinate conjunction | wanneer [gaat hij weg ?], wanneer [hij nu weggaat] |
DEMDET | demonstrative determiner | deze [bloemen zijn mooi] |
DEMPRO | demonstrative pronoun | deze [zijn mooi] |
DIG | digits | 1, 1.2 |
INDDET | indefinite determiner | geen [broer] |
INDPOST | indefinite postdeterminer | [de] beide [broers] |
INDPRE | indefinite predeterminer | al [de broers] |
INDPRO | indefinite pronoun | beide [gingen weg] |
INFCON | infinitive conjunction | om [te vragen] |
ITJ | interjections | Jawel, och, ach |
NOUN | common noun or proper noun | [de] hoed, [het goede] gevoel, [de] Betuwelijn |
ORD | ordinals | vijfde, 125ste, 12de |
PADJ | postmodifying adjective | [iets] aardigs |
PERS | personal pronoun | hij [sloeg] hem |
POSDET | possessive pronoun | mijn [boek] |
POSTP | postposition | [hij liep zijn huis] in |
PREP | preposition | [hij is] in [het huis] |
PROADV | pronominal adverb | [hij praat] hierover |
PTKA | adverb modification | [hij wil] te [snel] |
PTKNEG | negation | [hij gaat] niet [snel] |
PTKTE | infinitive particle | [hij hoopt] te [gaan] |
PTKVA | separated prefix of pronominal adverb or verb | [daar niet] mee [hij loopt] mee |
PUNCT | other punctuation | " ' ` { } [ ] < > - --- |
RELPRO | relative pronoun | [de man] die [lachte] |
RELSUB | relative conjunction | [Het kind] dat, [Het feit] dat |
SENT | sentence final punctuation | ; . ? |
SUBCON | subordinating conjunction | Hoewel [hij er was] |
SYM | symbols | @, % |
VAFIN | finite auxiliary verb | [hij] is [geweest] |
VAINF | infinite auxiliary verb | [hij zal] zijn |
VAPP | past participle auxiliary verb | [hij is] geweest |
VVFIN | finite substantive verb | [hij] zegt |
VVINF | infinitive substantive verb | [hij zal] zeggen |
VVPP | past participle substantive verb | [hij heeft] gezegd |
WADV | interrogative adverb | waarom [gaat hij] |
WDET | interrogative or relative determiner | [de vrouw] wier [man....] |
WPRO | interrogative or relative pronoun | [de vraag] wie ... |
English POS Tags – BT_ENGLISH
Tag | Description | Example |
---|---|---|
ADJ | (basic) adjective | [a] blue [book], [he is] big |
ADJCMP | comparative adjective | [he is] bigger, [a] better [question] |
ADJING | adjectival ing-form | [the] working [men] |
ADJPAP | adjectival past participle | [a] locked [door] |
ADJPRON | pronoun (with determiner) or adjective | [the] same; [the] other [way] |
ADJSUP | superlative adjective | [he is the] biggest; [the] best [cake] |
ADV | (basic) adverb | today, quickly |
ADVCMP | comparative adverb | sooner |
ADVSUP | superlative adverb | soonest |
CARD | cardinal (except one) | two, 123, IV |
CARDONE | cardinal one | [at] one [time] ; one [dollar] |
CM | comma | , |
COADV | coordination adverbs either, neither | either [by law or by force]; [he didn't come] either |
COORD | coordinating conjunction | and, or |
COSUB | subordinating conjunction | because, while |
COTHAN | conjunction than | [bigger] than |
DET | determiner | the [house], a [house], this [house], my [house] |
DETREL | relative determiner whose | [the man] whose [hat ...] |
INFTO | infinitive marker to | [he wants] to [go] |
ITJ | interjection | oh! |
MEAS | measure abbreviation | [50] m. [wide], yd |
MONEY | currency plus cardinal | $1,000 |
NOT | negation not | [he will] not [come in] |
NOUN | common noun | house |
NOUNING | nominal ing-form | [the] singing [was pleasant], [the] raising [of the flag] |
ORD | ordinal | 3rd, second |
PARTPAST | past participle (in subclause) | [while] seated[, he instructed the students]; [the car] sold [on Monday] |
PARTPRES | present participle (in subclause), gerund | [while] doing [it];[they began] designing [the ship];having [said this ...] |
POSS | possessive suffix 's | [Peter] 's ; [houses] ' |
PREDET | pre-determiner such | such [a way] |
PREP | preposition | in [the house], on [the table] |
PREPADVAS | preposition or adverbial as | as [big] as |
PRON | (non-personal) pronoun | everybody, this [is ] mine |
PRONONE | pronoun one | one [of them]; [the green] one |
PRONPERS | personal pronoun | I, me, we, you |
PRONREFL | reflexive pronoun | myself, ... |
PRONREL | relative pronoun who, whom, whose; which; that | [the man] who [wrote that book], [the ship] that capsized |
PROP | proper noun | Peter, [Mr.] Brown |
PUNCT | punctuation (other than SENT and CM) | " |
QUANT | quantifier all, any, both, double, each, enough, every, (a) few, half, many, some | many [people]; half [the price]; all [your children]; enough [time]; any [of these] |
QUANTADV | quantifier or adverb much, little | much [better] , [he cares] little |
QUANTCMP | quantifier or comparative adverb more, less | more [people], less [expensive] |
QUANTSUP | quantifier or superlative adverb most, least | most [people], least [expensive] |
SENT | sentence final punctuation | . ! ? : |
TIT | title | Mr., Dr. |
VAUX | auxiliary (modal) | [he] will [run], [I] won't [come] |
VBI | infinitive or imperative of be | [he will] be [running]; be [quiet!] |
VBPAP | past participle of be | [he has] been [there] |
VBPAST | past tense of be | [he] was [running], [he] was [here] |
VBPRES | present tense of be | [he] is [running], [he] is [old] |
VBPROG | ing-form of be | [it is] being [sponsored] |
VDI | infinitive of do | [He will] do [it] |
VDPAP | past participle of do | [he has] done [it] |
VDPAST | past tense of do | [we] did [it], [he] didn't [come] |
VDPRES | present tense of do | [We] do [it], [he] doesn't [go] |
VDPROG | ing-form of do | [He is] doing [it] |
VHI | infinitive or imperative of have | [he would] have [come]; have [a look!] |
VHPAP | past participle of have | [he has] had [a cold] |
VHPAST | past tense of have | [he] had [seen] |
VHPRES | present tense of have | [he] has [been watching] |
VHPROG | ing-form of have | [he is] having [a good time] |
VI | verb infinitive or imperative | [he will] go, [he comes to] see; listen [!] |
VPAP | verb past participle | [he has] played, [it is] written |
VPAST | verb past tense | [I] went, [he] loved |
VPRES | verb present tense | [we] go, [she] loves |
VPROG | verb ing-form | [you are] going |
VS | verbal 's (short for is or has) | [he] 's [coming] |
WADV | interrogative adverb | when [did ...], where [did ...], why [did ...] |
WDET | interrogative determiner | which [book], whose [hat] |
WPRON | interrogative pronoun | who [is], what [is] |
French POS Tags – BT_FRENCH
Tag | Description | Example |
---|---|---|
ADJ2_INV | special number invariant adjective | gros |
ADJ2_PL | special plural adjective | petites, grands |
ADJ2_SG | special singular adjective | petit, grande |
ADJ_INV | number invariant adjective | heureux |
ADJ_PL | plural adjective | gentils, gentilles |
ADJ_SG | singular adjective | gentil, gentille |
ADV | adverb | finalement, aujourd'hui |
CM | comma | , |
COMME | reserved for the word comme | comme |
CONJQUE | reserved for the word que' | que |
CONN | connector subordinate conjunction | si, quand |
COORD | coordinate conjunction | et, ou |
DET_PL | plural determiner | les |
DET_SG | singular determiner | le, la |
MISC | miscellaneous | miaou, afin |
NEG | negation particle | ne |
NOUN_INV | number invariant noun | taux |
NOUN_PL | plural noun | chiens, fourmis |
NOUN_SG | singular noun | chien, fourmi |
NUM | numeral | treize, 13, XIX |
PAP_INV | number invariant past participle | soumis |
PAP_PL | plural past participle | finis, finies |
PAP_SG | singular past participle | fini, finie |
PC | clitic pronoun | [donne-]le, [appelle-]moi, [donne-]lui |
PREP | preposition (other than à, au, de, du, des) | dans, après |
PREP_A | preposition "à'' | à, au, aux |
PREP_DE | preposition "de'' | de, d', du, des |
PRON | pronoun | il, elles, personne, rien |
PRON_P1P2 | 1st or 2nd person pronoun | je, tu, nous |
PUNCT | punctuation (other than comma) | : - |
RELPRO | relative/interrogative pronoun (except "que'') | qui, quoi, lequel |
SENT | sentence final punctuation | . ! ? ; |
SYM | symbols | @ % |
VAUX_INF | infinitive auxiliary | être, avoir |
VAUX_P1P2 | 1st or 2nd person auxiliary verb, any tense | suis, as |
VAUX_P3PL | 3rd person plural auxiliary verb, any tense | seraient |
VAUX_P3SG | 3rd person singular auxiliary verb, any tense | aura |
VAUX_PAP | past participle auxiliary | eu, été |
VAUX_PRP | present participle auxiliary verb | ayant |
VERB_INF | infinitive verb | danser, finir |
VERB_P1P2 | 1st or 2nd person verb, any tense | danse, dansiez, dansais |
VERB_P3PL | 3rd person plural verb, any tense | danseront |
VERB_P3SG | 3rd person singular verb, any tense | danse, dansait |
VERB_PRP | present participle verb | dansant |
VOICILA | reserved for voici, voilà'' | voici, voilà |
German POS Tags – BT_GERMAN
Tag | Description | Example |
---|---|---|
ADJA | (positive) attributive adjective | [ein] schnelles [Auto] |
ADJA2 | comparative attributive adjective | [ein] schnelleres [Auto] |
ADJA3 | superlative attributive adjective | [das] schnellste [Auto] |
ADJD | (positive) predicative or adverbial adjective | [es ist] schnell, [es fährt] schnell |
ADJD2 | comparative predicative or adverbial adjective | [es ist] schneller, [es fährt] schneller |
ADJD3 | superlative predicative or adverbial adjective | [es ist am] schnellsten, [er meint daß er am] schnellsten [fährt]. |
ADV | non-adjectival adverb | oft, heute, bald, vielleicht |
ART | article | der [Mann], eine [Frau] |
CARD | cardinal | 1, eins, 1/8, 205 |
CIRCP | circumposition, right part | [um der Ehre] willen |
CM | comma | , |
COADV | adverbial conjunction | aber, doch, denn |
COALS | conjunction als | als |
COINF | infinitival conjunction | ohne [zu fragen], anstatt [anzurufen] |
COORD | coordinating conjunction | und, oder |
COP1 | coordination 1st part | entweder [... oder] |
COP2 | coordination 2nd part | [weder ...] noch |
COSUB | subordinating conjunction | weil, daß, ob [ich mitgehe] |
COWIE | conjunction wie | wie |
DATE | date | 27.12.2006 |
DEMADJ | demonstrative adjective | solche [Mühe] |
DEMDET | demonstrative determiner | diese [Leute] |
DEMINV | invariant demonstrative | solch [ein schönes Buch] |
DEMPRO | demonstrative pronoun | jener [sagte] |
FM | foreign word | article, communication |
INDADJ | indefinite adjective | [die] meisten [Leute], viele [Leute], [die] meisten [sind da], viele [sind da] |
INDDET | indefinite determiner | kein [Mensch] |
INDINV | invariant indefinite | manch [einer] |
INDPRO | indefinite pronoun | man [sagt] |
ITJ | interjection | oh, ach, weh, hurra |
NOUN | common noun, nominalized adjective, nominalized infinitive, or proper noun | Hut, Leute, [das] Gute, [das] Wollen, Peter, [die] Schweiz |
ORD | ordinal | 2., dritter |
PERSPRO | personal pronoun | ich, du, ihm, mich, uns |
POSDET | possessive determiner | mein [Haus] |
POSPRO | possessive pronoun | [das ist] meins |
POSTP | postposition | [des Geldes] wegen |
PREP | preposition | in, auf, wegen, mit |
PREPART | preposition article | im, ins, aufs |
PTKANT | sentential particle | ja, nein, bitte, danke |
PTKCOM | comparative particle | desto [schneller] |
PTKINF | particle: infinitival zu | [er wagt] zu [sagen] |
PTKNEG | particle: negation nicht | nicht |
PTKPOS | positive modifier | zu [schnell], allzu [schnell] |
PTKSUP | superlative modifier | am [schnellsten] |
PUNCT | other punctuation, bracket | ; : ( ) [ ] - " |
REFLPRO | reflexive sich | sich |
RELPRO | relative pronoun | [der Mann,] der [lacht] |
REZPRO | reciprocal einander | einander |
SENT | sentence final punctuation | . ? ! |
SYMB | symbols | @, %, x311 |
TRUNC | truncated word, (first part of a compound or verb prefix) | Ein- [und Ausgang], Kinder- [und Jugendheim], be- [und entladen] |
VAFIN | finite auxiliary | [er] ist, [sie] haben |
VAINF | auxiliary infinitive | [er will groß] sein |
VAPP | auxiliary past participle | [er ist groß] geworden |
VMFIN | finite modal | [er] kann, [er] mochte |
VMINF | modal infinitive | [er wird kommen] können |
VPREF | separated verbal prefix | [er kauft] ein, [sie sieht] zu |
VVFIN | finite verb form | [er] sagt |
VVINF | infinitive | [er will] sagen, einkaufen |
VVIZU | infinitive with incorporated zu | [um] einzukaufen |
VVPP | past participle | [er hat] gesagt |
WADV | interrogative adverb | wieso [kommt er?] |
WDET | interrogative determiner | welche [Nummer?] |
WINV | invariant interrogative | welch [ein ...] |
WPRO | interrogative pronoun | wer [ist da?] |
Greek POS Tags – BT_GREEK
Tag | Description | Example |
---|---|---|
ADJ | adjective | παιδικό |
ADV | adverb | ευχαρίστως |
ART | article | η, της |
CARD | cardinal | χίλια |
CLIT | clitic (pronoun) | τον, τού |
CM | comma | , |
COORD | coordinating conjunction | και |
COSUBJ | conjunction with subjunctive | αντί [να] |
CURR | currency | $ |
DIG | digits | 123 |
FM | foreign word | article |
FUT | future tense particle | θα |
INTJ | interjection | χμ |
ITEM | item | 1.2 |
NEG | negation particle | μη |
NOUN | common noun | βιβλίο |
ORD | ordinal | τρίτα |
PERS | personal pronoun | εγώ |
POSS | possessive pronoun | μας, τους |
PREP | preposition | άνευ |
PREPART | preposition with article | στο |
PRON | pronoun | αυτοί |
PRONREL | relative pronoun | οποίες |
PROP | proper noun | Μαρία |
PTCL | particle | ας |
PUNCT | punctuation (other than SENT and CM) | : - |
QUOTE | quotation marks | " |
SENT | sentence final punctuation | . ! ? |
SUBJ | subjunctive particle | να |
SUBORD | subordinating conjunction | πως |
SYMB | special symbol | *, % |
VIMP | verb (imperative) | γράψε |
VIND | verb (indicative) | γράφεις |
VINF | verb (infinitive) | γράφει |
VPP | participle | δικασμένο |
Hebrew POS Tags – MILA_HEBREW
Tag | Description | Example |
---|---|---|
adjective | adjective | נפלאי |
adverb | adverb | מאחוריהם |
conjunction | conjunction | אך, ועוד |
copula | copula | איננה, תהיו |
existential | existential | ההיה, יש |
indInf | independent infinitive | היוסדה |
interjection | interjection | אח, חבל"ז |
interrogative | interrogative | לאן |
modal | modal | צריכים |
negation | negation particle | לא |
noun | common noun | אורולוגיה |
numeral | numeral | 613, שמונה |
participle | participle | משרטטה |
passiveParticiple | passive participle | סגורה |
preposition | preposition or prepositional phrase | עם, בפניו |
pronoun | pronoun | זה/ |
properName | proper noun | שרון |
punctuation | punctuation | . |
quantifier | quantifier or determiner | רובן |
title | title or honorific | גב', זצ"ל |
unknown | unknown | ומלמם'מ |
verb | verb | לכתוב |
wPrefix | prefix | פסאודו |
Hungarian POS Tags – BT_HUNGARIAN
Tag | Description | Example |
---|---|---|
ADJ | (invariant) adjective | kis |
ADV | adverb | jól |
ADV_PART | adverbial participle | állva |
ART | article | az |
AUX | auxiliary | szabad |
CM | comma | , |
CONJ | conjunction | és |
DEICT_PRON_NOM | deictic pronoun: nominative | ez |
DEICT_PRON_ACC | deictic pronoun: accusative | ezt |
DEICT_PRON_CASE | deictic pronoun: other case | ebbe |
FUT_PART_NOM | future participle: nominative | teendõ |
FUT_PART_ACC | future participle: accusative | teendõt |
FUT_PART_CASE | future participle: other case | teendõvel |
GENE_PRON_NOM | general pronoun: nominative | minden |
GENE_PRON_ACC | general pronoun: accusative | mindent |
GENE_PRON_CASE | general pronoun: other case | mindenbe |
INDEF_PRON_NOM | indefinite pronoun: nominative | más |
INDEF_PRON_ACC | indefinite pronoun: accusative | mást |
INDEF_PRON_CASE | indefinite pronoun: other case | mással |
INF | infinitive (verb) | csinálni |
INTERJ | interjection | jaj |
LS | list item symbol | 1) |
MEA | measure, unit | km |
NADJ_NOM | noun or adjective: nominative | ifjú |
NADJ_ACC | noun or adjective: accusative | ifjút |
NADJ_CASE | noun or adjective: other case | ifjúra |
NOUN_NOM | noun: nominative | asztal |
NOUN_ACC | noun: accusative | asztalt |
NOUN_CASE | noun: other case | asztalra |
NUM_NOM | numeral: nominative | három |
NUM_ACC | numeral: accusative | hármat |
NUM_CASE | numeral: other case | háromra |
NUM_PRON_NOM | numeral pronoun: nominative | kevés |
NUM_PRON_ACC | numeral pronoun: accusative | keveset |
NUM_PRON_CASE | numeral pronoun: other case | kevéssel |
NUMBER | numerals (digits) | 1 |
ORD_NUMBER | ordinal | 1. |
PAST_PART_NOM | past participle: nominative | meghívott |
PAST_PART_ACC | past participle: accusative | meghívottat |
PAST_PART_CASE | past participle: other case | meghívottakkal |
PERS_PRON | personal pronoun | én |
POSTPOS | postposition | alatt |
PREFIX | prefix | át |
PRES_PART_NOM | present participle: nominative | csináló |
PRES_PART_ACC | present participle: accusative | csinálót |
PRES_PART_CASE | present participle: other case | csinálónak |
PRON_NOM | pronoun: nominative | milyen |
PRON_ACC | pronoun: accusative | milyet |
PRON_CASE | pronoun: other case | milyenre |
PROPN_NOM | proper noun: accusative | Budapestet |
PROPN_ACC | proper noun: other case | Budapestre |
PROPN_CASE | proper noun: nominative | Budapest |
PUNCT | punctuation (other than SENT or CM) | ( ) |
REFL_PRON_NOM | reflexive pronoun: accusative | magát |
REFL_PRON_ACC | reflexive pronoun: other case | magadra |
REFL_PRON_CASE | reflexive pronoun: nominative | magad |
REL_PRON_NOM | relative pronoun: nominative | aki |
REL_PRON_ACC | relative pronoun: accusative | akit |
REL_PRON_CASE | relative pronoun: other case | akire |
ROM_NUMBER | Roman numeral | IV |
SENT | sentence final punctuation | ., ; |
SPEC | special string (URL, email) | www.xzymn.com |
SUFF | suffix | -re |
TRUNC | compound part | asztal- |
VERB | verb | csinál |
Italian POS Tags – BT_ITALIAN
Tag | Description | Example |
---|---|---|
ADJEX | proclitic noun modifier ex | ex |
ADJPL | plural adjective | belle |
ADJSG | singular adjective | buono, narcisistico |
ADV | adverb | lentamente, già, poco |
CLIT | clitic pronoun or adverb | vi, ne, mi, ci |
CM | comma | , |
CONJ | conjunction | e, ed, e/o |
CONNADV | adverbial connector | quando, dove, come |
CONNCHE | relative pronoun or conjunction | ch', che |
CONNCHI | relative or interrogative pronoun chi | chi |
DEMPL | plural demonstrative | quelli |
DEMSG | singular demonstrative | ciò |
DETPL | plural determiner | tali, quei, questi |
DETSG | singular determiner | uno, questo, il |
DIG | digits | +5, iv, 23.05, 3,45, 1997 |
INTERJ | interjection | uhi, perdiana, eh |
ITEM | list item marker | A. |
LET | single letter | [di tipo] C |
NPL | plural noun | case |
NSG | singular noun | casa, balsamo |
ORDPL | plural ordinal | terzi |
ORDSG | singular ordinal | secondo |
POSSPL | plural possessive | mie, vostri, loro |
POSSSG | singular possessive | nostro, sua |
PRECLIT | pre-clitic | me [lo dai], te [la rubo] |
PRECONJ | pre-conjunction | dato [che] |
PREDET | pre-determiner | tutto [il giorno], tutti [i problemi] |
PREP | preposition | tra, di, con, su di |
PREPARTPL | preposition + plural article | sulle, sugl', pegli |
PREPARTSG | preposition + singular article | sullo, nella |
PREPREP | pre-preposition | prima [di], rispetto [a] |
PRON | pronoun (3rd person singular/plural) sé | [disgusto di] sé |
PRONINDPL | plural indefinite pronoun | entrambi, molte |
PRONINDSG | singular indefinite pronoun | troppa |
PRONINTPL | plural interrrogative pronoun | quali, quanti |
PRONINTSG | singular interrogative pronoun | cos' |
PRONPL | plural personal pronoun | noi, loro |
PRONREL | invariant relative pronoun | cui |
PRONRELPL | plural relative pronoun | quali, quanti |
PRONRELSG | singular relative pronoun | quale |
PRONSG | singular personal pronoun | esso, io, tu, lei, lui |
PROP | proper noun | Bernardo, Monte Isola |
PUNCT | other punctuation | - ; |
QUANT | invariant quantifier | qualunque, qualsivoglia |
QUANTPL | plural quantifier, numbers | molti, troppe, tre |
QUANTSG | singular quantifier | niuna, nessun |
SENT | sentence final punctuation | . ! ? : |
VAUXF | finite auxiliary essere or avere | è, sarò, saranno, avrete |
VAUXGER | gerund auxiliary essere or avere | essendo, avendo |
VAUXGER_CLIT | gerund auxiliary + clitic | essendogli |
VAUXIMP | imperative auxiliary | sii, abbi |
VAUXIMP_CLIT | imperative auxiliary + clitic | siatene, abbiatemi |
VAUXINF | infinitive auxiliary essere/avere | esser, essere, aver, avere |
VAUXINF_CLIT | infinitive auxiliary essere/avere + clitic | esserle, averle |
VAUXPPPL | plural past participle auxiliary | stati/e, avuti/e |
VAUXPPPL_CLIT | plural past part. auxiliary + clitic | statine, avutiti |
VAUXPPSG | singular past participle auxiliary | stato/a, avuto/a |
VAUXPPSG_CLIT | singular past part. auxiliary + clitic | statone, avutavela |
VAUXPRPL | plural present participle auxiliary | essenti, aventi |
VAUXPRPL_CLIT | plural present participle auxiliary + clitic | aventile |
VAUXPRSG | singular present participle auxiliary | essente, avente |
VAUXPRSG_CLIT | singular present participle auxiliary + clitic | aventela |
VF | finite verb form | blatereremo, mangio |
VF_CLIT | finite verb + clitic | trattansi, leggevansi |
VGER | gerund | adducendo, intervistando |
VGER_CLIT | gerund + clitic | saziandole, appurandolo |
VIMP | imperative | pareggiamo, formulate |
VIMP_CLIT | imperative + clitic | impastategli, accoppiatevele |
VINF | verb infinitive | sciupare, trascinar |
VINF_CLIT | verb infinitive + clitic | spulciarsi, risucchiarsi |
VPPPL | plural past participle | riposti, offuscati |
VPPPL_CLIT | plural past participle + clitic | assestatici, ripostine |
VPPSG | singular past participle | sbudellata, chiesto |
VPPSG_CLIT | singular past participle + clitic | commossosi, ingranditomi |
VPRPL | plural present participle | meditanti, destreggianti |
VPRPL_CLIT | plural present participle + clitic | epurantile, andantivi |
VPRSG | singular present participle | meditante, destreggiante |
VPRSG_CLIT | singular present participle + clitic | epurantelo, andantevi |
Japanese POS Tags – BT_JAPANESE
Tag | Description | Example |
---|---|---|
AA | adnominal adjective | その[人], この[日], 同じ |
AJ | normal adjective | 美し, 嬉し, 易し |
AN | adjectival noun | きれい[だ], 静か[だ], 正確[だ] |
D | adverb | じっと, じろっと, ふと |
EOS | sentence-final punctuation | 。. |
FP | non-derivational prefix | 両[選手], 現[首相] |
FS | non-derivational suffix | [綺麗]な, [派手]だ |
HP | honorific prefix | お[風呂], ご[不在], ご[意見] |
HS | honorific suffix | [小泉]氏, [恵美]ちゃん, [伊藤]さん |
I | interjection | こんにちは, ほら, どっこいしょ |
J | conjunction | すなわち, なぜなら, そして |
NC | common noun | 公園, 電気, デジタルカメラ |
NE | noun before numerals | 約, 翌, 築, 乾元 |
NN | numeral | 3, 2, 五, 二百 |
NP | proper noun | 北海道, 斉藤 |
NR | pronoun | 私, あなた, これ |
NU | classifier | [100]メートル, [3]リットル |
O | others | BASIS |
PL | particle | [雨]が[降る], [そこ]に[座る], [私]は[一人] |
PUNCT | punctuation other than end of sentence | ,「」(); |
UNKNOWN | unknown | デパ[地下], ヴェロ |
V | verb | 書く, 食べます, 来た |
V1 | ichidan verb stem | 食べ[る], 集め[る], 起き[る] |
V5 | godan verb stem | 気負[う], 知り合[う], 行き交[う] |
VN | verbal noun | 議論[する], ドライブ[する], 旅行[する] |
VS | suru-verb | 馳せ参[じる], 相半ば[する] |
VX | irregular verb | 移り行[く], トラブ[る] |
WP | derivational prefix | チョー[綺麗], バカ[正直] |
WS | derivational suffix | [東京]都, [大阪]府, [白]ずくめ |
Tag |
---|
FOREIGN_GIVEN_NAME |
FOREIGN_PLACE |
FOREIGN_SURNAME |
GIVEN_NAME |
ORGANIZATION |
PERSON |
PLACE |
SURNAME |
Japanese POS Tags – BT_JAPANESE_RBLJE_2
Tag | Description | Example |
---|---|---|
AA | adnominal adjective | その[人], この[日], 同じ |
AJ | normal adjective | 美し, 嬉し, 易し |
AN | adjectival noun | きれい[だ], 静か[だ], 正確[だ] |
AUXVB | auxiliary verb | た, ない, らしい |
D | adverb | じっと, じろっと, ふと |
FS | non-derivational suffix | [綺麗]な, [派手]だ |
HS | honorific suffix | [小泉]氏, [恵美]ちゃん, [伊藤]さん |
I | interjection | こんにちは, ほら, どっこいしょ |
J | conjunction | すなわち, なぜなら, そして |
NC | common noun | 公園, 電気, デジタルカメラ |
NE | noun before numerals | 約, 翌, 築, 乾元 |
NN | numeral | 3, 2, 五, 二百 |
NP | proper noun | 北海道, 斉藤 |
NR | pronoun | 私, あなた, これ |
NU | classifier | [100]メートル, [3]リットル |
O | others | BASIS |
PL | particle | [雨]が[降る], [そこ]に[座る], [私]は[一人] |
PUNCT | punctuation | 。,「」(); |
UNKNOWN | unknown | デパ[地下], ヴェロ |
V | verb | 書く, 食べます, 来た |
WP | derivational prefix | チョー[綺麗], バカ[正直] |
WS | derivational suffix | [東京]都, [大阪]府, [白]ずくめ |
Korean POS Tags – BT_KOREAN
Tag | Description | Examples |
---|---|---|
ADC | conjunctive adverb | 그리고, 그러나, 및, 혹은 |
ADV | constituent or clausal adverb | 매우, 조용히, 제발, 만일 |
CO | copula | 이 |
DAN | configurative or demonstrative adnominal | 새, 헌, 그 |
EAN | adnominal ending | 는/ㄴ |
ECS | coordinate, subordinate, adverbial, complementizer ending | 고, 므로, 게, 다고, 라고 |
EFN | final ending | 는다/ㄴ다, 니, 는가, 는지, 어라/라, 자, 구나 |
ENM | nominal ending | 기, 음 |
EPF | pre-final ending (tense, honorific) | 었, 시, 겠 |
IJ | exclamation | 아 |
NFW | word written in foreign characters | Clinton, Computer |
NNC | common noun | 학교, 컴퓨터 |
NNU | ordinal or cardinal number | 하나, 첫째, 1, 세 |
NNX | dependent noun | 것, 등, 년, 달라, 적 |
NPN | personal or demonstrative pronoun | 그, 이것, 무엇 |
NPR | proper noun | 한국, 클린톤 |
PAD | adverbial postposition | 에서, 로 |
PAN | adnominal postposition | 의, 이라는 |
PAU | auxiliary postposition | 만, 도, 는, 마저 |
PCA | case postposition | 가/이, 을/를, 의, 야 |
PCJ | conjunctive postposition | 와/과, 하고 |
SCM | comma | , |
SFN | sentence ending marker | . ? ! |
SLQ | left quotation mark | ‘ ( “ { |
SRQ | right quotation mark | ‘ ) ” } |
SSY | symbol | ... ; : - |
UNK | unknown |
|
VJ | adjective | 예쁘, 다르 |
VV | verb | 가, 먹 |
VX | auxiliary predicate | 있, 하 |
XPF | prefix | 제 |
XSF | suffix | 님, 들, 적 |
XSJ | adjectivization suffix | 스럽, 답, 하 |
XSV | verbalization prefix | 하, 되, 시키 |
Language Neutral POS Tags – BT_LANGUAGE_NEUTRAL
Tag | Description | Example |
---|---|---|
ATMENTION | @mention | @basistechnology |
email address | email@example.com | |
EMO | emoji or emoticon | :-) |
HASHTAG | hashtag | #BlindGuardian |
URL | URL | http://www.babelstreet.com/ |
Persian POS Tags – BT_PERSIAN
Tag | Description | Example |
---|---|---|
ADJ | adjective | بزرگ |
ADV | adverb | تقريبا |
CONJ | conjunction | یا |
DET | indefinite article/determiner | هر |
EOS | end of sentence indicator | . |
INT | interjection or exclamation | عجب |
N | noun | افزايش |
NON_FARSI | not Arabic script | a b c |
NPROP | proper noun | مائیکروسافت |
NUM | number | ده |
PART | particle | می |
PREP | preposition | به |
PRO | pronoun | اين |
PUNC | punctuation, other than end of sentence | , : “ ” |
UNK | unknown | نرژی |
VERB | verb | گفتم |
VINF | infinitive | خریدن |
Polish POS Tags – BT_POLISH
Tag | Description | Example |
---|---|---|
ADV | adverb: adjectival | szybko |
adverb: comparative adjectival | szybciej | |
adverb: superlative adjectival | najszybciej | |
adverb: non-adjectival | trochę, wczoraj | |
ADJ | adjective: attributive (postnominal) | [stopy] procentowe |
adjective: attributive (prenominal) | szybki [samochód] | |
adjective: predicative | [on jest] ogromny | |
adjective: comparative attributive | szybszy [samochód] | |
adjective: comparative predicative | [on jest] szybszy | |
adjective: superlative attributive | najszybszy [samochód] | |
adjective: superlative predicative | [on jest] najszybszy | |
CJ/AUX | conjunction with auxiliary być | [robi wszystko,] żebyśmy [przyszli] |
CM | comma | , |
CMPND | compound part | [ośrodek] naukowo-[badawczy] |
CONJ | conjunction | a, ale, gdy, i, lub |
DATE | date expression | 31.12.99 |
EXCL | interjection | aha, hej |
FRGN | foreign material | cogito, numerus |
NOUN | noun: common | reakcja, weksel |
noun: proper | Krzysztof, Francja | |
noun: nominalized adjective | chory, [pośpieszny z] Krakowa | |
NUM | numeral (cardinal) | 22; 10,25; 5-7; trzy |
ORD | numeral (ordinal) | 12. [maja], 2., 12go, 13go, 28go |
PHRAS | phraseology | [po] polsku, fiku-miku |
PPERS | personal pronoun | ja, on, ona, my, wy, mnie, tobie, jemu, nam, mi, go, nas, was |
PR/AUX | pronoun with auxiliary być | [co] wyście [zrobili] |
PREFL | reflexive pronoun | [nie może] sobie [przypomnieć], [zabierz to ze] sobą, [warto] sobie [zadać pytanie] |
PREL | relative pronoun | który [problem], jaki [problem], co, który [on widzi], jakie [mamuzeum] |
PREP | preposition | od [dzisiaj], na [rynku walutowym] |
PRON | pronoun: demonstrative | [w] tym [czasie] |
pronoun: indefinite | wszystkie [stopy procentowe], jakieś [nienaturalne rozmiary] | |
pronoun: possessive | nasi [dwaj bracia] | |
pronoun: interrogative | Jaki [masz samochód?] | |
PRTCL | particle | także, nie, tylko, już |
PT/AUX | particle with auxiliary być | gdzie [byliście] |
PUNCT | punctuation (other than CM or SENT) | ( ) [ ] " " - '' |
QVRB | quasi-verb | brak, szkoda |
SENT | sentence final punctuation | . ! ? ; |
SYMB | symbol | @ § |
TIME | time expression | 11:00 |
VAUX | auxiliary | być, zostać |
VFIN | finite verb form: present | [Agata] maluje [obraz] |
finite verb form: future | [Piotr będzie] malował [obraz] | |
VGER | gerund | [demonstrują] domagając [się zmian] |
VINF | infinitive | odrzucić, stawić [się] |
VMOD | modal | [wojna] może [trwać nawet rok] |
VPRT | verb participle: predicative | [wynik jest] przesądzony |
verb participle: passive | [postępowanie zostanie] zawieszone | |
verb participle: attributive | [zmiany] będące [wynikiem...] |
Portuguese POS Tags – BT_PORTUGUESE
Tag | Description | Example |
---|---|---|
ADJ | invariant adjective | [duas saias] cor-de-rosa |
ADJPL | plural adjective | [cidadãos] portugueses |
ADJSG | singular adjective | [continente] europeu |
ADV | adverb | directamente |
ADVCOMP | comparison adverb mais and menos | [um país] mais [livre] |
AUXBE | finite "be" (ser or estar) | é, são, estão |
AUXBEINF | infinitive "be" | ser, estar |
AUXBEINFPRON | infinitive "be" with clitic | sê-lo |
AUXBEPRON | finite "be" with clitic | é-lhe |
AUXHAV | finite "have" | tem, haverá |
AUXHAVINF | infinitive "have" (ter, haver) | ter, haver |
AUXHAVINFPRON | infinitive "have" with clitic | ter-se |
AUXHAVPRON | finite "have" with clitic | tinham-se |
CM | comma | , |
CONJ | (coordinating) conjunction | [por fax] ou [correio] |
CONJCOMP | comparison conjunction do que | [mais] do que [uma vez] |
CONJSUB | subordination conjunction | para que, se, que |
DEMPL | plural demonstrative | estas |
DEMSG | singular demonstrative | aquele |
DETINT | interrogative or exclamative que | [demostra a] que [ponto] |
DETINTPL | plural interrogative determiner | quantas [vezes] |
DETINTSG | singular interrogative determiner | qual [reação] |
DETPL | plural definite article | os [maiores aplausos] |
DETRELPL | plural relative determiner | ..., cujas [presações] |
DETRELSG | singular relative determiner | ..., cuja [veia poética] |
DETSG | singular definite article | o [service] |
DIG | digit | 123 |
GER | gerundive | examinando |
GERPRON | gerundive with clitic | deixando-a |
INF | verb infinitive | reunir, conservar |
INFPRON | infinitive with clitic | datar-se |
INTERJ | interjection | oh, aí, claro |
ITEM | list item marker | A. [Introdução] |
LETTER | isolated character | [da seleção] A |
NEG | negation | não, nunca |
NOUN | invariant common noun | caos |
NPL | plural common noun | serviços |
NPROP | proper noun | PS, Lisboa |
NSG | singular common noun | [esta] rede |
POSSPL | plural possessive | seus [investigadores] |
POSSSG | singular possessive | sua [sobrinha] |
PREP | preposition | para, de, com |
PREPADV | preposition + adverb | [venho] daqui |
PREPDEMPL | preposition + plural demonstrative | desses [recursos] |
PREPDEMSG | preposition + singular demonstrative | nesta [placa] |
PREPDETPL | preposition + plural determiner | dos [Grandes Bancos] |
PREPDETSG | preposition + singular determiner | na [construção] |
PREPPRON | preposition + pronoun | [atrás] dela |
PREPQUANTPL | preposition + plural quantifier | nuns [terrenos] |
PREPQUANTSG | preposition + singular quantifier | numa [nuvem] |
PREPREL | preposition + invariant relative pronoun | [nesta praia] aonde |
PREPRELPL | preposition + plural relative pronoun | [alunos] aos quais |
PREPRELSG | preposition + singular relative pronoun | [área] através do qual |
PRON | invariant pronoun | se, si |
PRONPL | plural pronoun | as, eles, os |
PRONSG | singular pronoun | a, ele, ninguém |
PRONREL | invariant relative pronoun | [um ortopedista] que |
PRONRELPL | plural relative pronoun | [as instalações] as quais |
PRONRELSG | singular relative pronoun | [o ensaio] o qual |
PUNCT | other punctuation | : ( ) ; |
QUANTPL | plural quantifier | quinze, alguns, tantos |
QUANTSG | singular quantifier | um, algum, qualquer |
SENT | sentence final punctuation | . ! ? |
SYM | symbols | @ % |
VERBF | finite verb form | corresponde |
VERBFPRON | finite verb form with clitic | deu-lhe |
VPP | past participle (also adjectival use) | penetrado, referida |
Russian POS Tags – BT_RUSSIAN
Tag | Description | Example |
---|---|---|
ADJ | adjective | красивая, зеленый, удобный, темный |
ADJ_CMP | adjective: comparative | красивее, зеленее, удобнее, темнее |
ADV | adverb | быстро, просто, легко, правильно |
ADV_CMP | adverb: comparative | быстрее, проще, легче, правильнее |
AMOUNT | currency + cardinal, percentages | $20.000, 10% |
CM | comma | , |
CONJ | conjunction | что,или и, а |
DET | determiner | какой, некоторым [из вас], который[час] |
DIG | numerals (digits) | 1, 2000, 346 |
FRGN | foreign word | бутерброд, армия, сопрано |
IREL | relative/interrogative pronoun | кто [сделает это?] каков [результат?], сколько [стоит?], чей |
ITJ | interjection | увы, ура |
MISC | (miscellaneous) | АЛ345, чат, N8 |
NOUN | common noun: nominative case | страна |
common noun: accusative case | [любить] страну | |
common noun: dative case | [посвятить] стране | |
common noun: genitive case | [история] страны | |
common noun: instrumental case | [гордиться] страной | |
common noun: prepositional case | [говорить о] стране | |
NUM | numerals (spelled out) | шестьсот, десять, два |
ORD | ordinal | 12., 1.2.1., IX. |
PERS | personal pronoun | я, ты, они, мы |
PREP | preposition | в, на, из-под [земли], с [горы] |
PRONADV | pronominal adverb | как, там, зачем, никогда, когда-нибудь |
PRON | pronoun | все, тем, этим, себя |
PROP | proper noun | Россия, Арктика, Ивановых, Александра |
PTCL | particle | [но все] же,[постой]-ка [ну]-ка, |
PTCL_INT | introduction particle | вот [она], вон [там], пускай, неужели, ну |
PTCL_MOOD | mood marker | [если] бы, [что] ли,[так] бы [и сделали] |
PTCL_SENT | stand-alone particle | впрочем, однако |
PUNCT | punctuation (other than CM or SENT) | : ; " " ( ) |
SENT | sentence final punctuation | . ? ! |
SYMB | symbol | *, ~ |
VAUX | auxiliary verb | быть,[у меня] есть |
VFIN | finite verb | ходили, любила, сидит, |
VGER | verb gerund | бывая, думая, засыпая |
VINF | verb infinitive | ходить, любить, сидеть, |
VPRT | verp participle | зависящий [от родителей], сидящего [на стуле] |
Spanish POS Tags – BT_SPANISH
Tag | Description | Example |
---|---|---|
ADJ | invariant adjective | beige, mini |
ADJPL | plural adjective | bonitos, nacionales |
ADJSG | singular adjective | bonito, nacional |
ADV | adverb | siempre, directamente |
ADVADJ | adverb, modifying an adjective | muy [importante] |
ADVINT | interrogative adverb | adónde, cómo, cuándo |
ADVNEG | negation no | no |
ADVREL | relative adverb | cuanta, cuantos |
AUX | finite auxiliary ser or estar | es, fui, estaba |
AUXINF | infinitive ser, estar | estar, ser |
AUXINFCL | infinitive ser, estar with clitic | serme, estarlo |
CM | comma | , |
COMO | reserved for word como | como |
CONADV | adverbial conjunction | adonde, cuando |
CONJ | conjunction | y, o, si, porque,sin que |
DETPL | plural determiner | los, las, estas, tus |
DETQUANT | invariant quantifier | demás, más, menos |
DETQUANTPL | plural quantifier | unas, ambos, muchas |
DETQUANTSG | singular quantifier | un, una, ningún, poca |
DETSG | singular determiner | el, la, este, mi |
DIG | numerals (digits) | 123, XX |
HAB | finite auxiliary haber | han, hubo, hay |
HABINF | infinitive haber | haber |
HABINFCL | infinitive haber with clitic | haberle, habérseme |
INTERJ | interjection | ah, bravo, olé |
ITEM | list item marker | a. |
NOUN | invariant noun | bragazas, fénix |
NOUNPL | plural noun | aguas, vestidos |
NOUNSG | singular noun | agua, vestido |
NUM | numerals (spelled out) | once, tres, cuatrocientos |
PAPPL | past participle, plural | contenidos, hechas |
PAPSG | past participle, singular | privado, fundada |
PREDETPL | plural pre-determiner | todas [las], todos [los] |
PREDETSG | singular pre-determiner | toda [la], todo [el] |
PREP | preposition | en, de, con, para, dentro de |
PREPDET | preposition + determiner | al, del, dentro del |
PRON | pronoun | ellos, todos, nadie, yo |
PRONCLIT | clitic pronoun | le, la, te, me, os, nos |
PRONDEM | demonstrative pronoun | eso, esto, aquello |
PRONINT | interrogative pronoun | qué, quién, cuánto |
PRONPOS | possessive pronoun | (el) mío, (las) vuestras |
PRONREL | relative pronoun | (lo) cual, quien, cuyo |
PROP | proper noun | Pablo, Beralfier |
PUNCT | punctuation (other than CM or SENT) | ' ¡ ¿ : { |
QUE | reserved for word que | que |
SE | reserved for word se | se |
SENT | sentence final punctuation | . ? ; ! |
VERBFIN | finite verb form | tiene, pueda, dicte |
VERBIMP | verb imperative | dejad, oye |
VERBIMPCL | imperative with clitic | déjame, sígueme |
VERBINF | verb infinitive | evitar, tener, conducir |
VERBINFCL | infinitive with clitic | hacerse, suprimirlas |
VERBPRP | present participle | siendo, tocando |
VERBPRPCL | present participle with clitic | haciéndoles, tomándolas |
Urdu POS Tags – BT_URDU
Tag | Description | Example |
---|---|---|
ADJ | adjective | بہترین |
ADV | adverb | تاہم |
CONJ | conjunction | یا |
DET | indefinite article/determiner | ایک |
EOS | end of sentence indicator | . |
INT | interjection or exclamation | جئے |
N | noun | مہینہ |
NON_URDU | not Arabic script | a b c |
NPROP | proper noun | مائیکروسافٹ |
NUM | number | اٹھارہ |
PART | particle | غیر |
PREP | preposition | با |
PRO | pronoun | وہ |
PUNC | punctuation other than end of sentence | ‘ |
UNK | unknown | اپنیاں |
VERB | verb | کرتی |
Universal POS Tags – UPT16_V1
The universal tags are coarser than the language-specific tags, but enable tracking and comparison across languages.
To return universal POS tags in place of language-specific tags, use the Annotated Data Model (ADM) and BaseLinguisticsFactory
to set BaseLinguisticsOption.universalPosTag
to true
. See Returning Universal POS Tags.
Tag | Description |
---|---|
ADJ | adjective |
ADP | adposition |
ADV | adverb |
AUX | auxiliary verb |
CONJ | coordinating conjunction |
DET | determiner |
INTJ | interjection |
NOUN | noun |
NUM | numeral |
PART | particle |
PRON | pronoun |
PROPN | proper noun |
PUNCT | punctuation |
SCONJ | subordinating conjunction |
SYM | symbol |
VERB | verb |
X | other |
Options
General Options
The following options are described in more detail in Initial and Path Options.
If the option rootDirectory
is specified, then the string ${rootDirectory}
takes that value in the dictionaryDirectory
, modelDirectory
, and licensePath
options.
Option | Description | Type (Default) (Default) | Supported Languages |
---|---|---|---|
The path of the lemma and compound dictionary, if it exists. | Path ${rootDirectory}/dicts | All | |
The language to process by analyzers or tokenizers created by the factory. | Language code | All | |
The path of the RBL license file. | Path ${rootDirectory}/licenses/rlp-license.xml | All | |
The XML license content, overrides | String | All | |
The directory containing the model files. | Path ${rootDirectory}/models | All | |
Set the root directory. Also sets default values for other required options ( | Path | All |
Tokenizer Options
The following options are described in more detail in Tokenizers.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Selects the tokenizer to use |
| All | |
Indicates whether tokenizers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Malay (Standard), Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Ukrainian | |
Specify language to use for script regions, other than the script of the overall language. | Language code ( | Chinese, Japanese, Thai | |
Minimum length of sequential characters that are not in the primary script. If a non-primary script region is less than this length and adjacent to a primary script region, it is appended to the primary script region. | Integer (10) | Chinese, Japanese, Thai | |
Indicates whether to use a different word-breaker for each script. If false, uses script-specific breaker for primary script and default breaker for other scripts. | Boolean (false) | Chinese, Japanese, Thai | |
Turns on Unicode NFKC normalization before tokenization.
| Boolean (false) | All | |
Indicates the input will be queries, likely incomplete sentences. If true, tokenizers may change their behavior. | Boolean (false) | All |
The following options are described in more detail in Structured Text.
The following options are described in more detail in Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs.
Option | Description | Default | Supported Languages | ||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
n/a[a] | Enables emoji tokenization |
| All | ||||||||||||||||||||||||||||||||||||||||||||||
Enables emoticon tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables atMention tokenziation |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables hashtag tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables emailAdress tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
Enables url tokenization |
| All | |||||||||||||||||||||||||||||||||||||||||||||||
[a] Emoji tokenization and POS-tagging is always enabled and cannot be disabled. |
Analyzer Options
The following options are described in more detail in Analyzers.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Maximum number of entries in the analysis cache. Larger values increase throughput, but use extra memory. If zero, caching is off. | Integer (100.000) | All | |
Indicates whether analyzers produced by the factory are case sensitive. If false, they ignore case distinctions. | Boolean (true) | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish | |
Indicates whether the analyzers should return extended tags with the raw analysis. If true, the extended tags are returned. | Boolean (false) | All | |
A list of paths to user many-to-one normalization dictionaries, separated by semicolons or the OS-specific path separator. | List of paths | All | |
Indicates the input will be queries, likely incomplete sentences. If true, analyzers may change their behavior (e.g. disable disambiguation) | Boolean (false) | All |
The following options are described in more detail in Compounds.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether to decompose compounds. For Chinese and Japanese, If | Boolean (true) | Chinese, Danish, Dutch, German, Hungarian, Japanese, Korean, Norwegian (Bokmål, Nynorsk), Swedish | |
Indicates whether to return the surface forms of compound components. When this option is enabled and ADM results are returned, This option has no effect when | Boolean (false) | Dutch, German, Hungarian |
The following options are described in more detail in Disambiguation.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the analyzers should disambiguate the results. | Boolean (true) | Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish | |
Enables faster part of speech disambiguation for English. | Boolean (false) | English | |
Enables faster part of speech disambiguation for Greek | Boolean (false) | Greek | |
Enables faster part of speech disambiguation for Spanish. | Boolean (false) | Spanish |
The following options are described in more detail in Returning Universal Part-of-Speech (POS) Tags.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates if POS tags should be converted to universal versions | Boolean (false) | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. | |
URI of a POS tag map | URI | POS tags are defined for Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Malay (Standard), Persian, Polish, Portuguese, Russian, Spanish, Tagalog, and Urdu. |
The following options are described in more detail in Contraction Splitting Rule File Format.
The following options are only available when using the ADM API.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Enables analysis. If false, the annotator will only perform tokenization. | Boolean ( | All | |
URI of a POS tag map file for use by the | URI | Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish |
Chinese and Japanese Options
The following options are described in more detail in Chinese and Japanese Lexical Tokenization.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to consider punctuation between alphanumeric characters as a break. Has no effect when |
| Chinese | |
Indicates whether to provide consistent segmentation of embedded text not in the primary script. If false, then the setting of |
| Chinese, Japanese | |
Indicates whether to decompose compounds. |
| Chinese, Japanese | |
Indicates whether to recursively decompose each token into smaller tokens, if the token is marked in the dictionary as being decomposable. If deep decompounding is enabled, the decomposable tokens will be further decomposed into additional tokens. Has no effect when |
| Chinese, Japanese | |
Indicates whether to favor words in the user dictionary during segmentation. |
| Chinese, Japanese | |
Indicates whether to ignore whitespace separators when segmenting input text. If |
| Japanese | |
Indicates whether to filter stop words out of the output. |
| Chinese, Japanese | |
Indicates whether to join sequences of Katakana tokens adjacent to a middle dot token. | true | Japanese | |
Sets the minimum length of non-native text to be considered for a script change. A script change indicates a boundary between tokens, so the length may influence how a mixed-script string is tokenized. Has no effect when | 10 | Chinese, Japanese | |
Indicates whether to add parts of speech and secondary parts of speech to morphological analyses. |
| Chinese, Japanese | |
Indicates whether to segment each run of numbers or Latin letters into its own token, without splitting on medial number/word joiners. Has no effect when |
| Japanese | |
Indicates whether to return numbers and counters as separate tokens. |
| Japanese | |
Indicates whether to segment place names from their suffixes. |
| Japanese | |
Indicates whether to treat whitespace as a number separator. Has no effect when |
| Chinese | |
Indicates whether to treat whitespace as a morpheme delimiter. |
| Chinese, Japanese |
The following options are described in more detail in Chinese and Japanese Readings.
Option | Description | Default value | Supported languages |
---|---|---|---|
Indicates whether to return all the readings for a token. Has no effect when |
| Chinese | |
Indicates whether to skip directly to the fallback behavior of |
| Chinese, Japanese | |
Indicates whether to add readings to morphological analyses. The annotator will try to add readings by whole words. If it cannot, it will concatenate the readings of individual characters. |
| Chinese, Japanese | |
| Indicates whether to add a separator character between readings when concatenating readings by character. Has no effect when |
| Chinese, Japanese |
Sets the representation of Chinese readings. Possible values (case-insensitive) are:
|
| Chinese | |
Indicates whether to use 'v' instead of 'ü' in pinyin readings, a common substitution in environments that lack diacritics. The value is ignored when |
| Chinese |
Hebrew Options
The following options are described in more detail in Hebrew Analyses.
Chinese Script Converter Options
The following options are described in more detail in Chinese Script Converter (CSC).
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates most complex conversion level to use |
( | Chinese | |
The language from which the |
| Chinese, Simplified Chinese, Traditional Chinese | |
The language to which the |
| Chinese, Simplified Chinese, Traditional Chinese |
Lucene Options
The following options are described in more detail in Using RBL in Apache Lucene.
Option | Description | Type (Default) | Supported Languages |
---|---|---|---|
Indicates whether the token filter should add the lemmas (if none, the steps) of each surface token to the tokens being returned.. | Boolean (true) | All | |
Indicates whether the token filter should add the readings of each surface token to the tokens being returned | Boolean (false) | Chinese, Japanese | |
Indicates whether the token filter should identify contraction components as contraction components rather than as lemmas | Boolean (false) | All | |
Indicates whether the token filter should replace a surface token with its lemma. Disambiguation must be enabled. | Boolean (false) | All | |
Indicates whether the token filter should replace a surface form with its normalization. Normalization must be enabled. | Boolean (false) | All |
Option | Description | Type | Supported Languages |
---|---|---|---|
A list of paths to user lemma dictionaries. | List of Paths | Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai | |
A list of paths to user segmentation dictionaries. | List of Paths | All | |
A list of paths to user dictionaries. | List of Paths | All | |
A list of paths to reading dictionaries. | List of Paths | Japanese |
[6] Apache Lucene™, Lucene™, Apache Solr™, and Solr™ are trademarks of the Apache Software Foundation. Elasticsearch™ is a trademark of Elasticsearch BV.
[7] Essentially an ID number; see the ICU break rule documentation
[8] These analyzers are compatible with the Chinese and Japanese language processors found in the legacy Rosette (C++) products.
[9] As distinguished from the Arabic-Indic numerals often used in Arabic script (٠, ١, ٢, ٣, ٤, ٥, ٦, ٧, ٨, ٩) or the Eastern Arabic-Indic numerals often used in Persian and Urdu Arabic script (۰, ۱, ۲, ۳, ۴, ۵, ۶, ۷, ۸, ۹).
Social Media Tokens: Emoji & Emoticons, Hashtags, @Mentions, Email Addresses, URLs
RBL supports POS-tagging of emoji, emoticons, @mentions, email addresses, hashtags, and URLs in all supported languages.
Tokenization of emoji is always enabled. The other options are disabled by default but can be enabled through the options listed. When tokenization is disabled, the characters may be split into multiple tokens.
Option
Description
Default
Supported Languages
n/a[a]
Enables emoji tokenization
true
All
emoticons
Enables emoticon tokenization
false
All
atMentions
Enables atMention tokenziation
false
All
hashtags
Enables hashtag tokenization
false
All
emailAddresses
Enables emailAdress tokenization
false
All
urls
Enables url tokenization
false
All
[a] Emoji tokenization and POS-tagging is always enabled and cannot be disabled.
Enum Classes:
BaseLinguisticsOption
TokenizerOption
Sample input and output
Type
Input
Tokenization (when option is enabled)
Tokenization (when option is disabled)
POS tag
emoji
Tokenization of emoji cannot be disabled.
EMO
emoticon
:)
:)
: )
EMO
@mention
@basistechnology
@basistechnology
@ basistechnology
ATMENTION
hashtag
#basistechnology
#basistechnology
# basistechnology
HASHTAG
email address
info@basistech.com
info@basistech.com
info @ basistech.com
EMAIL
URL
http://www.babelstreet.com
http://www.babelstreet.com
http : / / www.babelstreet.com
URL
The tokenization when the option is disabled depend on the options
language
andtokenizerType
. The samples provided here are for whenlanguage=eng
andtokenizerType=ICU
.Emoji & Emoticon Recognition
Emoji are defined by Unicode Technical Standard #51. In tokenizing emoji, RBL recognizes the emoji presentation selector (VS16; U+FE0F) and text presentation selector (VS15; U+FE0E), which indicate if the preceding character should be treated as emoji or text.
Although RBL detects sideways, Western-style emoticons, it does not currently support Japanese-style emoticons called kaomoji such as
(o^ ^o)
.Emoji Normalization & Lemmatization
RBL normalizes emoji, placing the result into the
lemma
field. The simplest example is when an emoji presentation selector follows a character that is already an emoji. In this case, RBL will simply remove the emoji presentation selector.Lemmatization applies to an emoji character in multiple ways.
Emoji that depict people or body parts may be followed by an emoji modifier indicating skin tone. Lemmatization simply removes the emoji modifier skin tone from the emoji character. The reasoning is that the skin tone is of secondary importance to the meaning of the emoji.
Surface form
Lemmatized form
(
Boy +
Medium Skin Tone)
Emoji depicting people may be followed by an emoji component indicating hair color or style. Lemmatizing removes the hair component from the emoji character.
Surface form
Lemmatized form
(
Adult + ZWJ +
Red Hair)
Where a gender symbol has been added to create a gendered occupation emoji, lemmatization removes the gender symbol.
Surface form
Lemmatized form
(
Police Officer + ZWJ + ♀ Female Sign + VS16)
Finally, RBL can normalize non-fully-qualified emoji ZWJ sequences to fully-qualified emoji ZWJ sequences. In the above example, it is possible to omit the VS16 (though discouraged by Unicode): since Police Officer is an emoji, anything joined to it by a ZWJ is implicitly an emoji too. RBL adds the missing VS16.