Base Linguistics (RBL) Elasticsearch Plugin
Release Notes
Release 8.11.3.0
February 2024
Supports Elasticsearch 8.11.3 and includes RBL 7.47.1.c72.0
Release 8.4.0.0
October 2022
Supports Elasticsearch 8.4.0 and includes RBL 7.45.0.c67.0
Release 8.2.2.0
September 2022
Supports Elasticsearch 8.2.2 and includes RBL 7.45.0.c67.0
Release 8.1.1.0
June 2022
Supports Elasticsearch 8.1.1 and includes RBL 7.44.1.c67.0
Known Issues
Any feature using TensorFlow will not work with Elasticsearch 8.1.1 including:
Hebrew disambiguation with
disambiguatorType
set todnn
.Korean tokenization with
tokenizerType
set tospaceless_statistical
.Indonesian POS tagging.
Malaysian POS tagging.
Release 7.17.2.0
April 2022
Supports Elasticsearch 7.17.2 and includes RBL 7.43.0.c66.0
Release 7.17.1.0
March 2022
Supports Elasticsearch 7.17.1 and includes RBL 7.43.0.c66.0
Release 7.13.4.0
July 2021
Supports Elasticsearch 7.13.4 and includes RBL 7.41.1.c65.0
New:
New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. POS tags are not supported for these languages, so the
universalPosTags
option is ignored. (ETROG-3443, ETROG-3465)Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the
maxTokensForShortLine
parameter. (ETROG-3431)
Release 7.13.3.0
July 2021
Supports Elasticsearch 7.13.3 and includes RBL 7.40.1.c64.1
Release 7.13.2.1
July 2022
Supports Elasticsearch 7.13.2 and includes RBL 7.44.2.c67.0
New
TensorFlow supported: This plugin supports TensorFlow. (ESPI-168)
If you will be using any of features which require TensorFlow:
Copy the file
plugins/analysis-rbl-je/analysis-rbl-je.options
toconfig/jvm.options.d/analysis-rbl-je.options
Edit the file, uncommenting the line that matches your operating system and CPU.
Features that require TensorFlow include:
Hebrew disambiguation with
disambiguatorType
set todnn
.Korean tokenization with
tokenizerType
set tospaceless_statistical
.Indonesian POS tagging.
Malaysian POS tagging.
Release 7.13.2.0
June 2021
Supports Elasticsearch 7.13.2 and includes RBL 7.40.1.c64.1
New
Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
TokenizerOption.includeHebrewRoots
andTokenizerOption.guessHebrewPrefixes
have been deprecated and replaced byAnalyzerOption.includeHebrewRoots
andAnalyzerOption.guessHebrewPrefixes
.NFKC normalization is now supported for Hebrew.
We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Improved directory structure: The contents of the
models/
directory are now separated into subdirectories by language. (ETROG-1218)Statistical models moved to models/ directory: The following files have been moved from
dicts/
tomodels/
: (ETROG-1218)cat/ca-ud-train.downcased.mdl
est/et-ud-train.downcased.mdl
fas/posLemma.mdl
lav/lv-ud-train.downcased.mdl
nno/lemma.mdl
nob/lemma.mdl
slk/sk-ud-train.downcased.mdl
srp/sr-ud-train.downcased.mdl
New option for tokenizers: We've added a new option,
tokenizerType
to specify which tokenizer to use. The optionsalternativeTokenization
andfstTokenize
are deprecated in favor oftokenizerType
. (ETROG-3419)New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting
tokenizerType
tospaceless_statistical
. (ETROG-3392)
Bug fixes
Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)
Example: מע'רב
Previously:
Token{text=מע'} MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW} MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
Now:
Token{text=מע'רב} MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[], com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, tagSet=MILA_HEBREW}
Multiple punctuation characters are no longer returned as a single token in Chinese when
alternativeTokenization
istrue
ortokenizerType
is set tospaceless_lexical
. Now each character is its own token. (ETROG-3402)Example: Input: 天津??
Previously:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=??}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}
Now:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Third-party component updates
This release includes the following third-party component changes:
Package | Version |
---|---|
JavaCPP | 1.5.4 |
Package | Old Version | New Version |
---|---|---|
TensorFlow | 1.14.0 | 2.3.1 |
Release 7.10.2.0
February 2021
Supports Elasticsearch 7.10.2 and includes RBL 7.38.1.c63.0
Release 7.8.0.1
October 2024
Supports Elasticsearch 7.8.0 and includes RBL 7.47.4.c75.0
Release 7.6.1.3
January 2021
Supports Elasticsearch 7.6.1 and includes RBL 7.38.1.c63.0
New
The tools directory is now included in the RBL-Elasticsearch package. (ESPI-31)
Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis.
alternativeGreekDisambiguation
must be set tofalse
, which is the default. (ETROG-3289)Example: δείξε
Previously: Selected lemma: δεικνύω (archaic)
Now: Selected lemma: δείχνω (modern)
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set
alternativeGreekDisambiguation
totrue
. (ETROG-3304)RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Bug fixes
The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)
Example: Start+
Previously: POS tags: possible PROP, ADJ, NOUN
Now: POS tag: FM
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Jackson Annotations | 2.10.0 | 2.11.1 |
Jackson Core | 2.10.0 | 2.11.1 |
Jackson Databind | 2.10.0 | 2.11.1 |
Jackson Dataformat XML | 2.10.0 | 2.11.1 |
Jackson dataformats: Text | 2.10.0 | 2.11.1 |
Jackson modules: Base | 2.10.0 | 2.11.1 |
Protocol Buffers | 3.6.1 | 3.12.2 |
Apache Commons IO | 2.6 | 2.7 |
fastutil | 8.3.0 | 8.4.0 |
Woodstox Stax2 API | 4.2 | 4.2.1 |
SnakeYAML | 1.25 | 1.26 |
Release 7.6.1.2
November 2020
Supports Elasticsearch 7.6.1 and includes RBL 7.37.0.c62.2
New
Performance improvement: Spanish disambiguation with
alternativeSpanishDisambiguation
set tofalse
is now faster. (ETROG-3271)Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)
Support for unknown language: If the language is unknown (
xxx
), tokenization and sentence breaking is supported. (ETROG-3278)Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)
Tokenization rule preprocessor: The preprocessor command
!!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)Updated sample: The
tokenize-analyze
sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)New sample: The sample
csc-annotate
demonstrates using CSC with the ADM API. (ETROG-3317)Deprecated option:
TokenizerOption#includeRoots
has been deprecated and replaced withTokenizerOption#includeHebrewRoots
. (ETROG-3314)Deprecated option: The alternative tokenization option
deliverExtendedAttributes
is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)
Bug fixes
Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (
RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)RBL no longer crashes when
alternativeTokenization
andfragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)Correct start and end offsets are now produced when
fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)
Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET
Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
Tokens no longer have null token types. (ETROG-3316)
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when
nfkcNormalize
was set to true. (ETROG-2505)Example: ﷺ
Previously: Offsets:
صلى start 0 end 0
الله start 0 end 0
عليه start 0 end 0
وسلم start 0 end 1
Now: Offsets:
صلى start 0 end 1
الله start 0 end 1
عليه start 0 end 1
وسلم start 0 end 1