Base Linguistics (RBL)
Release Notes
Release 7.47.8.c78.0
June 2025
New
Solr and Lucene support: Solr 9.8.1 is now supported. (ETROG- 3714)
Bug Fixes
We fixed a bug where long words inside long spans of text without punctuation would sometimes be tokenized incorrectly. (ETROG-3716)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.18.0 | 2.19.0 |
Guava | 33.4.0-jre | 33.4.8-jre |
Jackson | 2.18.2 | 2.19.0 |
JUnit | 5.11.4 | 5.12.2 |
JUnit Platform | 1.11.4 | 1.12.2 |
Protocol Buffers | 4.29.3 | 4.30.2 |
SnakeYAML | 2.3 | 2.4 |
Package | Version | License |
---|---|---|
Guava InternalFutureFailureAccess and InternalFutures | 1.0.3 | Apache 2.0 |
JSpecify | 1.0.0 | Apache 2.0 |
Release 7.47.7.c77.0
March 2025
New
Solr and Lucene support: Solr 8.11.4 and Lucene 8.11.4 are now supported. (ETROG-3703)
Solr and Lucene support: Solr 9.8.0 and Lucene 10.1.0 are now supported. (ETROG- 3704)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
@API Guardian | 1.1.0 | 1.1.2 |
Apache Commons IO | 2.17.0 | 2.18.0 |
Apache Log4j | 2.24.1 | 2.24.3 |
Guava | 33.3.1-jre | 33.4.0-jre |
Jackson | 2.17.2 | 2.18.2 |
Jakarta Annotations API | 1.3.3 | 3.0.0 |
JavaCPP | 1.5.10 | 1.5.11 |
JUnit | 5.7.0 | 5.11.4 |
JUnit Platform | 1.7.0 | 1.11.4 |
OpenTest4J | 1.2.0 | 1.3.0 |
Protocol Buffers | 3.25.5 | 4.29.3 |
Metrics Core | 4.2.28 | 4.2.30 |
Package | Version | License |
---|---|---|
Java Architecture for XML Binding | 2.2.12 | CDDL 1.1 & GPL 2 + CE |
Release 7.47.6.c76.0
November 2024
New
Solr and Lucene support: Lucene 9.11.1 and Solr 9.7.0 are now supported (ETROG-3695)
Java 21 support: Java 21 is now supported. Java 11 and 17 are still supported. (ETROG-3698)
Unicode update: Unicode 16.0 is now supported. (ETROG-3694)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.16.1 | 2.17.0 |
Apache Commons Lang | 3.16.0 | 3.17.0 |
Apache Log4j | 2.23.1 | 2.24.1 |
fastutil | 8.5.14 | 8.5.15 |
Guava | 33.3.0-jre | 33.3.1-jre |
JavaCPP | 1.5.8 | 1.5.10 |
Protocol Buffers | 3.25.3 | 3.25.5 |
SnakeYAML | 2.2 | 2.3 |
Woodstox | 7.0.0 | 7.1.0 |
Release 7.47.5.c74.0
September 2024
Bug Fixes
We fixed a bug where long compound words (70-100 characters) in Danish, Norwegian (Bokmål and Nynorsk), and Swedish could cause an
OutofMemoryError
. (ETROG-3696)
Release 7.47.4.c75.0
September 2024
New
Neural model support improved: We upgraded TensorFlow Java to version 1.0.0-rc.1, which adds support for macOS ARM64. Neural models are now supported for macOS ARM64. (ETROG-3554)
Bug Fixes
Trailing decimal points in Chinese are no longer treated as part of a decimal fraction. When “点” or “點” ends a number, it is no longer segmented as part of the number. (ETROG-3680)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons Lang | 3.14.0 | 3.16.0 |
fastutil | 8.15.13 | 8.15.14 |
Guava | 33.2.0-jre | 33.3.0-jre |
Jackson | 2.17.1 | 2.17.2 |
TensorFlow for Java | 0.3.3 | 1.0.0-rc.1 |
Woodstox | 6.6.2 | 7.0.0 |
Release 7.47.3.c74.0
June 2024
New
Solr and Lucene support: Lucene 9.10 and Solr 9.6 are now supported (ETROG-3683)
CLA lexicon: Two terms have been added to the CLA lexicon: 喷码机 inkjet printer and 管理器 manager (in the context of software) (ETROG-3678)
Improved Readings: Readings are now returned for numeric words in Chinese and Japanese when
tokenizerType
is set toSPACELESS_LEXICAL
. (ETROG-3684)
Bug Fixes
When
tokenizerType
is set toSPACELESS_LEXICAL
, Japanese tokens for verbs in lemma form have had their readings fixed to cover the entire token. (ETROG-3640)Example: Input: 食べる
Previous Reading: た
Current Reading: たべる
When
tokenizerType
is set toSPACELESS_LEXICAL
, Japanese lemmatization has been corrected for numeric tokens containing both decimal points and multiplier characters.Example: Input: 2.5亿
Previous Lemma: 2500000000
Current Lemma: 250000000
We fixed a bug where the Chinese word “星期四” would be tokenized incorrectly in certain contexts. (ETROG-3582)
We fixed a bug where a token whose surface form was the empty string could be returned when
fragmentBoundaryDetection
was set totrue
(the default). (ETROG-3686)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.15.1 | 2.16.1 |
Apache Log4j | 2.21.1 | 2.23.1 |
args4J | 2.33 | 2.37 |
Guava | 33.0.0-jre | 33.2.0-jre |
Jackson | 2.16.1 | 2.17.1 |
Protocol Buffers | 3.25.0 | 3.25.3 |
Woodstox | 4.4.1 | 6.6.2 |
Release 7.47.2.c73.0
March 2024
New
Unicode update: Unicode 15.1 is now supported. (ETROG-3595)
Solr and Lucene support: Lucene 9.9 and Solr 9.5 are now supported. (ETROG-3673)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.15.0 | 2.15.1 |
Apache Commons Lang | 3.12.0 | 3.14.0 |
fastutil | 8.15.12 | 8.5.13 |
Guava | 32.1.3-jre | 33.0.0-jre |
Guava InternalFutureFailureAccess and InternalFutures | 1.0.1 | 1.0.2 |
ICU4J | 70.1 | 74.2 |
Jackson Annotations | 2.15.3 | 2.16.1 |
Jackson Core | 2.15.3 | 2.16.1 |
Jackson Databind | 2.15.3 | 2.16.1 |
Jackson Dataformat XML | 2.15.3 | 2.16.1 |
Jackson Dataformat YAML | 2.15.3 | 2.16.1 |
Jackson Datatype Guava | 2.15.3 | 2.16.1 |
Jackson Old JAXB Annotations | 2.15.3 | 2.16.1 |
Release 7.47.1.c72.0
December 2023
New
Japanese improvements: We have improved and augmented the Japanese lexicon that is used when
tokenizerType
is set tospaceless_lexical
. (ETROG-3532, 3581, 3535, 3668)Solr and Lucene support: Lucene 9.5 – 9.8 and Solr 9.4 are now supported. (ETROG-3665)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Jackson Annotations | 2.15.2 | 2.15.3 |
Jackson Core | 2.15.2 | 2.15.3 |
Jackson Databind | 2.15.2 | 2.15.3 |
Jackson Dataformat XML | 2.15.2 | 2.15.3 |
Jackson Dataformat YAML | 2.15.2 | 2.15.3 |
Jackson datatypes: collections | 2.15.2 | 2.15.3 |
Jackson Modules: Base | 2.15.2 | 2.15.3 |
Guava: Google Core Libraries for Java | 32.1.2-jre | 32.1.3-jre |
Protocol Buffers [Core] | 3.23.4 | 3.25.0 |
Apache Commons IO | 2.11.0 | 2.15.0 |
Apache Log4j API | 2.20.0 | 2.21.1 |
Apache Log4j Core | 2.20.0 | 2.21.1 |
Apache Log4j SLF4J Binding | 2.20.0 | 2.21.1 |
Stax2 API | 4.2.1 | 4.2.2 |
SnakeYAML | 2.0 | 2.2 |
Release 7.47.0.c71.0
September 2023
New
Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when
tokenizerType
is set tospaceless_lexical
. (ETROG-3616)Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when
tokenizerType
is set tospaceless_lexical
. (ETROG-3632)Added secondary parts of speech: We've added support for secondary parts of speech to Chinese and Japanese when
tokenizerType
is set tospaceless_lexical
. (ETROG-3636)Improved support for Chinese readings when
tokenizerType
is set tospaceless_lexical
:Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)
Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)
Example: "蔭權"
Previous readings returned: "【yīn;yìn】quán"
Readings now returned: "yīnquán” and “yìnquán"
Solr and Lucene support: Lucene 9.5 - 9.7 and Solr 9.3 are now supported (ETROG-3643)
Bug Fixes
We fixed a bug where an
ArrayIndexOutOfBoundsException
occurred when the Chinese dictionaries produced more than 6 matches andtokenizerType
was set tospaceless_lexical
. (ETROG-3635)When Chinese readings are constructed by character and
tokenizerType
is set tospaceless_lexical
, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)
Known Issues
The plugin for Solr 8.11 may throw an exception when using multiple neural models.
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Jackson Annotations | 2.15.0 | 2.15.2 |
Jackson Core | 2.15.0 | 2.15.2 |
Jackson Databind | 2.15.0 | 2.15.2 |
Jackson Dataformat XML | 2.15.0 | 2.15.2 |
Jackson Dataformat T | 2.15.0 | 2.15.2 |
Jackson Datatype: Guava | 2.15.0 | 2.15.2 |
Jackson Module: Old JAXB Annotations | 2.15.0 | 2.15.2 |
Guava: Google Core Libraries for Java | 31.1-jre | 32.1.2-jre |
Protocol Buffers [Core] | 3.21.7 | 3.23.4 |
Release 7.46.4.c70.0
June 2023
New
Lucene and Solr support: RBL-JE now supports Lucene 9.2 to 9.4 and Solr 9.2. (ETROG-3631)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Log4J | 2.19.0 | 2.20.0 |
fastutil | 8.5.9 | 8.5.12 |
Jackson Annotations | 2.14.0 | 2.15.0 |
Jackson Core | 2.14.0 | 2.15.0 |
Jackson Databind | 2.14.0 | 2.15.0 |
Jackson Dataformat XML | 2.14.0 | 2.15.0 |
Jackson dataformats: Text | 2.14.0 | 2.15.0 |
Jackson datatypes: collections | 2.14.0 | 2.15.0 |
Jackson modules: Base | 2.14.0 | 2.15.0 |
SnakeYAML | 1.33 | 2.0 |
Release 7.46.3.c69.0
March 2023
Bug Fixes
We fixed a bug where processing an empty string as Chinese or Japanese with
tokenizerType
set tospaceless_lexical
would throw aNullPointerException
. (ETROG-3629)
Known Issues
In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)
Release 7.46.2.c69.0
March 2023
New
We've improved the time it takes to tokenize extremely long (> 10K characters) Japanese sentences. (ETROG-3602)
Known Issues
In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)
Release 7.46.1.c68.0
December 2022
Bug Fixes
Annotating Ukrainian with
universalPosTags
set totrue
no longer results in aRosetteUnsupportedLanguageException
being thrown. (ETROG-3604)
Release 7.46.0.c68.0
November 2022
New
Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)
Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)
GPU support: Features that use TensorFlow now use a GPU if available. (ETROG-3564)
Emoji support: Emoji 15.0 is now supported. (ETROG-3577)
New option for Katakana: We've added the option
joinKatakanaNextToMiddleDot
to control whether sequences of Japanese Katakana tokens adjacent to a middle dot should be merged into a single Katakana token. By default, it istrue
, which matches the behavior in previous versions of RBL-JE. (ETROG-3592)Solr 9.1 support: Lucene and Solr 9.1 are supported. (ETROG-3597)
Bug Fixes
The Japanese POS tag VN (verbal noun) is now mapped to the UPT-16 POS tag NOUN. It was previously mapped to VERB. (ETROG-3583)
Third-party component updates
Package | Old version | New version |
---|---|---|
Apache Log4j | 2.17.1 | 2.19.0 |
fastutil | 8.5.6 | 8.5.9 |
Jackson | 2.11.1 | 2.14.0 |
JavaCPP | 1.58-alpha.20220614.013710.426 | 1.58 |
SLF4J | 1.7.33 | 1.7.36 |
SnakeYAML | 1.30 | 1.33 |
Release 7.45.0.c67.0
September 2022
New
Tagalog support:
RBL now supports Part of Speech (POS) tagging in Tagalog. (ETROG-3559)
RBL now supports lemmatization for Tagalog. (ETROG-3570)
The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)
Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)
Bug Fixes
We fixed a bug in Russian where certain uncommon consonant–vowel sequences in words in the lexicon were incorrectly replaced with more common sequences with different vowel letters. (ETROG-3541)
Example: брошюра
Previously: брошура
Now: брошюра
Release 7.44.2.c67.0
July 2022
New
An open source package (JavaCPP) has been updated to allow the Elasticsearch plugin to use TensorFlow. (ESPI-168)
Release 7.44.1.c67.0
June 2022
Bug Fixes
RBL-JE no longer crashes when loading TensorFlow fails with a
NoClassDefFoundError
. It now throws an exception. (ESPI-169)
Release 7.44.0.c67.0
June 2022
New
Indonesian support added: RBL now supports Part of Speech (POS) tagging in Indonesian. (ETROG-3543)
Malay (Standard) support added: RBL now supports Part of Speech (POS) tagging in Malay (Standard). (ETROG-3545)
Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)
Java 17 support added: Java 8 and 9 support has been removed. (ETROG-3524)
Solr 9 support added: RBL now supports Lucene and Solr 9. (ETROG-3549)
Solr 6 support deprecated: RBL no longer supports Lucene or Solr 6 or earlier. (ETROG-3519)
Bug Fixes
In Japanese, negative forms of ichidan verbs written all in hiragana are no longer lemmatized to end with “なう”. (ETROG-3534)
Example: Input: くれない
Previously: lemmatized to くれなう
Now: lemmatized to くれる
Third-party component updates
This release includes the following third-party component changes:
Package | Version | License |
---|---|---|
Jakarta Annotations API | 1.3.3 | Eclipse Public License 2.0 and GPL 2 with classpath exception |
Release 7.43.0.c66.0
February 2022
Notice
Solr 6 and earlier support is deprecated as of this release.
Java 8 and Java 9 support is deprecated as of this release.
New
Solr 8.11 support: This release supports Solr 8.11 (ETROG-3502)
Deprecated methods:
Token#getType
has been deprecated as token types are not used in RBL-JE without the Lucene/Solr plugins and the plugins use a different API. (ETROG-3503)Solr 6 support deprecated: Support for Solr versions 6.x and earlier is deprecated as of this release and will be removed in the next version.
Permission changes: We removed group and other write permissions from model files. All files are now only writable by the owner. (ETROG-3516)
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.7 | 2.11.0 |
Apache Commons Lang | 2.6 | 3.12.0 |
Apache Log4j | 1.2.17 | 2.17.1 |
ICU4J | 59.1 | 70.1 |
fastutil | 8.4.0 | 8.5.6 |
SLF4J | 1.7.28 | 1.7.33 |
SnakeYAML | 1.26 | 1.30 |
TensorFlow for Java | 0.2.0 | 0.3.3 |
Release 7.42.2.c65.0
November 2021
Bug Fixes
RBL no longer crashes in Arabic when
emoticons
is enabled. This fixes a bug introduced in 7.42.1. (ETROG-3493)
Release 7.42.1.c65.0
November 2021
Bug Fixes
In Korean, emoji and other language-neutral tokens no longer cause a
ClassCastException
to be thrown when using the ADM API. (ETROG-3488)
Release 7.42.0.c65.0
November 2021
New
Deprecated factories:
TokenizerFactory
,AnalyzerFactory
, andCSCAnalyzerFactory
have been deprecated in favor ofBaseLinguisticsFactory
. (ETROG-3453)Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts, for Japanese with
tokenizerType
set tospaceless_lexical
. (ETROG-3474)Example: Input: 三・一四
Previously: tokenization: 三 / ・ / 一四
Now: tokenization: 三・一四
Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)
Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)
Japanese tokenization: In Japanese, when
tokenizerType
is set tospaceless_lexical
, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. This is consistent with the default algorithm,spaceless_statistical
. (ETROG-3475)Solr 8.10 support: This release supports Solr 8.10. (ETROG-3482)
Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)
Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)
Bug Fixes
In Japanese, when
tokenizerType
is set tospaceless_lexical
, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)We've reverted two of the POS changes made in version 7.39.0.63.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)
The values are now:
"|以" Chinese) “|” tagged as PUNCT
"2对” (Chinese) “对” tagged as NM
RBL-JE no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)
In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)
Release 7.41.1.c65.0
July 2021
Bug Fixes
Enabling
universalPosTags
for Indonesian, Tagalog, or Standard Malay no longer throws aRosetteUnsupportedLanguageException
. POS tags are not supported for these languages, so theuniversalPosTags
option is ignored. (ETROG-3465)
Release 7.41.0.c65.0
July 2021
New
New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the
maxTokensForShortLine
parameter. (ETROG-3431)Solr 8.9 support: This release supports Solr 8.9. (ETROG-3457)
Release 7.40.1.c64.1
May 2021
Bug Fixes
We fixed a bug where enabling
alternativeSpanishDisambiguation
for Spanish caused aNullPointerException
to be thrown. (ETROG-3435)We fixed a bug where setting
disambiguatorType
toDNN
for Hebrew caused aRosetteRuntimeException
to be thrown. (ETROG-3437)
Release 7.40.0.c64.1
May 2021
New
New option for tokenizers: We've added a new option,
tokenizerType
to specify which tokenizer to use. The optionsalternativeTokenization
andfstTokenize
are deprecated in favor oftokenizerType
. (ETROG-3419)New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting
tokenizerType
tospaceless_statistical
. (ETROG-3392)
Bug Fixes
Multiple punctuation characters are no longer returned as a single token in Chinese when
alternativeTokenization
istrue
ortokenizerType
is set tospaceless_lexical
. Now each character is its own token. (ETROG-3402)Example: Input: 天津??
Previously:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=??}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}
Now:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Third-party component updates
This release includes the following third-party component changes:
Package | Version |
---|---|
JavaCPP | 1.5.4 |
Package | Old Version | New Version |
---|---|---|
TensorFlow | 1.14.0 | 2.3.1 |
Release 7.39.0.c63.0
March 2021
New
Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
TokenizerOption.includeHebrewRoots
andTokenizerOption.guessHebrewPrefixes
have been deprecated and replaced byAnalyzerOption.includeHebrewRoots
andAnalyzerOption.guessHebrewPrefixes
.NFKC normalization is now supported for Hebrew.
We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Solr 8.8: This release supports Solr 8.8. (ETROG-3369)
Improved CSCAnnotator output: The
CSCAnnotator
now emits tokens in addition to translations, even if no tokens were specified in the input. (ETROG-3356)Improved directory structure: The contents of the
models/
directory are now separated into subdirectories by language. (ETROG-1218)Statistical models moved to models/ directory: The following files have been moved from
dicts/
tomodels/
: (ETROG-1218)cat/ca-ud-train.downcased.mdl
est/et-ud-train.downcased.mdl
fas/posLemma.mdl
lav/lv-ud-train.downcased.mdl
nno/lemma.mdl
nob/lemma.mdl
slk/sk-ud-train.downcased.mdl
srp/sr-ud-train.downcased.mdl
Bug Fixes
Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)
Example: מע'רב
Previously:
Token{text=מע'} MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW} MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
Now:
Token{text=מע'רב} MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[], com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, tagSet=MILA_HEBREW}
A structured region containing two new lines is now properly labeled as STRUCTURED. Previously, the layout region would be labeled as UNSTRUCTURED. (ETROG-3378)
Example: * item\n* item\n* item\n\n
Previously:
{"startOffset": 0,"endOffset": 14,"layout": "STRUCTURED"} {"startOffset": 14,"endOffset": 22,"layout": "UNSTRUCTURED"}
Now:
{"startOffset": 0,"endOffset": 22,"layout": "STRUCTURED"}
Release 7.38.1.c63.0
January 2021
New
RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Bug Fixes
We reverted some changes to Korean disambiguation from the 7.37.0.c62.2 release as the changes introduced new disambiguation errors. (ETROG-3349)
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Jackson Annotations | 2.10.0 | 2.11.1 |
Jackson Core | 2.10.0 | 2.11.1 |
Jackson Databind | 2.10.0 | 2.11.1 |
Jackson Dataformat XML | 2.10.0 | 2.11.1 |
Jackson dataformats: Text | 2.10.0 | 2.11.1 |
Jackson modules: Base | 2.10.0 | 2.11.1 |
Protocol Buffers | 3.6.1 | 3.12.2 |
Apache Commons IO | 2.6 | 2.7 |
fastutil | 8.3.0 | 8.4.0 |
Woodstox Stax2 API | 4.2 | 4.2.1 |
SnakeYAML | 1.25 | 1.26 |
Release 7.38.0.c62.2
December 2020
New
Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis.
alternativeGreekDisambiguation
must be set tofalse
, which is the default. (ETROG-3289)Example: δείξε
Previously: Selected lemma: δεικνύω (archaic)
Now: Selected lemma: δείχνω (modern)
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set
alternativeGreekDisambiguation
totrue
. (ETROG-3304)Deprecated classes: The classes
BufferWordBreaker
andWordBreakResults
have been deprecated. (ETROG-3318)
Bug Fixes
The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)
Example: Start+
Previously: POS tags: possible PROP, ADJ, NOUN
Now: POS tag: FM
GenericTokenizer#hasNext
is now implemented to be consistent with the documentation forIterator#hasNext
. Previously it always returnedfalse
. (ETROG-2140)
Release 7.37.0.c62.2
November 2020
New
Performance improvement: Spanish disambiguation with
alternativeSpanishDisambiguation
set tofalse
is now faster. (ETROG-3271)Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)
Support for unknown language: If the language is unknown (
xxx
), tokenization and sentence breaking is supported. (ETROG-3278)Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)
Tokenization rule preprocessor: The preprocessor command
!!btinclude
is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)Updated sample: The
tokenize-analyze
sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)New sample: The sample
csc-annotate
demonstrates using CSC with the ADM API. (ETROG-3317)Deprecated option:
TokenizerOption#includeRoots
has been deprecated and replaced withTokenizerOption#includeHebrewRoots
. (ETROG-3314)Deprecated option: The alternative tokenization option
deliverExtendedAttributes
is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)
Bug Fixes
Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (
RosetteUnsupportedLanguageException
) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx
). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)RBL no longer crashes when
alternativeTokenization
andfragmentBoundaryDetection
are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)Correct start and end offsets are now produced when
fstTokenize
is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)
Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET
Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
Tokens no longer have null token types. (ETROG-3316)
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when
nfkcNormalize
was set to true. (ETROG-2505)Example: ﷺ
Previously: Offsets:
صلى start 0 end 0
الله start 0 end 0
عليه start 0 end 0
وسلم start 0 end 1
Now: Offsets:
صلى start 0 end 1
الله start 0 end 1
عليه start 0 end 1
وسلم start 0 end 1
Release 7.36.0.c62.2
September 2020
New
Lucene/Solr: Versions up through 8.6.0 are now supported. (ETROG-3250)
Decompose compounds: The option to control decomposition of compounds is now available in Dutch, German, Hungarian, Danish, Bokmål, Nynorsk, Swedish, and Korean. The default for
decomposeCompounds
istrue
. (ETROG-3263, ETROG-3264, ETROG-3265)Performance improvement: English and Spanish disambiguation with is now faster. Alternate disambiguation (
alternateEnglishDisambiguation
oralternateSpanishDisambiguation
) must be set tofalse
. (ETROG-3246, ETROG-3243)
Bug Fixes
In Hebrew, prefixes in some acronym tokens are now listed correctly in the list of prefixes, instead of being duplicated in the lemma. (ETROG-3214)
Example: “ומש"ס”
Previously: lemma: “ומומש"ס”, empty prefix list
Now: lemma: “ש"ס”, prefix list = [“ו”, “מ”]
Sentence breaks are now correct when there are two line breaks and
fragmentBoundaryDetection
is enabled. (ETROG-3241)Example: "a very very very very long line\nshort\n\n"
Previously: 2 sentences
{"startOffset":0,"endOffset":20} {"startOffset":20,"endOffset":27}
Now: 1 sentence
{"startOffset":0,"endOffset":26}
In Hebrew, lemmas starting or ending with spaces now have the spaces removed. (ETROG-3248)
Example: "אאורקה"
Previously: “אאורקה ”
Now: "אאורקה"
Analysis of unknown Hebrew words with guessed prefixes no longer have duplicate prefixes in their prefix list. (ETROG-3253)
Example: "בפיירפוקס"
Previously: prefix list: [ב, ב]
Now: prefix list: [ב]
In Chinese and Japanese, the system no longer crashes when both
fragmentBoundaryDetection
andalternativeTokenization
are enabled. (ETROG-3260)In Japanese, adjacent tokens are no longer erroneously joined when
alternativeTokenization
is enabled. (ETROG-3261)When
universalPosTags
are enabled the UPT-16 POS tags are now marked as having the tag setUPT16_V1
instead of the default tag set of the language. (ETROG-3273)Example: French
Previously: tag set:
BT_FRENCH
Now: tag set:
UPT16_V1
We've fixed the tokenize-analyze example in the samples directory. It now correctly produces results for Hebrew analysis. (ETROG-3252)
Release 7.35.0.c62.2
July 2020
New Features
Layout regions added: Layout regions, describing each section of input text as
STRUCTURED
orUNSTRUCTURED
, are now identified by the annotator. In order to detect layout regions, fragment boundary detection must be enabled. (ETROG-3172)New short line parameter: The option
maxTokensForShortLine
has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option
fstTokenize
must be set totrue
. (ETROG-3226)Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)
Hebrew user-defined dictionaries added: Static and dynamic user-defined Hebrew analysis dictionaries are now supported. (ETROG-3230)
Deprecated method:
HebrewAnalysis#characteristicString
is now deprecated. (ETROG-3209)Order of user-defined dictionaries: The order in which user-defined dictionaries are consulted has been standardized. Refer to the RBL-JE Application Developer's Guide for details. (ETROG-3148)
Bug Fixes
Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when
fstTokenize
was enabled and in some languages. (ETROG-3159)Example: "1\n234" (embedded newline within the number string)
Previously: "1 234" (1 token)
Now: "1" "234" (2 tokens)
This example assumes
fstTokenize
is enabled and the language is French.Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)
Tokens with digits are now eligible for the Greek guesser. (ETROG-3231)
Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags
Now: "HDMI1" gets FM POS tag
In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only occured when the
guessHebrewPrefixes
option is set totrue
.(ETROG-3221)Example: "ומפיפרנו"
Previously: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag preposition.
Now: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag unknown.
Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)
Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"
Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"
Now: One analysis, lemmatized to "разложить"
German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)
Example: Input "Mitbehandlung"; "mit" is a separable prefix
Previously: Lemmatized to "mitBehandlung"
Now: Lemmatized to "Mitbehandlung"
In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)
Example: "1" (keycap)
Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.
Now: Tokenized to one token, "1"
Release 7.34.2.c62.2
May 2020
New Features
Hebrew tokens that have prefixes but not stems now get appropriate parts of speech. Previously, they got the POS tag "unknown". (ETROG-3207)
Example: “ה” from the string “ה70”
Previously: POS tag "unknown"
Now: POS tag "quantifier"
Lucene/Solr up through version 8.5.1 is now supported. (ETROG-3208)
When
guessHebrewPrefixes
is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes. (ETROG-3188)Example: Token: "ומפיפרנו"
Previously: 2 analyses:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
Now: 3 analysis:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
hebrewPrefixes=[] lemma=ומפיפרנו
Bug Fixes
Minimally-qualified emoji are no longer split apart. (ETROG-3185)
Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>:
)
Previously: U+1F481 and <U+200D, U+2642> (2 tokens)
Now: <U+1F481, U+200D, U+2642> (1 token)
Capitalized nouns are no longer being detected as verbs. (ETROG-3186)
Example: The noun "Service" from the phrase "Price and Quality of Service"
Previously: POS tag VI (infinitive or imperative verb)
Now: POS tag PROP (proper noun)
When creating multiple analyzers for Chinese, Japanese, or Thai with
alternateTokenization
set tofalse
(the default), the analyzers will now share the same model data. This will improve memory usage when creating multiple analyzers. (ETROG-3200)Note: While memory usage has been improved, the process is still memory intensive. If RBL throws an
OutOfMemoryError
, increase the heap space.
Release 7.34.1.c62.2
March 2020
Bug Fixes
We removed some incorrect entries from the Hebrew lexicon that were added in 7.34.0.c62.2. (ETROG-3182, ETROG-3183)
Release 7.34.0.c62.2
March 2020
New Features
Lucene/Solr: RBL-JE now supports Lucene/Solr up through version 8.4.1. (ETROG-3156)
Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized. (ETROG-3164)
Additional emoji support: Emoji hair components are now lemmatized. (ETROG-3167)
German professions: Additional German professions have been added to the German lexicon. (ETROG-3163)
Spanish performance improvements: Spanish disambiguation is now faster when
alternativeSpanishDisambiguation
isfalse
. (ETROG-3169)Hebrew lemmatization: We increased proper noun coverage in the Hebrew lexicon. (ETROG-3161, ETROG-3162)
Bug Fixes
Low surrogates are no longer stripped from the ends of tokens in Hebrew. (ETROG-3165)
Number tokens with embedded spaces are no longer split into multiple tokens when preceded or followed by a symbol when
fstTokenize
istrue
. (ETROG-3158)Previously: $800 000 000 was tokenized as two tokens: $800 <TokenBoundary> 000 000
Now: $800 000 000 is tokenized as a single token: $800 000 000
Release 7.33.0.c62.2
January 2020
New Features
The delimiters for the fragment boundary detector are now configurable. (ETROG-3116)
The fragment boundary detector now marks a boundary after any spaces following the fragment boundary delimiter. (ETROG-3116)
An underscore (U+005F) is no longer treated as a token separator in German when
fstTokenize
is enabled. (ETROG-3144)
Bug Fixes
We fixed a bug where tokens from multi-script Russian text sometimes had incorrect offsets if
fstTokenize
was enabled. (ETROG-3142)We fixed a bug where multi-script Russian text would have a sentence break each time the script changed. (ETROG-3145)
We fixed a bug where there were unexpected sentence breaks after some short lines not ending in whitespace. (ETROG-3146)
We fixed a bug where sentence breaks were missing when the sentence break did not align with a token boundary. (ETROG-3140)
Release 7.32.0.c62.1
December 2019
New Features
Added support for Lucene/Solr up through version 8.3.0. (ETROG-3128)
Added support for tokenizing and lemmatizing Latvian. (ETROG-2798)
Latin-script regions within Russian documents are now tokenized and analyzed as English. (ETROG-3126)
TokenizerOption.licenseString
,AnalyzerOption.licenseString
, andBaseLinguisticsOption.licenseString
may now be passed into acreate
method. Previously, these options had to be set on the factory itself. (ETROG-3134)
Bug Fixes
We fixed a bug where guessed German compounds were sometimes lemmatized as verbs but tagged as nouns. (ETROG-3094)
We fixed a bug where the fragment boundary detector would mark a sentence break after every Windows newline. (ETROG-3133)
Release 7.31.0.c62.0
November 2019
New Features
The Hebrew files
dinflections.bin
,dprefixes.data,
andgimatria.data
have been moved from theroot/models
directory toroot/dicts/heb
. (ETROG-3088)Specifying the
universalPosTags
option now adds thedeliverExtendedTags
option as well. (ETROG-2185)Dynamic user dictionaries can now be created and populated at runtime. See the section User-Defined Dictionaries in the Application Developer's Guide for details. (ETROG-3086, ETROG-3100, ETROG-3109, ETROG-3110, ETROG-3111)
Fragment boundary detection is now enabled by default. Previously it was disabled by default. (ETROG-3108)
TokenizerOption.alternativeTokenizationOptions
has been deprecated in favor of a separate options for each YAML key. See the Javadoc for details. (ETROG-3109)The UPT-16 files
upt-16-pes.yaml
andupt-16-prs.yaml
have been removed from the distribution package, as they were unused. (ETROG-3122)The
-order
option inrbl-build-csc-dictionary
has been removed. All dictionaries are now built as LE, as LE dictionaries still work on BE machines. (ETROG-3120)We've added imperative forms for 2000 verbs to the Arabic lexicon. (ETROG-3090)
Bug Fixes
Fragment boundary detection is now enabled for Hebrew. (ETROG-1442)
When lemmatizing numbers in Russian, numbers containing spaces will now be lemmatized without the space. For example, "1 234" will now be lemmatized as "1234" instead of "1 234". (ETROG-3101)
We fixed a bug introduced in 7.30.1.c61.0 which raised an
ArrayIndexOutOfBoundsException
when processing Japanese withalternativeTokenization
andfavorUserDictionary
set totrue
. (ETROG-3118)We fixed a bug where a middle dot would be ignored if it preceded white space when using
alternativeTokenization
in Japanese. (ETROG-3113)
Third-party component updates
Component | Version | Change |
---|---|---|
Apache Commons IO | 2.6 | Version upgrade |
args4j | 2.33 | Version upgrade |
fastutil | 8.3.0 | Version upgrade |
Jackson Annotations | 2.10.0 | Version upgrade |
Jackson Core | 2.10.0 | Version upgrade |
Jackson Databind | 2.10.0 | Version upgrade |
Jackson Dataformat XML | 2.10.0 | Version upgrade |
Jackson dataformats: Text | 2.10.0 | Version upgrade |
Jackson datatypes: collections | 2.10.0 | Version upgrade |
Jackson modules: Base | 2.10.0 | Version upgrade |
SLF4J | 1.7.28 | Version upgrade |
SnakeYAML | 1.25 | Version upgrade |
TensorFlow for Java | 1.14.0 | Version upgrade |
Woodstox | 4.4.1 | Version upgrade |
Woodstox Stax2 API | 4.2 | Version upgrade |
Release 7.30.2.c61.0
September 2019
Bug Fixes
We fixed a bug where an
AssertionError
might be thrown when analyzing Hungarian with Java assertions enabled.Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.
Previously:
Аполлона-11
(Apollo-11) was tagged as PROP, MISC, and NOUNNow:
Аполлона-11
(Apollo-11) is tagged as NOUN
Correct token offsets are now returned from a Japanese annotator where a non-katakana character precedes a user-defined katakana token and
alternativeTokenization
andfavorUserDictionary
are enabled.We fixed a bug where constructors of factory classes in the Lucene/Solr plugin would throw an
UnsupportedOperationException
if passed aMap
that did not support theremove
method.
Release 7.30.1.c61.0
August 2019
New Features
Added support for Lucene/Solr up through version 8.2.0.
Dictionaries and models that are used on both big- and little-endian machines no longer include
LE
in their file names.
Bug Fixes
When
alternativeTokenization
was set totrue
, the Chinese tokenizer could create tokens at the end of the input string with the part of speech NT without checking that the context was valid for NTAnalyzing Chinese and Japanese with
alternativeTokenization
enabled is now much faster on sentences that are thousands of characters long.
Release 7.30.0.c61.0
August 2019
New Features
Segmentation user dictionaries can be used for all languages, not just Chinese, Japanese, and Thai.
The option
compoundComponentSurfaceForms
has been added to return the surface forms of the components of compound words. By default, RBL-JE only returns the lemmas.Added support for Lucene/Solr up through version 8.1.1.
Some Polish words ending in “-cku”, “-ska”, or “-sku” are lemmatized to forms ending in “-cki” or “-ski”.
Bug Fixes
The Japanese POS tag
NE
was not converted correctly to UPT-16.The French POS tag
CONJQUE
was converted to UPT-16CONJ
instead of the more appropriateSCONJ
.When
alternativeTokenization
was disabled, Chinese punctuation was tagged asGUESS
instead ofPUNCT
orEOS
.
Release 7.29.0.c61.0
June 2019
New Features
Setting
alternativeTokenization
totrue
enables an alternative tokenizer for Thai, for parity with the Thai tokenizer in Basis Technology's C++ API (RLP).All Hebrew tokens have analyses. The main change was adding the new part of speech
punctuation
. Non-punctuation tokens that formerly had empty analysis lists now have the part of speechunknown
.
Third-party component updates
Component | Version | Change |
---|---|---|
Jackson Annotations | 2.9.8 | Version upgrade |
Jackson Core | 2.9.8 | Version upgrade |
Jackson Databind | 2.9.8 | Version upgrade |
Jackson Dataformat XML | 2.9.8 | Version upgrade |
Jackson Dataformat YAML | 2.9.8 | Version upgrade |
Jackson Datatype Guava | 2.9.8 | Version upgrade |
Jackson Module JAXB Annotations | 2.9.8 | Version upgrade |
Protocol Buffers | 3.6.1 | Version upgrade |
SnakeYAML | 1.23 | Version upgrade |
Release 7.28.2.c60.0
June 2019
New Features
U+2019 RIGHT SINGLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are now normalized to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew words, to match the normalization of U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM.
Release 7.28.1.c60.0
May 2019
New Features
Updated the English lexicon.
Added support for Lucene/Solr up through version 8.0.0.
Updated the German lexicon.
Updated the Swedish lexicon.
Arabic analysis will attempt to replace leading hamzated alefs with plain alefs for unrecognized tokens.
Bug Fixes
The surface forms of Hebrew tokens consisting of multiple prefixes without a base, like “מה”, are now the entire token text, instead of just the first prefix.
Russian hyphenated words that end in numbers, like “Аполлона-11”, are no longer tagged as DIG. They are now tagged with the same parts of speech they had before 7.27.2.c60.0.
Closing parentheses, brackets, and braces that follow URLs when urls is enabled are no longer merged into the URLs.
The disambiguator is now more likely to select analyses with the POS tags ATMENTION, EMAIL, HASHTAG, and URL over other analyses.
When the Hebrew tokenizer encounters a character not used in Hebrew immediately following a character used in Hebrew, it starts a new token. Formerly, it would delete that character and any following characters up to the next token separator (e.g. white space).
RBL-JE can now successfully read in ICU tokenization rule files that begin with a BOM.
Hebrew tokens consisting of multiple prefixes without a base are now tagged with the part of speech “unknown”, to match single-prefix tokens.
The English token “than” is tagged only as COTHAN. The candidate part of speech COORD has been removed for this token.
Release 7.28.0.c60.0
May 2019
New Features
A perceptron-based disambiguator is available for Hebrew. It is used by default and when the option
disambiguatorType
is settoDisambiguatorType.PERCEPTRON.
It was measured to have higher lemma and part of speech accuracies than the alternatives. To use the previous default, setdisambiguatorType
toDisambiguatorType.DICTIONARY
.Added support for Lucene/Solr up through version 7.6.0.
Running on Java 11 is now supported.
Bug Fixes
Some white space characters could be part of Chinese tokens when alternativeTokenization was enabled.
Tokens that are thousands of characters long slow down the tokenizer.
Polish tokens that can appear in multiword expressions are no longer lemmatized to the full expressions. For example, “dzień” is not lemmatized to “dzień_dobry”.
The non-final components of Russian compound words with more than one hyphen were not lemmatized. The non-final components of Russian hyphenated compound words with the interfix “е” or “о” that coincidentally looked like the short forms of adjectives were lemmatized as if they were short forms.
RBL-JE Release Note Archive 7.27.0.c60.0 and earlier
Release 7.27 and earlier
New Features
Release 7.27.0.c60.0
The Chinese Script Converter must be licensed distinctly from the rest of RBL. Old licenses won’t work for it anymore. (ETROG-2916)
Lemmatization is supported for Persian. (ETROG-2924)
A dictionary-based disambiguator is available for Hebrew and is now the default. To run disambiguation in TensorFlow, set the option
disambiguatorType
toDisambiguatorType.DNN
. (ETROG-2928)
Release 7.26.5.c59.3
The tokenizer recognizes "百度" as a single token when
alternativeTokenization
is enabled. (ETROG-2909)
Release 7.26.3.c59.3
The tokenizer emits the normalized surface form as the lemma when
alternativeTokenization
is enabled. (ETROG-2892)
Release 7.26.0.c59.3
Analyzing German tokens with default ignorable code points, including U+00AD SOFT HYPHEN, U+200C ZERO WIDTH NON-JOINER, and U+200D ZERO WIDTH JOINER, produces the same analyses as if the tokens did not include those characters. (ETROG-2824)
Improved the lemma accuracy of the Spanish disambiguator. (ETROG-2856)
Improved disambiguation of English proper nouns. (ETROG-2867)
The North Korean (qkp) and South Korean (qkr) dialects are both treated as Korean (kor). (ETROG-2878)
Release 7.25.0.c59.3
Additional lemma dictionary for each of the two Norwegian languages. (ETROG-2797)
Added support for Lucene/Solr 7.3.1 through 7.4.0. (ETROG-2862)
Release 7.24.6.c59.2
Added support for Lucene/Solr 7.2.1 through 7.3.1. (ETROG-2842)
Release 7.24.3.c59.2
Added support for TensorFlow on more CPUs.
Release 7.24.0.c59.2
Added support for tokenizing and lemmatizing Catalan, Estonian, Serbian, and Slovak. (ETROG-2752, ETROG-2774)
Release 7.23.1.c59.0
Improved the accuracy of the Hebrew disambiguator. (ETROG-2718)
Running on Java 9 is now supported. (ETROG-2722)
Release 7.23.0.c59.0
Added support for Lucene/Solr 7.0.0 through 7.1.0. (ETROG-2706)
POS-tagging and disambiguation are supported for Hebrew. (ETROG-2707, ETROG-2717)
Release 7.22.2.c59.0
Added support for Lucene/Solr 7.0.0 through 7.2.1. (ETROG-2751)
Release 7.22.0.c59.0
German words that are completely uppercase guessed as acronyms. (ETROG-2684)
Release 7.21.2.c59.0
Modified internal dependency structure. (ETROG-2677)
Release 7.21.1.c59.0
Updated the compatibility number to 59.0. (ETROG-2609)
Release 7.21.0.c58.3
Added the
ArabicMorphoAnalysis
class to allow an Annotated Data Model application to get more information for Arabic, Persian, and Urdu text than theMorphoAnalysis
class would provide. (ETROG-2623)Improved speed and memory footprint for English and Spanish disambiguation. (ETROG-2607, ETROG-2618, ETROG-2635)
Added the
alternativeEnglishDisambiguation
andalternativeSpanishDisambiguation
options to specify the use of the old disambiguator in English and Spanish. The new disambiguator, introduced in version 7.18.0.c58.3, and enhanced in the current release, is more accurate, but slower. (ETROG-2626)Added the
guessHebrewPrefixes
option to control whether to split possible prefixes off unknown Hebrew words. (ETROG-2642)Normalized U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew. (ETROG-2647)
Filter out punctuation from Lucene/Solr when
query
is set. (ETROG-2648)Added support for Lucene/Solr 6.6. (ETROG-2656)
Release 7.20.0.c58.3
Added tokenization and POS-tagging for at-mentions and hashtags in all languages. (ETROG-2571)
Added the options
atMentions
,emailAddresses
,emoticons
,hashtags
, andurls
to enable tokenization and POS-tagging of @mentions, email addresses, emoticons, hashtags, and URLs. They are all disabled by default. (ETROG-2583)
Release 7.19.0.c58.3
Added tokenization and POS-tagging for URLs and email addresses in all languages. (ETROG-2557)
Release 7.18.0.c58.3
Implemented the many-to-one normalizer. (ETROG-1961)
Deprecated many classes and methods that are for internal use only. (ETROG-2065)
Added
BaseLinguisticsFactory#addUserCscDictionary
. (ETROG-2098)Removed obsolete big-endian models and dictionaries. (ETROG-2214)
Overhauled RBLCmd.
ANNOTATE
is the default command.-showTokenDetails
,-showRawResults
, and-verboseResults
are removed.-inputJson
interprets the input as an ADM.-outputJson
is a boolean option. (ETROG-1392, ETROG-2343)Decomposed compound verbs in Japanese when using
alternativeTokenization
. (ETROG-2350)Introduced more advanced disambiguation for English and Spanish. (ETROG-2367, ETROG-2370, ETROG-2372, ETROG-2371, ETROG-2467)
Improved decompounding accuracy in Dutch. (ETROG-2408)
Added tokenization, lemmatization, and POS-tagging for emoticons and emoji in all languages. (ETROG-2474, ETROG-2512, ETROG-2516, ETROG-2520, ETROG-2522, ETROG-2538)
Supplemented analysis dictionaries for English and Spanish. (ETROG-2481, ETROG-2532, ETROG-2535)
Added support for Lucene/Solr 6.3. (ETROG-2501)
Introduced the ability to specify a user-defined reading dictionary in Lucene/Solr (
userDefinedReadingDictionaryPath
). (ETROG-2527)
Release 7.17.2.c58.3
Add support for Lucene/Solr 6.2. (ESPI-77)
Release 7.17.1.c58.3
Version of OSGi (internal use only) upgraded. (ETROG-2441)
Release 7.17.0.c58.2
The FST tokenizer now supports Romanian. (ETROG-2255)
All Chinese parts of speech are supported in CLA user dictionaries. (ETROG-2262)
Modest speed improvement in the disambiguation algorithm used to process the results of the Japanese (statistical), Korean, and Arabic tokenizers. (ETROG-2288)
To become file-system-agnostic, the use of
Path
in the API is now supported. (ETROG-2310, ETROG-2381)
Added the
-outputJson
option toRBLCmd
to write the ADM as JSON to a file of choice. (ETROG-2332)
Reverted the fix for ETROG-2304, introduced in 7.16.0.c58.2. (ETROG-2376)
Release 7.16.0.c58.2
The Chinese script converter is an entitlement with a standard Chinese license. (ETROG-1605)
Arabic reh is normalized as a decimal separator in numeric contexts. (ETROG-1650)
Provide disambiguation of Dutch compounds. (ETROG-1736)
A custom reading dictionary can be specified on the RBLCmd command line. (ETROG-1938)
Alternative tokenization options are included in
BaseLinguisticsOption
. (ETROG-1946)Improve speed by caching Arabic analyses. (ETROG-1992)
Added support for alternative Chinese segmentation. (ETROG-2034)
Return Hebrew sentence boundaries. (ETROG-2036))
Added support for POS tag mappings for alternative Japanese and Chinese segmentation. (ETROG-2152)
Changed CompoundDictionary to provide its components in an order that reflects the contents of the lemma it returns. (ETROG-2154)
AnalyzerFactory#addUserAnalysisDictionary
now throws an informative exception when either the root or dictionary directory is invalid. (ETROG-2166)Augmented RBLCmd with the ability to return the RBL-JE version number. (ETROG-2168)
Improve handling of hiragana tokens homophonous to verbs in the alternative Japanese tokenizer (JLA). (ETROG-2188)
Improve handling of POS-ambiguous verb stems in the alternative Japanese tokenizer (JLA). (ETROG-2189)
The RBLCmd help command now sorts its options alphabetically. (ETROG-2195)
Han readings now returned for all Katakana tokens. (ETROG-2208)
In the Russian FST tokenizer, initials are tokenized and given the
+Init
morpho-tag. (ETROG-2209)Memory requirements of the FST tokenizer were reduced. (ETROG-2200, ETROG-2226))
Reduce the memory allocated for tokens by the FST tokenizer. (ETROG-2235))
Terminated support for Lucene/Solr 4.1-4.2. Added support for Lucene/Solr 6.0-6.1. (ETROG-2016, ETROG-2241, ETROG-2299)
Release 7.15.0.c57.2
Note: 7.15.0 was forked directly from 7.14.0 and thus does not have the changes in 7.14.1+.
Introduced the ability to specify an alternative FST tokenizer. See
TokenizerFactory.addCustomTokenizationFst
. (ETROG-2231)
Release 7.14.0.c57.2
The specification of options to
RBLCmd
was refactored. (ETROG-1503)Added UPT-16 support for Persian and Urdu. (ETROG-1830)
Changed UPT-16 mappings for Czech and Hungarian numbers. (ETROG-1841)
Removed incorrect analyses for Polish adjectives and participles ending in m/mi. (ETROG-1916)
Removed archaic Polish analyses containing "być". (ETROG-1917)
Added raw analyses for English contractions. (ETROG-1944)
The command line tool RBLCmd supports Hebrew tokenization. (ETROG-1973)
Added support for Finnish stemming. (ETROG-2012)
Removed the spurious generation of an accusative case analysis for some Polish nouns. (ETROG-2020)
The Hebrew tokenizer overzealously guessed that periods were part of an abbreviation. (ETROG-2024)
Refactored the position metadata for Lucene tokens of compound components. (ETROG-2042)
Lucene tokens for components of a contraction are identified with type "CONT". To invoke this functionality, set
FilterOption.identifyContractionComponents
to true. (ETROG-2044)AnalysesAttribute
s formatted as JSON in Elasticsearch. (ETROG-2057)
Release 7.13.0.c56.6
Added API support for Lucene & Solr 5.0-5.3. (ETROG-1647)
Added support for Persian and Urdu. (ETROG-1636, ETROG-1667)
The 'nor' (Norwegian) language code is accepted. (ETROG-1690)
Exposed support for using the Rosette Annotated Data Model (ADM) to perform RBL-JE operations. (ETROG-1713)
The Arabic analysis candidate generation code now uses the same algorithm that the Arabic Language Processor in the native (C++) version of Rosette Base Linguistics does. (ETROG-1722)
Provided an alternative Japanese analyzer. This provides parity with the Japanese analyzer in Basis Technology's C++ API (RLP). It offers improved accuracy with query strings and names and provides greater user control of the analysis. (ETROG-1727)
For English, Portuguese, and German text, added ADM support for splitting contractions and analyzing the constituents. (ETROG-1769)
Provided support for returning the set of 16 universal part-of-speech (POS) tags rather than the set of 12 that were introduced in version 7.12.0. (ETROG-1771)
The RBLCmd tool now lists the
BaseLinguisticsOption
options. To use these options you must setanalyzerType=none
,lang
, andBaseLinguisticsOption.language
. (ETROG-1862)
Release 7.12.1.c56.6
Version 7.12.1.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis JVM SDK (e.g. RBL-JE, RLI-JE, REX-JE) in a single application, then choose versions that have the same compatibility number. (ETROG-1700)
Release 7.12.0
Moved the
Tokenize
andAnalyze
samples into samples/tokenize-analyze and created a single Ant build script to compile and run both samples. (ETROG-1264)Provided support for returning universal part-of-speech (POS) tags rather than the language-specific POS tags we already return. The universal tags (UPT) are coarser than the language-specific tags, but enable tracking and comparison across languages. (ETROG-1472)
Added support for returning a disambiguated analysis for each token in Japanese text. For performance, this feature is turned off by default. (ETROG-1324)
Added support for returning morphological tags, where available, and placed an example illustrating the procedure for obtaining morphological tags in samples/morpho-tags. (ETROG-1485)
Removed the small number of dubious acronmym expansions from the lemmatization of English, French, Italian, German, Spanish, and Portuguese input. (ETROG-1547)
Improved the German lemma parser, which now returns the same lemma for German nouns that differ only in gender. (ETROG-1548)
Added API support for Lucene & Solr 4.10. (ETROG-1571)
Release 2.4.0
Enhanced support for Korean linguistic analysis, and integrated a guesser for generating morphemes, morpheme tags, compound components, and parts of speech. (ETROG-1486, ETROG-1512, ETROG-1528)
Added support for Korean user lemma dictionaries. (ETROG-1518)
Added stop words to the Japanese analysis dictionary. (ETROG-1525)
Release 2.3.0
Added the Chinese Script Converter, which can convert tokens in Traditional Chinese text to Simplified Chinese and vice versa. (ETROG-1462)
Terminated support for Lucene/Solr 3.6. (ETROG-1298)
Implemented support for Chinese part-of-speech (POS) tags and readings. (ETROG-1280)
Added support for normalization of Chinese and Japanese numbers. (ETROG-1310)
Implemented generation of Korean part-of-speech (POS) tags. (ETROG-1357)
Release 2.2.2
Added a tool for building user dictionaries. (ETROG-210)
For those cases in which you want to use your own whitespace tokenenizer and you are processing text that requires segmentation (such as Chinese, Japanese, or Thai), we have added support for a base linguistics segmentation token filter to be used after a whitespace tokenizer and before other filters, such as a base linguistics token filter. See the Javadoc for the RBL-JE API for Lucene 4.3-4.7. (ETROG-1240)
For Japanese, modified the base linguistics token filter to exclude lemmas for auxiliary verbs, particles, and adverbs from the token stream. (ETROG-1217)
Added support for using
AnalysesAttribute
to get the analyses and disambiguated analysis for each token in a token stream. (ETROG-1279)Added SLF4J support for logging RBL-JE applications. (ETROG-1318)
Added support for turning case sensitivity on/off when analyzing text. (ETROG-1365)
Deprecated
void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path)
in favor ofvoid com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path, EnumMap<AnalyzerOption, String> options)
whereoptions
is used to setAnalyzerOption.caseSensitive
to "true" or "false".Unused analyzer parameter removed from the
BaseLinguisticsSegmentationTokenFilter
constructor. (ETROG-1316)Updated the Japanese normalization dictionary. (ETROG-1229)
Added API support and samples for Lucene 4.9. (ETROG-1446)
Release 2.2.0
Added a Lucene Analyzer that combines the RBL-JE Tokenizer and TokenFilter, along with the LowerCaseFilter, CJKWidth Filter, and optional support for the StopFilter:
com.basistech.rosette.lucene.BaseLinguisticsAnalyzer
. Added a Lucene 4.3-4.7 sample application that illustrates its use. (ETROG-1138, ETROG-1172)Improved support for returning Japanese Hiragana readings. The API for adding readings to the token stream has moved from
TokenizerFactory#SetOption
toBaseLinguisticsTokenFilter#setAddReadings
. You can also include("addReadings", "true")
to the map of options you use to instantiate theBaseLinguisticsAnalyzer
. (ETROG-1054)
Release 2.1.0
Added support for Japanese Hiragana readings.
Factored in support for Lucene 3.6, 4.1-4.2, and 4.3.
Release 2.0.0
For this release, this product has been refactored and renamed to Rosette Base Linguisitcs Java Edition. This release concentrates on the core API instead of implementations for different versions of Lucene and Solr. This release returns part-of-speech tags for a core set of European languages and Japanese.
In place of a
LemmatizerFactory
, RBL-JE now provides anAnalyzerFactory
. Use the AnalyzerFactory to generate a language-specific Analyzer that you can use to generate Analysis objects for each token produced by the Tokenizer.
Release 1.11.0
For licensing and business reasons, support for Bulgarian, Catalan, Estonian, Croatian, Indonesian, Latvian, Malay, Slovak, Slovenian, Serbian, Albanian, and Ukrainian has been removed from the RSE package. (ETROG-921)
Release 1.10.0
Added support for tokenizing and lemmatizing Arabic, Czech, Hungarian, Korean, and Turkish. (ETROG-876)
Release 1.9.0
Added support for segmenting (tokenizing) Thai. (ETROG-448)
Added a tokenizer option (turned off by default) for returning Hebrew roots. (ETROG-788)
Changed required Java platform from 1.5 to 1.6. (ETROG-765)
Added support for using RSE with LucidWorks Enterprise 1.7, which supports a pre-release version of Lucene and Solr 4.0.
Release 1.8.0
Added support for tokenizing and lemmatizing Albanian, Bulgarian, Catalan, Croatian, Estonian, Greek, Hebrew, Indonesian, Latvian, Malay, Polish, Serbian, Slovakian, Slovenian, Russian, and Ukrainian. (ETROG-656, 658, 668, 677)
Added a command line driver for running RSE. For usage details, see the Javadoc for
com.basistech.rosette.bl.RBLCmd
. (ETROG-603)Added support for tokenizing and lemmatizing Norwegian Nynorsk text. (ETROG-637)
Consolidated support for Lucene 2.2, Lucene 2.4, Lucene 2.9, Lucene 3.1, Solr 1.3, Solr 1.4, and Solr 3.1 in a single SDK package with an associated documentation package.
Deprecated support in the
com.basistech.rosette.breaks
package (GenericTokenizer
andTokenizerOption
) for returning EOS (end-of-sentence) tokens.includeEOS
is off by default and should not be turned on; it interferes with Lucene searches. (ETROG-706)Deprecated Lucene 2.9
LemmaFilterFactory.supportedLanguages()
. UsegetSupportedLanguages()
. (ETROG-726)
Release 1.7.1
Revised the
SegmentationTokenizer
to provide more consistent handling of punctuation during the tokenization of Chinese, Japanese, and Thai text. (ETROG-640)
Release 1.7.0
Added support for Lucene 3.0.
Improved support for Japanese and Chinese tokenization.
Added the Japanese lemmatization dictionary and support of Japanese lemma user dictionaries. The Japanese lemmatization dictionary also provides orthographic normalization in the case of Katakana spelling variants and input text with archaic Kanji.
Added the production of normalized numbers to the lemmatization process.
Added support for Chinese lemma user dictionaries. Apart from numbers, which are already handled by the lemma guesser, lemmas do not ordinarily apply to Chinese, but a lemma user dictionary may be used for orthographic normalization.
Release 1.6.0
Added support for Danish and Norwegian (Bokmål). Improved support for Chinese token segmentation and Romanian.
To enhance clarity and consistency, and to avoid duplication of package names in class names, made a number of API changes that are not backwards compatible.
Renamed some factory classes: (ETROG-436)
Old
New
com.basistech.lucene.LuceneTokenizerFactory
com.basistech.lucene.TokenizerFactory
com.basistech.lucene.BaseLinguisticsTokenFilterFactory
com.basistech.lucene.LemmaFilterFactory
com.basistech.solr.BaseLinguisticsTokenizerFactory
com.basistech.solr.TokenizerFactory
com.basistech.solr.BaseLinguisticsTokenFilterFactory
com.basistech.solr.LemmaFilterFactory
All these factory classes include a
create()
method for instantiating theTokenizer
orLemmaFilter
. ThegetTokenFilter()
,getLuceneTokenizer()
, andgetLemmatizer()
methods have been removed.Promoted classes introduced in Release 1.5.beta.1 for setting tokenizer and lemmatizer options from inner Enums to top-level Enums:
com.basistech.rosette.breaks.TokenizerOption
andcom.basistech.rosette.bl.LemmatizerOption
. (ETROG-434)Removed the
TokenizerFactory
,LemmaFilterFactory
andLemmatizerFactory
option-specific methods for setting options that predate the introduction ofsetOption()
.The
com.basistech.breaks.BreakerFactory
methods for creating breakers have been renamed.Old
New
newScriptRegionBreaker()
createScriptRegionBreaker()
newBufferSentenceBreaker()
createBufferSentenceBreaker()
newBufferWordBreaker()
createBufferWordBreaker()
Release 1.5-beta-1
Added support for Chinese, and limited support for Japanese. For these languages, RSE adds statistically trained models/dictionaries to enabled the tokenization of non-whitespace-delimited text. Support for user dictionaries has also been expanded to include token dictionaries for Chinese, Japanese, and Thai.
Enhanced support for Dutch, Italian, and Portuguese.
Replaced Lucene 2.9 and Solr 1.4 packages with Lucene 3.0 package.
Revised the API for defining tokenizer and lemmatizer options.
Reorganized the documentation to reflect standard RSE usage patterns.
Release 1.4.2
To simplify usage in standard search applications, revised the RSE Tokenizer so that it does not put sentence-boundary tokens in the token stream unless you instruct the RSE TokenizerFactory to include them. (ETROG-312)
Release 1.4.1
Addressed a compatibility issue running RSE with RLP. To run RSE 1.4 and RLP 7.1 in the same process follow the instructions in RSE Technical Note: Using RSE 1.4 and RLP 7.1 in a Single Solr Instance.
Compiling a Swedish User Dictionary. As described in the RSE Application Developer's Guide, you must use RLP to create a user dictionary. See "Chapter 12. User-Defined Data" In the RLP Application Developer's Guide provides instructions on creating the source file for a user-defined dictionary and compiling the dictionary. The current release of RLP (RLP 7.1.0) does not include support for creating a Swedish user dictionary. To create a Swedish dictionary, you must add a file that we provide in the extras directory to the corresponding location in your RLP installation: rlp/bl1/dicts/sv/tags.txt.
When you create your source file, you can use [+DUMMY] as the POS tag for each entry.
The syntax for compiling a Swedish user dictionary from rlp/bl1/dicts/tools is
build_user_dict.sh sv input output
Release 1.4.0
Removed Rosette Language Analyzer (RLI) 100% Java implementation, which is now a separate product.
Provided separate SDK packages with support for Lucene 2.2, Lucene 2.4, and Lucene 3.0. (ETROG-198)
Added TokenizerFactory, which provides a language-specific Tokenizer for parsing input text. In addition to using the Sentence Breaker and Word Breaker, the Tokenizer normalizes the tokens (Unicode NFC normalization and lowercasing). (ETROG-185)
Added support for Swedish, including tokenization, lemmatization, and decompounding. (ETROG-201)
Added preliminary, limited support for Dutch, Danish, Norwegian, Italian, Portuguese, and Romanian.
Release 1.3-beta
Expanded support for German decompounding.
Added support for generating a separate lemma for each space-delimited element in lemmas that contain whitespace.
This distribution provides support for Lucene 2.2.
Release 1.2.0
Revised Java package names to avoid potential collisions with the RLP JNI-supported Java API.
Release 1.1.0
Upgraded Token Filter Factory support from Lucene 2.2 to Lucene 2.4.
Added The Rosette Language Identifier (RLI), Sentence Breaker, and Word Breaker:
Release 1.0.0
Introduced support for the creation of Lucene 2.2 Base Linguistics token filters for English, French, German, and Spanish text.
Bugs Fixed
Bugs fixed in 7.27.2.c60.0
Bug # | Description |
---|---|
ETROG-2981 | The Persian lemmatizer did not add lemmas to the first analyses of many tokens, especially verbs. |
ETROG-2983 | The lemmas of hyphenated Russian compound words now have both pieces lemmatized, not just the final piece. For example, “человека-волка” is lemmatized to “человек-волк”, whereas it was lemmatized to “человека-волк” in previous versions. |
ETROG-2984 | After some sequences of 4096 characters, containing mostly white space and at most one token, if there is no token or the token contains the last character of the sequence, any following tokens have incorrect original offsets. |
Bugs fixed in 7.27.1.c60.0
Bug # | Description |
---|---|
ETROG-2919 | Capitalized words in English are less likely to automatically get the part of speech PROP. |
ETROG-2927 | The English word "people" and its derivatives have the lemma candidate "person". They are no longer analyzed as plural nouns with lemmas equal to their surface forms. |
ETROG-2969 | Ambiguous English words like “second” and “lower” are less likely to be disambiguated as verbs when they should be ordinal numbers and comparative adjectives. |
Bugs fixed in 7.26.6.c60.0
Bug # | Description |
---|---|
ETROG-2958 | Analyzing Dutch writes a cache file to disk, which fails if the file is not writable. |
Bugs fixed in 7.26.5.c59.3
Bug # | Description |
---|---|
ETROG-2904, ETROG-2910, ETROG-2911 | Improved the lemma and part of speech accuracy of the Spanish disambiguator. |
ETROG-2921 | The parts of speech of the German acronyms “MAN” and “MIT” fall back to the parts of speech of the unrelated words “man” and “mit”. They should be NOUN. |
Bugs fixed in 7.26.4.c60.0
Bug # | Description |
---|---|
ETROG-2904, ETROG-2910, ETROG-2911 | Improved the lemma and part of speech accuracy of the Spanish disambiguator. |
ETROG-2914 | RBL-JE depended on Guava 18.0.0, which has a security vulnerability (CVE-2018-10237). Now it depends on Guava 26.0-jre. |
Bugs fixed in 7.26.3.c59.3
Bug # | Description |
---|---|
ETROG-2835 | Dutch compound nouns ending with “-ronde” can be analyzed as adjectives. |
ETROG-2864 | The disambiguator for Dutch non-compound words only considers parts of speech. If a token has multiple analyses with the same part of speech, the disambiguator picks one arbitrarily. |
ETROG-2898 | U+180E MONGOLIAN VOWEL SEPARATOR is not treated as a token separator in Hebrew. |
Bugs fixed in 7.26.2.c59.3
Bug # | Description |
---|---|
ETROG-2889 | Single-letter Spanish conjunctions sometimes get the POS tag ITEM instead of CONJ. |
Bugs fixed in 7.26.1.c59.3
Bug # | Description |
---|---|
ETROG-2888 | Norwegian lemmas for proper nouns are often converted to lowercase. |
Bugs fixed in 7.26.0.c59.3
Bug # | Description |
---|---|
ETROG-2876 | The application developer’s guide’s feature set table in section 1.3 erroneously claims that sentence boundary detection is not supported in Hebrew. |
ETROG-2877 | The Japanese character “々” is unrecognized and splits tokens. |
Bugs fixed in 7.25.0.c59.3
Bug # | Description |
---|---|
ETROG-2791 | In Catalan where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted. |
ETROG-2857 | Punctuation was not always separated from preceding characters where they should when |
ETROG-2860 | The Hebrew part of speech tag wPrefix is not converted to UPT-16 when |
ETROG-2861 | The documentation lists AUXV as the part of speech tag for Japanese auxiliary verbs, when it is actually AUXVB. |
Bugs fixed in 7.24.6.c59.2
Bug # | Description |
---|---|
ETROG-2829 | Some components of German compound words are incorrect when the surface form of the component could be either a noun or a verb. |
ETROG-2831 | Hebrew part of speech tags are not converted to UPT-16 when |
ETROG-2834 | JLA tokenizer sometimes truncates katakana tokens after non-katakana tokens. |
ETROG-2844 | Email addresses and URLs may contain control or whitespace characters. |
ETROG-2847 | Hebrew tokens can contain control characters or nothing but default ignorable characters. |
ETROG-2848 | Chinese, Japanese, and Thai tokens may contain control characters. |
Bugs fixed in 7.24.5.c59.2
Bug # | Description |
---|---|
ETROG-2781 | Incorrect analysis may be selected for Dutch non-compound words. |
ETROG-2804 | Sentence-final @mentions, email addresses, emoji, emoticons, hashtags, and URLs are not marked as sentence-final. |
ETROG-2820 | Tokenizing German with |
ETROG-2823 | Hebrew prefixes are not exposed correctly. |
Bugs fixed in 7.24.4.c59.2
Bug # | Description |
---|---|
COMN-234 | The Woodstox dependency is not shaded. |
Bugs fixed in 7.24.2.c59.2
Bug # | Description |
---|---|
ETROG-2814 | Running TensorFlow leaks memory. |
Bugs fixed in 7.24.1.c59.2
Bug # | Description |
---|---|
ETROG-2808 | The Hebrew disambiguation model is not cached, potentially leading to high memory pressure. |
Bugs fixed in 7.24.0.c59.2
Bug # | Description |
---|---|
ETROG-2782 | Tokens can be empty or consist of nothing but control characters and white space. |
Bugs fixed in 7.23.3.c59.0
Bug # | Description |
---|---|
ETROG-2744 | Tokenize-analyze sample emits unknown POS tags for Hebrew |
ETROG-2759 | Ending a Lucene token stream in Chinese or Japanese with |
ETROG-2760 | In languages like French and Italian, an apostrophe was parsed as its own token when directly followed by a digit. |
ETROG-2767 | Overlapping tokens, which are valid, are discarded. This particularly affects hyphenated tokens in French when |
ETROG-2776 | German all-caps words are assumed to be acronyms without considering the possibility that they are simply emphasized. |
ETROG-2777 | The application developer’s guide claims support for Java 9. |
ETROG-2778 | The application developer’s guide references btcommon-api-37.1.3.jar instead of btcommon-api-36.1.3.jar. |
ETROG-2779 | The German analysis cache returns analyses without taking the full context into account, leading to unpredictable analyses for unknown words. |
ETROG-2780 | English all-caps words are assumed to be proper nouns, though all-caps may simply denote emphasis. |
ETROG-2790 | The Application Developer's Guide does not mention support for Hebrew part of speech tagging in the Feature Set table. |
Bugs fixed in 7.23.2.c59.0
Bug # | Description |
---|---|
ESPI-110 | Disambiguating Hebrew tokens throws an |
Bugs fixed in 7.23.0.c59.0
Bug # | Description |
---|---|
ETROG-2710 | Some tokens consisting of numbers and Latin letters, such as serial codes, are decompounded into multiple morphemes in Korean. |
ETROG-2716 | “интернет”, the Russian word for "internet" is not in the lexicon although “Интернет” is there. |
Bugs fixed in 7.22.2.c59.0
Bug # | Description |
---|---|
ETROG-2738, ETROG-2754 | German words are assigned parts of speech without taking capitalization into account, leading the disambiguator to often pick the wrong analysis. |
ETROG-2739 | The annotated lemmas for German definite articles were inconsistent. |
ETROG-2740 | Some symbols and punctuation are not tokenized as separate tokens in German when |
ETROG-2745 | In languages like French and Italian where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted if the following token contains a digit. |
Bugs fixed in 7.22.1.c59.0
Bug # | Description |
---|---|
ETROG-2687 |
|
ETROG-2703 | Many German words ending with "teuer" get decompounded incorrectly. Several German compound words don't get decompounded at all. |
Bugs fixed in 7.22.0.c59.0
Bug # | Description |
---|---|
ETROG-2383 | Korean tokens sometimes include trailing ASCII periods. |
ETROG-2685 | URL tokens in Chinese, Japanese, and Thai run on into the following tokens, without inserting a token boundary. |
ETROG-2686 | "Dep." is not recognized as a Portuguese abbreviation, causing a sentence break. |
Bugs fixed in 7.21.3.c59.0
Bug # | Description |
---|---|
ETROG-2680 | Single quote in English sometimes incorrectly analyzed as possessive when found following whitespace. |
Bugs fixed in 7.21.0.c58.3
Bug # | Description |
---|---|
ETROG-2476 | Lemma for a German compound word may be wrong if the surface form has characters that were normalized in the compound components. |
ETROG-2589 | The non-English FST tokenizers do not properly split text into tokens in which text immediately follows colons or commas, such as ":,test". |
ETROG-2622 | Passing a long script file (over 100,000 rows) to RBLCmd with output paths specified in the third column can cause a “Too many open files” error. |
ETROG-2661 | Tokenizing a string of Katakana throws an |
Bugs fixed in 7.20.4.c58.3
Bug # | Description |
---|---|
ETROG-2638 | Number tokens that are hundreds of characters long can cause |
Bugs fixed in 7.20.3.c58.3
Bug # | Description |
---|---|
ETROG-2632 | A tiny number of Hebrew surface forms (e.g. "ג`ינג`ר") cause |
Bugs fixed in 7.20.2.c58.3
Bug # | Description |
---|---|
ETROG-2615 | Hebrew lemmas are not exposed in Lucene/Solr. |
Bugs fixed in 7.20.1.c58.3
Bug # | Description |
---|---|
ETROG-2584 | Emoticon detection has some false positives. |
Bugs fixed in 7.20.0.c58.3
Bug # | Description |
---|---|
ETROG-2338 | Annotating an |
ETROG-2574 | The English FST tokenizer does not properly split text into tokens in which text immediately follows colons or commas, such as ":,test". |
Bugs fixed in 7.19.0.c58.3
Bug # | Description |
---|---|
ETROG-710 | The Hebrew tokenizer throws an exception or produces incorrect results for tokens that begin with "prefix=". |
ETROG-995 | The Hebrew tokenizer could throw an exception or produce incorrect results for some inputs involving backslashes or multiple tokens with identical surface forms. |
ETROG-2531 | Creating Chinese and Japanese tokenizers with |
ETROG-2552 | Analyzing a zero-length English token throws a |
ETROG-2560 |
|
ETROG-2563 | The Hebrew tokenizer does not split on non-ASCII white space characters. |
ETROG-2565 | Requesting UPT-16 POS tags for a language for which POS tags are not supported throws an exception. |
Bugs fixed in 7.18.0.c58.3
Bug # | Description |
---|---|
ETROG-1689 | The lemmas of Russian compound words contain braces. |
ETROG-2360 | RBLCmd reports 0 bytes/char when reading from standard input. |
ETROG-2409 |
|
ETROG-2243 | All acronyms are tagged as proper nouns in English. |
ETROG-2536, ETROG-2546 | Setting |
ETROG-2543 |
|
ETROG-2547 |
|
ETROG-2548 | SLF4J binding jars shipped in lib/. To avoid classpath conflicts they have been moved to tools/lib/. |
Bugs fixed in 7.17.0.c58.2
Bug # | Description |
---|---|
ETROG-2252 | Word breaking with |
Bugs fixed in 7.16.1.c58.2
Bug # | Description |
---|---|
ETROG-2356 | Token can be missed by GenericTokenizer. (This augments the fix for ETROG-2292 made in 7.16.0.c58.2.) |
Bugs fixed in 7.16.0.c58.2
Bug # | Description |
---|---|
ETROG-676 | ZWNBSP was not treated as whitespace. With this fix, the word breaker treats 0x2060 and 0xFEFF as whitespace. |
ETROG-1100 | Incorrect tokenization of '0901d97c80103109' in Hebrew. |
ETROG-1638 | Fixed English business and place name acronym segmentation regressions. |
ETROG-1640 | SingleLanguageAnnotator does not provide a proper analysis if the language is specified with the legacy code zht or zhs. |
ETROG-1655 | ADMs contain multiple tokens (instead of multiple analysis) for the same Hebrew word. |
ETROG-1766 | Processing Spanish with sequences of the dash character can exhaust memory. |
ETROG-1996 | Period erroneously attached to terminal token |
ETROG-2001 |
|
ETROG-2038 | +int_noun, +int_adj, etc. can inadvertently be returned as a POS tag. |
ETROG-2088 | POS tag and Contraction annotators failed for English uppercase. |
ETROG-2112 | ConcurrentModificationException crash in alternative Japanese tokenization (JLA). |
ETROG-2114 | The double exclamation mark (\u203C, ‼) not treated as a separate token and could be agglutinated to an adjacent word. |
ETROG-2125 | IndexOutOfBoundsException can be thrown when analyzing Korean. |
ETROG-2136 | Some English words ending in "ss" are erroneously given lemmas ending in 's' |
ETROG-2172 | Korean lemmas for tokens with mixed numerals and letters have incomplete text. |
ETROG-2191 | BaseLinguisticsFactory.featuresForLanguage(LanguageCode.SIMPLIFIED_CHINESE) return values was missing CSCANALYSIS. |
ETROG-2192 | Compound nouns should allow numerals as the first component. |
ETROG-2194 | The English word 'metres' is not lemmatized correctly. |
ETROG-2251 | Polish ‘Ł.’ is not treated as an initial. |
ETROG-2253 | Word breaking in FST Tokenizer can fail for initials in Greek. |
ETROG-2254 | Word breaking in FST Tokenizer can fail for initials in Hungarian. |
ETROG-2268 | The order in which options are specified in a |
ETROG-2269 | Inflected forms of the English verb 'mentor' not lemmatized. |
ETROG-2283 | Morpheme information is lost if an annotator is created for Korean and |
ETROG-2292 | Processing text with too many of out of vocabulary characters may fail. |
ETROG-2297 | Setting ExtendedTags to false does nothing. |
ETROG-2304 | Some pairs of words have each other as disambiguated lemmas. NB: This fix was reverted in 7.17.0.c58.2 as experience showed some unacceptable disambiguation regressions. |
ETROG-2328 | Setting XlaOption separatePlaceNameFromSuffix to false does nothing. |
ETROG-2333 | The alternative Japanese segmenter (JLA) often mishandled もの, ような, and とおり. |
Bugs fixed in 7.14.2.c57.2
Bug # | Description |
---|---|
ETROG-2013 | Error in SBN reader cache key equality function can induce out of memory errors. |
Bugs fixed in 7.14.1.c57.2
Bug # | Description |
---|---|
ETROG-2200 | Use of the fstTokenize option can exhaust memory. |
Bugs fixed in 7.14.0.c57.2
Bug # | Description |
---|---|
ETROG-1681 | Analysis results not produced for legacy |
ETROG-1925 | Fixed UPT-16 mappings for proper nouns in Czech, Dutch, French, German, and Polish. |
ETROG-1933 | U+0022 is no longer removed during Urdu normalization. |
ETROG-1945 | Fixed errors when mapping parts of speech. |
ETROG-1994 | Attempting to use the |
ETROG-2018 | In some cases involving digits, the Hebrew tokenizer truncated part of the word. |
ETROG-2047, ETROG-2059 | Analyses of some inflected forms of Polish nouns were incorrect. |
Bugs fixed in 7.13.0.c56.6
Bug # | Description |
---|---|
ETROG-1567 | Fixed cases in which components for some Danish, Norwegian, and Swedish compound words are returned out of order. |
ETROG-1612 | Ensure that the order of lemmas retrieved from the morpho cache matches that from the FSTs. |
ETROG-1618 | RBLCmd to gracefully handle a leading BOM in a UTF-8 file. |
ETROG-1652 | Crash on the null character (U+0000) |
ETROG-1687 | Arabic prefix lengths sometimes miscalculated. |
ETROG-1703 | Include Semitic roots in Arabic analysis results from an Annotator. |
ETROG-1761 | Given language 'unknown', RBL-JE can produce a token with a whitespace in the middle. |
Bugs fixed in 7.12.1.c56.6
Bug # | Description |
---|---|
ETROG-1628 | Fixed errors in the production of Portuguese and Russian lemmas. |
ETROG-1663, ETROG-1665 | Corrected German regressions relative to the RBL native product (RLP). |
ETROG-1671 | Removed RBLCmd log4j warnings when AnalyzerOption.disambiguate = true |
4.42. Bugs fixed in 7.12.0
Bug # | Description |
---|---|
ETROG-1432 | Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters. This completes the fix that we made for version 2.3.0 (ETROG-647). |
ETROG-1552 | Fixed a |
ETROG-1563 | Fixed an error processing Korean text in which the analyzer produces a |
Bugs Fixed in 2.4.0
Bug # | Description |
---|---|
ETROG-1554 | Rebuilt big-endian Chinese analysis dictionary. |
Bugs Fixed in 2.3.0
Bug # | Description |
---|---|
ETROG-647 | Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters. |
Bugs Fixed in 2.2.2
Bug # | Description |
---|---|
ETROG-1295 | Fixed a NullPointerException when attempting to process Korean text. |
ETROG-1271 | Fixed reported errors in the English lemma dictionary:
|
ETROG-1387 | Stopped returning guessed POS tags for languages for which POS tags are not supported. |
ETROG-1311 | Corrected an error tokenizing strings containing certain Unicode Supplementary (non-BMP) characters. |
ETROG-1300 | Fixed |
ETROG-1226 | Fixed occasional duplication of linguistic lookup results. |
Bugs Fixed in 2.2.1
Bug # | Description |
---|---|
ETROG-1261 | Concurrency violation in RBL-JE user defined dictionaries |
Bugs Fixed in 1.10.1
Bug # | Description |
---|---|
ETROG-916 | Eliminated the |
ETROG-917 | Fixed bug that produced incorrect candidate lemmas for Korean text. The correct lemma candiate generator is now being used. |
Bugs Fixed in 1.7.1
Bug # | Description |
---|---|
ETROG-591 | Fixed buffer management error using a token user dictionary to tokenize components in a long sequence of Chinese or Japanese tokens with no sentence boundaries. |
ETROG-588 | Enabled the use of |
ETROG-596, ETROG-629 | Improved the handling of text that is not Hanzi (Kanji), Hiragana, or Katakana in Chinese and Japanese token user dictionary lookups. For most consistent performance, we recommend that you only include Hanzi (Kanji), Hiragana and Katakana characters in token user dictionary entries. |
Bugs Fixed in 1.6.0
Bug # | Description |
---|---|
ETROG-486, ETROG-495 | Addressed overgeneration of Swedish compound components for unknown words. Applied similar refactoring to Danish and Norwegian. |
Bugs Fixed in 1.4.3
Bug # | Description |
---|---|
ETROG-319 | Speeded up the RSE Tokenizer and LuceneTokenizer by eliminating unnecessary reinitialization. |
Bugs Fixed in 1.4.2
Bug # | Description |
---|---|
ETROG-316 | Avoided a heap overflow by revising the RSE LuceneTokenizer to gracefully handle multiple |
Bugs Fixed in 1.4.1
Bug # | Description |
---|---|
ETROG-191 | Worked around an out-of-memory error processing very long compounds. See Known Problems in 1.4.1. |
Bugs Fixed in 1.4.0
Bug # | Description |
---|---|
ETROG-142 | Corrected out-of-memory error processing very long German words. |
ETROG-88 | Fixed array out of bounds that occurred processing some multi-sentence input. |
ETROG-182 | Adjusted word breaker to avoid returning empty elements at end of the input text being processed. |
Third-Party Components
For a list of third-party components that are used in Basis Technology products, see ThirdPartyLicenses.txt.
Third-party component updates in 7.27.1.c60.0
Component | Version | Change |
---|---|---|
annoy-java | 0.2.5 | New |
Third-party component updates in 7.26.4.c60.0
Component | Version | Change |
---|---|---|
Google Guava | 26.0-jre | Version upgrade |
Third-party component updates in 7.25.0.c59.3
Component | Version | Change |
---|---|---|
Jackson Annotations | 2.9.6 | Version upgrade |
Jackson Core | 2.9.6 | Version upgrade |
Jackson Databind | 2.9.6 | Version upgrade |
Jackson Dataformat XML | 2.9.6 | Version upgrade |
Jackson Dataformat YAML | 2.9.6 | Version upgrade |
Jackson Datatype Guava | 2.9.6 | Version upgrade |
Jackson Module JAXB Annotations | 2.9.6 | Version upgrade |
Third-party component updates in 7.24.6.c59.2
Component | Version | Change |
---|---|---|
Woodstox | 4.0.5 | Version downgrade |
Third-party component updates in 7.24.0.c59.2
Component | Version | Change |
---|---|---|
Google Guava | 18.0 | Version upgrade |
Jackson Annotations | 2.9.4 | Version upgrade |
Jackson Core | 2.9.4 | Version upgrade |
Jackson Databind | 2.9.4 | Version upgrade |
Jackson Dataformat XML | 2.9.4 | Version upgrade |
Jackson Dataformat YAML | 2.9.4 | Version upgrade |
Jackson Datatype Guava | 2.9.4 | Version upgrade |
Jackson Module JAXB Annotations | 2.9.4 | Version upgrade |
SnakeYAML | 1.18 | Version upgrade |
TensorFlow for Java | 1.5.0 | Version upgrade |
Woodstox | 5.0.3 | New |
Third-party component updates in 7.23.0.c59.0
Component | Version | Change |
---|---|---|
Auto Common Libraries | 0.3 | new |
AutoService | 1.0-tc3 | new |
Commons CLI | 1.2 | new |
Easy Plugins | 0.2.2 | New |
Jackson Datatype Guava | 2.7.3 | New |
JavaPoet | 1.9.0 | New |
Metrics Core | 3.2.3 | New |
Protocol Buffers | 3.3.1 | New |
TensorFlow | 1.3.0 | New |
Third-party component updates in 7.21.1.c59.0
Component | Version | Change |
---|---|---|
ICU4J | 59.1 | Version upgrade |
Third-party component updates in 7.18.0.c58.3
Component | Version | Change |
---|---|---|
fastutil | 6.6.1 | Version upgrade |
ICU4J | 58.1 | Version upgrade |
Jackson Annotations, Core, Databind, Dataformat XML, Dataformat YAML, Module JAXB Annotations | 2.7.3 | Version upgrade |
Jackson Dataformat Smile | 2.7.3 | New |
Third-party component updates in 7.16.0.c58.2
Component | Version | Change |
---|---|---|
args4j | 2.32 | Version upgrade |
fastutil | 6.6.0 | Version upgrade |
opencsv | Removed |
Third-party component updates in 7.14.0.c57.2
Component | Version | Change |
---|---|---|
Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations | 2.6.2 | Version upgrade |
Jackson DataFormat YAML | 2.6.2 | Version upgrade |
SnakeYAML | 1.15 | New |
Snowball (no version or release info avaiable; copied 2015-11-30) | New | |
args4j | 2.3.2 | Version added |
Third-party component updates in 7.13.0.c56.6
Component | Version | Change |
---|---|---|
ICU4J | 55.1 | Version upgrade |
Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations | 2.4.4 | Version upgrade |
Jackson DataFormat YAML | 2.4.4 | New |
Known Problems
Known Problems in 2.x
If disambiguate is set to false, or if no disambiguator for the language exists,
BaseLinguisticsTokenFilter
does not set the type correctly for compound components when adding them to the token stream. It marks compound components as <LEMMA> instead of <COMP> when a non-disambiguating analysis is performed. (ETROG-1552)
Known Problems in 1.8.0
The prefixes and suffixes that the RSE tokenizer returns for Hebrew may include punctuation attached to the underlying tokens, such as parentheses (prefix, suffix) and comma (suffix). Accordingly, prefixes and suffixes are assigned a Token PositionIncrement of 1. A multicharacter prefix or suffix may be reported as a sequence of one-character prefixes or suffixes. (ETROG-697)
Known Problems in 1.7.0
If you use
LanguageCode.SIMPLIFIED_CHINESE
(zhs
) orLanguageCode.TRADITIONAL_CHINESE
(zht
) when you load a Chinese token user dictionary, the dictionary is not loaded. You must useLanguageCode.CHINESE
(zho
) to designate the language code for a Chinese token user dictionary. (ETROG-588)
Known Problems in 1.4.1
To avoid a potential out-of-memory error, RSE does not attempt to decompound words longer than 30 characters. For languages with support for decompounding, if a word is longer than 30 characters and is not found in a user dictionary or the standard dictionary, RSE classifies the word as a guessed lemma. (ETROG-191)
Known Problems in 1.4.0
Inconsistent handling of numbers and punctuation during lemmatization. (ETROG-266)
RSE expects valid Unicode strings as input. If the input includes illegal Unicode sequences, such as un-paired UTF-16 surrogate characters, the behavior is undefined. (ETROG-284)
Known Problems in 1.3-beta and 1.4.x
Incorrect capitalization in some lemmas, including some German compounds (e.g., unAbhängigkeit).
Incorrect lemma formation of some words with suffixes (e.g., Brötchen).
Over-generation of German compound components (e.g., übergreifen, über, and greifen as separate components).
Failure to recognize some extended written-out German numbers (e.g., zweitausendzwölf).