Base Linguistics (RBL)

Release Notes

Release 7.47.9.c79.0

September 2025

Note

Java support: Java 11 is no longer supported. Java 17 and Java 21 are supported.

New

Solr and Lucene support: Solr 9.9.0 and Lucene 9.12.2 are now supported. (ETROG-3718)

Third-party component updates

Table 53. Updated

Package	Old Version	New Version
Apache Commons IO	2.19.0	2.20.0
Apache Commons Lang	3.17.0	3.18.0
Apache Log4j	2.24.3	2.25.1
fastutil	8.5.15	8.5.16
Jackson	2.19.0	2.19.2
JavaCPP	1.5.11	1.5.12
Metrics Core	4.2.30	4.2.35
Protocol Buffers	4.30.2	4.31.1
Woodstox	7.1.0	7.1.1

Release 7.47.8.c78.0

June 2025

New

Solr and Lucene support: Solr 9.8.1 is now supported. (ETROG- 3714)

Bug Fixes

We fixed a bug where long words inside long spans of text without punctuation would sometimes be tokenized incorrectly. (ETROG-3716)

Third-party component updates

Table 54. Updated

Package	Old Version	New Version
Apache Commons IO	2.18.0	2.19.0
Guava	33.4.0-jre	33.4.8-jre
Jackson	2.18.2	2.19.0
JUnit	5.11.4	5.12.2
JUnit Platform	1.11.4	1.12.2
Protocol Buffers	4.29.3	4.30.2
SnakeYAML	2.3	2.4

Table 55. Added

Package	Version	License
Guava InternalFutureFailureAccess and InternalFutures	1.0.3	Apache 2.0
JSpecify	1.0.0	Apache 2.0

Release 7.47.7.c77.0

March 2025

New

Solr and Lucene support: Solr 8.11.4 and Lucene 8.11.4 are now supported. (ETROG-3703)
Solr and Lucene support: Solr 9.8.0 and Lucene 10.1.0 are now supported. (ETROG- 3704)

Third-party component updates

Table 56. Updated

Package	Old Version	New Version
@API Guardian	1.1.0	1.1.2
Apache Commons IO	2.17.0	2.18.0
Apache Log4j	2.24.1	2.24.3
Guava	33.3.1-jre	33.4.0-jre
Jackson	2.17.2	2.18.2
Jakarta Annotations API	1.3.3	3.0.0
JavaCPP	1.5.10	1.5.11
JUnit	5.7.0	5.11.4
JUnit Platform	1.7.0	1.11.4
OpenTest4J	1.2.0	1.3.0
Protocol Buffers	3.25.5	4.29.3
Metrics Core	4.2.28	4.2.30

Table 57. Added

Package	Version	License
Java Architecture for XML Binding	2.2.12	CDDL 1.1 & GPL 2 + CE

Release 7.47.6.c76.0

November 2024

New

Solr and Lucene support: Lucene 9.11.1 and Solr 9.7.0 are now supported (ETROG-3695)
Java 21 support: Java 21 is now supported. Java 11 and 17 are still supported. (ETROG-3698)
Unicode update: Unicode 16.0 is now supported. (ETROG-3694)

Third-party component updates

Table 58. Updated

Package	Old Version	New Version
Apache Commons IO	2.16.1	2.17.0
Apache Commons Lang	3.16.0	3.17.0
Apache Log4j	2.23.1	2.24.1
fastutil	8.5.14	8.5.15
Guava	33.3.0-jre	33.3.1-jre
JavaCPP	1.5.8	1.5.10
Protocol Buffers	3.25.3	3.25.5
SnakeYAML	2.2	2.3
Woodstox	7.0.0	7.1.0

Release 7.47.5.c74.0

September 2024

Bug Fixes

We fixed a bug where long compound words (70-100 characters) in Danish, Norwegian (Bokmål and Nynorsk), and Swedish could cause an OutofMemoryError. (ETROG-3696)

Release 7.47.4.c75.0

September 2024

New

Neural model support improved: We upgraded TensorFlow Java to version 1.0.0-rc.1, which adds support for macOS ARM64. Neural models are now supported for macOS ARM64. (ETROG-3554)

Bug Fixes

Trailing decimal points in Chinese are no longer treated as part of a decimal fraction. When “点” or “點” ends a number, it is no longer segmented as part of the number. (ETROG-3680)

Third-party component updates

Table 59. Updated

Package	Old Version	New Version
Apache Commons Lang	3.14.0	3.16.0
fastutil	8.15.13	8.15.14
Guava	33.2.0-jre	33.3.0-jre
Jackson	2.17.1	2.17.2
TensorFlow for Java	0.3.3	1.0.0-rc.1
Woodstox	6.6.2	7.0.0

Release 7.47.3.c74.0

June 2024

New

Solr and Lucene support: Lucene 9.10 and Solr 9.6 are now supported (ETROG-3683)
CLA lexicon: Two terms have been added to the CLA lexicon: 喷码机 inkjet printer and 管理器 manager (in the context of software) (ETROG-3678)
Improved Readings: Readings are now returned for numeric words in Chinese and Japanese when tokenizerType is set to SPACELESS_LEXICAL. (ETROG-3684)

Bug Fixes

When tokenizerType is set to SPACELESS_LEXICAL, Japanese tokens for verbs in lemma form have had their readings fixed to cover the entire token. (ETROG-3640)
Example: Input: 食べる
- Previous Reading: た
- Current Reading: たべる
When tokenizerType is set to SPACELESS_LEXICAL, Japanese lemmatization has been corrected for numeric tokens containing both decimal points and multiplier characters.
Example: Input: 2.5亿
- Previous Lemma: 2500000000
- Current Lemma: 250000000
We fixed a bug where the Chinese word “星期四” would be tokenized incorrectly in certain contexts. (ETROG-3582)
We fixed a bug where a token whose surface form was the empty string could be returned when fragmentBoundaryDetection was set to true (the default). (ETROG-3686)

Third-party component updates

Table 60. Updated

Package	Old Version	New Version
Apache Commons IO	2.15.1	2.16.1
Apache Log4j	2.21.1	2.23.1
args4J	2.33	2.37
Guava	33.0.0-jre	33.2.0-jre
Jackson	2.16.1	2.17.1
Protocol Buffers	3.25.0	3.25.3
Woodstox	4.4.1	6.6.2

Release 7.47.2.c73.0

March 2024

New

Unicode update: Unicode 15.1 is now supported. (ETROG-3595)
Solr and Lucene support: Lucene 9.9 and Solr 9.5 are now supported. (ETROG-3673)

Third-party component updates

Table 61. Updated

Package	Old Version	New Version
Apache Commons IO	2.15.0	2.15.1
Apache Commons Lang	3.12.0	3.14.0
fastutil	8.15.12	8.5.13
Guava	32.1.3-jre	33.0.0-jre
Guava InternalFutureFailureAccess and InternalFutures	1.0.1	1.0.2
ICU4J	70.1	74.2
Jackson Annotations	2.15.3	2.16.1
Jackson Core	2.15.3	2.16.1
Jackson Databind	2.15.3	2.16.1
Jackson Dataformat XML	2.15.3	2.16.1
Jackson Dataformat YAML	2.15.3	2.16.1
Jackson Datatype Guava	2.15.3	2.16.1
Jackson Old JAXB Annotations	2.15.3	2.16.1

Release 7.47.1.c72.0

December 2023

New

Japanese improvements: We have improved and augmented the Japanese lexicon that is used when tokenizerType is set to spaceless_lexical. (ETROG-3532, 3581, 3535, 3668)
Solr and Lucene support: Lucene 9.5 – 9.8 and Solr 9.4 are now supported. (ETROG-3665)

Third-party component updates

Table 62. Updated

Package	Old Version	New Version
Jackson Annotations	2.15.2	2.15.3
Jackson Core	2.15.2	2.15.3
Jackson Databind	2.15.2	2.15.3
Jackson Dataformat XML	2.15.2	2.15.3
Jackson Dataformat YAML	2.15.2	2.15.3
Jackson datatypes: collections	2.15.2	2.15.3
Jackson Modules: Base	2.15.2	2.15.3
Guava: Google Core Libraries for Java	32.1.2-jre	32.1.3-jre
Protocol Buffers [Core]	3.23.4	3.25.0
Apache Commons IO	2.11.0	2.15.0
Apache Log4j API	2.20.0	2.21.1
Apache Log4j Core	2.20.0	2.21.1
Apache Log4j SLF4J Binding	2.20.0	2.21.1
Stax2 API	4.2.1	4.2.2
SnakeYAML	2.0	2.2

Release 7.47.0.c71.0

September 2023

New

Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when tokenizerType is set to spaceless_lexical. (ETROG-3616)
Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when tokenizerType is set to spaceless_lexical. (ETROG-3632)
Added secondary parts of speech: We've added support for secondary parts of speech to Chinese and Japanese when tokenizerType is set to spaceless_lexical. (ETROG-3636)
Improved support for Chinese readings when tokenizerType is set to spaceless_lexical:
- Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)
- Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)
  Example: "蔭權"
  - Previous readings returned: "【yīn;yìn】quán"
  - Readings now returned: "yīnquán” and “yìnquán"
Solr and Lucene support: Lucene 9.5 - 9.7 and Solr 9.3 are now supported (ETROG-3643)

Bug Fixes

We fixed a bug where an ArrayIndexOutOfBoundsException occurred when the Chinese dictionaries produced more than 6 matches and tokenizerType was set to spaceless_lexical. (ETROG-3635)
When Chinese readings are constructed by character and tokenizerType is set to spaceless_lexical, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)
We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)

Known Issues

The plugin for Solr 8.11 may throw an exception when using multiple neural models.

Third-party component updates

Table 63. Updated

Package	Old Version	New Version
Jackson Annotations	2.15.0	2.15.2
Jackson Core	2.15.0	2.15.2
Jackson Databind	2.15.0	2.15.2
Jackson Dataformat XML	2.15.0	2.15.2
Jackson Dataformat T	2.15.0	2.15.2
Jackson Datatype: Guava	2.15.0	2.15.2
Jackson Module: Old JAXB Annotations	2.15.0	2.15.2
Guava: Google Core Libraries for Java	31.1-jre	32.1.2-jre
Protocol Buffers [Core]	3.21.7	3.23.4

Release 7.46.4.c70.0

June 2023

New

Lucene and Solr support: RBL-JE now supports Lucene 9.2 to 9.4 and Solr 9.2. (ETROG-3631)

Third-party component updates

Table 64. Updated

Package	Old Version	New Version
Apache Log4J	2.19.0	2.20.0
fastutil	8.5.9	8.5.12
Jackson Annotations	2.14.0	2.15.0
Jackson Core	2.14.0	2.15.0
Jackson Databind	2.14.0	2.15.0
Jackson Dataformat XML	2.14.0	2.15.0
Jackson dataformats: Text	2.14.0	2.15.0
Jackson datatypes: collections	2.14.0	2.15.0
Jackson modules: Base	2.14.0	2.15.0
SnakeYAML	1.33	2.0

Release 7.46.3.c69.0

March 2023

Bug Fixes

We fixed a bug where processing an empty string as Chinese or Japanese with tokenizerType set to spaceless_lexical would throw a NullPointerException. (ETROG-3629)

Known Issues

In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)

Release 7.46.2.c69.0

March 2023

New

We've improved the time it takes to tokenize extremely long (> 10K characters) Japanese sentences. (ETROG-3602)

Known Issues

In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)

Release 7.46.1.c68.0

December 2022

Bug Fixes

Annotating Ukrainian with universalPosTags set to true no longer results in a RosetteUnsupportedLanguageException being thrown. (ETROG-3604)

Release 7.46.0.c68.0

November 2022

New

Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)
Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)
GPU support: Features that use TensorFlow now use a GPU if available. (ETROG-3564)
Emoji support: Emoji 15.0 is now supported. (ETROG-3577)
New option for Katakana: We've added the option joinKatakanaNextToMiddleDot to control whether sequences of Japanese Katakana tokens adjacent to a middle dot should be merged into a single Katakana token. By default, it is true, which matches the behavior in previous versions of RBL-JE. (ETROG-3592)
Solr 9.1 support: Lucene and Solr 9.1 are supported. (ETROG-3597)

Bug Fixes

The Japanese POS tag VN (verbal noun) is now mapped to the UPT-16 POS tag NOUN. It was previously mapped to VERB. (ETROG-3583)

Third-party component updates

Table 65. Upgraded

Package	Old version	New version
Apache Log4j	2.17.1	2.19.0
fastutil	8.5.6	8.5.9
Jackson	2.11.1	2.14.0
JavaCPP	1.58-alpha.20220614.013710.426	1.58
SLF4J	1.7.33	1.7.36
SnakeYAML	1.30	1.33

Release 7.45.0.c67.0

September 2022

New

Tagalog support:
- RBL now supports Part of Speech (POS) tagging in Tagalog. (ETROG-3559)
- RBL now supports lemmatization for Tagalog. (ETROG-3570)
- The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)
Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)

Bug Fixes

We fixed a bug in Russian where certain uncommon consonant–vowel sequences in words in the lexicon were incorrectly replaced with more common sequences with different vowel letters. (ETROG-3541)
Example: брошюра
- Previously: брошура
- Now: брошюра

Release 7.44.2.c67.0

July 2022

New

An open source package (JavaCPP) has been updated to allow the Elasticsearch plugin to use TensorFlow. (ESPI-168)

Release 7.44.1.c67.0

June 2022

Bug Fixes

RBL-JE no longer crashes when loading TensorFlow fails with a NoClassDefFoundError. It now throws an exception. (ESPI-169)

Release 7.44.0.c67.0

June 2022

New

Indonesian support added: RBL now supports Part of Speech (POS) tagging in Indonesian. (ETROG-3543)
Malay (Standard) support added: RBL now supports Part of Speech (POS) tagging in Malay (Standard). (ETROG-3545)
Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)
Java 17 support added: Java 8 and 9 support has been removed. (ETROG-3524)
Solr 9 support added: RBL now supports Lucene and Solr 9. (ETROG-3549)
Solr 6 support deprecated: RBL no longer supports Lucene or Solr 6 or earlier. (ETROG-3519)

Bug Fixes

In Japanese, negative forms of ichidan verbs written all in hiragana are no longer lemmatized to end with “なう”. (ETROG-3534)
Example: Input: くれない
- Previously: lemmatized to くれなう
- Now: lemmatized to くれる

Third-party component updates

This release includes the following third-party component changes:

Table 66. Added

Package	Version	License
Jakarta Annotations API	1.3.3	Eclipse Public License 2.0 and GPL 2 with classpath exception

Release 7.43.0.c66.0

February 2022

Notice

Solr 6 and earlier support is deprecated as of this release.

Java 8 and Java 9 support is deprecated as of this release.

New

Solr 8.11 support: This release supports Solr 8.11 (ETROG-3502)
Deprecated methods: Token#getType has been deprecated as token types are not used in RBL-JE without the Lucene/Solr plugins and the plugins use a different API. (ETROG-3503)
Solr 6 support deprecated: Support for Solr versions 6.x and earlier is deprecated as of this release and will be removed in the next version.
Permission changes: We removed group and other write permissions from model files. All files are now only writable by the owner. (ETROG-3516)

Third-party component updates

This release includes the following third-party component changes:

Table 67. Upgraded

Package	Old Version	New Version
Apache Commons IO	2.7	2.11.0
Apache Commons Lang	2.6	3.12.0
Apache Log4j	1.2.17	2.17.1
ICU4J	59.1	70.1
fastutil	8.4.0	8.5.6
SLF4J	1.7.28	1.7.33
SnakeYAML	1.26	1.30
TensorFlow for Java	0.2.0	0.3.3

Release 7.42.2.c65.0

November 2021

Bug Fixes

RBL no longer crashes in Arabic when emoticons is enabled. This fixes a bug introduced in 7.42.1. (ETROG-3493)

Release 7.42.1.c65.0

November 2021

Bug Fixes

In Korean, emoji and other language-neutral tokens no longer cause a ClassCastException to be thrown when using the ADM API. (ETROG-3488)

Release 7.42.0.c65.0

November 2021

New

Deprecated factories: TokenizerFactory, AnalyzerFactory, and CSCAnalyzerFactory have been deprecated in favor of BaseLinguisticsFactory. (ETROG-3453)
Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts, for Japanese with tokenizerType set to spaceless_lexical. (ETROG-3474)
Example: Input: 三・一四
- Previously: tokenization: 三 / ・ / 一四
- Now: tokenization: 三・一四
Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)
Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)
Japanese tokenization: In Japanese, when tokenizerType is set to spaceless_lexical, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. This is consistent with the default algorithm, spaceless_statistical. (ETROG-3475)
Solr 8.10 support: This release supports Solr 8.10. (ETROG-3482)
Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)
Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)

Bug Fixes

In Japanese, when tokenizerType is set to spaceless_lexical, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)
We've reverted two of the POS changes made in version 7.39.0.63.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)
The values are now:
- "|以" Chinese) “|” tagged as PUNCT
- "2对” (Chinese) “对” tagged as NM
RBL-JE no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)
In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)

Release 7.41.1.c65.0

July 2021

Bug Fixes

Enabling universalPosTags for Indonesian, Tagalog, or Standard Malay no longer throws a RosetteUnsupportedLanguageException. POS tags are not supported for these languages, so the universalPosTags option is ignored. (ETROG-3465)

Release 7.41.0.c65.0

July 2021

New

New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine parameter. (ETROG-3431)
Solr 8.9 support: This release supports Solr 8.9. (ETROG-3457)

Release 7.40.1.c64.1

May 2021

Bug Fixes

We fixed a bug where enabling alternativeSpanishDisambiguation for Spanish caused a NullPointerException to be thrown. (ETROG-3435)
We fixed a bug where setting disambiguatorType to DNN for Hebrew caused a RosetteRuntimeException to be thrown. (ETROG-3437)

Release 7.40.0.c64.1

May 2021

New

New option for tokenizers: We've added a new option, tokenizerType to specify which tokenizer to use. The options alternativeTokenization and fstTokenize are deprecated in favor of tokenizerType. (ETROG-3419)
New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType to spaceless_statistical. (ETROG-3392)

Bug Fixes

Multiple punctuation characters are no longer returned as a single token in Chinese when alternativeTokenization is true or tokenizerType is set to spaceless_lexical. Now each character is its own token. (ETROG-3402)
Example: Input: 天津？？
- Previously:
  Token{text=天津}
  HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
  Token{text=？？}
  HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=？？, tagSet=BT_CHINESE}
- Now:
  Token{text=天津}
  HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
  Token{text=？}
  HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=？, tagSet=BT_CHINESE}
  Token{text=？}
  HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=？, tagSet=BT_CHINESE}

Third-party component updates

This release includes the following third-party component changes:

Table 68. Added

Package	Version
JavaCPP	1.5.4

Table 69. Upgraded

Package	Old Version	New Version
TensorFlow	1.14.0	2.3.1

Release 7.39.0.c63.0

March 2021

New

Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)
- TokenizerOption.includeHebrewRoots and TokenizerOption.guessHebrewPrefixes have been deprecated and replaced by AnalyzerOption.includeHebrewRoots and AnalyzerOption.guessHebrewPrefixes.
- NFKC normalization is now supported for Hebrew.
- We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
- We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
- Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Solr 8.8: This release supports Solr 8.8. (ETROG-3369)
Improved CSCAnnotator output: The CSCAnnotator now emits tokens in addition to translations, even if no tokens were specified in the input. (ETROG-3356)
Improved directory structure: The contents of the models/ directory are now separated into subdirectories by language. (ETROG-1218)
Statistical models moved to models/ directory: The following files have been moved from dicts/ to models/: (ETROG-1218)
- cat/ca-ud-train.downcased.mdl
- est/et-ud-train.downcased.mdl
- fas/posLemma.mdl
- lav/lv-ud-train.downcased.mdl
- nno/lemma.mdl
- nob/lemma.mdl
- slk/sk-ud-train.downcased.mdl
- srp/sr-ud-train.downcased.mdl

Bug Fixes

Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)

Example: מע'רב

Previously:

Token{text=מע'}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, 
partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, 
partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}

Now:

Token{text=מע'רב}
MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, 
tagSet=MILA_HEBREW}

A structured region containing two new lines is now properly labeled as STRUCTURED. Previously, the layout region would be labeled as UNSTRUCTURED. (ETROG-3378)
Example: * item\n* item\n* item\n\n
- Previously:
```
{"startOffset": 0,"endOffset": 14,"layout": "STRUCTURED"}
{"startOffset": 14,"endOffset": 22,"layout": "UNSTRUCTURED"}
```
- Now:
```
{"startOffset": 0,"endOffset": 22,"layout": "STRUCTURED"}
```

Release 7.38.1.c63.0

January 2021

New

RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)

Bug Fixes

We reverted some changes to Korean disambiguation from the 7.37.0.c62.2 release as the changes introduced new disambiguation errors. (ETROG-3349)

Third-party component updates

This release includes the following third-party component changes:

Package	Old Version	New Version
Jackson Annotations	2.10.0	2.11.1
Jackson Core	2.10.0	2.11.1
Jackson Databind	2.10.0	2.11.1
Jackson Dataformat XML	2.10.0	2.11.1
Jackson dataformats: Text	2.10.0	2.11.1
Jackson modules: Base	2.10.0	2.11.1
Protocol Buffers	3.6.1	3.12.2
Apache Commons IO	2.6	2.7
fastutil	8.3.0	8.4.0
Woodstox Stax2 API	4.2	4.2.1
SnakeYAML	1.25	1.26

Release 7.38.0.c62.2

December 2020

New

Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation must be set to false, which is the default. (ETROG-3289)
Example: δείξε
- Previously: Selected lemma: δεικνύω (archaic)
- Now: Selected lemma: δείχνω (modern)
New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation to true. (ETROG-3304)
Deprecated classes: The classes BufferWordBreaker and WordBreakResults have been deprecated. (ETROG-3318)

Bug Fixes

The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)
Example: Start+
- Previously: POS tags: possible PROP, ADJ, NOUN
- Now: POS tag: FM
GenericTokenizer#hasNext is now implemented to be consistent with the documentation for Iterator#hasNext. Previously it always returned false. (ETROG-2140)

Release 7.37.0.c62.2

November 2020

New

Performance improvement: Spanish disambiguation with alternativeSpanishDisambiguation set to false is now faster. (ETROG-3271)
Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)
Support for unknown language: If the language is unknown (xxx), tokenization and sentence breaking is supported. (ETROG-3278)
Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)
Tokenization rule preprocessor: The preprocessor command !!btinclude is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)
Updated sample: The tokenize-analyze sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)
New sample: The sample csc-annotate demonstrates using CSC with the ADM API. (ETROG-3317)
Deprecated option: TokenizerOption#includeRoots has been deprecated and replaced with TokenizerOption#includeHebrewRoots. (ETROG-3314)
Deprecated option: The alternative tokenization option deliverExtendedAttributes is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)

Bug Fixes

Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)
A clear exception (RosetteUnsupportedLanguageException) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)
RBL no longer crashes when alternativeTokenization and fragmentBoundaryDetection are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)
Correct start and end offsets are now produced when fstTokenize is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)
The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)
- Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET
- Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
Tokens no longer have null token types. (ETROG-3316)
When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize was set to true. (ETROG-2505)
Example: ﷺ
- Previously: Offsets:
  صلى start 0 end 0
  الله start 0 end 0
  عليه start 0 end 0
  وسلم start 0 end 1
- Now: Offsets:
  صلى start 0 end 1
  الله start 0 end 1
  عليه start 0 end 1
  وسلم start 0 end 1

Release 7.36.0.c62.2

September 2020

New

Lucene/Solr: Versions up through 8.6.0 are now supported. (ETROG-3250)
Decompose compounds: The option to control decomposition of compounds is now available in Dutch, German, Hungarian, Danish, Bokmål, Nynorsk, Swedish, and Korean. The default for decomposeCompounds is true. (ETROG-3263, ETROG-3264, ETROG-3265)
Performance improvement: English and Spanish disambiguation with is now faster. Alternate disambiguation (alternateEnglishDisambiguation or alternateSpanishDisambiguation) must be set to false. (ETROG-3246, ETROG-3243)

Bug Fixes

In Hebrew, prefixes in some acronym tokens are now listed correctly in the list of prefixes, instead of being duplicated in the lemma. (ETROG-3214)
Example: “ומש"ס”
- Previously: lemma: “ומומש"ס”, empty prefix list
- Now: lemma: “ש"ס”, prefix list = [“ו”, “מ”]
Sentence breaks are now correct when there are two line breaks and fragmentBoundaryDetection is enabled. (ETROG-3241)
Example: "a very very very very long line\nshort\n\n"
- Previously: 2 sentences
```
{"startOffset":0,"endOffset":20}
{"startOffset":20,"endOffset":27}
```
- Now: 1 sentence
```
{"startOffset":0,"endOffset":26}
```
In Hebrew, lemmas starting or ending with spaces now have the spaces removed. (ETROG-3248)
Example: "אאורקה"
- Previously: “אאורקה ”
- Now: "אאורקה"
Analysis of unknown Hebrew words with guessed prefixes no longer have duplicate prefixes in their prefix list. (ETROG-3253)
Example: "בפיירפוקס"
- Previously: prefix list: [ב, ב]
- Now: prefix list: [ב]
In Chinese and Japanese, the system no longer crashes when both fragmentBoundaryDetection and alternativeTokenization are enabled. (ETROG-3260)
In Japanese, adjacent tokens are no longer erroneously joined when alternativeTokenization is enabled. (ETROG-3261)
When universalPosTags are enabled the UPT-16 POS tags are now marked as having the tag set UPT16_V1 instead of the default tag set of the language. (ETROG-3273)
Example: French
- Previously: tag set: BT_FRENCH
- Now: tag set: UPT16_V1
We've fixed the tokenize-analyze example in the samples directory. It now correctly produces results for Hebrew analysis. (ETROG-3252)

Release 7.35.0.c62.2

July 2020

New Features

Layout regions added: Layout regions, describing each section of input text as STRUCTURED or UNSTRUCTURED, are now identified by the annotator. In order to detect layout regions, fragment boundary detection must be enabled. (ETROG-3172)
New short line parameter: The option maxTokensForShortLine has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)
Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option fstTokenize must be set to true. (ETROG-3226)
Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)
Hebrew user-defined dictionaries added: Static and dynamic user-defined Hebrew analysis dictionaries are now supported. (ETROG-3230)
Deprecated method: HebrewAnalysis#characteristicString is now deprecated. (ETROG-3209)
Order of user-defined dictionaries: The order in which user-defined dictionaries are consulted has been standardized. Refer to the RBL-JE Application Developer's Guide for details. (ETROG-3148)

Bug Fixes

Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when fstTokenize was enabled and in some languages. (ETROG-3159)
Example: "1\n234" (embedded newline within the number string)
- Previously: "1 234" (1 token)
- Now: "1" "234" (2 tokens)
This example assumes fstTokenize is enabled and the language is French.
Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)
Tokens with digits are now eligible for the Greek guesser. (ETROG-3231)
- Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags
- Now: "HDMI1" gets FM POS tag
In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only occured when the guessHebrewPrefixes option is set to true.(ETROG-3221)
Example: "ומפיפרנו"
- Previously: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag preposition.
- Now: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag unknown.
Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)
Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"
- Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"
- Now: One analysis, lemmatized to "разложить"
German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)
Example: Input "Mitbehandlung"; "mit" is a separable prefix
- Previously: Lemmatized to "mitBehandlung"
- Now: Lemmatized to "Mitbehandlung"
In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)
Example: "1" (keycap)
- Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.
- Now: Tokenized to one token, "1"

Release 7.34.2.c62.2

May 2020

New Features

Hebrew tokens that have prefixes but not stems now get appropriate parts of speech. Previously, they got the POS tag "unknown". (ETROG-3207)
Example: “ה” from the string “ה70”
- Previously: POS tag "unknown"
- Now: POS tag "quantifier"
Lucene/Solr up through version 8.5.1 is now supported. (ETROG-3208)
When guessHebrewPrefixes is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes. (ETROG-3188)
Example: Token: "ומפיפרנו"
- Previously: 2 analyses:
  hebrewPrefixes=[ו] lemma=מפיפרנו
  hebrewPrefixes=[ו, מ] lemma=פיפרנו
- Now: 3 analysis:
  hebrewPrefixes=[ו] lemma=מפיפרנו
  hebrewPrefixes=[ו, מ] lemma=פיפרנו
  hebrewPrefixes=[] lemma=ומפיפרנו

Bug Fixes

Minimally-qualified emoji are no longer split apart. (ETROG-3185)
Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>: )
- Previously: U+1F481 and <U+200D, U+2642> (2 tokens)
- Now: <U+1F481, U+200D, U+2642> (1 token)
Capitalized nouns are no longer being detected as verbs. (ETROG-3186)
Example: The noun "Service" from the phrase "Price and Quality of Service"
- Previously: POS tag VI (infinitive or imperative verb)
- Now: POS tag PROP (proper noun)
When creating multiple analyzers for Chinese, Japanese, or Thai with alternateTokenization set to false (the default), the analyzers will now share the same model data. This will improve memory usage when creating multiple analyzers. (ETROG-3200)
Note: While memory usage has been improved, the process is still memory intensive. If RBL throws an OutOfMemoryError, increase the heap space.

Release 7.34.1.c62.2

March 2020

Bug Fixes

We removed some incorrect entries from the Hebrew lexicon that were added in 7.34.0.c62.2. (ETROG-3182, ETROG-3183)

Release 7.34.0.c62.2

March 2020

New Features

Lucene/Solr: RBL-JE now supports Lucene/Solr up through version 8.4.1. (ETROG-3156)
Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized. (ETROG-3164)
Additional emoji support: Emoji hair components are now lemmatized. (ETROG-3167)
German professions: Additional German professions have been added to the German lexicon. (ETROG-3163)
Spanish performance improvements: Spanish disambiguation is now faster when alternativeSpanishDisambiguation is false. (ETROG-3169)
Hebrew lemmatization: We increased proper noun coverage in the Hebrew lexicon. (ETROG-3161, ETROG-3162)

Bug Fixes

Low surrogates are no longer stripped from the ends of tokens in Hebrew. (ETROG-3165)
Number tokens with embedded spaces are no longer split into multiple tokens when preceded or followed by a symbol when fstTokenize is true. (ETROG-3158)
- Previously: $800 000 000 was tokenized as two tokens: $800 <TokenBoundary> 000 000
- Now: $800 000 000 is tokenized as a single token: $800 000 000

Release 7.33.0.c62.2

January 2020

New Features

The delimiters for the fragment boundary detector are now configurable. (ETROG-3116)
The fragment boundary detector now marks a boundary after any spaces following the fragment boundary delimiter. (ETROG-3116)
An underscore (U+005F) is no longer treated as a token separator in German when fstTokenize is enabled. (ETROG-3144)

Bug Fixes

We fixed a bug where tokens from multi-script Russian text sometimes had incorrect offsets if fstTokenize was enabled. (ETROG-3142)
We fixed a bug where multi-script Russian text would have a sentence break each time the script changed. (ETROG-3145)
We fixed a bug where there were unexpected sentence breaks after some short lines not ending in whitespace. (ETROG-3146)
We fixed a bug where sentence breaks were missing when the sentence break did not align with a token boundary. (ETROG-3140)

Release 7.32.0.c62.1

December 2019

New Features

Added support for Lucene/Solr up through version 8.3.0. (ETROG-3128)
Added support for tokenizing and lemmatizing Latvian. (ETROG-2798)
Latin-script regions within Russian documents are now tokenized and analyzed as English. (ETROG-3126)
TokenizerOption.licenseString, AnalyzerOption.licenseString, and BaseLinguisticsOption.licenseString may now be passed into a create method. Previously, these options had to be set on the factory itself. (ETROG-3134)

Bug Fixes

We fixed a bug where guessed German compounds were sometimes lemmatized as verbs but tagged as nouns. (ETROG-3094)
We fixed a bug where the fragment boundary detector would mark a sentence break after every Windows newline. (ETROG-3133)

Release 7.31.0.c62.0

November 2019

New Features

The Hebrew files dinflections.bin, dprefixes.data, and gimatria.data have been moved from the root/models directory to root/dicts/heb. (ETROG-3088)
Specifying the universalPosTags option now adds the deliverExtendedTags option as well. (ETROG-2185)
Dynamic user dictionaries can now be created and populated at runtime. See the section User-Defined Dictionaries in the Application Developer's Guide for details. (ETROG-3086, ETROG-3100, ETROG-3109, ETROG-3110, ETROG-3111)
Fragment boundary detection is now enabled by default. Previously it was disabled by default. (ETROG-3108)
TokenizerOption.alternativeTokenizationOptions has been deprecated in favor of a separate options for each YAML key. See the Javadoc for details. (ETROG-3109)
The UPT-16 files upt-16-pes.yaml and upt-16-prs.yaml have been removed from the distribution package, as they were unused. (ETROG-3122)
The -order option in rbl-build-csc-dictionary has been removed. All dictionaries are now built as LE, as LE dictionaries still work on BE machines. (ETROG-3120)
We've added imperative forms for 2000 verbs to the Arabic lexicon. (ETROG-3090)

Bug Fixes

Fragment boundary detection is now enabled for Hebrew. (ETROG-1442)
When lemmatizing numbers in Russian, numbers containing spaces will now be lemmatized without the space. For example, "1 234" will now be lemmatized as "1234" instead of "1 234". (ETROG-3101)
We fixed a bug introduced in 7.30.1.c61.0 which raised an ArrayIndexOutOfBoundsException when processing Japanese with alternativeTokenization and favorUserDictionary set to true. (ETROG-3118)
We fixed a bug where a middle dot would be ignored if it preceded white space when using alternativeTokenization in Japanese. (ETROG-3113)

Third-party component updates

Component	Version	Change
Apache Commons IO	2.6	Version upgrade
args4j	2.33	Version upgrade
fastutil	8.3.0	Version upgrade
Jackson Annotations	2.10.0	Version upgrade
Jackson Core	2.10.0	Version upgrade
Jackson Databind	2.10.0	Version upgrade
Jackson Dataformat XML	2.10.0	Version upgrade
Jackson dataformats: Text	2.10.0	Version upgrade
Jackson datatypes: collections	2.10.0	Version upgrade
Jackson modules: Base	2.10.0	Version upgrade
SLF4J	1.7.28	Version upgrade
SnakeYAML	1.25	Version upgrade
TensorFlow for Java	1.14.0	Version upgrade
Woodstox	4.4.1	Version upgrade
Woodstox Stax2 API	4.2	Version upgrade

Release 7.30.2.c61.0

September 2019

Bug Fixes

We fixed a bug where an AssertionError might be thrown when analyzing Hungarian with Java assertions enabled.
Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.
- Previously:Аполлона-11 (Apollo-11) was tagged as PROP, MISC, and NOUN
- Now:Аполлона-11 (Apollo-11) is tagged as NOUN
Correct token offsets are now returned from a Japanese annotator where a non-katakana character precedes a user-defined katakana token and alternativeTokenization and favorUserDictionary are enabled.
We fixed a bug where constructors of factory classes in the Lucene/Solr plugin would throw an UnsupportedOperationException if passed a Map that did not support the remove method.

Release 7.30.1.c61.0

August 2019

New Features

Added support for Lucene/Solr up through version 8.2.0.
Dictionaries and models that are used on both big- and little-endian machines no longer include LE in their file names.

Bug Fixes

When alternativeTokenization was set to true, the Chinese tokenizer could create tokens at the end of the input string with the part of speech NT without checking that the context was valid for NT
Analyzing Chinese and Japanese with alternativeTokenization enabled is now much faster on sentences that are thousands of characters long.

Release 7.30.0.c61.0

August 2019

New Features

Segmentation user dictionaries can be used for all languages, not just Chinese, Japanese, and Thai.
The option compoundComponentSurfaceForms has been added to return the surface forms of the components of compound words. By default, RBL-JE only returns the lemmas.
Added support for Lucene/Solr up through version 8.1.1.
Some Polish words ending in “-cku”, “-ska”, or “-sku” are lemmatized to forms ending in “-cki” or “-ski”.

Bug Fixes

The Japanese POS tag NE was not converted correctly to UPT-16.
The French POS tag CONJQUE was converted to UPT-16 CONJ instead of the more appropriate SCONJ.
When alternativeTokenization was disabled, Chinese punctuation was tagged as GUESS instead of PUNCT or EOS.

Release 7.29.0.c61.0

June 2019

New Features

Setting alternativeTokenization to true enables an alternative tokenizer for Thai, for parity with the Thai tokenizer in Basis Technology's C++ API (RLP).
All Hebrew tokens have analyses. The main change was adding the new part of speech punctuation. Non-punctuation tokens that formerly had empty analysis lists now have the part of speech unknown.

Third-party component updates

Component	Version	Change
Jackson Annotations	2.9.8	Version upgrade
Jackson Core	2.9.8	Version upgrade
Jackson Databind	2.9.8	Version upgrade
Jackson Dataformat XML	2.9.8	Version upgrade
Jackson Dataformat YAML	2.9.8	Version upgrade
Jackson Datatype Guava	2.9.8	Version upgrade
Jackson Module JAXB Annotations	2.9.8	Version upgrade
Protocol Buffers	3.6.1	Version upgrade
SnakeYAML	1.23	Version upgrade

Release 7.28.2.c60.0

June 2019

New Features

U+2019 RIGHT SINGLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are now normalized to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew words, to match the normalization of U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM.

Release 7.28.1.c60.0

May 2019

New Features

Updated the English lexicon.
Added support for Lucene/Solr up through version 8.0.0.
Updated the German lexicon.
Updated the Swedish lexicon.
Arabic analysis will attempt to replace leading hamzated alefs with plain alefs for unrecognized tokens.

Bug Fixes

The surface forms of Hebrew tokens consisting of multiple prefixes without a base, like “מה”, are now the entire token text, instead of just the first prefix.
Russian hyphenated words that end in numbers, like “Аполлона-11”, are no longer tagged as DIG. They are now tagged with the same parts of speech they had before 7.27.2.c60.0.
Closing parentheses, brackets, and braces that follow URLs when urls is enabled are no longer merged into the URLs.
The disambiguator is now more likely to select analyses with the POS tags ATMENTION, EMAIL, HASHTAG, and URL over other analyses.
When the Hebrew tokenizer encounters a character not used in Hebrew immediately following a character used in Hebrew, it starts a new token. Formerly, it would delete that character and any following characters up to the next token separator (e.g. white space).
RBL-JE can now successfully read in ICU tokenization rule files that begin with a BOM.
Hebrew tokens consisting of multiple prefixes without a base are now tagged with the part of speech “unknown”, to match single-prefix tokens.
The English token “than” is tagged only as COTHAN. The candidate part of speech COORD has been removed for this token.

Release 7.28.0.c60.0

May 2019

New Features

A perceptron-based disambiguator is available for Hebrew. It is used by default and when the option disambiguatorType is set toDisambiguatorType.PERCEPTRON. It was measured to have higher lemma and part of speech accuracies than the alternatives. To use the previous default, set disambiguatorType to DisambiguatorType.DICTIONARY.
Added support for Lucene/Solr up through version 7.6.0.
Running on Java 11 is now supported.

Bug Fixes

Some white space characters could be part of Chinese tokens when alternativeTokenization was enabled.
Tokens that are thousands of characters long slow down the tokenizer.
Polish tokens that can appear in multiword expressions are no longer lemmatized to the full expressions. For example, “dzień” is not lemmatized to “dzień_dobry”.
The non-final components of Russian compound words with more than one hyphen were not lemmatized. The non-final components of Russian hyphenated compound words with the interfix “е” or “о” that coincidentally looked like the short forms of adjectives were lemmatized as if they were short forms.

RBL-JE Release Note Archive 7.27.0.c60.0 and earlier

Release 7.27 and earlier

New Features

Bugs Fixed

Third-Party Components

Known Problems

New Features

Release 7.27.0.c60.0

The Chinese Script Converter must be licensed distinctly from the rest of RBL. Old licenses won’t work for it anymore. (ETROG-2916)
Lemmatization is supported for Persian. (ETROG-2924)
A dictionary-based disambiguator is available for Hebrew and is now the default. To run disambiguation in TensorFlow, set the option disambiguatorType to DisambiguatorType.DNN. (ETROG-2928)

Release 7.26.5.c59.3

The tokenizer recognizes "百度" as a single token when alternativeTokenization is enabled. (ETROG-2909)

Release 7.26.3.c59.3

The tokenizer emits the normalized surface form as the lemma when alternativeTokenization is enabled. (ETROG-2892)

Release 7.26.0.c59.3

Analyzing German tokens with default ignorable code points, including U+00AD SOFT HYPHEN, U+200C ZERO WIDTH NON-JOINER, and U+200D ZERO WIDTH JOINER, produces the same analyses as if the tokens did not include those characters. (ETROG-2824)
Improved the lemma accuracy of the Spanish disambiguator. (ETROG-2856)
Improved disambiguation of English proper nouns. (ETROG-2867)
The North Korean (qkp) and South Korean (qkr) dialects are both treated as Korean (kor). (ETROG-2878)

Release 7.25.0.c59.3

Additional lemma dictionary for each of the two Norwegian languages. (ETROG-2797)
Added support for Lucene/Solr 7.3.1 through 7.4.0. (ETROG-2862)

Release 7.24.6.c59.2

Added support for Lucene/Solr 7.2.1 through 7.3.1. (ETROG-2842)

Release 7.24.3.c59.2

Added support for TensorFlow on more CPUs.

Release 7.24.0.c59.2

Added support for tokenizing and lemmatizing Catalan, Estonian, Serbian, and Slovak. (ETROG-2752, ETROG-2774)

Release 7.23.1.c59.0

Improved the accuracy of the Hebrew disambiguator. (ETROG-2718)
Running on Java 9 is now supported. (ETROG-2722)

Release 7.23.0.c59.0

Added support for Lucene/Solr 7.0.0 through 7.1.0. (ETROG-2706)
POS-tagging and disambiguation are supported for Hebrew. (ETROG-2707, ETROG-2717)

Release 7.22.2.c59.0

Added support for Lucene/Solr 7.0.0 through 7.2.1. (ETROG-2751)

Release 7.22.0.c59.0

German words that are completely uppercase guessed as acronyms. (ETROG-2684)

Release 7.21.2.c59.0

Modified internal dependency structure. (ETROG-2677)

Release 7.21.1.c59.0

Updated the compatibility number to 59.0. (ETROG-2609)

Release 7.21.0.c58.3

Added the ArabicMorphoAnalysis class to allow an Annotated Data Model application to get more information for Arabic, Persian, and Urdu text than the MorphoAnalysis class would provide. (ETROG-2623)
Improved speed and memory footprint for English and Spanish disambiguation. (ETROG-2607, ETROG-2618, ETROG-2635)
Added the alternativeEnglishDisambiguation and alternativeSpanishDisambiguation options to specify the use of the old disambiguator in English and Spanish. The new disambiguator, introduced in version 7.18.0.c58.3, and enhanced in the current release, is more accurate, but slower. (ETROG-2626)
Added the guessHebrewPrefixes option to control whether to split possible prefixes off unknown Hebrew words. (ETROG-2642)
Normalized U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew. (ETROG-2647)
Filter out punctuation from Lucene/Solr when query is set. (ETROG-2648)
Added support for Lucene/Solr 6.6. (ETROG-2656)

Release 7.20.0.c58.3

Added tokenization and POS-tagging for at-mentions and hashtags in all languages. (ETROG-2571)
Added the options atMentions, emailAddresses, emoticons, hashtags, and urls to enable tokenization and POS-tagging of @mentions, email addresses, emoticons, hashtags, and URLs. They are all disabled by default. (ETROG-2583)

Release 7.19.0.c58.3

Added tokenization and POS-tagging for URLs and email addresses in all languages. (ETROG-2557)

Release 7.18.0.c58.3

Implemented the many-to-one normalizer. (ETROG-1961)
Deprecated many classes and methods that are for internal use only. (ETROG-2065)
Added BaseLinguisticsFactory#addUserCscDictionary. (ETROG-2098)
Removed obsolete big-endian models and dictionaries. (ETROG-2214)
Overhauled RBLCmd. ANNOTATE is the default command. -showTokenDetails, -showRawResults, and -verboseResults are removed. -inputJson interprets the input as an ADM. -outputJson is a boolean option. (ETROG-1392, ETROG-2343)
Decomposed compound verbs in Japanese when using alternativeTokenization. (ETROG-2350)
Introduced more advanced disambiguation for English and Spanish. (ETROG-2367, ETROG-2370, ETROG-2372, ETROG-2371, ETROG-2467)
Improved decompounding accuracy in Dutch. (ETROG-2408)
Added tokenization, lemmatization, and POS-tagging for emoticons and emoji in all languages. (ETROG-2474, ETROG-2512, ETROG-2516, ETROG-2520, ETROG-2522, ETROG-2538)
Supplemented analysis dictionaries for English and Spanish. (ETROG-2481, ETROG-2532, ETROG-2535)
Added support for Lucene/Solr 6.3. (ETROG-2501)
Introduced the ability to specify a user-defined reading dictionary in Lucene/Solr (userDefinedReadingDictionaryPath). (ETROG-2527)

Release 7.17.2.c58.3

Add support for Lucene/Solr 6.2. (ESPI-77)

Release 7.17.1.c58.3

Version of OSGi (internal use only) upgraded. (ETROG-2441)

Release 7.17.0.c58.2

The FST tokenizer now supports Romanian. (ETROG-2255)

All Chinese parts of speech are supported in CLA user dictionaries. (ETROG-2262)

Modest speed improvement in the disambiguation algorithm used to process the results of the Japanese (statistical), Korean, and Arabic tokenizers. (ETROG-2288)

To become file-system-agnostic, the use of Path in the API is now supported. (ETROG-2310, ETROG-2381)

Added the -outputJson option to RBLCmd to write the ADM as JSON to a file of choice. (ETROG-2332)

Reverted the fix for ETROG-2304, introduced in 7.16.0.c58.2. (ETROG-2376)

Release 7.16.0.c58.2

The Chinese script converter is an entitlement with a standard Chinese license. (ETROG-1605)
Arabic reh is normalized as a decimal separator in numeric contexts. (ETROG-1650)
Provide disambiguation of Dutch compounds. (ETROG-1736)
A custom reading dictionary can be specified on the RBLCmd command line. (ETROG-1938)
Alternative tokenization options are included in BaseLinguisticsOption. (ETROG-1946)
Improve speed by caching Arabic analyses. (ETROG-1992)
Added support for alternative Chinese segmentation. (ETROG-2034)
Return Hebrew sentence boundaries. (ETROG-2036))
Added support for POS tag mappings for alternative Japanese and Chinese segmentation. (ETROG-2152)
Changed CompoundDictionary to provide its components in an order that reflects the contents of the lemma it returns. (ETROG-2154)
AnalyzerFactory#addUserAnalysisDictionary now throws an informative exception when either the root or dictionary directory is invalid. (ETROG-2166)
Augmented RBLCmd with the ability to return the RBL-JE version number. (ETROG-2168)
Improve handling of hiragana tokens homophonous to verbs in the alternative Japanese tokenizer (JLA). (ETROG-2188)
Improve handling of POS-ambiguous verb stems in the alternative Japanese tokenizer (JLA). (ETROG-2189)
The RBLCmd help command now sorts its options alphabetically. (ETROG-2195)
Han readings now returned for all Katakana tokens. (ETROG-2208)
In the Russian FST tokenizer, initials are tokenized and given the +Init morpho-tag. (ETROG-2209)
Memory requirements of the FST tokenizer were reduced. (ETROG-2200, ETROG-2226))
Reduce the memory allocated for tokens by the FST tokenizer. (ETROG-2235))
Terminated support for Lucene/Solr 4.1-4.2. Added support for Lucene/Solr 6.0-6.1. (ETROG-2016, ETROG-2241, ETROG-2299)

Release 7.15.0.c57.2

Note: 7.15.0 was forked directly from 7.14.0 and thus does not have the changes in 7.14.1+.

Introduced the ability to specify an alternative FST tokenizer. See TokenizerFactory.addCustomTokenizationFst. (ETROG-2231)

Release 7.14.0.c57.2

The specification of options to RBLCmd was refactored. (ETROG-1503)
Added UPT-16 support for Persian and Urdu. (ETROG-1830)
Changed UPT-16 mappings for Czech and Hungarian numbers. (ETROG-1841)
Removed incorrect analyses for Polish adjectives and participles ending in m/mi. (ETROG-1916)
Removed archaic Polish analyses containing "być". (ETROG-1917)
Added raw analyses for English contractions. (ETROG-1944)
The command line tool RBLCmd supports Hebrew tokenization. (ETROG-1973)
Added support for Finnish stemming. (ETROG-2012)
Removed the spurious generation of an accusative case analysis for some Polish nouns. (ETROG-2020)
The Hebrew tokenizer overzealously guessed that periods were part of an abbreviation. (ETROG-2024)
Refactored the position metadata for Lucene tokens of compound components. (ETROG-2042)
Lucene tokens for components of a contraction are identified with type "CONT". To invoke this functionality, set FilterOption.identifyContractionComponents to true. (ETROG-2044)
AnalysesAttributes formatted as JSON in Elasticsearch. (ETROG-2057)

Release 7.13.0.c56.6

Added API support for Lucene & Solr 5.0-5.3. (ETROG-1647)
Added support for Persian and Urdu. (ETROG-1636, ETROG-1667)
The 'nor' (Norwegian) language code is accepted. (ETROG-1690)
Exposed support for using the Rosette Annotated Data Model (ADM) to perform RBL-JE operations. (ETROG-1713)
The Arabic analysis candidate generation code now uses the same algorithm that the Arabic Language Processor in the native (C++) version of Rosette Base Linguistics does. (ETROG-1722)
Provided an alternative Japanese analyzer. This provides parity with the Japanese analyzer in Basis Technology's C++ API (RLP). It offers improved accuracy with query strings and names and provides greater user control of the analysis. (ETROG-1727)
For English, Portuguese, and German text, added ADM support for splitting contractions and analyzing the constituents. (ETROG-1769)
Provided support for returning the set of 16 universal part-of-speech (POS) tags rather than the set of 12 that were introduced in version 7.12.0. (ETROG-1771)
The RBLCmd tool now lists the BaseLinguisticsOption options. To use these options you must set analyzerType=none, lang, and BaseLinguisticsOption.language. (ETROG-1862)

Release 7.12.1.c56.6

Version 7.12.1.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis JVM SDK (e.g. RBL-JE, RLI-JE, REX-JE) in a single application, then choose versions that have the same compatibility number. (ETROG-1700)

Release 7.12.0

Moved the Tokenize and Analyze samples into samples/tokenize-analyze and created a single Ant build script to compile and run both samples. (ETROG-1264)
Provided support for returning universal part-of-speech (POS) tags rather than the language-specific POS tags we already return. The universal tags (UPT) are coarser than the language-specific tags, but enable tracking and comparison across languages. (ETROG-1472)
Added support for returning a disambiguated analysis for each token in Japanese text. For performance, this feature is turned off by default. (ETROG-1324)
Added support for returning morphological tags, where available, and placed an example illustrating the procedure for obtaining morphological tags in samples/morpho-tags. (ETROG-1485)
Removed the small number of dubious acronmym expansions from the lemmatization of English, French, Italian, German, Spanish, and Portuguese input. (ETROG-1547)
Improved the German lemma parser, which now returns the same lemma for German nouns that differ only in gender. (ETROG-1548)
Added API support for Lucene & Solr 4.10. (ETROG-1571)

Release 2.4.0

Enhanced support for Korean linguistic analysis, and integrated a guesser for generating morphemes, morpheme tags, compound components, and parts of speech. (ETROG-1486, ETROG-1512, ETROG-1528)
Added support for Korean user lemma dictionaries. (ETROG-1518)
Added stop words to the Japanese analysis dictionary. (ETROG-1525)

Release 2.3.0

Added the Chinese Script Converter, which can convert tokens in Traditional Chinese text to Simplified Chinese and vice versa. (ETROG-1462)
Terminated support for Lucene/Solr 3.6. (ETROG-1298)
Implemented support for Chinese part-of-speech (POS) tags and readings. (ETROG-1280)
Added support for normalization of Chinese and Japanese numbers. (ETROG-1310)
Implemented generation of Korean part-of-speech (POS) tags. (ETROG-1357)

Release 2.2.2

Added a tool for building user dictionaries. (ETROG-210)
For those cases in which you want to use your own whitespace tokenenizer and you are processing text that requires segmentation (such as Chinese, Japanese, or Thai), we have added support for a base linguistics segmentation token filter to be used after a whitespace tokenizer and before other filters, such as a base linguistics token filter. See the Javadoc for the RBL-JE API for Lucene 4.3-4.7. (ETROG-1240)
For Japanese, modified the base linguistics token filter to exclude lemmas for auxiliary verbs, particles, and adverbs from the token stream. (ETROG-1217)
Added support for using AnalysesAttribute to get the analyses and disambiguated analysis for each token in a token stream. (ETROG-1279)
Added SLF4J support for logging RBL-JE applications. (ETROG-1318)
Added support for turning case sensitivity on/off when analyzing text. (ETROG-1365)
Deprecated void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path) in favor of void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path, EnumMap<AnalyzerOption, String> options) where options is used to set AnalyzerOption.caseSensitive to "true" or "false".
Unused analyzer parameter removed from the BaseLinguisticsSegmentationTokenFilter constructor. (ETROG-1316)
Updated the Japanese normalization dictionary. (ETROG-1229)
Added API support and samples for Lucene 4.9. (ETROG-1446)

Release 2.2.0

Added a Lucene Analyzer that combines the RBL-JE Tokenizer and TokenFilter, along with the LowerCaseFilter, CJKWidth Filter, and optional support for the StopFilter: com.basistech.rosette.lucene.BaseLinguisticsAnalyzer. Added a Lucene 4.3-4.7 sample application that illustrates its use. (ETROG-1138, ETROG-1172)
Improved support for returning Japanese Hiragana readings. The API for adding readings to the token stream has moved from TokenizerFactory#SetOption to BaseLinguisticsTokenFilter#setAddReadings. You can also include ("addReadings", "true") to the map of options you use to instantiate the BaseLinguisticsAnalyzer. (ETROG-1054)

Release 2.1.0

Added support for Japanese Hiragana readings.
Factored in support for Lucene 3.6, 4.1-4.2, and 4.3.

Release 2.0.0

For this release, this product has been refactored and renamed to Rosette Base Linguisitcs Java Edition. This release concentrates on the core API instead of implementations for different versions of Lucene and Solr. This release returns part-of-speech tags for a core set of European languages and Japanese.

In place of a LemmatizerFactory, RBL-JE now provides an AnalyzerFactory. Use the AnalyzerFactory to generate a language-specific Analyzer that you can use to generate Analysis objects for each token produced by the Tokenizer.

Release 1.11.0

For licensing and business reasons, support for Bulgarian, Catalan, Estonian, Croatian, Indonesian, Latvian, Malay, Slovak, Slovenian, Serbian, Albanian, and Ukrainian has been removed from the RSE package. (ETROG-921)

Release 1.10.0

Added support for tokenizing and lemmatizing Arabic, Czech, Hungarian, Korean, and Turkish. (ETROG-876)

Release 1.9.0

Added support for segmenting (tokenizing) Thai. (ETROG-448)
Added a tokenizer option (turned off by default) for returning Hebrew roots. (ETROG-788)
Changed required Java platform from 1.5 to 1.6. (ETROG-765)
Added support for using RSE with LucidWorks Enterprise 1.7, which supports a pre-release version of Lucene and Solr 4.0.

Release 1.8.0

Added support for tokenizing and lemmatizing Albanian, Bulgarian, Catalan, Croatian, Estonian, Greek, Hebrew, Indonesian, Latvian, Malay, Polish, Serbian, Slovakian, Slovenian, Russian, and Ukrainian. (ETROG-656, 658, 668, 677)
Added a command line driver for running RSE. For usage details, see the Javadoc for com.basistech.rosette.bl.RBLCmd. (ETROG-603)
Added support for tokenizing and lemmatizing Norwegian Nynorsk text. (ETROG-637)
Consolidated support for Lucene 2.2, Lucene 2.4, Lucene 2.9, Lucene 3.1, Solr 1.3, Solr 1.4, and Solr 3.1 in a single SDK package with an associated documentation package.
Deprecated support in the com.basistech.rosette.breaks package (GenericTokenizer and TokenizerOption) for returning EOS (end-of-sentence) tokens. includeEOS is off by default and should not be turned on; it interferes with Lucene searches. (ETROG-706)
Deprecated Lucene 2.9 LemmaFilterFactory.supportedLanguages(). Use getSupportedLanguages(). (ETROG-726)

Release 1.7.1

Revised the SegmentationTokenizer to provide more consistent handling of punctuation during the tokenization of Chinese, Japanese, and Thai text. (ETROG-640)

Release 1.7.0

Added support for Lucene 3.0.
Improved support for Japanese and Chinese tokenization.
Added the Japanese lemmatization dictionary and support of Japanese lemma user dictionaries. The Japanese lemmatization dictionary also provides orthographic normalization in the case of Katakana spelling variants and input text with archaic Kanji.
Added the production of normalized numbers to the lemmatization process.
Added support for Chinese lemma user dictionaries. Apart from numbers, which are already handled by the lemma guesser, lemmas do not ordinarily apply to Chinese, but a lemma user dictionary may be used for orthographic normalization.

Release 1.6.0

Added support for Danish and Norwegian (Bokmål). Improved support for Chinese token segmentation and Romanian.

To enhance clarity and consistency, and to avoid duplication of package names in class names, made a number of API changes that are not backwards compatible.

Renamed some factory classes: (ETROG-436)

Old	New
`com.basistech.lucene.LuceneTokenizerFactory`	`com.basistech.lucene.TokenizerFactory`
`com.basistech.lucene.BaseLinguisticsTokenFilterFactory`	`com.basistech.lucene.LemmaFilterFactory`
`com.basistech.solr.BaseLinguisticsTokenizerFactory`	`com.basistech.solr.TokenizerFactory`
`com.basistech.solr.BaseLinguisticsTokenFilterFactory`	`com.basistech.solr.LemmaFilterFactory`

All these factory classes include a create() method for instantiating the Tokenizer or LemmaFilter. The getTokenFilter(), getLuceneTokenizer(), and getLemmatizer() methods have been removed.

Promoted classes introduced in Release 1.5.beta.1 for setting tokenizer and lemmatizer options from inner Enums to top-level Enums: com.basistech.rosette.breaks.TokenizerOption and com.basistech.rosette.bl.LemmatizerOption. (ETROG-434)
Removed the TokenizerFactory, LemmaFilterFactory and LemmatizerFactory option-specific methods for setting options that predate the introduction of setOption().
The com.basistech.breaks.BreakerFactory methods for creating breakers have been renamed.
Old
New
newScriptRegionBreaker()
createScriptRegionBreaker()
newBufferSentenceBreaker()
createBufferSentenceBreaker()
newBufferWordBreaker()
createBufferWordBreaker()

Old	New
`newScriptRegionBreaker()`	`createScriptRegionBreaker()`
`newBufferSentenceBreaker()`	`createBufferSentenceBreaker()`
`newBufferWordBreaker()`	`createBufferWordBreaker()`

Release 1.5-beta-1

Added support for Chinese, and limited support for Japanese. For these languages, RSE adds statistically trained models/dictionaries to enabled the tokenization of non-whitespace-delimited text. Support for user dictionaries has also been expanded to include token dictionaries for Chinese, Japanese, and Thai.
Enhanced support for Dutch, Italian, and Portuguese.
Replaced Lucene 2.9 and Solr 1.4 packages with Lucene 3.0 package.
Revised the API for defining tokenizer and lemmatizer options.
Reorganized the documentation to reflect standard RSE usage patterns.

Release 1.4.2

To simplify usage in standard search applications, revised the RSE Tokenizer so that it does not put sentence-boundary tokens in the token stream unless you instruct the RSE TokenizerFactory to include them. (ETROG-312)

Release 1.4.1

Addressed a compatibility issue running RSE with RLP. To run RSE 1.4 and RLP 7.1 in the same process follow the instructions in RSE Technical Note: Using RSE 1.4 and RLP 7.1 in a Single Solr Instance.

Compiling a Swedish User Dictionary. As described in the RSE Application Developer's Guide, you must use RLP to create a user dictionary. See "Chapter 12. User-Defined Data" In the RLP Application Developer's Guide provides instructions on creating the source file for a user-defined dictionary and compiling the dictionary. The current release of RLP (RLP 7.1.0) does not include support for creating a Swedish user dictionary. To create a Swedish dictionary, you must add a file that we provide in the extras directory to the corresponding location in your RLP installation: rlp/bl1/dicts/sv/tags.txt.

When you create your source file, you can use [+DUMMY] as the POS tag for each entry.

The syntax for compiling a Swedish user dictionary from rlp/bl1/dicts/tools is

build_user_dict.sh sv input output

Release 1.4.0

Removed Rosette Language Analyzer (RLI) 100% Java implementation, which is now a separate product.
Provided separate SDK packages with support for Lucene 2.2, Lucene 2.4, and Lucene 3.0. (ETROG-198)
Added TokenizerFactory, which provides a language-specific Tokenizer for parsing input text. In addition to using the Sentence Breaker and Word Breaker, the Tokenizer normalizes the tokens (Unicode NFC normalization and lowercasing). (ETROG-185)
Added support for Swedish, including tokenization, lemmatization, and decompounding. (ETROG-201)
Added preliminary, limited support for Dutch, Danish, Norwegian, Italian, Portuguese, and Romanian.

Release 1.3-beta

Expanded support for German decompounding.
Added support for generating a separate lemma for each space-delimited element in lemmas that contain whitespace.
This distribution provides support for Lucene 2.2.

Release 1.2.0

Revised Java package names to avoid potential collisions with the RLP JNI-supported Java API.

Release 1.1.0

Upgraded Token Filter Factory support from Lucene 2.2 to Lucene 2.4.
Added The Rosette Language Identifier (RLI), Sentence Breaker, and Word Breaker:

Release 1.0.0

Introduced support for the creation of Lucene 2.2 Base Linguistics token filters for English, French, German, and Spanish text.

Bugs Fixed

Bugs fixed in 7.27.2.c60.0

Bug #	Description
ETROG-2981	The Persian lemmatizer did not add lemmas to the first analyses of many tokens, especially verbs.
ETROG-2983	The lemmas of hyphenated Russian compound words now have both pieces lemmatized, not just the final piece. For example, “человека-волка” is lemmatized to “человек-волк”, whereas it was lemmatized to “человека-волк” in previous versions.
ETROG-2984	After some sequences of 4096 characters, containing mostly white space and at most one token, if there is no token or the token contains the last character of the sequence, any following tokens have incorrect original offsets.

Bugs fixed in 7.27.1.c60.0

Bug #	Description
ETROG-2919	Capitalized words in English are less likely to automatically get the part of speech PROP.
ETROG-2927	The English word "people" and its derivatives have the lemma candidate "person". They are no longer analyzed as plural nouns with lemmas equal to their surface forms.
ETROG-2969	Ambiguous English words like “second” and “lower” are less likely to be disambiguated as verbs when they should be ordinal numbers and comparative adjectives.

Bugs fixed in 7.26.6.c60.0

Bug #	Description
ETROG-2958	Analyzing Dutch writes a cache file to disk, which fails if the file is not writable.

Bugs fixed in 7.26.5.c59.3

Bug #	Description
ETROG-2904, ETROG-2910, ETROG-2911	Improved the lemma and part of speech accuracy of the Spanish disambiguator.
ETROG-2921	The parts of speech of the German acronyms “MAN” and “MIT” fall back to the parts of speech of the unrelated words “man” and “mit”. They should be NOUN.

Bugs fixed in 7.26.4.c60.0

Bug #	Description
ETROG-2904, ETROG-2910, ETROG-2911	Improved the lemma and part of speech accuracy of the Spanish disambiguator.
ETROG-2914	RBL-JE depended on Guava 18.0.0, which has a security vulnerability (CVE-2018-10237). Now it depends on Guava 26.0-jre.

Bugs fixed in 7.26.3.c59.3

Bug #	Description
ETROG-2835	Dutch compound nouns ending with “-ronde” can be analyzed as adjectives.
ETROG-2864	The disambiguator for Dutch non-compound words only considers parts of speech. If a token has multiple analyses with the same part of speech, the disambiguator picks one arbitrarily.
ETROG-2898	U+180E MONGOLIAN VOWEL SEPARATOR is not treated as a token separator in Hebrew.

Bugs fixed in 7.26.2.c59.3

Bug #	Description
ETROG-2889	Single-letter Spanish conjunctions sometimes get the POS tag ITEM instead of CONJ.

Bugs fixed in 7.26.1.c59.3

Bug #	Description
ETROG-2888	Norwegian lemmas for proper nouns are often converted to lowercase.

Bugs fixed in 7.26.0.c59.3

Bug #	Description
ETROG-2876	The application developer’s guide’s feature set table in section 1.3 erroneously claims that sentence boundary detection is not supported in Hebrew.
ETROG-2877	The Japanese character “々” is unrecognized and splits tokens.

Bugs fixed in 7.25.0.c59.3

Bug #	Description
ETROG-2791	In Catalan where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted.
ETROG-2857	Punctuation was not always separated from preceding characters where they should when `fstTokenize` is enabled.
ETROG-2860	The Hebrew part of speech tag wPrefix is not converted to UPT-16 when `universalPosTags` is enabled.
ETROG-2861	The documentation lists AUXV as the part of speech tag for Japanese auxiliary verbs, when it is actually AUXVB.

Bugs fixed in 7.24.6.c59.2

Bug #	Description
ETROG-2829	Some components of German compound words are incorrect when the surface form of the component could be either a noun or a verb.
ETROG-2831	Hebrew part of speech tags are not converted to UPT-16 when `universalPosTags` is enabled.
ETROG-2834	JLA tokenizer sometimes truncates katakana tokens after non-katakana tokens.
ETROG-2844	Email addresses and URLs may contain control or whitespace characters.
ETROG-2847	Hebrew tokens can contain control characters or nothing but default ignorable characters.
ETROG-2848	Chinese, Japanese, and Thai tokens may contain control characters.

Bugs fixed in 7.24.5.c59.2

Bug #	Description
ETROG-2781	Incorrect analysis may be selected for Dutch non-compound words.
ETROG-2804	Sentence-final @mentions, email addresses, emoji, emoticons, hashtags, and URLs are not marked as sentence-final.
ETROG-2820	Tokenizing German with `fstTokenize` enabled can drop tokens.
ETROG-2823	Hebrew prefixes are not exposed correctly.

Bugs fixed in 7.24.4.c59.2

Bug #	Description
COMN-234	The Woodstox dependency is not shaded.

Bugs fixed in 7.24.2.c59.2

Bug #	Description
ETROG-2814	Running TensorFlow leaks memory.

Bugs fixed in 7.24.1.c59.2

Bug #	Description
ETROG-2808	The Hebrew disambiguation model is not cached, potentially leading to high memory pressure.

Bugs fixed in 7.24.0.c59.2

Bug #	Description
ETROG-2782	Tokens can be empty or consist of nothing but control characters and white space.

Bugs fixed in 7.23.3.c59.0

Bug #	Description
ETROG-2744	Tokenize-analyze sample emits unknown POS tags for Hebrew
ETROG-2759	Ending a Lucene token stream in Chinese or Japanese with `alternativeTokenization` enabled throws an exception if none of the token stream's tokens have been consumed.
ETROG-2760	In languages like French and Italian, an apostrophe was parsed as its own token when directly followed by a digit.
ETROG-2767	Overlapping tokens, which are valid, are discarded. This particularly affects hyphenated tokens in French when `fstTokenize` is enabled.
ETROG-2776	German all-caps words are assumed to be acronyms without considering the possibility that they are simply emphasized.
ETROG-2777	The application developer’s guide claims support for Java 9.
ETROG-2778	The application developer’s guide references btcommon-api-37.1.3.jar instead of btcommon-api-36.1.3.jar.
ETROG-2779	The German analysis cache returns analyses without taking the full context into account, leading to unpredictable analyses for unknown words.
ETROG-2780	English all-caps words are assumed to be proper nouns, though all-caps may simply denote emphasis.
ETROG-2790	The Application Developer's Guide does not mention support for Hebrew part of speech tagging in the Feature Set table.

Bugs fixed in 7.23.2.c59.0

Bug #	Description
ESPI-110	Disambiguating Hebrew tokens throws an `IllegalArgumentException` when a Java security manager is enabled.

Bugs fixed in 7.23.0.c59.0

Bug #	Description
ETROG-2710	Some tokens consisting of numbers and Latin letters, such as serial codes, are decompounded into multiple morphemes in Korean.
ETROG-2716	“интернет”, the Russian word for "internet" is not in the lexicon although “Интернет” is there.

Bugs fixed in 7.22.2.c59.0

Bug #	Description
ETROG-2738, ETROG-2754	German words are assigned parts of speech without taking capitalization into account, leading the disambiguator to often pick the wrong analysis.
ETROG-2739	The annotated lemmas for German definite articles were inconsistent.
ETROG-2740	Some symbols and punctuation are not tokenized as separate tokens in German when `fstTokenize` is enabled.
ETROG-2745	In languages like French and Italian where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted if the following token contains a digit.

Bugs fixed in 7.22.1.c59.0

Bug #	Description
ETROG-2687	`BaseLinguisticsFactory#addUserSegDictionary`, `TokenizerFactory#addUserDefinedDictionary`, `TokenizerFactory#addCustomTokenizationFst`, `TokenizerFactory#create`, and the constructor of `BaseLinguisticsSegmentationTokenFilter` do not convert language codes to their canonical forms, such as `zhs` (simplified Chinese) to `zho` (Chinese).
ETROG-2703	Many German words ending with "teuer" get decompounded incorrectly. Several German compound words don't get decompounded at all.

Bugs fixed in 7.22.0.c59.0

Bug #	Description
ETROG-2383	Korean tokens sometimes include trailing ASCII periods.
ETROG-2685	URL tokens in Chinese, Japanese, and Thai run on into the following tokens, without inserting a token boundary.
ETROG-2686	"Dep." is not recognized as a Portuguese abbreviation, causing a sentence break.

Bugs fixed in 7.21.3.c59.0

Bug #	Description
ETROG-2680	Single quote in English sometimes incorrectly analyzed as possessive when found following whitespace.

Bugs fixed in 7.21.0.c58.3

Bug #	Description
ETROG-2476	Lemma for a German compound word may be wrong if the surface form has characters that were normalized in the compound components.
ETROG-2589	The non-English FST tokenizers do not properly split text into tokens in which text immediately follows colons or commas, such as ":,test".
ETROG-2622	Passing a long script file (over 100,000 rows) to RBLCmd with output paths specified in the third column can cause a “Too many open files” error.
ETROG-2661	Tokenizing a string of Katakana throws an `ArrayIndexOutOfBoundsException` if an unknown word immediately follows a word from a user dictionary and both `alternativeTokenization` and `favorUserDictionary` are `true` and the language is Japanese.

Bugs fixed in 7.20.4.c58.3

Bug #	Description
ETROG-2638	Number tokens that are hundreds of characters long can cause `StackOverflowError`s in languages that use spaces as thousands separators.

Bugs fixed in 7.20.3.c58.3

Bug #	Description
ETROG-2632	A tiny number of Hebrew surface forms (e.g. "ג`ינג`ר") cause `NullPointerException`s.

Bugs fixed in 7.20.2.c58.3

Bug #	Description
ETROG-2615	Hebrew lemmas are not exposed in Lucene/Solr.

Bugs fixed in 7.20.1.c58.3

Bug #	Description
ETROG-2584	Emoticon detection has some false positives.

Bugs fixed in 7.20.0.c58.3

Bug #	Description
ETROG-2338	Annotating an `AnnotatedText` which has an empty list of sentences throws an exception.
ETROG-2574	The English FST tokenizer does not properly split text into tokens in which text immediately follows colons or commas, such as ":,test".

Bugs fixed in 7.19.0.c58.3

Bug #	Description
ETROG-710	The Hebrew tokenizer throws an exception or produces incorrect results for tokens that begin with "prefix=".
ETROG-995	The Hebrew tokenizer could throw an exception or produce incorrect results for some inputs involving backslashes or multiple tokens with identical surface forms.
ETROG-2531	Creating Chinese and Japanese tokenizers with `alternativeTokenization` using a `TokenizerFactory` is not thread-safe.
ETROG-2552	Analyzing a zero-length English token throws a `StringIndexOutOfBoundsException`.
ETROG-2560	`BaseLinguisticsTokenFilter` can overwrite analyses generated by `BaseLinguisticsTokenizer`, causing problems for emoticon detection.
ETROG-2563	The Hebrew tokenizer does not split on non-ASCII white space characters.
ETROG-2565	Requesting UPT-16 POS tags for a language for which POS tags are not supported throws an exception.

Bugs fixed in 7.18.0.c58.3

Bug #	Description
ETROG-1689	The lemmas of Russian compound words contain braces.
ETROG-2360	RBLCmd reports 0 bytes/char when reading from standard input.
ETROG-2409	`customPosTagsUri` is parsed as as a file name instead of a URI.
ETROG-2243	All acronyms are tagged as proper nouns in English.
ETROG-2536, ETROG-2546	Setting `alternativeTokenization` to `true`, `consistentLatinSegmentation` to `false`, and `favorUserDictionary` to `true` and specifying a user dictionary can cause an infinite loop when processing Japanese.
ETROG-2543	`HebrewTokenizer#setReader` does not reset enough state, so the tokenizer cannot be reused.
ETROG-2547	`HebrewTokenizer` throws an `ArrayIndexOutOfBoundsException` for inputs with unusually long sentences.
ETROG-2548	SLF4J binding jars shipped in lib/. To avoid classpath conflicts they have been moved to tools/lib/.

Bugs fixed in 7.17.0.c58.2

Bug #	Description
ETROG-2252	Word breaking with `fstTokenize` can fail for initials in Czech.

Bugs fixed in 7.16.1.c58.2

Bug #	Description
ETROG-2356	Token can be missed by GenericTokenizer. (This augments the fix for ETROG-2292 made in 7.16.0.c58.2.)

Bugs fixed in 7.16.0.c58.2

Bug #	Description
ETROG-676	ZWNBSP was not treated as whitespace. With this fix, the word breaker treats 0x2060 and 0xFEFF as whitespace.
ETROG-1100	Incorrect tokenization of '0901d97c80103109' in Hebrew.
ETROG-1638	Fixed English business and place name acronym segmentation regressions.
ETROG-1640	SingleLanguageAnnotator does not provide a proper analysis if the language is specified with the legacy code zht or zhs.
ETROG-1655	ADMs contain multiple tokens (instead of multiple analysis) for the same Hebrew word.
ETROG-1766	Processing Spanish with sequences of the dash character can exhaust memory.
ETROG-1996	Period erroneously attached to terminal token
ETROG-2001	`BaseLinguisticsFactory#createSingleLanguageAnnotator` throws a `NullPointerException`.
ETROG-2038	+int_noun, +int_adj, etc. can inadvertently be returned as a POS tag.
ETROG-2088	POS tag and Contraction annotators failed for English uppercase.
ETROG-2112	ConcurrentModificationException crash in alternative Japanese tokenization (JLA).
ETROG-2114	The double exclamation mark (\u203C, ‼) not treated as a separate token and could be agglutinated to an adjacent word.
ETROG-2125	IndexOutOfBoundsException can be thrown when analyzing Korean.
ETROG-2136	Some English words ending in "ss" are erroneously given lemmas ending in 's'
ETROG-2172	Korean lemmas for tokens with mixed numerals and letters have incomplete text.
ETROG-2191	BaseLinguisticsFactory.featuresForLanguage(LanguageCode.SIMPLIFIED_CHINESE) return values was missing CSCANALYSIS.
ETROG-2192	Compound nouns should allow numerals as the first component.
ETROG-2194	The English word 'metres' is not lemmatized correctly.
ETROG-2251	Polish ‘Ł.’ is not treated as an initial.
ETROG-2253	Word breaking in FST Tokenizer can fail for initials in Greek.
ETROG-2254	Word breaking in FST Tokenizer can fail for initials in Hungarian.
ETROG-2268	The order in which options are specified in a `BaseLinguisticsFactory` may negate the use of a user-defined dictionary.
ETROG-2269	Inflected forms of the English verb 'mentor' not lemmatized.
ETROG-2283	Morpheme information is lost if an annotator is created for Korean and `BaseLinguisticsOption.universalPosTags` is set to true.
ETROG-2292	Processing text with too many of out of vocabulary characters may fail.
ETROG-2297	Setting ExtendedTags to false does nothing.
ETROG-2304	Some pairs of words have each other as disambiguated lemmas. NB: This fix was reverted in 7.17.0.c58.2 as experience showed some unacceptable disambiguation regressions.
ETROG-2328	Setting XlaOption separatePlaceNameFromSuffix to false does nothing.
ETROG-2333	The alternative Japanese segmenter (JLA) often mishandled もの, ような, and とおり.

Bugs fixed in 7.14.2.c57.2

Bug #	Description
ETROG-2013	Error in SBN reader cache key equality function can induce out of memory errors.

Bugs fixed in 7.14.1.c57.2

Bug #	Description
ETROG-2200	Use of the fstTokenize option can exhaust memory.

Bugs fixed in 7.14.0.c57.2

Bug #	Description
ETROG-1681	Analysis results not produced for legacy `uen` language code.
ETROG-1925	Fixed UPT-16 mappings for proper nouns in Czech, Dutch, French, German, and Polish.
ETROG-1933	U+0022 is no longer removed during Urdu normalization.
ETROG-1945	Fixed errors when mapping parts of speech.
ETROG-1994	Attempting to use the `alternativeJapaneseTokenization` option in Solr caused a crash.
ETROG-2018	In some cases involving digits, the Hebrew tokenizer truncated part of the word.
ETROG-2047, ETROG-2059	Analyses of some inflected forms of Polish nouns were incorrect.

Bugs fixed in 7.13.0.c56.6

Bug #	Description
ETROG-1567	Fixed cases in which components for some Danish, Norwegian, and Swedish compound words are returned out of order.
ETROG-1612	Ensure that the order of lemmas retrieved from the morpho cache matches that from the FSTs.
ETROG-1618	RBLCmd to gracefully handle a leading BOM in a UTF-8 file.
ETROG-1652	Crash on the null character (U+0000)
ETROG-1687	Arabic prefix lengths sometimes miscalculated.
ETROG-1703	Include Semitic roots in Arabic analysis results from an Annotator.
ETROG-1761	Given language 'unknown', RBL-JE can produce a token with a whitespace in the middle.

Bugs fixed in 7.12.1.c56.6

Bug #	Description
ETROG-1628	Fixed errors in the production of Portuguese and Russian lemmas.
ETROG-1663, ETROG-1665	Corrected German regressions relative to the RBL native product (RLP).
ETROG-1671	Removed RBLCmd log4j warnings when AnalyzerOption.disambiguate = true

4.42. Bugs fixed in 7.12.0

Bug #	Description
ETROG-1432	Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters. This completes the fix that we made for version 2.3.0 (ETROG-647).
ETROG-1552	Fixed a `BaseLinguisticsTokenFilter` error identifying compound components when a non-disambiguating analysis is performed.
ETROG-1563	Fixed an error processing Korean text in which the analyzer produces a `^KrName` tag for a token.

Bugs Fixed in 2.4.0

Bug #	Description
ETROG-1554	Rebuilt big-endian Chinese analysis dictionary.

Bugs Fixed in 2.3.0

Bug #	Description
ETROG-647	Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters.

Bugs Fixed in 2.2.2

Bug #	Description
ETROG-1295	Fixed a NullPointerException when attempting to process Korean text.
ETROG-1271	Fixed reported errors in the English lemma dictionary: noun plurals ending in "es" with a lemma ending in "is" (lemma of emphases is emphasis) singing (lemma is sing, not singe) leaves as verb (lemma is leave, not leaf).
ETROG-1387	Stopped returning guessed POS tags for languages for which POS tags are not supported.
ETROG-1311	Corrected an error tokenizing strings containing certain Unicode Supplementary (non-BMP) characters.
ETROG-1300	Fixed `AnalyzerOption.query`.
ETROG-1226	Fixed occasional duplication of linguistic lookup results.

Bugs Fixed in 2.2.1

Bug #	Description
ETROG-1261	Concurrency violation in RBL-JE user defined dictionaries

Bugs Fixed in 1.10.1

Bug #	Description
ETROG-916	Eliminated the `RuntimeException` that was thrown when RSE attempted to handle a very long token. RSE now splits extremely long tokens to fit in the token processing buffer.
ETROG-917	Fixed bug that produced incorrect candidate lemmas for Korean text. The correct lemma candiate generator is now being used.

Bugs Fixed in 1.7.1

Bug #	Description
ETROG-591	Fixed buffer management error using a token user dictionary to tokenize components in a long sequence of Chinese or Japanese tokens with no sentence boundaries.
ETROG-588	Enabled the use of `LanguageCode.SIMPLIFIED_CHINESE` (`zhs`) or `LanguageCode.TRADITIONAL_CHINESE` (`zht`) when you load a Chinese token user dictionary. RSE maps these language codes to `LanguageCode.CHINESE` (`zho`).
ETROG-596, ETROG-629	Improved the handling of text that is not Hanzi (Kanji), Hiragana, or Katakana in Chinese and Japanese token user dictionary lookups. For most consistent performance, we recommend that you only include Hanzi (Kanji), Hiragana and Katakana characters in token user dictionary entries.

Bugs Fixed in 1.6.0

Bug #	Description
ETROG-486, ETROG-495	Addressed overgeneration of Swedish compound components for unknown words. Applied similar refactoring to Danish and Norwegian.

Bugs Fixed in 1.4.3

Bug #	Description
ETROG-319	Speeded up the RSE Tokenizer and LuceneTokenizer by eliminating unnecessary reinitialization.

Bugs Fixed in 1.4.2

Bug #	Description
ETROG-316	Avoided a heap overflow by revising the RSE LuceneTokenizer to gracefully handle multiple `next()` calls from Solr after the tokenizer has reached the end of the token stream.

Bugs Fixed in 1.4.1

Bug #	Description
ETROG-191	Worked around an out-of-memory error processing very long compounds. See Known Problems in 1.4.1.

Bugs Fixed in 1.4.0

Bug #	Description
ETROG-142	Corrected out-of-memory error processing very long German words.
ETROG-88	Fixed array out of bounds that occurred processing some multi-sentence input.
ETROG-182	Adjusted word breaker to avoid returning empty elements at end of the input text being processed.

Third-Party Components

For a list of third-party components that are used in Basis Technology products, see ThirdPartyLicenses.txt.

Third-party component updates in 7.27.1.c60.0

Component	Version	Change
annoy-java	0.2.5	New

Third-party component updates in 7.26.4.c60.0

Component	Version	Change
Google Guava	26.0-jre	Version upgrade

Third-party component updates in 7.25.0.c59.3

Component	Version	Change
Jackson Annotations	2.9.6	Version upgrade
Jackson Core	2.9.6	Version upgrade
Jackson Databind	2.9.6	Version upgrade
Jackson Dataformat XML	2.9.6	Version upgrade
Jackson Dataformat YAML	2.9.6	Version upgrade
Jackson Datatype Guava	2.9.6	Version upgrade
Jackson Module JAXB Annotations	2.9.6	Version upgrade

Third-party component updates in 7.24.6.c59.2

Component	Version	Change
Woodstox	4.0.5	Version downgrade

Third-party component updates in 7.24.0.c59.2

Component	Version	Change
Google Guava	18.0	Version upgrade
Jackson Annotations	2.9.4	Version upgrade
Jackson Core	2.9.4	Version upgrade
Jackson Databind	2.9.4	Version upgrade
Jackson Dataformat XML	2.9.4	Version upgrade
Jackson Dataformat YAML	2.9.4	Version upgrade
Jackson Datatype Guava	2.9.4	Version upgrade
Jackson Module JAXB Annotations	2.9.4	Version upgrade
SnakeYAML	1.18	Version upgrade
TensorFlow for Java	1.5.0	Version upgrade
Woodstox	5.0.3	New

Third-party component updates in 7.23.0.c59.0

Component	Version	Change
Auto Common Libraries	0.3	new
AutoService	1.0-tc3	new
Commons CLI	1.2	new
Easy Plugins	0.2.2	New
Jackson Datatype Guava	2.7.3	New
JavaPoet	1.9.0	New
Metrics Core	3.2.3	New
Protocol Buffers	3.3.1	New
TensorFlow	1.3.0	New

Third-party component updates in 7.21.1.c59.0

Component	Version	Change
ICU4J	59.1	Version upgrade

Third-party component updates in 7.18.0.c58.3

Component	Version	Change
fastutil	6.6.1	Version upgrade
ICU4J	58.1	Version upgrade
Jackson Annotations, Core, Databind, Dataformat XML, Dataformat YAML, Module JAXB Annotations	2.7.3	Version upgrade
Jackson Dataformat Smile	2.7.3	New

Third-party component updates in 7.16.0.c58.2

Component	Version	Change
args4j	2.32	Version upgrade
fastutil	6.6.0	Version upgrade
opencsv		Removed

Third-party component updates in 7.14.0.c57.2

Component	Version	Change
Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations	2.6.2	Version upgrade
Jackson DataFormat YAML	2.6.2	Version upgrade
SnakeYAML	1.15	New
Snowball (no version or release info avaiable; copied 2015-11-30)		New
args4j	2.3.2	Version added

Third-party component updates in 7.13.0.c56.6

Component	Version	Change
ICU4J	55.1	Version upgrade
Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations	2.4.4	Version upgrade
Jackson DataFormat YAML	2.4.4	New

Known Problems

Known Problems in 2.x

If disambiguate is set to false, or if no disambiguator for the language exists, BaseLinguisticsTokenFilter does not set the type correctly for compound components when adding them to the token stream. It marks compound components as <LEMMA> instead of <COMP> when a non-disambiguating analysis is performed. (ETROG-1552)

Known Problems in 1.8.0

The prefixes and suffixes that the RSE tokenizer returns for Hebrew may include punctuation attached to the underlying tokens, such as parentheses (prefix, suffix) and comma (suffix). Accordingly, prefixes and suffixes are assigned a Token PositionIncrement of 1. A multicharacter prefix or suffix may be reported as a sequence of one-character prefixes or suffixes. (ETROG-697)

Known Problems in 1.7.0

If you use LanguageCode.SIMPLIFIED_CHINESE (zhs) or LanguageCode.TRADITIONAL_CHINESE (zht) when you load a Chinese token user dictionary, the dictionary is not loaded. You must use LanguageCode.CHINESE (zho) to designate the language code for a Chinese token user dictionary. (ETROG-588)

Known Problems in 1.4.1

To avoid a potential out-of-memory error, RSE does not attempt to decompound words longer than 30 characters. For languages with support for decompounding, if a word is longer than 30 characters and is not found in a user dictionary or the standard dictionary, RSE classifies the word as a guessed lemma. (ETROG-191)

Known Problems in 1.4.0

Inconsistent handling of numbers and punctuation during lemmatization. (ETROG-266)
RSE expects valid Unicode strings as input. If the input includes illegal Unicode sequences, such as un-paired UTF-16 surrogate characters, the behavior is undefined. (ETROG-284)

Known Problems in 1.3-beta and 1.4.x

Incorrect capitalization in some lemmas, including some German compounds (e.g., unAbhängigkeit).
Incorrect lemma formation of some words with suffixes (e.g., Brötchen).
Over-generation of German compound components (e.g., übergreifen, über, and greifen as separate components).
Failure to recognize some extended written-out German numbers (e.g., zweitausendzwölf).