Skip to main content

Release Notes

Base Linguistics (RBL)

Release Notes

Release 7.47.8.c78.0

June 2025

New

  • Solr and Lucene support: Solr 9.8.1 is now supported. (ETROG- 3714)

Bug Fixes

  • We fixed a bug where long words inside long spans of text without punctuation would sometimes be tokenized incorrectly. (ETROG-3716)

Third-party component updates

Table 54. Updated

Package

Old Version

New Version

Apache Commons IO

2.18.0

2.19.0

Guava

33.4.0-jre

33.4.8-jre

Jackson

2.18.2

2.19.0

JUnit

5.11.4

5.12.2

JUnit Platform

1.11.4

1.12.2

Protocol Buffers

4.29.3

4.30.2

SnakeYAML

2.3

2.4



Table 55. Added

Package

Version

License

Guava InternalFutureFailureAccess and InternalFutures

1.0.3

Apache 2.0

JSpecify

1.0.0

Apache 2.0



Release 7.47.7.c77.0

March 2025

New

  • Solr and Lucene support: Solr 8.11.4 and Lucene 8.11.4 are now supported. (ETROG-3703)

  • Solr and Lucene support: Solr 9.8.0 and Lucene 10.1.0 are now supported. (ETROG- 3704)

Third-party component updates

Table 56. Updated

Package

Old Version

New Version

@API Guardian

1.1.0

1.1.2

Apache Commons IO

2.17.0

2.18.0

Apache Log4j

2.24.1

2.24.3

Guava

33.3.1-jre

33.4.0-jre

Jackson

2.17.2

2.18.2

Jakarta Annotations API

1.3.3

3.0.0

JavaCPP

1.5.10

1.5.11

JUnit

5.7.0

5.11.4

JUnit Platform

1.7.0

1.11.4

OpenTest4J

1.2.0

1.3.0

Protocol Buffers

3.25.5

4.29.3

Metrics Core

4.2.28

4.2.30



Table 57. Added

Package

Version

License

Java Architecture for XML Binding

2.2.12

CDDL 1.1 & GPL 2 + CE



Release 7.47.6.c76.0

November 2024

New

  • Solr and Lucene support: Lucene 9.11.1 and Solr 9.7.0 are now supported (ETROG-3695)

  • Java 21 support: Java 21 is now supported. Java 11 and 17 are still supported. (ETROG-3698)

  • Unicode update: Unicode 16.0 is now supported. (ETROG-3694)

Third-party component updates

Table 58. Updated

Package

Old Version

New Version

Apache Commons IO

2.16.1

2.17.0

Apache Commons Lang

3.16.0

3.17.0

Apache Log4j

2.23.1

2.24.1

fastutil

8.5.14

8.5.15

Guava

33.3.0-jre

33.3.1-jre

JavaCPP

1.5.8

1.5.10

Protocol Buffers

3.25.3

3.25.5

SnakeYAML

2.2

2.3

Woodstox

7.0.0

7.1.0



Release 7.47.5.c74.0

September 2024

Bug Fixes

  • We fixed a bug where long compound words (70-100 characters) in Danish, Norwegian (Bokmål and Nynorsk), and Swedish could cause an OutofMemoryError. (ETROG-3696)

Release 7.47.4.c75.0

September 2024

New

  • Neural model support improved: We upgraded TensorFlow Java to version 1.0.0-rc.1, which adds support for macOS ARM64. Neural models are now supported for macOS ARM64. (ETROG-3554)

Bug Fixes

  • Trailing decimal points in Chinese are no longer treated as part of a decimal fraction. When “点” or “點” ends a number, it is no longer segmented as part of the number. (ETROG-3680)

Third-party component updates

Table 59. Updated

Package

Old Version

New Version

Apache Commons Lang

3.14.0

3.16.0

fastutil

8.15.13

8.15.14

Guava

33.2.0-jre

33.3.0-jre

Jackson

2.17.1

2.17.2

TensorFlow for Java

0.3.3

1.0.0-rc.1

Woodstox

6.6.2

7.0.0



Release 7.47.3.c74.0

June 2024

New

  • Solr and Lucene support: Lucene 9.10 and Solr 9.6 are now supported (ETROG-3683)

  • CLA lexicon: Two terms have been added to the CLA lexicon: 喷码机 inkjet printer and 管理器 manager (in the context of software) (ETROG-3678)

  • Improved Readings: Readings are now returned for numeric words in Chinese and Japanese when tokenizerType is set to SPACELESS_LEXICAL. (ETROG-3684)

Bug Fixes

  • When tokenizerType is set to SPACELESS_LEXICAL, Japanese tokens for verbs in lemma form have had their readings fixed to cover the entire token. (ETROG-3640)

    Example: Input: 食べる

    • Previous Reading:

    • Current Reading: たべる

  • When tokenizerType is set to SPACELESS_LEXICAL, Japanese lemmatization has been corrected for numeric tokens containing both decimal points and multiplier characters.

    Example: Input: 2.5亿

    • Previous Lemma: 2500000000

    • Current Lemma: 250000000

  • We fixed a bug where the Chinese word “星期四” would be tokenized incorrectly in certain contexts. (ETROG-3582)

  • We fixed a bug where a token whose surface form was the empty string could be returned when fragmentBoundaryDetection was set to true (the default). (ETROG-3686)

Third-party component updates

Table 60. Updated

Package

Old Version

New Version

Apache Commons IO

2.15.1

2.16.1

Apache Log4j

2.21.1

2.23.1

args4J

2.33

2.37

Guava

33.0.0-jre

33.2.0-jre

Jackson

2.16.1

2.17.1

Protocol Buffers

3.25.0

3.25.3

Woodstox

4.4.1

6.6.2



Release 7.47.2.c73.0

March 2024

New

  • Unicode update: Unicode 15.1 is now supported. (ETROG-3595)

  • Solr and Lucene support: Lucene 9.9 and Solr 9.5 are now supported. (ETROG-3673)

Third-party component updates

Table 61. Updated

Package

Old Version

New Version

Apache Commons IO

2.15.0

2.15.1

Apache Commons Lang

3.12.0

3.14.0

fastutil

8.15.12

8.5.13

Guava

32.1.3-jre

33.0.0-jre

Guava InternalFutureFailureAccess and InternalFutures

1.0.1

1.0.2

ICU4J

70.1

74.2

Jackson Annotations

2.15.3

2.16.1

Jackson Core

2.15.3

2.16.1

Jackson Databind

2.15.3

2.16.1

Jackson Dataformat XML

2.15.3

2.16.1

Jackson Dataformat YAML

2.15.3

2.16.1

Jackson Datatype Guava

2.15.3

2.16.1

Jackson Old JAXB Annotations

2.15.3

2.16.1



Release 7.47.1.c72.0

December 2023

New

  • Japanese improvements: We have improved and augmented the Japanese lexicon that is used when tokenizerType is set to spaceless_lexical. (ETROG-3532, 3581, 3535, 3668)

  • Solr and Lucene support: Lucene 9.5 – 9.8 and Solr 9.4 are now supported. (ETROG-3665)

Third-party component updates

Table 62. Updated

Package

Old Version

New Version

Jackson Annotations

2.15.2

2.15.3

Jackson Core

2.15.2

2.15.3

Jackson Databind

2.15.2

2.15.3

Jackson Dataformat XML

2.15.2

2.15.3

Jackson Dataformat YAML

2.15.2

2.15.3

Jackson datatypes: collections

2.15.2

2.15.3

Jackson Modules: Base

2.15.2

2.15.3

Guava: Google Core Libraries for Java

32.1.2-jre

32.1.3-jre

Protocol Buffers [Core]

3.23.4

3.25.0

Apache Commons IO

2.11.0

2.15.0

Apache Log4j API

2.20.0

2.21.1

Apache Log4j Core

2.20.0

2.21.1

Apache Log4j SLF4J Binding

2.20.0

2.21.1

Stax2 API

4.2.1

4.2.2

SnakeYAML

2.0

2.2



Release 7.47.0.c71.0

September 2023

New

  • Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when tokenizerType is set to spaceless_lexical. (ETROG-3616)

  • Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when tokenizerType is set to spaceless_lexical. (ETROG-3632)

  • Added secondary parts of speech: We've added support for secondary parts of speech to Chinese and Japanese when tokenizerType is set to spaceless_lexical. (ETROG-3636)

  • Improved support for Chinese readings when tokenizerType is set to spaceless_lexical:

    • Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)

    • Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)

      Example: "蔭權"

      • Previous readings returned: "【yīn;yìn】quán"

      • Readings now returned: "yīnquán” and “yìnquán"

  • Solr and Lucene support: Lucene 9.5 - 9.7 and Solr 9.3 are now supported (ETROG-3643)

Bug Fixes

  • We fixed a bug where an ArrayIndexOutOfBoundsException occurred when the Chinese dictionaries produced more than 6 matches and tokenizerType was set to spaceless_lexical. (ETROG-3635)

  • When Chinese readings are constructed by character and tokenizerType is set to spaceless_lexical, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)

  • We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)

Known Issues

  • The plugin for Solr 8.11 may throw an exception when using multiple neural models.

Third-party component updates

Table 63. Updated

Package

Old Version

New Version

Jackson Annotations

2.15.0

2.15.2

Jackson Core

2.15.0

2.15.2

Jackson Databind

2.15.0

2.15.2

Jackson Dataformat XML

2.15.0

2.15.2

Jackson Dataformat T

2.15.0

2.15.2

Jackson Datatype: Guava

2.15.0

2.15.2

Jackson Module: Old JAXB Annotations

2.15.0

2.15.2

Guava: Google Core Libraries for Java

31.1-jre

32.1.2-jre

Protocol Buffers [Core]

3.21.7

3.23.4



Release 7.46.4.c70.0

June 2023

New

  • Lucene and Solr support: RBL-JE now supports Lucene 9.2 to 9.4 and Solr 9.2. (ETROG-3631)

Third-party component updates

Table 64. Updated

Package

Old Version

New Version

Apache Log4J

2.19.0

2.20.0

fastutil

8.5.9

8.5.12

Jackson Annotations

2.14.0

2.15.0

Jackson Core

2.14.0

2.15.0

Jackson Databind

2.14.0

2.15.0

Jackson Dataformat XML

2.14.0

2.15.0

Jackson dataformats: Text

2.14.0

2.15.0

Jackson datatypes: collections

2.14.0

2.15.0

Jackson modules: Base

2.14.0

2.15.0

SnakeYAML

1.33

2.0



Release 7.46.3.c69.0

March 2023

Bug Fixes

  • We fixed a bug where processing an empty string as Chinese or Japanese with tokenizerType set to spaceless_lexical would throw a NullPointerException. (ETROG-3629)

Known Issues

  • In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)

Release 7.46.2.c69.0

March 2023

New

  • We've improved the time it takes to tokenize extremely long (> 10K characters) Japanese sentences. (ETROG-3602)

Known Issues

  • In Solr 8.11.2, it is not possible to consistently load more than one neural model at a time. (ETROG-3608)

Release 7.46.1.c68.0

December 2022

Bug Fixes

  • Annotating Ukrainian with universalPosTags set to true no longer results in a RosetteUnsupportedLanguageException being thrown. (ETROG-3604)

Release 7.46.0.c68.0

November 2022

New

  • Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)

  • Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)

  • GPU support: Features that use TensorFlow now use a GPU if available. (ETROG-3564)

  • Emoji support: Emoji 15.0 is now supported. (ETROG-3577)

  • New option for Katakana: We've added the option joinKatakanaNextToMiddleDot to control whether sequences of Japanese Katakana tokens adjacent to a middle dot should be merged into a single Katakana token. By default, it is true, which matches the behavior in previous versions of RBL-JE. (ETROG-3592)

  • Solr 9.1 support: Lucene and Solr 9.1 are supported. (ETROG-3597)

Bug Fixes

  • The Japanese POS tag VN (verbal noun) is now mapped to the UPT-16 POS tag NOUN. It was previously mapped to VERB. (ETROG-3583)

Third-party component updates

Table 65. Upgraded

Package

Old version

New version

Apache Log4j

2.17.1

2.19.0

fastutil

8.5.6

8.5.9

Jackson

2.11.1

2.14.0

JavaCPP

 1.58-alpha.20220614.013710.426

1.58

SLF4J

1.7.33

1.7.36

SnakeYAML

1.30

1.33



Release 7.45.0.c67.0

September 2022

New

  • Tagalog support:

    • RBL now supports Part of Speech (POS) tagging in Tagalog. (ETROG-3559)

    • RBL now supports lemmatization for Tagalog. (ETROG-3570)

    • The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)

  • Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)

  • Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)

Bug Fixes

  • We fixed a bug in Russian where certain uncommon consonant–vowel sequences in words in the lexicon were incorrectly replaced with more common sequences with different vowel letters. (ETROG-3541)

    Example: брошюра

    • Previously: брошура

    • Now: брошюра

Release 7.44.2.c67.0

July 2022

New

  • An open source package (JavaCPP) has been updated to allow the Elasticsearch plugin to use TensorFlow. (ESPI-168)

Release 7.44.1.c67.0

June 2022

Bug Fixes

  • RBL-JE no longer crashes when loading TensorFlow fails with a NoClassDefFoundError. It now throws an exception. (ESPI-169)

Release 7.44.0.c67.0

June 2022

New

  • Indonesian support added: RBL now supports Part of Speech (POS) tagging in Indonesian. (ETROG-3543)

  • Malay (Standard) support added: RBL now supports Part of Speech (POS) tagging in Malay (Standard). (ETROG-3545)

  • Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)

  • Java 17 support added: Java 8 and 9 support has been removed. (ETROG-3524)

  • Solr 9 support added: RBL now supports Lucene and Solr 9. (ETROG-3549)

  • Solr 6 support deprecated: RBL no longer supports Lucene or Solr 6 or earlier. (ETROG-3519)

Bug Fixes

  • In Japanese, negative forms of ichidan verbs written all in hiragana are no longer lemmatized to end with “なう”. (ETROG-3534)

    Example: Input: くれない

    • Previously: lemmatized to くれなう

    • Now: lemmatized to くれる

Third-party component updates

This release includes the following third-party component changes:

Table 66. Added

Package

Version

License

Jakarta Annotations API

1.3.3

Eclipse Public License 2.0 and GPL 2 with classpath exception



Release 7.43.0.c66.0

February 2022

Notice

Solr 6 and earlier support is deprecated as of this release.

Java 8 and Java 9 support is deprecated as of this release.

New

  • Solr 8.11 support: This release supports Solr 8.11 (ETROG-3502)

  • Deprecated methods: Token#getType has been deprecated as token types are not used in RBL-JE without the Lucene/Solr plugins and the plugins use a different API. (ETROG-3503)

  • Solr 6 support deprecated: Support for Solr versions 6.x and earlier is deprecated as of this release and will be removed in the next version.

  • Permission changes: We removed group and other write permissions from model files. All files are now only writable by the owner. (ETROG-3516)

Third-party component updates

This release includes the following third-party component changes:

Table 67. Upgraded

Package

Old Version

New Version

Apache Commons IO

2.7

2.11.0

Apache Commons Lang

2.6

3.12.0

Apache Log4j

1.2.17

2.17.1

ICU4J

59.1

70.1

fastutil

8.4.0

8.5.6

SLF4J

1.7.28

1.7.33

SnakeYAML

1.26

1.30

TensorFlow for Java

0.2.0

0.3.3



Release 7.42.2.c65.0

November 2021

Bug Fixes

  • RBL no longer crashes in Arabic when emoticons is enabled. This fixes a bug introduced in 7.42.1. (ETROG-3493)

Release 7.42.1.c65.0

November 2021

Bug Fixes

  • In Korean, emoji and other language-neutral tokens no longer cause a ClassCastException to be thrown when using the ADM API. (ETROG-3488)

Release 7.42.0.c65.0

November 2021

New

  • Deprecated factories: TokenizerFactory, AnalyzerFactory, and CSCAnalyzerFactory have been deprecated in favor of BaseLinguisticsFactory. (ETROG-3453)

  • Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts, for Japanese with tokenizerType set to spaceless_lexical. (ETROG-3474)

    Example: Input: 三・一四

    • Previously: tokenization: 三 / ・ / 一四

    • Now: tokenization: 三・一四

  • Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)

  • Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)

  • Japanese tokenization: In Japanese, when tokenizerType is set to spaceless_lexical, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. This is consistent with the default algorithm, spaceless_statistical. (ETROG-3475)

  • Solr 8.10 support: This release supports Solr 8.10. (ETROG-3482)

  • Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)

  • Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)

Bug Fixes

  • In Japanese, when tokenizerType is set to spaceless_lexical, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)

  • We've reverted two of the POS changes made in version 7.39.0.63.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)

    The values are now:

    • "|以" Chinese) “|” tagged as PUNCT

    • "2对” (Chinese) “对” tagged as NM

  • RBL-JE no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)

  • In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)

Release 7.41.1.c65.0

July 2021

Bug Fixes

  • Enabling universalPosTags for Indonesian, Tagalog, or Standard Malay no longer throws a RosetteUnsupportedLanguageException. POS tags are not supported for these languages, so the universalPosTags option is ignored. (ETROG-3465)

Release 7.41.0.c65.0

July 2021

New

  • New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)

  • Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine parameter. (ETROG-3431)

  • Solr 8.9 support: This release supports Solr 8.9. (ETROG-3457)

Release 7.40.1.c64.1

May 2021

Bug Fixes

  • We fixed a bug where enabling alternativeSpanishDisambiguation for Spanish caused a NullPointerException to be thrown. (ETROG-3435)

  • We fixed a bug where setting disambiguatorType to DNN for Hebrew caused a RosetteRuntimeException to be thrown. (ETROG-3437)

Release 7.40.0.c64.1

May 2021

New

  • New option for tokenizers: We've added a new option, tokenizerType to specify which tokenizer to use. The options alternativeTokenization and fstTokenize are deprecated in favor of tokenizerType. (ETROG-3419)

  • New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType to spaceless_statistical. (ETROG-3392)

Bug Fixes

  • Multiple punctuation characters are no longer returned as a single token in Chinese when alternativeTokenization is true or tokenizerType is set to spaceless_lexical. Now each character is its own token. (ETROG-3402)

    Example: Input: 天津??

    • Previously:

      Token{text=天津}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}

      Token{text=??}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}

    • Now:

      Token{text=天津}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}

      Token{text=?}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}

      Token{text=?}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}

Third-party component updates

This release includes the following third-party component changes:

Table 68. Added

Package

Version

JavaCPP

1.5.4



Table 69. Upgraded

Package

Old Version

New Version

TensorFlow

1.14.0

2.3.1



Release 7.39.0.c63.0

March 2021

New

  • Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)

    • TokenizerOption.includeHebrewRoots and TokenizerOption.guessHebrewPrefixes have been deprecated and replaced by AnalyzerOption.includeHebrewRoots and AnalyzerOption.guessHebrewPrefixes.

    • NFKC normalization is now supported for Hebrew.

    • We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.

    • We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)

    • Double apostrophes are now treated like gershayim. (ETROG-3249)

  • Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)

  • Solr 8.8: This release supports Solr 8.8. (ETROG-3369)

  • Improved CSCAnnotator output: The CSCAnnotator now emits tokens in addition to translations, even if no tokens were specified in the input. (ETROG-3356)

  • Improved directory structure: The contents of the models/ directory are now separated into subdirectories by language. (ETROG-1218)

  • Statistical models moved to models/ directory: The following files have been moved from dicts/ to models/: (ETROG-1218)

    • cat/ca-ud-train.downcased.mdl

    • est/et-ud-train.downcased.mdl

    • fas/posLemma.mdl

    • lav/lv-ud-train.downcased.mdl

    • nno/lemma.mdl

    • nob/lemma.mdl

    • slk/sk-ud-train.downcased.mdl

    • srp/sr-ud-train.downcased.mdl

Bug Fixes

  • Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)

    Example: מע'רב

    • Previously:

      Token{text=מע'}
      MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, 
      partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
      MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, 
      partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
    • Now:

      Token{text=מע'רב}
      MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
      com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, 
      tagSet=MILA_HEBREW}
  • A structured region containing two new lines is now properly labeled as STRUCTURED. Previously, the layout region would be labeled as UNSTRUCTURED. (ETROG-3378)

    Example: * item\n* item\n* item\n\n

    • Previously:

      {"startOffset": 0,"endOffset": 14,"layout": "STRUCTURED"}
      {"startOffset": 14,"endOffset": 22,"layout": "UNSTRUCTURED"}
    • Now:

      {"startOffset": 0,"endOffset": 22,"layout": "STRUCTURED"}

Release 7.38.1.c63.0

January 2021

New

  • RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)

Bug Fixes

  • We reverted some changes to Korean disambiguation from the 7.37.0.c62.2 release as the changes introduced new disambiguation errors. (ETROG-3349)

Third-party component updates

This release includes the following third-party component changes:

Package

Old Version

New Version

Jackson Annotations

2.10.0

2.11.1

Jackson Core

2.10.0

2.11.1

Jackson Databind

2.10.0

2.11.1

Jackson Dataformat XML

2.10.0

2.11.1

Jackson dataformats: Text

2.10.0

2.11.1

Jackson modules: Base

2.10.0

2.11.1

Protocol Buffers

3.6.1

3.12.2

Apache Commons IO

2.6

2.7

fastutil

8.3.0

8.4.0

Woodstox Stax2 API

4.2

4.2.1

SnakeYAML

1.25

1.26

Release 7.38.0.c62.2

December 2020

New

  • Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)

  • Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation must be set to false, which is the default. (ETROG-3289)

    Example: δείξε

    • Previously: Selected lemma: δεικνύω (archaic)

    • Now: Selected lemma: δείχνω (modern)

  • New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation to true. (ETROG-3304)

  • Deprecated classes: The classes BufferWordBreaker and WordBreakResults have been deprecated. (ETROG-3318)

Bug Fixes

  • The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)

    Example: Start+

    • Previously: POS tags: possible PROP, ADJ, NOUN

    • Now: POS tag: FM

  • GenericTokenizer#hasNext is now implemented to be consistent with the documentation for Iterator#hasNext. Previously it always returned false. (ETROG-2140)

Release 7.37.0.c62.2

November 2020

New

  • Performance improvement: Spanish disambiguation with alternativeSpanishDisambiguation set to false is now faster. (ETROG-3271)

  • Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)

  • Support for unknown language: If the language is unknown (xxx), tokenization and sentence breaking is supported. (ETROG-3278)

  • Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)

  • Tokenization rule preprocessor: The preprocessor command !!btinclude is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)

  • Updated sample: The tokenize-analyze sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)

  • New sample: The sample csc-annotate demonstrates using CSC with the ADM API. (ETROG-3317)

  • Deprecated option: TokenizerOption#includeRoots has been deprecated and replaced with TokenizerOption#includeHebrewRoots. (ETROG-3314)

  • Deprecated option: The alternative tokenization option deliverExtendedAttributes is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)

Bug Fixes

  • Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)

  • A clear exception (RosetteUnsupportedLanguageException) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)

  • RBL no longer crashes when alternativeTokenization and fragmentBoundaryDetection are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)

  • Correct start and end offsets are now produced when fstTokenize is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)

  • The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)

    • Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET

    • Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON

  • Tokens no longer have null token types. (ETROG-3316)

  • When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize was set to true. (ETROG-2505)

    Example: ﷺ

    • Previously: Offsets: 

      صلى start 0 end 0 

      الله start 0 end 0

      عليه start 0 end 0

      وسلم start 0 end 1

    • Now: Offsets:

      صلى start 0 end 1

      الله start 0 end 1

      عليه start 0 end 1

      وسلم start 0 end 1

Release 7.36.0.c62.2

September 2020

New

  • Lucene/Solr: Versions up through 8.6.0 are now supported. (ETROG-3250)

  • Decompose compounds: The option to control decomposition of compounds is now available in Dutch, German, Hungarian, Danish, Bokmål, Nynorsk, Swedish, and Korean. The default for decomposeCompounds is true. (ETROG-3263, ETROG-3264, ETROG-3265)

  • Performance improvement: English and Spanish disambiguation with is now faster. Alternate disambiguation (alternateEnglishDisambiguation or alternateSpanishDisambiguation) must be set to false. (ETROG-3246, ETROG-3243)

Bug Fixes

  • In Hebrew, prefixes in some acronym tokens are now listed correctly in the list of prefixes, instead of being duplicated in the lemma. (ETROG-3214)

    Example: “ומש"ס”

    • Previously: lemma: “ומומש"ס”, empty prefix list

    • Now: lemma: “ש"ס”, prefix list = [“ו”, “מ”]

  • Sentence breaks are now correct when there are two line breaks and fragmentBoundaryDetection is enabled. (ETROG-3241)

    Example: "a very very very very long line\nshort\n\n"

    • Previously: 2 sentences

      {"startOffset":0,"endOffset":20}
      {"startOffset":20,"endOffset":27}
    • Now: 1 sentence

      {"startOffset":0,"endOffset":26}
  • In Hebrew, lemmas starting or ending with spaces now have the spaces removed. (ETROG-3248)

    Example: "אאורקה"

    • Previously: “אאורקה ”

    • Now: "אאורקה"

  • Analysis of unknown Hebrew words with guessed prefixes no longer have duplicate prefixes in their prefix list. (ETROG-3253)

    Example: "בפיירפוקס"

    • Previously: prefix list: [ב, ב]

    • Now: prefix list: [ב]

  • In Chinese and Japanese, the system no longer crashes when both fragmentBoundaryDetection and alternativeTokenization are enabled. (ETROG-3260)

  • In Japanese, adjacent tokens are no longer erroneously joined when alternativeTokenization is enabled. (ETROG-3261)

  • When universalPosTags are enabled the UPT-16 POS tags are now marked as having the tag set UPT16_V1 instead of the default tag set of the language. (ETROG-3273)

    Example: French

    • Previously: tag set: BT_FRENCH

    • Now: tag set: UPT16_V1

  • We've fixed the tokenize-analyze example in the samples directory. It now correctly produces results for Hebrew analysis. (ETROG-3252)

Release 7.35.0.c62.2

July 2020

New Features

  • Layout regions added: Layout regions, describing each section of input text as STRUCTURED or UNSTRUCTURED, are now identified by the annotator. In order to detect layout regions, fragment boundary detection must be enabled. (ETROG-3172)

  • New short line parameter: The option maxTokensForShortLine has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)

  • Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option fstTokenize must be set to true. (ETROG-3226)

  • Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)

  • Hebrew user-defined dictionaries added: Static and dynamic user-defined Hebrew analysis dictionaries are now supported. (ETROG-3230)

  • Deprecated method: HebrewAnalysis#characteristicString is now deprecated. (ETROG-3209)

  • Order of user-defined dictionaries: The order in which user-defined dictionaries are consulted has been standardized. Refer to the RBL-JE Application Developer's Guide for details. (ETROG-3148)

Bug Fixes

  • Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when fstTokenize was enabled and in some languages. (ETROG-3159)

    Example: "1\n234" (embedded newline within the number string)

    • Previously: "1 234" (1 token)

    • Now: "1" "234" (2 tokens)

    This example assumes fstTokenize is enabled and the language is French.

  • Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)

  • Tokens with digits are now eligible for the Greek guesser. (ETROG-3231)

    • Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags

    • Now: "HDMI1" gets FM POS tag

  • In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only occured when the guessHebrewPrefixes option is set to true.(ETROG-3221)

    Example: "ומפיפרנו"

    • Previously: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag preposition.

    • Now: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag unknown.

  • Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)

    Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"

    • Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"

    • Now: One analysis, lemmatized to "разложить"

  • German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)

    Example: Input "Mitbehandlung"; "mit" is a separable prefix

    • Previously: Lemmatized to "mitBehandlung"

    • Now: Lemmatized to "Mitbehandlung"

  • In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)

    Example: "1" (keycap)

    • Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.

    • Now: Tokenized to one token, "1"

Release 7.34.2.c62.2

May 2020

New Features

  • Hebrew tokens that have prefixes but not stems now get appropriate parts of speech. Previously, they got the POS tag "unknown". (ETROG-3207)

    Example: “ה” from the string “ה70”

    • Previously: POS tag "unknown"

    • Now: POS tag "quantifier"

  • Lucene/Solr up through version 8.5.1 is now supported. (ETROG-3208)

  • When guessHebrewPrefixes is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes. (ETROG-3188)

    Example: Token: "ומפיפרנו"

    • Previously: 2 analyses: 

      hebrewPrefixes=[ו] lemma=מפיפרנו

      hebrewPrefixes=[ו, מ] lemma=פיפרנו

    • Now: 3 analysis:

      hebrewPrefixes=[ו] lemma=מפיפרנו

      hebrewPrefixes=[ו, מ] lemma=פיפרנו

      hebrewPrefixes=[] lemma=ומפיפרנו

Bug Fixes

  • Minimally-qualified emoji are no longer split apart. (ETROG-3185)

    Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>: 1f481.png)

    • Previously: U+1F481 and <U+200D, U+2642> (2 tokens)

    • Now: <U+1F481, U+200D, U+2642> (1 token)

  • Capitalized nouns are no longer being detected as verbs. (ETROG-3186)

    Example: The noun "Service" from the phrase "Price and Quality of Service"

    • Previously: POS tag VI (infinitive or imperative verb)

    • Now: POS tag PROP (proper noun)

  • When creating multiple analyzers for Chinese, Japanese, or Thai with alternateTokenization set to false (the default), the analyzers will now share the same model data. This will improve memory usage when creating multiple analyzers. (ETROG-3200)

    Note: While memory usage has been improved, the process is still memory intensive. If RBL throws an OutOfMemoryError, increase the heap space.

Release 7.34.1.c62.2

March 2020

Bug Fixes

  • We removed some incorrect entries from the Hebrew lexicon that were added in 7.34.0.c62.2. (ETROG-3182, ETROG-3183)

Release 7.34.0.c62.2

March 2020

New Features

  • Lucene/Solr: RBL-JE now supports Lucene/Solr up through version 8.4.1. (ETROG-3156)

  • Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized. (ETROG-3164)

  • Additional emoji support: Emoji hair components are now lemmatized. (ETROG-3167)

  • German professions: Additional German professions have been added to the German lexicon. (ETROG-3163)

  • Spanish performance improvements: Spanish disambiguation is now faster when alternativeSpanishDisambiguation is false. (ETROG-3169)

  • Hebrew lemmatization: We increased proper noun coverage in the Hebrew lexicon. (ETROG-3161, ETROG-3162)

Bug Fixes

  • Low surrogates are no longer stripped from the ends of tokens in Hebrew. (ETROG-3165)

  • Number tokens with embedded spaces are no longer split into multiple tokens when preceded or followed by a symbol when fstTokenize is true. (ETROG-3158)

    • Previously: $800 000 000 was tokenized as two tokens: $800 <TokenBoundary> 000 000

    • Now: $800 000 000 is tokenized as a single token: $800 000 000

Release 7.33.0.c62.2

January 2020

New Features

  • The delimiters for the fragment boundary detector are now configurable. (ETROG-3116)

  • The fragment boundary detector now marks a boundary after any spaces following the fragment boundary delimiter. (ETROG-3116)

  • An underscore (U+005F) is no longer treated as a token separator in German when fstTokenize is enabled. (ETROG-3144)

Bug Fixes

  • We fixed a bug where tokens from multi-script Russian text sometimes had incorrect offsets if fstTokenize was enabled. (ETROG-3142)

  • We fixed a bug where multi-script Russian text would have a sentence break each time the script changed. (ETROG-3145)

  • We fixed a bug where there were unexpected sentence breaks after some short lines not ending in whitespace. (ETROG-3146)

  • We fixed a bug where sentence breaks were missing when the sentence break did not align with a token boundary. (ETROG-3140)

Release 7.32.0.c62.1

December 2019

New Features

  • Added support for Lucene/Solr up through version 8.3.0. (ETROG-3128)

  • Added support for tokenizing and lemmatizing Latvian. (ETROG-2798)

  • Latin-script regions within Russian documents are now tokenized and analyzed as English. (ETROG-3126)

  • TokenizerOption.licenseString, AnalyzerOption.licenseString, and BaseLinguisticsOption.licenseString may now be passed into a create method. Previously, these options had to be set on the factory itself. (ETROG-3134)

Bug Fixes

  • We fixed a bug where guessed German compounds were sometimes lemmatized as verbs but tagged as nouns. (ETROG-3094)

  • We fixed a bug where the fragment boundary detector would mark a sentence break after every Windows newline. (ETROG-3133)

Release 7.31.0.c62.0

November 2019

New Features

  • The Hebrew files dinflections.bin, dprefixes.data, and gimatria.data have been moved from the root/models directory to root/dicts/heb. (ETROG-3088)

  • Specifying the universalPosTags option now adds the deliverExtendedTags option as well. (ETROG-2185)

  • Dynamic user dictionaries can now be created and populated at runtime. See the section User-Defined Dictionaries in the Application Developer's Guide for details. (ETROG-3086, ETROG-3100, ETROG-3109, ETROG-3110, ETROG-3111)

  • Fragment boundary detection is now enabled by default. Previously it was disabled by default. (ETROG-3108)

  • TokenizerOption.alternativeTokenizationOptions has been deprecated in favor of a separate options for each YAML key. See the Javadoc for details. (ETROG-3109)

  • The UPT-16 files upt-16-pes.yaml and upt-16-prs.yaml have been removed from the distribution package, as they were unused. (ETROG-3122)

  • The -order option in rbl-build-csc-dictionary has been removed. All dictionaries are now built as LE, as LE dictionaries still work on BE machines. (ETROG-3120)

  • We've added imperative forms for 2000 verbs to the Arabic lexicon. (ETROG-3090)

Bug Fixes

  • Fragment boundary detection is now enabled for Hebrew. (ETROG-1442)

  • When lemmatizing numbers in Russian, numbers containing spaces will now be lemmatized without the space. For example, "1 234" will now be lemmatized as "1234" instead of "1 234". (ETROG-3101)

  • We fixed a bug introduced in 7.30.1.c61.0 which raised an ArrayIndexOutOfBoundsException when processing Japanese with alternativeTokenization and favorUserDictionary set to true. (ETROG-3118)

  • We fixed a bug where a middle dot would be ignored if it preceded white space when using alternativeTokenization in Japanese. (ETROG-3113)

Third-party component updates

Component

Version

Change

Apache Commons IO

2.6

Version upgrade

args4j

2.33

Version upgrade

fastutil

8.3.0

Version upgrade

Jackson Annotations

2.10.0

Version upgrade

Jackson Core

2.10.0

Version upgrade

Jackson Databind

2.10.0

Version upgrade

Jackson Dataformat XML

2.10.0

Version upgrade

Jackson dataformats: Text

2.10.0

Version upgrade

Jackson datatypes: collections

2.10.0

Version upgrade

Jackson modules: Base

2.10.0

Version upgrade

SLF4J

1.7.28

Version upgrade

SnakeYAML

1.25

Version upgrade

TensorFlow for Java

1.14.0

Version upgrade

Woodstox

4.4.1

Version upgrade

Woodstox Stax2 API

4.2

Version upgrade

Release 7.30.2.c61.0

September 2019

Bug Fixes

  • We fixed a bug where an AssertionError might be thrown when analyzing Hungarian with Java assertions enabled.

  • Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.

    • Previously:Аполлона-11 (Apollo-11) was tagged as PROP, MISC, and NOUN

    • Now:Аполлона-11 (Apollo-11) is tagged as NOUN

  • Correct token offsets are now returned from a Japanese annotator where a non-katakana character precedes a user-defined katakana token and alternativeTokenization and favorUserDictionary are enabled.

  • We fixed a bug where constructors of factory classes in the Lucene/Solr plugin would throw an UnsupportedOperationException if passed a Map that did not support the remove method.

Release 7.30.1.c61.0

August 2019

New Features

  • Added support for Lucene/Solr up through version 8.2.0.

  • Dictionaries and models that are used on both big- and little-endian machines no longer include LE in their file names.

Bug Fixes

  • When alternativeTokenization was set to true, the Chinese tokenizer could create tokens at the end of the input string with the part of speech NT without checking that the context was valid for NT

  • Analyzing Chinese and Japanese with alternativeTokenization enabled is now much faster on sentences that are thousands of characters long.

Release 7.30.0.c61.0

August 2019

New Features

  • Segmentation user dictionaries can be used for all languages, not just Chinese, Japanese, and Thai.

  • The option compoundComponentSurfaceForms has been added to return the surface forms of the components of compound words. By default, RBL-JE only returns the lemmas.

  • Added support for Lucene/Solr up through version 8.1.1.

  • Some Polish words ending in “-cku”, “-ska”, or “-sku” are lemmatized to forms ending in “-cki” or “-ski”.

Bug Fixes

  • The Japanese POS tag NE was not converted correctly to UPT-16.

  • The French POS tag CONJQUE was converted to UPT-16 CONJ instead of the more appropriate SCONJ.

  • When alternativeTokenization was disabled, Chinese punctuation was tagged as GUESS instead of PUNCT or EOS.

Release 7.29.0.c61.0

June 2019

New Features

  • Setting alternativeTokenization to true enables an alternative tokenizer for Thai, for parity with the Thai tokenizer in Basis Technology's C++ API (RLP).

  • All Hebrew tokens have analyses. The main change was adding the new part of speech punctuation. Non-punctuation tokens that formerly had empty analysis lists now have the part of speech unknown.

Third-party component updates

Component

Version

Change

Jackson Annotations

2.9.8

Version upgrade

Jackson Core

2.9.8

Version upgrade

Jackson Databind

2.9.8

Version upgrade

Jackson Dataformat XML

2.9.8

Version upgrade

Jackson Dataformat YAML

2.9.8

Version upgrade

Jackson Datatype Guava

2.9.8

Version upgrade

Jackson Module JAXB Annotations

2.9.8

Version upgrade

Protocol Buffers

3.6.1

Version upgrade

SnakeYAML

1.23

Version upgrade

Release 7.28.2.c60.0

June 2019

New Features

  • U+2019 RIGHT SINGLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are now normalized to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew words, to match the normalization of U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM.

Release 7.28.1.c60.0

May 2019

New Features

  • Updated the English lexicon.

  • Added support for Lucene/Solr up through version 8.0.0.

  • Updated the German lexicon.

  • Updated the Swedish lexicon.

  • Arabic analysis will attempt to replace leading hamzated alefs with plain alefs for unrecognized tokens.

Bug Fixes

  • The surface forms of Hebrew tokens consisting of multiple prefixes without a base, like “מה”, are now the entire token text, instead of just the first prefix.

  • Russian hyphenated words that end in numbers, like “Аполлона-11”, are no longer tagged as DIG. They are now tagged with the same parts of speech they had before 7.27.2.c60.0.

  • Closing parentheses, brackets, and braces that follow URLs when urls is enabled are no longer merged into the URLs.

  • The disambiguator is now more likely to select analyses with the POS tags ATMENTION, EMAIL, HASHTAG, and URL over other analyses.

  • When the Hebrew tokenizer encounters a character not used in Hebrew immediately following a character used in Hebrew, it starts a new token. Formerly, it would delete that character and any following characters up to the next token separator (e.g. white space).

  • RBL-JE can now successfully read in ICU tokenization rule files that begin with a BOM.

  • Hebrew tokens consisting of multiple prefixes without a base are now tagged with the part of speech “unknown”, to match single-prefix tokens.

  • The English token “than” is tagged only as COTHAN. The candidate part of speech COORD has been removed for this token.

Release 7.28.0.c60.0

May 2019

New Features

  • A perceptron-based disambiguator is available for Hebrew. It is used by default and when the option disambiguatorType is set toDisambiguatorType.PERCEPTRON. It was measured to have higher lemma and part of speech accuracies than the alternatives. To use the previous default, set disambiguatorType to DisambiguatorType.DICTIONARY.

  • Added support for Lucene/Solr up through version 7.6.0.

  • Running on Java 11 is now supported.

Bug Fixes

  • Some white space characters could be part of Chinese tokens when alternativeTokenization was enabled.

  • Tokens that are thousands of characters long slow down the tokenizer.

  • Polish tokens that can appear in multiword expressions are no longer lemmatized to the full expressions. For example, “dzień” is not lemmatized to “dzień_dobry”.

  • The non-final components of Russian compound words with more than one hyphen were not lemmatized. The non-final components of Russian hyphenated compound words with the interfix “е” or “о” that coincidentally looked like the short forms of adjectives were lemmatized as if they were short forms.

RBL-JE Release Note Archive 7.27.0.c60.0 and earlier

Release 7.27 and earlier

New Features

Bugs Fixed

Third-Party Components

Known Problems

New Features

Release 7.27.0.c60.0
  • The Chinese Script Converter must be licensed distinctly from the rest of RBL. Old licenses won’t work for it anymore. (ETROG-2916)

  • Lemmatization is supported for Persian. (ETROG-2924)

  • A dictionary-based disambiguator is available for Hebrew and is now the default. To run disambiguation in TensorFlow, set the option disambiguatorType to DisambiguatorType.DNN. (ETROG-2928)

Release 7.26.5.c59.3
  • The tokenizer recognizes "百度" as a single token when alternativeTokenization is enabled. (ETROG-2909)

Release 7.26.3.c59.3
  • The tokenizer emits the normalized surface form as the lemma when alternativeTokenization is enabled. (ETROG-2892)

Release 7.26.0.c59.3
  • Analyzing German tokens with default ignorable code points, including U+00AD SOFT HYPHEN, U+200C ZERO WIDTH NON-JOINER, and U+200D ZERO WIDTH JOINER, produces the same analyses as if the tokens did not include those characters. (ETROG-2824)

  • Improved the lemma accuracy of the Spanish disambiguator. (ETROG-2856)

  • Improved disambiguation of English proper nouns. (ETROG-2867)

  • The North Korean (qkp) and South Korean (qkr) dialects are both treated as Korean (kor). (ETROG-2878)

Release 7.25.0.c59.3
  • Additional lemma dictionary for each of the two Norwegian languages. (ETROG-2797)

  • Added support for Lucene/Solr 7.3.1 through 7.4.0. (ETROG-2862)

Release 7.24.6.c59.2
  • Added support for Lucene/Solr 7.2.1 through 7.3.1. (ETROG-2842)

Release 7.24.3.c59.2
  • Added support for TensorFlow on more CPUs.

Release 7.24.0.c59.2
  • Added support for tokenizing and lemmatizing Catalan, Estonian, Serbian, and Slovak. (ETROG-2752, ETROG-2774)

Release 7.23.1.c59.0
  • Improved the accuracy of the Hebrew disambiguator. (ETROG-2718)

  • Running on Java 9 is now supported. (ETROG-2722)

Release 7.23.0.c59.0
  • Added support for Lucene/Solr 7.0.0 through 7.1.0. (ETROG-2706)

  • POS-tagging and disambiguation are supported for Hebrew. (ETROG-2707, ETROG-2717)

Release 7.22.2.c59.0
  • Added support for Lucene/Solr 7.0.0 through 7.2.1. (ETROG-2751)

Release 7.22.0.c59.0
  • German words that are completely uppercase guessed as acronyms. (ETROG-2684)

Release 7.21.2.c59.0
  • Modified internal dependency structure. (ETROG-2677)

Release 7.21.1.c59.0
  • Updated the compatibility number to 59.0. (ETROG-2609)

Release 7.21.0.c58.3
  • Added the ArabicMorphoAnalysis class to allow an Annotated Data Model application to get more information for Arabic, Persian, and Urdu text than the MorphoAnalysis class would provide. (ETROG-2623)

  • Improved speed and memory footprint for English and Spanish disambiguation. (ETROG-2607, ETROG-2618, ETROG-2635)

  • Added the alternativeEnglishDisambiguation and alternativeSpanishDisambiguation options to specify the use of the old disambiguator in English and Spanish. The new disambiguator, introduced in version 7.18.0.c58.3, and enhanced in the current release, is more accurate, but slower. (ETROG-2626)

  • Added the guessHebrewPrefixes option to control whether to split possible prefixes off unknown Hebrew words. (ETROG-2642)

  • Normalized U+05F3 HEBREW PUNCTUATION GERESH and U+05F4 HEBREW PUNCTUATION GERSHAYIM to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew. (ETROG-2647)

  • Filter out punctuation from Lucene/Solr when query is set. (ETROG-2648)

  • Added support for Lucene/Solr 6.6. (ETROG-2656)

Release 7.20.0.c58.3
  • Added tokenization and POS-tagging for at-mentions and hashtags in all languages. (ETROG-2571)

  • Added the options atMentions, emailAddresses, emoticons, hashtags, and urls to enable tokenization and POS-tagging of @mentions, email addresses, emoticons, hashtags, and URLs. They are all disabled by default. (ETROG-2583)

Release 7.19.0.c58.3
  • Added tokenization and POS-tagging for URLs and email addresses in all languages. (ETROG-2557)

Release 7.18.0.c58.3
  • Implemented the many-to-one normalizer. (ETROG-1961)

  • Deprecated many classes and methods that are for internal use only. (ETROG-2065)

  • Added BaseLinguisticsFactory#addUserCscDictionary. (ETROG-2098)

  • Removed obsolete big-endian models and dictionaries. (ETROG-2214)

  • Overhauled RBLCmd. ANNOTATE is the default command. -showTokenDetails, -showRawResults, and -verboseResults are removed. -inputJson interprets the input as an ADM. -outputJson is a boolean option. (ETROG-1392, ETROG-2343)

  • Decomposed compound verbs in Japanese when using alternativeTokenization. (ETROG-2350)

  • Introduced more advanced disambiguation for English and Spanish. (ETROG-2367, ETROG-2370, ETROG-2372, ETROG-2371, ETROG-2467)

  • Improved decompounding accuracy in Dutch. (ETROG-2408)

  • Added tokenization, lemmatization, and POS-tagging for emoticons and emoji in all languages. (ETROG-2474, ETROG-2512, ETROG-2516, ETROG-2520, ETROG-2522, ETROG-2538)

  • Supplemented analysis dictionaries for English and Spanish. (ETROG-2481, ETROG-2532, ETROG-2535)

  • Added support for Lucene/Solr 6.3. (ETROG-2501)

  • Introduced the ability to specify a user-defined reading dictionary in Lucene/Solr (userDefinedReadingDictionaryPath). (ETROG-2527)

Release 7.17.2.c58.3
  • Add support for Lucene/Solr 6.2. (ESPI-77)

Release 7.17.1.c58.3
  • Version of OSGi (internal use only) upgraded. (ETROG-2441)

Release 7.17.0.c58.2
  • The FST tokenizer now supports Romanian. (ETROG-2255)

  • All Chinese parts of speech are supported in CLA user dictionaries. (ETROG-2262)

  • Modest speed improvement in the disambiguation algorithm used to process the results of the Japanese (statistical), Korean, and Arabic tokenizers. (ETROG-2288)

  • To become file-system-agnostic, the use of Path in the API is now supported. (ETROG-2310, ETROG-2381)

  • Added the -outputJson option to RBLCmd to write the ADM as JSON to a file of choice. (ETROG-2332)

  • Reverted the fix for ETROG-2304, introduced in 7.16.0.c58.2. (ETROG-2376)

Release 7.16.0.c58.2
  • The Chinese script converter is an entitlement with a standard Chinese license. (ETROG-1605)

  • Arabic reh is normalized as a decimal separator in numeric contexts. (ETROG-1650)

  • Provide disambiguation of Dutch compounds. (ETROG-1736)

  • A custom reading dictionary can be specified on the RBLCmd command line. (ETROG-1938)

  • Alternative tokenization options are included in BaseLinguisticsOption. (ETROG-1946)

  • Improve speed by caching Arabic analyses. (ETROG-1992)

  • Added support for alternative Chinese segmentation. (ETROG-2034)

  • Return Hebrew sentence boundaries. (ETROG-2036))

  • Added support for POS tag mappings for alternative Japanese and Chinese segmentation. (ETROG-2152)

  • Changed CompoundDictionary to provide its components in an order that reflects the contents of the lemma it returns. (ETROG-2154)

  • AnalyzerFactory#addUserAnalysisDictionary now throws an informative exception when either the root or dictionary directory is invalid. (ETROG-2166)

  • Augmented RBLCmd with the ability to return the RBL-JE version number. (ETROG-2168)

  • Improve handling of hiragana tokens homophonous to verbs in the alternative Japanese tokenizer (JLA). (ETROG-2188)

  • Improve handling of POS-ambiguous verb stems in the alternative Japanese tokenizer (JLA). (ETROG-2189)

  • The RBLCmd help command now sorts its options alphabetically. (ETROG-2195)

  • Han readings now returned for all Katakana tokens. (ETROG-2208)

  • In the Russian FST tokenizer, initials are tokenized and given the +Init morpho-tag. (ETROG-2209)

  • Memory requirements of the FST tokenizer were reduced. (ETROG-2200, ETROG-2226))

  • Reduce the memory allocated for tokens by the FST tokenizer. (ETROG-2235))

  • Terminated support for Lucene/Solr 4.1-4.2. Added support for Lucene/Solr 6.0-6.1. (ETROG-2016, ETROG-2241, ETROG-2299)

Release 7.15.0.c57.2

Note: 7.15.0 was forked directly from 7.14.0 and thus does not have the changes in 7.14.1+.

  • Introduced the ability to specify an alternative FST tokenizer. See TokenizerFactory.addCustomTokenizationFst. (ETROG-2231)

Release 7.14.0.c57.2
  • The specification of options to RBLCmd was refactored. (ETROG-1503)

  • Added UPT-16 support for Persian and Urdu. (ETROG-1830)

  • Changed UPT-16 mappings for Czech and Hungarian numbers. (ETROG-1841)

  • Removed incorrect analyses for Polish adjectives and participles ending in m/mi. (ETROG-1916)

  • Removed archaic Polish analyses containing "być". (ETROG-1917)

  • Added raw analyses for English contractions. (ETROG-1944)

  • The command line tool RBLCmd supports Hebrew tokenization. (ETROG-1973)

  • Added support for Finnish stemming. (ETROG-2012)

  • Removed the spurious generation of an accusative case analysis for some Polish nouns. (ETROG-2020)

  • The Hebrew tokenizer overzealously guessed that periods were part of an abbreviation. (ETROG-2024)

  • Refactored the position metadata for Lucene tokens of compound components. (ETROG-2042)

  • Lucene tokens for components of a contraction are identified with type "CONT". To invoke this functionality, set FilterOption.identifyContractionComponents to true. (ETROG-2044)

  • AnalysesAttributes formatted as JSON in Elasticsearch. (ETROG-2057)

Release 7.13.0.c56.6
  • Added API support for Lucene & Solr 5.0-5.3. (ETROG-1647)

  • Added support for Persian and Urdu. (ETROG-1636, ETROG-1667)

  • The 'nor' (Norwegian) language code is accepted. (ETROG-1690)

  • Exposed support for using the Rosette Annotated Data Model (ADM) to perform RBL-JE operations. (ETROG-1713)

  • The Arabic analysis candidate generation code now uses the same algorithm that the Arabic Language Processor in the native (C++) version of Rosette Base Linguistics does. (ETROG-1722)

  • Provided an alternative Japanese analyzer. This provides parity with the Japanese analyzer in Basis Technology's C++ API (RLP). It offers improved accuracy with query strings and names and provides greater user control of the analysis. (ETROG-1727)

  • For English, Portuguese, and German text, added ADM support for splitting contractions and analyzing the constituents. (ETROG-1769)

  • Provided support for returning the set of 16 universal part-of-speech (POS) tags rather than the set of 12 that were introduced in version 7.12.0. (ETROG-1771)

  • The RBLCmd tool now lists the BaseLinguisticsOption options. To use these options you must set analyzerType=none, lang, and BaseLinguisticsOption.language. (ETROG-1862)

Release 7.12.1.c56.6
  • Version 7.12.1.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis JVM SDK (e.g. RBL-JE, RLI-JE, REX-JE) in a single application, then choose versions that have the same compatibility number. (ETROG-1700)

Release 7.12.0
  • Moved the Tokenize and Analyze samples into samples/tokenize-analyze and created a single Ant build script to compile and run both samples. (ETROG-1264)

  • Provided support for returning universal part-of-speech (POS) tags rather than the language-specific POS tags we already return. The universal tags (UPT) are coarser than the language-specific tags, but enable tracking and comparison across languages. (ETROG-1472)

  • Added support for returning a disambiguated analysis for each token in Japanese text. For performance, this feature is turned off by default. (ETROG-1324)

  • Added support for returning morphological tags, where available, and placed an example illustrating the procedure for obtaining morphological tags in samples/morpho-tags. (ETROG-1485)

  • Removed the small number of dubious acronmym expansions from the lemmatization of English, French, Italian, German, Spanish, and Portuguese input. (ETROG-1547)

  • Improved the German lemma parser, which now returns the same lemma for German nouns that differ only in gender. (ETROG-1548)

  • Added API support for Lucene & Solr 4.10. (ETROG-1571)

Release 2.4.0
  • Enhanced support for Korean linguistic analysis, and integrated a guesser for generating morphemes, morpheme tags, compound components, and parts of speech. (ETROG-1486, ETROG-1512, ETROG-1528)

  • Added support for Korean user lemma dictionaries. (ETROG-1518)

  • Added stop words to the Japanese analysis dictionary. (ETROG-1525)

Release 2.3.0
  • Added the Chinese Script Converter, which can convert tokens in Traditional Chinese text to Simplified Chinese and vice versa. (ETROG-1462)

  • Terminated support for Lucene/Solr 3.6. (ETROG-1298)

  • Implemented support for Chinese part-of-speech (POS) tags and readings. (ETROG-1280)

  • Added support for normalization of Chinese and Japanese numbers. (ETROG-1310)

  • Implemented generation of Korean part-of-speech (POS) tags. (ETROG-1357)

Release 2.2.2
  • Added a tool for building user dictionaries. (ETROG-210)

  • For those cases in which you want to use your own whitespace tokenenizer and you are processing text that requires segmentation (such as Chinese, Japanese, or Thai), we have added support for a base linguistics segmentation token filter to be used after a whitespace tokenizer and before other filters, such as a base linguistics token filter. See the Javadoc for the RBL-JE API for Lucene 4.3-4.7. (ETROG-1240)

  • For Japanese, modified the base linguistics token filter to exclude lemmas for auxiliary verbs, particles, and adverbs from the token stream. (ETROG-1217)

  • Added support for using AnalysesAttribute to get the analyses and disambiguated analysis for each token in a token stream. (ETROG-1279)

  • Added SLF4J support for logging RBL-JE applications. (ETROG-1318)

  • Added support for turning case sensitivity on/off when analyzing text. (ETROG-1365)

  • Deprecated void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path) in favor of void com.basistech.rosette.bl.AnalyzerFactory#addUserDefinedDictionary(LanguageCode language, String path, EnumMap<AnalyzerOption, String> options) where options is used to set AnalyzerOption.caseSensitive to "true" or "false".

  • Unused analyzer parameter removed from the BaseLinguisticsSegmentationTokenFilter constructor. (ETROG-1316)

  • Updated the Japanese normalization dictionary. (ETROG-1229)

  • Added API support and samples for Lucene 4.9. (ETROG-1446)

Release 2.2.0
  • Added a Lucene Analyzer that combines the RBL-JE Tokenizer and TokenFilter, along with the LowerCaseFilter, CJKWidth Filter, and optional support for the StopFilter: com.basistech.rosette.lucene.BaseLinguisticsAnalyzer. Added a Lucene 4.3-4.7 sample application that illustrates its use. (ETROG-1138, ETROG-1172)

  • Improved support for returning Japanese Hiragana readings. The API for adding readings to the token stream has moved from TokenizerFactory#SetOption to BaseLinguisticsTokenFilter#setAddReadings. You can also include ("addReadings", "true") to the map of options you use to instantiate the BaseLinguisticsAnalyzer. (ETROG-1054)

Release 2.1.0
  • Added support for Japanese Hiragana readings.

  • Factored in support for Lucene 3.6, 4.1-4.2, and 4.3.

Release 2.0.0

For this release, this product has been refactored and renamed to Rosette Base Linguisitcs Java Edition. This release concentrates on the core API instead of implementations for different versions of Lucene and Solr. This release returns part-of-speech tags for a core set of European languages and Japanese.

  • In place of a LemmatizerFactory, RBL-JE now provides an AnalyzerFactory. Use the AnalyzerFactory to generate a language-specific Analyzer that you can use to generate Analysis objects for each token produced by the Tokenizer.

Release 1.11.0
  • For licensing and business reasons, support for Bulgarian, Catalan, Estonian, Croatian, Indonesian, Latvian, Malay, Slovak, Slovenian, Serbian, Albanian, and Ukrainian has been removed from the RSE package. (ETROG-921)

Release 1.10.0
  • Added support for tokenizing and lemmatizing Arabic, Czech, Hungarian, Korean, and Turkish. (ETROG-876)

Release 1.9.0
  • Added support for segmenting (tokenizing) Thai. (ETROG-448)

  • Added a tokenizer option (turned off by default) for returning Hebrew roots. (ETROG-788)

  • Changed required Java platform from 1.5 to 1.6. (ETROG-765)

  • Added support for using RSE with LucidWorks Enterprise 1.7, which supports a pre-release version of Lucene and Solr 4.0.

Release 1.8.0
  • Added support for tokenizing and lemmatizing Albanian, Bulgarian, Catalan, Croatian, Estonian, Greek, Hebrew, Indonesian, Latvian, Malay, Polish, Serbian, Slovakian, Slovenian, Russian, and Ukrainian. (ETROG-656, 658, 668, 677)

  • Added a command line driver for running RSE. For usage details, see the Javadoc for com.basistech.rosette.bl.RBLCmd. (ETROG-603)

  • Added support for tokenizing and lemmatizing Norwegian Nynorsk text. (ETROG-637)

  • Consolidated support for Lucene 2.2, Lucene 2.4, Lucene 2.9, Lucene 3.1, Solr 1.3, Solr 1.4, and Solr 3.1 in a single SDK package with an associated documentation package.

  • Deprecated support in the com.basistech.rosette.breaks package (GenericTokenizer and TokenizerOption) for returning EOS (end-of-sentence) tokens. includeEOS is off by default and should not be turned on; it interferes with Lucene searches. (ETROG-706)

  • Deprecated Lucene 2.9 LemmaFilterFactory.supportedLanguages(). Use getSupportedLanguages(). (ETROG-726)

Release 1.7.1
  • Revised the SegmentationTokenizer to provide more consistent handling of punctuation during the tokenization of Chinese, Japanese, and Thai text. (ETROG-640)

Release 1.7.0
  • Added support for Lucene 3.0.

  • Improved support for Japanese and Chinese tokenization.

  • Added the Japanese lemmatization dictionary and support of Japanese lemma user dictionaries. The Japanese lemmatization dictionary also provides orthographic normalization in the case of Katakana spelling variants and input text with archaic Kanji.

  • Added the production of normalized numbers to the lemmatization process.

  • Added support for Chinese lemma user dictionaries. Apart from numbers, which are already handled by the lemma guesser, lemmas do not ordinarily apply to Chinese, but a lemma user dictionary may be used for orthographic normalization.

Release 1.6.0
  • Added support for Danish and Norwegian (Bokmål). Improved support for Chinese token segmentation and Romanian.

  • To enhance clarity and consistency, and to avoid duplication of package names in class names, made a number of API changes that are not backwards compatible.

    • Renamed some factory classes: (ETROG-436)

      Old

      New

      com.basistech.lucene.LuceneTokenizerFactory

      com.basistech.lucene.TokenizerFactory

      com.basistech.lucene.BaseLinguisticsTokenFilterFactory

      com.basistech.lucene.LemmaFilterFactory

      com.basistech.solr.BaseLinguisticsTokenizerFactory

      com.basistech.solr.TokenizerFactory

      com.basistech.solr.BaseLinguisticsTokenFilterFactory

      com.basistech.solr.LemmaFilterFactory

      All these factory classes include a create() method for instantiating the Tokenizer or LemmaFilter. The getTokenFilter(), getLuceneTokenizer(), and getLemmatizer() methods have been removed.

    • Promoted classes introduced in Release 1.5.beta.1 for setting tokenizer and lemmatizer options from inner Enums to top-level Enums: com.basistech.rosette.breaks.TokenizerOption and com.basistech.rosette.bl.LemmatizerOption. (ETROG-434)

    • Removed the TokenizerFactory, LemmaFilterFactory and LemmatizerFactory option-specific methods for setting options that predate the introduction of setOption().

    • The com.basistech.breaks.BreakerFactory methods for creating breakers have been renamed.

      Old

      New

      newScriptRegionBreaker()

      createScriptRegionBreaker()

      newBufferSentenceBreaker()

      createBufferSentenceBreaker()

      newBufferWordBreaker()

      createBufferWordBreaker()

Release 1.5-beta-1
  • Added support for Chinese, and limited support for Japanese. For these languages, RSE adds statistically trained models/dictionaries to enabled the tokenization of non-whitespace-delimited text. Support for user dictionaries has also been expanded to include token dictionaries for Chinese, Japanese, and Thai.

  • Enhanced support for Dutch, Italian, and Portuguese.

  • Replaced Lucene 2.9 and Solr 1.4 packages with Lucene 3.0 package.

  • Revised the API for defining tokenizer and lemmatizer options.

  • Reorganized the documentation to reflect standard RSE usage patterns.

Release 1.4.2
  • To simplify usage in standard search applications, revised the RSE Tokenizer so that it does not put sentence-boundary tokens in the token stream unless you instruct the RSE TokenizerFactory to include them. (ETROG-312)

Release 1.4.1
  • Addressed a compatibility issue running RSE with RLP. To run RSE 1.4 and RLP 7.1 in the same process follow the instructions in RSE Technical Note: Using RSE 1.4 and RLP 7.1 in a Single Solr Instance.

Compiling a Swedish User Dictionary. As described in the RSE Application Developer's Guide, you must use RLP to create a user dictionary. See "Chapter 12. User-Defined Data" In the RLP Application Developer's Guide provides instructions on creating the source file for a user-defined dictionary and compiling the dictionary. The current release of RLP (RLP 7.1.0) does not include support for creating a Swedish user dictionary. To create a Swedish dictionary, you must add a file that we provide in the extras directory to the corresponding location in your RLP installation: rlp/bl1/dicts/sv/tags.txt.

When you create your source file, you can use [+DUMMY] as the POS tag for each entry.

The syntax for compiling a Swedish user dictionary from rlp/bl1/dicts/tools is

build_user_dict.sh sv input output
Release 1.4.0
  • Removed Rosette Language Analyzer (RLI) 100% Java implementation, which is now a separate product.

  • Provided separate SDK packages with support for Lucene 2.2, Lucene 2.4, and Lucene 3.0. (ETROG-198)

  • Added TokenizerFactory, which provides a language-specific Tokenizer for parsing input text. In addition to using the Sentence Breaker and Word Breaker, the Tokenizer normalizes the tokens (Unicode NFC normalization and lowercasing). (ETROG-185)

  • Added support for Swedish, including tokenization, lemmatization, and decompounding. (ETROG-201)

  • Added preliminary, limited support for Dutch, Danish, Norwegian, Italian, Portuguese, and Romanian.

Release 1.3-beta
  • Expanded support for German decompounding.

  • Added support for generating a separate lemma for each space-delimited element in lemmas that contain whitespace.

  • This distribution provides support for Lucene 2.2.

Release 1.2.0
  • Revised Java package names to avoid potential collisions with the RLP JNI-supported Java API.

Release 1.1.0
  • Upgraded Token Filter Factory support from Lucene 2.2 to Lucene 2.4.

  • Added The Rosette Language Identifier (RLI), Sentence Breaker, and Word Breaker:

Release 1.0.0
  • Introduced support for the creation of Lucene 2.2 Base Linguistics token filters for English, French, German, and Spanish text.

Bugs Fixed

Bugs fixed in 7.27.2.c60.0

Bug #

Description

ETROG-2981

The Persian lemmatizer did not add lemmas to the first analyses of many tokens, especially verbs.

ETROG-2983

The lemmas of hyphenated Russian compound words now have both pieces lemmatized, not just the final piece. For example, “человека-волка” is lemmatized to “человек-волк”, whereas it was lemmatized to “человека-волк” in previous versions.

ETROG-2984

After some sequences of 4096 characters, containing mostly white space and at most one token, if there is no token or the token contains the last character of the sequence, any following tokens have incorrect original offsets.

Bugs fixed in 7.27.1.c60.0

Bug #

Description

ETROG-2919

Capitalized words in English are less likely to automatically get the part of speech PROP.

ETROG-2927

The English word "people" and its derivatives have the lemma candidate "person". They are no longer analyzed as plural nouns with lemmas equal to their surface forms.

ETROG-2969

Ambiguous English words like “second” and “lower” are less likely to be disambiguated as verbs when they should be ordinal numbers and comparative adjectives.

Bugs fixed in 7.26.6.c60.0

Bug #

Description

ETROG-2958

Analyzing Dutch writes a cache file to disk, which fails if the file is not writable.

Bugs fixed in 7.26.5.c59.3

Bug #

Description

ETROG-2904, ETROG-2910, ETROG-2911

Improved the lemma and part of speech accuracy of the Spanish disambiguator.

ETROG-2921

The parts of speech of the German acronyms “MAN” and “MIT” fall back to the parts of speech of the unrelated words “man” and “mit”. They should be NOUN.

Bugs fixed in 7.26.4.c60.0

Bug #

Description

ETROG-2904, ETROG-2910, ETROG-2911

Improved the lemma and part of speech accuracy of the Spanish disambiguator.

ETROG-2914

RBL-JE depended on Guava 18.0.0, which has a security vulnerability (CVE-2018-10237). Now it depends on Guava 26.0-jre.

Bugs fixed in 7.26.3.c59.3

Bug #

Description

ETROG-2835

Dutch compound nouns ending with “-ronde” can be analyzed as adjectives.

ETROG-2864

The disambiguator for Dutch non-compound words only considers parts of speech. If a token has multiple analyses with the same part of speech, the disambiguator picks one arbitrarily.

ETROG-2898

U+180E MONGOLIAN VOWEL SEPARATOR is not treated as a token separator in Hebrew.

Bugs fixed in 7.26.2.c59.3

Bug #

Description

ETROG-2889

Single-letter Spanish conjunctions sometimes get the POS tag ITEM instead of CONJ.

Bugs fixed in 7.26.1.c59.3

Bug #

Description

ETROG-2888

Norwegian lemmas for proper nouns are often converted to lowercase.

Bugs fixed in 7.26.0.c59.3

Bug #

Description

ETROG-2876

The application developer’s guide’s feature set table in section 1.3 erroneously claims that sentence boundary detection is not supported in Hebrew.

ETROG-2877

The Japanese character “々” is unrecognized and splits tokens.

Bugs fixed in 7.25.0.c59.3

Bug #

Description

ETROG-2791

In Catalan where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted.

ETROG-2857

Punctuation was not always separated from preceding characters where they should when fstTokenize is enabled.

ETROG-2860

The Hebrew part of speech tag wPrefix is not converted to UPT-16 when universalPosTags is enabled.

ETROG-2861

The documentation lists AUXV as the part of speech tag for Japanese auxiliary verbs, when it is actually AUXVB.

Bugs fixed in 7.24.6.c59.2

Bug #

Description

ETROG-2829

Some components of German compound words are incorrect when the surface form of the component could be either a noun or a verb.

ETROG-2831

Hebrew part of speech tags are not converted to UPT-16 when universalPosTags is enabled.

ETROG-2834

JLA tokenizer sometimes truncates katakana tokens after non-katakana tokens.

ETROG-2844

Email addresses and URLs may contain control or whitespace characters.

ETROG-2847

Hebrew tokens can contain control characters or nothing but default ignorable characters.

ETROG-2848

Chinese, Japanese, and Thai tokens may contain control characters.

Bugs fixed in 7.24.5.c59.2

Bug #

Description

ETROG-2781

Incorrect analysis may be selected for Dutch non-compound words.

ETROG-2804

Sentence-final @mentions, email addresses, emoji, emoticons, hashtags, and URLs are not marked as sentence-final.

ETROG-2820

Tokenizing German with fstTokenize enabled can drop tokens.

ETROG-2823

Hebrew prefixes are not exposed correctly.

Bugs fixed in 7.24.4.c59.2

Bug #

Description

COMN-234

The Woodstox dependency is not shaded.

Bugs fixed in 7.24.2.c59.2

Bug #

Description

ETROG-2814

Running TensorFlow leaks memory.

Bugs fixed in 7.24.1.c59.2

Bug #

Description

ETROG-2808

The Hebrew disambiguation model is not cached, potentially leading to high memory pressure.

Bugs fixed in 7.24.0.c59.2

Bug #

Description

ETROG-2782

Tokens can be empty or consist of nothing but control characters and white space.

Bugs fixed in 7.23.3.c59.0

Bug #

Description

ETROG-2744

Tokenize-analyze sample emits unknown POS tags for Hebrew

ETROG-2759

Ending a Lucene token stream in Chinese or Japanese with alternativeTokenization enabled throws an exception if none of the token stream's tokens have been consumed.

ETROG-2760

In languages like French and Italian, an apostrophe was parsed as its own token when directly followed by a digit.

ETROG-2767

Overlapping tokens, which are valid, are discarded. This particularly affects hyphenated tokens in French when fstTokenize is enabled.

ETROG-2776

German all-caps words are assumed to be acronyms without considering the possibility that they are simply emphasized.

ETROG-2777

The application developer’s guide claims support for Java 9.

ETROG-2778

The application developer’s guide references btcommon-api-37.1.3.jar instead of btcommon-api-36.1.3.jar.

ETROG-2779

The German analysis cache returns analyses without taking the full context into account, leading to unpredictable analyses for unknown words.

ETROG-2780

English all-caps words are assumed to be proper nouns, though all-caps may simply denote emphasis.

ETROG-2790

The Application Developer's Guide does not mention support for Hebrew part of speech tagging in the Feature Set table.

Bugs fixed in 7.23.2.c59.0

Bug #

Description

ESPI-110

Disambiguating Hebrew tokens throws an IllegalArgumentException when a Java security manager is enabled.

Bugs fixed in 7.23.0.c59.0

Bug #

Description

ETROG-2710

Some tokens consisting of numbers and Latin letters, such as serial codes, are decompounded into multiple morphemes in Korean.

ETROG-2716

“интернет”, the Russian word for "internet" is not in the lexicon although “Интернет” is there.

Bugs fixed in 7.22.2.c59.0

Bug #

Description

ETROG-2738, ETROG-2754

German words are assigned parts of speech without taking capitalization into account, leading the disambiguator to often pick the wrong analysis.

ETROG-2739

The annotated lemmas for German definite articles were inconsistent.

ETROG-2740

Some symbols and punctuation are not tokenized as separate tokens in German when fstTokenize is enabled.

ETROG-2745

In languages like French and Italian where an apostrophe, in some contexts, marks a token boundary, the token boundary is omitted if the following token contains a digit.

Bugs fixed in 7.22.1.c59.0

Bug #

Description

ETROG-2687

BaseLinguisticsFactory#addUserSegDictionary, TokenizerFactory#addUserDefinedDictionary, TokenizerFactory#addCustomTokenizationFst, TokenizerFactory#create, and the constructor of BaseLinguisticsSegmentationTokenFilter do not convert language codes to their canonical forms, such as zhs (simplified Chinese) to zho (Chinese).

ETROG-2703

Many German words ending with "teuer" get decompounded incorrectly. Several German compound words don't get decompounded at all.

Bugs fixed in 7.22.0.c59.0

Bug #

Description

ETROG-2383

Korean tokens sometimes include trailing ASCII periods.

ETROG-2685

URL tokens in Chinese, Japanese, and Thai run on into the following tokens, without inserting a token boundary.

ETROG-2686

"Dep." is not recognized as a Portuguese abbreviation, causing a sentence break.

Bugs fixed in 7.21.3.c59.0

Bug #

Description

ETROG-2680

Single quote in English sometimes incorrectly analyzed as possessive when found following whitespace.

Bugs fixed in 7.21.0.c58.3

Bug #

Description

ETROG-2476

Lemma for a German compound word may be wrong if the surface form has characters that were normalized in the compound components.

ETROG-2589

The non-English FST tokenizers do not properly split text into tokens in which text immediately follows colons or commas, such as ":,test".

ETROG-2622

Passing a long script file (over 100,000 rows) to RBLCmd with output paths specified in the third column can cause a “Too many open files” error.

ETROG-2661

Tokenizing a string of Katakana throws an ArrayIndexOutOfBoundsException if an unknown word immediately follows a word from a user dictionary and both alternativeTokenization and favorUserDictionary are true and the language is Japanese.

Bugs fixed in 7.20.4.c58.3

Bug #

Description

ETROG-2638

Number tokens that are hundreds of characters long can cause StackOverflowErrors in languages that use spaces as thousands separators.

Bugs fixed in 7.20.3.c58.3

Bug #

Description

ETROG-2632

A tiny number of Hebrew surface forms (e.g. "ג`ינג`ר") cause NullPointerExceptions.

Bugs fixed in 7.20.2.c58.3

Bug #

Description

ETROG-2615

Hebrew lemmas are not exposed in Lucene/Solr.

Bugs fixed in 7.20.1.c58.3

Bug #

Description

ETROG-2584

Emoticon detection has some false positives.

Bugs fixed in 7.20.0.c58.3

Bug #

Description

ETROG-2338

Annotating an AnnotatedText which has an empty list of sentences throws an exception.

ETROG-2574

The English FST tokenizer does not properly split text into tokens in which text immediately follows colons or commas, such as ":,test".

Bugs fixed in 7.19.0.c58.3

Bug #

Description

ETROG-710

The Hebrew tokenizer throws an exception or produces incorrect results for tokens that begin with "prefix=".

ETROG-995

The Hebrew tokenizer could throw an exception or produce incorrect results for some inputs involving backslashes or multiple tokens with identical surface forms.

ETROG-2531

Creating Chinese and Japanese tokenizers with alternativeTokenization using a TokenizerFactory is not thread-safe.

ETROG-2552

Analyzing a zero-length English token throws a StringIndexOutOfBoundsException.

ETROG-2560

BaseLinguisticsTokenFilter can overwrite analyses generated by BaseLinguisticsTokenizer, causing problems for emoticon detection.

ETROG-2563

The Hebrew tokenizer does not split on non-ASCII white space characters.

ETROG-2565

Requesting UPT-16 POS tags for a language for which POS tags are not supported throws an exception.

Bugs fixed in 7.18.0.c58.3

Bug #

Description

ETROG-1689

The lemmas of Russian compound words contain braces.

ETROG-2360

RBLCmd reports 0 bytes/char when reading from standard input.

ETROG-2409

customPosTagsUri is parsed as as a file name instead of a URI.

ETROG-2243

All acronyms are tagged as proper nouns in English.

ETROG-2536, ETROG-2546

Setting alternativeTokenization to true, consistentLatinSegmentation to false, and favorUserDictionary to true and specifying a user dictionary can cause an infinite loop when processing Japanese.

ETROG-2543

HebrewTokenizer#setReader does not reset enough state, so the tokenizer cannot be reused.

ETROG-2547

HebrewTokenizer throws an ArrayIndexOutOfBoundsException for inputs with unusually long sentences.

ETROG-2548

SLF4J binding jars shipped in lib/. To avoid classpath conflicts they have been moved to tools/lib/.

Bugs fixed in 7.17.0.c58.2

Bug #

Description

ETROG-2252

Word breaking with fstTokenize can fail for initials in Czech.

Bugs fixed in 7.16.1.c58.2

Bug #

Description

ETROG-2356

Token can be missed by GenericTokenizer. (This augments the fix for ETROG-2292 made in 7.16.0.c58.2.)

Bugs fixed in 7.16.0.c58.2

Bug #

Description

ETROG-676

ZWNBSP was not treated as whitespace. With this fix, the word breaker treats 0x2060 and 0xFEFF as whitespace.

ETROG-1100

Incorrect tokenization of '0901d97c80103109' in Hebrew.

ETROG-1638

Fixed English business and place name acronym segmentation regressions.

ETROG-1640

SingleLanguageAnnotator does not provide a proper analysis if the language is specified with the legacy code zht or zhs.

ETROG-1655

ADMs contain multiple tokens (instead of multiple analysis) for the same Hebrew word.

ETROG-1766

Processing Spanish with sequences of the dash character can exhaust memory.

ETROG-1996

Period erroneously attached to terminal token

ETROG-2001

BaseLinguisticsFactory#createSingleLanguageAnnotator throws a NullPointerException.

ETROG-2038

+int_noun, +int_adj, etc. can inadvertently be returned as a POS tag.

ETROG-2088

POS tag and Contraction annotators failed for English uppercase.

ETROG-2112

ConcurrentModificationException crash in alternative Japanese tokenization (JLA).

ETROG-2114

The double exclamation mark (\u203C, ‼) not treated as a separate token and could be agglutinated to an adjacent word.

ETROG-2125

IndexOutOfBoundsException can be thrown when analyzing Korean.

ETROG-2136

Some English words ending in "ss" are erroneously given lemmas ending in 's'

ETROG-2172

Korean lemmas for tokens with mixed numerals and letters have incomplete text.

ETROG-2191

BaseLinguisticsFactory.featuresForLanguage(LanguageCode.SIMPLIFIED_CHINESE) return values was missing CSCANALYSIS.

ETROG-2192

Compound nouns should allow numerals as the first component.

ETROG-2194

The English word 'metres' is not lemmatized correctly.

ETROG-2251

Polish ‘Ł.’ is not treated as an initial.

ETROG-2253

Word breaking in FST Tokenizer can fail for initials in Greek.

ETROG-2254

Word breaking in FST Tokenizer can fail for initials in Hungarian.

ETROG-2268

The order in which options are specified in a BaseLinguisticsFactory may negate the use of a user-defined dictionary.

ETROG-2269

Inflected forms of the English verb 'mentor' not lemmatized.

ETROG-2283

Morpheme information is lost if an annotator is created for Korean and BaseLinguisticsOption.universalPosTags is set to true.

ETROG-2292

Processing text with too many of out of vocabulary characters may fail.

ETROG-2297

Setting ExtendedTags to false does nothing.

ETROG-2304

Some pairs of words have each other as disambiguated lemmas. NB: This fix was reverted in 7.17.0.c58.2 as experience showed some unacceptable disambiguation regressions.

ETROG-2328

Setting XlaOption separatePlaceNameFromSuffix to false does nothing.

ETROG-2333

The alternative Japanese segmenter (JLA) often mishandled もの, ような, and とおり.

Bugs fixed in 7.14.2.c57.2

Bug #

Description

ETROG-2013

Error in SBN reader cache key equality function can induce out of memory errors.

Bugs fixed in 7.14.1.c57.2

Bug #

Description

ETROG-2200

Use of the fstTokenize option can exhaust memory.

Bugs fixed in 7.14.0.c57.2

Bug #

Description

ETROG-1681

Analysis results not produced for legacy uen language code.

ETROG-1925

Fixed UPT-16 mappings for proper nouns in Czech, Dutch, French, German, and Polish.

ETROG-1933

U+0022 is no longer removed during Urdu normalization.

ETROG-1945

Fixed errors when mapping parts of speech.

ETROG-1994

Attempting to use the alternativeJapaneseTokenization option in Solr caused a crash.

ETROG-2018

In some cases involving digits, the Hebrew tokenizer truncated part of the word.

ETROG-2047, ETROG-2059

Analyses of some inflected forms of Polish nouns were incorrect.

Bugs fixed in 7.13.0.c56.6

Bug #

Description

ETROG-1567

Fixed cases in which components for some Danish, Norwegian, and Swedish compound words are returned out of order.

ETROG-1612

Ensure that the order of lemmas retrieved from the morpho cache matches that from the FSTs.

ETROG-1618

RBLCmd to gracefully handle a leading BOM in a UTF-8 file.

ETROG-1652

Crash on the null character (U+0000)

ETROG-1687

Arabic prefix lengths sometimes miscalculated.

ETROG-1703

Include Semitic roots in Arabic analysis results from an Annotator.

ETROG-1761

Given language 'unknown', RBL-JE can produce a token with a whitespace in the middle.

Bugs fixed in 7.12.1.c56.6

Bug #

Description

ETROG-1628

Fixed errors in the production of Portuguese and Russian lemmas.

ETROG-1663, ETROG-1665

Corrected German regressions relative to the RBL native product (RLP).

ETROG-1671

Removed RBLCmd log4j warnings when AnalyzerOption.disambiguate = true

4.42. Bugs fixed in 7.12.0

Bug #

Description

ETROG-1432

Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters. This completes the fix that we made for version 2.3.0 (ETROG-647).

ETROG-1552

Fixed a BaseLinguisticsTokenFilter error identifying compound components when a non-disambiguating analysis is performed.

ETROG-1563

Fixed an error processing Korean text in which the analyzer produces a ^KrName tag for a token.

Bugs Fixed in 2.4.0

Bug #

Description

ETROG-1554

Rebuilt big-endian Chinese analysis dictionary.

Bugs Fixed in 2.3.0

Bug #

Description

ETROG-647

Fixed segmentation errors handling strings containing Unicode Supplementary (non-BMP) characters.

Bugs Fixed in 2.2.2

Bug #

Description

ETROG-1295

Fixed a NullPointerException when attempting to process Korean text.

ETROG-1271

Fixed reported errors in the English lemma dictionary:

  • noun plurals ending in "es" with a lemma ending in "is" (lemma of emphases is emphasis)

  • singing (lemma is sing, not singe)

  • leaves as verb (lemma is leave, not leaf).

ETROG-1387

Stopped returning guessed POS tags for languages for which POS tags are not supported.

ETROG-1311

Corrected an error tokenizing strings containing certain Unicode Supplementary (non-BMP) characters.

ETROG-1300

Fixed AnalyzerOption.query.

ETROG-1226

Fixed occasional duplication of linguistic lookup results.

Bugs Fixed in 2.2.1

Bug #

Description

ETROG-1261

Concurrency violation in RBL-JE user defined dictionaries

Bugs Fixed in 1.10.1

Bug #

Description

ETROG-916

Eliminated the RuntimeException that was thrown when RSE attempted to handle a very long token. RSE now splits extremely long tokens to fit in the token processing buffer.

ETROG-917

Fixed bug that produced incorrect candidate lemmas for Korean text. The correct lemma candiate generator is now being used.

Bugs Fixed in 1.7.1

Bug #

Description

ETROG-591

Fixed buffer management error using a token user dictionary to tokenize components in a long sequence of Chinese or Japanese tokens with no sentence boundaries.

ETROG-588

Enabled the use of LanguageCode.SIMPLIFIED_CHINESE (zhs) or LanguageCode.TRADITIONAL_CHINESE (zht) when you load a Chinese token user dictionary. RSE maps these language codes to LanguageCode.CHINESE (zho).

ETROG-596, ETROG-629

Improved the handling of text that is not Hanzi (Kanji), Hiragana, or Katakana in Chinese and Japanese token user dictionary lookups. For most consistent performance, we recommend that you only include Hanzi (Kanji), Hiragana and Katakana characters in token user dictionary entries.

Bugs Fixed in 1.6.0

Bug #

Description

ETROG-486, ETROG-495

Addressed overgeneration of Swedish compound components for unknown words. Applied similar refactoring to Danish and Norwegian.

Bugs Fixed in 1.4.3

Bug #

Description

ETROG-319

Speeded up the RSE Tokenizer and LuceneTokenizer by eliminating unnecessary reinitialization.

Bugs Fixed in 1.4.2

Bug #

Description

ETROG-316

Avoided a heap overflow by revising the RSE LuceneTokenizer to gracefully handle multiple next() calls from Solr after the tokenizer has reached the end of the token stream.

Bugs Fixed in 1.4.1

Bug #

Description

ETROG-191

Worked around an out-of-memory error processing very long compounds. See Known Problems in 1.4.1.

Bugs Fixed in 1.4.0

Bug #

Description

ETROG-142

Corrected out-of-memory error processing very long German words.

ETROG-88

Fixed array out of bounds that occurred processing some multi-sentence input.

ETROG-182

Adjusted word breaker to avoid returning empty elements at end of the input text being processed.

Third-Party Components

For a list of third-party components that are used in Basis Technology products, see ThirdPartyLicenses.txt.

Third-party component updates in 7.27.1.c60.0

Component

Version

Change

annoy-java

0.2.5

New

Third-party component updates in 7.26.4.c60.0

Component

Version

Change

Google Guava

26.0-jre

Version upgrade

Third-party component updates in 7.25.0.c59.3

Component

Version

Change

Jackson Annotations

2.9.6

Version upgrade

Jackson Core

2.9.6

Version upgrade

Jackson Databind

2.9.6

Version upgrade

Jackson Dataformat XML

2.9.6

Version upgrade

Jackson Dataformat YAML

2.9.6

Version upgrade

Jackson Datatype Guava

2.9.6

Version upgrade

Jackson Module JAXB Annotations

2.9.6

Version upgrade

Third-party component updates in 7.24.6.c59.2

Component

Version

Change

Woodstox

4.0.5

Version downgrade

Third-party component updates in 7.24.0.c59.2

Component

Version

Change

Google Guava

18.0

Version upgrade

Jackson Annotations

2.9.4

Version upgrade

Jackson Core

2.9.4

Version upgrade

Jackson Databind

2.9.4

Version upgrade

Jackson Dataformat XML

2.9.4

Version upgrade

Jackson Dataformat YAML

2.9.4

Version upgrade

Jackson Datatype Guava

2.9.4

Version upgrade

Jackson Module JAXB Annotations

2.9.4

Version upgrade

SnakeYAML

1.18

Version upgrade

TensorFlow for Java

1.5.0

Version upgrade

Woodstox

5.0.3

New

Third-party component updates in 7.23.0.c59.0

Component

Version

Change

Auto Common Libraries

0.3

new

AutoService

1.0-tc3

new

Commons CLI

1.2

new

Easy Plugins

0.2.2

New

Jackson Datatype Guava

2.7.3

New

JavaPoet

1.9.0

New

Metrics Core

3.2.3

New

Protocol Buffers

3.3.1

New

TensorFlow

1.3.0

New

Third-party component updates in 7.21.1.c59.0

Component

Version

Change

ICU4J

59.1

Version upgrade

Third-party component updates in 7.18.0.c58.3

Component

Version

Change

fastutil

6.6.1

Version upgrade

ICU4J

58.1

Version upgrade

Jackson Annotations, Core, Databind, Dataformat XML, Dataformat YAML, Module JAXB Annotations

2.7.3

Version upgrade

Jackson Dataformat Smile

2.7.3

New

Third-party component updates in 7.16.0.c58.2

Component

Version

Change

args4j

2.32

Version upgrade

fastutil

6.6.0

Version upgrade

opencsv

Removed

Third-party component updates in 7.14.0.c57.2

Component

Version

Change

Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations

2.6.2

Version upgrade

Jackson DataFormat YAML

2.6.2

Version upgrade

SnakeYAML

1.15

New

Snowball (no version or release info avaiable; copied 2015-11-30)

New

args4j

2.3.2

Version added

Third-party component updates in 7.13.0.c56.6

Component

Version

Change

ICU4J

55.1

Version upgrade

Jackson Annotations, Core, Databind, DataFormat XML, Module JAXB Annotations

2.4.4

Version upgrade

Jackson DataFormat YAML

2.4.4

New

Known Problems

Known Problems in 2.x
  • If disambiguate is set to false, or if no disambiguator for the language exists, BaseLinguisticsTokenFilter does not set the type correctly for compound components when adding them to the token stream. It marks compound components as <LEMMA> instead of <COMP> when a non-disambiguating analysis is performed. (ETROG-1552)

Known Problems in 1.8.0
  • The prefixes and suffixes that the RSE tokenizer returns for Hebrew may include punctuation attached to the underlying tokens, such as parentheses (prefix, suffix) and comma (suffix). Accordingly, prefixes and suffixes are assigned a Token PositionIncrement of 1. A multicharacter prefix or suffix may be reported as a sequence of one-character prefixes or suffixes. (ETROG-697)

Known Problems in 1.7.0
  • If you use LanguageCode.SIMPLIFIED_CHINESE (zhs) or LanguageCode.TRADITIONAL_CHINESE (zht) when you load a Chinese token user dictionary, the dictionary is not loaded. You must use LanguageCode.CHINESE (zho) to designate the language code for a Chinese token user dictionary. (ETROG-588)

Known Problems in 1.4.1
  • To avoid a potential out-of-memory error, RSE does not attempt to decompound words longer than 30 characters. For languages with support for decompounding, if a word is longer than 30 characters and is not found in a user dictionary or the standard dictionary, RSE classifies the word as a guessed lemma. (ETROG-191)

Known Problems in 1.4.0
  • Inconsistent handling of numbers and punctuation during lemmatization. (ETROG-266)

  • RSE expects valid Unicode strings as input. If the input includes illegal Unicode sequences, such as un-paired UTF-16 surrogate characters, the behavior is undefined. (ETROG-284)

Known Problems in 1.3-beta and 1.4.x
  • Incorrect capitalization in some lemmas, including some German compounds (e.g., unAbhängigkeit).

  • Incorrect lemma formation of some words with suffixes (e.g., Brötchen).

  • Over-generation of German compound components (e.g., übergreifen, über, and greifen as separate components).

  • Failure to recognize some extended written-out German numbers (e.g., zweitausendzwölf).