Skip to main content

Release Notes

Base Linguistics (RBL) Elasticsearch Plugin

Release Notes

Release 8.11.3.0

February 2024

Supports Elasticsearch 8.11.3 and includes RBL 7.47.1.c72.0

Release 8.4.0.0

October 2022

Supports Elasticsearch 8.4.0 and includes RBL 7.45.0.c67.0

Release 8.2.2.0

September 2022

Supports Elasticsearch 8.2.2 and includes RBL 7.45.0.c67.0

Release 8.1.1.0

June 2022

Supports Elasticsearch 8.1.1 and includes RBL 7.44.1.c67.0

Known Issues

  • Any feature using TensorFlow will not work with Elasticsearch 8.1.1 including:

    • Hebrew disambiguation with disambiguatorType set to dnn.

    • Korean tokenization with tokenizerType set to spaceless_statistical.

    • Indonesian POS tagging.

    • Malaysian POS tagging.

Release 7.17.2.0

April 2022

Supports Elasticsearch 7.17.2 and includes RBL 7.43.0.c66.0

Release 7.17.1.0

March 2022

Supports Elasticsearch 7.17.1 and includes RBL 7.43.0.c66.0

Release 7.13.4.0

July 2021

Supports Elasticsearch 7.13.4 and includes RBL 7.41.1.c65.0

New:

  • New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. POS tags are not supported for these languages, so the universalPosTags option is ignored. (ETROG-3443, ETROG-3465)

  • Fragment definition: A single line followed by an empty line is no longer always considered a fragment. They are still considered fragments if the line is short, as specified by the maxTokensForShortLine parameter. (ETROG-3431)

Release 7.13.3.0

July 2021

Supports Elasticsearch 7.13.3 and includes RBL 7.40.1.c64.1

Release 7.13.2.1

July 2022

Supports Elasticsearch 7.13.2 and includes RBL 7.44.2.c67.0

New

  • TensorFlow supported: This plugin supports TensorFlow. (ESPI-168)

    If you will be using any of features which require TensorFlow:

    1. Copy the file plugins/analysis-rbl-je/analysis-rbl-je.options to config/jvm.options.d/analysis-rbl-je.options

    2. Edit the file, uncommenting the line that matches your operating system and CPU.

    Features that require TensorFlow include:

    • Hebrew disambiguation with disambiguatorType set to dnn.

    • Korean tokenization with tokenizerType set to spaceless_statistical.

    • Indonesian POS tagging.

    • Malaysian POS tagging.

Release 7.13.2.0

June 2021

Supports Elasticsearch 7.13.2 and includes RBL 7.40.1.c64.1

New

  • Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. Hebrew tokenization and analysis are now done in separate steps.(ETROG-3290)

    • TokenizerOption.includeHebrewRoots and TokenizerOption.guessHebrewPrefixes have been deprecated and replaced by AnalyzerOption.includeHebrewRoots and AnalyzerOption.guessHebrewPrefixes.

    • NFKC normalization is now supported for Hebrew.

    • We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.

    • We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)

    • Double apostrophes are now treated like gershayim. (ETROG-3249)

  • Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)

  • Improved directory structure: The contents of the models/ directory are now separated into subdirectories by language. (ETROG-1218)

  • Statistical models moved to models/ directory: The following files have been moved from dicts/ to models/: (ETROG-1218)

    • cat/ca-ud-train.downcased.mdl

    • est/et-ud-train.downcased.mdl

    • fas/posLemma.mdl

    • lav/lv-ud-train.downcased.mdl

    • nno/lemma.mdl

    • nob/lemma.mdl

    • slk/sk-ud-train.downcased.mdl

    • srp/sr-ud-train.downcased.mdl

  • New option for tokenizers: We've added a new option, tokenizerType to specify which tokenizer to use. The options alternativeTokenization and fstTokenize are deprecated in favor of tokenizerType. (ETROG-3419)

  • New Korean tokenizer: We've added a new tokenizer for spaceless Korean input. The previous tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. Activate it by setting tokenizerType to spaceless_statistical. (ETROG-3392)

Bug fixes

  • Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token text, and the part after the geresh would sometimes be considered a suffix. Now the whole token is returned as the token's text. (ETROG-3262, ETROG-3290)

    Example: מע'רב

    • Previously:

      Token{text=מע'}
      MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, 
      partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW}
      MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, 
      partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
    • Now:

      Token{text=מע'רב}
      MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[],
      com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, 
      tagSet=MILA_HEBREW}
  • Multiple punctuation characters are no longer returned as a single token in Chinese when alternativeTokenization is true or tokenizerType is set to spaceless_lexical. Now each character is its own token. (ETROG-3402)

    Example: Input: 天津??

    • Previously:

      Token{text=天津}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}

      Token{text=??}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}

    • Now:

      Token{text=天津}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}

      Token{text=?}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}

      Token{text=?}

      HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}

Third-party component updates

This release includes the following third-party component changes:

Table 70. Added

Package

Version

JavaCPP

1.5.4



Table 71. Upgraded

Package

Old Version

New Version

TensorFlow

1.14.0

2.3.1



Release 7.10.2.0

February 2021

Supports Elasticsearch 7.10.2 and includes RBL 7.38.1.c63.0

Release 7.8.0.1

October 2024

Supports Elasticsearch 7.8.0 and includes RBL 7.47.4.c75.0

Release 7.6.1.3

January 2021

Supports Elasticsearch 7.6.1 and includes RBL 7.38.1.c63.0

New

  • The tools directory is now included in the RBL-Elasticsearch package. (ESPI-31)

  • Greek lexicon: The Greek lexicon has additional words. (ETROG-3288)

  • Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. alternativeGreekDisambiguation must be set to false, which is the default. (ETROG-3289)

    Example: δείξε

    • Previously: Selected lemma: δεικνύω (archaic)

    • Now: Selected lemma: δείχνω (modern)

  • New Greek disambiguator added: The new Greek disambiguator is more accurate, but slower. The new disambiguator is enabled by default. To use the old disambiguator, set alternativeGreekDisambiguation to true. (ETROG-3304)

  • RBL-JE no longer normalizes certain emoji ZWJ sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)

Bug fixes

  • The Greek guesser now handles tokens with non-alphanumeric characters. (ETROG-3286)

    Example: Start+

    • Previously: POS tags: possible PROP, ADJ, NOUN

    • Now: POS tag: FM

Third-party component updates

This release includes the following third-party component changes:

Package

Old Version

New Version

Jackson Annotations

2.10.0

2.11.1

Jackson Core

2.10.0

2.11.1

Jackson Databind

2.10.0

2.11.1

Jackson Dataformat XML

2.10.0

2.11.1

Jackson dataformats: Text

2.10.0

2.11.1

Jackson modules: Base

2.10.0

2.11.1

Protocol Buffers

3.6.1

3.12.2

Apache Commons IO

2.6

2.7

fastutil

8.3.0

8.4.0

Woodstox Stax2 API

4.2

4.2.1

SnakeYAML

1.25

1.26

Release 7.6.1.2

November 2020

Supports Elasticsearch 7.6.1 and includes RBL 7.37.0.c62.2

New

  • Performance improvement: Spanish disambiguation with alternativeSpanishDisambiguation set to false is now faster. (ETROG-3271)

  • Performance improvement: Korean disambiguation is now faster. (ETROG-3280, ETROG-3282)

  • Support for unknown language: If the language is unknown (xxx), tokenization and sentence breaking is supported. (ETROG-3278)

  • Solr 8.7.0: We now support Solr 8.7.0. (ETROG-3315)

  • Tokenization rule preprocessor: The preprocessor command !!btinclude is supported in tokenization rule files, supporting inclusion of files in rule files. (ETROG-2497)

  • Updated sample: The tokenize-analyze sample has been changed from two applications running in sequence to a single application that both tokenizes and analyzes. (ETROG-3291)

  • New sample: The sample csc-annotate demonstrates using CSC with the ADM API. (ETROG-3317)

  • Deprecated option: TokenizerOption#includeRoots has been deprecated and replaced with TokenizerOption#includeHebrewRoots. (ETROG-3314)

  • Deprecated option: The alternative tokenization option deliverExtendedAttributes is now deprecated. Previously it delivered an unsupported extended property. (ETROG-3311)

Bug fixes

  • Combining characters in Hebrew which were being erroneously split into tokens separate from their bases are now not being split. (ETROG-3277)

  • A clear exception (RosetteUnsupportedLanguageException) is now thrown when tokenizing some unsupported languages. Previously, these languages appeared to work. The same tokenizer is still available by specifying the unknown language (xxx). The languages impacted are Albanian, Bulgarian, Croatian, Indonesian, Malay, Slovenian, Standard Malay, and Ukrainian. (ETROG-3278, ETROG-3326)

  • RBL no longer crashes when alternativeTokenization and fragmentBoundaryDetection are both enabled for some inputs in Japanese and Chinese. (ETROG-3285)

  • Correct start and end offsets are now produced when fstTokenize is set to true. Previously, some Spanish inputs would produce tokens with start and end offsets of 0. (ETROG-3292)

  • The mappings of default Basis POS tags to universal POS tags (UPT-16) have been corrected for Greek. (ETROG-3306)

    • Previously: COSUBJ maps to CONJ, ORD maps to ADJ, and POSS maps to DET

    • Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON

  • Tokens no longer have null token types. (ETROG-3316)

  • When an NFKC normalized character results in multiple tokens, those tokens no longer have equal start and end offsets. Previously this could occur when nfkcNormalize was set to true. (ETROG-2505)

    Example: ﷺ

    • Previously: Offsets: 

      صلى start 0 end 0 

      الله start 0 end 0

      عليه start 0 end 0

      وسلم start 0 end 1

    • Now: Offsets:

      صلى start 0 end 1

      الله start 0 end 1

      عليه start 0 end 1

      وسلم start 0 end 1