Skip to main content

Release Notes

Entity Extractor (REX)

Release Notes

Release 7.56.1.c78.0

June 2025

New

  • Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

Third-party component updates

Table 29. Updated

Package

Old Version

New Version

Apache Commons IO

2.18.0

2.19.0

Apache Commons Text

1.13.0

1.13.1

Apache CXF Core

4.1.1

4.1.2

Apache CXF JAX-RS Client

4.1.1

4.1.2

Apache CXF Runtime HTTP Transport

4.1.1

4.1.2

Apache CXF Runtime JAX-RS Frontend

4.1.1

4.1.2

Apache CXF Runtime Security functionality

4.1.1

4.1.2

Guava InternalFutureFailureAccess and InternalFutures

1.0.2

1.0.3

Guava: Google Core Libraries for Java

33.4.0-jre

33.4.8-jre

Jackson datatype: Guava

2.18.2

2.19.0

Jackson Jakarta-RS: base

2.18.2

2.19.0

Jackson Jakarta-RS: JSON

2.18.2

2.19.0

Jackson module: Jakarta XML Bind Annotations (jakarta.xml.bind)

2.18.2

2.19.0

Jackson module: Old JAXB Annotations (javax.xml.bind)

2.18.2

2.19.0

Jackson-annotations

2.18.2

2.19.0

Jackson-core

2.18.2

2.19.0

jackson-databind

2.18.2

2.19.0

Jackson-dataformat-XML

2.18.2

2.19.0

Jackson-dataformat-YAML

2.18.2

2.19.0

Jackson-JAXRS: base

2.18.2

2.19.0

Jackson-JAXRS: JSON

2.18.2

2.19.0

Protocol Buffers [Core]

4.29.3

4.30.2

SnakeYAML

2.3

2.4



Table 30. Added

Package

Version

License

JSpecify annotations

1.0.0

Apache-2.0

Project Lombok

1.18.38

MIT



Table 31. Removed

Package

CogComp-NLPy

JVM Integration for Metrics

Jackson-dataformat-CSV

MongoDB Java Driver (unmaintained)

SLF4J JDK14 Binding



Release 7.56.0.c77.0

March 2025

Important

The installation instructions for the Solr plugin have changed.

To install the Solr plugin:

  • Copy all files from the lib directory inside the Entity Extractor Solr plugin installation (rex-je-solr) into the lib directory of your Solr core.

  • Copy all files from the lib directory inside the Entity Extractor installation (rex-je) into the lib directory of your Solr core

New

  • New entity types: You can now extract social media entity types using Entity Extractor. The types extracted are HASHTAG, ATMENTION, URL, and EMAIL. To extract these types, set extractSocialMedia to true in the annotator configuration. (TEJ-2672)

  • Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

  • Solr support: Solr 9.8.1 is now supported (TEJ-2683)

  • Solr installation change: We've changed how the plugin is installed because Solr has discontinued the lib directives. When installing, you must copy all JAR files from the lib directory inside the Entity Extractor installation into the lib directory of your Solr core and ensure that the solrconfig.xml file points to the REX-JE installation directory. If you are using Solr version 9.7.x or earlier, the previous installation instructions will still work.

Bug Fixes

  • We fixed a bug with in-document coreference server. In-document coreference chains of entity mentions will now be correct. (TEJ-2655)

Third-party component updates

Table 32. Updated

Package

Old Version

New Version

Angus Activation Registries

2.0.1

2.0.2

Auto Common Libraries

0.8

1.2.1

AutoService

1.0-rc4

1.1.1

Apache Commons Codec

1.17.1

1.18.0

Apache Commons IO

2.17.0

2.18.0

Apache Commons Text

1.12.02.17.0

1.13.0

Apache CXF

4.0.4

4.1.1

Apache Log4j

2.24.1

2.24.3

Guava

33.3.1-jre

33.4.0-jre

istack common utility code runtime

4.0.1

4.1.2

Jackson

2.17.2

2.18.2

Jakarta Activation API

2.1.2

2.1.3

Jakarta RESTful WS API (prev. jakarta.ws.rs-api)

3.0.0

3.1.0

Jakarta XML Binding API

3.0.1

4.0.2

JavaCPP

1.5.10

1.5.11

JAXB Core and Runtime

3.0.2

4.0.5

JVM Integration for Metrics

3.0.1

3.0.2

Protocol Buffers [Core]

3.25.5

4.29.3

Metrics Core

3.0.1, 4.2.28

3.0.2, 4.2.30

TXW2 Runtime

3.0.2

4.0.5



Table 33. Added

Package

Version

License

Java Architecture for XML Binding

2.2.12

CDDL 1.1

MongoDB Java Driver (unmaintained)

3.12.14

The Apache License, version 2.0



Release 7.55.15.c76.0

November 2024

New

  • Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

  • Java 21 support added: Java 21 is now supported. Java 11 and 17 are still supported (TEJ-2484).

Third-party component updates

Table 34. Updated

Package

Old Version

New Version

annoy

0.2.5

0.2.6

Apache Commons Compress

1.27.0

1.27.1

Commons IO

2.16.1

2.17.0

Apache Commons Lang

3.16.0

3.17.0

Apache Commons Text

1.10.0

1.12.0

Apache Log4j

2.23.1

2.24.1

fastutil

8.5.14

8.5.15

Guava

33.3.0-jre

33.3.1-jre

JavaCPP

1.5.8

1.5.10

Metrics Core

3.2.3

4.2.28

Protocol Buffers

3.25.3

3.25.5

SnakeYAML

2.2

2.3

Woodstox

7.0.0

7.1.0



Table 35. Added

Package

Version

License

Apache Commons Codec

1.16.1

Apache-2.0

Project Lombok

1.18.34

The MIT License

Streaming API for XML

1.0-2

GNU General Public Library



Release 7.55.14.c75.0

September 2024

New

  • Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

Bug Fixes

  • Fixed a bug where REX would accept indoc-coref-server entity mentions which violated sentence boundaries. REX will now reject mentions from indoc-coref-server that are not contained within document sentences. (TEJ-2451)

Third-party component updates

Table 36. Updated

Package

Old Version

New Version

Apache Commons CLI

1.7.0

1.9.0

Apache Commons Compress

1.26.1

1.27.0

Apache Commons Lang

3.14.0

3.16.0

fastutil

8.15.13

8.15.14

Guava

33.2.0-jre

33.3.0-jre

Jackson

2.17.1

2.17.2

Project Lombok

1.18.22

1.18.34

TensorFlow for Java

0.3.3

1.0.0-rc.1

Woodstox

6.6.2

7.0.0



Release 7.55.13.c74.0

June 2024

New

  • Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIKI-507)

  • Linking improvements: We've added a heuristic to help stop generic, unnamed entities, such as "mortgage law", from being linked. (RWIKI-475)

Bug Fixes

  • Aliases are no longer filtered by low normalized link probability; it is now possible to link entities where abbreviations like "MIT", "LA", "WHO", "UN" are the mention text. (RWIKI-389)

  • We fixed a NullPointerException while writing log entry when processing empty tokens. (TEJ-2361)

  • We updated the REXCmd output for plain text and console output to display the entityId for all mentions, not just the head mention. Also fixed ArrayOutOfBounds Exception while writing context output. (TEJ-2155)

  • We fixed a bug where news media, such as television programs, were typed as ORG. (RWIKI-483).

  • We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)

Third-party component updates

Table 37. Added

Package

Version

License

Angus Activation Registries

2.0.1

EDL 1.0



Table 38. Updated

Package

Old Version

New Version

Apache Commons CLI

1.6.0

1.7.0

Apache Commons Codec

1.11

1.16.1

Apache Commons Compress

1.26.0

1.26.1

Apache Commons IO

1.26.0

1.26.1

Apache Log4j

2.21.1

2.23.1

Apache CFX

3.4.7

4.0.4

args4J

2.33

2.37

Guava

33.0.0-jre

33.2.0-jre

iStack Common Utility Code

3.0.12

4.0.1

JAXB

2.3.4

3.0.2

Jackson

2.16.1

2.17.1

Jakarta Activation API

1.2.2

2.1.2

Jakarta Annotations API

1.3.5

2.1.1

Jakarta RESTful Web Services API

2.1.6

3.0.0

Jakarta XML Binding API

2.3.3

3.0.1

Protocol Buffers

3.25.0

3.25.3

TXW2

2.3.4

3.0.2

Woodstox

4.4.1, 6.2.6

6.6.2

XmlSchema

2.2.5

2.3.1



Removed Packages

  • Jakarta SOAP with Attachments API

  • Jakarta Transaction API

  • Jakarta Web Services Metadata API

  • Jakarta XML Web Services API

Release 7.55.12.c73.0

May 2024

New

  • Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. You should see large improvements in entity linking. (RWIKI-454, RWIKI-507)

    We've made some changes as to how some entity types are linked to the provided knowledge base:

    • PERSON: Now only real humans are linked as person entities; fictional, imaginary, and mythical humans are not.

    • PRODUCT: Product entities now exclude most creative works.

  • Linking improvements: We've changed the conflict resolution algorithm to one which tries to link using the longest possible mentions. You should see better linking, especially in cases where the mention of a popular entity is embedded within the mention of interest. (RWIKI-404)

    Example: I studied at the University of Chicago

    • Previously linked: Chicago

    • Now linked: University of Chicago

  • Solr support: Solr 9.6.1 is now supported (TEJ-2357)

Bug Fixes

  • We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)

  • We fixed a bug where Chinese characters were normalized when looking up knowledge base artifacts for linking in Japanese. You should see improved entity linking in Japanese. (RWIKI-406)

  • English terms for half(s) and quarter(s) were removed from the Russian (RUS) and German (DEU) regexes for time. (TEJ-1817)

Release 7.55.11.c73.0

March 2024

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 39. Updated

Package

Old Version

New Version

Apache Commons CLI

1.2.0

1.6.0

Apache Commons Compress

1.24.0

1.26.0

Apache Commons IO

2.15.0

2.15.1

Apache Commons Lang

3.12.0

3.14.0

fastutil

8.15.12

8.5.13

Guava

32.1.3-jre

33.0.0-jre

Guava InternalFutureFailureAccess and InternalFutures

1.0.1

1.0.2

ICU4J

70.1

74.2

Jackson Annotations

2.15.3

2.16.1

Jackson Core

2.15.3

2.16.1

Jackson Databind

2.15.3

2.16.1

Jackson Dataformat CSV

2.15.3

2.16.1

Jackson Dataformat XML

2.15.3

2.16.1

Jackson Dataformat YAML

2.15.3

2.16.1

Jackson Datatype: Guava

2.15.3

2.16.1

Jackson JAXRS: Base

2.15.3

2.16.1

Jackson JAXRS: JSON

2.15.3

2.16.1



Table 40. Removed

Package

JavaPoet

OSGi Core



Release 7.55.10.c73.0

March 2024

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 41. Updated

Package

Old Version

New Version

Apache Commons CLI

1.2.0

1.6.0

Apache Commons Compress

1.24.0

1.26.0

Apache Commons IO

2.15.0

2.15.1

Apache Commons Lang

3.12.0

3.14.0

fastutil

8.15.12

8.5.13

Guava

32.1.3-jre

33.0.0-jre

Guava InternalFutureFailureAccess and InternalFutures

1.0.1

1.0.2

ICU4J

70.1

74.2

Jackson Annotations

2.15.3

2.16.1

Jackson Core

2.15.3

2.16.1

Jackson Databind

2.15.3

2.16.1

Jackson Dataformat CSV

2.15.3

2.16.1

Jackson Dataformat XML

2.15.3

2.16.1

Jackson Dataformat YAML

2.15.3

2.16.1

Jackson Datatype: Guava

2.15.3

2.16.1

Jackson JAXRS: Base

2.15.3

2.16.1

Jackson JAXRS: JSON

2.15.3

2.16.1



Table 42. Removed

Package

JavaPoet

OSGi Core



Release 7.55.9.c72.0

December 2023

New

  • Solr support: Solr 9.4 is now supported. (TEJ-2112)

  • Build scripts: The scripts to improve gazetteer performance by compiling them into binary files have been moved to the ./scripts directory. You no longer need the Field Training Kit (FTK) to compile these files. (TEJ-2096)

Bug fixes:

  • We've added a reject gazetteer to ensure that the string USA, Canada is extracted correctly as 2 location entities: USA and Canada. This is active by default. (TEJ-2114)

Third-party component updates

Table 43. Updated

Package

Old Version

New Version

Jackson-annotations

2.15.2

2.15.3

Jackson Core

2.15.2

2.15.3

Jackson Databind

2.15.2

2.15.3

Jackson Dataformat XML

2.15.2

2.15.3

Jackson Dataformat YAML

2.15.2

2.15.3

Jackson datatypes: Guava

2.15.2

2.15.3

Jackson-JAXRS: base

2.15.2

2.15.3

Jackson-JAXRS: JSON

2.15.2

2.15.3

Jackson module: Old JAXB Annotations (javax.xml.bind)

2.15.2

2.15.3

Guava: Google Core Libraries for Java

32.1.2-jre

32.1.3-jre

Protocol Buffers [Core]

3.23.4

3.25.0

Apache Commons IO

2.11.0

2.15.0

Apache Commons Compress

1.23.0

1.24.0

liblinear

2.42

2.44

Apache Log4j API

2.20.0

2.21.1

Apache Log4j Core

2.20.0

2.21.1

Apache Log4j SLF4J Binding

2.20.0

2.21.1

Stax2 API

4.2.1

4.2.2

SnakeYAML

2.0

2.2



Release 7.55.8.c71.0

September 2023

New

  • Solr support: Solr 9.3 is now supported.

Bug Fixes

  • We fixed a bug in licensing for Chinese language codes. (WS-2861)

Third-party component updates

Table 44. Updated

Package

Old Version

New Version

Jackson Annotations

2.15.0

2.15.2

Jackson Core

2.15.0

2.15.2

Jackson Databind

2.15.0

2.15.2

Jackson Dataformat XML

2.15.0

2.15.2

Jackson Dataformat T

2.15.0

2.15.2

Jackson Datatype: Guava

2.15.0

2.15.2

Jackson Module: Old JAXB Annotations

2.15.0

2.15.2

Guava: Google Core Libraries for Java

31.1-jre

32.1.2-jre

Protocol Buffers [Core]

3.21.7

3.23.4



Release 7.55.7.c70.0

June 2023

Bug Fixes

  • The parameter regexCurrencySplit has been fixed. When set to true, currency values will now extract into two entity types: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE instead of IDENTIFIER:MONEY. (TEJ-1960)

Known Issues

  • The Solr plugin is not supported on Solr 9.2.x

Third-party component updates

This release includes the following third-party component changes:

Table 45. Updated

Package

Old Version

New Version

Apache Commons Compress

1.22

1.23

Apache Log4J API

2.19.0

2.20.0

Apache Log4J Core

2.19.0

2.20.0

Apache Log4J SLF4J Binding

2.19.0

2.20.0

fastutil

8.5.9

8.5.12

Jackson Annotations

2.14.0

2.15.0

Jackson Core

2.14.0

2.15.0

Jackson Databind

2.14.0

2.15.0

Jackson Dataformat CSV

2.14.0

2.15.0

Jackson Dataformat YAML

2.14.0

2.15.0

Jackson Dataformat XML

2.14.0

2.15.0

Jackson datatype: Guava

2.14.0

2.15.0

Jackson JAXRS:base

2.14.0

2.15.0

Jackson JAXRS:JSON

2.14.0

2.15.0

Jackson module:OLD JAXB Annotations

2.14.0

2.15.0

SnakeYAML

1.33

2.0



Release 7.55.6.c69.0

March 2023

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 46. Upgraded

Package

Old Version

New Version

Guava: Google Core Libraries for Java

26.0-jre

31.1-jre

Protocol Buffers [Core]

3.12.2

3.21.7



Release 7.55.4.c69.0

March 2023

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 47. Upgraded

Package

Old Version

New Version

Guava: Google Core Libraries for Java

26.0-jre

31.1-jre

Protocol Buffers [Core]

3.12.2

3.21.7



Release 7.55.3.c68.0

January 2023

Bug Fixes

  • We fixed a linking problem introduced in 7.55.0.c68.0, in which the NIL_BIAS parameter value was incorrect. The parameter is now correct for all languages. (TEJ-1899)

Release 7.55.0.c68.0

December 2022

New

  • Wikidata refreshed: We've updated the knowledge base data for the provided linking knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIki-119, ELK-274, ELK-276)

  • New currency regex: We've introduced a new option, regexCurrencySplit, that, when set to true, will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two new entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These two new types represent the amount of the currency (50,000) and the currency type ($), respectively. By default, regexCurrencySplit is set to false. (TEJ-1792)

  • Tagalog support: We've added case-insensitive NER support for Tagalog. Previously we released a case-sensitive model and we've now added the case-insensitive model as well. (TEJ-1858)

  • Parameter removed: We've removed the deprecated genre extraction option. This option was used to turn the linker on which has been, and will still be, available by the linkEntities option. The genre option is no longer available in the REX SDK, in the Rosette Server REX configuration, as well as the Rosette API bindings (TEJ-1855).

Release 7.54.1.c67.0

October 2022

New

  • This beta release contains a new English statistical extraction model which was trained on financial data. The model has improved accuracy over the default English models for financial domain documents. (TEJ-1837)

Release 7.54.0.c67.0

September 2022

New

  • Tagalog (tgl) support: We've added Tagalog to our list of languages. The following processors are supported: gazetteer, regex, statistical NER, linking. (TEJ-1812, TEJ-1822, TEJ-1785, TEJ-1786)

  • New linking option: We've added a new option for entity linking. When linkMentionMode is set to entities the linker will attempt to link the entities extracted by other processors (regex, gazetters, and the statistical processor) instead of using its own processor to extract entity candidates. Depending on your data, this may provide higher accuracy and speed. (TEJ-1806)

  • REXCmd parameter change: The linkEntities parameter can now act as a toggle instead of taking a true/false value, matching how other REXCmd boolean parameters are handled. (TEJ-1806)

    • Previously: REXCmd ... -linkEntities true

    • Now: REXCmd ... -linkEntities

  • Parameter deprecated: The parameter genre is deprecated and will be removed in the next release.

Bug Fixes

  • REX no longer produces an exception when token normalization produces an empty token string. (TEJ-1803)

  • When looking for candidate mentions in text, if there is an overlap between these mentions the linker now resolves the longest spanning mention before disambiguation. (ELK-277)

Release 7.53.3.c67.0

June 2022

New

  • Configure knowledge base linking priority: With multiple knowledge bases it is possible to set the order in which to try linking against each knowledge base. Set the priority in the redactor configuration file (ne_types.xml) (TEJ-1726, TEJ-1754)

    Example: The following XML element will set the custom-kb priority higher than the default knowledge base (kb-linker) when linking a PRODUCT entity type:

    <ne_type>
      <name>PRODUCT</name>
      <weight name="kb-linker" value="100" />
      <weight name="kb-linker:custom-kb" value="1" />
    </ne_type>
  • relatedEntities renamed to contextWords: When creating a custom knowledge base, the feature contextWords, which was previously called relatedEntities, is required. Context words are language-specific words that are strongly related to the entity. The term relatedEntities has been deprecated. (TEJ-1756)

  • Java 17 support added: Java 8 and 9 support has been removed. (TEJ-1728, TEJ-1763)

  • Solr 9 support added: REX now supports Lucene and Solr 9. (TEJ-1731)

  • Solr 6 support deprecated: REX no longer supports Solr 6 or earlier. (TEJ-1731)

Bug Fixes

  • Bug fix: An error is no longer generated when there are null prefixes in Arabic morphological analyses. (TEJ-1765)

  • Bug fix: We fixed a bug to enable using noisy_context_vector feature for disambiguation. (ELK-265, ELK-268, ELS-272, TEJ-1776)

Release 7.53.0.c66.0

March 2022

Notice

Solr 6 and earlier support is deprecated as of this release.

Java 8 and Java 9 support is deprecated as of this release.

Bug Fixes

  • Updated Log4j to version 2.17.1. (TEJ-1724)

Third-party component updates

This release includes the following third-party component changes:

Table 48. Upgraded

Package

Old Version

New Version

Apache Commons Compress

1.9

1.21

Apache Commons IO

2.7

2.11.0

Apache Commons Lang3

3.32

3.12.0

Apache Log4j

1.2.17

2.17.1

Auto Common Libraries

0.3

0.8

AutoService

1.0-r3

0.8

ICU4J

58.1

70.1

fastutil

8.4.0

8.5.6

LibLinear

2.30

2.42

SLF4J

1.7.28

1.7.33

SnakeYAML

1.26

1.30

TensorFlow for Java

0.2.0

0.3.3



Table 49. Added

Package

Version

License

AOP alliance

1.0

Public Domain

Apache Commons Logging

1.2

Apache License 2.0

Apache Commons Math

2.0

Apache License 2.0

Apache POI

3.9

Apache License 2.0

DOM4J

1.6.1

DOM4J License

JCommon

1.0.17

GNU Lesser General Public Licence

JFreeChart

1.0.14

GNU Lesser General Public Licence

JUnit

4.13.2

Eclipse Public License 1.0

JVM Integration for Metrics

3.0.4

Apache License 2.0

Java Architecture for XML Binding

2.3.2

Eclipse Distribution License - v 1.0

Java Common Annotations API

1.3.2

CDDL + GPLv2 with classpath exception

Java Message Service

1.1

Common Development and Distribution License (CDDL) v1.0

JavaBeans Activation Framework (JAF)

1.1

Common Development and Distribution License (CDDL) v1.0

JavaBeans Activation Framework API jar

1.2.1

EDL 1.0

JavaMail API

1.4

Common Development and Distribution License (CDDL) v1.0

Javax WS-RS API

2.1.5

EPL 2.0

JetBrains Java Annotations

23.0.0

Apache License 2.0

Jimfs

1.1

Apache License 2.0

Legion of the Bouncy Castle Java Cryptography APIs

138

Bouncy Castle License

Lib TensorFlow

1.5.0

Apache License 2.0

Mockito

1.9.5

The MIT License

ODFDOM

0.8.6

Apache License 2.0

Project Lombok

1.18.22

The MIT License

Spring

4.2.4.RELEASE

Apache License 2.0

StAX API

1.0.1

Apache License 2.0

Sun Multi-Schema XML Validator

20050913

The BSD License

TensorFlow

1.5.0

Apache License 2.0

XML Commons External Components XML APIs

1.3.04

Apache License 2.0

Xerces2 Java Parser

2.9.4

Apache License 2.0

XMLBeans

2.3.0

Apache License 2.0

ZIP4J

1.3.2

Apache License 2.0

iText

2.1.5

Mozilla Public License



Table 50. Removed

Package

Apache Geronimo

JAX-WS

JBoss RMI

JSR203 Hadoop

Jacorb Omg

Jakarta Activation

Jakarta WS-RS API

Jakarta XML Bind API

Javax Activation

Javax Annotation

Javax XML Soap

MIME Pull

SAAJ Impl

STAX-EX



Release 7.52.0.c65.0

December 2021

New

  • New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.42.2.c65.0. (TEJ-1681, TEJ-1693)

Bug Fixes

  • Hungarian dates are now extracted correctly. Previously, dates with embedded periods followed by a space were not being extracted. (TEJ-1681)

  • rexcmd info no longer lists TEMPORAL types by default for SWEDISH. (TEJ-1687)

Release 7.51.1.c65.0

September 2021

Bug Fixes

  • The supported entity types info of the DNN processor now spells PERSON correctly. (TEJ-1670)

Release 7.51.0.c65.0

August 2021

New

  • Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions. (TEJ-1657, TEJ-1658)

  • New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.41.1.c65.0. (TEJ-1667)

Bug Fixes

  • A single line followed by an empty line is no longer always considered a fragment. (ETROG-3431)

  • The Field Training Kit (FTK) no longer returns erroneous error messages from generating wordclasses. (TEJ-1636)

  • The RBL models directory is now correctly specified in the FTK. (TEJ-1655)

  • The REX Training Server (RTS) no longer fails when the request contains the language code msa. msa is now mapped to zsm, the language code supported by REX for Malay. (TEJ-1669)

Release 7.50.0

May 2021

Bug Fixes

  • We fixed a bug where Invalid whitespace handling by the DNN processor would cause a runtime exception. (TEJ-1614)

Open Source Changes

Table 51. Upgraded

Package

Old Version

New Version

jackson

2.10.0

2.11.1

commons-io

2.6

2.7

fastutil

8.3.0

8.4.0

liblinear

1.95

2.42

snakeyaml

1.25

1.26

stax2-api

4.2

4.2.1



Table 52. New

Package

Version

License

JavaCPP

1.5.4

Apache 2.0

TensorFlow Core API

0.2.0

Apache 2.0

TensorFlow NDArray

0.2.0

Apache 2.0



Table 53. Deleted

Package

libtensorflow

libtensorflow jni

protobuf



Release 7.49.1

April 2021

New

  • Language-specific joiner rules: Custom joiner rules can now be language-specific or apply to all languages. (TEJ-178)

  • New default processing for structured text regions (lists, tables): Because structured text is often just words or phrases, and thus missing the syntactic context that REX was trained on, some REX users would pre-process input text to remove structured regions, on which REX performed poorly. Users no longer have to pre-process the input as now the statistical/DNN model is turned off by default for structured regions. This mode increases precision but may result in reduced recall in these regions. Note, the other REX processors (pattern match, exact match, entity linking) which do not rely on context will continue to analyze the structured regions. To turn on the statistical/DNN model for structured regions, set the parameter structuredRegionProcessingType to nerModel. (TEJ-1502) (TEJ-1502)

  • New name classifier model for structured regions (LABS): We've added a new model for processing structured regions. The name classifier classifies a text fragment as PERSON, LOCATION, ORGANIZATION, or NONE. The entire structured region is classified as a single label, an entity type or NONE. It is disabled by default. (TEJ-1613, TEJ-1621)

  • Japanese organization gazetteers: The gazetteers for Japanese organizations has been updated to improve extraction of Japanese organizations. (TEJ-1612)

  • New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.39.0. (TEJ-1618)

  • Rosette Training Server (RTS) results: When using REX with Adaptation Studio (RAS), the results returned by RTS are now preferred by default. (TEJ-1605)

Bug Fixes

  • Entities are no longer extracted when they cross a sentence boundary. To enable entity linking across sentence boundaries, set disableApplySentenceBoundaries to true. (ELK-259)

  • Entities are now checked to ensure they are normalized. (TEJ-1615)

Third-party component updates

This release includes the following third-party component changes:

Package

Old Version

New Version

liblinear

1.94

2.42

Release 7.48.0

January 2021

Bug Fixes

  • We fixed an offset alignment issue for carriage return normalization. (TEJ-1600)

Release 7.47.0

December 2020

New

  • Updated the internal database for Wikidata linking. QIDs for some entities may change from previous versions, as Wikidata has been refreshed and re-indexed. (TEJ-1579, ELK-249, ELK-251, RWIKI-77)

  • Updated RBL version (TEJ-1579)

Bug Fixes

  • The sqlite-kb-connector sample now works correctly. Runtime issues with sqlite dependencies have been corrected. (ELK-245, ELK-257)

  • Extraction no longer fails when a custom processor returns a NULL annotator; instead a warning is generated. (TEJ-1580)

  • Mentions normalized by the custom processor are no longer ignored. (TEJ-1573)

  • Windows-formatted carriage returns (/r, /r/n) are now handled correctly.

Release 7.46.2

September 2020

New Features

  • Joiner runs before redactor: The joiner now runs before the redactor by default, providing more flexibility and control over the  joiner results. Set runJoinerPostRedactor to true to run the joiner after the redactor. (TEJ-1534)

  • Improved phone number recognition: Regular expressions for phone number extraction have been improved and now extract more phone number patterns. (TEJ-1556)

  • REXCmd input from stdin: REXCmd can now accept input from stdin by specifying the command line option -stdin.

    Example:

    $ echo "Basis Technology is a company in Massachusetts" | REXCmd extract -stdin -langCode eng

Bug Fixes

  • We fixed a bug where sometimes a null pointer exception was returned when the custom processor and the linker had overlapping results. (TEJ-1561)

  • Custom processors can now only modify the entity and metadata sections of the ADM. Previously, any modification could be made which could override annotation data. (TEJ-1537)

  • We've partially fixed a problem in Japanese ORG extraction where sometimes the model extracts multiple ORG entities or includes non-related adjacent tokens. (TEJ-1534)

  • The Field Training Kit no longer generates invalid models for when creating custom knowledge bases. This occurred for all languages except eng, jpn, and zho. (ELK-252)

Release 7.46.0

June 2020

New Features

  • Improved sample The sample files to build the SQLite connector described in the Custom Knowledge Base Connectors section now includes all files required to build with Maven. The configuration to run the connector with Rosette Enterprise is now provided as well. (TEJ-1508)

  • Language-specific alias Custom knowledge bases compiled with the Field Training Kit (FTK) will now maintain the language of the alias. Aliases will only be extracted in documents of the language the alias is defined for. Aliases can be defined as for all languages or for a specific language. (ELK-241)

  • Custom knowledge bases can be compiled without disambiguation. While adding a knowledge base without a disambiguation model will not provide the best results, it will function as an enhanced gazetteer that attaches an assigned ID to each gazetteer entry and supports multiple aliases per entry. To compile a custom knowledge base without compiling a disambiguation model pass -d as an argument to train-linker-model. (ELK-233)

  • New method A method getBaseLinguisticsParameters has been added to retrieve the base linguistics parameters that were used in training the model. Use the retrieved parameters to configure an external instance of RBL to produce tokens consistent with the training tokenization. A new sample application, RBLParametersSample.java, is available in the samples directory. (TEJ-1501)

  • Base linguistics added The FTK can now use input ADM files containing base linguistics annotations, such as tokens, sentence boundaries, and morphological analysis for languages such as Korean and Arabic. For REX to produce the optimal results, tokenize with the options provided by the getBaseLinguisticsParameters method when creating the ADM file from RBL. (APE-1793)

  • Hebrew improvements REX has improved Hebrew normalization and added the ability of the disambiguator to identify prefixes removed from the entity's normalized form. Improvements are a result of enhancements in Hebrew base linguistics. (ETROG-3189)

Bug Fixes

  • A new line character in a regex (\n) will now also match carriage returns (\r) and a combination of both (\r\n). (TEJ-1525)

  • Confidence scores for entity linking now use the same scale, whether linking to Wikidata or a custom knowledge base. Previously, the confidence scores given for links to custom knowledge bases were much lower than those calculated for the Wikidata knowledge base. (ELK-240)

Release 7.45.0

March 2020

New Features

  • Connector framework for custom Knowledge Bases added. See section 5.6 in the Application Developer's Guide. (TEJ-1476, TEJ-1477, TEJ-1485)

  • Added Deep Neural Network model for Hebrew for improved accuracy. Replace statistical model with it by using the flag -useDeepNeuralNetworkProcessor. (TEJ-1503)

  • Hebrew normalization improved: instead of using the lemma form, just the prefixes are being removed, except the definite article. (TEJ-1505)

  • New statistical model for Hebrew trained on news and finance data. (TEJ-1497)

  • Solr plugin now available as a Docker container. (TEJ-1492)

  • Supplemental regex support for ISO-6709 geo-coordinates. (TEJ-1431, DATA-761)

  • Support for setting prioritization for multiple custom Knowledge Bases. See section 5.2 in the Application Developer's Guide. (ELK-236)

  • Redactor weighs can now be configured for specific subsources. See section 3.2.1 in the Application Developer's Guide. (TEJ-1480)

  • Separate license key required for linker custom Knowledge Bases. Note: extractions against existing custom Knowledge Bases will fail unless licenses are updated. (TEJ-1483)

  • Custom Knowledge Bases can be set in Rosette Enterprise profiles. Note: To support this feature, the flinx directory was moved into {rex-installation}/data. Any custom data inside must also be moved to the new location. (TEJ-1494)

Bug Fixes

  • TEJ-1499 REXAnnotatorFactory failed to assign linking confidence thresholds.

  • TEJ-1479 Fixed dynamic gazetteers for Malay.

  • TEJ-1506 Deep Neural Network extractions failed in REXCmd.

Release 7.44.1

February 2020

Bug Fixes

  • TEJ-1470 Fixed an error when extracting entities in Arabic text.

Release 7.44.0

December 2019

New Features

  • New language: Entity extraction now supports Swedish. (TEJ-1395)

  • Manual cache memory eviction functionality added to the SDK. See the Application Developer Guide for details. (TEJ-1450)

Bug Fixes

  • ELK-169 Fixed an error when reading aliases binaries produced with the FTK.

  • ELK-179 FTK no longer requires English or Chinese models for other languages.

Release 7.43.2

September 2019

Bug Fixes

  • Re-enabled and improved decompounding for Japanese and Chinese. It was disabled in 7.43.1.

Release 7.43.1

August 2019

Bug Fixes

  • Disabled decompounding for Japanese and Chinese

Release 7.43.0

August 2019

New Features

  • Tested and confirmed compatibility with Java 11.

  • Updated internal database for Wikidata linking. The DBPedia Type field now supports multiple subtypes. QIDs for some of the entities may change from previous versions, as Wikidata has been refreshed and re-indexed.

  • Entity linking returns PermIDs (IDs from Thomson Reuters knowledge base) in addition to QIDs (Wikidata IDs) for some of the entities.

Bug Fixes

  • Fixed a potential Null Pointer Exception which could have occured while using DNN.

Release 7.42.2

July 2019

Bug Fixes

  • Fixed multi-threading bug with custom gazetteer

Release 7.42.1

June 2019

New Features

  • Flinx disambiguation models are packaged with optional parameter files which control some parameters during runtime. These file were missing from several previous distribution packages, which may have affected accuracy performance. They have now been re-added to the distribution packages.

  • Fixed additional cases where Japanese characters were wrongly normalized into their simplified Chinese equivalents in entity linking, an issue addressed also in the previous release.

  • Chinese language code is now composed of three characters uniformly throughout the file system.

  • Entity extraction and entity linking now consume the latest version of RBL (Rosette Base Linguistics), which includes several improvements and bug fixes.

  • Improved installation by providing a script to facilitate unzip and installation of documentation and language packages.

Bug Fixes

  • Japanese date extraction now includes the new era 令和.

Release 7.41.0 and earlier

New Features

Bugs Fixed

New Features

Release 7.41.0

  • In Japanese, a middle dot comes in the middle of Western names and acts as a sort of whitespace separating words in the name. Previously, some of the entities with middle dot have been split into two entities, extracting only a part of the name. This is now handled correctly, and entities with middle dot are not split. (TEJ-1341)

  • In Japanese entity linking, in some cases the last character of the Japanese word changed into a Chinese character. This is now fixed. (ELK-118)

  • Previously, when includeDbPediaTypes option was off, entity linking occassionally extracted an inaccurate entity type. Now, fine types of linked entities are identified also when includeDbPediaTypes option is off. (ELK-115)

  • Provided a distribution package per language. (TEJ-1361)

Release 7.39.0

  • Reduced linker data package size (TEJ-1306, TEJ-1321)

  • Updated the linking confidence calculation and thresholds to improve accuracy (TEJ-1343)

Release 7.38.1

  • Improved the accuracy of Korean extraction, largely through better handling of Josa (postpositions) and compound words.

  • Added support for Entity linking to Wikipedia for both the top level types (PERSON, LOCATION, ORGANIZATION, ETC.) as well as the over 700 DBpedia types in the remaining 16 languages supported by Entity Extraction. This is in addition to the languages currently supported by entity linking: Chinese, English, Japanese, and Spanish.

Release 7.36.0

  • The linker process now has the option of returning over 700 new entity types drawn from the DBpedia ontology. To access these entity types, turn on the kbLinker processor and add the includeDBpediaType flag to the factory configuration. You’ll notice more than 10 additional primary types in the type field as well as the all new DBpedia type field. Note that this is a LABS (experimental) api and subject to change. Send us your feedback!

  • New language: Entity extraction now supports Hungarian.

Release 7.35.0

  • Replaced Japanese tokenizer to improve accuracy (TEJ-1176, TEJ-1180)

Release 7.34.0

  • Enabled string normalization for Hebrew based on DNN disambiguation model to improve indoc-coref results (chaining mention`s into a single `Entity) and to present a more proper form of the name (TEJ-1139, TEJ-1173)

  • Social-media characters such as '@' and '#' are removed from Mention`s normalized string, offsets to the original string `data field remain the same. This feature can be disabled by EntityExtractor.setRetainSocialMediaSymbols() (TEJ-418)

  • Improved statistical model confidence score to emit maximal confidence less frequently (TEJ-1146)

Release 7.33.0

  • Added static and dynamic capabilities to adding entries to the custom knowledge base for entity linking (ELK-30, ELK-41, ELK-44, TEJ-1150)

  • Added a new deep neural network processor (BETA) as an alternative entity extraction processor, which can be used in place of the standard statistical extractor for English, Arabic and Korean (TEJ-1132, TEJ-1142, TEJ-1150)

Release 7.32.0

  • Added support for entity type "Title" in Hebrew (APE-1641)

  • Added new option MaxResolvedEntities to REXCmd (TEJ-1137)

Release 7.30.0

  • Accuracy of Korean statistical model is improved (APE-1737)

  • Default linking confidence thresholds are set (TEJ-1080, TEJ-1068)

  • The method setUseDeepNeuralNetworkProcessor() in com.basistech.rosette.rex.EntityExtractor is part of a new experimental API to replace the statistical model by new deep learning model. Another option to use it is to provide ProcessorType.deepNeuralNetwork for the method setProcessors. Currently available only for English and Arabic. Some operating systems do not support the deep neural network model, and some do not provide good latency.

Release 7.29.0

  • FTK supports training a disambiguation model for custom knowledge base (ELK-13, ELK-14, ELK-16, ELK-22, ELK-34)

  • Added default linking confidence threshold for linked entities (TEJ-1068)

  • Updated RBL version (TEJ-1048)

  • Application developer’s guide and customization guide are merged into a single manual (ELK-22)

Release 7.28.0

  • Added confidence score for linking results, named linkingConfidence, in addition to statistical model confidence score. Different thresholds apply for the different confidence scores. (TEJ-974)

Release 7.27.0

  • The new salience classifier has been incorporated. The salience calculation is enabled via EntityExtractor.setCalculateSalience() or by setting calculateSalience in either REXFactoryConfiguration or REXAnnotatorConfiguration. (TEJ-936)

  • Added manual custom processor registration API. (TEJ-972, TEJ-982)

  • Deduped partial duplicated regex. (TEJ-785)

  • Improved multiple gazetteer. (exact-match) processor support (TEJ-960, TEJ-1005)

Release 7.26.0

  • Confidence score calculation is improved to correlate well with precision, may be used for thresholding and removal of false positives (TEJ-910, TEJ-919)

  • Statistical models are trained with new emoticon-sensitive tokenizer (TEJ-924)

  • New script allows repacking REX with minimal configuration per language (TEJ-893)

  • Automatic case sensitivity mode prefers case-sensitive for short text by default (TEJ-931)

Release 7.25.0

  • Added Custom Processor for rejection. (TEJ-840, TEJ-841, TEJ-843, TEJ-880)

  • Redactor improvements: dynamic rules prioritization and subtypes handling (TEJ-863, TEJ-858)

  • Pronominal resolver is fully supported for English. Added as a processor type, as well as indoc-coref (TEJ-867)

  • Indoc-coref allows partial match for ORGANIZATION type (INDOC-26)

Release 7.24.1

  • Added full support for Vietnamese. (APE-1691)

  • Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/MapReduceExample/README.md. (TEJ-807)

  • The method setResolvePronouns() in com.basistech.rosette.rex.EntityExtractor is part of a new experimental API to resolve pronouns like 'he' and 'she' to entities of type Person. It may be changed or removed in future releases. Available only in English. (TEJ-831)

  • Reject regex and gazetteers allow wildcard entity type. (TEJ-853, TEJ-817)

  • The kb-linker experimental processor now supports Chienese and Japanese in addition to English. This functionality may be changed or removed in future minor releases. (TEJ-857)

  • Automatic case sensitivity improved (English only). (TEJ-861)

Release 7.23.1

  • Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/SparkEntityCount/README.md. (TEJ-155)

  • Information about what languages are licensed can now be accessed. The methods getLanguageInformation() and getSupportedEntityTypes() in com.basistech.rosette.rex.EntityExtractor now take in a flag for whether or not to return information on all languages REX supports, or just those that are licensed. Additionally, REXCmd info now has a -onlyLicensed option. (TEJ-767)

Release 7.22.0

  • Added partial support for Vietnamese to extract phone numbers and dates using regexes. (TEJ-740)

  • REX annotators are now created faster and can feasibly be created on a document level when using the new com.basistech.rosette.rex.REXAnnotatorFactory API. The current com.basistech.rosette.rex.EntityExtractor API now also has a faster startup time. See the javadocs, the API Overview section in the Application Developer’s guide, and the sample program at ./samples/EntityAnnotatorFactorySample.java for additional details. (TEJ-773)

Release 7.21.0

  • REX now reports its results using two new classes, Entity and Mention, such that each Entity in a document has one or more Mentions that refer to the same real-world identity. Moving forward, this API will replace EntityMention and its coreferenceChainId. This version of REX is backwards-compatible and still supports the deprecated EntityMention. (TEJ-702)

  • Improved Malaysian statistical model and added a new Malaysian gazetteer. (TEJ-711, TEJ-715)

  • kbLinker (flinx) is part of a new experimental API to to link entities from social media text to knowledge bases and may be changed or removed in future minor releases. (TEJ-722, TEJ-725, TEJ-757)

  • REX has been upgraded to Rosette Platform compatibility level 58.2. If you intend to use more than one Rosette JVM SDK in a single application, then you should choose versions that have the same compatibility number. (TEJ-702)

  • Improved case-insensitivity detection in European languages. (TEJ-687)

  • Entity mentions extracted with the statistical model now also specify the model’s path as a subsource. (TEJ-724)

  • Added a new method, EntityExtractor setOverlayDataDirectory(Path overlayDataDirectory), that allows you to specify an additional data directory for REX to use. (TEJ-731)

  • The REXCmd command line utility now allows you to specify any additional regex files you want to use, besides just the default. (TEJ-628)

Release 7.20.0

  • Added full support for Standard Malay to extract standard entities using regexes, gazetteers, and statistical processors. (TEJ-704)

Release 7.19.3

  • Added support for extraction using two statistical models operating in tandem. (TEJ-674)

  • In order to reduce disk footprint, Big Endian binaries are no longer shipped. REX will correctly memory map Little Endian models and dictionaries even on Big Endian systems. (TEJ-664)

  • Optional new packaging: RBL and REX classes are available in one combined jar. (TEJ-692)

Release 7.18.0

  • This is a maintenance release to address SUPPO-569.

Release 7.17.0

  • Added new customization for statistical processors with unsupervised field training. See Section 4.5 of the Application Developer Guide.

  • Added partial support for Malaysian. (TEJ-626)

Release 7.16.0

  • Added a new setting for REX to automatically choose the most accurate CaseSensitivity model (case-insensitive or case-sensitive) for the input text. This is not activated by default, see the sample programs or javadocs for reference on how to enable this feature. (TEJ-568)

  • Added case-insensitive models for German, Italian, Dutch, and Spanish. (TEJ-566)

  • REX is now built with JDK 1.7, so users can no longer run REX on Java Virtual Machines versioned 1.6 and earlier. (TEJ-551)

  • Improved accuracy of the English statistical model by using multiple Brown clusters. (TEJ-396, TEJ-559)

  • New disableStatisticalCleaner option added to REXCmd and EntityExtractor. (TEJ-379)

  • You can now reactivate regular expression-based entities that are disabled by default by instructing REX to load the regex files in each language’s supplemental directory. See the Javadoc for EntityExtractor.addRegularExpressions(). (TEJ-587)

  • Refined redaction rules for PERSON entities. (TEJ-115)

  • EntityMentions and the returned fields are now documented in the Application Developer Guide. (TEJ-623)

Release 7.15.0

  • Added a boolean caseSensitive parameter to the EntityExtractor’s `addGazetteer and addGazetteerEntity method, to allow case-insensitive string matching of user-provided textual gazetteer entries. (TEJ-56)

  • A RosetteUnsupportedLanguageException is now thrown when REX cannot find data for the requested language, instead of a generic runtime exception. (TEJ-536)

  • Adding a duplicate gazetteer entry will overwrite the existing one. (TEJ-74)

Release 7.14.0

  • Support for script-insensitive Chinese added: Entities are now extracted from Chinese input documents for which the 'Simplified' or 'Traditional' writing system is not specified. Applications may now submit text using the zho language code instead of specifying zhs or zht. (TEJ-525)

  • Version 7.14.0.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis Technology JVM SDK in a single application, then choose versions that have the same compatibility number. (TEJ-524)

Release 7.13.0

  • Indonesian (Bahasa Indonesia) support added. (TEJ-441)

  • To enhance performance, and in response to customer feedback, we deactivated the regular expressions for extracting the following entity types: IDENTIFIER:DISTANCE, IDENTIFIER:LATITUDE_LONGITUDE, IDENTIFIER:UTM, TEMPORAL:DATE, and TEMPORAL:TIME. You can restore support for any of these entity types by removing the @ignore=rex-je attribute value that appears in front of the relevant regular expressions in the regexes.xml files. (TEJ-510)

  • Improved speed and reduced memory consumption of regex and gazetteer matching. (TEJ-489)

  • Improved the accuracy of the statistical models for case-insensitive Portuguese and French. (TEJ-475)

  • Added the ability to modify the indoc shut-off threshold: setMaxResolvedEntities(). (TEJ-456)

  • Introduced an experimental EntityExtractor API for excluding entity types in which you are not interested. See the Javadoc for {get,set}ExcludedEntityTypesfor details. Note that this is an experimental API which can change or be removed in future minor versions of REX. (TEJ-480)

Release 7.12.1

  • REX emits an exception when asked to extract in a language it has no data for. (TEJ-447)

  • Missing values for confidence and coreferenceChainId are now represented as nulls instead of -1. (TEJ-466, TEJ-467)

Release 7.12.0

  • Improved Korean Entity Extraction and In-Document Co-reference Resolution

  • REX uses a new statistical model that achieves higher accuracy (about 25% overall error reduction). (APE-1111)

  • In-document co-reference resolution now recognizes Korean prefixes and suffixes, and will attempt to chain morphological variations of Korean entity mentions. (TEJ-366)

  • Added features to the REXCmd utility

  • Plaintext output now pretty-prints the chain ID for each entity mention it returns. Mentions with the same chain ID refer to the same entity. (TEJ-410)

  • The -context option marks the entities in their original text context, with embedded entity type and chain ID. (TEJ-410)

  • REXCmd now supports pre-annotated input in json-serialized Annotated Text format. (TEJ-455)

  • Modified the reporting of text offsets for partial regular-expression and gazetteer matches to align with the token boundaries of the tokens that contain the matched text. (TEJ-393)

  • The REX Tcl implementation of regular expressions now supports characters in the Supplementary Multilingual Plane (SMP) (TEJ-454). Previous releases represented SMP codepoints as two characters each. (TEJ-330)

  • Provided an EntityExtractor option to instruct the REX statistical processor to ignore the lowercase/uppercase distinction. This feature is currently supported for English, French, and Portuguese. (TEJ-458)

  • Added the EntityExtractor.createDispatchAnnotator method to allow annotating documents from a predefined set of languages. (TEJ-432)

  • Added Portuguese regular expressions for temporal and monetary expressions. (TEJ-444)

Release 7.11.0

  • Added the Fragment Boundary Detector to enable the extractor to separate entities in text fragments that do not form sentences. (TEJ-117)

  • Refined the usage pattern for the com.basistech.rosette.rex.REXCmd command-line utility. The JSON output this utility generates has been trimmed to represent the serialization of an AnnotatedText object. (TEJ-374)

Release 2.2.0

  • Deprecated com.basistech.rosette.rex.EntityCursor. Use com.basistech.rosette.rex.EntityExtractor and com.basistech.rosette.dm.Annotator to extract entities from an input document.

  • Re-established pattern matcher support for the regular expressions disabled in 2.1.0.

Release 2.1.0

  • Added a document-level API for extracting entities. Over time, we expect to deprecate EntityCursor (streaming) in favor of EntityExtractor and Annotator (document-level extraction).

  • Added an API (the EntityExtractor setPostConfidence method) for extracting a confidence floating point value for each entity that REX Java Edition finds. A potential entity is ignored if its confidence score is below the threshold set with the setConfidenceThreshold method. (TEJ-60)

  • REX Java Edition returns normalized entities from all sources: statistical, pattern matching, and exact matches (gazetteers). (TEJ-288)

  • We have disabled the use of pattern matcher regular expressions that do not terminate properly, consuming large amounts of CPU time. This change could cause REX to miss some temporal, distance and long/lat expressions. If your use case requires high recall on these numeric types, please contact analyticssupport@babelstreet.com for assistance on enabling these regular expressions.(TEJ-283)

  • The REXCmd command-line utility now includes the REX Java Edition version number in the output. (TEJ-261)

  • Added a shell script (Unix) and .bat file (Windows) to simplify the running of the RexCmd command-line utility. (TEJ-247)

  • The JSON document generated by the command-line utility is much more verbose than in previous releases. Accordingly, if you are not writing the output to a file, you may want to pipe the output through a JSON parser (e.g., | python -mjson.tool) and concentrate on the EntityMention elements. (TEJ-260)

  • Added support for Arabic, Simplified Chinese, Traditional Chinese, Korean, and Japanese.

  • Support for using the REX Field Training Kit to enhance accuracy handling a particular category of documents and to return new entity types. (TEJ-220)

Release 2.0.0

  • Added support for Dutch, Hebrew, Persian (Western Farsi and Dari), Portuguese, Pashto, and Urdu.

  • Incorporated improvements to the statistical language models, gazetteers, and regular expressions introduced in the Rosette C++ REX implementation since the release of REX Java Edition 1.1.

  • Enhanced the command-line utility with new options. (TEJ-212)

  • Added support for resetting the maximum number of tokens that an entity may include, which defaults to 8. Use the EntityExtractor setMaxEntityTokens(int) method. (TEJ-188)

Release 1.1

  • Added support for Uppercase English, French, German, Italian, Russian, and Spanish.

  • Added support for resolving coreferences to the same entity. Use EntityExtractor setResolveNamedEntities(true) to put coreferences to the same entity in an entity chain: see EntityCursor getChainId(). (TEJ-52)

  • In response to customer feedback, removed IDENTIFIER:NUMBER from the default set of entity types returned by regular expressions. We commented out the IDENTIFIER:NUMBER entries in the regexes.xml files in data/regex/lang/accept, so you can re-activate any of these entries if you wish. (TEJ-171)

  • Added a public EntityCursor hasNext() method that can be used to determine whether there are any more entities in the result set, without advancing to the next entity. (TEJ-150)

  • Added the following EntityExtractor methods:

public void setStatisticalModel(LanguageCode, InputStream);
                             public void addGazetteer(LanguageCode, InputStream, boolean);
                             public void addGazetteer(LanguageCode, InputStream);
                             public void addRegularExpressions(LanguageCode, InputStream, boolean);
                             public void setRedactorWeights(InputStream);
                             public void addJoinerRules(InputStream);
                             public void setLicense(InputStream);
These methods enable access to data files placed in a JAR file (perhaps for use in a Hadoop environment). (TEJ-59)

Bugs Fixed

Bug number is followed by a brief bug description.

Fixed in 7.40.0

  • TEJ-1349 Fixed linker head mention bug

  • TEJ-1353 Fixed missing entity types in linker results

Fixed in 7.39.1

  • TEJ-1281 Set log level to Debug instead of Warn when linkEntities and genre don’t agree

  • TEJ-1319, TEJ-1346, ELK-114 Picked up new TVEC to improve the efficiency of the initial load time

  • TEJ-1324 Fixed a bug where the salience score was not always returned for entities with pronominal mentions, when requested.

  • TEJ-1327 Consumed new RBL to fix null pointer exception with pronoun resolver

  • TEJ-1331 Removed xxx from reported supported languages

  • APE-1766 Fixed a bug where all entity mentions were not always returned by statistical model

Fixed in 7.38.1

  • TEJ-1282 Relocated LIBLINEAR and TVEC, tested with rli

  • TEJ-1283 Fixed cases in which QID was not returned although DBPedia result was available, when using REX’s Kblinker

  • TEJ-1292 Custom processors made configurable via REXFactoryConfiguration

  • ELK-82 Fixed an acronym feature issue.

Fixed in 7.34.0

  • TEJ-1160 Moved ORGANIZATION eng regexes to supplemental directory to improve performance

  • TEJ-1167 Improved failure message for missing flinx data directory

  • TEJ-1168 Fixed null point exception with long chains

Fixed in 7.31.0

  • TEJ-1099 Added additional currency symbols to regex

  • TEJ-1108 Fixed hexadecimal number string incorrectly extracted as product

Fixed in 7.29.0

  • TEJ-1067 Fixed calculateConfidence configuration bug

  • TEJ-1020 Added regex for Israeli ID number

  • TEJ-1054 Cleaned MD5 codes extracted as PRODUCT entity type

  • TEJ-1049 Fixed case-insensitive text file gazetteer treated as case-sensitive

Fixed in 7.28.1

  • TEJ-1050 Fixed annotator configuration object being modified while creating.

Fixed in 7.28.0

  • TEJ-1041 Improved redactor rules for extractor and linking overlaps and indoc-coref to prefer chaining based on linking ID.

  • TEJ-1039 Enabled emoticon mode for RBL for RosAPI

  • TEJ-1042 Fixed error handling for requesting salience score while indoc-coref is disabled

Fixed in 7.27.0

  • TEJ-950 Improved custom processor examples.

  • TEJ-951 Apply American SSN regex for English only.

  • TEJ-962 Fixed personal ID entity type in Vietnamese regex.

  • TEJ-1005 Fixed source and subsource for static user gazetteer.

Fixed in 7.26.4

  • TEJ-1006 Improved handling of the case Arabic prefix is null.

  • TEJ-1014 Fixed shading of dependencies.

Fixed in 7.26.3

  • TEJ-971 Fixed a concurrency issue in kb-linker.

  • TEJ-977 Fixed to return null for confidence from non-statistical entities.

Fixed in 7.26.2

  • TEJ-952, TEJ-963 Fixed lookup issue in wordclasses

Fixed in 7.25.0

  • TEJ-693 setStatisticalModel without caseSensitivity assumes case-sensitive

Fixed in 7.23.1

  • TEJ-813 Relocated more RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.

Fixed in 7.22.0

  • TEJ-477 REXCmd now emits a usage error when 'info' is used with no info command.

Fixed in 7.21.0

  • TEJ-755 Relocated RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.

  • TEJ-735, TEJ-738 Fixed a bug in which Entity Type Survey didn’t support non-default field training models or customer-defined types.

  • TEJ-747 Fixed a bug in which REXCmd produced unexpected offsets for files with DOS-style line endings.

  • TEJ-718 Added Malaysian sample text.

Fixed in 7.20.2

  • TEJ-717 Minor update to zsm model in 'single' distribution form.

Fixed in 7.20.0

  • TEJ-694 Added a missing file resource release in addGazetteer.

Fixed in 7.19.3

  • TEJ-676, TEJ-691 Shaded and relocated all 3rd party dependencies.

  • TEJ-683 Removed extraneous pom.xml’s from META-INF in distro jar.

Fixed in 7.18.0

  • SUPPO-569 Fixed incompatibility issues with RBL.

Fixed in 7.17.1

  • TEJ-659 Fixed support for custom tags in statistical model parser.

  • APE-1608 Fixed REXCmd’s behavior with case-insensitive models and the -model command line option.

Fixed in 7.17.0

  • SUPPO-536 Fixed partial regex extraction for Korean.

Fixed in 7.16.0

  • TEJ-573 REXCmd now ignores a BOM at the beginning of its input file.

  • TEJ-574 Lookbehind assertions are not supported, and this is now included in the Application Developer Guide.

Fixed in 7.14.0

  • TEJ-512 Regex xml parser misses some CDATA sections (URL extraction issues)

Fixed in 7.13.0

  • TEJ-459, TEJ-507 indoc chaining applied only to PER/LOC/ORG

  • TEJ-498 Application Developer Guide claims that REX extracts full postal addresses

Fixed in 7.12.1

  • TEJ-486 EntityAnnotator reuse problems

  • TEJ-479 RBL dictionary markers in distro

Fixed in 7.12.0

  • TEJ-460 Partial match regex produces wrong offsets

  • TEJ-438 ICU dependency not relocated/shaded

  • TEJ-415 REX requires an RBL license