Entity Extractor (REX)

Release Notes

Release 7.56.1.c78.0

June 2025

New

Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

Third-party component updates

Table 29. Updated

Package	Old Version	New Version
Apache Commons IO	2.18.0	2.19.0
Apache Commons Text	1.13.0	1.13.1
Apache CXF Core	4.1.1	4.1.2
Apache CXF JAX-RS Client	4.1.1	4.1.2
Apache CXF Runtime HTTP Transport	4.1.1	4.1.2
Apache CXF Runtime JAX-RS Frontend	4.1.1	4.1.2
Apache CXF Runtime Security functionality	4.1.1	4.1.2
Guava InternalFutureFailureAccess and InternalFutures	1.0.2	1.0.3
Guava: Google Core Libraries for Java	33.4.0-jre	33.4.8-jre
Jackson datatype: Guava	2.18.2	2.19.0
Jackson Jakarta-RS: base	2.18.2	2.19.0
Jackson Jakarta-RS: JSON	2.18.2	2.19.0
Jackson module: Jakarta XML Bind Annotations (jakarta.xml.bind)	2.18.2	2.19.0
Jackson module: Old JAXB Annotations (javax.xml.bind)	2.18.2	2.19.0
Jackson-annotations	2.18.2	2.19.0
Jackson-core	2.18.2	2.19.0
jackson-databind	2.18.2	2.19.0
Jackson-dataformat-XML	2.18.2	2.19.0
Jackson-dataformat-YAML	2.18.2	2.19.0
Jackson-JAXRS: base	2.18.2	2.19.0
Jackson-JAXRS: JSON	2.18.2	2.19.0
Protocol Buffers [Core]	4.29.3	4.30.2
SnakeYAML	2.3	2.4

Table 30. Added

Package	Version	License
JSpecify annotations	1.0.0	Apache-2.0
Project Lombok	1.18.38	MIT

Table 31. Removed

Package
CogComp-NLPy
JVM Integration for Metrics
Jackson-dataformat-CSV
MongoDB Java Driver (unmaintained)
SLF4J JDK14 Binding

Release 7.56.0.c77.0

March 2025

Important

The installation instructions for the Solr plugin have changed.

To install the Solr plugin:

Copy all files from the lib directory inside the Entity Extractor Solr plugin installation (rex-je-solr) into the lib directory of your Solr core.
Copy all files from the lib directory inside the Entity Extractor installation (rex-je) into the lib directory of your Solr core

New

New entity types: You can now extract social media entity types using Entity Extractor. The types extracted are HASHTAG, ATMENTION, URL, and EMAIL. To extract these types, set extractSocialMedia to true in the annotator configuration. (TEJ-2672)
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Solr support: Solr 9.8.1 is now supported (TEJ-2683)
Solr installation change: We've changed how the plugin is installed because Solr has discontinued the lib directives. When installing, you must copy all JAR files from the lib directory inside the Entity Extractor installation into the lib directory of your Solr core and ensure that the solrconfig.xml file points to the REX-JE installation directory. If you are using Solr version 9.7.x or earlier, the previous installation instructions will still work.

Bug Fixes

We fixed a bug with in-document coreference server. In-document coreference chains of entity mentions will now be correct. (TEJ-2655)

Third-party component updates

Table 32. Updated

Package	Old Version	New Version
Angus Activation Registries	2.0.1	2.0.2
Auto Common Libraries	0.8	1.2.1
AutoService	1.0-rc4	1.1.1
Apache Commons Codec	1.17.1	1.18.0
Apache Commons IO	2.17.0	2.18.0
Apache Commons Text	1.12.02.17.0	1.13.0
Apache CXF	4.0.4	4.1.1
Apache Log4j	2.24.1	2.24.3
Guava	33.3.1-jre	33.4.0-jre
istack common utility code runtime	4.0.1	4.1.2
Jackson	2.17.2	2.18.2
Jakarta Activation API	2.1.2	2.1.3
Jakarta RESTful WS API (prev. jakarta.ws.rs-api)	3.0.0	3.1.0
Jakarta XML Binding API	3.0.1	4.0.2
JavaCPP	1.5.10	1.5.11
JAXB Core and Runtime	3.0.2	4.0.5
JVM Integration for Metrics	3.0.1	3.0.2
Protocol Buffers [Core]	3.25.5	4.29.3
Metrics Core	3.0.1, 4.2.28	3.0.2, 4.2.30
TXW2 Runtime	3.0.2	4.0.5

Table 33. Added

Package	Version	License
Java Architecture for XML Binding	2.2.12	CDDL 1.1
MongoDB Java Driver (unmaintained)	3.12.14	The Apache License, version 2.0

Release 7.55.15.c76.0

November 2024

New

Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Java 21 support added: Java 21 is now supported. Java 11 and 17 are still supported (TEJ-2484).

Third-party component updates

Table 34. Updated

Package	Old Version	New Version
annoy	0.2.5	0.2.6
Apache Commons Compress	1.27.0	1.27.1
Commons IO	2.16.1	2.17.0
Apache Commons Lang	3.16.0	3.17.0
Apache Commons Text	1.10.0	1.12.0
Apache Log4j	2.23.1	2.24.1
fastutil	8.5.14	8.5.15
Guava	33.3.0-jre	33.3.1-jre
JavaCPP	1.5.8	1.5.10
Metrics Core	3.2.3	4.2.28
Protocol Buffers	3.25.3	3.25.5
SnakeYAML	2.2	2.3
Woodstox	7.0.0	7.1.0

Table 35. Added

Package	Version	License
Apache Commons Codec	1.16.1	Apache-2.0
Project Lombok	1.18.34	The MIT License
Streaming API for XML	1.0-2	GNU General Public Library

Release 7.55.14.c75.0

September 2024

New

Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.

Bug Fixes

Fixed a bug where REX would accept indoc-coref-server entity mentions which violated sentence boundaries. REX will now reject mentions from indoc-coref-server that are not contained within document sentences. (TEJ-2451)

Third-party component updates

Table 36. Updated

Package	Old Version	New Version
Apache Commons CLI	1.7.0	1.9.0
Apache Commons Compress	1.26.1	1.27.0
Apache Commons Lang	3.14.0	3.16.0
fastutil	8.15.13	8.15.14
Guava	33.2.0-jre	33.3.0-jre
Jackson	2.17.1	2.17.2
Project Lombok	1.18.22	1.18.34
TensorFlow for Java	0.3.3	1.0.0-rc.1
Woodstox	6.6.2	7.0.0

Release 7.55.13.c74.0

June 2024

New

Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIKI-507)
Linking improvements: We've added a heuristic to help stop generic, unnamed entities, such as "mortgage law", from being linked. (RWIKI-475)

Bug Fixes

Aliases are no longer filtered by low normalized link probability; it is now possible to link entities where abbreviations like "MIT", "LA", "WHO", "UN" are the mention text. (RWIKI-389)
We fixed a NullPointerException while writing log entry when processing empty tokens. (TEJ-2361)
We updated the REXCmd output for plain text and console output to display the entityId for all mentions, not just the head mention. Also fixed ArrayOutOfBounds Exception while writing context output. (TEJ-2155)
We fixed a bug where news media, such as television programs, were typed as ORG. (RWIKI-483).
We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)

Third-party component updates

Table 37. Added

Package	Version	License
Angus Activation Registries	2.0.1	EDL 1.0

Table 38. Updated

Package	Old Version	New Version
Apache Commons CLI	1.6.0	1.7.0
Apache Commons Codec	1.11	1.16.1
Apache Commons Compress	1.26.0	1.26.1
Apache Commons IO	1.26.0	1.26.1
Apache Log4j	2.21.1	2.23.1
Apache CFX	3.4.7	4.0.4
args4J	2.33	2.37
Guava	33.0.0-jre	33.2.0-jre
iStack Common Utility Code	3.0.12	4.0.1
JAXB	2.3.4	3.0.2
Jackson	2.16.1	2.17.1
Jakarta Activation API	1.2.2	2.1.2
Jakarta Annotations API	1.3.5	2.1.1
Jakarta RESTful Web Services API	2.1.6	3.0.0
Jakarta XML Binding API	2.3.3	3.0.1
Protocol Buffers	3.25.0	3.25.3
TXW2	2.3.4	3.0.2
Woodstox	4.4.1, 6.2.6	6.6.2
XmlSchema	2.2.5	2.3.1

Removed Packages

Jakarta SOAP with Attachments API
Jakarta Transaction API
Jakarta Web Services Metadata API
Jakarta XML Web Services API

Release 7.55.12.c73.0

May 2024

New

Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. You should see large improvements in entity linking. (RWIKI-454, RWIKI-507)
We've made some changes as to how some entity types are linked to the provided knowledge base:
- PERSON: Now only real humans are linked as person entities; fictional, imaginary, and mythical humans are not.
- PRODUCT: Product entities now exclude most creative works.
Linking improvements: We've changed the conflict resolution algorithm to one which tries to link using the longest possible mentions. You should see better linking, especially in cases where the mention of a popular entity is embedded within the mention of interest. (RWIKI-404)
Example: I studied at the University of Chicago
- Previously linked: Chicago
- Now linked: University of Chicago
Solr support: Solr 9.6.1 is now supported (TEJ-2357)

Bug Fixes

We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)
We fixed a bug where Chinese characters were normalized when looking up knowledge base artifacts for linking in Japanese. You should see improved entity linking in Japanese. (RWIKI-406)
English terms for half(s) and quarter(s) were removed from the Russian (RUS) and German (DEU) regexes for time. (TEJ-1817)

Release 7.55.11.c73.0

March 2024

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 39. Updated

Package	Old Version	New Version
Apache Commons CLI	1.2.0	1.6.0
Apache Commons Compress	1.24.0	1.26.0
Apache Commons IO	2.15.0	2.15.1
Apache Commons Lang	3.12.0	3.14.0
fastutil	8.15.12	8.5.13
Guava	32.1.3-jre	33.0.0-jre
Guava InternalFutureFailureAccess and InternalFutures	1.0.1	1.0.2
ICU4J	70.1	74.2
Jackson Annotations	2.15.3	2.16.1
Jackson Core	2.15.3	2.16.1
Jackson Databind	2.15.3	2.16.1
Jackson Dataformat CSV	2.15.3	2.16.1
Jackson Dataformat XML	2.15.3	2.16.1
Jackson Dataformat YAML	2.15.3	2.16.1
Jackson Datatype: Guava	2.15.3	2.16.1
Jackson JAXRS: Base	2.15.3	2.16.1
Jackson JAXRS: JSON	2.15.3	2.16.1

Table 40. Removed

Package
JavaPoet
OSGi Core

Release 7.55.10.c73.0

March 2024

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 41. Updated

Package	Old Version	New Version
Apache Commons CLI	1.2.0	1.6.0
Apache Commons Compress	1.24.0	1.26.0
Apache Commons IO	2.15.0	2.15.1
Apache Commons Lang	3.12.0	3.14.0
fastutil	8.15.12	8.5.13
Guava	32.1.3-jre	33.0.0-jre
Guava InternalFutureFailureAccess and InternalFutures	1.0.1	1.0.2
ICU4J	70.1	74.2
Jackson Annotations	2.15.3	2.16.1
Jackson Core	2.15.3	2.16.1
Jackson Databind	2.15.3	2.16.1
Jackson Dataformat CSV	2.15.3	2.16.1
Jackson Dataformat XML	2.15.3	2.16.1
Jackson Dataformat YAML	2.15.3	2.16.1
Jackson Datatype: Guava	2.15.3	2.16.1
Jackson JAXRS: Base	2.15.3	2.16.1
Jackson JAXRS: JSON	2.15.3	2.16.1

Table 42. Removed

Package
JavaPoet
OSGi Core

Release 7.55.9.c72.0

December 2023

New

Solr support: Solr 9.4 is now supported. (TEJ-2112)
Build scripts: The scripts to improve gazetteer performance by compiling them into binary files have been moved to the ./scripts directory. You no longer need the Field Training Kit (FTK) to compile these files. (TEJ-2096)

Bug fixes:

We've added a reject gazetteer to ensure that the string USA, Canada is extracted correctly as 2 location entities: USA and Canada. This is active by default. (TEJ-2114)

Third-party component updates

Table 43. Updated

Package	Old Version	New Version
Jackson-annotations	2.15.2	2.15.3
Jackson Core	2.15.2	2.15.3
Jackson Databind	2.15.2	2.15.3
Jackson Dataformat XML	2.15.2	2.15.3
Jackson Dataformat YAML	2.15.2	2.15.3
Jackson datatypes: Guava	2.15.2	2.15.3
Jackson-JAXRS: base	2.15.2	2.15.3
Jackson-JAXRS: JSON	2.15.2	2.15.3
Jackson module: Old JAXB Annotations (javax.xml.bind)	2.15.2	2.15.3
Guava: Google Core Libraries for Java	32.1.2-jre	32.1.3-jre
Protocol Buffers [Core]	3.23.4	3.25.0
Apache Commons IO	2.11.0	2.15.0
Apache Commons Compress	1.23.0	1.24.0
liblinear	2.42	2.44
Apache Log4j API	2.20.0	2.21.1
Apache Log4j Core	2.20.0	2.21.1
Apache Log4j SLF4J Binding	2.20.0	2.21.1
Stax2 API	4.2.1	4.2.2
SnakeYAML	2.0	2.2

Release 7.55.8.c71.0

September 2023

New

Solr support: Solr 9.3 is now supported.

Bug Fixes

We fixed a bug in licensing for Chinese language codes. (WS-2861)

Third-party component updates

Table 44. Updated

Package	Old Version	New Version
Jackson Annotations	2.15.0	2.15.2
Jackson Core	2.15.0	2.15.2
Jackson Databind	2.15.0	2.15.2
Jackson Dataformat XML	2.15.0	2.15.2
Jackson Dataformat T	2.15.0	2.15.2
Jackson Datatype: Guava	2.15.0	2.15.2
Jackson Module: Old JAXB Annotations	2.15.0	2.15.2
Guava: Google Core Libraries for Java	31.1-jre	32.1.2-jre
Protocol Buffers [Core]	3.21.7	3.23.4

Release 7.55.7.c70.0

June 2023

Bug Fixes

The parameter regexCurrencySplit has been fixed. When set to true, currency values will now extract into two entity types: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE instead of IDENTIFIER:MONEY. (TEJ-1960)

Known Issues

The Solr plugin is not supported on Solr 9.2.x

Third-party component updates

This release includes the following third-party component changes:

Table 45. Updated

Package	Old Version	New Version
Apache Commons Compress	1.22	1.23
Apache Log4J API	2.19.0	2.20.0
Apache Log4J Core	2.19.0	2.20.0
Apache Log4J SLF4J Binding	2.19.0	2.20.0
fastutil	8.5.9	8.5.12
Jackson Annotations	2.14.0	2.15.0
Jackson Core	2.14.0	2.15.0
Jackson Databind	2.14.0	2.15.0
Jackson Dataformat CSV	2.14.0	2.15.0
Jackson Dataformat YAML	2.14.0	2.15.0
Jackson Dataformat XML	2.14.0	2.15.0
Jackson datatype: Guava	2.14.0	2.15.0
Jackson JAXRS:base	2.14.0	2.15.0
Jackson JAXRS:JSON	2.14.0	2.15.0
Jackson module:OLD JAXB Annotations	2.14.0	2.15.0
SnakeYAML	1.33	2.0

Release 7.55.6.c69.0

March 2023

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 46. Upgraded

Package	Old Version	New Version
Guava: Google Core Libraries for Java	26.0-jre	31.1-jre
Protocol Buffers [Core]	3.12.2	3.21.7

Release 7.55.4.c69.0

March 2023

This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.

Third-party component updates

This release includes the following third-party component changes:

Table 47. Upgraded

Package	Old Version	New Version
Guava: Google Core Libraries for Java	26.0-jre	31.1-jre
Protocol Buffers [Core]	3.12.2	3.21.7

Release 7.55.3.c68.0

January 2023

Bug Fixes

We fixed a linking problem introduced in 7.55.0.c68.0, in which the NIL_BIAS parameter value was incorrect. The parameter is now correct for all languages. (TEJ-1899)

Release 7.55.0.c68.0

December 2022

New

Wikidata refreshed: We've updated the knowledge base data for the provided linking knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIki-119, ELK-274, ELK-276)
New currency regex: We've introduced a new option, regexCurrencySplit, that, when set to true, will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two new entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These two new types represent the amount of the currency (50,000) and the currency type ($), respectively. By default, regexCurrencySplit is set to false. (TEJ-1792)
Tagalog support: We've added case-insensitive NER support for Tagalog. Previously we released a case-sensitive model and we've now added the case-insensitive model as well. (TEJ-1858)
Parameter removed: We've removed the deprecated genre extraction option. This option was used to turn the linker on which has been, and will still be, available by the linkEntities option. The genre option is no longer available in the REX SDK, in the Rosette Server REX configuration, as well as the Rosette API bindings (TEJ-1855).

Release 7.54.1.c67.0

October 2022

New

This beta release contains a new English statistical extraction model which was trained on financial data. The model has improved accuracy over the default English models for financial domain documents. (TEJ-1837)

Release 7.54.0.c67.0

September 2022

New

Tagalog (tgl) support: We've added Tagalog to our list of languages. The following processors are supported: gazetteer, regex, statistical NER, linking. (TEJ-1812, TEJ-1822, TEJ-1785, TEJ-1786)
New linking option: We've added a new option for entity linking. When linkMentionMode is set to entities the linker will attempt to link the entities extracted by other processors (regex, gazetters, and the statistical processor) instead of using its own processor to extract entity candidates. Depending on your data, this may provide higher accuracy and speed. (TEJ-1806)
REXCmd parameter change: The linkEntities parameter can now act as a toggle instead of taking a true/false value, matching how other REXCmd boolean parameters are handled. (TEJ-1806)
- Previously: REXCmd ... -linkEntities true
- Now: REXCmd ... -linkEntities
Parameter deprecated: The parameter genre is deprecated and will be removed in the next release.

Bug Fixes

REX no longer produces an exception when token normalization produces an empty token string. (TEJ-1803)
When looking for candidate mentions in text, if there is an overlap between these mentions the linker now resolves the longest spanning mention before disambiguation. (ELK-277)

Release 7.53.3.c67.0

June 2022

New

Configure knowledge base linking priority: With multiple knowledge bases it is possible to set the order in which to try linking against each knowledge base. Set the priority in the redactor configuration file (ne_types.xml) (TEJ-1726, TEJ-1754)
Example: The following XML element will set the custom-kb priority higher than the default knowledge base (kb-linker) when linking a PRODUCT entity type:
```
<ne_type>
  <name>PRODUCT</name>
  <weight name="kb-linker" value="100" />
  <weight name="kb-linker:custom-kb" value="1" />
</ne_type>
```
relatedEntities renamed to contextWords: When creating a custom knowledge base, the feature contextWords, which was previously called relatedEntities, is required. Context words are language-specific words that are strongly related to the entity. The term relatedEntities has been deprecated. (TEJ-1756)
Java 17 support added: Java 8 and 9 support has been removed. (TEJ-1728, TEJ-1763)
Solr 9 support added: REX now supports Lucene and Solr 9. (TEJ-1731)
Solr 6 support deprecated: REX no longer supports Solr 6 or earlier. (TEJ-1731)

Bug Fixes

Bug fix: An error is no longer generated when there are null prefixes in Arabic morphological analyses. (TEJ-1765)
Bug fix: We fixed a bug to enable using noisy_context_vector feature for disambiguation. (ELK-265, ELK-268, ELS-272, TEJ-1776)

Release 7.53.0.c66.0

March 2022

Notice

Solr 6 and earlier support is deprecated as of this release.

Java 8 and Java 9 support is deprecated as of this release.

Bug Fixes

Updated Log4j to version 2.17.1. (TEJ-1724)

Third-party component updates

This release includes the following third-party component changes:

Table 48. Upgraded

Package	Old Version	New Version
Apache Commons Compress	1.9	1.21
Apache Commons IO	2.7	2.11.0
Apache Commons Lang3	3.32	3.12.0
Apache Log4j	1.2.17	2.17.1
Auto Common Libraries	0.3	0.8
AutoService	1.0-r3	0.8
ICU4J	58.1	70.1
fastutil	8.4.0	8.5.6
LibLinear	2.30	2.42
SLF4J	1.7.28	1.7.33
SnakeYAML	1.26	1.30
TensorFlow for Java	0.2.0	0.3.3

Table 49. Added

Package	Version	License
AOP alliance	1.0	Public Domain
Apache Commons Logging	1.2	Apache License 2.0
Apache Commons Math	2.0	Apache License 2.0
Apache POI	3.9	Apache License 2.0
DOM4J	1.6.1	DOM4J License
JCommon	1.0.17	GNU Lesser General Public Licence
JFreeChart	1.0.14	GNU Lesser General Public Licence
JUnit	4.13.2	Eclipse Public License 1.0
JVM Integration for Metrics	3.0.4	Apache License 2.0
Java Architecture for XML Binding	2.3.2	Eclipse Distribution License - v 1.0
Java Common Annotations API	1.3.2	CDDL + GPLv2 with classpath exception
Java Message Service	1.1	Common Development and Distribution License (CDDL) v1.0
JavaBeans Activation Framework (JAF)	1.1	Common Development and Distribution License (CDDL) v1.0
JavaBeans Activation Framework API jar	1.2.1	EDL 1.0
JavaMail API	1.4	Common Development and Distribution License (CDDL) v1.0
Javax WS-RS API	2.1.5	EPL 2.0
JetBrains Java Annotations	23.0.0	Apache License 2.0
Jimfs	1.1	Apache License 2.0
Legion of the Bouncy Castle Java Cryptography APIs	138	Bouncy Castle License
Lib TensorFlow	1.5.0	Apache License 2.0
Mockito	1.9.5	The MIT License
ODFDOM	0.8.6	Apache License 2.0
Project Lombok	1.18.22	The MIT License
Spring	4.2.4.RELEASE	Apache License 2.0
StAX API	1.0.1	Apache License 2.0
Sun Multi-Schema XML Validator	20050913	The BSD License
TensorFlow	1.5.0	Apache License 2.0
XML Commons External Components XML APIs	1.3.04	Apache License 2.0
Xerces2 Java Parser	2.9.4	Apache License 2.0
XMLBeans	2.3.0	Apache License 2.0
ZIP4J	1.3.2	Apache License 2.0
iText	2.1.5	Mozilla Public License

Table 50. Removed

Package
Apache Geronimo
JAX-WS
JBoss RMI
JSR203 Hadoop
Jacorb Omg
Jakarta Activation
Jakarta WS-RS API
Jakarta XML Bind API
Javax Activation
Javax Annotation
Javax XML Soap
MIME Pull
SAAJ Impl
STAX-EX

Release 7.52.0.c65.0

December 2021

New

New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.42.2.c65.0. (TEJ-1681, TEJ-1693)

Bug Fixes

Hungarian dates are now extracted correctly. Previously, dates with embedded periods followed by a space were not being extracted. (TEJ-1681)
rexcmd info no longer lists TEMPORAL types by default for SWEDISH. (TEJ-1687)

Release 7.51.1.c65.0

September 2021

Bug Fixes

The supported entity types info of the DNN processor now spells PERSON correctly. (TEJ-1670)

Release 7.51.0.c65.0

August 2021

New

Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions. (TEJ-1657, TEJ-1658)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.41.1.c65.0. (TEJ-1667)

Bug Fixes

A single line followed by an empty line is no longer always considered a fragment. (ETROG-3431)
The Field Training Kit (FTK) no longer returns erroneous error messages from generating wordclasses. (TEJ-1636)
The RBL models directory is now correctly specified in the FTK. (TEJ-1655)
The REX Training Server (RTS) no longer fails when the request contains the language code msa. msa is now mapped to zsm, the language code supported by REX for Malay. (TEJ-1669)

Release 7.50.0

May 2021

Bug Fixes

We fixed a bug where Invalid whitespace handling by the DNN processor would cause a runtime exception. (TEJ-1614)

Open Source Changes

Table 51. Upgraded

Package	Old Version	New Version
jackson	2.10.0	2.11.1
commons-io	2.6	2.7
fastutil	8.3.0	8.4.0
liblinear	1.95	2.42
snakeyaml	1.25	1.26
stax2-api	4.2	4.2.1

Table 52. New

Package	Version	License
JavaCPP	1.5.4	Apache 2.0
TensorFlow Core API	0.2.0	Apache 2.0
TensorFlow NDArray	0.2.0	Apache 2.0

Table 53. Deleted

Package
libtensorflow
libtensorflow jni
protobuf

Release 7.49.1

April 2021

New

Language-specific joiner rules: Custom joiner rules can now be language-specific or apply to all languages. (TEJ-178)
New default processing for structured text regions (lists, tables): Because structured text is often just words or phrases, and thus missing the syntactic context that REX was trained on, some REX users would pre-process input text to remove structured regions, on which REX performed poorly. Users no longer have to pre-process the input as now the statistical/DNN model is turned off by default for structured regions. This mode increases precision but may result in reduced recall in these regions. Note, the other REX processors (pattern match, exact match, entity linking) which do not rely on context will continue to analyze the structured regions. To turn on the statistical/DNN model for structured regions, set the parameter structuredRegionProcessingType to nerModel. (TEJ-1502) (TEJ-1502)
New name classifier model for structured regions (LABS): We've added a new model for processing structured regions. The name classifier classifies a text fragment as PERSON, LOCATION, ORGANIZATION, or NONE. The entire structured region is classified as a single label, an entity type or NONE. It is disabled by default. (TEJ-1613, TEJ-1621)
Japanese organization gazetteers: The gazetteers for Japanese organizations has been updated to improve extraction of Japanese organizations. (TEJ-1612)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.39.0. (TEJ-1618)
Rosette Training Server (RTS) results: When using REX with Adaptation Studio (RAS), the results returned by RTS are now preferred by default. (TEJ-1605)

Bug Fixes

Entities are no longer extracted when they cross a sentence boundary. To enable entity linking across sentence boundaries, set disableApplySentenceBoundaries to true. (ELK-259)
Entities are now checked to ensure they are normalized. (TEJ-1615)

Third-party component updates

This release includes the following third-party component changes:

Package	Old Version	New Version
liblinear	1.94	2.42

Release 7.48.0

January 2021

Bug Fixes

We fixed an offset alignment issue for carriage return normalization. (TEJ-1600)

Release 7.47.0

December 2020

New

Updated the internal database for Wikidata linking. QIDs for some entities may change from previous versions, as Wikidata has been refreshed and re-indexed. (TEJ-1579, ELK-249, ELK-251, RWIKI-77)
Updated RBL version (TEJ-1579)

Bug Fixes

The sqlite-kb-connector sample now works correctly. Runtime issues with sqlite dependencies have been corrected. (ELK-245, ELK-257)
Extraction no longer fails when a custom processor returns a NULL annotator; instead a warning is generated. (TEJ-1580)
Mentions normalized by the custom processor are no longer ignored. (TEJ-1573)
Windows-formatted carriage returns (/r, /r/n) are now handled correctly.

Release 7.46.2

September 2020

New Features

Joiner runs before redactor: The joiner now runs before the redactor by default, providing more flexibility and control over the joiner results. Set runJoinerPostRedactor to true to run the joiner after the redactor. (TEJ-1534)
Improved phone number recognition: Regular expressions for phone number extraction have been improved and now extract more phone number patterns. (TEJ-1556)

REXCmd input from stdin: REXCmd can now accept input from stdin by specifying the command line option -stdin.
Example:
```
$ echo "Basis Technology is a company in Massachusetts" | REXCmd extract -stdin -langCode eng
```

Bug Fixes

We fixed a bug where sometimes a null pointer exception was returned when the custom processor and the linker had overlapping results. (TEJ-1561)
Custom processors can now only modify the entity and metadata sections of the ADM. Previously, any modification could be made which could override annotation data. (TEJ-1537)
We've partially fixed a problem in Japanese ORG extraction where sometimes the model extracts multiple ORG entities or includes non-related adjacent tokens. (TEJ-1534)
The Field Training Kit no longer generates invalid models for when creating custom knowledge bases. This occurred for all languages except eng, jpn, and zho. (ELK-252)

Release 7.46.0

June 2020

New Features

Improved sample The sample files to build the SQLite connector described in the Custom Knowledge Base Connectors section now includes all files required to build with Maven. The configuration to run the connector with Rosette Enterprise is now provided as well. (TEJ-1508)
Language-specific alias Custom knowledge bases compiled with the Field Training Kit (FTK) will now maintain the language of the alias. Aliases will only be extracted in documents of the language the alias is defined for. Aliases can be defined as for all languages or for a specific language. (ELK-241)
Custom knowledge bases can be compiled without disambiguation. While adding a knowledge base without a disambiguation model will not provide the best results, it will function as an enhanced gazetteer that attaches an assigned ID to each gazetteer entry and supports multiple aliases per entry. To compile a custom knowledge base without compiling a disambiguation model pass -d as an argument to train-linker-model. (ELK-233)
New method A method getBaseLinguisticsParameters has been added to retrieve the base linguistics parameters that were used in training the model. Use the retrieved parameters to configure an external instance of RBL to produce tokens consistent with the training tokenization. A new sample application, RBLParametersSample.java, is available in the samples directory. (TEJ-1501)
Base linguistics added The FTK can now use input ADM files containing base linguistics annotations, such as tokens, sentence boundaries, and morphological analysis for languages such as Korean and Arabic. For REX to produce the optimal results, tokenize with the options provided by the getBaseLinguisticsParameters method when creating the ADM file from RBL. (APE-1793)
Hebrew improvements REX has improved Hebrew normalization and added the ability of the disambiguator to identify prefixes removed from the entity's normalized form. Improvements are a result of enhancements in Hebrew base linguistics. (ETROG-3189)

Bug Fixes

A new line character in a regex (\n) will now also match carriage returns (\r) and a combination of both (\r\n). (TEJ-1525)
Confidence scores for entity linking now use the same scale, whether linking to Wikidata or a custom knowledge base. Previously, the confidence scores given for links to custom knowledge bases were much lower than those calculated for the Wikidata knowledge base. (ELK-240)

Release 7.45.0

March 2020

New Features

Connector framework for custom Knowledge Bases added. See section 5.6 in the Application Developer's Guide. (TEJ-1476, TEJ-1477, TEJ-1485)
Added Deep Neural Network model for Hebrew for improved accuracy. Replace statistical model with it by using the flag -useDeepNeuralNetworkProcessor. (TEJ-1503)
Hebrew normalization improved: instead of using the lemma form, just the prefixes are being removed, except the definite article. (TEJ-1505)
New statistical model for Hebrew trained on news and finance data. (TEJ-1497)
Solr plugin now available as a Docker container. (TEJ-1492)
Supplemental regex support for ISO-6709 geo-coordinates. (TEJ-1431, DATA-761)
Support for setting prioritization for multiple custom Knowledge Bases. See section 5.2 in the Application Developer's Guide. (ELK-236)
Redactor weighs can now be configured for specific subsources. See section 3.2.1 in the Application Developer's Guide. (TEJ-1480)
Separate license key required for linker custom Knowledge Bases. Note: extractions against existing custom Knowledge Bases will fail unless licenses are updated. (TEJ-1483)
Custom Knowledge Bases can be set in Rosette Enterprise profiles. Note: To support this feature, the flinx directory was moved into {rex-installation}/data. Any custom data inside must also be moved to the new location. (TEJ-1494)

Bug Fixes

TEJ-1499 REXAnnotatorFactory failed to assign linking confidence thresholds.
TEJ-1479 Fixed dynamic gazetteers for Malay.
TEJ-1506 Deep Neural Network extractions failed in REXCmd.

Release 7.44.1

February 2020

Bug Fixes

TEJ-1470 Fixed an error when extracting entities in Arabic text.

Release 7.44.0

December 2019

New Features

New language: Entity extraction now supports Swedish. (TEJ-1395)
Manual cache memory eviction functionality added to the SDK. See the Application Developer Guide for details. (TEJ-1450)

Bug Fixes

ELK-169 Fixed an error when reading aliases binaries produced with the FTK.
ELK-179 FTK no longer requires English or Chinese models for other languages.

Release 7.43.2

September 2019

Bug Fixes

Re-enabled and improved decompounding for Japanese and Chinese. It was disabled in 7.43.1.

Release 7.43.1

August 2019

Bug Fixes

Disabled decompounding for Japanese and Chinese

Release 7.43.0

August 2019

New Features

Tested and confirmed compatibility with Java 11.
Updated internal database for Wikidata linking. The DBPedia Type field now supports multiple subtypes. QIDs for some of the entities may change from previous versions, as Wikidata has been refreshed and re-indexed.
Entity linking returns PermIDs (IDs from Thomson Reuters knowledge base) in addition to QIDs (Wikidata IDs) for some of the entities.

Bug Fixes

Fixed a potential Null Pointer Exception which could have occured while using DNN.

Release 7.42.2

July 2019

Bug Fixes

Fixed multi-threading bug with custom gazetteer

Release 7.42.1

June 2019

New Features

Flinx disambiguation models are packaged with optional parameter files which control some parameters during runtime. These file were missing from several previous distribution packages, which may have affected accuracy performance. They have now been re-added to the distribution packages.
Fixed additional cases where Japanese characters were wrongly normalized into their simplified Chinese equivalents in entity linking, an issue addressed also in the previous release.
Chinese language code is now composed of three characters uniformly throughout the file system.
Entity extraction and entity linking now consume the latest version of RBL (Rosette Base Linguistics), which includes several improvements and bug fixes.
Improved installation by providing a script to facilitate unzip and installation of documentation and language packages.

Bug Fixes

Japanese date extraction now includes the new era 令和.

Release 7.41.0 and earlier

New Features

Bugs Fixed

New Features

Release 7.41.0

In Japanese, a middle dot comes in the middle of Western names and acts as a sort of whitespace separating words in the name. Previously, some of the entities with middle dot have been split into two entities, extracting only a part of the name. This is now handled correctly, and entities with middle dot are not split. (TEJ-1341)
In Japanese entity linking, in some cases the last character of the Japanese word changed into a Chinese character. This is now fixed. (ELK-118)
Previously, when includeDbPediaTypes option was off, entity linking occassionally extracted an inaccurate entity type. Now, fine types of linked entities are identified also when includeDbPediaTypes option is off. (ELK-115)
Provided a distribution package per language. (TEJ-1361)

Release 7.39.0

Reduced linker data package size (TEJ-1306, TEJ-1321)
Updated the linking confidence calculation and thresholds to improve accuracy (TEJ-1343)

Release 7.38.1

Improved the accuracy of Korean extraction, largely through better handling of Josa (postpositions) and compound words.
Added support for Entity linking to Wikipedia for both the top level types (PERSON, LOCATION, ORGANIZATION, ETC.) as well as the over 700 DBpedia types in the remaining 16 languages supported by Entity Extraction. This is in addition to the languages currently supported by entity linking: Chinese, English, Japanese, and Spanish.

Release 7.36.0

The linker process now has the option of returning over 700 new entity types drawn from the DBpedia ontology. To access these entity types, turn on the kbLinker processor and add the includeDBpediaType flag to the factory configuration. You’ll notice more than 10 additional primary types in the type field as well as the all new DBpedia type field. Note that this is a LABS (experimental) api and subject to change. Send us your feedback!
New language: Entity extraction now supports Hungarian.

Release 7.35.0

Replaced Japanese tokenizer to improve accuracy (TEJ-1176, TEJ-1180)

Release 7.34.0

Enabled string normalization for Hebrew based on DNN disambiguation model to improve indoc-coref results (chaining mention`s into a single `Entity) and to present a more proper form of the name (TEJ-1139, TEJ-1173)
Social-media characters such as '@' and '#' are removed from Mention`s normalized string, offsets to the original string `data field remain the same. This feature can be disabled by EntityExtractor.setRetainSocialMediaSymbols() (TEJ-418)
Improved statistical model confidence score to emit maximal confidence less frequently (TEJ-1146)

Release 7.33.0

Added static and dynamic capabilities to adding entries to the custom knowledge base for entity linking (ELK-30, ELK-41, ELK-44, TEJ-1150)
Added a new deep neural network processor (BETA) as an alternative entity extraction processor, which can be used in place of the standard statistical extractor for English, Arabic and Korean (TEJ-1132, TEJ-1142, TEJ-1150)

Release 7.32.0

Added support for entity type "Title" in Hebrew (APE-1641)
Added new option MaxResolvedEntities to REXCmd (TEJ-1137)

Release 7.30.0

Accuracy of Korean statistical model is improved (APE-1737)
Default linking confidence thresholds are set (TEJ-1080, TEJ-1068)
The method setUseDeepNeuralNetworkProcessor() in com.basistech.rosette.rex.EntityExtractor is part of a new experimental API to replace the statistical model by new deep learning model. Another option to use it is to provide ProcessorType.deepNeuralNetwork for the method setProcessors. Currently available only for English and Arabic. Some operating systems do not support the deep neural network model, and some do not provide good latency.

Release 7.29.0

FTK supports training a disambiguation model for custom knowledge base (ELK-13, ELK-14, ELK-16, ELK-22, ELK-34)
Added default linking confidence threshold for linked entities (TEJ-1068)
Updated RBL version (TEJ-1048)
Application developer’s guide and customization guide are merged into a single manual (ELK-22)

Release 7.28.0

Added confidence score for linking results, named linkingConfidence, in addition to statistical model confidence score. Different thresholds apply for the different confidence scores. (TEJ-974)

Release 7.27.0

The new salience classifier has been incorporated. The salience calculation is enabled via EntityExtractor.setCalculateSalience() or by setting calculateSalience in either REXFactoryConfiguration or REXAnnotatorConfiguration. (TEJ-936)
Added manual custom processor registration API. (TEJ-972, TEJ-982)
Deduped partial duplicated regex. (TEJ-785)
Improved multiple gazetteer. (exact-match) processor support (TEJ-960, TEJ-1005)

Release 7.26.0

Confidence score calculation is improved to correlate well with precision, may be used for thresholding and removal of false positives (TEJ-910, TEJ-919)
Statistical models are trained with new emoticon-sensitive tokenizer (TEJ-924)
New script allows repacking REX with minimal configuration per language (TEJ-893)
Automatic case sensitivity mode prefers case-sensitive for short text by default (TEJ-931)

Release 7.25.0

Added Custom Processor for rejection. (TEJ-840, TEJ-841, TEJ-843, TEJ-880)
Redactor improvements: dynamic rules prioritization and subtypes handling (TEJ-863, TEJ-858)
Pronominal resolver is fully supported for English. Added as a processor type, as well as indoc-coref (TEJ-867)
Indoc-coref allows partial match for ORGANIZATION type (INDOC-26)

Release 7.24.1

Added full support for Vietnamese. (APE-1691)
Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/MapReduceExample/README.md. (TEJ-807)
The method setResolvePronouns() in com.basistech.rosette.rex.EntityExtractor is part of a new experimental API to resolve pronouns like 'he' and 'she' to entities of type Person. It may be changed or removed in future releases. Available only in English. (TEJ-831)
Reject regex and gazetteers allow wildcard entity type. (TEJ-853, TEJ-817)
The kb-linker experimental processor now supports Chienese and Japanese in addition to English. This functionality may be changed or removed in future minor releases. (TEJ-857)
Automatic case sensitivity improved (English only). (TEJ-861)

Release 7.23.1

Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at ./samples/SparkEntityCount/README.md. (TEJ-155)
Information about what languages are licensed can now be accessed. The methods getLanguageInformation() and getSupportedEntityTypes() in com.basistech.rosette.rex.EntityExtractor now take in a flag for whether or not to return information on all languages REX supports, or just those that are licensed. Additionally, REXCmd info now has a -onlyLicensed option. (TEJ-767)

Release 7.22.0

Added partial support for Vietnamese to extract phone numbers and dates using regexes. (TEJ-740)
REX annotators are now created faster and can feasibly be created on a document level when using the new com.basistech.rosette.rex.REXAnnotatorFactory API. The current com.basistech.rosette.rex.EntityExtractor API now also has a faster startup time. See the javadocs, the API Overview section in the Application Developer’s guide, and the sample program at ./samples/EntityAnnotatorFactorySample.java for additional details. (TEJ-773)

Release 7.21.0

REX now reports its results using two new classes, Entity and Mention, such that each Entity in a document has one or more Mentions that refer to the same real-world identity. Moving forward, this API will replace EntityMention and its coreferenceChainId. This version of REX is backwards-compatible and still supports the deprecated EntityMention. (TEJ-702)
Improved Malaysian statistical model and added a new Malaysian gazetteer. (TEJ-711, TEJ-715)
kbLinker (flinx) is part of a new experimental API to to link entities from social media text to knowledge bases and may be changed or removed in future minor releases. (TEJ-722, TEJ-725, TEJ-757)
REX has been upgraded to Rosette Platform compatibility level 58.2. If you intend to use more than one Rosette JVM SDK in a single application, then you should choose versions that have the same compatibility number. (TEJ-702)
Improved case-insensitivity detection in European languages. (TEJ-687)
Entity mentions extracted with the statistical model now also specify the model’s path as a subsource. (TEJ-724)
Added a new method, EntityExtractor setOverlayDataDirectory(Path overlayDataDirectory), that allows you to specify an additional data directory for REX to use. (TEJ-731)
The REXCmd command line utility now allows you to specify any additional regex files you want to use, besides just the default. (TEJ-628)

Release 7.20.0

Added full support for Standard Malay to extract standard entities using regexes, gazetteers, and statistical processors. (TEJ-704)

Release 7.19.3

Added support for extraction using two statistical models operating in tandem. (TEJ-674)
In order to reduce disk footprint, Big Endian binaries are no longer shipped. REX will correctly memory map Little Endian models and dictionaries even on Big Endian systems. (TEJ-664)
Optional new packaging: RBL and REX classes are available in one combined jar. (TEJ-692)

Release 7.18.0

This is a maintenance release to address SUPPO-569.

Release 7.17.0

Added new customization for statistical processors with unsupervised field training. See Section 4.5 of the Application Developer Guide.
Added partial support for Malaysian. (TEJ-626)

Release 7.16.0

Added a new setting for REX to automatically choose the most accurate CaseSensitivity model (case-insensitive or case-sensitive) for the input text. This is not activated by default, see the sample programs or javadocs for reference on how to enable this feature. (TEJ-568)
Added case-insensitive models for German, Italian, Dutch, and Spanish. (TEJ-566)
REX is now built with JDK 1.7, so users can no longer run REX on Java Virtual Machines versioned 1.6 and earlier. (TEJ-551)
Improved accuracy of the English statistical model by using multiple Brown clusters. (TEJ-396, TEJ-559)
New disableStatisticalCleaner option added to REXCmd and EntityExtractor. (TEJ-379)
You can now reactivate regular expression-based entities that are disabled by default by instructing REX to load the regex files in each language’s supplemental directory. See the Javadoc for EntityExtractor.addRegularExpressions(). (TEJ-587)
Refined redaction rules for PERSON entities. (TEJ-115)
EntityMentions and the returned fields are now documented in the Application Developer Guide. (TEJ-623)

Release 7.15.0

Added a boolean caseSensitive parameter to the EntityExtractor’s `addGazetteer and addGazetteerEntity method, to allow case-insensitive string matching of user-provided textual gazetteer entries. (TEJ-56)
A RosetteUnsupportedLanguageException is now thrown when REX cannot find data for the requested language, instead of a generic runtime exception. (TEJ-536)
Adding a duplicate gazetteer entry will overwrite the existing one. (TEJ-74)

Release 7.14.0

Support for script-insensitive Chinese added: Entities are now extracted from Chinese input documents for which the 'Simplified' or 'Traditional' writing system is not specified. Applications may now submit text using the zho language code instead of specifying zhs or zht. (TEJ-525)
Version 7.14.0.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis Technology JVM SDK in a single application, then choose versions that have the same compatibility number. (TEJ-524)

Release 7.13.0

Indonesian (Bahasa Indonesia) support added. (TEJ-441)
To enhance performance, and in response to customer feedback, we deactivated the regular expressions for extracting the following entity types: IDENTIFIER:DISTANCE, IDENTIFIER:LATITUDE_LONGITUDE, IDENTIFIER:UTM, TEMPORAL:DATE, and TEMPORAL:TIME. You can restore support for any of these entity types by removing the @ignore=rex-je attribute value that appears in front of the relevant regular expressions in the regexes.xml files. (TEJ-510)
Improved speed and reduced memory consumption of regex and gazetteer matching. (TEJ-489)
Improved the accuracy of the statistical models for case-insensitive Portuguese and French. (TEJ-475)
Added the ability to modify the indoc shut-off threshold: setMaxResolvedEntities(). (TEJ-456)
Introduced an experimental EntityExtractor API for excluding entity types in which you are not interested. See the Javadoc for {get,set}ExcludedEntityTypesfor details. Note that this is an experimental API which can change or be removed in future minor versions of REX. (TEJ-480)

Release 7.12.1

REX emits an exception when asked to extract in a language it has no data for. (TEJ-447)
Missing values for confidence and coreferenceChainId are now represented as nulls instead of -1. (TEJ-466, TEJ-467)

Release 7.12.0

Improved Korean Entity Extraction and In-Document Co-reference Resolution
REX uses a new statistical model that achieves higher accuracy (about 25% overall error reduction). (APE-1111)
In-document co-reference resolution now recognizes Korean prefixes and suffixes, and will attempt to chain morphological variations of Korean entity mentions. (TEJ-366)
Added features to the REXCmd utility
Plaintext output now pretty-prints the chain ID for each entity mention it returns. Mentions with the same chain ID refer to the same entity. (TEJ-410)
The -context option marks the entities in their original text context, with embedded entity type and chain ID. (TEJ-410)
REXCmd now supports pre-annotated input in json-serialized Annotated Text format. (TEJ-455)
Modified the reporting of text offsets for partial regular-expression and gazetteer matches to align with the token boundaries of the tokens that contain the matched text. (TEJ-393)
The REX Tcl implementation of regular expressions now supports characters in the Supplementary Multilingual Plane (SMP) (TEJ-454). Previous releases represented SMP codepoints as two characters each. (TEJ-330)
Provided an EntityExtractor option to instruct the REX statistical processor to ignore the lowercase/uppercase distinction. This feature is currently supported for English, French, and Portuguese. (TEJ-458)
Added the EntityExtractor.createDispatchAnnotator method to allow annotating documents from a predefined set of languages. (TEJ-432)
Added Portuguese regular expressions for temporal and monetary expressions. (TEJ-444)

Release 7.11.0

Added the Fragment Boundary Detector to enable the extractor to separate entities in text fragments that do not form sentences. (TEJ-117)
Refined the usage pattern for the com.basistech.rosette.rex.REXCmd command-line utility. The JSON output this utility generates has been trimmed to represent the serialization of an AnnotatedText object. (TEJ-374)

Release 2.2.0

Deprecated com.basistech.rosette.rex.EntityCursor. Use com.basistech.rosette.rex.EntityExtractor and com.basistech.rosette.dm.Annotator to extract entities from an input document.
Re-established pattern matcher support for the regular expressions disabled in 2.1.0.

Release 2.1.0

Added a document-level API for extracting entities. Over time, we expect to deprecate EntityCursor (streaming) in favor of EntityExtractor and Annotator (document-level extraction).
Added an API (the EntityExtractor setPostConfidence method) for extracting a confidence floating point value for each entity that REX Java Edition finds. A potential entity is ignored if its confidence score is below the threshold set with the setConfidenceThreshold method. (TEJ-60)
REX Java Edition returns normalized entities from all sources: statistical, pattern matching, and exact matches (gazetteers). (TEJ-288)
We have disabled the use of pattern matcher regular expressions that do not terminate properly, consuming large amounts of CPU time. This change could cause REX to miss some temporal, distance and long/lat expressions. If your use case requires high recall on these numeric types, please contact analyticssupport@babelstreet.com for assistance on enabling these regular expressions.(TEJ-283)
The REXCmd command-line utility now includes the REX Java Edition version number in the output. (TEJ-261)
Added a shell script (Unix) and .bat file (Windows) to simplify the running of the RexCmd command-line utility. (TEJ-247)
The JSON document generated by the command-line utility is much more verbose than in previous releases. Accordingly, if you are not writing the output to a file, you may want to pipe the output through a JSON parser (e.g., | python -mjson.tool) and concentrate on the EntityMention elements. (TEJ-260)
Added support for Arabic, Simplified Chinese, Traditional Chinese, Korean, and Japanese.
Support for using the REX Field Training Kit to enhance accuracy handling a particular category of documents and to return new entity types. (TEJ-220)

Release 2.0.0

Added support for Dutch, Hebrew, Persian (Western Farsi and Dari), Portuguese, Pashto, and Urdu.
Incorporated improvements to the statistical language models, gazetteers, and regular expressions introduced in the Rosette C++ REX implementation since the release of REX Java Edition 1.1.
Enhanced the command-line utility with new options. (TEJ-212)
Added support for resetting the maximum number of tokens that an entity may include, which defaults to 8. Use the EntityExtractor setMaxEntityTokens(int) method. (TEJ-188)

Release 1.1

Added support for Uppercase English, French, German, Italian, Russian, and Spanish.
Added support for resolving coreferences to the same entity. Use EntityExtractor setResolveNamedEntities(true) to put coreferences to the same entity in an entity chain: see EntityCursor getChainId(). (TEJ-52)
In response to customer feedback, removed IDENTIFIER:NUMBER from the default set of entity types returned by regular expressions. We commented out the IDENTIFIER:NUMBER entries in the regexes.xml files in data/regex/lang/accept, so you can re-activate any of these entries if you wish. (TEJ-171)
Added a public EntityCursor hasNext() method that can be used to determine whether there are any more entities in the result set, without advancing to the next entity. (TEJ-150)
Added the following EntityExtractor methods:

public void setStatisticalModel(LanguageCode, InputStream);
                             public void addGazetteer(LanguageCode, InputStream, boolean);
                             public void addGazetteer(LanguageCode, InputStream);
                             public void addRegularExpressions(LanguageCode, InputStream, boolean);
                             public void setRedactorWeights(InputStream);
                             public void addJoinerRules(InputStream);
                             public void setLicense(InputStream);
These methods enable access to data files placed in a JAR file (perhaps for use in a Hadoop environment). (TEJ-59)

Bugs Fixed

Bug number is followed by a brief bug description.

Fixed in 7.40.0

TEJ-1349 Fixed linker head mention bug
TEJ-1353 Fixed missing entity types in linker results

Fixed in 7.39.1

TEJ-1281 Set log level to Debug instead of Warn when linkEntities and genre don’t agree
TEJ-1319, TEJ-1346, ELK-114 Picked up new TVEC to improve the efficiency of the initial load time
TEJ-1324 Fixed a bug where the salience score was not always returned for entities with pronominal mentions, when requested.
TEJ-1327 Consumed new RBL to fix null pointer exception with pronoun resolver
TEJ-1331 Removed xxx from reported supported languages
APE-1766 Fixed a bug where all entity mentions were not always returned by statistical model

Fixed in 7.38.1

TEJ-1282 Relocated LIBLINEAR and TVEC, tested with rli
TEJ-1283 Fixed cases in which QID was not returned although DBPedia result was available, when using REX’s Kblinker
TEJ-1292 Custom processors made configurable via REXFactoryConfiguration
ELK-82 Fixed an acronym feature issue.

Fixed in 7.34.0

TEJ-1160 Moved ORGANIZATION eng regexes to supplemental directory to improve performance
TEJ-1167 Improved failure message for missing flinx data directory
TEJ-1168 Fixed null point exception with long chains

Fixed in 7.31.0

TEJ-1099 Added additional currency symbols to regex
TEJ-1108 Fixed hexadecimal number string incorrectly extracted as product

Fixed in 7.29.0

TEJ-1067 Fixed calculateConfidence configuration bug
TEJ-1020 Added regex for Israeli ID number
TEJ-1054 Cleaned MD5 codes extracted as PRODUCT entity type
TEJ-1049 Fixed case-insensitive text file gazetteer treated as case-sensitive

Fixed in 7.28.1

TEJ-1050 Fixed annotator configuration object being modified while creating.

Fixed in 7.28.0

TEJ-1041 Improved redactor rules for extractor and linking overlaps and indoc-coref to prefer chaining based on linking ID.
TEJ-1039 Enabled emoticon mode for RBL for RosAPI
TEJ-1042 Fixed error handling for requesting salience score while indoc-coref is disabled

Fixed in 7.27.0

TEJ-950 Improved custom processor examples.
TEJ-951 Apply American SSN regex for English only.
TEJ-962 Fixed personal ID entity type in Vietnamese regex.
TEJ-1005 Fixed source and subsource for static user gazetteer.

Fixed in 7.26.4

TEJ-1006 Improved handling of the case Arabic prefix is null.
TEJ-1014 Fixed shading of dependencies.

Fixed in 7.26.3

TEJ-971 Fixed a concurrency issue in kb-linker.
TEJ-977 Fixed to return null for confidence from non-statistical entities.

Fixed in 7.26.2

TEJ-952, TEJ-963 Fixed lookup issue in wordclasses

Fixed in 7.25.0

TEJ-693 setStatisticalModel without caseSensitivity assumes case-sensitive

Fixed in 7.23.1

TEJ-813 Relocated more RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.

Fixed in 7.22.0

TEJ-477 REXCmd now emits a usage error when 'info' is used with no info command.

Fixed in 7.21.0

TEJ-755 Relocated RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.
TEJ-735, TEJ-738 Fixed a bug in which Entity Type Survey didn’t support non-default field training models or customer-defined types.
TEJ-747 Fixed a bug in which REXCmd produced unexpected offsets for files with DOS-style line endings.
TEJ-718 Added Malaysian sample text.

Fixed in 7.20.2

TEJ-717 Minor update to zsm model in 'single' distribution form.

Fixed in 7.20.0

TEJ-694 Added a missing file resource release in addGazetteer.

Fixed in 7.19.3

TEJ-676, TEJ-691 Shaded and relocated all 3rd party dependencies.
TEJ-683 Removed extraneous pom.xml’s from META-INF in distro jar.

Fixed in 7.18.0

SUPPO-569 Fixed incompatibility issues with RBL.

Fixed in 7.17.1

TEJ-659 Fixed support for custom tags in statistical model parser.
APE-1608 Fixed REXCmd’s behavior with case-insensitive models and the -model command line option.

Fixed in 7.17.0

SUPPO-536 Fixed partial regex extraction for Korean.

Fixed in 7.16.0

TEJ-573 REXCmd now ignores a BOM at the beginning of its input file.
TEJ-574 Lookbehind assertions are not supported, and this is now included in the Application Developer Guide.

Fixed in 7.14.0

TEJ-512 Regex xml parser misses some CDATA sections (URL extraction issues)

Fixed in 7.13.0

TEJ-459, TEJ-507 indoc chaining applied only to PER/LOC/ORG
TEJ-498 Application Developer Guide claims that REX extracts full postal addresses

Fixed in 7.12.1

TEJ-486 EntityAnnotator reuse problems
TEJ-479 RBL dictionary markers in distro

Fixed in 7.12.0

TEJ-460 Partial match regex produces wrong offsets
TEJ-438 ICU dependency not relocated/shaded
TEJ-415 REX requires an RBL license