Entity Extractor (REX)
Release Notes
Release 7.56.1.c78.0
June 2025
New
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons IO | 2.18.0 | 2.19.0 |
Apache Commons Text | 1.13.0 | 1.13.1 |
Apache CXF Core | 4.1.1 | 4.1.2 |
Apache CXF JAX-RS Client | 4.1.1 | 4.1.2 |
Apache CXF Runtime HTTP Transport | 4.1.1 | 4.1.2 |
Apache CXF Runtime JAX-RS Frontend | 4.1.1 | 4.1.2 |
Apache CXF Runtime Security functionality | 4.1.1 | 4.1.2 |
Guava InternalFutureFailureAccess and InternalFutures | 1.0.2 | 1.0.3 |
Guava: Google Core Libraries for Java | 33.4.0-jre | 33.4.8-jre |
Jackson datatype: Guava | 2.18.2 | 2.19.0 |
Jackson Jakarta-RS: base | 2.18.2 | 2.19.0 |
Jackson Jakarta-RS: JSON | 2.18.2 | 2.19.0 |
Jackson module: Jakarta XML Bind Annotations (jakarta.xml.bind) | 2.18.2 | 2.19.0 |
Jackson module: Old JAXB Annotations (javax.xml.bind) | 2.18.2 | 2.19.0 |
Jackson-annotations | 2.18.2 | 2.19.0 |
Jackson-core | 2.18.2 | 2.19.0 |
jackson-databind | 2.18.2 | 2.19.0 |
Jackson-dataformat-XML | 2.18.2 | 2.19.0 |
Jackson-dataformat-YAML | 2.18.2 | 2.19.0 |
Jackson-JAXRS: base | 2.18.2 | 2.19.0 |
Jackson-JAXRS: JSON | 2.18.2 | 2.19.0 |
Protocol Buffers [Core] | 4.29.3 | 4.30.2 |
SnakeYAML | 2.3 | 2.4 |
Package | Version | License |
---|---|---|
JSpecify annotations | 1.0.0 | Apache-2.0 |
Project Lombok | 1.18.38 | MIT |
Package |
---|
CogComp-NLPy |
JVM Integration for Metrics |
Jackson-dataformat-CSV |
MongoDB Java Driver (unmaintained) |
SLF4J JDK14 Binding |
Release 7.56.0.c77.0
March 2025
Important
The installation instructions for the Solr plugin have changed.
To install the Solr plugin:
Copy all files from the
lib
directory inside the Entity Extractor Solr plugin installation (rex-je-solr
) into thelib
directory of your Solr core.Copy all files from the
lib
directory inside the Entity Extractor installation (rex-je
) into thelib
directory of your Solr core
New
New entity types: You can now extract social media entity types using Entity Extractor. The types extracted are HASHTAG, ATMENTION, URL, and EMAIL. To extract these types, set
extractSocialMedia
totrue
in the annotator configuration. (TEJ-2672)Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Solr support: Solr 9.8.1 is now supported (TEJ-2683)
Solr installation change: We've changed how the plugin is installed because Solr has discontinued the
lib
directives. When installing, you must copy all JAR files from thelib
directory inside the Entity Extractor installation into thelib
directory of your Solr core and ensure that thesolrconfig.xml
file points to the REX-JE installation directory. If you are using Solr version 9.7.x or earlier, the previous installation instructions will still work.
Bug Fixes
We fixed a bug with in-document coreference server. In-document coreference chains of entity mentions will now be correct. (TEJ-2655)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Angus Activation Registries | 2.0.1 | 2.0.2 |
Auto Common Libraries | 0.8 | 1.2.1 |
AutoService | 1.0-rc4 | 1.1.1 |
Apache Commons Codec | 1.17.1 | 1.18.0 |
Apache Commons IO | 2.17.0 | 2.18.0 |
Apache Commons Text | 1.12.02.17.0 | 1.13.0 |
Apache CXF | 4.0.4 | 4.1.1 |
Apache Log4j | 2.24.1 | 2.24.3 |
Guava | 33.3.1-jre | 33.4.0-jre |
istack common utility code runtime | 4.0.1 | 4.1.2 |
Jackson | 2.17.2 | 2.18.2 |
Jakarta Activation API | 2.1.2 | 2.1.3 |
Jakarta RESTful WS API (prev. jakarta.ws.rs-api) | 3.0.0 | 3.1.0 |
Jakarta XML Binding API | 3.0.1 | 4.0.2 |
JavaCPP | 1.5.10 | 1.5.11 |
JAXB Core and Runtime | 3.0.2 | 4.0.5 |
JVM Integration for Metrics | 3.0.1 | 3.0.2 |
Protocol Buffers [Core] | 3.25.5 | 4.29.3 |
Metrics Core | 3.0.1, 4.2.28 | 3.0.2, 4.2.30 |
TXW2 Runtime | 3.0.2 | 4.0.5 |
Package | Version | License |
---|---|---|
Java Architecture for XML Binding | 2.2.12 | CDDL 1.1 |
MongoDB Java Driver (unmaintained) | 3.12.14 | The Apache License, version 2.0 |
Release 7.55.15.c76.0
November 2024
New
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Java 21 support added: Java 21 is now supported. Java 11 and 17 are still supported (TEJ-2484).
Third-party component updates
Package | Old Version | New Version |
---|---|---|
annoy | 0.2.5 | 0.2.6 |
Apache Commons Compress | 1.27.0 | 1.27.1 |
Commons IO | 2.16.1 | 2.17.0 |
Apache Commons Lang | 3.16.0 | 3.17.0 |
Apache Commons Text | 1.10.0 | 1.12.0 |
Apache Log4j | 2.23.1 | 2.24.1 |
fastutil | 8.5.14 | 8.5.15 |
Guava | 33.3.0-jre | 33.3.1-jre |
JavaCPP | 1.5.8 | 1.5.10 |
Metrics Core | 3.2.3 | 4.2.28 |
Protocol Buffers | 3.25.3 | 3.25.5 |
SnakeYAML | 2.2 | 2.3 |
Woodstox | 7.0.0 | 7.1.0 |
Package | Version | License |
---|---|---|
Apache Commons Codec | 1.16.1 | Apache-2.0 |
Project Lombok | 1.18.34 | The MIT License |
Streaming API for XML | 1.0-2 | GNU General Public Library |
Release 7.55.14.c75.0
September 2024
New
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Bug Fixes
Fixed a bug where REX would accept indoc-coref-server entity mentions which violated sentence boundaries. REX will now reject mentions from indoc-coref-server that are not contained within document sentences. (TEJ-2451)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Apache Commons CLI | 1.7.0 | 1.9.0 |
Apache Commons Compress | 1.26.1 | 1.27.0 |
Apache Commons Lang | 3.14.0 | 3.16.0 |
fastutil | 8.15.13 | 8.15.14 |
Guava | 33.2.0-jre | 33.3.0-jre |
Jackson | 2.17.1 | 2.17.2 |
Project Lombok | 1.18.22 | 1.18.34 |
TensorFlow for Java | 0.3.3 | 1.0.0-rc.1 |
Woodstox | 6.6.2 | 7.0.0 |
Release 7.55.13.c74.0
June 2024
New
Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIKI-507)
Linking improvements: We've added a heuristic to help stop generic, unnamed entities, such as "mortgage law", from being linked. (RWIKI-475)
Bug Fixes
Aliases are no longer filtered by low normalized link probability; it is now possible to link entities where abbreviations like "MIT", "LA", "WHO", "UN" are the mention text. (RWIKI-389)
We fixed a NullPointerException while writing log entry when processing empty tokens. (TEJ-2361)
We updated the REXCmd output for plain text and console output to display the entityId for all mentions, not just the head mention. Also fixed ArrayOutOfBounds Exception while writing context output. (TEJ-2155)
We fixed a bug where news media, such as television programs, were typed as ORG. (RWIKI-483).
We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)
Third-party component updates
Package | Version | License |
---|---|---|
Angus Activation Registries | 2.0.1 | EDL 1.0 |
Package | Old Version | New Version |
---|---|---|
Apache Commons CLI | 1.6.0 | 1.7.0 |
Apache Commons Codec | 1.11 | 1.16.1 |
Apache Commons Compress | 1.26.0 | 1.26.1 |
Apache Commons IO | 1.26.0 | 1.26.1 |
Apache Log4j | 2.21.1 | 2.23.1 |
Apache CFX | 3.4.7 | 4.0.4 |
args4J | 2.33 | 2.37 |
Guava | 33.0.0-jre | 33.2.0-jre |
iStack Common Utility Code | 3.0.12 | 4.0.1 |
JAXB | 2.3.4 | 3.0.2 |
Jackson | 2.16.1 | 2.17.1 |
Jakarta Activation API | 1.2.2 | 2.1.2 |
Jakarta Annotations API | 1.3.5 | 2.1.1 |
Jakarta RESTful Web Services API | 2.1.6 | 3.0.0 |
Jakarta XML Binding API | 2.3.3 | 3.0.1 |
Protocol Buffers | 3.25.0 | 3.25.3 |
TXW2 | 2.3.4 | 3.0.2 |
Woodstox | 4.4.1, 6.2.6 | 6.6.2 |
XmlSchema | 2.2.5 | 2.3.1 |
Removed Packages
Jakarta SOAP with Attachments API
Jakarta Transaction API
Jakarta Web Services Metadata API
Jakarta XML Web Services API
Release 7.55.12.c73.0
May 2024
New
Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. You should see large improvements in entity linking. (RWIKI-454, RWIKI-507)
We've made some changes as to how some entity types are linked to the provided knowledge base:
PERSON: Now only real humans are linked as person entities; fictional, imaginary, and mythical humans are not.
PRODUCT: Product entities now exclude most creative works.
Linking improvements: We've changed the conflict resolution algorithm to one which tries to link using the longest possible mentions. You should see better linking, especially in cases where the mention of a popular entity is embedded within the mention of interest. (RWIKI-404)
Example: I studied at the University of Chicago
Previously linked: Chicago
Now linked: University of Chicago
Solr support: Solr 9.6.1 is now supported (TEJ-2357)
Bug Fixes
We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)
We fixed a bug where Chinese characters were normalized when looking up knowledge base artifacts for linking in Japanese. You should see improved entity linking in Japanese. (RWIKI-406)
English terms for half(s) and quarter(s) were removed from the Russian (RUS) and German (DEU) regexes for time. (TEJ-1817)
Release 7.55.11.c73.0
March 2024
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Apache Commons CLI | 1.2.0 | 1.6.0 |
Apache Commons Compress | 1.24.0 | 1.26.0 |
Apache Commons IO | 2.15.0 | 2.15.1 |
Apache Commons Lang | 3.12.0 | 3.14.0 |
fastutil | 8.15.12 | 8.5.13 |
Guava | 32.1.3-jre | 33.0.0-jre |
Guava InternalFutureFailureAccess and InternalFutures | 1.0.1 | 1.0.2 |
ICU4J | 70.1 | 74.2 |
Jackson Annotations | 2.15.3 | 2.16.1 |
Jackson Core | 2.15.3 | 2.16.1 |
Jackson Databind | 2.15.3 | 2.16.1 |
Jackson Dataformat CSV | 2.15.3 | 2.16.1 |
Jackson Dataformat XML | 2.15.3 | 2.16.1 |
Jackson Dataformat YAML | 2.15.3 | 2.16.1 |
Jackson Datatype: Guava | 2.15.3 | 2.16.1 |
Jackson JAXRS: Base | 2.15.3 | 2.16.1 |
Jackson JAXRS: JSON | 2.15.3 | 2.16.1 |
Package |
---|
JavaPoet |
OSGi Core |
Release 7.55.10.c73.0
March 2024
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Apache Commons CLI | 1.2.0 | 1.6.0 |
Apache Commons Compress | 1.24.0 | 1.26.0 |
Apache Commons IO | 2.15.0 | 2.15.1 |
Apache Commons Lang | 3.12.0 | 3.14.0 |
fastutil | 8.15.12 | 8.5.13 |
Guava | 32.1.3-jre | 33.0.0-jre |
Guava InternalFutureFailureAccess and InternalFutures | 1.0.1 | 1.0.2 |
ICU4J | 70.1 | 74.2 |
Jackson Annotations | 2.15.3 | 2.16.1 |
Jackson Core | 2.15.3 | 2.16.1 |
Jackson Databind | 2.15.3 | 2.16.1 |
Jackson Dataformat CSV | 2.15.3 | 2.16.1 |
Jackson Dataformat XML | 2.15.3 | 2.16.1 |
Jackson Dataformat YAML | 2.15.3 | 2.16.1 |
Jackson Datatype: Guava | 2.15.3 | 2.16.1 |
Jackson JAXRS: Base | 2.15.3 | 2.16.1 |
Jackson JAXRS: JSON | 2.15.3 | 2.16.1 |
Package |
---|
JavaPoet |
OSGi Core |
Release 7.55.9.c72.0
December 2023
New
Solr support: Solr 9.4 is now supported. (TEJ-2112)
Build scripts: The scripts to improve gazetteer performance by compiling them into binary files have been moved to the
./scripts
directory. You no longer need the Field Training Kit (FTK) to compile these files. (TEJ-2096)
Bug fixes:
We've added a reject gazetteer to ensure that the string USA, Canada is extracted correctly as 2 location entities: USA and Canada. This is active by default. (TEJ-2114)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Jackson-annotations | 2.15.2 | 2.15.3 |
Jackson Core | 2.15.2 | 2.15.3 |
Jackson Databind | 2.15.2 | 2.15.3 |
Jackson Dataformat XML | 2.15.2 | 2.15.3 |
Jackson Dataformat YAML | 2.15.2 | 2.15.3 |
Jackson datatypes: Guava | 2.15.2 | 2.15.3 |
Jackson-JAXRS: base | 2.15.2 | 2.15.3 |
Jackson-JAXRS: JSON | 2.15.2 | 2.15.3 |
Jackson module: Old JAXB Annotations (javax.xml.bind) | 2.15.2 | 2.15.3 |
Guava: Google Core Libraries for Java | 32.1.2-jre | 32.1.3-jre |
Protocol Buffers [Core] | 3.23.4 | 3.25.0 |
Apache Commons IO | 2.11.0 | 2.15.0 |
Apache Commons Compress | 1.23.0 | 1.24.0 |
liblinear | 2.42 | 2.44 |
Apache Log4j API | 2.20.0 | 2.21.1 |
Apache Log4j Core | 2.20.0 | 2.21.1 |
Apache Log4j SLF4J Binding | 2.20.0 | 2.21.1 |
Stax2 API | 4.2.1 | 4.2.2 |
SnakeYAML | 2.0 | 2.2 |
Release 7.55.8.c71.0
September 2023
New
Solr support: Solr 9.3 is now supported.
Bug Fixes
We fixed a bug in licensing for Chinese language codes. (WS-2861)
Third-party component updates
Package | Old Version | New Version |
---|---|---|
Jackson Annotations | 2.15.0 | 2.15.2 |
Jackson Core | 2.15.0 | 2.15.2 |
Jackson Databind | 2.15.0 | 2.15.2 |
Jackson Dataformat XML | 2.15.0 | 2.15.2 |
Jackson Dataformat T | 2.15.0 | 2.15.2 |
Jackson Datatype: Guava | 2.15.0 | 2.15.2 |
Jackson Module: Old JAXB Annotations | 2.15.0 | 2.15.2 |
Guava: Google Core Libraries for Java | 31.1-jre | 32.1.2-jre |
Protocol Buffers [Core] | 3.21.7 | 3.23.4 |
Release 7.55.7.c70.0
June 2023
Bug Fixes
The parameter
regexCurrencySplit
has been fixed. When set totrue
, currency values will now extract into two entity types: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE instead of IDENTIFIER:MONEY. (TEJ-1960)
Known Issues
The Solr plugin is not supported on Solr 9.2.x
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Apache Commons Compress | 1.22 | 1.23 |
Apache Log4J API | 2.19.0 | 2.20.0 |
Apache Log4J Core | 2.19.0 | 2.20.0 |
Apache Log4J SLF4J Binding | 2.19.0 | 2.20.0 |
fastutil | 8.5.9 | 8.5.12 |
Jackson Annotations | 2.14.0 | 2.15.0 |
Jackson Core | 2.14.0 | 2.15.0 |
Jackson Databind | 2.14.0 | 2.15.0 |
Jackson Dataformat CSV | 2.14.0 | 2.15.0 |
Jackson Dataformat YAML | 2.14.0 | 2.15.0 |
Jackson Dataformat XML | 2.14.0 | 2.15.0 |
Jackson datatype: Guava | 2.14.0 | 2.15.0 |
Jackson JAXRS:base | 2.14.0 | 2.15.0 |
Jackson JAXRS:JSON | 2.14.0 | 2.15.0 |
Jackson module:OLD JAXB Annotations | 2.14.0 | 2.15.0 |
SnakeYAML | 1.33 | 2.0 |
Release 7.55.6.c69.0
March 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Guava: Google Core Libraries for Java | 26.0-jre | 31.1-jre |
Protocol Buffers [Core] | 3.12.2 | 3.21.7 |
Release 7.55.4.c69.0
March 2023
This release is for compatibility with other Rosette SDKs. There are no new features or bug fixes.
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Guava: Google Core Libraries for Java | 26.0-jre | 31.1-jre |
Protocol Buffers [Core] | 3.12.2 | 3.21.7 |
Release 7.55.3.c68.0
January 2023
Bug Fixes
We fixed a linking problem introduced in 7.55.0.c68.0, in which the
NIL_BIAS
parameter value was incorrect. The parameter is now correct for all languages. (TEJ-1899)
Release 7.55.0.c68.0
December 2022
New
Wikidata refreshed: We've updated the knowledge base data for the provided linking knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIki-119, ELK-274, ELK-276)
New currency regex: We've introduced a new option,
regexCurrencySplit
, that, when set to true, will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two new entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These two new types represent the amount of the currency (50,000) and the currency type ($), respectively. By default,regexCurrencySplit
is set to false. (TEJ-1792)Tagalog support: We've added case-insensitive NER support for Tagalog. Previously we released a case-sensitive model and we've now added the case-insensitive model as well. (TEJ-1858)
Parameter removed: We've removed the deprecated
genre
extraction option. This option was used to turn the linker on which has been, and will still be, available by thelinkEntities
option. Thegenre
option is no longer available in the REX SDK, in the Rosette Server REX configuration, as well as the Rosette API bindings (TEJ-1855).
Release 7.54.1.c67.0
October 2022
New
This beta release contains a new English statistical extraction model which was trained on financial data. The model has improved accuracy over the default English models for financial domain documents. (TEJ-1837)
Release 7.54.0.c67.0
September 2022
New
Tagalog (tgl) support: We've added Tagalog to our list of languages. The following processors are supported: gazetteer, regex, statistical NER, linking. (TEJ-1812, TEJ-1822, TEJ-1785, TEJ-1786)
New linking option: We've added a new option for entity linking. When
linkMentionMode
is set toentities
the linker will attempt to link the entities extracted by other processors (regex, gazetters, and the statistical processor) instead of using its own processor to extract entity candidates. Depending on your data, this may provide higher accuracy and speed. (TEJ-1806)REXCmd parameter change: The
linkEntities
parameter can now act as a toggle instead of taking a true/false value, matching how other REXCmd boolean parameters are handled. (TEJ-1806)Previously:
REXCmd ... -linkEntities true
Now:
REXCmd ... -linkEntities
Parameter deprecated: The parameter
genre
is deprecated and will be removed in the next release.
Bug Fixes
REX no longer produces an exception when token normalization produces an empty token string. (TEJ-1803)
When looking for candidate mentions in text, if there is an overlap between these mentions the linker now resolves the longest spanning mention before disambiguation. (ELK-277)
Release 7.53.3.c67.0
June 2022
New
Configure knowledge base linking priority: With multiple knowledge bases it is possible to set the order in which to try linking against each knowledge base. Set the priority in the redactor configuration file (
ne_types.xml)
(TEJ-1726, TEJ-1754)Example: The following XML element will set the
custom-kb
priority higher than the default knowledge base (kb-linker
) when linking a PRODUCT entity type:<ne_type> <name>PRODUCT</name> <weight name="kb-linker" value="100" /> <weight name="kb-linker:custom-kb" value="1" /> </ne_type>
relatedEntities renamed to contextWords: When creating a custom knowledge base, the feature
contextWords
, which was previously calledrelatedEntities
, is required. Context words are language-specific words that are strongly related to the entity. The termrelatedEntities
has been deprecated. (TEJ-1756)Java 17 support added: Java 8 and 9 support has been removed. (TEJ-1728, TEJ-1763)
Solr 9 support added: REX now supports Lucene and Solr 9. (TEJ-1731)
Solr 6 support deprecated: REX no longer supports Solr 6 or earlier. (TEJ-1731)
Bug Fixes
Bug fix: An error is no longer generated when there are null prefixes in Arabic morphological analyses. (TEJ-1765)
Bug fix: We fixed a bug to enable using
noisy_context_vector
feature for disambiguation. (ELK-265, ELK-268, ELS-272, TEJ-1776)
Release 7.53.0.c66.0
March 2022
Notice
Solr 6 and earlier support is deprecated as of this release.
Java 8 and Java 9 support is deprecated as of this release.
Bug Fixes
Updated Log4j to version 2.17.1. (TEJ-1724)
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
Apache Commons Compress | 1.9 | 1.21 |
Apache Commons IO | 2.7 | 2.11.0 |
Apache Commons Lang3 | 3.32 | 3.12.0 |
Apache Log4j | 1.2.17 | 2.17.1 |
Auto Common Libraries | 0.3 | 0.8 |
AutoService | 1.0-r3 | 0.8 |
ICU4J | 58.1 | 70.1 |
fastutil | 8.4.0 | 8.5.6 |
LibLinear | 2.30 | 2.42 |
SLF4J | 1.7.28 | 1.7.33 |
SnakeYAML | 1.26 | 1.30 |
TensorFlow for Java | 0.2.0 | 0.3.3 |
Package | Version | License |
---|---|---|
AOP alliance | 1.0 | Public Domain |
Apache Commons Logging | 1.2 | Apache License 2.0 |
Apache Commons Math | 2.0 | Apache License 2.0 |
Apache POI | 3.9 | Apache License 2.0 |
DOM4J | 1.6.1 | DOM4J License |
JCommon | 1.0.17 | GNU Lesser General Public Licence |
JFreeChart | 1.0.14 | GNU Lesser General Public Licence |
JUnit | 4.13.2 | Eclipse Public License 1.0 |
JVM Integration for Metrics | 3.0.4 | Apache License 2.0 |
Java Architecture for XML Binding | 2.3.2 | Eclipse Distribution License - v 1.0 |
Java Common Annotations API | 1.3.2 | CDDL + GPLv2 with classpath exception |
Java Message Service | 1.1 | Common Development and Distribution License (CDDL) v1.0 |
JavaBeans Activation Framework (JAF) | 1.1 | Common Development and Distribution License (CDDL) v1.0 |
JavaBeans Activation Framework API jar | 1.2.1 | EDL 1.0 |
JavaMail API | 1.4 | Common Development and Distribution License (CDDL) v1.0 |
Javax WS-RS API | 2.1.5 | EPL 2.0 |
JetBrains Java Annotations | 23.0.0 | Apache License 2.0 |
Jimfs | 1.1 | Apache License 2.0 |
Legion of the Bouncy Castle Java Cryptography APIs | 138 | Bouncy Castle License |
Lib TensorFlow | 1.5.0 | Apache License 2.0 |
Mockito | 1.9.5 | The MIT License |
ODFDOM | 0.8.6 | Apache License 2.0 |
Project Lombok | 1.18.22 | The MIT License |
Spring | 4.2.4.RELEASE | Apache License 2.0 |
StAX API | 1.0.1 | Apache License 2.0 |
Sun Multi-Schema XML Validator | 20050913 | The BSD License |
TensorFlow | 1.5.0 | Apache License 2.0 |
XML Commons External Components XML APIs | 1.3.04 | Apache License 2.0 |
Xerces2 Java Parser | 2.9.4 | Apache License 2.0 |
XMLBeans | 2.3.0 | Apache License 2.0 |
ZIP4J | 1.3.2 | Apache License 2.0 |
iText | 2.1.5 | Mozilla Public License |
Package |
---|
Apache Geronimo |
JAX-WS |
JBoss RMI |
JSR203 Hadoop |
Jacorb Omg |
Jakarta Activation |
Jakarta WS-RS API |
Jakarta XML Bind API |
Javax Activation |
Javax Annotation |
Javax XML Soap |
MIME Pull |
SAAJ Impl |
STAX-EX |
Release 7.52.0.c65.0
December 2021
New
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.42.2.c65.0. (TEJ-1681, TEJ-1693)
Bug Fixes
Hungarian dates are now extracted correctly. Previously, dates with embedded periods followed by a space were not being extracted. (TEJ-1681)
rexcmd info
no longer lists TEMPORAL types by default for SWEDISH. (TEJ-1687)
Release 7.51.1.c65.0
September 2021
Bug Fixes
The supported entity types info of the DNN processor now spells PERSON correctly. (TEJ-1670)
Release 7.51.0.c65.0
August 2021
New
Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions. (TEJ-1657, TEJ-1658)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.41.1.c65.0. (TEJ-1667)
Bug Fixes
A single line followed by an empty line is no longer always considered a fragment. (ETROG-3431)
The Field Training Kit (FTK) no longer returns erroneous error messages from generating wordclasses. (TEJ-1636)
The RBL models directory is now correctly specified in the FTK. (TEJ-1655)
The REX Training Server (RTS) no longer fails when the request contains the language code
msa
.msa
is now mapped tozsm
, the language code supported by REX for Malay. (TEJ-1669)
Release 7.50.0
May 2021
Bug Fixes
We fixed a bug where Invalid whitespace handling by the DNN processor would cause a runtime exception. (TEJ-1614)
Open Source Changes
Package | Old Version | New Version |
---|---|---|
jackson | 2.10.0 | 2.11.1 |
commons-io | 2.6 | 2.7 |
fastutil | 8.3.0 | 8.4.0 |
liblinear | 1.95 | 2.42 |
snakeyaml | 1.25 | 1.26 |
stax2-api | 4.2 | 4.2.1 |
Package | Version | License |
---|---|---|
JavaCPP | 1.5.4 | Apache 2.0 |
TensorFlow Core API | 0.2.0 | Apache 2.0 |
TensorFlow NDArray | 0.2.0 | Apache 2.0 |
Package |
---|
libtensorflow |
libtensorflow jni |
protobuf |
Release 7.49.1
April 2021
New
Language-specific joiner rules: Custom joiner rules can now be language-specific or apply to all languages. (TEJ-178)
New default processing for structured text regions (lists, tables): Because structured text is often just words or phrases, and thus missing the syntactic context that REX was trained on, some REX users would pre-process input text to remove structured regions, on which REX performed poorly. Users no longer have to pre-process the input as now the statistical/DNN model is turned off by default for structured regions. This mode increases precision but may result in reduced recall in these regions. Note, the other REX processors (pattern match, exact match, entity linking) which do not rely on context will continue to analyze the structured regions. To turn on the statistical/DNN model for structured regions, set the parameter
structuredRegionProcessingType
tonerModel
. (TEJ-1502) (TEJ-1502)New name classifier model for structured regions (LABS): We've added a new model for processing structured regions. The name classifier classifies a text fragment as PERSON, LOCATION, ORGANIZATION, or NONE. The entire structured region is classified as a single label, an entity type or NONE. It is disabled by default. (TEJ-1613, TEJ-1621)
Japanese organization gazetteers: The gazetteers for Japanese organizations has been updated to improve extraction of Japanese organizations. (TEJ-1612)
New RBL version: Entity extraction now consumes the latest version of Rosette Base Linguistics (RBL) 7.39.0. (TEJ-1618)
Rosette Training Server (RTS) results: When using REX with Adaptation Studio (RAS), the results returned by RTS are now preferred by default. (TEJ-1605)
Bug Fixes
Entities are no longer extracted when they cross a sentence boundary. To enable entity linking across sentence boundaries, set
disableApplySentenceBoundaries
totrue
. (ELK-259)Entities are now checked to ensure they are normalized. (TEJ-1615)
Third-party component updates
This release includes the following third-party component changes:
Package | Old Version | New Version |
---|---|---|
liblinear | 1.94 | 2.42 |
Release 7.48.0
January 2021
Bug Fixes
We fixed an offset alignment issue for carriage return normalization. (TEJ-1600)
Release 7.47.0
December 2020
New
Updated the internal database for Wikidata linking. QIDs for some entities may change from previous versions, as Wikidata has been refreshed and re-indexed. (TEJ-1579, ELK-249, ELK-251, RWIKI-77)
Updated RBL version (TEJ-1579)
Bug Fixes
The sqlite-kb-connector sample now works correctly. Runtime issues with sqlite dependencies have been corrected. (ELK-245, ELK-257)
Extraction no longer fails when a custom processor returns a NULL annotator; instead a warning is generated. (TEJ-1580)
Mentions normalized by the custom processor are no longer ignored. (TEJ-1573)
Windows-formatted carriage returns (/r, /r/n) are now handled correctly.
Release 7.46.2
September 2020
New Features
Joiner runs before redactor: The joiner now runs before the redactor by default, providing more flexibility and control over the joiner results. Set
runJoinerPostRedactor
totrue
to run the joiner after the redactor. (TEJ-1534)Improved phone number recognition: Regular expressions for phone number extraction have been improved and now extract more phone number patterns. (TEJ-1556)
REXCmd input from stdin: REXCmd can now accept input from stdin by specifying the command line option
-stdin
.Example:
$ echo "Basis Technology is a company in Massachusetts" | REXCmd extract -stdin -langCode eng
Bug Fixes
We fixed a bug where sometimes a null pointer exception was returned when the custom processor and the linker had overlapping results. (TEJ-1561)
Custom processors can now only modify the entity and metadata sections of the ADM. Previously, any modification could be made which could override annotation data. (TEJ-1537)
We've partially fixed a problem in Japanese ORG extraction where sometimes the model extracts multiple ORG entities or includes non-related adjacent tokens. (TEJ-1534)
The Field Training Kit no longer generates invalid models for when creating custom knowledge bases. This occurred for all languages except eng, jpn, and zho. (ELK-252)
Release 7.46.0
June 2020
New Features
Improved sample The sample files to build the SQLite connector described in the Custom Knowledge Base Connectors section now includes all files required to build with Maven. The configuration to run the connector with Rosette Enterprise is now provided as well. (TEJ-1508)
Language-specific alias Custom knowledge bases compiled with the Field Training Kit (FTK) will now maintain the language of the alias. Aliases will only be extracted in documents of the language the alias is defined for. Aliases can be defined as for all languages or for a specific language. (ELK-241)
Custom knowledge bases can be compiled without disambiguation. While adding a knowledge base without a disambiguation model will not provide the best results, it will function as an enhanced gazetteer that attaches an assigned ID to each gazetteer entry and supports multiple aliases per entry. To compile a custom knowledge base without compiling a disambiguation model pass
-d
as an argument totrain-linker-model
. (ELK-233)New method A method
getBaseLinguisticsParameters
has been added to retrieve the base linguistics parameters that were used in training the model. Use the retrieved parameters to configure an external instance of RBL to produce tokens consistent with the training tokenization. A new sample application,RBLParametersSample.java
, is available in thesamples
directory. (TEJ-1501)Base linguistics added The FTK can now use input ADM files containing base linguistics annotations, such as tokens, sentence boundaries, and morphological analysis for languages such as Korean and Arabic. For REX to produce the optimal results, tokenize with the options provided by the
getBaseLinguisticsParameters
method when creating the ADM file from RBL. (APE-1793)Hebrew improvements REX has improved Hebrew normalization and added the ability of the disambiguator to identify prefixes removed from the entity's normalized form. Improvements are a result of enhancements in Hebrew base linguistics. (ETROG-3189)
Bug Fixes
A new line character in a regex (\n) will now also match carriage returns (\r) and a combination of both (\r\n). (TEJ-1525)
Confidence scores for entity linking now use the same scale, whether linking to Wikidata or a custom knowledge base. Previously, the confidence scores given for links to custom knowledge bases were much lower than those calculated for the Wikidata knowledge base. (ELK-240)
Release 7.45.0
March 2020
New Features
Connector framework for custom Knowledge Bases added. See section 5.6 in the Application Developer's Guide. (TEJ-1476, TEJ-1477, TEJ-1485)
Added Deep Neural Network model for Hebrew for improved accuracy. Replace statistical model with it by using the flag
-useDeepNeuralNetworkProcessor.
(TEJ-1503)Hebrew normalization improved: instead of using the lemma form, just the prefixes are being removed, except the definite article. (TEJ-1505)
New statistical model for Hebrew trained on news and finance data. (TEJ-1497)
Solr plugin now available as a Docker container. (TEJ-1492)
Supplemental regex support for ISO-6709 geo-coordinates. (TEJ-1431, DATA-761)
Support for setting prioritization for multiple custom Knowledge Bases. See section 5.2 in the Application Developer's Guide. (ELK-236)
Redactor weighs can now be configured for specific subsources. See section 3.2.1 in the Application Developer's Guide. (TEJ-1480)
Separate license key required for linker custom Knowledge Bases. Note: extractions against existing custom Knowledge Bases will fail unless licenses are updated. (TEJ-1483)
Custom Knowledge Bases can be set in Rosette Enterprise profiles. Note: To support this feature, the
flinx
directory was moved into {rex-installation}/data
. Any custom data inside must also be moved to the new location. (TEJ-1494)
Bug Fixes
TEJ-1499
REXAnnotatorFactory
failed to assign linking confidence thresholds.TEJ-1479 Fixed dynamic gazetteers for Malay.
TEJ-1506 Deep Neural Network extractions failed in REXCmd.
Release 7.44.1
February 2020
Bug Fixes
TEJ-1470 Fixed an error when extracting entities in Arabic text.
Release 7.44.0
December 2019
New Features
New language: Entity extraction now supports Swedish. (TEJ-1395)
Manual cache memory eviction functionality added to the SDK. See the Application Developer Guide for details. (TEJ-1450)
Bug Fixes
ELK-169 Fixed an error when reading aliases binaries produced with the FTK.
ELK-179 FTK no longer requires English or Chinese models for other languages.
Release 7.43.2
September 2019
Bug Fixes
Re-enabled and improved decompounding for Japanese and Chinese. It was disabled in 7.43.1.
Release 7.43.1
August 2019
Bug Fixes
Disabled decompounding for Japanese and Chinese
Release 7.43.0
August 2019
New Features
Tested and confirmed compatibility with Java 11.
Updated internal database for Wikidata linking. The DBPedia Type field now supports multiple subtypes. QIDs for some of the entities may change from previous versions, as Wikidata has been refreshed and re-indexed.
Entity linking returns PermIDs (IDs from Thomson Reuters knowledge base) in addition to QIDs (Wikidata IDs) for some of the entities.
Bug Fixes
Fixed a potential Null Pointer Exception which could have occured while using DNN.
Release 7.42.2
July 2019
Bug Fixes
Fixed multi-threading bug with custom gazetteer
Release 7.42.1
June 2019
New Features
Flinx disambiguation models are packaged with optional parameter files which control some parameters during runtime. These file were missing from several previous distribution packages, which may have affected accuracy performance. They have now been re-added to the distribution packages.
Fixed additional cases where Japanese characters were wrongly normalized into their simplified Chinese equivalents in entity linking, an issue addressed also in the previous release.
Chinese language code is now composed of three characters uniformly throughout the file system.
Entity extraction and entity linking now consume the latest version of RBL (Rosette Base Linguistics), which includes several improvements and bug fixes.
Improved installation by providing a script to facilitate unzip and installation of documentation and language packages.
Bug Fixes
Japanese date extraction now includes the new era 令和.
Release 7.41.0 and earlier
New Features
Release 7.41.0
In Japanese, a middle dot comes in the middle of Western names and acts as a sort of whitespace separating words in the name. Previously, some of the entities with middle dot have been split into two entities, extracting only a part of the name. This is now handled correctly, and entities with middle dot are not split. (TEJ-1341)
In Japanese entity linking, in some cases the last character of the Japanese word changed into a Chinese character. This is now fixed. (ELK-118)
Previously, when includeDbPediaTypes option was off, entity linking occassionally extracted an inaccurate entity type. Now, fine types of linked entities are identified also when includeDbPediaTypes option is off. (ELK-115)
Provided a distribution package per language. (TEJ-1361)
Release 7.39.0
Reduced linker data package size (TEJ-1306, TEJ-1321)
Updated the linking confidence calculation and thresholds to improve accuracy (TEJ-1343)
Release 7.38.1
Improved the accuracy of Korean extraction, largely through better handling of Josa (postpositions) and compound words.
Added support for Entity linking to Wikipedia for both the top level types (PERSON, LOCATION, ORGANIZATION, ETC.) as well as the over 700 DBpedia types in the remaining 16 languages supported by Entity Extraction. This is in addition to the languages currently supported by entity linking: Chinese, English, Japanese, and Spanish.
Release 7.36.0
The linker process now has the option of returning over 700 new entity types drawn from the DBpedia ontology. To access these entity types, turn on the
kbLinker
processor and add theincludeDBpediaType
flag to the factory configuration. You’ll notice more than 10 additional primary types in the type field as well as the all new DBpedia type field. Note that this is a LABS (experimental) api and subject to change. Send us your feedback!New language: Entity extraction now supports Hungarian.
Release 7.35.0
Replaced Japanese tokenizer to improve accuracy (TEJ-1176, TEJ-1180)
Release 7.34.0
Enabled string normalization for Hebrew based on DNN disambiguation model to improve
indoc-coref
results (chainingmention`s into a single `Entity
) and to present a more proper form of the name (TEJ-1139, TEJ-1173)Social-media characters such as '@' and '#' are removed from
Mention`s normalized string, offsets to the original string `data
field remain the same. This feature can be disabled byEntityExtractor.setRetainSocialMediaSymbols()
(TEJ-418)Improved statistical model confidence score to emit maximal confidence less frequently (TEJ-1146)
Release 7.33.0
Added static and dynamic capabilities to adding entries to the custom knowledge base for entity linking (ELK-30, ELK-41, ELK-44, TEJ-1150)
Added a new deep neural network processor (BETA) as an alternative entity extraction processor, which can be used in place of the standard statistical extractor for English, Arabic and Korean (TEJ-1132, TEJ-1142, TEJ-1150)
Release 7.32.0
Added support for entity type "Title" in Hebrew (APE-1641)
Added new option MaxResolvedEntities to REXCmd (TEJ-1137)
Release 7.30.0
Accuracy of Korean statistical model is improved (APE-1737)
Default linking confidence thresholds are set (TEJ-1080, TEJ-1068)
The method
setUseDeepNeuralNetworkProcessor()
incom.basistech.rosette.rex.EntityExtractor
is part of a new experimental API to replace the statistical model by new deep learning model. Another option to use it is to provideProcessorType.deepNeuralNetwork
for the methodsetProcessors
. Currently available only for English and Arabic. Some operating systems do not support the deep neural network model, and some do not provide good latency.
Release 7.29.0
FTK supports training a disambiguation model for custom knowledge base (ELK-13, ELK-14, ELK-16, ELK-22, ELK-34)
Added default linking confidence threshold for linked entities (TEJ-1068)
Updated RBL version (TEJ-1048)
Application developer’s guide and customization guide are merged into a single manual (ELK-22)
Release 7.28.0
Added confidence score for linking results, named linkingConfidence, in addition to statistical model confidence score. Different thresholds apply for the different confidence scores. (TEJ-974)
Release 7.27.0
The new salience classifier has been incorporated. The salience calculation is enabled via
EntityExtractor.setCalculateSalience()
or by setting calculateSalience in eitherREXFactoryConfiguration
orREXAnnotatorConfiguration
. (TEJ-936)Added manual custom processor registration API. (TEJ-972, TEJ-982)
Deduped partial duplicated regex. (TEJ-785)
Improved multiple gazetteer. (exact-match) processor support (TEJ-960, TEJ-1005)
Release 7.26.0
Confidence score calculation is improved to correlate well with precision, may be used for thresholding and removal of false positives (TEJ-910, TEJ-919)
Statistical models are trained with new emoticon-sensitive tokenizer (TEJ-924)
New script allows repacking REX with minimal configuration per language (TEJ-893)
Automatic case sensitivity mode prefers case-sensitive for short text by default (TEJ-931)
Release 7.25.0
Added Custom Processor for rejection. (TEJ-840, TEJ-841, TEJ-843, TEJ-880)
Redactor improvements: dynamic rules prioritization and subtypes handling (TEJ-863, TEJ-858)
Pronominal resolver is fully supported for English. Added as a processor type, as well as indoc-coref (TEJ-867)
Indoc-coref allows partial match for ORGANIZATION type (INDOC-26)
Release 7.24.1
Added full support for Vietnamese. (APE-1691)
Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at
./samples/MapReduceExample/README.md
. (TEJ-807)The method
setResolvePronouns()
incom.basistech.rosette.rex.EntityExtractor
is part of a new experimental API to resolve pronouns like 'he' and 'she' to entities of type Person. It may be changed or removed in future releases. Available only in English. (TEJ-831)Reject regex and gazetteers allow wildcard entity type. (TEJ-853, TEJ-817)
The kb-linker experimental processor now supports Chienese and Japanese in addition to English. This functionality may be changed or removed in future minor releases. (TEJ-857)
Automatic case sensitivity improved (English only). (TEJ-861)
Release 7.23.1
Added an example of how to use REX over Hadoop DFS. The README illustrating how to use this example can be found at
./samples/SparkEntityCount/README.md
. (TEJ-155)Information about what languages are licensed can now be accessed. The methods
getLanguageInformation()
andgetSupportedEntityTypes()
incom.basistech.rosette.rex.EntityExtractor
now take in a flag for whether or not to return information on all languages REX supports, or just those that are licensed. Additionally, REXCmd info now has a-onlyLicensed option
. (TEJ-767)
Release 7.22.0
Added partial support for Vietnamese to extract phone numbers and dates using regexes. (TEJ-740)
REX annotators are now created faster and can feasibly be created on a document level when using the new
com.basistech.rosette.rex.REXAnnotatorFactory API
. The currentcom.basistech.rosette.rex.EntityExtractor
API now also has a faster startup time. See the javadocs, the API Overview section in the Application Developer’s guide, and the sample program at./samples/EntityAnnotatorFactorySample.java
for additional details. (TEJ-773)
Release 7.21.0
REX now reports its results using two new classes, Entity and Mention, such that each Entity in a document has one or more Mentions that refer to the same real-world identity. Moving forward, this API will replace
EntityMention
and itscoreferenceChainId
. This version of REX is backwards-compatible and still supports the deprecatedEntityMention
. (TEJ-702)Improved Malaysian statistical model and added a new Malaysian gazetteer. (TEJ-711, TEJ-715)
kbLinker
(flinx) is part of a new experimental API to to link entities from social media text to knowledge bases and may be changed or removed in future minor releases. (TEJ-722, TEJ-725, TEJ-757)REX has been upgraded to Rosette Platform compatibility level 58.2. If you intend to use more than one Rosette JVM SDK in a single application, then you should choose versions that have the same compatibility number. (TEJ-702)
Improved case-insensitivity detection in European languages. (TEJ-687)
Entity mentions extracted with the statistical model now also specify the model’s path as a subsource. (TEJ-724)
Added a new method,
EntityExtractor setOverlayDataDirectory(Path overlayDataDirectory)
, that allows you to specify an additional data directory for REX to use. (TEJ-731)The REXCmd command line utility now allows you to specify any additional regex files you want to use, besides just the default. (TEJ-628)
Release 7.20.0
Added full support for Standard Malay to extract standard entities using regexes, gazetteers, and statistical processors. (TEJ-704)
Release 7.19.3
Added support for extraction using two statistical models operating in tandem. (TEJ-674)
In order to reduce disk footprint, Big Endian binaries are no longer shipped. REX will correctly memory map Little Endian models and dictionaries even on Big Endian systems. (TEJ-664)
Optional new packaging: RBL and REX classes are available in one combined jar. (TEJ-692)
Release 7.18.0
This is a maintenance release to address SUPPO-569.
Release 7.17.0
Added new customization for statistical processors with unsupervised field training. See Section 4.5 of the Application Developer Guide.
Added partial support for Malaysian. (TEJ-626)
Release 7.16.0
Added a new setting for REX to automatically choose the most accurate CaseSensitivity model (case-insensitive or case-sensitive) for the input text. This is not activated by default, see the sample programs or javadocs for reference on how to enable this feature. (TEJ-568)
Added case-insensitive models for German, Italian, Dutch, and Spanish. (TEJ-566)
REX is now built with JDK 1.7, so users can no longer run REX on Java Virtual Machines versioned 1.6 and earlier. (TEJ-551)
Improved accuracy of the English statistical model by using multiple Brown clusters. (TEJ-396, TEJ-559)
New disableStatisticalCleaner option added to
REXCmd
andEntityExtractor
. (TEJ-379)You can now reactivate regular expression-based entities that are disabled by default by instructing REX to load the regex files in each language’s supplemental directory. See the Javadoc for
EntityExtractor.addRegularExpressions()
. (TEJ-587)Refined redaction rules for PERSON entities. (TEJ-115)
EntityMentions
and the returned fields are now documented in the Application Developer Guide. (TEJ-623)
Release 7.15.0
Added a boolean
caseSensitive
parameter to theEntityExtractor’s `addGazetteer
andaddGazetteerEntity
method, to allow case-insensitive string matching of user-provided textual gazetteer entries. (TEJ-56)A
RosetteUnsupportedLanguageException
is now thrown when REX cannot find data for the requested language, instead of a generic runtime exception. (TEJ-536)Adding a duplicate gazetteer entry will overwrite the existing one. (TEJ-74)
Release 7.14.0
Support for script-insensitive Chinese added: Entities are now extracted from Chinese input documents for which the 'Simplified' or 'Traditional' writing system is not specified. Applications may now submit text using the zho language code instead of specifying zhs or zht. (TEJ-525)
Version 7.14.0.c56.6 introduced the use of the "compatibility" version number extension (c56.6 in this case). If you intend to use more than one Basis Technology JVM SDK in a single application, then choose versions that have the same compatibility number. (TEJ-524)
Release 7.13.0
Indonesian (Bahasa Indonesia) support added. (TEJ-441)
To enhance performance, and in response to customer feedback, we deactivated the regular expressions for extracting the following entity types: IDENTIFIER:DISTANCE, IDENTIFIER:LATITUDE_LONGITUDE, IDENTIFIER:UTM, TEMPORAL:DATE, and TEMPORAL:TIME. You can restore support for any of these entity types by removing the
@ignore=rex-je
attribute value that appears in front of the relevant regular expressions in theregexes.xml
files. (TEJ-510)Improved speed and reduced memory consumption of regex and gazetteer matching. (TEJ-489)
Improved the accuracy of the statistical models for case-insensitive Portuguese and French. (TEJ-475)
Added the ability to modify the indoc shut-off threshold:
setMaxResolvedEntities()
. (TEJ-456)Introduced an experimental EntityExtractor API for excluding entity types in which you are not interested. See the Javadoc for
{get,set}ExcludedEntityTypesfor
details. Note that this is an experimental API which can change or be removed in future minor versions of REX. (TEJ-480)
Release 7.12.1
REX emits an exception when asked to extract in a language it has no data for. (TEJ-447)
Missing values for
confidence
andcoreferenceChainId
are now represented as nulls instead of -1. (TEJ-466, TEJ-467)
Release 7.12.0
Improved Korean Entity Extraction and In-Document Co-reference Resolution
REX uses a new statistical model that achieves higher accuracy (about 25% overall error reduction). (APE-1111)
In-document co-reference resolution now recognizes Korean prefixes and suffixes, and will attempt to chain morphological variations of Korean entity mentions. (TEJ-366)
Added features to the
REXCmd
utilityPlaintext output now pretty-prints the chain ID for each entity mention it returns. Mentions with the same chain ID refer to the same entity. (TEJ-410)
The
-context
option marks the entities in their original text context, with embedded entity type and chain ID. (TEJ-410)REXCmd
now supports pre-annotated input in json-serialized Annotated Text format. (TEJ-455)Modified the reporting of text offsets for partial regular-expression and gazetteer matches to align with the token boundaries of the tokens that contain the matched text. (TEJ-393)
The REX Tcl implementation of regular expressions now supports characters in the Supplementary Multilingual Plane (SMP) (TEJ-454). Previous releases represented SMP codepoints as two characters each. (TEJ-330)
Provided an
EntityExtractor
option to instruct the REX statistical processor to ignore the lowercase/uppercase distinction. This feature is currently supported for English, French, and Portuguese. (TEJ-458)Added the
EntityExtractor.createDispatchAnnotator
method to allow annotating documents from a predefined set of languages. (TEJ-432)Added Portuguese regular expressions for temporal and monetary expressions. (TEJ-444)
Release 7.11.0
Added the Fragment Boundary Detector to enable the extractor to separate entities in text fragments that do not form sentences. (TEJ-117)
Refined the usage pattern for the
com.basistech.rosette.rex.REXCmd
command-line utility. The JSON output this utility generates has been trimmed to represent the serialization of anAnnotatedText
object. (TEJ-374)
Release 2.2.0
Deprecated
com.basistech.rosette.rex.EntityCursor
. Usecom.basistech.rosette.rex.EntityExtractor
andcom.basistech.rosette.dm.Annotator
to extract entities from an input document.Re-established pattern matcher support for the regular expressions disabled in 2.1.0.
Release 2.1.0
Added a document-level API for extracting entities. Over time, we expect to deprecate
EntityCursor
(streaming) in favor ofEntityExtractor
andAnnotator
(document-level extraction).Added an API (the
EntityExtractor setPostConfidence
method) for extracting a confidence floating point value for each entity that REX Java Edition finds. A potential entity is ignored if its confidence score is below the threshold set with thesetConfidenceThreshold
method. (TEJ-60)REX Java Edition returns normalized entities from all sources: statistical, pattern matching, and exact matches (gazetteers). (TEJ-288)
We have disabled the use of pattern matcher regular expressions that do not terminate properly, consuming large amounts of CPU time. This change could cause REX to miss some temporal, distance and long/lat expressions. If your use case requires high recall on these numeric types, please contact analyticssupport@babelstreet.com for assistance on enabling these regular expressions.(TEJ-283)
The
REXCmd
command-line utility now includes the REX Java Edition version number in the output. (TEJ-261)Added a shell script (Unix) and .bat file (Windows) to simplify the running of the
RexCmd
command-line utility. (TEJ-247)The JSON document generated by the command-line utility is much more verbose than in previous releases. Accordingly, if you are not writing the output to a file, you may want to pipe the output through a JSON parser (e.g.,
| python -mjson.tool
) and concentrate on theEntityMention
elements. (TEJ-260)Added support for Arabic, Simplified Chinese, Traditional Chinese, Korean, and Japanese.
Support for using the REX Field Training Kit to enhance accuracy handling a particular category of documents and to return new entity types. (TEJ-220)
Release 2.0.0
Added support for Dutch, Hebrew, Persian (Western Farsi and Dari), Portuguese, Pashto, and Urdu.
Incorporated improvements to the statistical language models, gazetteers, and regular expressions introduced in the Rosette C++ REX implementation since the release of REX Java Edition 1.1.
Enhanced the command-line utility with new options. (TEJ-212)
Added support for resetting the maximum number of tokens that an entity may include, which defaults to 8. Use the
EntityExtractor setMaxEntityTokens(int)
method. (TEJ-188)
Release 1.1
Added support for Uppercase English, French, German, Italian, Russian, and Spanish.
Added support for resolving coreferences to the same entity. Use
EntityExtractor setResolveNamedEntities(true)
to put coreferences to the same entity in an entity chain: seeEntityCursor getChainId()
. (TEJ-52)In response to customer feedback, removed IDENTIFIER:NUMBER from the default set of entity types returned by regular expressions. We commented out the IDENTIFIER:NUMBER entries in the
regexes.xml
files indata/regex/lang/accept
, so you can re-activate any of these entries if you wish. (TEJ-171)Added a public
EntityCursor hasNext()
method that can be used to determine whether there are any more entities in the result set, without advancing to the next entity. (TEJ-150)Added the following
EntityExtractor methods
:
public void setStatisticalModel(LanguageCode, InputStream); public void addGazetteer(LanguageCode, InputStream, boolean); public void addGazetteer(LanguageCode, InputStream); public void addRegularExpressions(LanguageCode, InputStream, boolean); public void setRedactorWeights(InputStream); public void addJoinerRules(InputStream); public void setLicense(InputStream); These methods enable access to data files placed in a JAR file (perhaps for use in a Hadoop environment). (TEJ-59)
Bugs Fixed
Bug number is followed by a brief bug description.
Fixed in 7.40.0
TEJ-1349 Fixed linker head mention bug
TEJ-1353 Fixed missing entity types in linker results
Fixed in 7.39.1
TEJ-1281 Set log level to Debug instead of Warn when linkEntities and genre don’t agree
TEJ-1319, TEJ-1346, ELK-114 Picked up new TVEC to improve the efficiency of the initial load time
TEJ-1324 Fixed a bug where the salience score was not always returned for entities with pronominal mentions, when requested.
TEJ-1327 Consumed new RBL to fix null pointer exception with pronoun resolver
TEJ-1331 Removed xxx from reported supported languages
APE-1766 Fixed a bug where all entity mentions were not always returned by statistical model
Fixed in 7.38.1
TEJ-1282 Relocated LIBLINEAR and TVEC, tested with rli
TEJ-1283 Fixed cases in which QID was not returned although DBPedia result was available, when using REX’s Kblinker
TEJ-1292 Custom processors made configurable via REXFactoryConfiguration
ELK-82 Fixed an acronym feature issue.
Fixed in 7.34.0
TEJ-1160 Moved ORGANIZATION eng regexes to supplemental directory to improve performance
TEJ-1167 Improved failure message for missing
flinx
data directoryTEJ-1168 Fixed null point exception with long chains
Fixed in 7.31.0
TEJ-1099 Added additional currency symbols to regex
TEJ-1108 Fixed hexadecimal number string incorrectly extracted as product
Fixed in 7.29.0
TEJ-1067 Fixed
calculateConfidence
configuration bugTEJ-1020 Added regex for Israeli ID number
TEJ-1054 Cleaned MD5 codes extracted as PRODUCT entity type
TEJ-1049 Fixed case-insensitive text file gazetteer treated as case-sensitive
Fixed in 7.28.1
TEJ-1050 Fixed annotator configuration object being modified while creating.
Fixed in 7.28.0
TEJ-1041 Improved redactor rules for extractor and linking overlaps and indoc-coref to prefer chaining based on linking ID.
TEJ-1039 Enabled emoticon mode for RBL for RosAPI
TEJ-1042 Fixed error handling for requesting salience score while indoc-coref is disabled
Fixed in 7.27.0
TEJ-950 Improved custom processor examples.
TEJ-951 Apply American SSN regex for English only.
TEJ-962 Fixed personal ID entity type in Vietnamese regex.
TEJ-1005 Fixed source and subsource for static user gazetteer.
Fixed in 7.26.4
TEJ-1006 Improved handling of the case Arabic prefix is null.
TEJ-1014 Fixed shading of dependencies.
Fixed in 7.26.3
TEJ-971 Fixed a concurrency issue in kb-linker.
TEJ-977 Fixed to return null for confidence from non-statistical entities.
Fixed in 7.26.2
TEJ-952, TEJ-963 Fixed lookup issue in wordclasses
Fixed in 7.25.0
TEJ-693 setStatisticalModel without caseSensitivity assumes case-sensitive
Fixed in 7.23.1
TEJ-813 Relocated more RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.
Fixed in 7.22.0
TEJ-477 REXCmd now emits a usage error when 'info' is used with no info command.
Fixed in 7.21.0
TEJ-755 Relocated RBL classes in REX to prevent version conflicts when using RBL in conjunction with REX.
TEJ-735, TEJ-738 Fixed a bug in which Entity Type Survey didn’t support non-default field training models or customer-defined types.
TEJ-747 Fixed a bug in which REXCmd produced unexpected offsets for files with DOS-style line endings.
TEJ-718 Added Malaysian sample text.
Fixed in 7.20.2
TEJ-717 Minor update to zsm model in 'single' distribution form.
Fixed in 7.20.0
TEJ-694 Added a missing file resource release in addGazetteer.
Fixed in 7.19.3
TEJ-676, TEJ-691 Shaded and relocated all 3rd party dependencies.
TEJ-683 Removed extraneous pom.xml’s from META-INF in distro jar.
Fixed in 7.18.0
SUPPO-569 Fixed incompatibility issues with RBL.
Fixed in 7.17.1
TEJ-659 Fixed support for custom tags in statistical model parser.
APE-1608 Fixed REXCmd’s behavior with case-insensitive models and the -model command line option.
Fixed in 7.17.0
SUPPO-536 Fixed partial regex extraction for Korean.
Fixed in 7.16.0
TEJ-573 REXCmd now ignores a BOM at the beginning of its input file.
TEJ-574 Lookbehind assertions are not supported, and this is now included in the Application Developer Guide.
Fixed in 7.14.0
TEJ-512 Regex xml parser misses some CDATA sections (URL extraction issues)
Fixed in 7.13.0
TEJ-459, TEJ-507 indoc chaining applied only to PER/LOC/ORG
TEJ-498 Application Developer Guide claims that REX extracts full postal addresses
Fixed in 7.12.1
TEJ-486 EntityAnnotator reuse problems
TEJ-479 RBL dictionary markers in distro
Fixed in 7.12.0
TEJ-460 Partial match regex produces wrong offsets
TEJ-438 ICU dependency not relocated/shaded
TEJ-415 REX requires an RBL license