Babel Street Hosted Services
Release Notes
Release 1.34.0
June 2025
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Morphological Analysis /morphology/{morphoFeature}
Bug fix: We fixed a bug where long words inside long spans of text without punctuation would sometimes be tokenized incorrectly. (ETROG-3716)
Name Similarity /name-similarity
Updated real world ID dictionaries: The real world ID dictionaries have been updated. You may see some differences in matching when using real world IDs. (RLPNC-8170)
1 letter misspellings: We added new parameters to control the token score when there is 1 letter difference between 2 versions of a name, or when the edit distance equals 1. To use, set
alternateEditDistanceTokenScorerMechanism
totrue
. The token score will then take the value of the parameteralternateEditDistanceTokenScorerMechanismScore
, the default value of which is 0.95. (RLPNC-8227)Bug fixes:
We fixed a problem that caused the HMM token cache to be invalidated unnecessarily. You may see an improvement in performance. (RLPNC-8183)
We fixed a bug where Simplified Chinese names could be improperly translated during name matching. (RLPNC-8199)
We fixed a problem matching Chinese names where the match value was not = 1. (RLPNC-8199)
We fixed a bug where real world ID values were not properly associated with some Chinese names. You may see an improvement in matching Chinese organization names. (RLPNC-8170)
Name Translation /name-translation
Bug fixes:
We fixed an issue that caused a null pointer exception when translating Chinese names with Latin characters. (RLPNC-8088)
We fixed an issue, where English to Chinese translations always had a confidence of 1.0 and the response didn't contain the from and target domains. (RLPNC-8101)
We fixed a Chinese translation error that could lead to an index array out of bounds exception. (RLPNC-8200).
We fixed a bug where Simplified Chinese names could be improperly translated during name matching. (RLPNC-8199)
Record Similarity /record-similarity
Updated real world ID dictionaries: The real world ID dictionaries have been updated. You may see some differences in matching when using real world IDs. (RLPNC-8170)
1 letter misspellings: We added new parameters to control the token score when there is 1 letter difference between 2 versions of a name, or when the edit distance equals 1. To use, set
alternateEditDistanceTokenScorerMechanism
totrue
. The token score will then take the value of the parameteralternateEditDistanceTokenScorerMechanismScore
, the default value of which is 0.95. (RLPNC-8227)New date parameter: We added a new parameter,
boostSwappedDigits
, to control whether dates that are identical apart from two adjacent digits being swapped (e.g. '1958-05-02' and '1958-05-20') are scored higher than other dates with an edit distance of 2. This is set totrue
by default to match previous behavior. (RLPNC-8114)
Tokenization /tokens
Bug fix: We fixed a bug where long words inside long spans of text without punctuation would sometimes be tokenized incorrectly. (ETROG-3716)
Release 1.33.0
March 2025
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Bug fix: We fixed a bug with in-document coreference server. In-document coreference chains of entity mentions will now be correct. (TEJ-2655)
Event Extraction /events
Dutch support: Dutch is now supported for event extraction.
Info /info
New name: The /info endpoint now returns "Babel Street Analytics Server" instead of "Rosette".
Name Similarity /name-similarity
Improved Hebrew-English ORG matching: We've improved name matching for organizations for names containing affixes. (RLPNC-7944)
Example: AL-QAID IN IRAQ vs אלקאעדה עירא
Previously: 0.51
Now: 0.89
Improved Korean name matching: We've improved Korean matching for PERSONS and ORGANIZATIONS by updating the stop word list. (RLPNC-7951)
New parameter: We've added a parameter,
maxExpansions
to control the number of phonetically similar terms considered during the first-pass fuzzy matching. Increasing this parameter can improve first-pass results, ensuring that the correct name will be sent to the second pass, but may impact performance. (RLPNC-7967)Improved stop words: We now support stop word prefixes and stop patterns that contain the forward slash (/) characters. This is especially useful for Indian and Malaysian names that include titles which are acronyms, such as A/P, A/L, S/O, and D/O. (RLPNC-7919)
English to Chinese name translation: The English to Chinese name translation is now implemented in Java instead of C++. You may see some differences in translations. (RLPNC-8003)
New character support: Match now supports CJK Unified Ideographs Extension B, which includes rare and historical Chinese characters. This update ensures that characters from U+20000 to U+2A6DF are correctly recognized and processed, improving compatibility with Chinese, Cantonese, Korean, and Japanese data. (RLPNC-7956)
Bug fix: We fixed a bug in Cantonese where the confidence score was always 0.0 if a token had a special character at the end. (RLPNC-7893)
Bug fix We fixed an issue where left and right names in the explain info could appear in the wrong order (RLPNC-7922)
Bug fix: We fixed an issue where removing stop words produced improper segmentation results for Chinese, leading to poor match scores. (RLPNC-8036)
Bug fix: We fixed a bug in Chinese name translation where a
StringIndexOutOfBoundsException
occurred due to incorrect handling of token dictionary matches. (RLPNC-8009)Bug fix: We fixed a bug with the handling of transliteration schemes in Cantonese to English translations. You can now set the language of origin and language of use to
yue
and receive the correct translation without error. (RLPNC-7865)
Name Translation /nametranslation
New output value: The value
processedByBabel
was added to the response. This is a Boolean value which is set toTrue
when the input value is processed in any way for translation. (RLPNC-7961)Example: If an input of
ABC
returnsabc
, the value will betrue
.English to Chinese name translation: The English to Chinese name translation is now implemented in Java instead of C++. You may see some differences in translations. (RLPNC-8003)
Bug fix: We fixed a bug with the handling of transliteration schemes in Cantonese to English translations. You can now set the language of origin and language of use to yue and receive the correct translation without error. (RLPNC-7865)
Bug fix: We fixed a bug in native Chinese name translation where too many results could be returned if a name had multiple segmentations (RLPNC-8058)
Bug fix: We fixed a bug with the handling of transliteration schemes in Cantonese to English translations. You can now set the language of origin and language of use to yue and receive the correct translation without error. (RLPNC-7865)
Known issue: When translating from English to Chinese, the source and target domain information (
sourceScript
,sourceLanguageOfUse
,targetLanguage
,targetScript
,targetScheme
) are not returned in the results. The translation is returned. (RLPNC-8101)
Ping /ping
New name: The /ping endpoint now returns "Babel Street Analytics Server at your service" instead of "Rosette at your service".
Release 1.32.0
December 2024
Note
Rosette Cloud has been renamed to Babel Street Hosted Services.
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Morphological Analysis /morphology/{morphoFeature}
Unicode update: Unicode 16.0 is now supported. (ETROG-3694)
Name Similarity /name-similarity
New name: Rosette Name Indexer has been renamed to Babel Street Match.
Improved L337/OCR scorer: We've improved the
GeneralizedEditDistanceTokenScorer
for L337 and OCR type errors. (RLPNC-7776)New parameter added: We've added a parameter
charactersToAlwaysNormalizeToSpace
to define a set of characters to replace with a space when normalizing names and addresses. (RLPNC-7808)Improved performance: We've improved performance of English and Spanish name matching. (RLPNC-7782)
Gender ignored by L337/OCR scorer: We no longer apply a
genderConflictPenalty
in the case of L337 matching.Updated real world ID dictionaries: We've updated the real world ID dictionary. You may see some differences in matches when using real world IDs, especially for Arabic organization names containing numerals. (RLPNC-7642)
Bug fix: We now use the correct value for
sourceLanguageOfUse
when translating Cantonese names. The yue/hani/jyutping domain is now used instead of zho/hani/hypy. (RLPNC-7837)Bug fix: An error is no longer thrown when one of the search terms is normalized away. (RLPNC-7870)
Name Translation /name-translation
Bug fix: We now use the correct value for
sourceLanguageOfUse
when translating Cantonese names. The yue/hani/jyutping domain is now used instead of zho/hani/hypy. (RLPNC-7837)
Tokenization /tokens
Unicode update: Unicode 16.0 is now supported. (ETROG-3694)
Release 1.31.0
October 2024
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data. The QID assigned to some extracted entities may differ from previous versions.
Bug fix: Fixed a bug where REX would accept indoc-coref-server entity mentions which violated sentence boundaries. REX will now reject mentions from indoc-coref-server that are not contained within document sentences. (TEJ-2451)
Morphological Analysis /morphology/{morphoFeature}
Bug fix: Trailing decimal points in Chinese are no longer treated as part of a decimal fraction. When “点” or “點” ends a number, it is no longer segmented as part of the number. (ETROG-3680)
Name Similarity /name-similarity
l337/OCR error token scorer added: We've added a new feature to handle leetspeak (l337) and OCR errors. The token scorer contains a series of rules files defining symbol substitutions, along with the penalty for the substitution.
Improved explainability: Explain info now contains alternative Latin readings (
alternativeLatnData
) for translated names. This field displays the alternative Latin readings of translated names from the input-info. (RLPNC-7609)Example:
{ "leftInput": { "data": "温家宝", "normalizedData": "温家宝", "latnData": "wen jiabao", "script": "Hani", "languageOfUse": "CHINESE", "languageOfOrigin": "UNKNOWN", "entityType": "PERSON", "alternativeLatnData": [ "on kahou", "on ka-po" ] }, "rightInput": { "data": "عبد الله بن عبد الرحمن بن جبرين", "normalizedData": "عبد الله بن عبد الرحمن بن جبرين", "latnData": "abd allah bin abd alrahman bin jabrin", "script": "Arab", "languageOfUse": "ARABIC", "languageOfOrigin": "UNKNOWN", "entityType": "PERSON", "alternativeLatnData": [ "abid allah bin abd alrahman bin jabrin", "abd allah bin abid alrahman bin jabrin", "abd allah bunn abd alrahman bin jabrin", "abd allah bin abd alrahman bunn jabrin" ] }, "finalScore": 0.0 }
Improved unsupported language scoring: We added a new parameter
editDistanceFalloff
to control scoring of names in unsupported languages. The parameter is set to 0 (disabled) by default; a value between 0 and 1 will reduce the dropoff in edit distance. When set to a value between 0 and 1, you may now find matches in scripts and languages that failed to match in previous releases. (RLPNC-7481)Improved performance: Significant improvement of the HMM used in the high-recall phase of name matching. (RLPNC-7686, RLPNC-7679)
Improved performance: We've added a parameter,
disableHMMMatching
, which disables the second pass HMM matching. Enabling this parameter greatly increasing performance, but at the cost of reducing accuracy. By default, this parameter is off. (RLPNC-7605, RLPNC-7604, RLPNC-7603)Faster Arabic translation: We've added a new parameter,
tokenByTokenArabLatnFolkTranslation
to enable faster translation of Arabic-script, Arabic-language names to Latin-script English names with the Folk transliteration scheme. This speeds up translation by processing the name token-by-token. (RLPNC-7719)New parameters added: You can now boost a specified number of tokens from the left, right, or both ends. This can be useful when scoring multi-token given names or surnames. The provided value sets the number of tokens from the left and/or right to give a boost. This feature is supported by adding the parameters
leftBoostTokens
,rightBoostTokens
, andbothEndsBoostTokens
to theparameter_defs.yaml
file. (RLPNC-7606)Improved Hebrew transliteration: We've improved the FOLK transliteration scheme for Hebrew-English. (RLPNC-7623)
Improved Persian/English matching: We've improved accuracy of Persian-English matching through tuning of parameters. (RLPNC-7631)
Persian organization improvements: We've added stop words for organizations for Persian. (RLPNC-6890)
Chinese migration: Chinese name translation and matching is now implemented completely in Java, instead of C++. You may notice some small fluctuations in Chinese name matching. (RLPNC-6768)
Improved token span definitions: Token spans now attempt to include boundary characters removed from the original input during normalization (RLPNC-7730)
Gender added to pairwise matching: We've added the gender property as a request parameter for /name-similarity. (RLPNC-7451)
Bug fix: We fixed a bug where the seq2seq neural model could not be loaded properly. The model now loads. (RLPNC-7754)
Bug fix: We fixed a bug where if you set
haniFourCornerCodeMismatchPenalty
globally, it didn't work for any profiles. Users can now either set thehaniFourCornerCodeMismatchPenalty
globally (enabling it for all Hani matching) or using the language profiles (i.e. specifically forzho_zho
).(RLPNC-7717)Bug fix We removed some erroneous overrides. (RLPNC-7758)
Bug fix We fixed a memory leak in RNT that was causing slowdowns in Rosette Cloud and Rosette Server. (RLPNC-7703)
Name Translation /name-translation
Faster Arabic translation: We've added a new parameter,
tokenByTokenArabLatnFolkTranslation
to enable faster translation of Arabic-script, Arabic-language names to Latin-script English names with the Folk transliteration scheme. This speeds up translation by processing the name token-by-token. (RLPNC-7719)Improved Hebrew transliteration: We've improved the FOLK transliteration scheme for Hebrew-English. (RLPNC-7623)
Bug fix We fixed a memory leak in RNT that was causing slowdowns in Rosette Cloud and Rosette Server. (RLPNC-7703)
Record Similarity /record-similarity
Improved explainability in record matching: Request-level information is now included in the information block of the response. Messages concerning the mapping or properties of a request are now part of the info field of the response (RLPNC-7653)
Improved error handling in record matching: Record matching will now throw an exception if a mapping does not contain any valid fields. (RLPNC-7653)
Tokenization /tokens
Bug fix: Trailing decimal points in Chinese are no longer treated as part of a decimal fraction. When “点” or “點” ends a number, it is no longer segmented as part of the number. (ETROG-3680)
Release 1.30.0
July 2024
Names endpoints ( /name-similarity, /name-deduplication, /name-translation, /record-similarity) will now return a 400 error instead of a 500 error when called with the output=rosette
parameter. This parameter is only supported for document (non-names) endpoints.
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data for the provided knowledge base. The QID assigned to some extracted entities may differ from previous versions. You should see large improvements in entity linking. (RWIKI-454, RWIKI-507)
We've made some changes as to how some entity types are linked to the provided knowledge base:
PERSON: Now only real humans are linked as person entities; fictional, imaginary, and mythical humans are not.
PRODUCT: Product entities now exclude most creative works.
Linking improvements: We've changed the conflict resolution algorithm to one which tries to link using the longest possible mentions. You should see better linking, especially in cases where the mention of a popular entity is embedded within the mention of interest. (RWIKI-404)
Example: I studied at the University of Chicago
Previously linked: Chicago
Now linked: University of Chicago
Linking improvements: We've added a heuristic to help stop generic, unnamed entities, such as "mortgage law", from being linked. (RWIKI-475)
New endpoint for supported languages: We've added a new endpoint, entities/indoc-coref-server/supported-languages, which returns the supported languages for indocument coreference. (WS-3176)
Bug fix: We fixed a bug when running REX with multiple threads where stop word files were being loaded and not closed, causing the system to run out of file handles and memory. (TEJ-2347)
Bug fix: We fixed a bug where Chinese characters were normalized when looking up knowledge base artifacts for linking in Japanese. You should see improved entity linking in Japanese. (RWIKI-406)
Bug fix: English terms for half(s) and quarter(s) were removed from the Russian (RUS) and German (DEU) regexes for time. (TEJ-1817)
Bug fix: Aliases are no longer filtered by low normalized link probability; it is now possible to link entities where abbreviations like "MIT", "LA", "WHO", "UN" are the mention text. (RWIKI-389)
Bug fix: We fixed a NullPointerException while writing log entry when processing empty tokens. (TEJ-2361)
Bug fix: We fixed a bug where news media, such as television programs, were typed as ORG. (RWIKI-483).
Known issue: If the indoc-coref-server is enabled, errors may be returned from the /sentiment, /topics, and /relationships endpoints if multiple entity mentions are returned for the same entity. (RQA-1345)
Morphological Analysis /morphology/{morphoFeature}
CLA lexicon: Two terms have been added to the CLA lexicon: 喷码机 inkjet printer and 管理器 manager (in the context of software) (ETROG-3678)
Improved Readings: Readings are now returned for numeric words in Chinese and Japanese when
tokenizerType
is set toSPACELESS_LEXICAL
. (ETROG-3684)Bug fix: When
tokenizerType
is set toSPACELESS_LEXICAL
, Japanese tokens for verbs in lemma form have had their readings fixed to cover the entire token. (ETROG-3640)Example: Input: 食べる
Previous Reading: た
Current Reading: たべる
Bug fix: When
tokenizerType
is set toSPACELESS_LEXICAL
, Japanese lemmatization has been corrected for numeric tokens containing both decimal points and multiplier characters.Example: Input: 2.5亿
Previous Lemma: 2500000000
Current Lemma: 250000000
Bug fix: We fixed a bug where a token whose surface form was the empty string could be returned when
fragmentBoundaryDetection
was set totrue
(the default). (ETROG-3686)
Name Similarity /name-similarity
Japanese translation improvements: We've improved Japanese translation by updating the custom reading dictionary (RLPNC-7539).
Hebrew translation overrides: Added additional overrides for Hebrew translation. (RLPNC-7471)
Non-Latin numeric characters: Numeric characters in certain languages are now normalized to their Latin-script counterparts. Supported languages currently include Thai (RLPNC-7562), Arabic, Burmese, Pashto (RLPNC-7564), Persian (including Iranian and Afghan Persian), Urdu, and Khmer (RLPNC-7565).
Pashto organization improvements: We've added stop words for organizations for Pashto. (RLPNC-6889)
Spanish organization improvements: We've added stop words for organizations for Spanish (RLPNC-6893)
Bug fix: We fixed a bug with cross-entity-type matching: cross-entity-type match scoring is now commutative. As a result of this, cross-entity-type matching will now ignore entity-type-specific parameters and overrides. (RLPNC-7485).
Bug fix: We fixed a bug where khm-khm was returned as a supported language pair for name translation. (RLPNC-7556)
Bug fix: We fixed a bug, that caused the left and right input fields in the explain info to swap places, when a Japanese organization name is matched against a single character, that is normalized away (like '*') and the language of use is not defined on that side. (RLPNC-7554)
Name Translation /name-translation
Japanese translation improvements: We've improved Japanese translation by updating the custom reading dictionary (RLPNC-7539).
Hebrew translation overrides: Added additional overrides for Hebrew translation. (RLPNC-7471)
Bug fix: We fixed a bug where khm-khm was returned as a supported language pair for name translation. (RLPNC-7556)
Record Similarity /record-similarity
Note
This version of the /record-similarity endpoint is not backward compatible with earlier versions of the Java and C# bindings.
Fielded dates supported: We've added support for fielded dates. (RLPNC-7520)
Fielded addresses supported: We've added support for fielded addresses. (RLPNC-7519)
Blank fields supported: We've added support for specifying a
scoreIfNull
for fields in record similarity. You can now specify a value for when a field is missing in a record. (RLPNC-7516, RLPNC-7517)Parameter support improved: We've added support for specifying either a parameter universe or a mapping of parameter names to parameter values. (RLPNC-7497)
Validation improved: We've improved validation of field mappings and field weights. (RLPNC-7545)
Info messages added: Record matching responses now return info messages when default property values are used.(RLPNC-7509)
Error reporting improved: Fields not included in the mapping or fields with unknown types no longer cause an error in record matching. Instead, this information is returned in the "info" block. A record with only non-included fields or fields with unknown types will lead to an error in the response. (RLPNC-7512, RLPNC-7514)
Partial records supported: Record similarity now supports partial request success if some record pairs contain unmapped or unknown fields, or encounter other scoring errors. (RLPNC-7502)
More records and fields supported: We've removed hard limits on the number of records or mapping fields in record-similarity requests (RLPNC-7601)
Relationship Extraction /relationships
Known issue: If the indoc-coref-server is enabled, errors may be returned from the /sentiment, /topics, and /relationships endpoints if multiple entity mentions are returned for the same entity. (RQA-1345)
Sentiment Analysis /sentiment
Known issue: If the indoc-coref-server is enabled, errors may be returned from the /sentiment, /topics, and /relationships endpoints if multiple entity mentions are returned for the same entity. (RQA-1345)
Tokenization /tokens
Bug fix: We fixed a bug where the Chinese word “星期四” would be tokenized incorrectly in certain contexts. (ETROG-3582)
Topic Extraction /topics
Known issue: If the indoc-coref-server is enabled, errors may be returned from the /sentiment, /topics, and /relationships endpoints if multiple entity mentions are returned for the same entity. (RQA-1345)
Release 1.29.0
April 2024
Address Similarity /address-similarity
Bug fix: Fixed the error being returned when matching addresses. Only English and Chinese are supported for address matching; RNI will now throw an unsupported language exception when matching non-English, non-Chinese addresses if
allLanguageSupport
is disabled. (RLPNC-7416)
Entity Extraction and Linking /entities
Indoc Coreference: We've added a server to provide in-document coreference (indoc coref). With indoc coref enabled, all entity mentions, including pronouns, titles, and other references to an entity, are returned in the ADM output. To enable indoc coref, set the option
useIndocServer
totrue
. By default, indoc coref is disabled. (TEJ-2244)
Event Extractor /events
New event extractor: Event extractor analyzes unstructured text and extracts event mentions and the roles (event mentions) which add detail to the event. We've included two simple event models, travel and meet, to demonstrate how events works.
Morphological Analysis /morphology/{morphoFeature}
Unicode update: Unicode 15.1 is now supported. (ETROG-3595)
Bug fix: Upper case input text is now supported. Previously, the endpoint would send an error message
Language uen not supported
. (WS-3163)
Name Similarity /name-similarity
Malay support expanded: We have improved name Malay matching by expanding the stop word list. (RLPNC-7175, RLPNC-7176)
Hebrew improved: We have improved name matching and translation for Hebrew by expanding translation override lists. (RLPNC-7234)
Explain info improved:
Sub-elements are now ordered consistently and provide additional detail for any given pairwise match. (RLPNC-7293)
All date matches now return explain info about the parsed date fields, and report the “time distance” for time distance and time proximity matches. (RLPNC-7309)
All address matches now return explain info about tokenization and the final score for each address field to address field match. (RLPNC-7292)
Bug fix: Fixed a bug in which stop words were not being applied properly for Greek. (RLPNC-7144)
Bug fix: Names in Han script with unknown language and unknown language of origin now give appropriate Japanese, Chinese, and Korean readings. (RLPNC-7367)
Bug fix: Fixed token tagging in names that end with a suffix. (RLPNC-7417)
Bug fix: Names that get normalized to empty now return an empty list of Real World IDs. (RLPNC-7202)
Bug fix: Overrides are no longer considered for the gender penalty. (RLPNC-7346)
Bug fix: Confidence scores for fullname overrides are now correctly calculated when at least one token has a confidence score specified. (RLPNC-7456)
Bug fix: Fixed token tagging in names that end with a suffix. (RLPNC-7417)
Name Translation /name-translation
Multiple translations: We've added a parameter,
maximumResults
, to return multiple translations along with their confidence scores. The default is to return a single translation. (RLPNC-7350)Hebrew improved: We have improved name matching and translation for Hebrew by expanding translation override lists. (RLPNC-7234)
Record Similarity /record-similarity
Record Similarity: We've added a new endpoint to compare two lists of records and return a similarity score for each pair. Each record can contain one to five fields of mixed data types. Check it out. (RLPNC-7372)
Semantic Similarity /semantics/{semanticsFeature}
New embeddings for French and Italian: The GEN_2 embeddings (originally released in June2023) are now available for French and Italian. These embeddings provide more accurate results and are debiased compared to the previous embeddings. You may see differences in returned values for these languages. To use the previous embeddings, set
embeddingsMode
toGEN_1
. (RD-2632)Korean embeddings: North Korean and South Korean embeddings are no longer distinguished by default. The default embedding mode (GEN_2) treats them the same. If you need to distinguish between North Korean and South Korean embeddings, set
embeddingsMode
toGEN_1
.
Sentence Taggig /sentences
Bug fix: Upper case input text is now supported. Previously, the endpoint would send an error message
Language uen not supported
. (WS-3163)
Tokenization /tokens
Unicode update: Unicode 15.1 is now supported. (ETROG-3595)
Bug fix: Upper case input text is now supported. Previously, the endpoint would send an error message
Language uen not supported
. (WS-3163)
Release 1.28.0
January 2024
Note
Header requirement: All POST calls must include Content-Type: application/json
in the request header.
Entity Extraction and Linking /entities
Bug fix: We've added a reject gazetteer to ensure that the string USA, Canada is extracted correctly as 2 location entities: USA and Canada. This is active by default. (TEJ-2114)
Morphological Analysis /morphology/{morphoFeature}
Japanese improvements: We have improved and augmented the Japanese lexicon. (ETROG-3532, 3581, 3535, 3668)
Name Deduplication name-deduplication
Cyrillic support expanded: Extended Cyrillic characters are now supported. This will improve performance for non-Russian, Cyrillic languages. (RLPNC-7236)
Name Similarity /name-similarity
Added parameter
hmmNormalizationAlternative
: This parameter adjusts normalization for more accurate HMM match scores in certain languages. It is available for Russian, Hebrew, Korean, Japanese, Arabic, and Greek. When this parameter is enabled, HMM scores may be lowered. It is disabled for all languages by default. This can lower the probability of names being translated, resulting in them being transliterated instead, which may be more accurate for names in some languages. (RLPNC-7192).Cyrillic support expanded: Extended Cyrillic characters are now supported. This will improve performance for non-Russian, Cyrillic languages. (RLPNC-7236)
Name Translation /name-translation
Source language returned: The response now always includes
sourceLanguageOfUse
andsourceScript
. If these values are not included in the request, they are the values determined by the endpoint. (RLPNC-7170)Cyrillic support expanded: Extended Cyrillic characters are now supported. This will improve performance for non-Russian, Cyrillic languages. (RLPNC-7236)
Tokenization /tokens
Japanese improvements: We have improved and augmented the Japanese lexicon. (ETROG-3532, 3581, 3535, 3668)
Release 1.27.0
October 2023
Morphological Analysis /morphology/{morphoFeature}
Note
The default tokenizerType
for Chinese and Japanese is spaceless_lexical
.
Expanded Chinese lexicon: We've expanded the lexicon of multi-character Chinese surnames when
tokenizerType
is set tospaceless_lexical
. (ETROG-3616)Expanded Japanese lexicon: We have expanded the Japanese lexicon that is used when
tokenizerType
is set tospaceless_lexical
. (ETROG-3632)Improved support for Chinese readings when
tokenizerType
is set tospaceless_lexical
:Readings are merged into a single reading if the readings become the same string after tone mark removal. (ETROG-3625)
Chinese readings are returned in a list. Previously, a token with multiple possible readings was a single string with brackets and semicolons was returned. (ETROG-3626)
Example: "蔭權"
Previous readings returned: "【yīn;yìn】quán"
Readings now returned: "yīnquán” and “yìnquán"
Bug fix: We fixed a bug where an
ArrayIndexOutOfBoundsException
occurred when the Chinese dictionaries produced more than 6 matches andtokenizerType
was set tospaceless_lexical
. (ETROG-3635)Bug fix: When Chinese readings are constructed by character and
tokenizerType
is set tospaceless_lexical
, an apostrophe is now inserted before pinyin syllables that start with "a", "e", or "o" which are not the first syllable. (ETROG-3637)Bug fix: We fixed a bug where the UPT-16 conversion where some Japanese particles part of speech were not tagged correctly. The particles are now tagged correctly as ADP. (ETROG-3526)
Name Similarity /name-similarity
Malay support added: Added support for Malay-English and Malay-Malay name matching. (RLPNC-7079)
Release 1.26.0
July 2023
Address Similarity /address-similarity
Known issue: When performing address matching, the setting
allLanguageSupport
must be set totrue
(the default value). If it is set tofalse
, an unsupported language exception will be thrown, regardless of the language of the address. By default, it is set totrue
.
Entity Extraction and Linking /entities
Bug fix: The parameter
regexCurrencySplit
has been fixed. When set totrue
, currency values will now extract into two entity types: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE instead of IDENTIFIER:MONEY. (TEJ-1960)
Name Similarity /name-similarity
Improved organization name matching The stop word list for organization entities has been expanded in the following languages (RLPNC-7025):
Turkish
Thai
Portuguese
Russian
Korean
Japanese
Italian
Hungarian
German
French
English
Greek
Arabic
Semantic Similarity /semantics/{semanticsFeature}
New embeddings: We've replaced the embeddings used by the /semantics/similar and /semantics/vector endpoints. The new embeddings (GEN_2) provide more accurate results and the results are debiased compared to the previous embeddings. You may see some differences in returned values. The new embeddings are the default. French and Italian continue to use the original GEN_1 embeddings. To use the previous embeddings, set
embeddingsMode
toGEN_1
. (RD-2575)
Release 1.25.1
March 2023
Entity Extraction and Linking /entities
Bug fix: The deprecated
genre
parameter is now allowed in document requests.
Release 1.25.0
March 2023
Morphological Analysis /morphology/{morphoFeature}
We've improved the time it takes to tokenize extremely long (> 10K characters) Japanese sentences. (ETROG-3602)
Name Similarity /name-similarity
Turkish support added: Added support for Turkish-English and Turkish-Turkish name matching. We have also added person and organization overrides, stopwords, and language detection to improve matching in Turkish. (RLPNC-6499)
Improved person name matching: RNI-RNT now has the ability to detect given names and surnames in Latin script when the name is of English origin. When the
enableAdditionalOnomastics
parameter is true, gender mismatch penalty is only applied to the detected given name, as opposed to the first name token in a query. (RLPNC-6719, RLPNC-6720)Improved Arabic person name matching: The new TRAILING_PATRONYMIC_DELETION match phenomenon provides improved scores for matches which contain a deletion that is caused by truncation of a patronymic. The score of this deletion is controlled by the
trailingPatronymicDeletionScore
parameter. This only applies to Latin script names of Arabic origin whenenableAdditionalOnomastics
is true. (RLPNC-6756)
Name Translation /name-translation
Russian support migrated from C++ to Java: Translations from Russian to English may differ slightly. (RLPNC-6764)
Tokenization /tokens
We've improved the time it takes to tokenize extremely long (> 10K characters) Japanese sentences. (ETROG-3602)
Release 1.24.1
January 2023
Address Similarity /address-similarity
Improved Chinese address matching: We've expanded the list of Chinese stop words for addresses. (RLPNC-6587)
Entity Extraction and Linking /entities
Wikidata refreshed: We've updated the knowledge base data for the provided linking knowledge base. The QID assigned to some extracted entities may differ from previous versions. (RWIki-119, ELK-274, ELK-276)
New currency regex: We've introduced a new option,
regexCurrencySplit
, that, when set to true, will attempt to split entities extracted with the regex engine of type IDENTIFIER:MONEY into two new entities: IDENTIFIER:CURRENCY_AMT and IDENTIFIER:CURRENCY_TYPE. These two new types represent the amount of the currency (50,000) and the currency type ($), respectively. By default,regexCurrencySplit
is set to false. (TEJ-1792)Tagalog support: We've added case-insensitive NER support for Tagalog. Previously we released a case-sensitive model and we've now added the case-insensitive model as well. (TEJ-1858)
Parameter removed: We've removed the deprecated
genre
extraction option. This option was used to turn the linker on which has been, and will still be, available by thelinkEntities
option. Thegenre
option is no longer available in therex-factory-config.yaml
file, as an option in the call, or in the Rosette API bindings (TEJ-1855).
Morphological Analysis /morphology/{morphoFeature}
Ukrainian support added: Tokenization, sentence boundary detection, segmentation user dictionaries, and many-to-one normalization dictionaries are supported for Ukrainian. (ETROG-3594)
Improved part of speech tags: Language-neutral tokens (numbers, symbols, and punctuation) now get part of speech tags in Indonesian, Standard Malay, and Tagalog. (ETROG-3574)
Emoji support: Emoji 15.0 is now supported. (ETROG-3577)
Bug fix: The Japanese POS tag VN (verbal noun) is now mapped to the UPT-16 POS tag NOUN. It was previously mapped to VERB. (ETROG-3583)
Name Similarity /name-similarity
Improved Japanese organization matching: カンパニー (company) added to the Japanese organization stopwords list. (RLPNC-6545)
Improved Chinese organization matching: We've expanded the list of Chinese stop words for organizations. (RLPNC-6615)
Improved name matching results: When no
entityType
is specified, the typePERSON
will be applied. Previously, the typeNONE
was applied. (RLPNC-6576)Bug fix: Horizontal tabs are now removed as part of normalization in English. (RLPNC-6541)
Bug fix: Control characters are now removed from Arabic names before matching. (RLPNC-6543)
Bug fix Fixed a case where unexpected name inputs could lead to a null pointer exception. (RLPNC-6634)
Bug fix Fixed an issue where name-similarity could return match scores greater than 1. (RLPNC-6595)
Sentence Tagging /sentences
Ukrainian support added: Sentence boundary detection is now supported for Ukrainian. (ETROG-3594)
Tokenization /tokens
Ukrainian support added: Tokenization is now supported for Ukrainian. (ETROG-3594)
Release 1.23.0
September 2022
Entity Extraction and Linking /entities
Tagalog (tgl) support: We've added Tagalog to our list of languages. The following processors are supported: gazetteer, regex, statistical NER, linking. (TEJ-1812, TEJ-1822, TEJ-1785, TEJ-1786)
New linking option: We've added a new option for entity linking. When
linkMentionMode
is set toentities
the linker will attempt to link the entities extracted by other processors (regex, gazetters, and the statistical processor) instead of using its own processor to extract entity candidates. Depending on your data, this may provide higher accuracy and speed. (TEJ-1806)Bug fix: An exception is no longer emitted when token normalization produces an empty string. (TEJ-1803)
Bug fix: When looking for candidate mentions in text, if there is an overlap between these mentions, the linker now resolves the longest spanning mention before disambiguation. (ELK-277)
Morphological Analysis /morphology/{morphoFeature}
Tagalog (tgl) support:
Part of speech (POS) tagging in Tagalog is now supported. (ETROG-3559)
Lemmatization for Tagalog is now supported. (ETROG-3570)
The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Indonesian (ind) support: RBL now supports lemmatization for Indonesian, which is the standardized form of Malay spoken in Indonesia. (ETROG-3563)
Standard Malay (zsm) support: RBL now supports lemmatization for Standard Malay, the standardized form of Malay spoken in Malaysia. (ETROG-3563)
Bug fix: We fixed a bug in Russian where certain uncommon consonant–vowel sequences in words in the lexicon were incorrectly replaced with more common sequences with different vowel letters. (ETROG-3541)
Example: брошюра
Previously: брошура
Now: брошюра
Name Similarity /name-similarity
Complete CJK Ext A support: We now have full support of CJK Unified Ideographs Extension A. (RLPNC-6324)
Improved Spanish name matching: We have improved Spanish surname detection. (RLPNC-6294)
Improved Japanese location matching: All prefectures of Japan are now included in the override list. (RLPNC-6326)
Example: 北海道 vs. Hokkaido Prefecture
Previously: 0.7246
Now: 0.99
Bug fix: The string "luiz arlos da silva bueno" is no longer in the Greek, English, and Vietnamese stop word lists. (RLPNC-6358)
Name Translation /name-translation
Complete CJK Ext A support: We now have full support of CJK Unified Ideographs Extension A. (RLPNC-6324)
Sentence Tagging /sentences
Tagalog (tgl) support: The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Tokenization /tokens
Tagalog (tgl) support: The Tagalog sentence-breaker now recognizes certain abbreviations that end with periods and doesn’t break sentences after them. The tokenizer keeps the period in the token with the rest of the abbreviation. (ETROG-3573)
Release 1.22.0
June 2022
Address Similarity /address-similarity
Improved Chinese - English address matching: We expanded overrides for ethnic minority regions, particularly from Xinjiang, Tibet, and Inner Mongolia. (RLPNC-6077)
Entity Extraction and Linking /entities
Bug fix: An error is no longer generated when there are null prefixes in Arabic morphological analyses. (TEJ-1765)
Morphological Analysis /morphology/{morphoFeature}
Spaceless Korean tokenizer: We've added an option to select the spaceless Korean tokenizer in the call. The default tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. To use this tokenizer, set the option
modelType
toDNN
. Previously, this option was not available on a per-call basis. (ETROG-3513)Indonesian support added: Rosette now supports part of speech (POS) tagging in Indonesian. (ETROG-3543)
Malay (standard) support added: Rosette now supports part of speech (POS) tagging in Malay (standard). (ETROG-3545)
Russian lexicon improved: We've added many words related to computer technology to the Russian lexicon. (ETROG-3523, ETROG-3538)
Bug fix: In Japanese, negative forms of Ichidan verbs written all in Hiragana are no longer lemmatized to end with “なう”. (ETROG-3534)
Example: Input: くれない
Previously: lemmatized to くれなう
Now: lemmatized to くれる
Name Similarity /name-similarity
Khmer - English added: Khmer - Khmer and Khmer - English are now supported name matching pairs. (RLPNC-5712)
Improved language detection: We've improved language detection for languages that use Han characters (Chinese, Japanese, Korean). (RLPNC-6059)
Improved ORG matching: We've expanded the list of known organization names in our real world ID tables to improve ORG matching in Arabic (ara), Burmese (mya), Chinese (zho), French (fra), German (deu), Greek (ell), Hebrew (heb), Hungarian (hun), Italian (ita), Japanese (jpn), Korean (kor), Portuguese (por), Russian (rus), Spanish (spa), Thai (tha), and Vietnamese (vie). (RLPNC-6090)
Improved Chinese - English ORG matching: We added override mappings for Chinese numerals in Hanzi to Arabic numbers from zero through twenty-one. (RLPNC-6028)
Improved matching of Spanish names and names of Spanish origin: Name similarity now has a deeper understanding of Spanish surnames. For example: "JOSE JORGE RIOS TORRES" now gets a higher score when matched against "JOSE RIOS" than it does when matched against "JOSE TORRES", since "RIOS" is recognized as the primary surname. (RLPNC-6037)
The following parameters impact how Spanish names are matched:
The new Boolean parameter
enableAdditionalOnomastics
controls whether to assign aTokenType
to allow for multiple Spanish surnames. When set totrue
, each token is assigned aTokenType
, where theTokenType
is one of:UNKNOWN
,SURNAME
, orSURNAME2
. It is currently set totrue
for thespa_eng_PERSON
,spa_spa_PERSON
andeng:spa_eng_PERSON
profiles.The preexisting parameter
surnameTokenTypeWeight
now applies only to theTokenType.SURNAME
tokens. Its default value was changed from 1 to 1.2.The new parameter
secondarySurnameTokenTypeWeight
applies toTokenType.SURNAME2
tokens. Its default value is 0.6.The new parameter
crossSurnameMatchPenalty
parameter is applied (by simple multiplication) when aTokenType.SURNAME
token is scored against aTokenType.SURNAME2
token. Its default value is 0.75.Example: Pablo Emilio Escobar Gaviria vs. Pablo Escobar
Previously: 0.7945
Now: 0.8309
Example: Pablo Emilio Escobar Gaviria vs. Emilio Gaviria
Previously: 0.7999
Now: 0.7365
Improved matching of English organization names: We've added ordinal numbers to the override list for English organizations. (RLPNC-6225)
Example: 1st National Bank vs. First National Bank
Previously: 0.6470
Now: 0.9257
Improved Vietnamese name matching: We've expanded the Vietnamese stop word lists for PERSON and ORGANIZATION entity types. (RLPNC-5694)
Example: Chủ tịch Hồ Chí Minh vs. Hồ Chí Minh (translation: President Ho Chi Minh)
Previously: 0.83
Now: 0.99
New parameter for organization names: We've added the parameter
tokenizeOrganizationsWithNumbers
that prevents tokenization of names with numbers within the name. When set totrue
(default), the number is left within the token and the name will get a higher value from the edit distance scorer. This is desirable if your data contains organization names which intersperse alphabetic and numeric characters or if your data often contains typographical errors with numerals inserted into otherwise valid tokens. (RLPNC-6200)Improved Japanese-English location name matching: We've expanded the Japanese-English overrides list for location names. (RLPNC-6268)
Example: 大阪府 vs. Osaka Prefecture
Previously: 0.5853
Now: 0.99
Bug fix: Pairwise match now works with all languages that have limited language support. Previously, an error was returned for unidentified languages. (RLPNC-6100)
Bug Fix: Name similarity will no longer return match scores above 1.0. (RLPNC-6254)
Name Translation /name-translation
Khmer - English added: Khmer to English is now a supported name translation language pair. (RLPNC-5708)
Cantonese support added: We can now transliterate Han into Latin characters using the Jyutping transliteration scheme for Cantonese. (RLPNC-6232)
Semantic Similarity /semantics/{semanticsFeature}
New Language support: The Semantic Similarity endpoints now support Tagalog (
tgl
). (RD-2546)
Tokenization /tokens
Spaceless Korean tokenizer: We've added an option to select the spaceless Korean tokenizer in the call. The default tokenizer was not trained on spaceless Korean and did not perform well without spaces between tokens. To use this tokenizer, set the option
modelType
toDNN
. Previously, this option was not available on a per-call basis. (ETROG-3513)
Release 1.20.0
December 2021
Address Similarity /address-similarity
Chinese address matching: We now support Chinese-Chinese and Chinese-English address matching. (RLPNC-5822)
Improved address matching: We've modified the field weight values to provide more accurate address match scores. Weightings were determined by evaluating US and UK address data. (RLPNC-5893)
Example: "85 Court Road Newton Ferrers, Plymouth PL8 1DE1B Devon, England UK” vs “85 Court Road Newton Ferrers PL8 1DE UK"
Previously: Score: 0.73
Now: Score: 0.81
Entity Extraction and Linking /entities
Bug fix: The supported entity types info of the DNN processor now spells PERSON correctly. (TEJ-1670)
Bug fix: Hungarian dates are now extracted correctly. Previously, dates with embedded periods followed by a space were not being extracted. (TEJ-1681)
Bug fix: The entities/info endpoint now returns the complete list of valid entity types. (WS-2360)
Bug fix: The entities/info endpoint no longer lists TEMPORAL types by default for SWEDISH. (TEJ-1687)
Morphological Analysis /morphology
Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts in Japanese. (ETROG-3474)
Example: Input: 三・一四
Previously: tokenization: 三 / ・ / 一四
Now: tokenization: 三・一四
Emojis: U+3030 and U+303D are now tagged as emojis even when not followed by U+FE0F. (ETROG-3478)
Emoji support: We now support the emoji in Unicode 14.0 (ETROG-3476)
Japanese tokenization: In Japanese, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. (ETROG-3475)
Improved POS tags: Many number, punctuation, and symbol characters are now POS-tagged appropriately as numbers, punctuations, and symbols instead of being marked as unknown or some other tag. This applies to all languages with POS tags. (ETROG-3481)
Hungarian improvements: We've added some Hungarian abbreviations and improved sentence boundary detection around Hungarian abbreviations. (ETROG-3479, ETROG-3484)
Bug fix: In Japanese, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)
Bug fix: We've reverted two of the POS changes made in version 1.19.0 as they introduced regressions in Chinese and Japanese. (ETROG-3466)
The values are now:
"|以" Chinese) “|” tagged as PUNCT
"2对” (Chinese) “对” tagged as NM
Bug fix: Morphology no longer detects characters as emoji when followed by the text presentation selector (U+FE0E). (ETROG-3480)
Bug fix: In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)
Name Similarity /name-similarity
Burmese transliteration: We added a Basis Technology-created Folk transliteration scheme for Burmese name matching that is similar to how Burmese names are commonly transliterated to English. (RLPNC-5892)
Basic support for all languages: Name similarity can now match names in any language. Languages which previously would have returned an "unsupported language" error now return a match score. The score is either 1 for a perfect match, or a value based on edit distance. (RLPNC-5979)
Improved ORG matching: We added real world ID tables or organizational names to improve ORG matching in the following languages: Thai (tha), Greek (ell), Hebrew (heb), Burmese (mya), German (deu), French (fra), Hungarian (hun), Italian (ita), Portuguese (por), Spanish (spa), and Vietnamese (vie). (RLPNC-5986)
Example: "International Astronomical Union" vs. "האיגוד האסטרונומי הבינלאומי"
Previously: score 0.70
Now: score 0.98
Improved Hebrew name matching: We've added a rule-based vocalization checker for the statistical-model vocalizer to improve Hebrew-Hebrew and Hebrew-English name matching. (RLPNC-5990)
Name Translation /name-translation
Improved Hebrew transliteration: The Hebrew character ח used to be transliterated as “h” in some cases and “kh” in others (if it was followed by a geresh). It is now transliterated as “ch"when not followed by a geresh. The Hebrew character כ used to be transliterated as “h” in some cases and “k” in others (if it has a dagesh). Now, it is transliterated to “ch” in the cases when it used to be transliterated to “h”. (RLPNC-5928)
Example: נחמן
Previously: Nahman
Now: Nachman
Example: מיכל
Previously: Mihal
Now: Michal
Bug fix: Burmese-English transliteration has been improved by revising the Folk and MLCTS transliteration schemes. (RLPNC-5950)
Bug fix: Hebrew Folk transliteration has been improved, especially for the letters vav and yod. (RLPNC-5916)
Tokenization /tokens
Katakana tokenization: The fullwidth and halfwidth Katakana middle dots (U+30FB and U+FF65) are now treated as decimal points in numeric contexts in Japanese. (ETROG-3474)
Example: Input: 三・一四
Previously: tokenization: 三 / ・ / 一四
Now: tokenization: 三・一四
Japanese tokenization: In Japanese, numeric tokens tagged NN are lemmatized to their ASCII values. For example, “七” is lemmatized to “7”. (ETROG-3475)
Bug fix: In Japanese, the combining marks U+3099 and U+309A are now tokenized with the preceding character as a single token. Previously, they were tokenized as 2 separate tokens. (ETROG-3472)
Bug fix: In English, the lowercase abbreviations of the titles “dr.”, “drs.”, “mr.”, and “mrs.” are now tokenized the same as the uppercase “Dr.”, “Drs.”, “Mr.”, and “Mrs.”. (ETROG-3485)
Release 1.19.3
August 2021
Address Similarity /address-similarity
Improved address matching: We've expanded the override tables for UK, U.S., and Canadian addresses. (RLPNC-5886)
Example: houseNumber<47>road<Albert Street>city<Aberdeen>stateDistrict<Aberdeenshire>postCode<AB25 1XT> vs. houseNumber<47>road<Albert Street>city<Aberdeen>stateDistrict<ABD>postCode<AB25 1XT>
Previously: Score: 0.86
Now: Score: 0.96
Bug fix: Overrides for alphanumeric address fields (houseNumber, unit, poBox, postCode) are now being applied. (RLPNC-5863)
Example: “3710 W Martin Luther King Blvd STE #121” vs. “3710 W Martin Luther King Blvd Suite #121”
Previously: Score: 0.833
Now: Score:0.95
Entity Extraction and Linking /entities
Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions. (TEJ-1657, TEJ-1658)
New /info endpoint: The
entities/info
endpoint returns a list of supported entity types available by processor type and language.New option for structured regions: The option
structuredRegionProcessingType
has been added to specify the type of processing for structured regions. By default, the statistical/DNN model is turned off for structured regions, which increases precision, but may result in reduced recall in structured regions.To turn on the statistical/DNN model for structured regions, set the option
structuredRegionProcessingType
tonerModel
.To use the name classifier (LABS) to identify structured regions as PERSON, LOCATION, or NONE entity types, set the option
structuredRegionProcessingType
tonameClassifier
.
The existing option
enableStructuredRegion
does not determine how structured regions get processed. WhenenableStructuredRegion
is set totrue
and the input iscontentUri
(i.e. set to retrieve content from a URL), then HTML lists and tables will be extracted by Tika as structured regions. The value ofstructuredRegionProcessingType
will then determine how those structured regions are processed.
Language Identification /language
Bug fix: Confidence scores for language regions are now always within the range [0,1]. (RLIJE-552)
Name Similarity /name-similarity
Improved Hebrew-English name matching:
We've improved the statistical model. (RLPNC-5842)
We changed the default transliteration scheme to FOLK from ISO259-2-1994, which improves matching scores as FOLK more closely matches how people transliterate Hebrew names. (RLPNC-5844)
Example: בִּנְיָמִין גַּנְץ vs. Benjamin Gantz
Previously: Score: 0.8738
Now: Score: 0.9709
We expanded the token overrides for person entity types. (RLPNC-5845)
Example: אלכס vs. Alexander
Previously: Score: 0.8675
Now: Score: 0.9361
We added word embeddings for Hebrew organizations. (RLPNC-5837)
Example: ארגון המזון והחקלאות vs. Food and Agriculture Organization
Previously: Score: 0.6002
Now: Score: 0.7309
Improved Hebrew-Hebrew name matching: We expanded the token overrides for person entity types. (RLPNC-5891)
Example: סולומונתאס vs. סולונאס
Previously: Score: 0.5309
Now: Score: 0.8894
Improved English-English name matching: We added the token override pair Alex/Aleksandar. (RLPNC-5871)
Example: Alex vs. Aleksandar
Previously: Score: 0.4106
Now: Score: 0.8894
Improved matching for identifiers: We improved matching and added support for three new subtypes: IDENTIFIER_DRIVERS_LICENSE, IDENTIFIER_LICENSE_PLATE, IDENTIFIER_NATIONAL_ID_NUM, along with IDENTIFIER_GENERIC. (RLPNC-5852)
Example: NH123456789DL vs. NH123456789DN (as IDENTIFIER_DRIVERS_LICENSE entity type)
Previously: Score: 0.6940
Now: Score: 0.9689
Improved Japanese Segmentation: We've expanded the segmentation dictionary to improve Japanese name segmentation. (RLPNC-5835)
Example: ミロシェヴィッチスロボダン
Previously: [ミロシェヴィッチスロボ ダン]
Now: [ミロシェヴィッチ][ スロボ ダン]
Bug fix: Hebrew tokens containing diacritics are now identified in the override table. (RLPNC-5882)
Example: אֲבִי vs. Abigail
Previously: Score: 0.5299
Now: Score: 0.9361
Sentence Tagging /sentences
New language support: Sentence tagging is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
Tokenization /tokens
New language support: Tokenization is now supported for Indonesian, Standard Malay, and Tagalog. (ETROG-3443)
Release 1.19.1
June 2021
Address Similarity /address-similarity
Improved handling of postal codes: Better match scores result from this enhancement. (RLPNC-5639)
Example: houseNumber<123>road<Clifton St>city<Cambridge>state<MA>postCode<02140 1234> vs. houseNumber<123>road<Clifton St>city<Cambridge>state<MA>postCode<02140-1234>
Previously: Score: 0.89
Now: Score: 1.0
Expanded UK and CA address override tables: We expanded override tables for UK and Canadian addresses. (RLPNC-5607)
Example: houseNumber<100>road<Main Ave>city<Shellbrook>state<Saskatchewan>postCode<S0J 2E0> vs houseNumber<100>road<Main Ave>city<Shellbrook>state<Sask>postCode<S0J 2E0>
Previously: Score: 0.88
Now: Score: 0.96
Entity Extraction and Linking /entities
New default processing for structured text regions (lists, tables): Because structured text is often just words or phrases, and thus missing the syntactic context that REX was trained on, some REX users would pre-process input text to remove structured regions, on which REX performed poorly. Users no longer have to pre-process the input as now the statistical/DNN model is turned off by default for structured regions. This mode increases precision but may result in reduced recall in these regions. Note, the other REX processors (pattern match, exact match, entity linking) which do not rely on context will continue to analyze the structured regions. (TEJ-1502) (TEJ-1502)
New name classifier model for structured regions (LABS): We've added a new model for processing structured regions. Each sentence or fragment (the structured region) is classified as a single entity. The name classifier classifies the entity as PERSON, LOCATION, or NONE. It is disabled by default. It can be enabled by using the
enableStructuredRegion
option in the call. (TEJ-1613, TEJ-1621)New option for structured regions: The option
enableStructuredRegion
has been added to configure how structured regions are processed. The default value isfalse
. When set totrue
:If the input is html, Tika is configured to extract tables and lists, in addition to content.
The name classifier model is used to process structured regions.
Japanese organization gazetteers: The gazetteers for Japanese organizations has been updated to improve extraction of Japanese organizations. (TEJ-1612)
Bug fix: Invalid whitespace handling by the DNN processor no longer causes a runtime exception. (TEJ-1614)
Bug fix: A previous entity offset alignment issue involving very short regex-captured entities followed by \r is fixed.
Bug fix: We reverted some changes to Korean tokenization from the 1.18.0 release. You may see minor differences in extracted entities in Korean when comparing results from the previous release.
Bug fix: Entities are no longer extracted when they cross a sentence boundary. To enable entity linking across sentence boundaries, set
disableApplySentenceBoundaries
totrue
. (ELK-259)Bug fix: Entities are now checked to ensure they are normalized. (TEJ-1615)
Morphological Analysis /morphology/{morphoFeature}
Greek lexicon: The Greek lexicon has additional nouns. (ETROG-3288)
New Greek disambiguator: There is a new, more accurate Greek disambiguator. (ETROG-3304)
Greek disambiguation improved: Certain Greek forms are now disambiguated to prefer a modern analysis over an archaic analysis. (ETROG-3289)
Example: δείξε
Previously: Selected lemma: δεικνύω (archaic)
Now: Selected lemma: δείχνω (modern)
Emoji normalization: Morphology no longer normalizes certain emoji zero-width joiner (ZWJ) sequences to U+1F48F KISS, U+1F491 COUPLE WITH HEART, and U+1F46A FAMILY, to be consistent with Unicode’s efforts to make emoji more gender-neutral by default. (ETROG-3350)
Improved Hebrew tokenizer and new analyzer: The Hebrew tokenizer is now more consistent with the tokenizers of other languages. (ETROG-3290)
We've improved tokenization of certain sequences involving digits, periods, and number-related symbols like ⟨%⟩.
We've added additional acronyms and abbreviations to the Hebrew tokenizer. (ETROG-3249)
Double apostrophes are now treated like gershayim. (ETROG-3249)
Normalized characters: Normalized half-width and full-width characters are processed the same as their counterparts. (ETROG-3351)
Bug fix: Consecutive punctuation characters are no longer returned as a single token in Chinese. Now each character is its own token. (ETROG-3402)
Example: Input: 天津??
Previously:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=??}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=U, lemma=??, tagSet=BT_CHINESE}
Now:
Token{text=天津}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=NP, lemma=天津, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Token{text=?}
HanMorphoAnalysis{extendedProperties={}, partOfSpeech=EOS, lemma=?, tagSet=BT_CHINESE}
Bug fix: We reverted some changes to Korean disambiguation from the 1.18.0 release as the changes introduced new disambiguation errors. You may see minor differences in Korean disambiguation when comparing results from earlier releases.
Bug fix: Hebrew tokens containing a geresh are now tokenized properly. Previously, only the part up to the geresh would be returned as the token, and the part after the geresh would sometimes be considered a suffix. Now all the characters of the token are returned as the token. ETROG-3290)
Example: מע'רב
Previously:
Token{text=מע'} MorphoAnalysis{extendedProperties={hebrewPrefixes=[], hebrewSuffixes=[]}, partOfSpeech=noun, lemma=מע', tagSet=MILA_HEBREW} MorphoAnalysis{extendedProperties={hebrewPrefixes=[מ, ב], hebrewSuffixes=[ר, ב]}, partOfSpeech=numeral, lemma=70, tagSet=MILA_HEBREW}
Now:
Token{text=מע'רב} MorphoAnalysis{extendedProperties={com.basistech.rosette.bl.hebrewPrefixes=[], com.basistech.rosette.bl.hebrewSuffixes=[]}, partOfSpeech=unknown, lemma=מע'רב, tagSet=MILA_HEBREW}
Name Deduplication /name-deduplication
Improved name deduplication: We added support for name overrides so that, for example, the nickname “Mike” is included in the same cluster with “Michael”. (RLPNC-5600)
Name Similarity /name-similarity
Burmese-English added: Burmese-Burmese and Burmese-English are now supported name matching pairs. (RLPNC-5660)
Hebrew-English added: Hebrew-Hebrew and Hebrew-English are now supported name matching pairs. (RLPNC-5339)
Vietnamese-English added: Vietnamese-Vietnamese and Vietnamese-English are now supported name matching pairs. (RLPNC-5687)
Improved Chinese-English: Name similarity scores for Chinese-English now leverage English translations from a list of Chinese name translations. (RLPNC-5643)
Example: 汤姆 vs. Tom
Previously: Score: 0.37
Now: Score: 0.99
Improved English-English organizations: We expanded the overrides list with numbers and their written form from 1 to 21. (RLPNC-5644)
Example: Channel One Russia vs. Channel 1 Russia
Previously: Score: 0.54
Now: Score: 0.96
Improved name matching for organizations: Name similarity scores for organizations has been improved by adding new frequency models in English, Chinese, Arabic, Japanese, and Russian. (RLPNC-5416)
Updated frequency model: We updated the English frequency model for PERSON entity type by adding birth names from 1920-2019 to the existing model. (RLPNC-5592)
Character normalization improved: We've improved character normalization for all supported languages. All languages except for Japanese, Korean, and Arabic-script languages now have NFKC normalizations performed. Japanese, Korean, and Arabic-script languages have NFKD normalizations performed.
Japanese ORGANIZATION type matching improved: We've improved the accuracy of ORGANIZATION matching for Japanese-English and Japanese-Japanese names by expanding the stop word list for the ORGANIZATION entity type.
Example: コダック合同会社 vs. Kodak Limited
Previously: Score: 0.7258
Now: Score: 0.98
Better entity resolution for Russian ORGANIZATION type matching: We've improved Russian-English and Russian-Russian name matching by adding support for Russian organizations in the entity resolution engine.
Example: Ура́льские авиали́нии vs. Ural Airlines
Previously: Score: 0.6648
Now: Score: 0.98
Expanded stop word list for Russian ORGANIZATION type matching: We've improved the accuracy of ORGANIZATION matching for Russian-English and Russian-Russian names by expanding the stop word list for the ORGANIZATION entity type.
Example: Балтийский федеральный университет имени Иммануила Канта vs. Immanuel Kant Baltic Federal University
Previously: Score: 0.7140
Now: Score: 0.7740
Korean ORGANIZATION type matching improved: We've improved Korean-English and Korean-Korean name matching by adding support for Korean organizations in the entity resolution engine.
Example: 현대자동차 vs. Hyundai Motor Company
Previously: Score: 0.6921
Now: Score: 0.98
Bug fix: We fixed a bug where Arabic-Arabic name matching was returning a low score in some cases for names that were seemingly very similar.
Example: عبدلخالق الحوثي vs. عبدالخالق الحوثي
Previously: Score: 0.6912
Now: Score: 0.8995
Name Translation /name-translation
Burmese-English added: Burmese to English is now a supported name translation language pair. (RLPNC-5662)
Example: မင်း အောင် လှိုင် ⟹ Maang Aaung Lhuing
Release 1.18.0
December 2020
Entity Extraction and Linking /entities
Wikidata refreshed: The internal database for Wikidata linking has been refreshed and re-indexed. QIDs for some entities may change from previous versions.
Bug fix: The sample for the SQLite-kb-connector now works correctly. Runtime issues with SQLite dependencies have been corrected.
Bug fix: Extraction no longer fails when a custom processor returns a NULL annotator; a warning is issued instead.
Bug fix: Mentions normalized by a custom processor are no longer ignored.
Bug fix: Newlines \r and \r\n are now handled correctly.
Morphological Analysis /morphology/{morphoFeature}
Deprecated option: The alternative tokenization option
deliverExtendedAttributes
is now deprecated. Previously it delivered an unsupported extended property.Bug fix: Combining characters in Hebrew which were being erroneously split into tokens separated from their bases are no longer being split.
Bug fix: Certain patterns of white-space within Chinese and Japanese tokens no longer cause an internal server error.
Bug fix: The mappings of default Basis POS tags to universal POS tags (
"options": {"partOfSpeechTagSet": "upt16"}
) have been corrected for Greek.Previously: COSUBJ mapped to CONJ, ORD mapped to ADJ, and POSS mapped to DET
Now: COSUBJ maps to ADP, ORD maps to NUM, and POSS maps to PRON
Bug fix: When
fstTokenize
is enabled, inputs in Spanish are no longer truncated.
Name Similarity /name-similarity
Improved ORGANIZATION type matching: We improved the accuracy of ORGANIZATION name matching by integrating name completion with an internal entity resolution engine.
Example: ソニー株式会社 vs. Sony Corporation
Previously: Score: 0.7793
Now: Score: 0.98
Improved Japanese: We've improved the segmentation of Japanese PERSON names.
Example: スズキタロウ
Previously: スズ キタロウ (suzu kitarou)
Now: スズキ タロウ (suzuki tarou)
Spanish token overrides: The Spanish-Spanish token overrides for the PERSON entity type have been expanded.
Example: Francisco vs. Paco
Previously: Score: 0.7786
Now: Score: 0.8723
Semantic Similarity /semantics/{semanticsFeature}
Languages added: The semantic similarity endpoints now also supports Hebrew, Hungarian, Italian, Persian, Portuguese, and Urdu.
API information /info
License expiration: The license expiration date has been added to the retrieved API information.
Known Issues
The log files
500-exception.log
androsapi.log
are not created automatically. Contact Support if you need these log files.
Open Source Changes
Package | Old Version | New Version |
---|---|---|
Apache Bval | 1.1.1 | 2.0.5 |
Apache XMLSchema | 2.2.4 | 2.2.5 |
Bean Validation API | 1.1.0.Final | 2.0.0.FINAL |
Deep Learning for Java | 1.0.0-beta4 | 1.0.0-beta7 |
Package | Version |
---|---|
Jakarta Activation | 1.2.1 |
Java Expression Language | 2.2.4 |
Objenesis | 2.1.0 |
uk.com.robust-it.cloning library | 1.9.3 |
Package | Version |
---|---|
Apache CXF | 3.3.6 |
Apache Felix Declarative Services | 2.1.25 |
Package | Version |
---|---|
Codehaus Plexus Interpolation | 1.22 and 1.23 |
Package | Old Version | New Version |
---|---|---|
Google GSON | 2.8.5 | 2.7.0 |
Package | Version |
---|---|
Google GSON | 2.8.5 |
Package | Version |
---|---|
Jakarta XML Binding | 2.3.2 |
Patch Release - 1.17.3
October 2020
Rosette - All Platforms
Language Identification /language
Bug fix: Rosette now correctly identifies the primary language of short documents which contain small fragments of a language in another script. Previously, the language of the fragments might be erroneously detected as the primary language. The lengths of the document's script regions are now taken into account when identifying the primary language.
Morphological Analysis /morphology/{morphoFeature}
Bug fix: Sentence breaks are now correct when there are two line breaks and
fragmentBoundaryDetection
is enabled.Example: "a very very very very long line\nshort\n\n"
Previously: 2 sentences
"a very very very very long line\n" "short\n\n"
Now: 1 sentence
"a very very very very long line\nshort\n\n"
Bug fix: In Hebrew, lemmas starting or ending with spaces now have the spaces removed.
Example: "אאורקה "
Previously: "אאורקה "
Now: "אאורקה"
Rosette Server On-Premise Only
Morphological Analysis /morphology/{morphoFeature}
Decompose noun compounds: Support for
decomposeCompounds
has been expanded to Danish, Dutch, German, Hungarian, Korean, Norwegian (Bokmål, Nynorsk), and Swedish. The default value istrue
.
Patch Release - 1.17.2
September 2020
Rosette - All Platforms
Address Similarity /address-similarity
Name Similarity /name-similarity
Bug fix: We've fixed a problem where the endpoints might not see the correct license.
Open Source Changes
Package | Version | License |
---|---|---|
Jakarta XML Binding | 2.3.2 | EDL 1.0 & BSD 3-Clause & EPL 2.0 & GPL 2 w/ CE |
Package | Version | License |
---|---|---|
JavaPoet | 1.9.0 | Apache 2.0 |
Patch Release - 1.17.1
September 2020
Rosette - All Platforms
Address Similarity /address-similarity
New per-call option - parameters: Requests can now specify a
parameters
object to update parameter values for an individual request. Any non-static parameter can be changed. To specify, add"parameters": {"parameterName":"value"}
to the call.Example:
{ "address1": { "houseNumber": "122", "street": "main st", "postCode": "02222" }, "address2" : { "houseNumber": "123", "street": "main st", "postCode": "02222" }, "parameters": { "postCodeAddressFieldWeight": "2.0", "stateAddressFieldWeight": "0.5" } }
Improved address matching: Address matching has been improved by improving the normalization of the postal code address field.
Example: 71-75 Shelton street London WC2H9JQ vs. 71 SHELTON STREET LONDON WC2H 9JQ
Previously: 0.6239
Now: 0.8542
Entity Extraction and Linking /entities
Improved phone number recognition: Regular expressions for phone number extraction have been improved and now extract more phone number patterns.
Bug fix: We've partially fixed a problem in Japanese ORG extraction where sometimes the model extracts multiple ORG entities or includes non-related adjacent tokens.
Name Similarity /name-similarity
New per-call option - parameters: Requests can now specify a
parameters
object to update parameter values for an individual request. Any non-static parameter can be changed. To specify, add"parameters": {"parameterName":"value"}
to the call.{ "name1": { "text": "Kraft Services", "entityType": "Organization" }, "name2": { "text": "Kraft Srvs", "entityType": "Organization" }, "parameters": { "deletionScore": "0.2" } }
Expanded English stop words: Organization name matching has been improved by the addition of English stop words.
Example: SUNY Canton vs. State University of New York at Canton
Previously: 0.8340
Now: 0.8713
Improved Spanish, Arabic, and Korean organization matching: By adding the use of word embeddings, we've improved organization name matching when one or both of the names are in the specified language.
Spanish-English example: Astilleros y Talleres del Noroeste vs. Shipyards and Workshops of the Northwest
Previously: 0.4558
Now: 0.8838
Korean-English example: 아시아나 항공 vs. Asiana Airlines
Previously: 0.7208
Now: 0.8672
Arabic-English example: الاتحاد العالمي للحفاظ على الطبيعة والمصادر الطبيعية vs. International Union for Conservation of Nature and Natural Resources
Previously: 0.3263
Now: 0.7106
Rosette Server On-Premise Only
Entity Extraction and Linking /entities
Joiner runs before redactor: The joiner now runs before the redactor by default, providing more flexibility and control over the joiner results. Set
runJoinerPostRedactor
totrue
to run the joiner after the redactor.Bug fix: We fixed a bug where sometimes a null pointer exception was returned when the custom processor and the linker had overlapping results.
Bug fix: Custom processors can now only modify the entity and metadata sections of the ADM. Previously, any modification could be made which could override annotation data.
Open Source Changes
Package | Version | License |
---|---|---|
Spotify Annoy Java | 0.2.5 | Apache 2.0 |
Package | Version | License |
---|---|---|
Apache CXF | 3.3.4 | Apache 2.0 |
Apache Service Mix Specs JAXWS API 2.3 | 2.3.1 | CDDL 1.1 & GPL 2 w/CE |
Apache Service Mix Specs JSR 339 API 2.0 | 2.4.0 | CDDL 1.1 & GPL 2 w/CE |
Apache XMLSchema | 2.2.4 | Apache 2.0 |
Jackson JAXRS | 2.10.0 | Apache 2.0 |
Jakarta Activation | 1.2.1 | BSD 3-clause |
Jakarta RESTful Web Services | 2.1.5 | EPL 2.0 & GPL 2 w/CE |
Jakarta SOAP with Attachments (SAAJ) | 1.4.0 | EDL 1.0 & BSD 2-Clause & EPL 2.0 & GPL 2 w/CE |
Java Architecture for XML Binding | 2.3.0 | CDDL 1.1 & GPL 2 w/CE |
Java Common Annotations | 1.3.2 | CDDL 1.1 & GPL 2 w/CE |
Java Servlet API | 3.1.0 | CDDL 1.1 & GPL 2 w/CE |
Package | Version |
---|---|
Spring Cloud Netflix | 2.2.4 |
Package | Version |
---|---|
Jakarta XML Binding | 2.3.2 |
Package | Old Version | New Version |
---|---|---|
Apache Tomcat Embedded | 8.5.55 | 8.55.56 |
NEW Release - 1.17.0
August 2020
Rosette - All Platforms
Morphological Analysis /morphology/{morphoFeature}
Greek coverage expanded: POS tags and lemmas are now recognized for some Greek words previously not identified. (ETROG-3225)
Bug fix: Fragment detection now counts tokens correctly to determine short lines. This mostly impacts languages without spaces: Chinese, Japanese, and Thai. (ETROG-3177)
Bug fix: Tokens with digits are now analyzed by for the Greek guesser. (ETROG-3231)
Previously: "HDMI1" defaulted to possible PROP, ADJ, NOUN POS tags
Now: "HDMI1" gets FM POS tag
Bug fix: Russian perfective verbs are now lemmatized correctly. Previously some were lemmatized to their imperfective counterparts' lemmas or other incorrect lemmas. (ETROG-3112)
Example: "разложу" where "разложу" is perfective and its lemma is "разложить". Its imperfective counterpart’s lemma is "раскладывать"
Previously: Two analyses: one lemmatized to "раскладывать", the other to "разлагать"
Now: One analysis, lemmatized to "разложить"
Bug fix: German lemmas that consist of a separable prefix and a noun are now correctly capitalized. (ETROG-3235)
Example: Input "Mitbehandlung"; "mit" is a separable prefix
Previously: Lemmatized to "mitBehandlung"
Now: Lemmatized to "Mitbehandlung"
Bug fix: In Hebrew, terminal combining characters are no longer getting split into their own tokens. (ETROG-3224)
Example: "1" (keycap)
Previously: Tokenized to two tokens, <U+0031 DIGIT ONE> <U+20E3 COMBINING ENCLOSING KEYCAP>.
Now: Tokenized to one token, "1"
Name Similarity /name-similarity
Arabic improvements: New PERSON and ORGANIZATION stopwords have been added, improving Arabic-English and Arabic-Arabic name matching.
Example for PERSON entity type: محمد vs. نبي محمد
Previously: score = 0.4186
Now: score = 0.99
Example for ORGANIZATION entity type: بنك الأهلي التجاري vs. ال البنك الأهلي التجاري
Previously: score = 0.6512
Now: score = 0.99
Name Translation /name-translation
Hebrew - English added: Hebrew to English is now a supported translation language pair. Rosette supports the Hebrew transliteration standards
ISO259_2_1994
andICU
(which refers to the default Hebrew transliterator implemented by ICU and is based on the UNGEGN standard for geographic names), but defaults toFOLK
. This Basis Technology-created transliteration scheme is more useful than the other more academic standards (ISO259_2_1994
andICU
) as it closely resembles how people in the real world write Hebrew names with Latin characters.
Semantic Similarity /semantics/{semanticsFeature}
French added: The semantic similarity endpoints now support French.
Rosette Server On-Premise Only
Entity Extraction and Linking /entities
LABS Custom Knowledge Base Connector Sample: A working example of a custom knowledge base connector to a knowledge base backed by a SQLite database is now available on our public github page at https://github.com/rosette-api/sqlite-kb-connector. Note that this feature is still in LABS and subject to change.
Morphological Analysis /morphology/{morphoFeature}
New short line parameter: The option
maxTokensForShortLine
has been added to configure how many tokens can be in a line for it to be considered short for fragment boundary detection. The default value is 6. (ETROG-3179)Greek time abbreviations: The time abbreviations "π.μ." and "μ.μ." are now identified and annotated in Greek. The option
fstTokenize
must be set totrue
. (ETROG-3226)Bug fix: Whitespace-delimited fragment boundaries are no longer skipped when they fall within tokens. This only occurred when
fstTokenize
was enabled and in some languages. (ETROG-3159)Example: "1\n234" (embedded newline within the number string)
Previously: "1 234" (1 token)
Now: "1" "234" (2 tokens)
This example assumes
fstTokenize
is enabled and the language is French.Bug fix: In Hebrew, tokens with an unknown part of speech are no longer assigned the part of speech of one of their prefixes. This only applies when the option
guessHebrewPrefixes
is set totrue
. (ETROG-3221)Example: "ומפיפרנו"
Previously: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag preposition.
Now: Lemmatized to “פיפרנו” with two prefixes (“ו” and “מ”) and the POS tag unknown.
Open Source Changes
Package | Old Version | New Version |
---|---|---|
Apache Log4j | 2.7 | 2.13.3 |
Package | Old Version | New Version |
---|---|---|
Apache Tika | 1.22 | 1.24.1 |
Package | Platform |
---|---|
Spotify Annoy Java 0.2.5 | RESTful |
JavaPoet 1.9.0 | Embedded |
Package | Version | License |
---|---|---|
JavaPoet | 1.9.0 | Apache 2.0 |
Package | Old Version | New Version | New License |
---|---|---|---|
Apache Tomcat | 8.5.46 | 8.5.57 | |
Apache Commons Daemon | 1.2.1 | 1.2.2 | |
Reactive Streams | 1.0.2 | 1.0.3 | |
Basis Technology Rosette API Client Library for Java | 1.15.0 | 1.16.1 | |
RESTEasy | 3.8.1 | 3.13.0 | |
Jackson Annotations | 2.9.9 | 2.10.4 | |
Jackson Core | 2.9.9 | 2.10.4 | |
Jackson Databind | 2.9.9 | 2.10.4 | |
Jackson JAX-RS | 2.9.9 | 2.10.4 | |
Jackson JAXB Annotations | 2.9.9 | 2.10.4 | |
JBoss Jakarta Annotations API 1.3 Spec | 1.0.1 | 2.0.1 | EPL 2.0 & GPL 2 w/ CE |
JBoss Jakarta JAXB API 2.3 Spec | 1.0.1 | 2.0.0 | EDL 1.0/BSD 3-Clause |
JBoss Jakarata JAXRS API 2.1 Spec | 1.0.3 | 2.0.1 | EPL 2.0 & GPL 2 w/ CE |
Jakarata Activation | 1.1.1 | 1.2.1 | |
Checker Qual | 2.5.2 | 2.8.1 | |
Google Guava | 16.0.1 | 28.1-jre |
Package | Version | License |
---|---|---|
Jakarata Bean Validation | 2.02 | Apache 2.0 |
Google Guava Failure Access | 1.0.1 | Apache 2.0 |
Google Guava Listenable Future | 9999.0 | Apache 2.0 |
Patch Release - 1.16.3
July 2020
Rosette Server On-Premise Only
The
activeLearning
parameter in therex-factory-config.yaml
file is now being read.
Patch Release - 1.16.2
July 2020
Rosette Server On-Premise Only
Dynamic custom profiles Changes to a custom profile can now be loaded without restarting Rosette Enterprise. To dynamically change a custom profile, delete the profile directory from the disk and create a new one with the changes. The new directory can have the same name as the deleted profile directory.
Patch Release - 1.16.1
June 2020
Rosette - All Platforms
Address Similarity /address-similarity
Support for unfielded addresses: Addresses no longer have to be provided as separate field components.
{ "address1": "The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom", "address2": "The Book Club 100-108 Leonard St Shoreditch London EC2A 4RH, UK" }"
Added field overrides: More overrides (matches) have been added for address fields, improving the match scores. Overrides explicitly map nicknames, cognates, and variants to improve matching accuracy.
Example: England and UK have been added as overrides
{ "address1": { "house": "Ffrwdgrech Industrial Estate", "road": "Ffrwdgrech Rd", "city": "Brecon", "country": "UK", "postcode": "LD3 8LA" }, "address2": { "house": "Ffrwdgrech Industrial Estate", "road": "Ffrwdgrech Rd", "city": "Brecon", "country": "England", "postcode": "LD3 8LA" } }
Previously: 0.86
Now: 0.95
Improved support for misfielded address components: We've improved scores by checking for matches in different fields to catch where address components were input into the wrong field.
Example: Washington and D.C. are matched even though they were put into different fields
{ "address1": { "houseNumber": "1600", "road": "Pennsylvania Ave N.W.", "city": "Washington", "state": "D.C.", "postcode": "20500" }, "address2": { "houseNumber": "1600", "road": "Pennsylvania Ave N.W.", "city": "D.C.", "state": "Washington", "postcode": "20500" } }
Previously: 0.72
Now: cross-field matching between city and state generates a higher match score of 0.94
Entity Extraction and Linking /entities
Hebrew improvements: Entity extraction has improved Hebrew normalization. The disambiguator can now identify prefixes removed from the entity's normalized form. Improvements are a result of Hebrew morphology enhancements.
Bug fix: A new line character in a regex (\n) will now also match carriage returns (\r) and a combination of both (\r\n).
Morphological Analysis /morphology/{morphoFeature}
Hebrew improvements: Hebrew tokens that have prefixes but not stems now get tagged the correct part of speech. Previously, they got the POS tag "unknown".
Example: “ה” from the string “ה70”
Previously: POS tag "unknown"
Now: POS tag "quantifier"
Bug fix: Minimally-qualified emoji are no longer split apart.
Example: The emoji for "man tipping hand" (<U+1F481, U+200D, U+2642>:
)
Previously: U+1F481 and <U+200D, U+2642> (2 tokens)
Now: <U+1F481, U+200D, U+2642> (1 token)
Bug fix: Capitalized common nouns are no longer detected as verbs.
Example: The noun "Service" from the phrase "Price and Quality of Service"
Previously: POS tag VI (infinitive or imperative verb)
Now: POS tag PROP (proper noun)
Name Similarity /name-similarity
Improved Arabic-English and Arabic-Arabic matching: Many enhancements have improved the name similarity scores when matching Arabic names with Arabic or English names.
Token alignment Improved token alignment between names in Arabic and English
Example: Aung San Suu Kyi vs. أون سان سو تشي
Previously: 0.4544
Now: 0.7261
New statistical model A new Arabic-English statistical model has been added.
Example: Debby Ryan vs. ديبي رايان
Previously: 0.6114
Now: 0.9202
Gender identification We've added gender identification for Arabic names
Example: Mario Savinova vs. ماريا سافينوفا
Previously: 0.9394
Now: مَارِيَا (Maria) gets detected as a female name so a gender correction penalty gets applied and the match score comes down to 0.6413
Language model The weighting of tokens in Arabic names has been improved.
Example: Margaret Thatcher vs. مارجريت تاتشر
Previously: 0.6299
Now: 0.8182
Initials We've added support for initials and initialisms in Arabic.
Example: J vs. جمال
Previously: 0.4106
Now: 0.8256
Japanese-English Name similarity scores for organizations has been improved by expanding the list of organization name equivalents mapped between Japanese-English (aka, token overrides).
Example: 中国建設銀行 vs. China Construction Bank
Previously: 0.8701
Now: 1.0
Rosette Server On-Premise Only
Address Similarity /address-similarity
Windows support: To parse unfielded addresses, address similarity depends on the jpostal binding for the open source libpostal library. Though jpostal is not officially supported on Windows, our tests have shown it to function as expected. Please contact support if you discover any issues.
Entity Extraction and Linking /entities
More complete sample files: The sample files to build the SQLite connector described in the Custom Knowledge Base Connectors section now includes all files required to build with Maven. The configuration to run the connector with Rosette Enterprise is now provided as well.
Bug fix: Confidence scores for entity linking now use the same scale, whether linking to Wikidata or a custom knowledge base. Previously, the confidence scores given for links to custom knowledge bases were much lower than those calculated for the Wikidata knowledge base.
Morphological Analysis /morphology/{morphoFeature}
Hebrew improvements: When
guessHebrewPrefixes
is true, unrecognized Hebrew tokens will now get analyses with and without potential prefixes. Previously, they would only get analyses with potential prefixes.Example: Token: "ומפיפרנו"
Previously: 2 analyses:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
Now: 3 analysis:
hebrewPrefixes=[ו] lemma=מפיפרנו
hebrewPrefixes=[ו, מ] lemma=פיפרנו
hebrewPrefixes=[] lemma=ומפיפרנו
Open Source Changes
Package and Version | License |
---|---|
jpostal 1.0 | MIT |
Libpostal v1.1-alpha | MIT |
Apache Tomcat Embedded 8.5.2 | Apache 2.0 |
Package and Version | License |
---|---|
Prometheus Java Simpleclient 0.8.1 | Apache 2.0 |
NEW Release - 1.16.0
April 2020
Rosette - All Platforms
/supported-languages (all endpoints)
Licensed field: You can now see which languages you are licensed for by endpoint. All /supported-languages endpoints now return a
licensed
field for each language. The value istrue
if the you have a license for the language,false
if not.
Entity Extraction and Linking /entities
Hebrew accuracy improvement: We re-trained the statistical model, adding an annotated finance dataset. The accuracy is improved for the finance genre and remains the same for the news genre.
New DNN model for Hebrew: We added a new deep neural network model for Hebrew. It performs better than the statistical model on both news and finance genres.
Hebrew normalization: For more accurate downstream processing, we have removed prefixes from normalized Hebrew output, except for the definite article.
Morphological Analysis /morphology/{morphoFeature}
Hebrew proper nouns: We added more proper nouns to the Hebrew lexicon.
German professions: We added more German professions to the German lexicon.
Additional emoji support: Emoji hair components are now lemmatized.
Spanish performance improvements: Spanish disambiguation is now faster.
Tokenization /tokens
Unicode 13.0 emojis: Unicode 13.0 emoji sequences are now tokenized.
Bug fix: We fixed a bug where low surrogates were stripped from the ends of tokens in Hebrew.
Known issue: Minimally-qualified emoji zero-width joiner (ZWJ) sequences are no longer recognized as emoji. We will fix this in the next release.
Rosette Server On-Premise Only
Offline Docker installation Server shipments now include a helper script,
download.sh
, to download the components used in an offline Docker installation.
Entity Extraction and Linking /entities
Geo-coordinates regex: We added supplemental regex support for ISO 6709 geo-coordinates for all languages.
Linking knowledge bases: You can now load multiple custom knowledge bases and prioritize which knowledge base to try linking with first. Edit the
kbs
parameter in therex-factory-config.yaml
file.Redactor improvement: Redactor weights can now be configured for specific subsources used by processors. For example, the entity linker can link to multiple knowledge bases, each of which is considered a subsource. These subsources can have their own redactor weights to adjudicate linking conflicts among the subsource.
Custom profile with custom knowledge base: You can now define a custom knowledge base in a custom profile.
New license file: Linking to custom knowledge bases is now licensed separately. If you link to one or more custom knowledge bases, you will need to install a new license file. Linking to wikidata does not require a new license file.
Name Similarity /name-similarity
Japanese stopwords: We’ve added the ability to handle Japanese stopwords that contain parentheses.
Tokenization /tokens
Bug fix: We fixed a bug where a multi-digit number containing a space that was adjacent to a symbol was tokenized to multiple tokens in some languages when
fstTokenize
was enabled.
Open Source Changes
Package and Version | License |
---|---|
Apache Commons Compress v1.9 | Apache 2.0 |
Apache Commons Net v3.1.0 | Apache 2.0 |
Apache FreeMarker v2.3.23 | Apache 2.0 |
Deep Learning for Java v1.0.0-beta4 | Apache 2.0 |
Google Flatbuffers v1.10.0 | Apache 2.0 |
Google GSON v2.8.5 | Apache 2.0 |
Joda Time v2.9.1 | Apache 2.0 |
Objenesis v2.1.0 | Apache 2.0 |
Oswego Concurrent v1.3.4 | Public Domain |
ThreeTen v1.3.3 | GPL 2 w/CE |
uk.com.robust-it.cloning library v1.9.3 | Apache 2.0 |
Package and Version |
---|
Basis Technology Rosette API Model 1.15.0 |
Patch Release - 1.15.1
February 2020
Rosette Platform Changes
Entity Extraction and Linking /entities
Bug fix: We fixed a bug in the Arabic statistical model and retrained the model.
Morphological Analysis /morphology/{morphoFeature}
Latvian support added: Latvian lemmatization is supported.
Latin-script tokens in Russian: Latin-script regions within Russian inputs are now tokenized and analyzed as English. Example input: “мой новый iPhone”:
Previously: “iPhone” has part of speech FRGN.
Now: “iPhone” has part of speech PROP.
Bug fix: When Rosette guessed German compounds they were sometimes lemmatized as verbs but tagged as nouns.
Name Similarity /name-similarity
Bug fix: There was a pairwise matching bug where Arabic name A vs Arabic name B produced a different result than Arabic name B vs Arabic name A.
Sentence Tagging /sentences
The fragment boundary detector now marks a boundary after any spaces following a fragment boundary delimiter.
Bug fix: There was a sentence break after every Windows newline (i.e. carriage return + line feed).
Bug fix: Multi-script Russian text would have a sentence break each time the script changed.
Bug fix: There were unexpected sentence breaks after some short lines which did not end in whitespace.
Bug fix: Sentence breaks were missing when the sentence break did not align with a token boundary.
Tokenization /tokens
Latvian support added: Latvian tokenization is supported.
Rosette Enterprise On-Premise Users Only
Sentence Tagging /sentences
Configurable fragment boundaries: The delimiters for the fragment boundary detector are now configurable. A delimiter is restricted to a single character. To configure, edit the value of
fragmentBoundaryDelimiters
in therbl-factory-config.yaml
file. The string should contain all values that should be recognized as a fragment boundary, including any of the default values you want to keep.Bug fix: An underscore (U+005F) is no longer treated as a token separator in German when
fstTokenize
is enabled.Bug fix: Tokens from multi-script Russian text sometimes had incorrect offsets if
fstTokenize
was enabled.
LABS Usage Tracking: Rosette Enterprise can now track Rosette calls by app-id, profileID, endpoint, and language. See the Rosette User Enterprise User Guide for more information. Note that this feature is still in LABS and subject to change.
LABS Custom Endpoints: Rosette Enterprise now supports creation of custom endpoints that combine business logic, custom workflows, and Rosette endpoints into a single call. See the Rosette User Enterprise User Guide for more information. Note that this feature is still in LABS and subject to change.
Name Similarity /name-similarity
The deep learning model for Katakana-Latin name matching has been disabled in this release. The setting enableSeq2SeqTokenScorer
must be set to false
, which is the default. It will be available again in a future release.
Open Source Changes
Package | Old Version | New Version |
---|---|---|
Basis Technology Annotated Data Model | 2.5.3 | 2.7 |
Package | Version | License |
---|---|---|
Findbugs JSR 305 | 3.0.2 | BSD-3-Clause |
Package |
---|
Apache Commons Compress v1.9 |
Apache Commons Net v3.1.0 |
Apache FreeMarker v2.3.23 |
Deep Learning for Java v1.0.0-beta4 |
Google Flatbuffers v1.10.0 |
Google GSON v2.8.5 |
Joda Time v2.9.1 |
Objenesis v2.1.0 |
Oswego Concurrent v1.3.4 |
ThreeTen v1.3.3 |
uk.com.robust-it.cloning library v1.9.3 |
Package | Version | License |
---|---|---|
Animal Sniffer | 1.14 | Apache 2.0 |
Apache Commons Codec | 1.13 and 1.9 | Apache 2.0 |
Apache Commons Collections | 3.2.2 | Apache 2.0 |
Apache Commons Configuration | 1.8 | Apache 2.0 |
Apache Commons Daemon | 1.2.1 | Apache 2.0 |
Apache Commons IO | 2.4 and 2.5 | Apache 2.0 |
Apache Commons Lang | 2.6 and 3.9 | Apache 2.0 |
Apache HTTPComponents | 4.4.4, 4.4.13, 4.5.2, and 4.5.10 | Apache 2.0 |
Apache Log4j | 2.12.1 | Apache 2.0 |
Apache Taglibs | 1.2.5 | Apache 2.0 |
Apache Tomcat | 8.5.46 | Apache 2.0 |
Apache Tomcat Embedded | 9.0.30 | Apache 2.0 |
ASM | 5.0.4 | BSD 3-clause |
AspectJ Weaver | 1.9.5 | EPL 1.0 |
Basis Technology Annotated Data Model | 2.5.2 | Apache 2.0 |
Basis Technology Rosette API Client Library for Java | 1.15.0 | Apache 2.0 |
Basis Technology Rosette Common Java API | 37.0.0 | Apache 2.0 |
Bouncy Castle | 1.60 | MIT |
Checker Qual | 2.5.2 | MIT |
Eclipse Compiler for Java | 4.6.3 | EPL 1.0 |
Eclipse Jersey | 1.19.1 | EPL 2.0 and GPL 2 w/CE |
Error Prone | 2.1.3 | Apache 2.0 |
FasterXML ClassMate | 1.5.1 | Apache 2.0 |
Findbugs JSR 305 | 3.0.1 and 3.0.2 | BSD 3-clause |
Francis Galiegue BTF | 1.2 | Apache 2.0 and LGPL 3.0 |
Francis Galiegue Msg Simple | 1.1 | Apache 2.0 and LGPL 3.0 |
Google Guava | 15.0 and 16.0.1 | Apache 2.0 |
HdrHistogram | 2.1.11 | Public Domain |
Hibernate Validator | 6.0.18 | Apache 2.0 |
J2ObjC Annotations | 1.1 | Apache 2.0 |
Jackson Annotations | 2.9.9 and 2.10.2 | Apache 2.0 |
Jackson Base Modules | 2.10.2 | Apache 2.0 |
Jackson Core | 2.9.9 and 2.10.2 | Apache 2.0 |
Jackson Core Utils | 1.6 | Apache 2.0 and LGPL 3.0 |
Jackson Databind | 2.9.9 and 2.10.2 | Apache 2.0 |
Jackson JAX-RS | 2.9.9 | Apache 2.0 |
Jackson JAXB Annotations | 2.9.9 | Apache 2.0 |
Jackson Java 8 Modules | 2.10.2 | Apache 2.0 |
Jakarta Activation | 1.1.1 | BSD 3-clause |
Jakarta Annotations | 1.3.5 | EPL 2.0 and GPL 2 w/CE |
Jakarta Bean Validation | 1.1.0 and 2.0.2 | Apache 2.0 |
Java API for RESTful Web Services | 1.1.1 | CDDL 1.1 and GPL 2 w/ CE |
Java Hamcrest | 1.1 | BSD 3-clause |
JBoss Annotations API | 1.3 Spec 1.0.1 | CDDL 1.1 and GPL 2 w/ CE |
JBoss JAXB API | 2.3 Spec 1.0.1 | CDDL 1.1 and GPL 2 w/ CE |
JBoss JAXRS API | 2.1 Spec 1.0.1 | CDDL 1.1 and GPL 2 w/ CE |
Jboss Logging | 3.3.2 and 3.4.1 | Apache 2.0 |
JSR-330: Dependency Injection for Java | 1 | Apache 2.0 |
JSON Patch | 1.9 | Apache 2.0 and LGPL 3.0 |
JSON Simple | 1.1.1 | Apache 2.0 |
JUnit | 4.10 | EPL 1.0 |
LatencyUtils | 2.0.3 | Public Domain or BSD 2-clause |
Logback | 1.2.3 | EPL 1.0 & LGPL 2.1 |
Micrometer Application Metrics | 1.3.2 | Apache 2.0 |
Netflix Archaius | 0.7.6 | Apache 2.0 |
Netflix Commons Util | 0.3.0 | Apache 2.0 |
Netflix Hystrix | 1.5.18 | Apache 2.0 |
Netflix Ribbon | 2.3.0 | Apache 2.0 |
Netflix Servo | 0.12.21 | Apache 2.0 |
Netflix Statistics | 0.1.1 | Apache 2.0 |
Netflix Zuul | 1.3.1 | Apache 2.0 |
Reactive Streams | 1.0.2 and 1.0.3 | Public Domain |
ReactiveX Java | 1.3.8 | Apache 2.0 |
ReactiveX Java Streams | 1.2.1 | Apache 2.0 |
ReactiveX Netty | 0.4.9 | Apache 2.0 |
RESTEasy | 3.8.1 | Apache 2.0 |
SLF4J | 1.7.5 and 1.7.30 | MIT |
SnakeYAML | 1.25 | Apache 2.0 |
Spring Boot | 2.2.4 | Apache 2.0 |
Spring Cloud Commons | 2.2.0 | Apache 2.0 |
Spring Cloud Netflix | 2.2.0 | Apache 2.0 |
Spring Framework | 5.2.3 | Apache 2.0 |
Spring Security | 5.2.1 | Apache 2.0 |
Stephen C. JCIP Annotations | 1.0.1 | Apache 2.0 |
NEW Release - 1.15.0
December 2019
Rosette Platform Changes
Address Similarity /address-similarity
Improved address matching: We've boosted the score of tokens that differed only by an inserted or missing space.
Previously: The similarity score for "Old Colony Avenue" vs. "OldColony Avenue" was 0.44.
Now: The similarity score for "Old Colony Avenue" vs. "OldColony Avenue" is 1.0.
Improved address matching: We've added support for additional Spanish and English address abbreviations, such as:
calle : cl
camino : cno, cmno
Entity Extraction and Linking /entities
Swedish: We have added entity extraction and linking support for Swedish (
swe
).
Morphological Analysis /morphology/{morphoFeature}
Fragment boundary detection: The fragment boundary detector, which is used to better process data in lists and tables, is now on. Previously it was off.
New imperative Arabic verbs: We've added the imperative forms of 2000 Arabic verbs to the Arabic lexicon.
Bug fix: In Russian, embedded spaces have been removed from the lemmas of numbers containing spaces. For example, the token "1 234" is now lemmatized to "1234" instead of "1 234".
Bug fix: In Japanese, if a middle dot appeared in the input text immediately before a newline character, it was missing from the tokenized output. Now the middle dot gets a token in the output.
Name Similarity /name-similarity
Improved Japanese-English matching for organizations: We've expanded the Japanese-English token overrides for organizations. Overrides explicitly map nicknames, cognates, and variants to improve matching accuracy.
Rosette Enterprise On-Premise Users Only
System Requirements: JDK 11 is now supported.
Custom Profiles
Rosette Enterprise can now support multiple profiles for users that have different processing needs. Each profile consists of a specific set of parameter and configuration settings (e.g., case-insensitive on/off, entity linking on/off) and different data domains (e.g., enabling specific regex and gazetteer files). Custom profiles do not apply to the /address-similarity, /name-similarity, /name-deduplication, and /name-translation endpoints.
Refer to the section Custom Profiles in the Enterprise User Guide for more details.
Entity Extraction and Linking /entities
Dynamic gazetteers: We’ve added a new endpoint, /entities/configuration/gazetteer/add, to add gazetteer entries to the /entities endpoint without having to stop and restart Rosette Enterprise.
Refer to the section Adding Dynamic Gazetteers in the Enterprise User Guide for more details.
Morphological Analysis /morphology/{morphoFeature}
Fragment boundary detection: The fragment boundary detector, which is used to better process data in lists and tables, is now on by default. Previously it was off by default.
To turn off fragment boundary detection:
Edit the file:
/config/rosapi/rbl-factory-config.yaml
Set
fragmentBoundaryDetection: false
Name Similarity /name-similarity
New model for Katakana-Latin name matching: We've added a new deep learning model that improves Katakana-Latin name matching. It is enabled by setting
enableSeq2SeqTokenScorer
totrue
. Note that whentrue
, all Japanese names, not just those in Katakana, will be scored with this new model, which may result in lower accuracy for Japanese names in Hiragana or Kanji. Improvements are planned.New parameter: We've added a new static parameter,
katakanaTransliterationOnly
, which defaults tofalse
. Setting it totrue
will cause Japanese names written in Katakana to only be transliterated, not translated. This goes into thejpn_eng:
section in theparameter_defs.yaml
file.ジョージ・ブッシュ translated to English as: George Bush
ジョージ・ブッシュ transliterated to English as: Jyo-ji Busshu
Refer to the section Name Similarity Configuration Files in the Enterprise User Guide for more details.
Open Source Changes
Apache Aries Blueprint API v1.0.1
Apache Aries Blueprint CM v1.0.8
Apache Aries Blueprint Core Compatibility v1.0.0
Apache Aries Blueprint Core v1.6.2
Apache Aries Proxy API v1.0.1
Apache Aries Proxy Impl v1.0.5
Apache Aries Util v1.0.0
Apache Aries Util v1.1.1
Apache Commons CLI 1.2
Apache Commons Lang 3.3.2
Apache Commons Weaver v1.1
Apache HTTPComponents v4.4.1
Apache Geronimo OSGI Factory Registry v1.1
Apache Geronimo Xbean v4.1
Apache Groovy v2.4.7
Apache Jakarta Regexp v1.4
Apache Service Mix Specs Activation API 1.1 v2.4.0
Apache Service Mix Specs JAXB API 2.2 v2.4.0
Apache Service Mix Specs SAAJ API 1.3 v2.4.0
args4j 2.32
ASM v5.0.2
ASM v5.0.3
ASM v5.0.4
Jackson Datatype Joda 2.4.5
Java API for RESTful Web Services v1.1.1
Java Mail API 1.4.4
Jettison v1.3.7
JRuby v9.1.6.0
Metro XML Information Set v1.2.13_1
Reflections v0.9.10_3
Swagger Annotations 1.5.7
Woodstox v3.0.1
Woodstox v3.1.0
Woodstox v3.1.4
Woodstox v4.0.0
Woodstox 4.0.5
Woodstox 5.0.1
Wordnik Swagger Annotations 1.5.3-M1
Apache XML Commons Resolver
Java Common Annotations
Package | Old Version | New Version | New License |
---|---|---|---|
Apache Aries Spifly Dynamic Bundle | 1.0.14 | 1.2.3 | No |
Apache Commons IO | 2.4 | 2.6 | No |
Apache Commons Logging | 1.2 | 1.1.3 | No |
Apache Commons Math | 3.5 | 3.6.1 | No |
Apache CXF | 3.1.4 | 3.3.4 | No |
Apache Felix Framework | 5.4.0 | 6.0.3 | No |
Apache Service Mix Specs JAXWS | API 2.2 v2.4.0 | API 2.3 v2.3.1 | |
Apache Tika | 1.1 | 1.22 | No |
Apache XMLSchema | 2.2.1 | 2.2.4 | No |
Codehaus Plexus Interpolation | 1.23 | 1.25 | No |
Eclipse Jetty | v9.2.11.v20150529 | 9.4.21.v20190926 | No |
fastutil | 6.6.1 | 8.3.0 | No |
Jackson Annotations | 2.9.8 | 2.10.0 | No |
Jackson Core | 2.9.8 | 2.10.0 | No |
Jackson Databind | 2.9.8 | 2.10.0 | No |
Jackson Dataformat CBOR | 2.9.8 | 2.10.0 | No |
Jackson Dataformat Smile | 2.9.8 | 2.10.0 | No |
Jackson Dataformat XML | 2.9.8 | 2.10.0 | No |
Jackson Dataformat YAML | 2.9.8 | 2.10.0 | No |
Jackson Datatype JSR310 | 2.9.8 | 2.10.0 | No |
Jackson JAX RS | 2.9.8 | 2.10.0 | No |
Jackson JAXB Annotations | 2.9.8 | 2.10.0 | No |
Java Architecture for XML Binding | 2.2.11_1 | 2.3.0 | No |
Java Common Annotations | 1.2 | 1.3.2 | No |
Joda Time | 2.2 | 2.9.1 | No |
SLF4J | 1.7.5 | 1.7.28 | No |
SnakeYAML | 1.23 | 1.25 | No |
Tensorflow | Added An Additional Version | 1.14.0 | No |
Tukaani XZ Java | 1.5 | 1.8 | No |
Woodstox | Multiple Versions Removed/Upgraded | 4.2 | No |
Package and Version | License |
---|---|
Animal Sniffer 1.9 | MIT |
Apache Commons Net 3.1.0 | Apache 2.0 |
Apache FreeMarker 2.3.23 | Apache 2.0 |
Deep Learning for Java 1.0.0-beta4 | Apache 2.0 |
Google Flatbuffers 1.10.0 | Apache 2.0 |
Google GSON 2.8.5Objenesis 2.1.0 | Apache 2.0 |
Oswego Concurrent 1.3.4 | Public Domain |
Swagger UI 3.24.3 | Apache 2.0 |
ThreeTen 1.3.3 | GPL 2 w/ CE |
uk.com.robust-it.cloning library 1.9.3 | Apache 2.0 |
Package and Version | License |
---|---|
Jakarta XML Binding 2.3.2 | EDL 1.0 & BSD 3-Clause & EPL 2.0 & GPL 2 w/ CE |
Package and Version | License |
---|---|
AssertJ 3.13.2 | Apache 2.0 |
Jakarta Activation 1.2.1 | BSD 3-Clause |
Jakarta RESTful Web Services 2.1.5 | EPL 2.0 & GPL 2 w/ CE |
Jakarta SOAP with Attachments (SAAJ) 1.4.0 | EDL 1.0 & BSD 3-Clause & EPL 2.0 & GPL 2 w/ CE |
Swagger UI 3.24.3 | Apache 2.0 |
Patch Release 1.14.4
November 2019
Rosette Enterprise On-Premise Users Only
Name Translation /name-translation
Fixed a bug where translating names from Arabic to English could cause a crash.
Patch Release - 1.14.3
October 2019
Rosette Platform Changes
Address Similarity /address-similarity (LABS)
New endpoint Address Similarity: We’ve added a new endpoint, /address-similarity, which performs a field by field comparison of two addresses and returns a similarity score between 0 and 1. Note that this endpoint is still in LABS and subject to change. Send us your feedback!
Entity Extraction and Linking /entities
Bug fix: Incorporates Chinese tokenization fix from /morphology.
Morphological Analysis /morphology/{morphoFeature}
Bug fix: The Chinese tokenizer could create tokens at the end of the input string for temporal nouns without checking that the context was valid for temporal nouns.
Previously: 3分 (meaning “3 minutes” or “3 parts”) was detected as one token at the end of input but two tokens in most other contexts.
Now: 3分 is detected as two tokens at the end of input.
Name Similarity /name-similarity
Additional token overrides: We’ve added token overrides to the English-English override file and a new fullname override to the Japanese-English fullname override file.
Rosette Enterprise On-Premise Users Only
Bug fix: The Rosette install directory name can now contain spaces.
Morphological Analysis /morphology/{morphoFeature}
Bug fix: When
fstTokenize
is enabled, Russian words hyphenated with a number are now tagged with the part of speech of the word without the number.Previously:
Аполлона-11
was tagged PROP, MISC, and NOUN.Now:
Аполлона-11
is tagged NOUN.
Package | Old Version | New Version | License |
---|---|---|---|
org.apache.commons:commons-collections4 | 4.0 | 4.4 | Apache 2.0 |
Patch Release - 1.14.1
September 2019
Rosette Enterprise On-Premise Users Only
Name Translation /name-translation
Bug fix: We fixed a bug where transliterations from Arabic to English did not have all expected vowels.
Previously: تهران was translated to
Thran
Now: تهران is translated to
Tehran
NEW Release - 1.14.0
August 2019
Note
Breaking Changes
Be sure to check below if you use the following features:
Entity linking with DBpedia
Language identification with Malaysian
Entity extraction and linking with Rosette Enterprise On-Premise
Rosette Platform Changes
Entity Extraction and Linking /entities
New Feature - PermID Linking (LABS): We have augmented QID (Wikidata) linked entities by also linking them to Thomson Reuters Permanent Identifiers (PermIDs) when PermIDs are included in the Wikidata entry. To access these identifiers, add
{"options": {"includePermID": true}}
to your call. For more information, see the Features and Functions. Note that this feature is still in LABS and subject to change. Please try it out and send us your feedback!Wikidata update: We’ve refreshed and re-indexed the internal database for Wikidata linking. As a result, QIDs for some entities may have changed from previous versions.
DBpedia update: A QID can now be associated with more than one DBpedia subtype. A new option parameter,
includeDBpediaTypes
, has been added. A new response field,dbpediaTypes
, is a list that can contain one or more string values. The original response field,dbpediaType
, will continue to return a single string: the first value on the list. The original option and response field are now deprecated and will be removed in a future release. If you are using DBpedia values, please migrate to the new fields.Bug fix: We fixed a bug in entity extraction and linking where it would crash when the
modelType
option is set toDNN
, thelinkEntities
option is set totrue
, and the Entity Linker and DNN mentions overlap.
Language Identification /language
Malaysian update: Language identification now returns
zsm
for Standard Malay (Malaysian) when using both long string and short string algorithms. Previously,zsm
was returned from the short string algorithm andmsa
(Malay macrolanguage) from the long string algorithm. This may require updates if you have code looking for the language codemsa
.Increased short string language coverage: Language identification's standard and short-string algorithms now have identical language coverage (except for transliterated Arabic-script languages). Added support for Albanian, Bulgarian, Catalan, Croatian, Estonian, Icelandic, Kurdish (Arabic script), Kurdish (Latin script), Latvian, Lithuanian, Macedonian, Polish, Serbian (Cyrillic script), Serbian (Latin script), Slovak, Slovenian, Somali, Tagalog, Ukrainian, Urdu (Arabic script), Uzbek (Cyrillic script), Uzbek (Latin script), and Vietnamese to the short string language identification algorithm.
Multilingual short string support: For sufficiently short strings, language identification will now use its short-string algorithm in multilingual mode.
Morphological Analysis /morphology/{morphoFeature}
Polish update: Some Polish words ending in
-cku
,-ska
, or-sku
are now (following a different convention) lemmatized to forms ending in-cki
or-ski
. They had previously been lemmatized to their surface forms.Bug fix: The Japanese POS tag
NE
remainedNE
and was not converted correctly to UPT-16 (Universal POS tagset).NE
is now correctly converted to the UPT-16PART
.Bug fix: The French POS tag
CONJQUE
was converted to UPT-16CONJ
. It is now converted to the more appropriateSCONJ
.Bug fix: Chinese punctuation was tagged as
GUESS
when alternativeTokenization was disabled. Chinese punctuation is now tagged asPUNCT
orEOS
.
Name Similarity /name-similarity
Improved Chinese-Chinese name matching: We now give greater weight to names whose romanization is identical.
Previously: The similarity score for
李尚福
(li shangfu
) and李商复
(li shangfu
) was 0.726.Now: The similarity score for
李尚福
and李商复
is 0.981.
Improved name matching and translation of Japanese by normalizing small Katakana characters into their full-sized counterparts.
Previously: Small Ka in
田中ヵ
wasn’t being transliterated and the following readings were generated instead (tanaka ヵ||denchuu ヵ||tana ヵ||tianzhong ヵ||jeonjung ヵ||tanaka ヵ
).Now: Small Ka in
田中ヵ
now gets normalized to田中カ
and the following readings get generated (tanaka ka||denchuu ka||tana ka||tianzhong ka||jeonjung ka||tanaka ka
).
Improved name matching and translation in Chinese by normalizing Extension A characters.
Previously:
㨗
in㨗報
used to get ignored and the following readings were generated (bao||mitsugi||cubbon
).Now:
㨗
in㨗報
maps properly to its variant and now the following readings get generated (jie bao||shou mitsugi||cubbon
).
Semantic Vectors /semantics/vector
Token frequency used for document vectors all languages: For all languages, document vectors are now calculated using token frequency information. This information gives more weight to less common words, which are more significant to the overall meaning of the document. Previously, only English vectors were weighted this way.
Rosette Enterprise On-Premise Users Only
The default configuration of the Entity Extraction and Linking endpoint has fewer options enabled to optimize performance. The endpoint is now aligned with the configuration of the REX-JE SDK. Rosette Cloud has a different configuration, providing a fully-functional demonstration environment. For more details on how to set the options back to the previous settings, see the section “Entity Extraction and Linking Default Configuration” in the Rosette Enterprise User Guide.
New default settings:
Entity linking is disabled (
linkEntities: false
)No supplemental regular expressions are loaded (
supplementalRegularExpressionPaths null)
Pronominal resolution is not enabled (
resolvePronouns: false
)caseSensitivity: caseSensitive
Package | Old version | New version | License |
---|---|---|---|
2.5.2 | 2.5.3 | Apache 2.0 | |
37.0.0 | 37.0.1 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
2.9.6 | 2.9.8 | Apache 2.0 | |
1.18 | 1.23 | Apache 2.0 |
Package | Version |
---|---|
Checker Qual | 2.5.2 |
Error Prone | 2.1.3 |
J2ObjC | 1.1 |
JSR 305: Annotations for Software Defect Detection | 3.0.2 |
Mojohaus Animal Sniffer Annotations | 1.14 |
Patch Release - 1.13.2
June 2019
Rosette Platform Changes
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Japanese Update: Japanese data regex now includes the new era 令和 (Reiwa), which started May 1, 2019.
Bug fix: We've fixed additional cases where Japanese characters were wrongly normalized into their simplified Chinese equivalents in entity linking. This issue was also addressed in the 1.13.0 release.
Bug fix: We've added missing parameter files for linking models, which may improve accuracy.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Updated lexicons: We've made various small updates to the lexicons of German, English, and Swedish.
Improved Arabic analyzer: The Arabic analyzer will attempt to replace leading hamzated alefs with plain alefs for unrecognized tokens, to see if the version with the plain alef is recognized.
Improved Hebrew normalization: U+2019 RIGHT SINGLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK are now normalized to U+0027 APOSTROPHE and U+0022 QUOTATION MARK in Hebrew, to support their use as geresh and gershayim.
Bug fix: We've fixed a bug which occurred when Rosette encountered a Hebrew token consisting of multiple prefixes without a base. In this case, Rosette returned an incomplete surface form, i.e., only containing the first prefix. For example, the surface form of the token “מה” should be “מה” but was being returned as “מ”. With the bug fix the surface form returned now is “מה”.
Bug fix: We’ve fixed a problem where closing parentheses, brackets, and braces that follow URLs were merged into the URLs.
Bug fix: The disambiguator is now more likely to select analyses for @mentions, email addresses, hashtags, and URLs over other analyses.
Bug fix: When the Hebrew tokenizer encounters a character not used in Hebrew immediately following a character used in Hebrew, it now starts a new token. Formerly, it would delete that character and any following characters up to the next token separator (e.g. whitespace).
Bug fix: Hebrew tokens consisting of multiple prefixes without a base are now tagged with the part of speech “unknown”, to match single-prefix tokens. Previously, multi-prefix tokens were not tagged with any part of speech.
Rosette Enterprise On-Premise Users Only
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Bug fix: When
fstTokenize
is enabled, we’ve fixed a problem where Russian hyphenated words that end in numbers, like “Аполлона-11”, were not tagged as DIG. They are now tagged with the same parts of speech they had before Rosette API 1.12.2
Patch Release - 1.13.1
May 2019
Rosette Enterprise On-Premise Users Only
To minimize the size of your Rosette Enterprise installation, the entity extraction (rex-root) and semantic similarity (tvec-root) components are now shipped by language. The name of the language specific files contain the three letter ISO-639 language code, indicating which language is supported by the file.
Entity extraction is shipped with one base file and one or more language-specific files.
Example:
rex-root-<version>.tar.gz
rex-root-<version>-eng.tar.gz
for English language filesrex-root-<version>-deu.tar.gz
for German language files
Semantic Similarity is shipped with one file per language.
Example:
tvec-root-<version>-eng.tar.gz
for English language filestvec-root-<version>-deu.tar.gz
for German language files
The Rosette Enterprise installer has been updated and will automatically install all components as required, based on your license.
NEW Release - 1.13.0
April 2019
Rosette Platform Changes
Categorization /categories
Categorization /categories
Expanded English support: We can now process English with a case-insensitive model by providing (
uen
) as the input language.
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Bug fix: We’ve fixed a problem in Japanese where entity names that include the middle dot were not being handled correctly. Entity names that include a middle dot are no longer split into two entities.
Bug fix: We’ve fixed a problem in Japanese entity linking where, in some cases, Japanese characters were being replaced with Chinese characters.
Bug fix: We’ve fixed a problem where, in some cases, entities were mislabeled when the
includeDbPediaTypes
option was not flagged.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
New Hebrew disambiguator: We’ve added a new default disambiguator (
perceptron
) for Hebrew. Use thedisambiguatorType
option to enable another disambiguator (DNN
ordictionary
). For example, to return to the previous default, add{"options": {"disambiguatorType": "DNN"}}
to your call.Bug fix: We’ve fixed an error where some Chinese tokens included spurious white space characters.
Bug fix: We’ve fixed a problem where extremely long tokens (thousands of characters) would slow down the tokenizer.
Bug fix: We’ve fixed a problem where Polish tokens that can appear in multiword expressions were lemmatized to full expressions, even when the full expression wasn’t in the input.
Previously:
dzień
lemmatized todzień_dobry
.Now:
dzień
is lemmatized todzień
.
Bug fix: We've fixed a problem where the non-final components of Russian compound words with more than one hyphen were not lemmatized correctly.
Name Similarity /name-similarity
Name Similarity /name-similarity
Hungarian: We’ve added a Hungarian frequency language model for LOCATION entity type and retrained the language model for PERSON entity type, improving Hungarian name matching.
PERSON:
Previously: The similarity score for Domonkos Gyula Tiborné and Domonkos Gy. Lászlóné was 0.697.
Now: The similarity score for Domonkos Gyula Tiborné and Domonkos Gy. Lászlóné is 0.775.
LOCATION:
Previously: The similarity score for Szentmihály puszta and Szentmihály pihenő was 0.576.
Now: The similarity score for Szentmihály puszta and Szentmihály pihenő is 0.696.
Sentiment Analysis /sentiment
Sentiment Analysis /sentiment
Expanded English support: We can now process English with a case-insensitive model by providing (
uen
) as the input language. Be aware that when usingDNN
for themodelType
the accuracy of the results may be lower than when analyzing standard, sentence-cased, English input.
Rosette Enterprise On-Premise Users Only
Sentiment Analysis /sentiment
Sentiment Analysis /sentiment
Document-only analysis: Entity-level sentiment analysis can now be turned off, allowing document-level sentiment analysis only. See the Rosette Enterprise User Guide for more information.
Expanded Language support: Users of the Rosette Text Classification Field Training Kit can now train custom sentiment analysis models in any language supported by the tokenization endpoint (for document-level analysis) or the entity extraction and linking endpoint (for entity-level analysis). For more information on the training and configuration procedure, see the Rosette Field Training Kit documentation.
Open Source Changes
Version Changes
Apache Lucene Core updated from v6.6.0_1 to v7.6.0_1 (Apache 2.0 license)
Patch Release 1.12.2
February 25, 2019
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Bug fix: We’ve fixed a bug where, under some circumstances, the head mention for an extracted entity was not correctly identified.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Bug fix: We've fixed a bug where the Persian lemmatizer did not add lemmas to the first analyses of many tokens, especially verbs.
Bug fix: We've fixed a bug where after some sequences of 4096 characters, containing mostly white space and at most one token, any following tokens had incorrect original offsets.
Semantic Similarity /semantics/{semanticsFeature}
Semantic Similarity /semantics/{semanticsFeature}
Bug fix: We've fixed a bug where capitalized tokens could return an out-of-vocabulary token-level embedding instead of the embedding consistent with their lowercase form.
Bug fix: Previously, the Semantic Vectors endpoint did not always return vectors of a consistent length. Now, returned vectors will always be normalized to have a length of one.
Rosette Enterprise On-Premise Users Only
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Bug fix: We've fixed a bug where when the
fstTokenize
option was enabled, the lemmas of hyphenated Russian compound words only had the the final piece lemmatized. Now both pieces are lemmatized.Previously: человека-волка was lemmatized to человека-волк
Now: человека-волка is lemmatized to человек-волк
Categorization /categories
Categorization /categories
New factory configuration option: We’ve added a new factory configuration option,
maxResults
. This option can be used to cap the number of results returned. By default, all results exceeding the score and confidence thresholds (if set) will be returned.
Open Source Changes
Deleted from Rosette Enterprise Embedded
Apache Ant v1.5
Apache Log4j v2.7
Checker Qual v2.5.2
Error Prone v2.1.3
J2ObjC v1.1
JSON in Java v20141113
JSR 305: Annotations for Software Defect Detection v3.0.2
Mojohaus Animal Sniffer Annotations v1.14
Deleted from Rosette Enterprise Restful
JSON in Java v20141113
NEW Release - 1.12.1
January 31, 2019
Semantic Similarity /semantics/{semanticsFeature} (LABS)
Semantic Similarity /semantics/{semanticsFeature} (LABS)
Note that the Semantics Similarity features are still in LABS and subject to change. Send us your feedback!
New endpoint Similar Terms: We've added a new endpoint, /semantics/similar, which uses text vectors to generate multilingual related terms with numerical similarity scores for any input word(s) in Arabic, English, Chinese, German, Japanese, North or South Korean, Russian, or Spanish. For more information, see the Features and Functions.
Input Term:
spy
returnsSpanish {"term":"espía","similarity":0.61295485}, {"term":"cia","similarity":0.46201307}, {"term":"desertor","similarity":0.42849663}, {"term":"cómplice","similarity":0.36646274}, {"term":"subrepticiamente","similarity":0.36629659} German {"term":"Deckname","similarity":0.51391315}, {"term":"GRU","similarity":0.50809389}, {"term":"Spion","similarity":0.50051737}, {"term":"KGB","similarity":0.49981388}, {"term":"Informant","similarity":0.48774603}, Japanese {"term":"スパイ","similarity":0.5544399}, {"term":"諜報","similarity":0.46903181}, {"term":"MI6","similarity":0.46344957}, {"term":"殺し屋","similarity":0.41098994}, {"term":"正体","similarity":0.40109193},
Semantic Vectors: The /text-embedding endpoint has been renamed to /semantics/vector. While the /text-embedding endpoint will remain accessible through April, we encourage you to migrate as soon as possible to avoid missing any important updates.
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Improved linking confidence: We've updated the linking confidence calculation and thresholds to improve accuracy.
Supported languages: We've removed
xxx
from the list of languages returned when using /entities/supported-languages.Bug fix: We’ve fixed a bug where the salience score was not always returned for entities with pronominal mentions, when requested.
Bug fix: We've fixed a bug where some later entity mentions that were chained to the first mention of a given entity were not always properly returned.
Bug fix: We've fixed a bug where sometimes a null pointer exception was returned when resolving pronouns.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Bug fix: We’ve fixed a bug where some English words were automatically getting tagged as proper nouns when capitalized.
Bug fix: We’ve fixed a bug in English where “people” was not being properly lemmatized to “person”. It now has the lemma candidate “person” when appropriate.
Bug fix: We’ve fixed a bug where ordinal numbers and comparative adjectives in English like “second” and “lower” were analyzed as verbs.
Name Similarity /name-similarity
Name Similarity /name-similarity
Hungarian: We’ve added support for multi-letter initials in Hungarian, improving Hungarian name matching.
Previously: The similarity score for Kovács Cs. István and Kovács Csaba István was 0.68.
Now: The similarity score for Kovács Cs. István and Kovács Csaba István is 0.91.
Japanese: We’ve improved the accuracy of matching between katakana and kanji versions of Japanese organization names.
Previously: The similarity score for ドクリツギョウセイホウジンニホンガクジュツシンコウカイ and 独立行政法人日本学術振興会 was 0.41.
Now: The similarity score for ドクリツギョウセイホウジンニホンガクジュツシンコウカイ and 独立行政法人日本学術振興会 score is 0.72.
Chinese: We’ve improved the accuracy of matching Chinese organization names.
Previously: The similarity score for 松下能源(上海)有限公司 and Panasonic Energy (Shanghai) Co., Ltd was 0.58.
Now: The similarity score for 松下能源(上海)有限公司 and Panasonic Energy (Shanghai) Co., Ltd is 0.82.
Rosette Enterprise On-Premise Users Only
We've decreased warm-up time by only loading licensed languages when Rosette is set to pre-warm.
We’ve added the ability for on-premises users to have more control over configuration options when using custom-trained models.
We've reduced the minimum memory requirements to 16GB of RAM for the entity extraction and linking, sentiment analysis, and topic extraction endpoints.
We’ve improved the efficiency of the initial load time for the entity extraction and linking endpoint.
All client bindings have been updated to support the /semantics/vector and /semantics/similar endpoints.
The examples have been modified in the Python bindings to demonstrate how to set options.
Open Source Changes
New Addition
Version Changes
Release 1.12.0 and older
NEW Release - 1.12.0
December 10, 2018
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Korean: We've improved the accuracy of Korean extraction, largely through better handling of Josa (postpositions) and compound words.
Entity Linking: We've added support for entity linking to Wikipedia for both the top level types (PERSON, LOCATION, ORGANIZATION, ETC.) as well as the over 700 DBpedia types (see full list here) in the remaining 16 languages supported by entity extraction. This is in addition to the languages currently supported by entity linking: Chinese, English, Japanese, and Spanish.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Hebrew disambiguation: We’ve improved analysis in Hebrew by adding disambiguation, a mechanism for more accurately choosing which of several candidate analyses is provided in the response. For Hebrew only, we've added an option
disambiguatorType
to select which disambiguator is used. The values areDNN
for the TensorFlow-based deep neural network model anddictionary
for the dictionary-based model. The default isdictionary
. To enable theDNN
disambiguator, add{"options": {"disambiguator": "DNN"}}
to your call.Persian lemmatization: We’ve added lemmatization support to Persian.
Name Similarity /name-similarity
Name Similarity /name-similarity
Chinese organization names: We've improved the accuracy of matching between Chinese and English organization names.
Previously: The similarity score for 索尔维-恒昌(张家港)精细化工有限公司 and Solvay-Hengchang (Zhangjiagang) Fine Chemicals Co., Ltd was 0.6929.
Now: The similarity score for 索尔维-恒昌(张家港)精细化工有限公司 and Solvay-Hengchang (Zhangjiagang) Fine Chemicals Co., Ltd score is 0.8225.
Rosette Enterprise On-Premise Users Only
The minimum system requirements for running Rosette Enterprise have changed for some use cases. This is a result of providing entity linking for 16 additional languages in this release. We now support entity linking for all 20 languages supported by entity extraction.
The entity extraction and linking, sentiment analysis, and topic extraction endpoints require significant memory allocation. If using these endpoints, the new minimum memory requirements are:
32GB RAM
64GB of disk space (more may be needed for growing logs)
For all other endpoints, the minimum memory requirements are:
16GB RAM
35GB of disk space (more may be needed for growing logs)
Rosette Enterprise is now available as a Docker container. The images are available on Docker Hub. A Basis shipment, containing a license file and a docker-compose file customized to your licensed endpoints, is still required.
Patch Release 1.11.3
October 29, 2018
Morphological Analysis /morphology/{morphoFeatue}
Morphological Analysis /morphology/{morphoFeature}
Chinese: We’ve added the word “百度” to the Chinese lexicon. This only has an effect when
modelType
is set todefault
, which is its default value.Previous: Two tokens: “百”, “度”
Now: One token: “百度”
German: The part of speech of the acronyms “MAN” and “MIT” is now NOUN in German, instead of falling back to the parts of speech of the unrelated words “man” and “mit”.
Previous: MAN/INDPRO, MIT/PREP, MIT/VPREF, MIT/ADV
Now: MAN/NOUN, MIT/NOUN
Spanish: We’ve improved Spanish part-of-speech tag and lemma disambiguation.
Previous: 91.646% POS accuracy and 90.243% lemma accuracy using the IULA Spanish LSP Treebank
Now: 91.742% POS accuracy and 90.344% lemma accuracy
Name Similarity /name-similarity
Name Similarity /name-similarity
New Model: We’ve added a new model for increased accuracy when matching Hungarian names to other Hungarian names.
Previous: Pavlovitch Bryulov vs. Pavlovics Brjullov - score: 0.84
Now: Pavlovitch Bryulov vs. Pavlovics Brjullov - score: 0.95
Bug Fix: Fixed a bug involving duplicate readings produced when transliterating Chinese names.
Previous: Wang Xing vs. 王行 - score: 0.69
Now: Wang Xing vs. 王行 - score: 0.99
Patch Release 1.11.2
September 24, 2018
Entity Extraction and Linking /entities
Entity Extraction and Linking /entities
Improved Chinese accuracy: We've replaced the underlying tokenizer to improve accuracy in Chinese.
Improved Hungarian accuracy: We've updated our pattern match extractors in Hungarian. This improves accuracy for the MONEY and DATE types.
Known issue: We've fixed a bug where entity mentions were being miscounted if the
calculateSalience
option was set totrue
.
Morphological Analysis /morphology/{morphoFeature}
Morphological Analysis /morphology/{morphoFeature}
Improved Dutch disambiguation: We've improved Dutch part-of-speech disambiguation.
Improved Japanese and Chinese lemma support: In the last release, most Japanese and Chinese tokens did not have lemmas when
modelType
was set todefault
. Now, such tokens have lemmas equivalent to their surface forms.
NEW Release - 1.11.0
August 27, 2018
Rosette Enterprise On-Premise Users Only
New per-endpoint licensing: Endpoints are now activated directly from your installed license. The
endpoints.yaml
file has been removed from the installation.New Enterprise User guide: We’ve added a user guide (rosette-enterprise-user-guide-1.11.0.pdf), that provides new content and replaces the files rosette-api-on-premises-install-guide-1.11.0.txt, overview.md, and Rosette_API_Embedded_User_Guide.pdf.
Simplified installation for macOS and Linux: The installation for both RESTful and embedded Java has been simplified for macOS and Linux. Installation for Windows has not changed, but detailed installation notes are now included as part of the new Enterprise User Guide.
Name Changes: We are continuing to consolidate and simplify our branding. Rosette API On-Premise is now Rosette Enterprise. We’ve made changes to the documentation and license names to reflect.
Rosette Platform Changes
New supported languages sub-endpoints: For all endpoints (excluding Name Similarity, Name Translation, and Name Deduplication), Rosette now provides a
GET /rest/v1/<endpoint>/supported-languages
method that returns that endpoint's supported languages and scripts. See the Features and Functions or the Interactive Docs for more information.Updated bindings: We've updated our CSharp and Java bindings. Be sure to get the latest version (1.11.0) to take advantage of all the new features and improvements!
Name Deduplication
New language: Rosette now supports name deduplication of Hungarian names.
Categorization
Multilabel categorization: Rosette can now return multiple category labels per document. For more information, see the Features and Functions. To return only a single category label per document, set the
{"options": {"singleLabel": true}}
. For more information, see the Features and Functions.
Text Embedding
New languages: The text embedding endpoint now supports Russian, North Korean, South Korean, and Arabic.
Individual token embeddings: We can now return embeddings for individual input tokens. To enable per-token embeddings, add
{"options": {"perToken": true}}
to your call.Response modifications: We've made changes to the text embeddings endpoint's response structure. Document-level embeddings now have their own dedicated slot
embeddings
and will no longer appear indocumentMetadata
. Please note that this is a breaking change, contact Rosette support for more information.
Entity Extraction and Linking
New feature - DBpedia Types (LABS): We've added over 700 new entity types to the Entity Extraction and Linking endpoint, drawn from the DBpedia ontology. To access these entity types, add
{"options": {"includeDBpediaType" = true}}
to your call. You'll notice more than 10 additional macro types in thetype
field as well as the all newDBpediaType
field. For more information, see the Features and Functions. Note that this feature is still in LABS and subject to change. Send us your thoughts!Better accuracy: We've improved the recall of Rosette's entity linking across all supported languages.
New language: Entity extraction now supports Hungarian.
Known issue: MONEY, PHONE NUMBER, and URL types are not extracting properly in Hungarian. This will be fixed in the September 2018 patch release.
Language Identification
Support for North and South Korean added: Rosette can now identify North Korean (
qkp
) and South Korean (qkr
) dialects. To enable the dialects, add{ "options": { "koreanDialects":true }}
to your call.
Morphological Analysis
New algorithm for Chinese and Japanese: We've added a new algorithm for Chinese and Japanese morphological analysis. Prior to version 1.11.0, the default algorithm was perceptron. To return to the old model, add
{"options": {"modelType": "perceptron"}}
to the body of your call.Norwegian lemmatization: We’ve expanded the lemma dictionaries for Norwegian, both Bokmål and Nynorsk.
Improved English and Spanish disambiguation We've improved the accuracy of lemmatization and part of speech tagging in both English and Spanish.
Bug fix: We've improved handling of formatting characters in German.
Bug fix: We’ve fixed a bug where the Hebrew POS tag
wPrefix
was not converted to UPT-16.
Tokenization
New algorithm for Chinese and Japanese: We've added a new algorithm for Chinese and Japanese tokenization. Prior to version 1.11.0, the default algorithm was perceptron. To return to the old model, add
{"options": {"modelType": "perceptron"}}
to the body of your call.Bug fix: We’ve fixed a bug where in Catalan, in which Rosette did not tokenize after an apostrophe in cases where the apostrophe marks a token boundary.
Bug fix: We've fixed a big in Japanese, in which Rosette did not recognize 々 as a Japanese character, so it was considered its own token.
Name Similarity
New language: Name similarity now supports matching between Hungarian and English names.
Improved language-of-origin detection: We've improved the detection of language-of-origin of Japanese names written in Katakana.
Patch Release 1.10.2
June 25, 2018
Name Similarity
Improved Arabic script segmentation: We've improved segmentation of Persian (Dari and Farsi), Pushto, and Urdu names written in Arabic script.
Entity Extraction and Linking
New Japanese tokenizer: We've replaced the underlying tokenizer to improve accuracy for Japanese.
Morphological Analysis
Bug fix: We've fixed a bug where some components of compound German words were incorrect when the surface form of the component could be either a noun or a verb.
Bug fix: We've fixed a bug where the Hebrew parts of speech tags did not use UPT-16, the POS tag set used by the other languages.
Bug fix: We've fixed a bug where returned email addresses and URLs could contain control or whitespace characters.
Bug fix: We've fixed a bug where returned Hebrew tokens could contain control characters or nothing but default ignorable characters.
Bug fix: We've fixed a bug where Chinese, Japanese, and Thai tokens could contain control characters.
Tokenization
Bug fix: We've fixed a bug where the default Japanese tokenizer truncated some katakana tokens when they appeared after non-katakana tokens.
Name Deduplication
Name Deduplication input limit: Each call now has a limit of 1000 names per list.
Patch Release 1.10.1
May 30, 2018
Sentiment Analysis
Improved Entity Level Sentiment Analysis: We've improved our calculation of entity level sentiment to more accurately consider the context around each mention. Please note, this update may cause results to change.
Syntactic Dependencies
Bug fix: We've fixed a bug where the initial token dependency for every sentence (other than the first) was omitted from the list of results.
Entity Extraction and Linking
Improved Hebrew Entity Extraction: We've improved Hebrew entity extraction by removing superfluous prefixes from extracted entities.
Improved Confidence Scores: We've improved statistical model confidence scores to provide a more effective tradeoff between precision and recall. Please note, this update may cause results to change. If you have set a threshold based on entity confidence scores, please evaluate to ensure optimal performance.
Improved Entity Normalization: Social media characters such as "@" and "#" are removed from a
Mentions
normalized string. Offsets to the original string data field remain the same.
Morphological Analysis
Bug fix: Previously, the Dutch disambiguator would always choose analyses whose lemmas matched their surface forms, even for very rare lemmas; now the more common lemma will be returned. For example, schepen can be a singular noun with the lemma schepen, but it is more likely to be a plural noun with the lemma schip.
NEW Release 1.10.0
April 23, 2018
Entity Extraction and Linking
New deep neural network processor (in BETA): We've added an alternative entity extraction processor, which can be used in place of the standard statistical extractor. The new processor employs a deep neural network that improves accuracy up to 7% and error rate up to 32%. It is available for English, Arabic, and Korean. To enable this processor, provide
DNN
for the modelType. Example:{"content": "your_text_here", "options": {"modelType": "DNN"}}
Morphological Analysis
New language support: We've added support for lemmatizing Catalan, Estonian, Serbian, and Slovak text.
Bug fix: Previously, tokens could be empty or contain only invisible characters. Such tokens will no longer be returned.
Language Identification
New language support: Short string language identification now also supports Malay and Indonesian. Both languages were already supported for longer texts.
Sentiment Analysis
New language support: We've added support for sentiment analysis in Persian (Farsi and Dari) at both the document and the entity level.
Name Translation
New language support: Rosette now supports transliteration of person, organization, and location names from Greek to Latin script.
Name Similarity
New language support: Rosette now supports matching of Greek names written in Greek script to English names written in Latin script and other Greek names written in Greek script.
Accuracy improvements: We have improved match scores and segmentation rules for Arabic, Western Farsi, and Japanese names.
Point Release - 1.9.4
March 27, 2018
Morphological Analysis
Bug fix: We’ve improved our handling of tokens consisting of numbers and Latin characters, such as serial numbers, in Korean. Previously these tokens were decompounded into multiple morphemes.
Bug fix: We’ve added the lower case Russian word “интернет” (“internet”) to the dictionary, which was previously only present in title case.
Bug fix: We’ve improved our handling of tokens containing an apostrophe immediately followed by a digit in languages like French and Italian, like “all'M5S”. Previously, the apostrophe would be parsed as its own token.
Bug fix: We’ve improved our German analysis by taking better advantage of context clues, and now return more accurate results, especially for uncommon words.
Bug fix: We’ve improved our handling of English and German words in all-caps. Previously, these words were assumed to be proper nouns, even though all-caps may simply denote emphasis.
Transliteration
Performance improvement: We’ve added caching of model objects to prevent OutOfMemoryErrors.
Entity Extraction and Linking
Hebrew entity extraction: We’ve added support for entity type "Title" in Hebrew.
Point Release 1.9.3
February 17, 2018
Morphological Analysis
German disambiguation: We’ve improved our German disambiguator for lemmas and part-of-speech tags to be more sensitive to capitalization, particularly for single word inputs.
Bug fix: Previously, German definite articles (der, die, das, den, dem, and des) meaning the were lemmatized inconsistently. They are now all lemmatized to the masculine singular nominative form, der.
Bug fix: In some languages, an apostrophe may mark a token boundary, like in the Italian phrase all'M5S. Previously the token boundary was incorrectly omitted when the following token contained a digit. This issue has been rectified and M5S will be properly tokenized.
Name Similarity
Farsi Name Matching: We’ve improved the behavior of Western Farsi-English matching by tokenizing input names earlier in the analysis process.
Point Release - 1.9.2
February 6, 2018
Entity Extraction and Linking
Currency support: We’ve added several additional currency symbols to the regex, including the Turkish Lira (₺), the Pound Sterling (₤), and the Euro (€).
Bug fix: We’ve fixed a bug that caused hexadecimal number strings to be incorrectly extracted as products.
Name Translation
Thai improvement: We’ve modified the gemination rules (consonant elongation) of the ISO11940-2 Thai transliteration standard to improve Thai translation accuracy.
Name Similarity
Bug fix: We’ve fixed a bug around matching names of organizations and locations that contain numbers, such as "Century 21 Real Estate LLC." These non-person names containing digits will now match more accurately.
NEW Release - 1.9.0
January 16, 2018
Topics
Salience scores: We've added salience scores for keyphrases and concepts to indicate how relevant an extracted concept or keyphrase is to the overall content of a text. You now have the option to filter out results below a desired threshold value:
{"content": "your_text_here", "options": {"keyphraseSalienceThreshold": value, "conceptSalienceThreshold": other_value}
}.Short string support: We've improved our concept extraction logic, and now support concept extraction for short input strings, i.e. texts less than 280 characters long.
Name Deduplication
Thai support: Rosette now supports deduplication of Thai names.
Sentiment Analysis
New feature: We've added the option to use an experimental alternative deep neural network (DNN) sentiment model for English:
{"content": "your_text_here", "options": {"modelType": "DNN"}}
. The new model will produce different results, which may be more accurate than the current support vector machine (SVM) model, depending on your data. As it is experimental, we are particularly interested in getting user feedback. On-premise users of Rosette API should review the new system requirements in install-guide.txt before using this option.
Entity Extraction and Linking
Entity offsets returned: Entity mention offsets are now returned by default. Offsets can be used to locate the exact surface forms of an extracted entity in the document text.
Korean improvements: We’ve significantly improved the accuracy of entity extraction results across all entity types in Korean.
Confidence scores: Confidence scores for entities extracted using Rosette’s statistical processor, as well as all linked entities, will now be returned by default. Confidence scores allow Rosette to return the most accurate results, particularly for entity linking. To change this behavior, set
{"content": "your_text_here", "options": {"calculateConfidence": false}}
.
Language Identification
Detect language regions in multilingual documents: The language identification endpoint can now detect different language regions in a multilingual document.
Score changes: We’ve rescaled the confidence scores returned by the language identification endpoint based on customer feedback. The ranking of language candidates will not change, but the scores themselves will be higher. If you currently filter language identification results based on a confidence threshold, you will need to reset that threshold to maintain parity with previous versions.
Name Translation
Thai support: Rosette now supports transliteration of names from Thai to Latin script.
Name Similarity
Thai support: Rosette now supports matching of Thai names to English names and other Thai names.
Accuracy improvements: We have improved match scores for Arabic names (persons, locations and organizations) as well as for Chinese and Japanese organizations.
Point Release - 1.8.1
November 13, 2017
Entity Extraction and Linking
Bug Fix: This release addresses a bug whereby entity linking confidence scores were not being returned when requested. Confidence scores for entities resolved to Wikipedia entries will now be returned when using the following option:
{"content": "your_text_here", "options": {"calculateConfidence": true}}
NEW Release - 1.8.0
October 23. 2017
Topic Extraction
New Endpoint: Topic extraction We've added a topic extraction endpoint that identifies the key ideas of an input text. For a given input, the endpoint will return two lists: Keyphrases, a list of phrases extracted directly from the text, and Concepts, a list of phrases which do not have to be explicitly mentioned in the input.
LABS
LABS graduates: The /transliteration, /relationships, and /syntax/dependencies endpoints have graduated from “Labs” status and are now fully supported.
Sentiment Analysis
New language: Rosette now supports document and entity-level sentiment analysis in French.
Entities
Salience Scoring: Rosette can now return salience scores, which indicate whether an entity is important to the overall scope of the document. Turn on the scores by adding an option to the request:
{"content": "your_text_here", "options": {"calculateSalience": true}}
Linking Confidence Scoring: Rosette can also now return Linking Confidence scores, which represent the degree of certainty of the link between an in-document entity mention and its linked QID. It may be used for thresholding and removal of false positives. Linking Confidence scores for entities identified by our linker and assigned with a QID are now available by adding an option to the request
: {"content": "your_text_here", "options": {"calculateConfidence": true}}
Point Release - 1.7.3
July 26, 2017
Name Deduplication
New Endpoint: Name Deduplication We've added a name deduplication endpoint that identifies similar names within a list. The endpoint accepts a list of names, organizes the list into clusters of unique names, and assigns each cluster with an id number. It then returns those ids to the user.
Point Release - 1.7.2
June 22, 2017
Entity Extraction
Bug fix: This release addresses a backward compatibility issue between the latest Rosette API and older versions of our Java binding that affected Rosette's ability to return entity confidence scores. Confidence scores for entities identified by our statistical extractor are now available by adding an option to the request:
{"content": "your_text_here", "options": {"calculateConfidence": true}}
NEW Release - 1.7.1
June 14, 2017
Transliteration for Arabizi
New Endpoint: Transliteration We've added a transliteration endpoint that converts between Arabic written in ASCII, also called Romanized Arabic chat or Arabizi, and native Arabic script.
Arabic Sentiment Analysis
Beta Arabic Support for /sentiment: We now return document-level and entity-level sentiment analysis results for Arabic language input.
Relationship Extraction
Personal pronoun resolution for /relationships: Building on the pronoun resolution capabilities of our /entities endpoint, pronouns which are resolved to named entities can now be arguments in relationships.
Entities
Improved Confidence Scoring: Confidence score calculation is improved to correlate well with precision and may be used for thresholding and removal of false positives.
Tokenization
New support for emoticons, emoji, @mentions, hashtags, URLs, and email addresses: These special characters and character combinations are now kept together as a single token in all languages, greatly improving the accuracy of analysis further downstream.
Morphological Analysis
Improved accuracy for English and Spanish: For this release, we updated our English and Spanish dictionaries. We also introduced new, more advanced disambiguation models for these languages, which help Rosette to correctly determine a given word’s part of speech. For example, words like “object” can be either a noun (“this is an object”) or a verb (“I object!”).
Lemmatization and normalization of emoticons, emoji, @mentions, hashtags, URLs, and email addresses: Rosette now normalizes and lemmatizes these special characters and character combinations to streamline analysis.
Improved decompounding for Dutch: Dutch language text is now decompounded more accurately, Dutch text is now decompounded more accurately, producing better tokens for search enhancement and other applications.
NEW Release - 1.6.0
March 23, 2017
Relationship Extraction
Improved Accuracy of Corporate Relationships: Improvements made to the identification of relationships between corporations. The relationships involved are: ORG-SUBSIDIARY-OF, ORG-COLLABORATORS, ORG-ACQUIRED-BY and ORG-PROVIDER-TO.
Removed the ORG-PARTNERSHIPS Relationship: The ORG-PARTNERSHIPS relationship is now subsumed under ORG-COLLABORATORS and is no longer extracted as an independent relationship.
Entity Extraction and Linking
Improved Linking Accuracy via Inclusion of New Context Features: The statistical model for entity linking includes features that measure the vector space similarity between an entity context and the Wikipedia contexts of its potential linking targets. The new features result in higher F-Scores across all supported languages.
Entity Linking in Japanese, Chinese and Spanish: Entity linking to Wikidata with QIDs for Japanese, Chinese and Spanish text is supported.
Removed Long Text Linking: Entity linking to Wikidata (with QIDs) for long texts is removed, which, as a result, removed entity linking capabilities in Arabic.
Text Embedding
Vector dimension reduced from 512 to 300: We are able to produce smaller vectors that are more efficient and memory friendly without sacrificing overall speed or accuracy.
Improved Speed and Accuracy: A number of speed enhancements have been made along with much larger vocabularies to increase accuracy.
Language Identification
Improved Accuracy on Texts with Mixed Scripts: A script specific model is now selected based on the weighted frequency of the different scripts in the input.
Name Matching
Japanese Improvements: Rosette API now has better support for Japanese name matching. This includes the new use of word embeddings, which are used to match words with similar semantic meaning, as well as improved Japanese name segmentation.
NEW Release - 1.5.1
January 10, 2017
Targeted Relationship Extraction
New Endpoint Functionality: The /relationships endpoint now returns targeted relationships, as opposed to the former open relationships, as its default extracted relationships. Targeted relationships are specifically between two entities, and are labeled by a certain relationship type. You can see the former open relationships by setting the option of "discoveryMode" to "true".
/entities/linked REMOVED
Removed Deprecated Endpoint: The /entities/linked endpoint, previously deprecated, is now completely removed. All functionality is available through the /entities endpoint. You will receive a 404 when calling /entities/linked.
Entity Extraction
Social Media Linking in Japanese and Chinese: Our fast short text entity linker to Wikidata is now available for Japanese and Chinese.
Removal of long text entity linking: Our long text entity linker has been replaced by our fast short string entity linker. You will now see entity linking results from our short string linker by default. This removes linking support for Arabic.
Additional Language Support: The entity extractor now supports Vietnamese.
CJK Support for Names
Name Translation and Similarity CJK Improvements: The /name-similarity and /name-translation endpoints now support matching and translating between Japanese-Chinese, Japanese-Korean, and Korean-Chinese. Japanese accuracy was improved significantly.
Text Embeddings Improvement
Improved Accuracy for Document-level Embeddings: We made some improvements to our algorithm for calculating text embeddings across multi-word input, so you should see more accurate results for document-level vectors.
Japanese Sentiment Analysis
Beta Japanese Support for /sentiment: We now return document-level and entity-level sentiment analysis results for Japanese language input.
NEW Release - 1.4.0
October 27, 2016
Syntactic Dependencies (NEW)
New Endpoint: We've added a syntactic dependencies endpoint that returns the parse tree of the input text as a list of labeled directed links between tokens, as well as the list of tokens in the input sentence.
Relationship Extraction
Entities Linked to Wikidata: Where available, Rosette will now link entities extracted within relationships to Wikidata. You'll see this information returned as a QID in the argument ID.
Modality Returned: We've also added a "modality" field to Rosette's Relationship Extraction. Modality is the semantic context of the possibility or necessity of the relationship; the values can be “assertion”, “negation”, “uncertainty”, “opinion”, or “question”.
Starter Plan (NEW)
New $99 API Plan: For a limited time, we’re offering a special Starter plan. $99/month gets you 40,000 Rosette API calls. Want to dive deep into Rosette but don’t need a whole 100,000 calls? This plan is for you.
NEW Release - 1.3.0
September 15, 2016
Text Embedding (NEW)
New Endpoint: We added a text embedding endpoint that returns a single vector of floating point numbers that represents the document or word in the semantic space.
Sentiment Analysis
Additional entities: We changed the /sentiment endpoint to return the sentiment of all entities discovered by Rosette, including Person, Location, Organization, Date, Time, and more entity types.
Entity Extraction
Turn off entity linking: We added an option to disable entity linking in order to improve the call speed. Add
"options": {"linkEntities": "false"}
to your /entities call. Rosette returns a list of the entities with a temporary ID (TID).
Global changes
Concurrency header: We added the
X-RosetteAPI-Concurrency
header to return the number of concurrent calls allowed on your plan. If you are receiving 429 errors, Too Many Requests, then Contact us for greater concurrency.
NEW Release - 1.2.3
July 21, 2016
Global changes
Input genre: The genre field is available for /entities and /entities/linked to indicate the input is from social media. Specifying
genre=social-media
does not affect the output of the other endpoints. Applies to: /entities, /entities/linked, /relationships, /categories, /sentiment, /language, /morphology, /tokens, /sentences.
Entity Linking
Temporary entity ID: With the unification of the /entities/linked and /entities endpoints, the /entities/linked now returns a “T” ID for entities without a Wikidata QID.
Entity Extraction
Entity endpoints unified: We combined the /entities and /entities/linked endpoints into one endpoint, /entities. Rosette now returns the entity mentions and the entityId, if available. The
entityId
replaced theindocChainId
. The output of /sentiment has not changed.Entities Linked deprecated: We deprecated the /entities/linked endpoint. It is still available, but we recommend that you adapt your applications to the new /entities endpoint.
Additional entities: Rosette now extracts more entity types: Date, Time, Longitude and Latitude, and Distance.
Japanese entityId: We added support for linking entities in Japanese (
jpn
) text to theirentityId
.Spanish social media: We added support for extracting entities from
social-media
in Spanish language documents, using thegenre
field.Malay entities: We added support for extracting entities in Malay (
msa
).
Error code
409 Error: We added the 409 error code for when the binding version is out of date. If you receive this error, update your binding to the most recent version.
Sentiment Analysis
Spanish support: We added support for analyzing the sentiment of Spanish language documents.
NEW Binding Release - Ruby and R bindings
June 20, 2016
Bindings
Ruby: We added the Ruby binding to the gray column to the right and on Github. There is a Ruby gem available as well.
R: We added the R binding to the gray column to the right and on Github.
cURL examples: We changed the shell examples in the gray column on Features and Functions to be cURL code examples.
NEW Release - 1.1.2
May 10, 2016
Entity Linking
Social input: We added a request field,
"genre": "social-media"
, to speed up and improve the accuracy of linking Person, Location, Organization and Product entities in social media posts. English input only.
NEW Release - 0.10.3
March 29, 2016
Global changes
Language used: We added a Response Header
, X-RosetteAPI-ProcessedLanguage,
to return the language used by Rosette for processing the call. Applies to: /entities, /entities/linked, /relationships, /categories, /sentiment, /language, /morphology, /tokens, /sentencesrequestId moved: We moved the
requestId
object from the JSON response body to the Response Header asX-RosetteAPI-Request-Id.
Applies to: /entities, /entities/linked, /relationships, /categories, /sentiment, /language, /morphology, /tokens, /sentences, /name-translation, /name-similarityRosette API Key: We changed the
user_key
header’s name toX-BabelStreetAPI-Key
. Theuser_key
header is deprecated. Applies to: /entities, /entities/linked, /relationships, /categories, /sentiment, /language, /morphology, /tokens, /sentences, /name-translation, /name-similarityunit parameter removed: We removed the optional
unit
request parameter. All input will be handled as adoc
. Applies to: /entities, /entities/linked, /relationships, /sentiment, /morphologyBase64: We removed support for Base64 encoding. You can submit binary files as a
multipart/form-data
call type. Applies to: /entities, /entities/linked, /relationships, /categories, /sentiment, /language, /morphology, /tokens, /sentences
Entity Extraction
Confidence removed: The
confidence
value has been removed from the response object.
Relationship Extraction
Accuracy mode: We removed the optional accuracy mode. All input will be processed with the
precision
accuracy mode, so Rosette will return a precise list of accurate relationships.Explanations removed: The
explanations
value has been removed from the response object.
Categorization
Explanations removed: The
explanations
value has been removed from the response object.
Sentiment Analysis
Entity sentiment: We added support for entity-level sentiment analysis. The JSON response for the /sentiment endpoint now includes two objects –
document
andentities
. See the interactive documentation for examples of this new response.Neutral result: We added a neutral label for documents and entities with a neutral sentiment.
Short strings: Rosette will automatically process short and long content with our proprietary algorithm for sentiment analysis.
Explanations removed: The
explanations
value has been removed from the response object.
Morphological Analysis
Added language support: We added language support for Dari, Persian, Urdu, and Western Farsi for Parts-of-Speech Tags.
Universal POS Tags: We return Universal Parts-of-Speech Tags for all supported languages.
Tokens list: Rosette returns parallel lists of tokens, lemmas, compound components, parts-of-speech tags, and Han-readings. If a token does not have a lemma, compound component, POS tag, or Han-reading, or if the language is not supported, then Rosette will return “null” in that list.
Name Translation
Renamed to /name-translation: To clarify the endpoint’s function, we renamed /translated-name to /name-translation. /translated-name is no longer available.
Removed result layer: Within the response to /name-translation and /name-similarity endpoints, we removed the result layer so the results are in the response object. Also applies to: /name-similarity
TargetScheme requires uppercase: For advanced users who would like to specify a targetScheme, the scheme must be submitted in uppercase.
Name Matching
Renamed to /name-similarity: To clarify the endpoint’s function, we renamed /matched-name to /name-similarity. /matched-name is no longer available.