Skip to main content

Babel Street Analytics API

Entity Extractor

https://analytics.babelstreet.com/rest/v1/entities

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/entities.curl

Entity Extractor uses statistical or deep neural network based models, patterns, and exact matching to identify entities in documents. An entity refers to an object of interest such as a person, organization, location, date, or email address. Identifying entities can help you classify documents and the kinds of data they contain.

The statistical models are based on computational linguistics and human-annotated training documents. The patterns are regular expressions that identify entities such as dates, times, and geographical coordinates. The exact matcher uses lists of entities to match words exactly in one or more entities.

Statistical model based extractions can return confidence scores for each entity. Confidence score calculation correlates well with precision and may be used for thresholding and removal of false positives. Confidence is calculated by default if linkEntities is on. Otherwise, to include the scores in the result, add the calculateConfidence option to the request.

The entities endpoint can also return a salience score for each extracted entity. Salience indicates whether the entity is important to the overall scope of the document, for example, if it would be included in a summary of the document. Returned salience scores are binary, either 0 (not salient) or 1 (salient). To include the salience scores in your results, add the calculateSalience option to the request.

The normalized field returns the entity name with normalized white space around each word, one white space per token, which allows entity mention occurrences with different white space usage to be clustered together.

Entities also has a deep neural network model that can be used in place of the statistical model for selected entities. By default, entities uses the statistical models rather than the deep neural network based model. You can customize which model is used via the modelType option, which defaults to statistical. To enable the deep neural network model, set the modelType to DNN.

Deep Neural Network language Support

Do you know the language of your input?

If you know the language of your input, include the three-letter language code in your call. This will speed up the response time.

Otherwise, the endpoint will identify the language automatically.

Deep Neural Network Processor

Entity Extractor has a deep neural network (DNN) model that can be used in place of the statistical model for selected languages. By default, the statistical models is used rather than the DNN model. You can customize which model is used.

Currently, Entity Extractor has DNN models for the following languages:

  • Arabic (ara)

  • English (eng)

  • Hebrew (heb)

  • Korean (kor)

Important

The deep neural network model and the statistical model cannot be used together. When selected, the DNN replaces the statistical model.

Entity Linking

Entity linking provides a mechanism for disambiguating the identity of similarly named entities mentioned in a document. For example, “Rebecca Cole” is the second African-American woman to become a doctor in the United States and also the name of an Australian professional basketball player. Linking helps establish the identity of the entity by disambiguating common names and matching a variety of names, such as nicknames and formal titles, with an entity ID.

Entities links extracted Person, Location, Organization, and Product entities to the Wikidata knowledge base. If the entity exists in Wikidata, then the Wikidata QID, such as Q1 for the Universe, is returned.

Each QID is also assigned a linking confidence score. Linking confidence scores represent the certainty level of the link between an in-document entity mention and its linked QID, and may be used for thresholding and removal of false positive links. Linking confidence is enabled by default.

If it cannot link the entity, then it creates a placeholder temporary (“T”) entity ID to link mentions of the same entity in the document. However, the TID may be different across documents for the same entity.

The default for entity linking depends on whether you are using the Cloud instance or have installed Server on-premises:

  • Cloud: Entity linking is on by default; you can disable entity linking to improve the call speed. When entity linking is turned off, the entities are returned with a TID.

  • Server: Entity linking is off by default.

Entities supports linking to other knowledge bases: specifically the DBpedia ontology and the Refinitiv PermID.

DBpedia includes over 700 entity types spanning seven layers of granularity. When this feature is enabled, each extracted entity will be returned with an additional dbpediaTypes field. This field returns a list containing one or more types within the DBpedia hierarchy. Entities linked to DBpedia will also be returned with a QID and a confidence score. DBpedia types are supported for all languages supported by the entity extraction endpoint.

In-document coreference

Within a document, there may be multiple references to a single entity. In-document coreference (indoc coref) chains together all mentions to an entity.

  • The indoc coref server is an additional server which must be installed on your system for Server.

  • The indoc coref server is installed and available on Cloud

  • By default, indoc coref is disabled.

  • To enable indoc coref for a call, set the option useIndocServer to true.

  • The response time will be slower when indoc coref is enabled. We recommend using a GPU with indoc coref enabled.

  • To see which languages support indoc coref, use the /entities/indoc-coref-server/supported-languages endpoint.

Query Parameters

Name

Value

Description

output

rosette

Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Option

Type

Description

Default

modelType

string

model type to use; valid values statistical and DNN

statistical

calculateConfidence

boolean

Return the confidence values.

  • confidence: A value between 0 and 1. Only returned for statistical models. No confidence score is returned from the DNN based models, or from results generated by regex rules and gazetteers. This is the confidence value for the extracted entity.

  • linkingConfidence: The confidence from the linker that the QID is the correct entity.

false (unless linkEntities is true)

calculateSalience

boolean

Return salience score.

Salience indicates whether a given entity is important to the overall scope of the document. Salience values are binary, either 0 (not salient) or 1 (salient). Salience is determined by a classifier trained to predict which entities would be included in a summary or abstract of an article.

false

linkEntities

boolean

Link mentions to knowledge base entities with disambiguation model. Enabling this option also enables calculateConfidence.

true (Cloud)

false (Server)

includeDBpediaTypes

boolean

Return the full ontological path of the type within the DBpedia hierarchy

false

includePermID

boolean

Return the id to PermID knowledge base

false

linkMentionMode

string

When set to entities, the linker will attempt to link the entities extracted by other processes (regex, gazetteers, and the statistical processor) instead of using its own processor.

string

regexCurrencySplit

boolean

When set to true, money entities are extracted as IDENTIFIER:CURENCY_AMT and IDENTIFIER:CURRENCY_TYPE

false

structuredRegionProcessingType

string

Configures how structured regions will be processed. It has three values: none, nerModel, and nameClassifier.

none

useIndocServer

boolean

Enables the indoc co-ref server to return extended entity references.

The query parameter output=Rosette must be set to true.

false

Tip

Entity linking must be enabled to return DBpediaTypes and PermIDs.

{
  "content": "string",
  "language": "string",
  "options": {
    "modelType": "string",
    "calculateConfidence": "false",
    "calculateSalience": "false",
    "linkEntities": "false",
    "includeDBpediaTypes": "false",
    "includePermID": "false",
    "linkMentionMode": "entities",
    "regexCurrencySplit": "true",
    "structuredRegionProcessingType": "none",
    "useIndocServer": "false"  
}

Response

{
  "entitiesResponse": [
    {
      "type": "string",
      "mention": "string",
      "normalized": "string",
      "count": 0,
      "mentionOffsets": [
        {
        "startOffset": number,
        "endOffset": number
        }
      ],
      "entityId": "string",
      "confidence": 0,
      "linkingConfidence": 0,
      "DPediaTypes": [],
      "permId": "string",
      "salience": 0
    }
  ]
}

Language support

The following tables describe the entity types returned by the different processors for each supported language.

Key to processor used to identify each entity type:

  • S = statistical processor

  • G = exact matching processor (gazetteer)

  • R = pattern matching processor (regex)

  • L = entity linking available

  • D = deep neural network processor

Table 9. Statistical, Exact Match (Gazetteer) Extracted Entities, and Linked Entities

Language

(ISO code)

Entity Type

LOC

ORG

PER

PROD

TTL

NAT

REL

Arabic ara

S/G/D/L

S/G/D/L

S/D/L

L

S

G

G

Chinese, Script-insensitive zho

S/G/L

S/G/L

S/L

L

S

G

G

Chinese, Simplified zhs

S/G/L

S/G/L

S/L

L

S

G

G

Chinese, Traditional zhs

S/G/L

S/G/L

S/L

L

S

G

G

Dutch nld

S/L

S/G/L

S/L

L

G

English eng

S/G/L/D

S/R/G/L/D

S/L/D

S/L

S

G

G

French fra

S/L

S/G/L

S/L

L

S

German deu

S/L

S/G/L

S/L

L

S

Hebrew heb

S/L/D

S/G/L/D

S/L/D

L

Hungarian hun

S/G/L

S/G/L

S/G/L

S/L

S

Indonesian ind

S/G/L

S/G/L

S/L

L

Italian ita

S/L

S/G/L

S/L

L

S

Japanese jpn

S/L

S/G/L

S/L

L

S

G

G

Korean kor

S/D/L

S/G/D/L

S/D/L

L

S

G

G

Malay, Standard zsm

S/G/L

S/G/L

S/L

L

Pashto pus

S/L

S/G/L

S/L

L

S

Persian fas

S/L

S/G/L

S/L

L

G

G

G

Portuguese por

S/L

S/G/L

S/L

L

S

Russian rus

S/L

S/G/L

S/L

L

S

G

G

Spanish spa

S/L

S/G/L

S/L

L

S

Swedish swe

S/L

S/G/L

S/L

L

S

S/G

S/G

Tagalog tgl

S/G/L

S/G/L

S/L

L

Urdu urd

S/L

S/G/L

S/L

L

G

Vietnamese vie

S/L

S/L

S/L

L

G

G

G



The following entity types are not returned by default:

Table 10. Rule-based Extracted Entities

Language

(ISO Code)

Entity Type

CC#

Dist

EM

LATLNG

MONEY/CURRENCY

PERS ID

TEL#

URL

UTM

DATE

TIME

Arabic ara

R

R

R

R

R

R

R

R

R

R

R

Chinese, Script-insensitive zho

R

R

R

R

R

R

R

R

R

R

R

Chinese, Simplified zhs

R

R

R

R

R

R

R

R

R

R

R

Chinese, Traditional zhs

R

R

R

R

R

R

R

R

R

R

R

Dutch nld

R

R

R

R

R

R

R

R

R

R

R

English eng

R

R

R

R

R

R

R

R

R

R

R

French fra

R

R

R

R

R

R

R

R

R

R

R

German deu

R

R

R

R

R

R

R

R

R

R

R

Hebrew heb

R

R

R

R

R

R

R

R

R

R

R

Hungarian hun

R

R

R

R

R

R

R

R

R

R

R

Indonesian ind

R

R

R

R

R

R

R

R

R

R

Italian ita

R

R

R

R

R

R

R

R

R

R

R

Japanese jpn

R

R

R

R

R

R

R

R

R

R

R

Korean kor

R

R

R

R

R

R

R

R

R

R

R

Malay, Standard zsm

R

R

R

R

R

R

R

R

R

R

Pashto pus

R

R

R

R

R

R

R

R

R

R

R

Persian fas 

R

R

R

R

R

R

R

R

R

R

R

Portuguesepor

R

R

R

R

R

R

R

R

R

R

R

Russian rus

R

R

R

R

R

R

R

R

R

R

R

Spanish spa

R

R

R

R

R

R

R

R

R

R

R

Swedish swe

R

R

R

R

R

R

R

R

R

R

R

Tagalog tgl

R

R

R

R

R

R

R

R

R

R

Urdu urd

R

R

R

R

R

R

R

Vietnamese vie

R

R

R

R

R

R

R

R

R



Supported languages

GET /entities/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Supported languages - Indoc coref

GET /entities/indoc-coref-server/supported-languages

Returns the list of supported languages and scripts for the indoc coref server.

Response

Field

Type

Description

language

string

ISO 639 language code

script

string

Four-letter ISO-15924 script code

licensed

boolean

Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}