Entity Extractor

https://analytics.babelstreet.com/rest/v1/entities

https://raw.githubusercontent.com/rosette-api/curl-examples/develop/examples/entities.curl

Entity Extractor uses statistical or deep neural network based models, patterns, and exact matching to identify entities in documents. An entity refers to an object of interest such as a person, organization, location, date, or email address. Identifying entities can help you classify documents and the kinds of data they contain.

The statistical models are based on computational linguistics and human-annotated training documents. The patterns are regular expressions that identify entities such as dates, times, and geographical coordinates. The exact matcher uses lists of entities to match words exactly in one or more entities.

Statistical model based extractions can return confidence scores for each entity. Confidence score calculation correlates well with precision and may be used for thresholding and removal of false positives. Confidence is calculated by default if linkEntities is on. Otherwise, to include the scores in the result, add the calculateConfidence option to the request.

The entities endpoint can also return a salience score for each extracted entity. Salience indicates whether the entity is important to the overall scope of the document, for example, if it would be included in a summary of the document. Returned salience scores are binary, either 0 (not salient) or 1 (salient). To include the salience scores in your results, add the calculateSalience option to the request.

The normalized field returns the entity name with normalized white space around each word, one white space per token, which allows entity mention occurrences with different white space usage to be clustered together.

Entities also has a deep neural network model that can be used in place of the statistical model for selected entities. By default, entities uses the statistical models rather than the deep neural network based model. You can customize which model is used via the modelType option, which defaults to statistical. To enable the deep neural network model, set the modelType to DNN.

Deep Neural Network language Support

Do you know the language of your input?

If you know the language of your input, include the three-letter language code in your call. This will speed up the response time.

Otherwise, the endpoint will identify the language automatically.

Deep Neural Network Processor

Entity Extractor has a deep neural network (DNN) model that can be used in place of the statistical model for selected languages. By default, the statistical models is used rather than the DNN model. You can customize which model is used.

Currently, Entity Extractor has DNN models for the following languages:

Arabic (ara)
English (eng)
Hebrew (heb)
Korean (kor)

Important

The deep neural network model and the statistical model cannot be used together. When selected, the DNN replaces the statistical model.

Tip

Try it in the interactive documentation

Entity Linking

Entity linking provides a mechanism for disambiguating the identity of similarly named entities mentioned in a document. For example, “Rebecca Cole” is the second African-American woman to become a doctor in the United States and also the name of an Australian professional basketball player. Linking helps establish the identity of the entity by disambiguating common names and matching a variety of names, such as nicknames and formal titles, with an entity ID.

Entities links extracted Person, Location, Organization, and Product entities to the Wikidata knowledge base. If the entity exists in Wikidata, then the Wikidata QID, such as Q1 for the Universe, is returned.

Each QID is also assigned a linking confidence score. Linking confidence scores represent the certainty level of the link between an in-document entity mention and its linked QID, and may be used for thresholding and removal of false positive links. Linking confidence is enabled by default.

If it cannot link the entity, then it creates a placeholder temporary (“T”) entity ID to link mentions of the same entity in the document. However, the TID may be different across documents for the same entity.

The default for entity linking depends on whether you are using the Cloud instance or have installed Server on-premises:

Cloud: Entity linking is on by default; you can disable entity linking to improve the call speed. When entity linking is turned off, the entities are returned with a TID.
Server: Entity linking is off by default.

Entities supports linking to other knowledge bases: specifically the DBpedia ontology and the Refinitiv PermID.

DBpedia includes over 700 entity types spanning seven layers of granularity. When this feature is enabled, each extracted entity will be returned with an additional dbpediaTypes field. This field returns a list containing one or more types within the DBpedia hierarchy. Entities linked to DBpedia will also be returned with a QID and a confidence score. DBpedia types are supported for all languages supported by the entity extraction endpoint.

In-document coreference

Within a document, there may be multiple references to a single entity. In-document coreference (indoc coref) chains together all mentions to an entity.

The indoc coref server is an additional server which must be installed on your system for Server.
The indoc coref server is installed and available on Cloud
By default, indoc coref is disabled.
To enable indoc coref for a call, set the option useIndocServer to true.
The response time will be slower when indoc coref is enabled. We recommend using a GPU with indoc coref enabled.
To see which languages support indoc coref, use the /entities/indoc-coref-server/supported-languages endpoint.

Query Parameters

Name	Value	Description
output	rosette	Returns the response in ADM format.

Note

All input parameters, including the text being analyzed and any relevant options, are defined in the request body.

Request

Name	Type	Description
`content`	string	Text to process
`contentUri`	string	URI to accessible content
`language`	string	ISO 639 language code

Notice

content and contentUri are mutually exclusive; only one can be specified per call.

Option	Type	Description	Default
`modelType`	string	model type to use; valid values `statistical` and `DNN`	`statistical`
`calculateConfidence`	boolean	Return the confidence values. confidence: A value between 0 and 1. Only returned for statistical models. No confidence score is returned from the DNN based models, or from results generated by regex rules and gazetteers. This is the confidence value for the extracted entity. linkingConfidence: The confidence from the linker that the QID is the correct entity.	false (unless `linkEntities` is true)
`calculateSalience`	boolean	Return salience score. Salience indicates whether a given entity is important to the overall scope of the document. Salience values are binary, either 0 (not salient) or 1 (salient). Salience is determined by a classifier trained to predict which entities would be included in a summary or abstract of an article.	false
`linkEntities`	boolean	Link mentions to knowledge base entities with disambiguation model. Enabling this option also enables `calculateConfidence`.	true (Cloud) false (Server)
`includeDBpediaTypes`	boolean	Return the full ontological path of the type within the DBpedia hierarchy	false
`includePermID`	boolean	Return the id to PermID knowledge base	false
`linkMentionMode`	string	When set to `entities`, the linker will attempt to link the entities extracted by other processes (regex, gazetteers, and the statistical processor) instead of using its own processor.	string
`regexCurrencySplit`	boolean	When set to true, money entities are extracted as `IDENTIFIER:CURENCY_AMT` and `IDENTIFIER:CURRENCY_TYPE`	false
`structuredRegionProcessingType`	string	Configures how structured regions will be processed. It has three values: `none`, `nerModel`, and `nameClassifier`.	`none`
`useIndocServer`	boolean	Enables the indoc co-ref server to return extended entity references. The query parameter `output=Rosette` must be set to true.	false

Tip

Entity linking must be enabled to return DBpediaTypes and PermIDs.

{
  "content": "string",
  "language": "string",
  "options": {
    "modelType": "string",
    "calculateConfidence": "false",
    "calculateSalience": "false",
    "linkEntities": "false",
    "includeDBpediaTypes": "false",
    "includePermID": "false",
    "linkMentionMode": "entities",
    "regexCurrencySplit": "true",
    "structuredRegionProcessingType": "none",
    "useIndocServer": "false"  
}

Response

{
  "entitiesResponse": [
    {
      "type": "string",
      "mention": "string",
      "normalized": "string",
      "count": 0,
      "mentionOffsets": [
        {
        "startOffset": number,
        "endOffset": number
        }
      ],
      "entityId": "string",
      "confidence": 0,
      "linkingConfidence": 0,
      "DPediaTypes": [],
      "permId": "string",
      "salience": 0
    }
  ]
}

Language support

The following tables describe the entity types returned by the different processors for each supported language.

Key to processor used to identify each entity type:

S = statistical processor
G = exact matching processor (gazetteer)
R = pattern matching processor (regex)
L = entity linking available
D = deep neural network processor

Table 9. Statistical, Exact Match (Gazetteer) Extracted Entities, and Linked Entities

Language (ISO code)	Entity Type
	LOC	ORG	PER	PROD	TTL	NAT	REL
	Arabic `ara`	S/G/D/L	S/G/D/L	S/D/L	L	S	G	G
Chinese, Script-insensitive `zho`	S/G/L	S/G/L	S/L	L	S	G	G
Chinese, Simplified `zhs`	S/G/L	S/G/L	S/L	L	S	G	G
Chinese, Traditional `zhs`	S/G/L	S/G/L	S/L	L	S	G	G
Dutch `nld`	S/L	S/G/L	S/L	L	G
English `eng`	S/G/L/D	S/R/G/L/D	S/L/D	S/L	S	G	G
French `fra`	S/L	S/G/L	S/L	L	S
German `deu`	S/L	S/G/L	S/L	L	S
Hebrew `heb`	S/L/D	S/G/L/D	S/L/D	L
Hungarian `hun`	S/G/L	S/G/L	S/G/L	S/L	S
Indonesian `ind`	S/G/L	S/G/L	S/L	L
Italian `ita`	S/L	S/G/L	S/L	L	S
Japanese `jpn`	S/L	S/G/L	S/L	L	S	G	G
Korean `kor`	S/D/L	S/G/D/L	S/D/L	L	S	G	G
Malay, Standard `zsm`	S/G/L	S/G/L	S/L	L
Pashto `pus`	S/L	S/G/L	S/L	L	S
Persian `fas`	S/L	S/G/L	S/L	L	G	G	G
Portuguese `por`	S/L	S/G/L	S/L	L	S
Russian `rus`	S/L	S/G/L	S/L	L	S	G	G
Spanish `spa`	S/L	S/G/L	S/L	L	S
Swedish `swe`	S/L	S/G/L	S/L	L	S	S/G	S/G
Tagalog `tgl`	S/G/L	S/G/L	S/L	L
Urdu `urd`	S/L	S/G/L	S/L	L	G
Vietnamese `vie`	S/L	S/L	S/L	L	G	G	G

The following entity types are not returned by default:

Table 10. Rule-based Extracted Entities

Language (ISO Code)	Entity Type
Language (ISO Code)	CC#	Dist	EM	LATLNG	MONEY/CURRENCY	PERS ID	TEL#	URL	UTM	DATE	TIME
Arabic `ara`	R	R	R	R	R	R	R	R	R	R	R
Chinese, Script-insensitive `zho`	R	R	R	R	R	R	R	R	R	R	R
Chinese, Simplified `zhs`	R	R	R	R	R	R	R	R	R	R	R
Chinese, Traditional `zhs`	R	R	R	R	R	R	R	R	R	R	R
Dutch `nld`	R	R	R	R	R	R	R	R	R	R	R
English `eng`	R	R	R	R	R	R	R	R	R	R	R
French `fra`	R	R	R	R	R	R	R	R	R	R	R
German `deu`	R	R	R	R	R	R	R	R	R	R	R
Hebrew `heb`	R	R	R	R	R	R	R	R	R	R	R
Hungarian `hun`	R	R	R	R	R	R	R	R	R	R	R
Indonesian `ind`	R	R	R		R	R	R	R	R	R	R
Italian `ita`	R	R	R	R	R	R	R	R	R	R	R
Japanese `jpn`	R	R	R	R	R	R	R	R	R	R	R
Korean `kor`	R	R	R	R	R	R	R	R	R	R	R
Malay, Standard `zsm`	R	R	R		R	R	R	R	R	R	R
Pashto `pus`	R	R	R	R	R	R	R	R	R	R	R
Persian `fas`	R	R	R	R	R	R	R	R	R	R	R
Portuguese`por`	R	R	R	R	R	R	R	R	R	R	R
Russian `rus`	R	R	R	R	R	R	R	R	R	R	R
Spanish `spa`	R	R	R	R	R	R	R	R	R	R	R
Swedish `swe`	R	R	R	R	R	R	R	R	R	R	R
Tagalog `tgl`	R	R	R		R	R	R	R	R	R	R
Urdu `urd`	R		R		R	R	R	R	R
Vietnamese `vie`	R	R	R		R	R	R	R		R	R

Supported languages

GET /entities/supported-languages

Returns the list of supported languages and scripts for the endpoint, along with whether you have a license for the language.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Supported languages - Indoc coref

GET /entities/indoc-coref-server/supported-languages

Returns the list of supported languages and scripts for the indoc coref server.

Response

Field	Type	Description
`language`	string	ISO 639 language code
`script`	string	Four-letter ISO-15924 script code
`licensed`	boolean	Indicates if you are licensed for this language

{
  "supportedLanguages": [
    {
      "language": "string",
      "script": "string",
      "licensed": boolean
    }
  ]
}

Babel Street Analytics API