Entity Recognition - Natural Language Processing and Information Extraction

Entity Recognition is the task of detecting and classifying all entities of interest in a text.

Hence, this task is a combination of two steps, first identifying the spans of text that describe the entities and then classifying them to the semantic type they refer to (Jurafsky and Martin, 2009).

Figure 2.1: Named Entity Recognition example with annotated text to the respective entities.

The most common semantic entity type in Entity Recognition, are Named Entities, i.e.

entities which can be refered to by name such asperson,location,organization, etc. However, Named Entity Recognition, the task of identifying these Named Entities, has been extended to also include temporal and numerical expressions such asdatesandcurrenciesrespectively.

It becomes apparent that semantic entity types can vary based on the domain of interest and the application. For Entity Recognition systems that are targeted to the Biomedical domain, the term Biomedical Entity Recognition is used. Commonly, Biomedical Named Entity Recognition is used to identify entities such asgenes,diseases,drugs,proteinsandorganisms

and is considered to be a harder task than Named Entity Recognition due to the complexity of the biomedical literature.

2.2.1 Linguistic Characteristics

Entity Recognition systems are dependent on a plethora of linguistic, orthographic and morphological features that stem from the source text, along with the available context to make a prediction (Nadeau and Sekine, 2007). The syntax and structure of documents, the capitalization of certain words and the use of domain specific n-grams have all been proven very effective in early approaches and are easily extractable from the souce text. However, these information are usually not enough and as such Gazetteer, semantically categorized list of entities, have been very useful, especially in the Biomedical domain (Finkel et al., 2004;

Magnolini et al., 2019). This is because biomedical documents contains a significantly more entities than normal other domains and is developing in a very fast pace, not allowing for a general consensus over which morphological features to be defined for entity naming (Smith et al., 2008).

Another factor that plays a significant role in Entity Recognition is the inherited ambiguity in texts. In both common literature and in the biomedical domains, two types of ambiguity are identified. The first case arises from the use of the same span of text to describe two different entities of the same semantic type. This can most commonly occur with ordinary first names such as “George” and is treated by a different task in NLP, Coreference Resolution, which is described in depth in Section 2.4. The second case, usually cause by metonymy, is when the same span of text is used to describe two different entities of different semantic type. A common example is when businesses use the last name of the founder as their name.

2.2.2 Approaches to Entity Recognition

Entity Recognition systems are usually treated as sequence labeling task, in which for each word of text, a label indicating its semantic type is assigned. The labels used in a sequence labeling can vary depending on the application and many labeling schemes have been proposed as they have an impact in performance (Krishnan and Ganapathy, 2005). The Inside, Outside scheme (IO) is the most common one, which annotates only the words that are part of an entity as Inside and every other word with Outside. However, as this scheme fails to clearly set the borders of each entity span when more than one entities are considered, the Inside, Outside, Before scheme (IOB) has been introduced, with the Before tag being assigned to the first word of an entity span. Progressively more labels have been introduced, in order to insert the ability to annotate encapsulating entities or handle other special cases. It becomes apparent that with more complex annotation schemes more labels are required, making each label more sparse and harder to learn and therefore impacting performance.

Modern Entity Recognition models are dominated by Deep Learning approaches, as they require less feature engineering and enable end-to-end training, resulting in higher overall performance (Li et al., 2020a). These models are comprised of three components, the input representations, the context encoder and the output decoder. The input representations are high level representations of the source text, at word level, character level or both. The word level representation is based on Pre-Trained Language Models on large collections of similar domain documents, resulting in better performance. These emerging representations are used at the context encoder level, a neural architecture designed to capture the contextual information through the use of deep neural architectures such as RNNs and Transformers, creating latent representations of the input text. These latent representations are finally used in the decoder level to produce sequential labels. Entity Recognition models usually rely on Conditional Random Fields (CRFs) which condition their output on previous predictions (Lafferty et al., 2001), to accurately decode the label sequence as they perform better than a simple FFNN with a Softmax activation function.

With Deep Learning, Named Entity Recognition systems have been able to work end-to- end, without the need for manual feature engineering or outside latent information. While Biomedical Entity Recognition systems did also profit from Neural Networks, their individual characteristics resulted in lackluster performance in comparison to their Named Entity coun- terparts. Consequently, even modern approaches use domain specific Taxonomies, a higher form of Gazeteers that apart from eliminating ambiguity also provide hierarchy and resolve synonyms.

2.2.3 Entity Recongition Evaluation

Entity Recognition systems are not evaluated with any special metrics. As such, the metrics of Precision, Recall and F-score are used at an individual entity level, while micro and macro averaged F-scores are used for systems that identify more than one entity type. These metrics are based on the number of True Positives (TP), False Positives (FP) and False Negatives (FN) generated by the system. Formally, Precision and Recall are defined as:

P recision= T P

T P +F P (2.2)

Recall = T P

T P +F N (2.3)

and F-score is defined as:

F_β−Score= (1 +β²)× P recision×Recall

(β²×P recision) +Recall (2.4)

whereβis a weighting factor. Unless explicitly defined,βis usually set to 1, for the F1-score, which is the harmonic mean of Precision (P) and Recall (R) defined as:

F1−Score= 2× P recision×Recall

P recision+Recall (2.5)

and is the metric universally used in Named Entity Recognition. In some Biomedical applications,β can be used as the metric of choice, to emphasize the importance of either Precision withβ = 2or Recall withβ = 0.5, depending on the desired outcome. In cases where more than one entity types predicted by the model, the micro-averaged F1-score is biased, based on the entity frequency (also referred to as support) while the macro-averaged F1-score considers all classes equally distributed and are reported in addition to per-entity metrics.

No documento Natural Language Processing and Information Extraction (páginas 33-36)