Experimental results - Coreference-aware Language Modeling Training

5.3 Coreference-aware Language Modeling Training

5.3.5 Experimental results

As discussed earlier, we evaluated our models in three settings: Language Modeling, word similarity and in the downstream task of NER. Starting with Language Modeling, to evaluate the results of our Entity-Transformers architecture and the effects of coreference annotations to the task, we measure the change in performance of the trained LMs using Perplexity (PPL).

Perplexity is defined as the exponential of the average negative log-likelihood of the LM, as explained in Section 2.1.3.

(a) Training loss per step.

(b) Validation loss per step.

Figure 5.2: Language Model model performance per step on the CoNLL-2012 corpus Table 5.2, shows the training and validation losses of GPT2-CoNLL and GPT2E, as well as the Perplexity of the models after 10e5 training steps and time per step. The gradual changes in training and validation losses, measured every 10e3 steps, are illustrated in Figures Figures 5.2a and 5.2b with GPT2-CoNLL model in orange and GPT2E model in blue colors respectively.

Table 5.2: Perplexity and Validation loss on the CoNLL 2012 dataset

Process GPT2E GPT2-CoNLL

PPL Loss Time

per step PPL Loss Time per step Training 5.52 1.71 0.298s 4.80 1.57 0.290s Validation 1.20 0.187 0.298s 1.19 0.184 0.290s

Comparing the two models on the CoNLL corpus, there is little difference in both the train and validation sets in terms of loss and PPL. Notably, there is a slight increase in time per step in the model that uses coreference annotation that is attributed to the extra computations required to utilize and update the entity representations. This results in an 2% increase in the total training time in our setting. From these results alone, we cannot draw any conclusions as to which model performs better. Both models appear to perform equally well on this corpus, something that is also apparent by the very similar loss at various training and validation stages, indicating that the Entity-Transformer layer performs equally as good as the normal Transformer layer.

We do not compare the GPT2E and GPT2-CoNLL to the original GPT2 model, or other Transformer based models as they are trained on significantly less training data and for a shorter amount of time. Hence, we expect that their performance will suffer in comparison.

However, in terms of PPL, both GPT2 variations perform significantly better than other entity-aware LMs (table 5.3). We attribute the majority of the gap in these scores to the Transformers architecture, as all previous approaches are RNN based.

Table 5.3: Entity-aware Language Model comparison in CoNLL-2012 test set

Model PPL

EntityNLM 161.64

YangLM 114.0

SetLM 107.0

GPT2E 1.20

GPT2-CoNLL 1.19

Comparing the two models on the LAMBADA corpus highlights the performance difference in an out-of-domain set (Table 5.4). Comparing the two model variations, the entity-aware version is performing significantly better. It’s worth noting that while there is a big difference in performance, neither model is performing optimally. However, this is attributed to the size of the training data, and by extension the vocabulary size, as the original GPT2 achieves 26.49 PPL.

Table 5.4: Perplexity performance on the LAMBADA dataset Model Perplexity

GPT2E 196.81

GPT2-CoNLL 219.97

To compare the changes in the entity mention representations when using coreference information during training we conducted a series of experiments, taking into account the existence or absence of coreference annotation. Specifically, for both models, for each entity cluster we calculate the average similarity of its mentions with the other entity mentions, with and without the use of entity representations for GPT2E. Furthermore we measured the average similarity between the entity representation and the entity mentions.

Intuitively, the word that are nouns, proper nouns and pronouns are the ones that would show the biggest difference, due to the nature of coreference annotations. As such, we grouped our results in nouns and proper nouns, that can be very distinctive for entity clusters, and pronouns that are usually shared between clusters. The detailed results, calculated using cosine similarity, are presented in Table 5.5.

Table 5.5: Cosine similarity of mention representations and their entities in different scenarios

Experiments GPT2E

without Entities

GPT2E

with Entities GPT2-CoNLL Average mention similarity

NN,NNS,NNP,NNPS 0.7117 0.7117 0.6971

Average entity similarity

NN,NNS,NNP,NNPS 0.0489 0.0513 -0.0164

Average mention similarity

PRP,PRP$ 0.8250 0.8250 0.7928

Average entity similarity

PRP,PRP$ 0.0619 0.0566 -0.0173

From the results, we can infer that the mentions maintain their similarity regardless of existence the coreference information are used during inference, while also have a higher average similarity than the respective mentions of the model trained without coreference annotations. However, taking into account the changing similarity scores between the entity representations and the entity mentions when we use coreference information during inference, we can conclude that there is a constant change to the mention representations.

In the case of nouns and proper nouns, that change brings the representations closer while in pronouns it has the opposite effect. Visualizing the embeddings of GPT2E and

GPT2-CoNLL (Figure 5.3), using t-SNE for dimensionality reduction in, this becomes also apparent. Specifcally, GPT2E embeddings appear to have more clustered words than GPT2 in the individual representation, while also the vocabulary seems more distinctly laid out.

(a) GPT2E embeddings. (b) GPT2 embeddings.

Figure 5.3: Visualization of the word representations of (a) GPT2E and (b) GPT2E and (c) comparison between the two, trained on the CoNLL2012 dataset.

In the downstream NER task, the model trained using word representations from GPT2E, achieved a mean average 3% F1 increase than the one trained with GPT2-CoNLL word representations. We highlight four named entities in Table 5.6, which showed the biggest differences between the two trained models. Specifically, we observe that the named entities of PERSON and PRODUCT, which would be directly affected by the anaphoric information in the training process, showed the greatest increase and contributed the most to the performance boost. Subsequently, EVENT entities were more commonly mislabeled while using GPT2E representations. This behavior is credited to the use of LOCATION terms to describe events (e.g. “the Guangzhou Fair”) and to generic event terms that refer to different entities based on their context (e.g. “new year” can refer to a different year) which the baseline model was

Table 5.6: NER performance using GPT2-CoNLL and GPT2E representations as input.

Labels GPT2 GPT2E

F1 Prec Recall F1 Precision Recall

PERSON 48% 95.5% 32.5 % 51.5% 94% 35.5%

PRODUCT 8% 33% 4.5 % 23.5% 90% 13.5%

EVENT 23% 83.5% 13.5% 15% 75% 8.5%

CARDINAL 28% 81.5% 17.5% 34% 75% 23%

NORP 44.5% 72.5% 36% 48% 79% 39.5%

Overall 54% 87% 39% 57% 88% 42%

unable to handle correctly when the word representations were affected by entity information.

No documento Natural Language Processing and Information Extraction (páginas 103-107)