Experiments - Natural Language Processing and Information Extraction

The previously proposed architectures are robustly designed to handle all BioER tasks.

However, only EBM entities entities have the ability to improve the clinical practice. As such, all the previously described architectures were trained and evaluated on the EBM-NLP corpus (Nye et al., 2018), using the predefined train and test splits.

While the corpus has hierarchical entity annotations, we only focus on the top level annotations which are aligned with the PICO framework. While more detailed entities would provide a deeper information scheme, they are not globally used and would be little help in enhancing the EBM practice. All the experiments were run on a computer with single Titan V 12GB graphics card and 32GB of memory.

3.4.1 Data and Preprocessing

The EBM-NLP corpus is a collection of 4982 abstracts, from RCTs and CTs, with the final annotations being the result of inter-annotator agreement and annotator scoring. It has a total of 53397 sentences, with 241 maximum words per sentence and 838 maximum words per abstract.

We use the corpus directly, without any major preprocessing. Importantly, the Intervention and Comparator entities are considered to be the same entity and treated as one (Nye et al., 2018). As a result we identifyPopulation,Intervention andOutcomeusing an IO (Inside, Outside) format for the target labels. Using the IO scheme, instead of the more common IOB allows for a better balance betweenInsideand Beforelabels, which in turn has an impact on model performance. Furthermore, as PICO labels are semantic entities, no encapsulating annotation exist in the corpus, making IO a more suitable annotation scheme.

3.4.2 Experimental setup

Each model is designed with a unique set of hyperparameters, based on its architecture and characteristics. After hyperparameter tuning, the resulting configurations for each model are the following:

BioER-RNN:For the CNNs, we used kernels with sizes 1, 2, 3 and 4 and filters 40, 80, 120 and 160 respectively. We implemented two BiLSTMs, with a Highway Residual connection between, with hidden sizes of each LSTM layers,d_rnn, 712 for the first and 365 for the second.

The character embeddings sized_charwas set to 300. The ELMo embeddings dimensiond_{ELM o} is 1024 and the weights are pre-trained on PubMed (Jin et al., 2019), commonly referred to as BioELMo. We use sentence padding to the maximum sentence length in the training set (n_max) and character padding to the maximum word length in the training set (m_max).

Adam (Kingma and Ba, 2014) was used for weight optimization through back-propagation with a learning rate of 1e-3 and a decay of 0.90 per epoch. We employed a 0.5 dropout (Sri- vastava et al., 2014) between layers to avoid overfitting. The model was trained for 20 epochs with early stopping and a 16 batch size.

BioTransformER:We use SciBERT (Beltagy et al., 2019) for all our BERT embeddings, withdBERT = 768. The max sequence length of the model was set ton= 256tokens, which is enough to capture all the information in the largest sentence of the training set and the whole corpus. Each Encoder layer hasN_ebm=h_ebm= 4 and a 0.1 dropout following both the self attention layer and position-wise feed-forward network inside.

The final model was trained for 3 epochs with Adam optimizer, learning rate of 2e-5 and batch size 16.

LongSeq: Similarly to BioTransformER, SciBERT was used for word embeddings, with d_BERT = 768. Each model variation, based on a different efficient Transformer encoder was fine-tuned separately, leading to different encoder configurations for each. As a result, Vanilla Transfomer and Reformer both haveN = 6andh= 6. Longformer hasN = 4,h= 4and uses the sliding chunks attention approach with a window sizew= 256. Finally, Linformer is defined with N = 4, h = 4and k = 512 wherek is a special parameter to define the projection length inside the sparse attention mechanism.

In all models we define a maximum sequence length of 512 tokens,dEnc = 768anddP roj

to the number of entity classes. All models are trained for 6 epochs with a learning rate of 3e-5, using the Adam optimizer. We experimented with two variations of the model, trained with input scope of sentences and of full abstracts respectively, with batch sizes 32 and 4 respectively.

3.4.3 Experimental results

We evaluated the trained models on the EBM-NLP test data and compared their performance to other works in the literature. The detailed results presented in Table 3.1 highlight the performance of each model overall and for each label (Population, Intervention and Outcome) in terms of Precision (P), Recall (R) and F1-score. Bold scores represent the best overall values for models predicting all PICO elements simultaneously.

Table 3.1: EBM Models performance comparison

Models Population Intervention Outcome Overall

P R F1 P R F1 P R F1 P R F1

EBM-NLP 73 82 77 50 60 54 77 56 65 68 61 64

BioBERT 77 59 67 52 62 57 72 84 77 69 66 67

BioER-RNN 81 80 80 54 71 62 80 58 67 70 70 70

QA-PICO 87 88 87 74 78 74 68 69 67 - - 75

LongSeq 80 89 84 63 88 74 86 58 69 79 81 80

BioTransformER 80 91 85 69 79 74 84 60 70 79 80 80 It is important to note that the reported scores from QA-PICO (Schmidt et al., 2020) in terms of Population, Intervention and Outcome are of models predicting each PICO label individually (only P or I or O label), with different configurations and parameters each. In comparison, the overall F1 score is of a single model handling all element predictions, which however has not reported Precision and Recall scores or individual element scores.

Analyzing the results in more detail, we notice that BioER-RNN compares favorably to both EBM-NLP and to a Fine-Tuned BERT on the training data. While BERT does perform better in Outcome entities due to a high Recall, its overall performance suffer.

The QA-PICO model bests BioER-RNN overall and in some metrics is better than both LongSeq and TransforMED, when predicting individual only a single entity at a time. In doing so, there is an increased complexity that needs to be resolved in case of conflicting PICO annotations from the different models in order to use them for a real world application.

Furthermore, such a complexity is not resolved even when the model predicts all labels at once as it can return the same span of text as an answer to more than one questions. The model variation that can predict all PICO entities requires a significant amount of pre-traing on the SQUAD dataset to reach its performance, which pales in comparison to both LongSeq and TransforMED. All the QA-PICO variations however require a significant amount of pre-processing to convert the data in a QA format which is not required by the other models.

The LongSeq model presented uses the Longformer encoder architecture and models the whole abstract in one pass. Compared to TransforMED, which uses forward and backward

Table 3.2: Example predictions of BioTransformER model in comparison to gold labels. We useRedfor Populations,Greenfor Interventions andLight Bluefor Outcomes.

Example 1

Gold labels In patients takingblood pressure or lipid-lowering treatmentfor the preven- tion of cardiovascular disease,text messagingimprovedmedication adher- encecompared withno text messaging.

Predicted Inpatients taking blood pressure or lipid-lowering treatment forthe preven- tion of cardiovascular disease,text messagingimprovedmedication adher- encecompared with notext messaging.

Example 2

Gold labels Conclusions: Thelaparoscopic surgery in combination of QYJDRcould effectively improve clinical symptoms of EMs patients ofblood stasisand toxinaccumulation syndrome, promotenegative conversion of EMAb, lower serum CA125 levels, and elevate theclinical pregnancy rate.

Predicted Conclusions: Thelaparoscopic surgeryin combination of QYJDRcould effectively improveclinical symptomsofEMs patients ofblood stasis and toxin accumulation syndrome, promotenegative conversion of EMAb, lower serum CA125 levels, and elevate theclinical pregnancy rate.

Example 3

Gold labels Objective: The study objective was to identifyfactors associated with death and cardiac transplantationininfants undergoing the Norwood procedure and to determine differences in associations that might favor the modified Blalock-Taussig shunt or a right ventricle-to-pulmonary artery shunt.

Predicted Objective: The study objective was to identify factors associated withdeath and cardiac transplantationininfants undergoing theNorwood procedure and to determine differences in associations that might favor themodified Blalock-Taussig shuntor aright ventricle-to-pulmonary artery shunt.

encoders in a sentence level the performance of the two models is indistinguishable, with small trades between Precision and Recall. However, LongSeq is significantly faster in training and in inference, when a full abstract is provided.

Further analysis of the golden annotations in the EBM-NLP, in comparison to BioTrans- formER model’s predictions, provides insights to the result performance of all the models and the overall task. In Table 3.2 we present three characteristic examples, in all of which we notice that the predicted labels either miss some non-medical words or annotate more words in the PICO entities. This effect stems from overall disagreements in the EBM-NLP corpus where in some instances terms like “patients” are considered part of a PICO entity and in others not. While the medical information are always annotated, this results in the

extracted information not being coherent enough to stand on its own. Moreover, Example 3 from Table 3.2, shows that the model identifies entities that are missed by the annotators.

These annotation disagreements impact on all the models reported performance as all models are trained and evaluated with the agreed upon annotations.

3.4.4 Ablation study

The resulting models demonstrated a noticeable performance increase compared to other methodologies in the literature. In order to measure the impact of the different components to each model’s performance, we conducted a comprehensive ablation study. In the following, we present our findings on an per-architecture basis.

• BioER-RNN Architecture Ablation

Table 3.3: BioER-RNN Entity Recognizer ablations results

Models Population Intervention Outcome Overall

P R F1 P R F1 P R F1 P R F1

1D CNN + BiLSTM 73 82 77 50 60 54 77 56 65 68 61 64

2D CNN 52 77 72 52 54 53 68 77 72 66 61 63

BioER-RNN 82 89 80 55 71 62 80 59 68 71 70 70

-HR 72 86 78 50 71 58 84 57 68 69 68 69

-Att 73 84 78 50 67 57 79 59 67 69 68 68

-2D + 1D-BiLSTM 72 89 78 50 66 57 77 55 65 68 67 67

For BioER-RNN, we are interested in both the impact of the character embeddings as well as the impact of the residual connections and attention mechanism to our model. Table 3.3 exhibits the slightly reduced performance transitioning from the 1D CNN with a BiLSTM network approach (Ma and Hovy, 2016; Lample et al., 2016) to a 2D CNN approach on the baseline model. We notice that while the transitioning step did not contribute to overall performance, it allows for faster computations. Specifically, we averaged 6 and 10 minutes training time reduction, per epoch, on the baseline and full model respectively, making our approach 20% faster. Intriguingly, using the 1D approach resulted in worse performance than our 2D approach to the final model. We attribute this effect to 2D CNNs capturing character features from the neighboring words that can be better utilized by our model architecture.

In addition, we display an increase in Precision and Recall with the use of the highway residual connection to stacked BiLSTMs (HR) and the self-attention mechanism (Att). Fur-

thermore, the addition of a highway residual connection every two RNN layers allows for effective information passing that in turn boosts the performance by 2%.

• BioTransformER Architecture Ablation

Table 3.4: Ablations study on the BioTransformER model.

Model Population Intervention Outcome Overall

P R F1 P R F1 P R F1 P R F1

BioTransformER 80 91 85 69 79 74 84 60 70 79 80 80

-Backward Encoder 75 91 82 69 78 73 84 60 70 79 79 78 -Encoders+BiLSMT 79 76 77 72 68 70 74 69 71 78 75 77

SciBert 78 74 77 70 69 69 72 69 71 77 75 76

SciBert -CRF 78 72 75 70 65 67 83 58 68 73 73 73

Table 3.4 exhibits the changes in performance of the model under different configurations.

Specifically, we highlight the importance of using Transformers encoders, in comparison to using a simple projection layer after BERT or to using a BiLSTM. From the results, it is obvious that the Transformer encoders perform significantly better than the baseline models, with all architectural changes contributing to the overall improvement. The addition of a Backward Encoder is also contributing to better identify the entities in all classes, boosting the overall performance.

• LongSeq Architecture Ablation

For LongSeq, we are not interested in investigating the performance gains of stacked Transformer encoders after BERT, as it was already proven in BioTransformER. Instead, we are focused on the evaluation of different Transformer encoder architectures and their effects to both performance and training times.

Table 3.5: Architecture and scope results on EBM-NLP corpus using efficient Transformer encoders.

Transformer Architectures

Sentences Abstracts

P R F1 P R F1

Vanilla 76% 77% 75% 79% 80% 78%

Linformer 75% 79% 76% 81% 78% 78%

Reformer 78% 79% 78% 79% 80% 78%

Longformer 75% 77% 75% 79% 81% 80%

Table 3.5 compares the overall performance of the model, under different input scipes, in all four Transformer encoder architectures. While the model with Reformer’s encoder achieved better performance in sentences, it was surpased by Longformer when the whole abstract was available. Going further, analyzing absolute training times per example in Table 3.6, all abstract level completed training significantly faster than sentence level models.

The difference in training speed are consistent with the expected computational complexity of each individual architecture in the scope provided, considering the BERT bottleneck at the first layer of each model.

Table 3.6: Absolute training times comparison between sentence and abstract level inputs.

Model Sentences Abstracts Vanilla 185m 3s 102m 17s Linformer 184m 17s 83m 21s Reformer 256m 10s 93m 53s Longformer 381m 12s 93m 19s

3.5 Conclusion extraction with Evidence-Based Medicine

No documento Natural Language Processing and Information Extraction (páginas 63-69)