Identificação de Bioprocessos em textos

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Biological Processes Identification in

Texts

Vânia Leite

Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho

(2)

(3)

Biological Processes Identification in Texts

Vânia Leite

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Chair: Carlos Soares

External Examiner: Alípio Jorge Supervisor: Rui Camacho

(4)

(5)

Abstract

Due to the large diversity, heterogeneity and ever growing rate of publications made electron-ically available in databases such as PubMed, biomedical researchers spend a lot of time and effort in searching for the available information in their area of research. A lot of issues cause this difficulty, among them the facts that there are various forms of name expressions for the same object or activity in the biomedical field, orthographic variants and abbreviations, meaning that most standard publication search engines can’t deal with this complexity. Biomedical Text Mining(BTM), the field that deals with automatic retrieval and processing of biomedical litera-ture, is therefore a very promising research field, namely in the retrieval of biological elements or concepts, working towards developing automated curation tools to better aid researchers to cope with this aforementioned information overload. There are two more common approaches to text mining: rule-based or knowledge-based approaches, and statistical or machine-learning based approaches. Some common BTM tasks include Named Entity Recognition (NER), Relation Ex-traction(RE), document summarization, document classification and document clustering. The aim of this project is to explore approaches for the extraction of biological processes from texts, after a revision of the state-of-the-art, to then combine them in order to automate the recognition of biological processes, helping researchers find relevant information for their studies and analyzing past works in their research area. The result was the combination of linguistic tools and biomedi-cal natural language processors to determine the best way to intersect them and see what approach delivered better results

(6)

(7)

Acknowledgements

First, I want to thank my family for their unconditional support and understanding when the dead-lines approached and I had little to zero availability to be with them and during this overall critical and stressful time.

I want to thank my supervisor Prof. Rui Camacho for all his support, availability and guidance offered throughout this dissertation development, and also Prof. Sérgio Matos for the availability, shared knowledge and work developed in the area of biological event extraction.

Finally, I want to thank all my friends for their support, and specially my boyfriend Pedro Castro for all the pep talks and support, and Anaís Dias, for all the intense Starbucks sessions where we were both highly stressed, but where our feelings of empathy and relatability with each other allowed me to calm down and see things clearer.

(8)

(9)

“Anyone whose goal is ’something higher’ must expect someday to suffer vertigo. What is vertigo? Fear of falling? No, Vertigo is something other than fear of falling. It is the voice of the emptiness below us which tempts and lures us, it is the desire to fall, against which, terrified, we defend ourselves.”

(10)

(11)

List of Figures

1.1 Event annotation example from the GE task in the BioNLP website . . . 2

2.1 Showcase of Stanford CoreNLP features . . . 15

2.2 Stanford Core NLP’s dependency parsing of a sentence from our dataset . . . 15

2.3 DisplaCy’s parsing of a sentence from our dataset . . . 16

(14)

LIST OF FIGURES

(15)

List of Tables

2.1 Event types and their arguments. The type of the filler entity is specified in paren-thesis. The filler entity of the secondary arguments are all of Entity type which represents any entity but proteins: T=Theme, C=Cause, P=Protein, Ev=Event.[1] selected from the Genia Ontology . . . 4

3.1 Results metrics of a Random Forest classification before feature selection with variance threshold=0.8 . . . 27

3.2 Results metrics of a Random Forest classification after feature selection with vari-ance threshold=0.8 . . . 28

4.1 Metrics for Multinomial Naive Bayes with the Genia Tagger subset of features . . 32

4.2 Metrics for Linear SVC with the Genia Tagger subset of features . . . 33

4.3 Metrics for Random Forest with the Genia Tagger subset of features . . . 33

4.4 Metrics for Multinomial Naive Bayes with the Genia Tagger + SpaCy subset of features . . . 34

4.5 Metrics for LineaSVC with the Genia Tagger + SpaCy subset of features . . . 35

4.6 Metrics for Random Forest with the Genia Tagger + SpaCy subset of features . . 36

4.7 Metrics for Random Forest with the base subset of features . . . 37

4.8 Metrics for Random Forest with the base + spacy subset of features . . . 38

4.9 Metrics for Random Forest with the base + genia + spacy + MetaMap subset of features . . . 39

4.10 Metrics for Random Forest with Variance Threshold=0.9999999 . . . 40

4.11 Metrics for Random Forest with SelectPercentile with F_Classif and Percentile=50 41

4.12 Metrics for Random Forest with SelectPercentile with mutual_info_classif and Percentile=70 . . . 42

(16)

LIST OF TABLES

(17)

Abbreviations

AL Active Learning BioNLP-ST BioNLP Shared Task BTM Biomedical Text Mining CRFs Conditional Random Fields GO Gene Ontology

HMMs Hidden Markov Models ID Information Density IE Information Extraction IR Information Retrieval LC Least Confidence ML Machine Learning

NER Named Entity Recognition NLP Natural Language Processing POST Part-Of-Speech Tagging QBC Query By Committee RE Relation Extraction SVM Support Vector Machines

TF-IDF Term Frequency–Inverse Document Frequency TM Text Mining

(18)

(19)

Chapter 1

Introduction

Life Sciences is an area where a vast amount of the knowledge is represented in free text, like articles, research papers, publications and journals. However, the publication rate of this type of literature has been rising quite exponentially over the years, making the relevant knowledge harder for researchers to access during their search efforts. This happens because typical Information Retrieval (IR) tools are not prepared to deal with the particularities encountered in the biomedical research literature domain, where there are no strict norms on the representation of proteins genes and event triggers, or where a gene can have the name of a common word from our daily lives. These special cases affect the precision (the danger of irrelevant information rises when a gene has a name that is vastly used in other contexts aside biology) and recall (an entity can be represented in various forms, making it so that some documents might not show up in a certain keyword search) of a system of IR of biomedical literature to a given query.

These particularities result in large amounts of time spent by researchers trying to find their needed relevant information for the objects and events they are studying, urging for automated literature curation processes [2]. Because of this, several efforts have been made in the Text Mining area to develop more sophisticated Information Extraction (IE) tools to help researchers analyze the increasing body of scientific literature.

This dissertation aims to contribute to the exploration of new precise and robust ways to re-trieve knowledge from free text. The goal is to explore typical approaches for the automatic extraction of biological events in scientific publications, and study the impact made by certain features and classification algorithms.

1.1 Context and Motivation

"The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE)"1_{. This is where the context of this}

(20)

Introduction

Figure 1.1: Event annotation example from the GE task in the BioNLP website

dissertation is situated. The aim is to explore new approaches of Information Extraction in biomed-ical literature, with the case of automatic identification of biologbiomed-ical processes in texts, paralleled with a task of the BioNLP, the GE task2.

"Using keyword queries, scientists can retrieve a large set of documents. To find knowledge from the retrieved papers, they must identify the relevant information. Such manual processing is time consuming and repetitive, because of the bibliography size, and the database continuous updating." [3]. This quote reflects some of the ambitions in developing these type of systems: to help researchers in the biomedical area avoid the tiresome manual processing of this literature.

1.2 Document structure

This document contains more chapters essential to the understanding of the work performed in the context of this dissertation. In chapter2we expose the knowledge gathered during the liter-ature review needed for understanding the scope and particularities of biological events an their automatic extraction, namely details about biological events and machine learning.

Following is chapter3, we explain how the task was approached and the underlying method-ology. We provide an overview about the processing pipeline for the text dataset, the natural language processing tools used, and the the implementation of the algorithms.

To finalize the document, the Conclusion is written in chapter5, where the experiments are discussed and the word done pondered upon. It is also where there are some reflections on the current state of the field of biological event extraction.

2_{http://2016.bionlp-st.org/tasks/ge4}

(21)

Chapter 2

Concepts on Biological Events and Text

Mining

In this chapter we present basic notions on genomes and molecular biology and the particularities that arise with the textual representation of biological processes and events, needed to understand the obstacles and resources available to the identification of biological events. The advantages of using Machine Learning in the context of this dissertation will be discussed, with an explanation of a diverse set of supervised learning techniques. Closing the chapter, there are multiple useful tools gathered, which have proven to be relevant in this area of research, and may aid in the development of this dissertation, along with related work, an overview of some approaches to the automatic extraction of biological events from texts.

2.1 Biological Events

Our planet is populated by living things - organized chemical factories that make use of mat-ter surrounding them to generate copies of themselves. The raw mamat-terial that fuels evolution is preexisting DNA, all natural mechanism make use of it to innovate, through a series of different processes - be it intragenic mutation, gene duplication, segment shuffling or horizontal (intercel-lular) transfer. [4]

Generally, a biological process is a recognized series of events or molecular functions: a conjunction of molecular events with beginning and end, occurring in living organisms.

Below is a list of some relevant event types and their correspondent biological interpretation, easily understood by consulting the GO (Gene Ontology) [5], most of which will turn out to be relevant in the context of this dissertation due to their importance and number of annotated instances in the GENIA corpus [6], explained further in this document.

• Gene Expression - When information from a gene is used in the synthesis of a functional gene product; These products are usually proteins.

(22)

Concepts on Biological Events and Text Mining

Table 2.1: Event types and their arguments. The type of the filler entity is specified in parenthesis. The filler entity of the secondary arguments are all of Entity type which represents any entity but proteins: T=Theme, C=Cause, P=Protein, Ev=Event.[1] selected from the Genia Ontology

Type Primary arg. Secondary arg. Gene_expression T(P)

Transcription T(P) Protein_catabolism T(P)

Phosphorylation T(P) Site

Localization T(P) AtLoc, ToLoc

Binding T(P)+ Site+

Regulation T(P/Ev),C(P/Ev) Site,CSite Positive_regulation T(P/Ev),C(P/Ev) Site,CSite Negative_regulation T(P/Ev),C(P/Ev) Site,CSite

• Transcription - First step of gene expression, the process of formation of RNA. The DNA’s function is to inform the RNA the correct order for amino acids. synthesizes.

• Protein catabolism - Process of transformation of proteins into amino acids.

• Phosphorylation - Critical to cellular processes, it is the event where compounds of the phosphoryl group are added to molecules.

• Localization - Processes in which a cell, a substance, or a cellular entity, such as a protein complex or organelle, is transported, tethered to or otherwise maintained in a specific loca-tion. In the case of substances, localization may also be achieved via selective degradaloca-tion. • Binding - The selective interaction of a molecule with one or more specific sites on another

molecule.

• Regulation - Any process that modulates the frequency, rate or extent of any biological process, quality or function.

We will be working with a set of labeled documents from the BioNLP Shared task, which follows a specific style of annotation. The BioNLP Shared Task works on the principle that these types of shared tasks with curated resources severely contribute to the progress of their respective research fields. "The task targets semantically rich event extraction, involving the extraction of several different classes of information." [1] With a variety of event types addressed in the anno-tated corpora, they were selected from the GENIA ontology, and they all concern protein biology - typically they will take only proteins as their themes.

In the events represented in the table 2.1, the first three refer to metabolism (i.e. protein production and breakdown). Event types and their arguments. The type of the filler entity is specified in parenthesis. The filler entity of the secondary arguments are all of Entity type which

(23)

represents any entity but proteins: T=Theme, C=Cause, P=Protein, Ev=Event. The rest represent regulatory events and causal relation.

The theme or themes, primary arguments of every event, are the essential arguments to cor-rectly identify the events. In Regulation, the cause (entity or event) is also the primary argument. We may be faced with one argument (first five events) or with the task to detect a certain number of arguments (Binding event). Regulation events may take a theme and cause arguments.

Each event type represents a different level of complexity. When we are only considering one event argument, Gene Expression, Transcription, Protein Catabolism, Phosphorylation and Local-ization can be seen as a relation extraction between a predicate (event trigger) and an argument (protein). The Binding type is more complex and requires the detection of more than one argu-ment at times. As for Regulation, they take a Theme (event or entity) as an arguargu-ment and, when expressed, also a Cause argument. [1]

The BioNLP Genia event dataset is composed of about 775 files. In each file, we have a biomedical text, and annotations for the biological events present in it. There are 3 types of annotations present:

• Denotations - This type of annotation is concerned with event triggers and themes (entities). Its study can be used to study the impact of different features and learning settings on the initial phase of identification of biological events.

• Relations - the Relations annotations serve to relate denotations among them. There are two classes: themeOf, causeOf and equivalentTo.

• Modifications - Not included in this dissertation scope due to the complexity associated, they are related to Negation/Speculation detection which this dissertation did not address due to the complexity and time available.

2.1.1 Gene Ontology

Mentioned above, the Gene Ontology [5] will be presented and explained in this section, as well as its importance and role in the unification of biomedical language across the field. Biomedical terminology has a lot of variations, a major constraint for effective searches and the construction of tools for automatic extraction of relevant knowledge to researchers.

"In the biomedical domain, not only do special terms appear, but in addition common words are used differently with different meanings. Because of this, we need to retrain or adapt NLP programs for the domain." [7]

To tackle this issue, Gene Ontology (GO) was created, representing a collaborative effort to address the need for consistent descriptions of gene products in different databases.1

The Go Consortium nowadays includes many databases, including several of the world’s major repositories for plant, animal and microbial genomes2. It provides descriptions of gene products

1_{ftp://ftp.geneontology.org/pub/go/www/GO.doc.shtml}

(24)

in terms of their associated biological processes, cellular components, and molecular functions in a species-independent manner. GO demonstrates an effort in the biomedical community to have all the information about the domain consistently organized and accessible. It covers three main domains: Cellular component, molecular function and biological process: operations or sets of molecular events regarding the functioning of living things: cells, tissues, organs and organisms.

Examples of broad biological process terms are cellular physiological process or signal trans-duction. Examples of more specific terms are pyrimidine metabolic process or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct step.

"Formally, the term biological event refers to a task accomplished via one or more molecular entities." [8]

Although GO is not a way to unify biological databases, it provides a good step, a shared vocabulary is a leap towards this junction, but not enough. This is because the vocabulary changes and the ontology may have a lag update, this ontology has a very specific scope that doesn’t cover all the issues in biology and the fact that individual curators evaluate data differently.

2.2 Machine Learning

Data is being collected and accumulated among a variety of fields at a dramatic pace. This has called for the need of more sophisticated computational resources, capable of automatically ex-tracting useful knowledge from this growing volume of data. Machine Learning (ML), born from this necessity, is a subfield of Computer Science that focus on making machines able to extract knowledge from data or making predictions. It enables computers to complete certain tasks with-out being explicitly told so. This is possible because computers can learn models by analyzing data, to produce reliable and repeatable decisions and results3. ML has a lot of important uses, among them recommendation systems, self-driving cars and fraud detection.

Data Mining refers to techniques for finding and describing structural patterns as a tool for explaining that data and make predictions from it [9]. The data can consist of instances (like clients, transactions, etc), and the output consists of predictions about new instances of data - for instance a prediction about whether a loan by a client will be successful. The output can also include a description of a structure that can be useful to classify new objects - i.e decision trees, which give a structured understandable representation of what is learned - providing knowledge, understandable by humans, about the basis of what has been predicted.

Several issues and particularities need to be addressed when developing these models, namely: • Preparing the input - We may be dealing with data from various heterogeneous sources. We may need to treat and clean that data in a preprocessing stage before we can input it in our prediction algorithm;

3_{http://www.sas.com/en_us/insights/analytics/machine-learning.html}

(25)

• Feature Selection - The predictive attributes in a data set have a high impact in the per-formance of a predictive model. Usually, the more attributes we have, the better is our description of the data set. We should, although. be attentive to the presence of irrelevant or inconsistent attributes and deal with them accordingly, as to not negatively affect the per-formance or our classification algorithm. Furthermore, it should be noted that the necessary number of instances needed to achieve a good predictive model increases with the number of predictive features. When faced with this trade-off, one should resort either to aggregation techniques (creating new attributes by aggregating current attributes) or a feature selection, selecting an optimal subset of the original features. Among feature selection techniques there are filters, wrappers and embedded approaches.

• Overfitting - This is a phenomenon that occurs when the model is not capable of performing well on examples outside of the training set [9]. It occurs when a model starts "memoriz-ing" the input space rather than learning to generalize the patterns encountered. This may have different causes. It could be because of a bad approach at data preprocessing (i.e fill-ing missfill-ing attributes only with a predominant class, bad feature selection approach, etc). Several algorithms employ techniques to avoid this issue, trying to make space for a more generalized model.

• Outliers - This refers to instances inconsistent with the rest of the training space, deviating significantly from the other objects. This can be avoided by using graphical representations of the input space, since that makes it relatively easy to identify outliers, which could rep-resent errors in the input data, and by taking steps in the algorithms to minimize the effect of outliers presence. There are also methods for outlier detection, using either statistical approaches, proximity-based approaches (the proximity of an outlier deviates significantly from most of the others in the set), clustering-based approaches (ignoring objects that do not make it into any of the clusters) or classification-based approaches (i.e finding a classi-fication model able to identify normal data versus outliers).

Supervised Machine Learning (ML) techniques currently serve as a methodological backbone for lots of Natural Language Processing (NLP) activities. Among the most widely used for these NLP systems are methods like support vectors machines, hidden markov model and conditional random fields.

2.2.1 Classification

The induction of predictive models estimate how likely are the outcomes of a given event/instance. These models are capable of mapping (classifying) the correct labels to new unlabeled objects, given the value of it’s predictive attribute.

(26)

2.2.1.1 Support Vector Machines

Support Vector Machines (SVM) works to reduce the chances of overfitting by producing a model with high predictive performance and low complexity.

Support Vector Machines select a small set of critical boundary instances from each class called support vectors to build a linear function that separates them as widely as possible [9]. So, when processing, it’s algorithm tries to increase the margins between the decision border and the closest objects, increasing the model generalization (and thus reducing the chances of overfitting). The larger the separation, the greater is the generalization capability of the obtained model.

In [10] we can see an example of an events extraction framework that uses this technique. In their approach, several features were extracted automatically from the texts. This was done by applying state-of-the-art systems trained on biological corpora for splitting, tokenization and part-of-speech tagging, and then proceeding to a syntactic and semantic processing. The features used for the feature vectors were token features (stem, lemma, part-of-speech, nearest protein and presence of symbol or capital letter), frequency features (named entities per sentence, stop words per sentence, bag of words counts of sentences and TF-IDF score of said token in the training data set), dependency features (dependency chains, label path of nearest protein) and shortest path features (n-gram dependencies, n-grams of words, presence of some token in the shortest path in a trigger gazetteer - prepared from the trigger stems in the training set and enriched with WordNet synsets and a pattern induction method).

These extracted features were then used to perform event trigger detection using a machine learning classifier, choosing kernel-based SVM because it showed to scale well with high dimen-sional data. In the end, the developed obtained a F-score of 56.83%, with a recall of 50.57% and precision of 71.81%, achieving a higher score than the fist ranked project in the BioNLP edition of 2009.

2.2.1.2 Naïve Bayes

In some classification algorithms it is important to estimate the probability of an object to belong to each class.

This is a method is called Naïve Bayes because it makes use of the Bayes’s rule of conditional probability and naïvely assumes independence between the attributes (this is why we should verify that our attributes are not correlated before applying this algorithm- the events must be indepen-dent), and said assumption is never classified. So it works well when combined with feature selection methods that eliminate redundant (nonindependent) attributes. [9] [11]

For classification tasks, the Bayes theorem calculates the probability of an object X to belong to a class, given:

• The probability of the class y to occur;

• The probability of this object X to occur in this class y; • The probability of the object X to occur;

(27)

Bayes rule says that if you have a hypothesis H and evidence E that bears on that hypothesis, then:

P[H|E] = P[E|H]P[H] P[E] P[A] - Probability of an event A;

P[A|B] - Probability of A being conditional of B;

To predict the class of a new object, if we have m classes, we use the Bayes theorem m times, one for each class. The class with the highest probability is the one assigned to the object.

It has various uses, among them document classification:

Each instance represents a document and the wanted class is it’s topic. We represent the document in a bag-of-words style, characterizing the documents in relation to the words in it, treating the presence of absence of each word as a boolean attribute. In this case Naïve Bayes is very popular because it is fast and accurate [9]. It is also useful in a classifier committee.

2.2.1.3 Hidden Markov Models

Markovian logic has been used in the natural language processing domain extensively in fields such as speech recognition. It can successfully be used for both part-of-speech (POS) tagging and named-entity extraction [11].

Hidden Markov Models (HMMs) are a special kind of Bayesian networks.

It has been largely used in extraction of events from texts due to the ability of the Markov model to represent repetitive events and the time dependence of both probabilities and utilities which allows for more accurate representation [12]

Most applications of hidden Markov models can be reduced to three basic problems: 1. Find P(T|M) - The probability of a given observation sequence T in a given model M; 2. Find the most likely state trajectory given M and T;

3. Find the model that best accounts for a given sequence;

A proposal using this method entered the BioNLP in 2013, reaching the conclusion that their use of the Markov’s logic approach had a state-of-the-art approach comparable to the rest of the participants, with a precision score of 81% [12].

2.2.1.4 Maximum Entropy Modeling

Considering a random process that assigns a value y, member of set Y of possible values, influ-enced by contextual information x, member of set X of possible contexts. The task is to build a statistical model that accurately represents the behavior of that random process. That model will be a method of estimating the conditional probability of generating y given context information x [11].

(28)

P(x,y) can be denoted as the unknown true joint probability distribution of the random process, and P(y|x) the model we are trying to build taken from the class M of all possible models. To build this model we are given a set of training samples generated by the random process. We will then have a set of pairs (xi, yi).

The degree of uniformity of a model is expressed by it’s conditional entropy. One should prefer model p that maximizes the conditional entropy while satisfying any given constraints. Maximum Entropy Modeling serves as a basis for several NLP tasks, such as part-of-speech tagging, and can be used to learn conditional distributions from labeled training data.

2.2.1.5 Conditional Random Fields

Conditional Random Fields (CRFs) is a model successfully used in information extraction for tasks such as named entity recognition [13], offering advantages over HMMs. It constitutes a conditional model based on maximal entropy. They can handle correlated attributes in the training set, and define a single Maximum Entropy-based distribution over the whole label sequence [11]. "Let X be a random variable over the observation sequences and Y a random variable over the label sequences. All components Yi of Y are assumed to range over a finite set L of labels. The labels roughly correspond to states in finite-state models. The variables X and Y are jointly distributed, but CRF constructs a conditional model p(Y|X) without explicitly modeling the margin p(X)" [11].

A CRF on (X,Y) is categorized by vector f=(f1,f2,...fm) of local features and a corresponding weight vector w. Each local feature f(x,y,i) is a function of real values of the observation sequence x, the label sequence y and the sequence position i. The value of this function at each position i only depends yi (state feature) or yi+1 and yi (transition feature). The global feature vector F(x,y) is a sum of local features:

F(x, y) =

i=1

∑

i=n

f(x, y, i)

The conditional probability distribution defined by the CRF model is: pw(y|x) = Zw(x)−1exp(w.F(x, y))

where:

Zw(x) =

∑

y

exp(w.F(x, y))

As with HMMs, there are three main problems related with CRFs:

1. Find P(T|M) - The probability of a given observation sequence T in a given model M; 2. Find the most likely state trajectory given M and T;

3. Find the model that best accounts for a given sequence; 10

(29)

2.2.1.6 Random Forests

Random Forest, or Random Decision Forest, is a machine learning ensemble technique used for both classification and regression. It functions, in Classification matters, by constructing multiple decision trees on various sub-samples of the dataset and outputting the modes of the predicted classes, using averaging (essentially, the trees vote for the most popular class) to improve the predictive accuracy and control over-fitting. The process is a combination of tree predictors, where each tree depends on the values of a random vector sampled independently and with the same distribution across all trees in the forest [14].

2.3 Text Mining in Biotechnology

Text mining is a research area where the information overload problem is dealt with by using techniques from data mining, machine learning, natural language processing (NLP), informa-tion retrieval (IR), and knowledge management. It encompasses text (data) preprocessing (text categorization, information extraction, term extraction), the storage of text representations, the techniques to analyze these representations (distribution analysis, clustering, trend analysis and association rules) and the visualization of the results. [11]

It’s procedure could then be paralleled with that of Data Mining: Data preprocessing, then application of the desired algorithms, followed by treatment and visualization of results. Text Mining then "seeks to extract useful information from data sources through the identification and exploration of interesting patterns" [11]

Text Mining is important in biotechnology to provide tools to extract knowledge from the ever growing scientific literature. It enables for the use of automated methods for exploiting the enor-mous amount of knowledge available in the biomedical literature [15], building more sophisticated systems to deal with the variability issues in this area.

Among common TM tasks in this domain, Entity Recognition tools have gained strong pop-ularity over the years. They use specialized lexical resources, such as ontologies or dictionaries, and training resources like annotated corpora. Relation extraction can range from simple statis-tical heuristics (i.e. estimating term frequency distributions) to combining syntax and semantic sentence processing with NLP techniques. [2]

2.3.1 Common tasks

Three basic types of Text Mining tasks have been frequent in this domain. Co-occorrence based methods look for concepts in the same unit of text and deduce a relationship between them - highly error prone, not commonly used today. [15].

Rule-based or Knowledge based approaches - Take advantage of prior knowledge. This might take the form of general knowledge about how language is structured, specific knowledge about how biologically relevant facts are stated in the biomedical literature, about the sets of things that scientists talk about and the relationships in between, etc. They might make use of

(30)

sophisticated NLP such as semantic analysis to help deduce assertions about the classes present in the texts.

Statistical or machine-learning-based - Rely on classifiers that help in various forms of tasks, such as Part-of-speech-tagging, Named Entity Recognition, with methods that can classify either words, sentences of entire documents.

2.3.2 Problems and challenges

Rule-systems are often assumed to take a long time to develop, while statistical systems often require large amounts of labeled data, most times expensive to get. [15]

Both have to deal with linguistic ambiguity. For example, in biomedical texts, the sting fat can represent either an adjective or a noun. Either part of speech is entirely plausible in biomedical texts, and PubMed returns almost 112 K hits for that single-word query (and more than 13 K even if we try to restrict the query to genomics by including the disjunction ( gene OR genetic OR genetics ). [15]. There are also more genes in different species that can be represented with the string fat.

This also happens because research data in the field has been published across a varied amount of sources with different individualized norms, and so because of this inconsistency it has become difficult to develop generic algorithms to handle said data and extracting the wanted knowledge. [16].

2.3.2.1 Named Entity Recognition

The Named Entity Recognition (NER) phase is a basic task-oriented phase of any IE system. In this step, the system will try to identify all mentions of proper names and quantities such as people’s names, geographic locations, organizations, dates, etc [11].

In the context of this dissertation, we are interested in biomedical NER and it’s contribution in knowledge discovery in this domain. In our tasks, it refers to the recognition and classification of named entities such as proteins and genes. For example, the following could be identified as proteins: E-rosette receptor, Tp67, and enkephalinase.

It is essential in any biomedical text mining tool and can be accomplished with pattern match-ing and machine learnmatch-ing [17]. Some common ML approaches include those already exploited in this document (Hidden Markov Models, Naive Bayes, Maximum Entropy, Conditional Random Trees, etc).

Good approaches to NER could include the combination of various relevant sources - anno-tated corpora, dictionaries and ontologies (ontologies can, as referred before, provide a "common framework for structured knowledge representation of domain knowledge" [16]).

2.3.3 Co-reference Resolution

An important task in relation extraction tasks, Anaphora Resolution or Co-reference Resolution is the process of matching pairs of text representations that refer to the same entity in the real

(31)

world [9]. It is focused on identifying expressions that refer to the same object or activity, like for example alpha-2-macroglobulin and Alpha-2-M could be resolved. [18]. As it is important in text mining applications for common vocabularies, it becomes even more important in the context of biomedical texts, derived from the issues discusses in previous sections, related with the multiple existent representations for the various entities and event triggers used in research papers.

2.3.3.1 Part of speech tagging

This is related to the assignment of the part-of-speech information to each token, and is very important in text processing systems. Such labels can be "adjective, article, noun, proper noun, preposition, verb and other lexical class markers to individual tokens a text corpus" [18]. It is one of very important and common first phases of natural language processing and annotates words based on the context in which they appear [9], providing semantic knowledge about a word.

2.3.4 Tools

With all the tasks needed to achieve our featuring and model exploration purposes, it is essential to look to current state of the art tools recognized in their role of aiding the development of knowledge extraction systems. They may aid in steps as small as tokenization, but also in complex phases like entity recognition. They are essential in this investigation because they make it possible to extract a wide variety of features from our biomedical texts without the costs of "reinventing the wheel" and with the security of, in some cases, using tools that already earned the trust of the research field.

Some available Entity Recognizers have inclusive achieved a confidence of 88% in F-score [1].

2.3.4.1 UMLS

UMLS - Unified Medical Language System - is a tool provided by the National Library of Medicine and is an important resource to the unification of knowledge representation in this domain. "UMLS integrates over 2 million names for some 900 000 concepts from more than 60 families of biomed-ical vocabularies, as well as 12 million relations among these concepts" [19].

The effort towards the UMLS is born from the need to turn the variety of the plentiful of resources available for biomedical researchers interoperable with each other. The issue is in the terminology used - not all resources have adapted a common vocabulary, even though they may want to refer to the same concept.

UMLS is divided in three groups: Metathesaurus, Semantic Network and Special Lexicons" [16].

• Metathesaurus - Holds all biological terms in the database. It is essentially a "repository of inter-related biomedical concepts" [19]. Biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among concepts.

(32)

• Semantic Network - Connects the terms from the Metathesaurus semantically. Provides "high-level categories used to categorize every Metathesaurus concept" [19].

The use of this valuable tool is free of charge, users are only obliged to register to sign a license agreement before obtaining access to the data. Under the agreement, all the data can be used for research purposes within an institution.

Another valuable resource implemented by UMLS is the MetaMap [20]. This tool is capable of mapping concepts in texts to the UMLS Metathesaurus, providing an efficient access to the knowledge of the Metathesaurus, the largest metathesaurus in the biomedical domain.

2.3.4.2 Genia Tagger

It was used throughout various submissions to the BioNLP Genia Task and provides useful func-tions to biomedical text preprocessing, like par-of-speech-tagging and named entity recognition. It was trained not only on biomedical texts but also on generalist corpora like Wall Street Journal - shown to improve accuracy scores4.

2.3.4.3 Stanford Parser

A state-of-the-art text parsing tool, already widely used in the task of biological events extraction, like in [21].

It is a "program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to pro-duce the most likely analysis of new sentences."5.

2.3.4.4 Stanford CoreNLP

Stanford CoreNLP [22] is essentialy and integration of the Stanford NLP tools, which "provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs" [23], and offers them as a vast framework to automatically explore human literature.

It has become well known and a tool of reference in regards to extraction of knowledge from human text, and is capable of identifying, among others, many useful text features such as base forms of words, their part of speech, perform Named Entity Recognition, and Sentiment Analysis. We can observe some of the tools abilities in2.1

It is designed to be flexible and extensible, allowing for quick tool changing with different parameter optioning.

4_{http://www.nactem.ac.uk/GENIA/tagger/} 5_{http://nlp.stanford.edu/software/lex-parser.shtml}

(33)

Figure 2.1: Showcase of Stanford CoreNLP features

(34)

2.3.4.5 SpaCy

Spacy is a tool for Python that offers "Industrial-strength natural language processing" [24]. Among the companies that trust SpaCy for performing automatic text processing, there is the well-known and renowned web platform Quora, Chartbeat, DueDil and Wayblazer, a recommendation engine. It is a library for applying NLP models for production.

To help people get familiar with the tool’s Dependency Parser, ExplosionAI6made available a dependency visualizer online through DisplaCy7. This way a new user can see how SpaCy parses sentences without having to install it first, and gets a grasp on it’s functionality. Arrows point from children to heads, and are labeled by their relation type.

Figure 2.3: DisplaCy’s parsing of a sentence from our dataset

2.3.5 Google’s SyntaxNet

A very recent tool, Google’s open source natural language parser was launched May 2016. It has become the best published natural language parser. Compared to SpaCy, it’s models are more accurate, load faster and consume less memory. But SpaCy runs faster and includes built-in sen-tence segmentation. It’s a neural network framework implemented in TensorFlow 8 and it pro-vides everything needed for users to train their own models using their own data, and also includes Google’s own Parsey McParseface, an English parser that performed with 94.5% on the Wall Street Journal Corpora according to Google. Google also affirms that based on their own in-house studies, human syntactic annotators agree on at least 96& of the cases, meaning SyntaxNet is ap-proaching human performance on well-structured text. To find results for badly formatted text, the model was tested against sentences accross the web9. Google’s syntactic parser achieved a 90% performance on these type of sentences.

6_{https://explosion.ai/}

7_{https://demos.explosion.ai/displacy/}

8_{https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html} 9_{https://research.googleblog.com/2011/03/building-resources-to-syntactically.html}

(35)

2.4 Related Work

In the past years several solutions that tackle the issue of biomedical event extraction have been developed, most of which in the context of the BioNLP Shared Task event [1]. We will give an overview of some of these solutions in this section.

2.4.1 TrigNER

TrigNER is a tool for biomedical event trigger recognition, "using an optimization algorithm that allows the tool to adapt itself to corpora with different events and domains". [25] It’s implemen-tation and study provided a new insight on the linguistic and context complexity of each event trigger and associated requirements. It also showed that CRFs are able to reach results in the recognition of event trigger words.

The selected feature set used in this project consisted of: Concept-based features:

• Tags - if a token is part of a concept name such as "Concept=Protein", said tag is added to the token’s associated features;

• Names - name of the concept it is a part of; • Head - Head token of the concept name;

• Counting - number of annotations per concept type - i.e. "Num_Protein¯2"; External Re-sources:

• Dictionaries of specific domain terms and trigger words are used as features - i.e. if "coex-pressed" is matched, the feature "Trigger ¯Gene_expression" is added to the token;

Dependency parsing: To complement findings done by local analysis, TrigNER does an extrac-tion of featured derived form dependency parsing. These include:

• Verb lemmas for which the token acts as a subject; • Nouns for which the token acts as a modified; • Modifiers of each token;

• Input and output dependencies, considering "inherent dependency", lemma, POS and chunk tags.

By analyzing the dependency parse graph, further features are extracted, relating for example a token with the closest biomedical entity. From this analysis, it is taken, for both dependency and short paths:

• Edge path - path of edge labels between tokens;

(36)

• Vertex path - Path of actual tokens between them;

• Edge n-grams - N-grams of edge labels between two tokens; • Vertex n-grams - N-grams of features between the two tokens;

As for the some of the main findings in TrigNER’s article [25], it was concluded that higher order CRFs obtained better results, and it’s also worth noticing, it was observed that the amount of annotation provided for each event type has an impact on the feature set complexity: the fewer the samples, the less feature complexity was required to model the low heterogeneity present in the corpus of said event.

In regards to the feature impact, it was observed that local context features, which consisted in the aggregation of information from surrounding tokens, unexpectedly had a low impact on the performance, and that shortest path features were more relevant than dependency path features, showing that establishing a relation with concept names in the sentence is fundamental in the recognition of event trigger words.

2.4.2 Turku Event Extraction System

Feature study for Event Extraction

The basis for the approach of this project was the analysis of dependency parse correlation with even annotations. "The shortest path is the primary source of features for interactions, whether they are binary relations or event arguments. For triggers and other nodes, the immediate dependency context is used to dene features." [26] It uses the Porter Stemmer to derive the stem of each word, and the stem and non-stem words are used as features that can detect the same word in different in different inflected forms.

2.5 Conclusions

After the study of the related work and required research for the development of a tool to the identification of biological processes helped to identify the required tasks and modules needed for the project, presented in the next chapter. There are some NLP tools with good performance that can be incorporated in the project, requiring a decision between a trade-off of work minimization and external dependencies. There are also a lot of classification approaches that have been used in these type of tasks, but it wasn’t found one that stood out in all the cases. Vector Space Modeling looks promising, but before a decision is made, a comparison between the algorithms together with the AL approach and the Genia annotated corpora and the scores obtained should be made. There are also decisions to be made on the Active Learning approach, namely in the stop criterion and query method. Here too, experiments should be made and scores compared. [27] provides a good methodology to start.

(37)

Chapter 3

Extracting Biological Events From

Texts

The methodologies used to reach a solution for extraction of biological events will be described in the present chapter. We start by describing the text processing techniques experimented, following with an overview of the feature set and feature tuning.

3.1 Dataset

The annotated dataset provided by the BioNLP organization is vast and includes lots of differ-ent types of evdiffer-ents. In 775 files, it contains 8548 sdiffer-entences and 18604 denotations (the type of annotations that help identify entities and event triggers).

Below is a simple example to demonstrate the structure each of the file in the dataset has1.

1 {

2 "target": "http://pubannotation.org/docs/sourcedb/PMC/sourceid/1064873/divs/10",

3 "sourcedb": "PMC",

4 "sourceid": "1064873",

5 "divid": 10,

6 "text": \"Resistance to IL-10 inhibition of IFN-\u03B3 production in RA CD4+ T cells\\nThe CD28 costimulatory pathway is crucial for effective antigen-specific T-cell cytokine production, and IL-10 can directly suppress this response by inhibiting CD28 tyrosine phosphorylation and binding of phosphatidylinositol 3-kinase [24]. To evaluate...",

7 "project": "bionlp-st-ge-2016-test-proteins",

8 "denotations": [

9 {

1_{The example given for the dataset has the text and objects length reduced so as to not occupy too much space. To}

(38)

Extracting Biological Events From Texts 10 "id": "T1", 11 "span": { 12 "begin": 14, 13 "end": 19 14 }, 15 "obj": "Protein" 16 }, 17 { 18 "id": "T2", 19 "span": { 20 "begin": 34, 21 "end": 39 22 }, 23 "obj": "Protein" 24 }, 25 . 26 . 27 .] 28 "relations": [ 29 { 30 "id": "R9", 31 "pred": "themeOf", 32 "subj": "T10", 33 "obj": "E9" 34 } 35 . 36 . 37 .]

Each Denotation object has an id, a span (in relation do the whole text), and the class it belongs to as obj. The possible scope of classification for the denotation objects is as follows:

• Acetylation • Binding • DNA • Deacetylation • Entity • Gene_expression • Localization • Negative_regulation • Phosphorylation 20

(39)

Extracting Biological Events From Texts • Positive_regulation • Protein • Protein_catabolism • Protein_domain • Protein_modification • Regulation • Transcription • Ubiquitination

However, some of these object types have too very few samples to be able to train a model with them. For example, there are only 2 instances of the Acetylation object, 1 of the DNA type, 1 of the Deacetylation, etc. When showing the metrics from the results of a run of a model, the support of each class is also shown so we can take this fact into account when judging all the different approaches.

As for the Relation objects, they also have their own id, and relate two denotations, subject (subj) and object (obj) through the predicate (pred). Each relation may belong to one of the following classes:

• themeOf

• causeOf

• equivalentTo

3.2 Text Processing

To visualize in a simple way the whole process each text goes through when being analyzed by the approach developed, here is a pipeline, which should also help the reader to better understand where each of the tools and methods described throughout this chapter stepped in to aid in the achievement of the classification models.

(40)

Extracting Biological Events From Texts

Figure 3.1: Text processing scheme

(41)

3.2.0.1 Genia Tagger

Genia Tagger, mentioned in the previous chapter, provides valuable information for biomedical text processing - it parses each sentence by dividing it in tokens and returning said tokens’ insights - outputting base forms (the form of a word to which suffixes and prefixes can be added to create new words), chunk tags (in the IOB2 format), and Named Entity Tags (Protein, DNA, RNA, Cell Line, Cell Type) [28]. To make use Genia Tagger’s knowledge in this dissertation context some tweaks had to be made, which will be explained in this section.

Genia Tagger returns a list of tokens from each sentence passed and it’s characteristics:

1 (word1, base1, POStag1, chunktag1, NEtag1),

2 (word2, base2, POStag2, chunktag2, NEtag2),

3 ( : , : , : , : , : ),

4 ( : , : , : , : , : ),

To contextualize, below is the output of the parsing for the sentence "We also found that the CD40-mediated activation of NF − κB and the stress-activated protein kinase c-Jun kinase (JNK) was defective in HOIP-deficient cells.", found in our dataset.

1 [(’We’, ’We’, ’PRP’, ’B-NP’, ’O’),

2 (’also’, ’also’, ’RB’, ’B-ADVP’, ’O’),

3 (’found’, ’find’, ’VBD’, ’B-VP’, ’O’),

4 (’that’, ’that’, ’IN’, ’B-SBAR’, ’O’),

5 (’the’, ’the’, ’DT’, ’B-NP’, ’O’),

6 (’CD40-mediated’, ’CD40-mediated’, ’JJ’, ’I-NP’, ’O’),

7 (’activation’, ’activation’, ’NN’, ’I-NP’, ’O’),

8 (’of’, ’of’, ’IN’, ’B-PP’, ’O’),

9 (’NF-\xce\xbaB’, ’NF-\xce\xbaB’, ’NN’, ’B-NP’, ’B-protein’),

10 (’and’, ’and’, ’CC’, ’O’, ’O’),

11 (’the’, ’the’, ’DT’, ’B-NP’, ’O’),

12 (’stress-activated’, ’stress-activated’, ’JJ’, ’I-NP’, ’B-protein’),

13 (’protein’, ’protein’, ’NN’, ’I-NP’, ’I-protein’),

14 (’kinase’, ’kinase’, ’NN’, ’I-NP’, ’I-protein’),

15 (’c-Jun’, ’c-Jun’, ’NN’, ’I-NP’, ’I-protein’),

16 (’kinase’, ’kinase’, ’NN’, ’I-NP’, ’I-protein’),

17 (’(’, ’(’, ’(’, ’O’, ’O’),

18 (’JNK’, ’JNK’, ’NN’, ’B-NP’, ’B-protein’),

19 (’)’, ’)’, ’)’, ’O’, ’O’),

20 (’was’, ’be’, ’VBD’, ’B-VP’, ’O’),

21 (’defective’, ’defective’, ’JJ’, ’B-ADJP’, ’O’),

22 (’in’, ’in’, ’IN’, ’B-PP’, ’O’),

23 (’HOIP-deficient’, ’HOIP-deficient’, ’JJ’, ’B-NP’, ’B-cell_type’),

24 (’cells’, ’cell’, ’NNS’, ’I-NP’, ’I-cell_type’),

(42)

After observing the output, it is evident that it doesn’t comply to our dataset, where the only information we have about each denotation is it’s class and span, in relation to the whole text. So to actually connect this information to our denotations, a pprocessing step was re-quired. The procedure consisted in firstly creating the functions getDenotationSentence and get_denotation_delimeters_insentence, that received the denotation’s delimiters in relation to the

whole text and returned the sentence index from which it belonged and the denotations delimiters in relation to it’s sentence - what we actually wanted. This allowed us to pass that information to a function addGeniaInfoToDenotation, that proceeded to match each token information from the Genia Tagger parse to the denotation - the delimiters helped handle the cases where the words were repeated across the sentence, so a simple pattern-matching solution wouldn’t be enough and would lead to mistakes. This approach was able to handle almost all the denotations and append to them Genia Tagger’s information.

3.2.1 SpaCy

To add more valuable grammatical information to the dataset, it was essential to extract informa-tion related with dependency parsing. The first idea was to use an implementainforma-tion of the renowned Standford CoreNLP (mentioned in the previous chapter) in Python, but the processing power avail-able made it so that the running time needed for an extraction of dependencies in the whole dataset went up to nearly 7 hours - in exchange for all the valuable information offered by the CoreNLP parser, that was the price to pay. However, since the main interest was to study the impact of only the dependency features, SpaCy could to that task with less processing power, reducing the ex-traction step time to 4 hours, and also providing a Python API. In 2015, two peer-reviewed papers declared it the fastest syntactic parser in the world and within the best 1% in terms of accuracy.

Spacy’s dependency parser provided easy ways to navigate each token’s ancestors and succes-sors and access the sentence’s parse tree. It uses the term Head and Child to refer to the words connected by a single arc and Dep to label the connection between them, describing the syntactic relation between the Child and the Head. This allowed for a range of experiments that overall increased the models’ prediction score, among them the appending of a denotations left-most and right-most successors, and the different results obtained using this functionality will be discussed with greater detail in the next chapter.

3.2.2 UMLS Metamap

MetaMap is a tool developed to map biomedical text to the UMLS Metathesaurus. Since this tool gives users the possibility to find interesting biomedical concepts in free text, it was considered its impact in the feature set should be studied. MetaMap was even present in the BioNLP workshop of 2014, with the title "Using MetaMap: A Tutorial", further proving it’s relevancy in the topic of extraction of biological events.

To see all the concepts MetaMap is capable of identifying, refer to appendixA. 24

(43)

To avoid discovering unnecessary concepts in the context of this dissertation, a screening of the ones thought to be relevant in biological events was made, and their impact analyzed. Among them were: Chemical, Gene, Entity, Phenomenon.

3.3 Feature Set

After the processing described in the past few sections, we have a vast feature set:

• Chunk - Extracted via Genia Tagger, which detects chunks like noun chunks or verb chunks. As mentioned previously, it follows the IOB2 formula, which means in has a prefix of either B (begin), I (inside) or O (outside).

• NER - Information of Named Entity Extraction via GeniaTagger, which was built upon biomedical corpora with a performance of 98,26%, meaning it should detect most of the entities. Also follows the IOB2 format (there may be cases where an entity is represented by more than one token).

• POS - Part of Speech tag extracted via Genia Tagger.

• hasGreekSymbol - Boolean value regarding the presence or absence of greek symbols, which should possibly help in the detection of proteins or genes.

• hasNumber - Boolean value regarding the presence or absence of a number within the token. • hasSymb - Boolean value regarding the presence or absence of special symboles (_,- or ;)

within the token.

• initUp - Boolean value activated when the first character of the token is an uppercase letter. • upperCount - Sum up all uppercase letters within the token.

• Length - Length of the token.

• Outward dependencies - left_lemma, left_post, right_lemma, right_pos • Inward dependencies - in_lemma, in_pos.

• MetaMap concepts - Concepts found by MetaMap. Initially with the boolean values of isChemical, is Entity, is Gene and isPhenomenon.

• nConceptsFound - Number of MetaMap concepts found in the sentence from which the token belongs to.

(44)

3.4 Identification of Entities and Biological Event Triggers

3.5 Modeling Process

A problem we are usually faced with in Text Mining is the high dimension a dataset can reach: Even in small documents such as newspaper articles or abstracts, the total number of different words used can be huge, and even in a sparse dataset, it can reach thousands of non-zero compo-nents. Much of these different words may be useless to the classification task and could even add noise to the prediction process. Removing irrelevant words is a crucial step towards improving model prediction - eliminating noise features - and decreasing computing costs - working with a smaller, more useful subset of features.

In this section dive in two different techniques used to explore this issue and observe their impact in the context of our dataset.

3.5.1 Dimensionality Reduction By Feature Extraction

One way to reduce the number of dimensions is to create a smaller set of synthetic features derived form the original dataset [11]. Term clustering, for examples, is an approach where the problem of synonym is addressed by grouping together words with a high degree of semantic relatedness. Then, those same cluster groups are used as features instead of the original dataset words. How-ever, in our dataset, we are not dealing with a collection of documents, and our text features don’t refer only to a word: they can be constituted by more than one word, since entities and triggers may be represented by more than one word a lot of the time. The technique experimented on was to use WordNet [29], a large lexical database for the English language, where nouns, verbs, adjectives and adverbs are grouped in cognitive synonyms (synsets), each meaning a different concept -so we could essential parse all our text data with it, reduce dimensionality and compare the results before and after.

3.5.2 Feature Selection

To estimate the relevancy of each feature in the prediction scores of the models, some feature selection studies were made. Since Scikit-Learn offers some good feature selection extensions, the work in this area was based on the platform’s sklearn.feature_selection module, hoping to improve estimators accuracy and maximize their performance.

There are some different ways to compute the relevancy of a feature to the classification task.

3.5.2.1 Variance Threshold

Particularly useful for sparse datasets like the one we have in hands - it became quite sparse after vectorization, which is normal in text mining datasets - this technique is focused on removing features for which the variance doesn’t meet a given threshold. It acts on the principle that the

(45)

Table 3.1: Results metrics of a Random Forest classification before feature selection with variance threshold=0.8

Precision Recall F-Score Support

Binding 0.60 0.29 0.39 291 Entity 0.85 0.65 0.74 214 Gene_expression 0.59 0.66 0.62 662 Localization 0.57 0.33 0.42 125 Negative_regulation 0.36 0.42 0.39 503 Phosphorylation 0.82 0.78 0.80 143 Positive_regulation 0.47 0.57 0.52 835 Protein 0.97 0.98 0.98 6107 Protein_catabolism 0.33 0.19 0.24 26 Protein_modification 1.00 0.67 0.80 6 Regulation 0.38 0.23 0.28 284 Transcription 0.52 0.42 0.47 99 Ubiquitination 1.00 0.33 0.5 3 Avg/Total 0.82 0.82 0.82 9298

higher the variance of a feature, the more said feature variance can explain in the dataset, thus ren-dering the selection of features with a high variance essential for the improvement of the model’s accuracy.

An experiment on the dataset was run to test the impact of the variance threshold technique, using the Random Forest classifier with a number of estimators of 200, running the same classifier again after the application of a variance threshold selection of 0.8. Since most of the features are boolean (after vectorization), this makes it possible to eliminate the ones where they have the same value in more than 80% of the cases.

3.5.2.2 Univariate Feature Selection

The purpose behind univariate feature selection is to calculate the weight in the bond between each feature and the class labels individually. There are multiple statistical tests one can use to do this, among them the "chi2" (chi squared) statistical test for non-negative features (available through Scikit Learn and indicated by the platform documentation to be used in sparse datasets) and the "Mutual information and maximal information coefficient (MIC)".

Using these types of univariate statistical tests, the features will be selected based on their results. The two methods of univariate feature selection presented above (chi2 and Mutual Infor-mation) will be explained below, followed by a small example demonstrating their impact with an experiment in our biological events data.

3.5.2.2.1 Chi2

Available as a module in Scikit Learn, it computes chi-squared stats between each non-negative feature and class. To apply this technique, the features should only contain non-negative values,

(46)

Table 3.2: Results metrics of a Random Forest classification after feature selection with variance threshold=0.8

Precision Recall F-Score Support

Binding 0.59 0.39 0.47 291 Entity 0.84 0.65 0.73 214 Gene_expression 0.66 0.69 0.67 662 Localization 0.63 0.44 0.52 125 Negative_regulation 0.36 0.43 0.40 503 Phosphorylation 0.84 0.80 0.82 143 Positive_regulation 0.48 0.58 0.53 835 Protein 0.98 0.98 0.98 6107 Protein_catabolism 0.33 0.31 0.32 26 Protein_modification 1.00 0.67 0.80 6 Regulation 0.37 0.23 0.28 284 Transcription 0.55 0.47 0.51 99 Ubiquitination 0.50 0.33 0.4 3 Avg/Total 0.83 0.83 0.83 9298

as with boolean features and frequency features (i.e term frequencies), which is the case we are presented with when navigating our data. After the selection process, the feature set should be trimmed of features who show to be independent of the class - therefore irrelevant to the predictive accuracy of the desired model.

3.5.2.2.2 Mutual Information

A more robust option for the estimation of correlation, estimating the mutual information for a discrete target variable. Mutual Information (MI) would be a non-negative value representing the dependency between two values. The function, also implemented in the Scikit Learn distribution [30], "relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances" as theorized in [31].

3.6 Conclusions

To finish this chapter where we explained the methodology used throughout our process, some conclusions are to be made. Firstly, the choice of Python for the implementation of the pipeline proved to be a good one - although I was not a familiar with it, it was easy to learn and a vast selection of lexical programs and natural language processing tools are available in the platform, and even the ones that aren’t available by origin, like Stanford CoreNLP and UMLS Metamap, have wrappers built around them for Python made by the community, and it’s integration in the processing was possible anyways.

One thing that was not predicted was the processing power used by some of the tools, and some of the difficulties found when matching the findings to the dataset (i.e the Genia Tagger case, where some tweaks had to be made when it was time to intersect the Genia Tagger data with

(47)

the data set). This issue turned out to be an obstacle in the development of the data preprocessing step, specially before the discovery of SpaCy, the alternative used to replace Stanford CoreNLP, and since this discovery was made quite "late in the game", when the pipeline was nearly finished, some time was already wasted even though it server as precious help to finalize the whole process. Nevertheless, we have a wide variety of extracted features, and even if the impact of some of them doesn’t reveal itself to be valuable to the performance of the models, it provides a good base start for a first exploration and experimentation in the field.

(48)

(49)

Chapter 4

Experiments On Identification of

Biological Events

This chapter describes the results achieved, comparing different settings and the impact they have on the score of the predictive model, discussing and theorizing the possible reasons behind it.

4.1 Classification Algorithms and Parameterization

To start this analysis, we compare the results of different classification algorithms for the same cases of subset of features and compare their overall performance. We enter this chapter with this approach because after calculating the algorithm that makes better predictions with our features overall, we can use said algorithm for the feature impact study and feature selection that follows in the next sections, using the same model and parameters for a better visualization of results.

The chosen algorithms were Multinomial Naive Bayes, Linear SVC (SVM classification) and Random Forest, based on some good results seen with these designs when reviewing literature, and on some good results seen on experimental runs.

The best parametrization for each algorithm was determined with Scikit Learn’s GridSearch1.

4.1.1 With Genia Tagger Features

In this section the results of three different algorithms are compared in regards to the subset of features with the Genia Tagger. That means overall, the present features at this classification stage are:

• Text

• hasGreekSymb

Identificação de Bioprocessos em textos

F

E

U

P

Biological Processes Identification in

Texts

Vânia Leite

Biological Processes Identification in Texts

Vânia Leite

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introduction

1.1

Context and Motivation

1.2

Document structure

Chapter 2

Concepts on Biological Events and Text

Mining

2.1

Biological Events

2.2

Machine Learning

∑

∑

2.3

Text Mining in Biotechnology

2.4

Related Work

2.5

Conclusions

Chapter 3

Extracting Biological Events From

Texts

3.1

Dataset

3.2

Text Processing

3.3

Feature Set

3.4

Identification of Entities and Biological Event Triggers

3.5

Modeling Process

3.6

Conclusions

Chapter 4

Experiments On Identification of

Biological Events

4.1

Classification Algorithms and Parameterization