Knowledge Graphs

A Knowledge Graph (KG) is a structure which allows to represent knowledge in the form of a semantic net (Sowa, 1987).As the name suggests, it is a graph structure with the nodes representing the concepts or entities and the edges representing the semantic connection, the link, between the nodes (Yan et al., 2018). The formal definition of KG remains unsettled as there is a number of conflicting definitions (Ehrlinger and Wöß, 2016; Bonatti et al., 2019).

Inevitably, KGs are very similar to Knowledge Bases (KBs), which is an intelligent database that stores semantic information (Frost, 1986). With the only major distinction between a KG and a KB being the graph like structure of the first, in comparison to the second which can be unstructured, these terms are used interchangeably in research.

KGs relate closely to NLP as they represent a way to structure and store the semantic information extracted from unstructured text. This process involves many NLP tasks, making KG creation a challenging subject which can be identified in three task types: Entity Extraction tasks, Relation Extraction and Reasoning tasks (Yan et al., 2018). These tasks work in a pipeline, following their presented order with their goals being to extract the entities of interest, identify their relationships and validate their semantic connections respectively.

The Entity Extraction tasks used in both KG creation and enrichment phases can be limited to Entity Recognition and Entity Linking. Entity Recognition has been extensively discussed in Section 2.2. Entity Linking is the task of correctly identifying and linking all

Figure 2.5: The stages of KG creation and the related Information Extraction task types.

entity mentions together via a unique identifier provided through a KB (Shen et al., 2014).

When a KB is not used, the task is indestinguishable from Coreference Resolution which was reviewed in detail in Section 2.4.

Relation Extraction is the NLP task of identifying the relationship between two entities of interest. For KG, this task is focused on creating triples of the form< subject, predicate, object >, in which thepredicateidentifies the semantic relationship between thesubject andobject, which is always binary. The type of relationship can either come from predefined categories or be automatically extracted from text, depending on the approach and the available resources (Smirnova and Cudré-Mauroux, 2018).

Lastly, Reasoning tasks can be a combination of Logical Inference, Graph Learning and Semantic Role Labeling. Logical Inference refers to the task of using rules and constrains to enforce logical properties and resolve inconsistencies in a KG. Similarly, Graph Learning is the task of learning to represent a graph and performing acts of inference from the graph itself. This enables the evaluation of the validity of new relations to a given graph, so that the KG can be further enriched. Semantic Role Labeling is the NLP task that identifies entity mentions in text, links them together and also identifies the syntactic dependencies of each sentence. Essentially, it is the NLP counterpart of Logical Inference, without the need for complex manually created rules. However, both Graph Learning and Semantic Role Labeling require annotated training which is hard to make.

An important feature of KGs is their persistent nature, as they can be stored, main- tained and re-used for many applications. There are many forms of storing KGs, with

the most common being in the Resource Description Framework (RDF), which stores the

< subject, predicate, object >triples that comprise a KG into non-SQL databases, called triplestores. These triples can be reconstructed into a KG, using the subject and object information as the nodes and thepredicateas the label to their connecting edge.

3 ^B ÎOMEDICAL Ê ^NTITY ^R ÊCOGNITION

“Healing is a matter of time, but it is sometimes also a matter of opportunity.”

— Hippocrates

3.1 Introduction . . . 36 3.2 Related Work . . . 37 3.3 Biomedical Entity Recognition Architectures . . . 39 3.4 Experiments . . . 45 3.5 Conclusion extraction with Evidence-Based Medicine Entities . . . 51 3.6 Entity-specific Biomedical Entity Recognition . . . 54 3.7 Conclusions and Future Work . . . 58 Biomedical Entity Recognition is usually aimed towards specific medical entity types, such as disease, gene, drug, etc. Such entities, while very important in their own right, provide limited usefulness when used as standalone information in clinical decision making as they cannot offer any semantic information about the study. To enable faster searches and improve the clinical decision making process, we are focused on enhancing the clinical practice of Evidence-Based Medicine through Biomedical Entity Recognition.

Evidence-Based Medicine (EBM) is a methodology used by medical practitioners in order to identify all relative literature to a patient’s case in the shortest time frame possible. These information can then be used by the medical practitioners to create informative treatment plans based on that extracted knowledge. Accordingly, the same approaches that are used to enable Biomedical Entity Recognition are also used for modern Evidence-Based Medicine.

This chapter describes three deep learning architectures, developed incrementally, to identify semantically hard biomedical entities in the domain of EBM and accelerate the treatment formulation process. The architectures are validated through detailed ablation studies and results comparison with related works. We further use the identified EBM entities to extract study overviews in the form of single statements from the studies.

3.1 Introduction

Entity Recognition is a very important task of Natural Language Processing, used to not only identify important entities in texts but also help in many downstream tasks. These include Question Answering, Summarization, Information Retrieval and Knowledge Graph construction, among others (Li et al., 2020a). Depending on the targeted entities, different types of Entity Recognition systems exist. Named Entity Recognition (NER) is the task of identifying Named Entities in a document and is the most commonly used type of an Entity Recognition system. Similarly, in the clinical domain where the source information is vastly different, we are interested in identifying biomedical entities through Biomedical Entity Recognition (BioER).

The entities that are recognized by BioER systems are focused on either Disease, Chem- icals, Genes, Molecules of Cells and Drugs, or a combination of the above. As a result, BioER’s domain of application extends that of classic Entity Recognition systems to domain specific tasks such as, adverse drug event extraction (Gurulingappa et al., 2012), drug-drug interactions (Zhang et al., 2017) and protein-protein interactions (Szklarczyk et al., 2015).

Daily, a staggering number of new research is published in the plethora of biomedical fields, with the number of articles of interest to a medical practitioner increasing exponentially.

This rapid growth makes it extremely difficult for healthcare staff and medical practitioners to stay updated with the latest research and guidelines (Bastian et al., 2010). During the COVID-19 pandemic, only in 2020, more than fifty thousand research articles about the novel coronavirus have been published on PubMed¹ alone.

Evidence-Based Medicine (EBM) is the practice with which medical practitioners identify all the relevant previous research in order to create treatment plans. These research studies which constitute the complete available evidence at the time, usually come in the form of Randomized Control Trials (RCTs) or Clinical Trials (CTs) that investigate the effects of a treatment on a specific group of patients and present their findings. The most dominant method to achieve this is through the PICO Framework, named after its elements Population, Intervention, Comparator and Outcome (Huang et al., 2006; Methley et al., 2014).

A combination of advancements in both Deep Learning (DL) and Natural Language Processing (NLP) techniques has contributed significantly in the increased performance of modern BioER systems (Hong and Lee, 2020; Cho et al., 2020; Weber et al., 2021). In comparison to general purpose NER, BioER systems performance suffers due to the high variance and complexity of terms found in medical literature (Campos et al., 2012). This complexity is worsened in the case of EBM, in which a term, e.g. "high blood sugar", can be identified in multiple classes, i.e. Population or Outcome in this case, depending on the

1https://www.ncbi.nlm.nih.gov/pubmed/

context of the study.

In addition, research on advancing EBM with ML has been slow, due to the lack of high quality and substantial size benchmark datasets, as well as the instability in the frameworks that have been used. Furthermore, while significant strides have been made, researchers confront the task in a variety of approaches, ranging from Sentence Classification and Entity Recognition (Chung, 2009; Jin and Szolovits, 2018) to Question Answering and Information Retrieval (Abacha and Zweigenbaum, 2015; Gulden et al., 2019). The publication of a high quality corpus with PICO entities by Nye et al. (2018) facilitates the ability to use PICO entities at a word level and sets a benchmark for the entity recognition task.

In the remainder of the chapter we first discuss the related work in the field of BioER and EBM with ML (Section 3.2), describe our developed Biomedical Entity Recognition architectures thoroughly (Section 3.3) and present our detailed findings by evaluating their performance in EBM targeted Entity Recognition (Section 3.4). A novel approach towards exploiting PICO entities to extract study overviews follows (Section 3.5) along with a entity- specific evaluation of the developed architectures (Section 3.6). Finally, we offer our conclusions and discuss future work in the subject (Section 3.7).

No documento Natural Language Processing and Information Extraction (páginas 50-55)

3 B IOMEDICAL E NTITY R ECOGNITION

Contents

3.1 Introduction

3 ^B ÎOMEDICAL Ê ^NTITY ^R ÊCOGNITION