• Nenhum resultado encontrado

Identifying Semantic Relations for DiseaseTreatment in Medline

N/A
N/A
Protected

Academic year: 2017

Share "Identifying Semantic Relations for DiseaseTreatment in Medline "

Copied!
6
0
0

Texto

(1)

Available Online at www.ijecse.org ISSN: 2277-1956

Identifying Semantic Relations for

Disease-Treatment in Medline

P. Menaka 1,Prof.D.Thilagavathy 2

1-PG Scholar, Department Of Computer science and Engineering,

Adhiyamman College Of Engineering, Hosur,TamilNadu, India.

menu.mnk@gmail.com

2-HOD, Department Of Computer science and Engineering,

Adhiyamman College Of Engineering, Hosur,TamilNadu, India.

thilagakarthick@yahoo.co.in

Abstract-

The Machine Learning (ML) is almost used in any domain of research and now it has become a reliable tool in the medical domain.ML is a tool by which medical field is integrated with the computer based systems to provide more efficient medical care. The main objective of this work is to show what Natural Language Processing (NLP) and Machine Learning (ML) techniques used for representation of information and what classification algorithms are suitable for identifying and classifying relevant medical information in short texts. It acknowledges the fact that tools capable of identifying reliable information in the medical domain stand as building blocks for a healthcare system that is up-to-date with the latest discoveries. In this research, it focuses on diseases and treatment information, and the relation that exists between these two entities.

Keywords: Machine Learning, Natural Language Processing, Classifiers.

I. INTRODUCTION

Medline is the richest and most used source of information. It is a database that has enormous articles regarding life sciences. This paper focuses on two tasks: Automatically identifying sentences from Medline and also provides the relation between the diseases and treatments. The next task is based in three relations: Cure, Prevent and Side effects. These tasks helped to develop the Information Technology Framework that provides all the healthcare information. Healthcare related information (e.g., published articles, clinical trials, news, etc.) is a source for both healthcare providers and people. These Studies shows that people are searching the web and gain the knowledge about the healthcare related information.

Natural Language Processing (NLP) and Machine Learning (ML) are the techniques that are used here. The ultimate aim is to show what representation of information and algorithms used to identify and provide relevant healthcare information in short texts. The tools that are used to identify all medical related information are the building blocks in medical domain that has all latest discoveries up-to-date. The main entities involved in this paper are disease and treatment. The main aim is to focus on all information related to disease. It is also possible with tendency of having personalized medicine so that each patient has healthcare to provide treatment to their particular disease. However it is not enough to focus on certain study that provides treatment to particular disease. Also the Healthcare providers should focus on all recent discoveries in order to identify any side effects for patients.

(2)

everything is done in one step by classifying sentences based on interest and also including the sentences that do not provide relevant information.

II. .RELATED WORK

This work presents various Machine Learning (ML) and information for classification of short texts and finds the relation between diseases and treatments. According to ML technique the information are shown in short texts when identifying relations between two entities such as diseases and treatment. Thus there is improvement in solutions when using a pipeline of two tasks (Hierarchical way of approaching). It is better to identify and remove the sentence that does not contain information relevant to disease or treatments. The remaining sentences can be classified according to the interest. It will be very complex to identify the exact solution if everything is done in one step by classifying sentences based on interest and also including the sentences that do not provide relevant information.

The related work is done by Hearst and Rosario. The data set used in this work are created and distributed by the authors of this paper. The data set contains information from medline with all relevant information including diseases, treatments and eight relations between diseases and treatment. Entity recognition is focused mainly for diseases and treatments. The authors use two models for entity recognition and relation details. The models are Hidden markov model and maximum entropy model. The representation of information depends on medical lexical ontology, phrases and words in context. In this paper different techniques and models are used which provides better solution. One of the key tasks in natural language processing is that of Information Extraction (IE), which is traditionally divided into three subproblems: coreference resolution, named entity recognition, and relation extraction.

The main task in this work is to extract healthcare information and the relation details. Extraction of relation is done using three approaches. The approaches are: co-occurrences analysis, rule based approaches, and statistical methods. Rule based approach is used in solving relation between entities. Co-occurrences analysis is based on words in context and lexical knowledge. Rule based approach use syntactic: part-of-speech and syntactic structure or information in the form of fixed patterns that has word that describes the relation between the entities. The disadvantage of rule based approach is that it is mostly driven with human efforts rather than using data driven. One of the best rule based system is that rules constructed by humans, auto-extraction and refined by human. The main advantage of having this system is it provides precision results while recall tends to be low.

Syntactic rule based systems are very complex since it depends on some more tools used to assign POS tags. But it is clearly mentioned that such tools are not yet reached the state of art level since they are for designed only for common English texts and therefore it does not provide better solutions. Thomas et al., Yakushiji et al are the two who presented the works on relation extraction in medline abstracts. Although it does not provide accurate results these systems are successful in this biomedical field.

(3)

III. SYSTEM ARCHITECHTURE

Figure 1.System Architecture

In this work user can give their symptoms, the server will extract the information from various articles related to those symptoms. Then it classifies that information based on the symptoms and then provides the cure, preventive measures and side-effects for those symptoms. The main task in this work is to extract healthcare information and the relation details. In this research work, it focuses on diseases and treatment information, and the relation that exists between these two entities. Our interests are inline with the tendency of having a personalized medicine, one in which each patient has its medical care tailored to its needs. It is not enough to read and know only about one study that states that a treatment is beneficial for a certain disease.

IV. THE PROPOSED APPROACH

A. Tasks and Data Sets:

The two tasks used in this paper are the basis for the development of information technology framework. This framework helps to identify the medical related information from abstracts. The first task deals with extraction all information regarding diseases and treatments while the task deals with extraction of related information existing between disease and treatments. The framework developed with these tasks are used by healthcare providers, people who needs to take care of their health related problems and companies that build systematic views. The future product can be provided with browser plug-in and desktop application so that it helps the user to get all information related to diseases and treatments and also the relation between those entities. It is also be useful to know more about latest discoveries related to medicine. The product can be developed and sold by companies that do research in medical care domain, Natural Language Processing (NLP), and Machine Learning (ML), and companies that develop tools like Microsoft Health Vault and Google Health.

This product is valuable in e-commerce fields by showing the statistics that the information provided here Client

Login

User Symptoms Entry

SERVER

Data Represe

ntation

Data Base

Medline Data Base Articles Selected Relevant Disease

(4)

should be trust worthy so that people can buy it. It is the key factor foe any company to make product successful. When coming to health care products it should be more trust worthy since it is dealing with health related problems. Companies that wish to sell health care framework need to develop tools that automatically extract the wealth of research.

For example the information provided for diseases or treatments needs to be based on recent discoveries on health care field so that people can trust. The product quality also should be taken care so that it provides dynamic content for users. The first task deals with the identification of sentences from the Medline abstracts that provide the information about the diseases and treatments. In other words it also seems like scanning the sentences from medline abstracts that contain relevant information which the user wants.

Natural Language Processing (NLP), and Machine Learning (ML) are used to extract accurate information or it can also say that it perfectly removes the unwanted information which are not related to disease or treatment. Natural Language Processing (NLP) and Machine Learning (ML) itself involve in extracting informative sentences. It is difficult task to identify the informative sentences in fields such as summarization and information extraction. The work and contribution value with this task is helpful in results and in settings for this task in healthcare field.

In the first task data set are provided with information such as label containing that the extracted sentence is informative or label containing that the extracted sentence is not accurate.

For the second task it is provided with information that shows if the relation between disease and treatment is cure, prevent or side effects. Further it is good to focus on relations based on interest to provide accurate results. The three relations (cure, prevent and side effects) are mentioned in original data set and they are needed for future reference.Two ways of identifying semantic relations are:

Task 1: In this three models have been built. Each model is focused on relation and provides the difference between the sentences that contain information that do not contain.

Task 2: Unlike in setting 1 it has only one model. This model focus on three relation (cure, prevent and side effects). It distinguishes relations in three-class tasks and contains label in each sentence that has any one semantic relation.

The most important thing is that they can be combined in pipeline so that it provides solution to framework which identifies all related information. This proposed pipeline approach first deals task1 and then carries out with task2 so that it can give only informative sentences based on three relations. The logic behind to go with pipeline approach is to identify the best model to identify and extract the reliable healthcare information. With this pipeline task some problems can be removed due to the fact that unwanted information can be the potential factor to classify the sentences into three semantic relations.

B. Classification Algorithms and Data Representations

In Machine Learning the expertise and previous research provides the guidance to solve new tasks. The models described should be able to identify and provide informative sentences and relation between entities. The research should be made in a way to achieve high performance.

While working with ML technique two challenges should be considered. One is to find the exact model. ML provides a better model that can be used. To find the model one should rely on empirical studies and should gain knowledge in healthcare domain. The next one is to provide better data representation and to do feature engineering, because feature increase the performance of the model. Finding the exact feature for the predictive model is a crucial task that should be considered. These two challenges can be achieved by using various algorithms and textual representation that suits for the tasks.

(5)

to obtain database table that contains as many columns as number of feature selected to represent data and as many rows as the data points.

C. Bag-of-Words Representation

The bag-of-words is the name commonly used to classification of tasks. It is a representation in which features are selected among the words that are present in the training data. And then selection techniques are used in order to identify the most suitable words as features. Once the feature is identified, each training and test instance is mapped to this feature representation by providing values to each feature for a specific instance. Two most common feature value representations for Bag-of-Words (BOW) representation are: binary feature values or frequency feature values. The binary feature value is the value of a feature can be either 0 or 1, where 1 represents the fact that the feature is present in the instance and 0 otherwise. The frequency feature value is the value of the feature is the number of times it appears in an instance or 0 if it did not appear. Here we use frequency feature value. There is no that much difference between binary feature values and frequency feature values because there is only twenty words in each sentence of short texts. The advantage of using frequency feature values is that the feature’s value will be greater than other features since it captures the number of feature appeared once in a sentence. It keeps words that appear at least three times in training data and have at least one alphanumeric.

D. NLP and Biomedical Concepts Representation

Unlike above representation this representation is based on syntactic information such as nouns, phrases, healthcare related points in the sentence. A tool called Genia tagger is used to extract information. The tagger is specifically designed for biomedical text such as Medline abstracts. It analyzes the sentences to generate base forms, chunk tags, part of speech tags. The phrases and nouns obtained by tagger is used in second representation method. While running the Genia tagger it extracts only nouns, phrases and healthcare related concepts from each sentence of data set. The steps to be followed to obtain the final feature for classification are: removing features that contains punctuations and considering lemma-based forms. Lemma is used because there is lot of words that has plural forms. Lemmatized form gives the base form of the word. Another reason to use lemma is to reduce the sparseness problem. Lemma forms remove the problem of representing only a few features in short text forms.

E. Medical Concepts (UMLS) Representation

For more general features we use Unified Medical Language system (UMLS). UMLS is a developed at the US National Library of Medicine (NLM) .It contains a metathesaurus, a semantic network, and the specialist lexicon for biomedical domain. The metathesaurus deals with concepts and meanings. It provides relation between various different concepts.In addition to the UMLS knowledge base, NLM created a set of tools that allow easier access to the useful information. UMLS contains over 1 million medical concepts, and over 5 million concept names. Each concept is assigned with one semantic type from the semantic between concepts. NLM created new tool called MetaMap that maps text to healthcare concepts in UMLS. With this text is processed through entire data set and finally it provides ranked list of all possible concepts for particular noun-phrase. For each noun phrase in text various noun-phrases are generated. For each variant noun-phrases UMLS met thesaurus are obtained and evaluated. The obtained phrase is compared with actual phrase using fit function. Fit function measures the text overlap between obtained concepts and actual phrase. The best of the candidates are then organized according to the decreasing value of the fit function.

V. CONCLUSION

(6)

between them. The outcome of this proposed work could be integrated with an application and used in health care field.

REFERENCES

[1] R. Bunescu and R. Mooney, “A Shortest Path Dependency Kernel for Relation Extraction,” Proc. Conf. Human Language Technology and Empirical Methods in Natural Language Processing (HLT/ EMNLP), pp. 724-731, 2005.

[2] R. Bunescu, R. Mooney, Y. Weiss, B. Scho¨ lkopf, and J. Platt, “Subsequence Kernels for Relation Extraction,” Advances in Neural Information Processing Systems, vol. 18, pp. 171-178, 2006.

[3] A.M. Cohen and W.R. Hersh, and R.T. Bhupatiraju, “Feature Generation, Feature Selection, Classifiers, and Conceptual Drift for Biomedical Document Triage,” Proc. 13th Text Retrieval Conf. (TREC), 2004.

[4] M. Craven, “Learning to Extract Relations from Medline,” Proc.Assoc. for the Advancement of Artificial Intelligence, 1999.

[5] I. Donaldson et al., “PreBIND and Textomy: Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine,” BMC Bioinformatics, vol. 4, 2003.

[6] C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: A Natural Language Processing System for the Extraction of Molecular Pathways from Journal Articles,” Bioinformatics, vol. 17, pp. S74-S82, 2001.

[7] O. Frunza and D. Inkpen, “Textual Information in Predicting Functional Properties of the Genes,” Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP) in conjunction with Assoc. for Computational Linguistics (ACL ’08), 2008.

[8] R. Gaizauskas, G. Demetriou, P.J. Artymiuk, and P. Willett, “Protein Structures and Information Extraction from Biological Texts: The PASTA System,” Bioinformatics, vol. 19, no. 1, pp. 135- 143, 2003.

[9] C. Giuliano, L. Alberto, and R. Lorenza, “Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature,” Proc. 11th Conf. European Chapter of the Assoc. for Computational Linguistics, 2006.

[10] J. Ginsberg, H. Mohebbi Matthew, S.P. Rajan, B. Lynnette, S.S. Mark, and L. Brilliant, “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature, vol. 457, pp. 1012-1014, Feb. 2009.

[11] M. Goadrich, L. Oliphant, and J. Shavlik, “Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction,” Proc. 14th Int’l Conf. Inductive Logic Programming, 2004.

[12] L. Hunter and K.B. Cohen, “Biomedical Language Processing: What’s beyond PubMed?” Molecular Cell, vol. 21-5, pp. 589-594, 2006. [13] L. Hunter, Z. Lu, J. Firby, W.A. Baumgartner Jr., H.L. Johnson, P.V. Ogren, and K.B. Cohen, “OpenDMAP: An Open Source, Ontology-Driven Concept Analysis Engine, with Applications to Capturing Knowledge Regarding Protein Transport, Protein Interactions and Cell-Type-Specific Gene Expression,” BMC Bioinformatics, vol. 9, article no. 78, Jan. 2008.

[14] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A Literature Network of Human Genes for High-Throughput Analysis of Gene Expression,” Nature Genetics, vol. 28, no. 1, pp. 21-28, 2001.

[15] R. Kohavi and F. Provost, “Glossary of Terms,” Mac ine Learning, Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, vol. 30, pp. 271-274, 1998.

Referências

Documentos relacionados

Quando o personagem aparece-lhe pela primeira vez, diz o autor que ―naquela época, ele ainda não se chamava Pereira, ainda não tinha traços definidos, era algo vago, fugidio e

The first step to evaluate the AudioNavigator system is to test if the user has a correct perception of the spatial audio it generates i.e., if the sound played by the

Furthermore, in our case the use of the penalty method replaces a constrained optimization problem (the delayed optimal control problem) by a sequence of unconstrained problems of

No presente estudo optou-se por não considerar artigos internacionais publicados em periódicos científicos sem fator de impacto, conforme indicador SJR. Dessa forma, foram

Houve necessidade de definir a classe Operador_urbano, que como o nome indica é uma classe necessária para definir os transportes urbanos, porque nela está

with chronic low back pain REVISTA Clinical Rehabilitation Physiotherapy Theory and Practice Disability and Rehabilitation Journal of Aging and Physical Activity

Neste sentido, o enfermeiro contribui na elaboração de uma assistência planejada, fundamentada na educação em saúde, nas orientações sobre o auto cuidado,

FEDORA is a network that gathers European philanthropists of opera and ballet, while federating opera houses and festivals, foundations, their friends associations and