Cross-language ontology matching : Mapeamento multilíngue de ontologias

(1)

INSTITUTO DE COMPUTA ¸C ˜AO

Juliana Medeiros Destro

Cross-language ontology matching

Mapeamento multil´ıngue de ontologias

CAMPINAS

2019

(2)

Cross-language ontology matching

Mapeamento multil´ıngue de ontologias

Tese apresentada ao Instituto de Computa¸cão da Universidade Estadual de Campinas como parte dos requisitos para a obten¸cão do t´ıtulo de Doutora em Ciência da Computa¸cão.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. Julio Cesar dos Reis

Co-supervisor/Coorientador: Prof. Dr. Ricardo da Silva Torres, Prof. Dr. Ivan Luiz Marques Ricarte

Este exemplar corresponde `a vers˜ao final da Tese defendida por Juliana Medeiros Destro e orientada pelo Prof. Dr. Julio Cesar dos Reis.

CAMPINAS

2019

(3)

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467

Destro, Juliana Medeiros,

D476c DesCross-language ontology matching / Juliana Medeiros Destro. – Campinas, SP : [s.n.], 2019.

DesOrientador: Julio Cesar dos Reis.

DesCoorientadores: Ricardo da Silva Torres e Ivan Luiz Marques Ricarte. DesTese (doutorado) – Universidade Estadual de Campinas, Instituto de

Computação.

Des1. Ontologias multilíngues. 2. Mapeamento de ontologias. 3. Web semântica. 4. Alinhamento de ontologias. I. Reis, Julio Cesar dos, 1979-. II. Torres, Ricardo da Silva, 1977-. III. Ricarte, Ivan Luiz Marques,1962-. IV. Universidade Estadual de Campinas. Instituto de Computação. V. Título.

Informações para Biblioteca Digital

Título em outro idioma: Mapeamento multilíngue de ontologias Palavras-chave em inglês:

Cross-lingual ontologies Ontology mapping Semantic Web Ontology alignment

Área de concentração: Ciência da Computação Titulação: Doutora em Ciência da Computação Banca examinadora:

Julio Cesar dos Reis [Orientador] Renato Fileto

Carla Geovana do Nascimento Macário André Santanchè

Leonardo Montecchi

Data de defesa: 10-07-2019

Programa de Pós-Graduação: Ciência da Computação

Identificação e informações acadêmicas do(a) aluno(a)

- ORCID do autor: https://orcid.org/0000-0002-1180-6435 - Currículo Lattes do autor: http://lattes.cnpq.br/3025767139491665

(4)

INSTITUTO DE COMPUTA ¸C ˜AO

Juliana Medeiros Destro

Cross-language ontology matching

Mapeamento multil´ıngue de ontologias

Banca Examinadora:

• Prof. Dr. Julio Cesar dos Reis

Instituto de Computa¸c˜ao / Universidade Estadual de Campinas • Prof. Dr. Renato Fileto

Departamento de Inform´atica e Estat´ıstica, Centro Tecnol´ogico / Universidade Federal de Santa Catarina

• Dra. Carla Geovana do Nascimento Mac´ario

Pesquisa e Desenvolvimento / Empresa Brasileira de Pesquisa Agropecuária (Embrapa) • Prof. Dr. André Santanchè

Instituto de Computa¸c˜ao / Universidade Estadual de Campinas • Prof. Dr. Leonardo Montecchi

Instituto de Computa¸c˜ao / Universidade Estadual de Campinas

A ata da defesa, assinada pelos membros da Comiss˜ao Examinadora, consta no SIGA/Sistema de Fluxo de Disserta¸c˜ao/Tese e na Secretaria do Programa da Unidade.

(5)

(6)

(7)

First of all, I would like to thank God for my life and all opportunities given to me. I would like to thank my supervisor Julio Cesar dos Reis and my co-supervisors Ricardo Torres and Ivan Ricarte, for their guidance and incentive. I will take their advices with me for the rest of my life.

I am grateful to my all friends, specially in our group Tramps. Near or far you were always there for me, cheering me on every day. The moments, tears and laughter we shared helped me more than I can say.

I thank my family and relatives for the support and understanding during all these years.

I also would like to thank my first supervisor, Ariadne Maria Rizzoni Brito de Carvalho, who introduced me to Artificial Intelligence during my undergraduate studies and helped me during my early years in my doctorate. She was the first to believe in me and was much more to me than a supervisor. I’m deeply grateful for her support.

Finally, a special thanks to my spouse André, whom was by my side, with love and encouragement, every hour of every day, and to whom this thesis is dedicated. Thank you for your words, your caring, for everything you abdicated to give me this opportunity. You are my everything.

(8)

Há uma publicação cada vez maior de conjuntos de dados volumosos, dinâmicos, he-terogêneos e complexos. Isso exige que computadores possam suportar métodos para automaticamente localizar, acessar e integrar de maneira significativa os dados, indepen-dentemente da linguagem natural em que estão escritos. As ontologias representam for-malmente conceitos de domínio e suas inter-relações, com o objetivo de tornar a semântica de dados explícita. Os mapeamentos entre diferentes ontologias permitem a integração semântica de fontes de dados heterogêneas e facilitam a troca de dados entre sistemas. A criação automática de mapeamentos requer um procedimento para identificar os conceitos correspondentes. Identificar mapeamentos entre entidades de ontologia em diferentes idi-omas continua sendo um desafio de pesquisa. Os alfabetos silábicos distintos dificultam o uso direto de técnicas de similaridade baseadas na composição das palavras. Inves-tigações adicionais ainda são necessárias para avançar os métodos de correspondência de ontologias multilíngues para obter resultados tão bons quanto as abordagens monolin-guas existentes. Esta pesquisa concentrou-se em abordar dois problemas desafiadores: (1) como gerar mapeamentos de ontologias entre diferentes linguagens; e (2) como enriquecer semanticamente os relacionamentos de um mapeamento.

O objetivo desta tese é fornecer novos métodos adequados para gerar automaticamente mapeamentos de ontologias e enriquecer o tipo de relação semântica nos alinhamentos criados entre ontologias descritas em diferentes linguagens naturais. Nosso trabalho en-volveu explorar tradução automática, medidas de similaridade semântica baseadas em conhecimento prévio, vizinhança de entidades da ontologia e técnicas de recuperação de informação. Propomos uma combinação de medidas de similaridade como solução para o problema (1) e a utilização de informações da evolução das ontologias para o problema (2). As principais contribuições desta tese são: (i) experimentos com alinhamentos de ontolo-gias biomédicas para estudar o comportamento das medidas de similaridade semânticas; (ii) um método novo para geração automática de mapeamentos multilíngue de ontolo-gias utilizando uma combinação ponderada de medidas de similaridade e vizinhança dos conceitos; (iii) um segundo método de geração de mapeamentos multilíngues adotando técnicas de rank aggregation; (iv) uma técnica de refinamento de mapeamentos que utiliza informações de evolução da própria ontologia para enriquecer semanticamente os mape-amentos. Todas as nossas técnicas foram avaliadas utilizando ontologias e mapeamentos reais, destacando especialmente o uso de ontologias do domínio biomédico como SNOMED CT e LOINC, e o conjunto de ontologias Multifarm, utilizado na trilha de mesmo nome da competição internacional Ontology Alignment Evaluation Initiative (OAEI). As ava-liações experimentais demonstraram a efetividade das nossas abordagens de mapeamento multilíngue de ontologias.

(9)

There is an ever-increasing publication of voluminous, dynamic, heterogeneous and com-plex datasets. This demands computer-supported methods for automatically locating, accessing and meaningfully integrating data regardless the natural language encoding it. Ontologies formally represent domain concepts and their interrelationships to turn data semantics explicit. Mappings between different ontologies are relevant to semanti-cally integrate heterogeneous data sources and facilitate data exchange among systems. The automatic creation of mappings requires a matching procedure to identify corre-spondent concepts. Cross-language matching aims to obtain mappings between ontology entities written in different natural languages. The distinct syllabic alphabets hamper the straightforward use of string-based similarity techniques. Further investigations are still required to advance cross-lingual ontology matching methods to obtain results as good as the existing monolingual approaches. This research focused on addressing two chal-lenging problems: (1) how to generate cross-language ontology mappings; and (2) how to semantically enrich mapping relationships.

The objective of this Ph.D. thesis is to provide new methods suited to automati-cally generate ontology mappings and enrich the type of semantic relation in the created alignments between ontologies described in distinct natural languages. Our work in-volved exploring automatic translation, semantic similarity measures based on background knowledge, neighborhood of ontology entities, and information retrieval techniques. We proposed a combination of similarity measures as a solution to the problem (1) and the use of changes from ontology evolution to the problem (2). The main contributions of this thesis are: (i) extensive experiments with alignments of biomedical ontologies to study the behavior of semantic similarity measures; (ii) a method for automatic generation of cross-language ontology mappings using a weighted combination of similarity measures and neighborhood concepts; (iii) a novel method of creating cross-language ontology map-pings leveraging rank aggregation techniques; (iv) a mapping refinement technique using evolution changes to semantically enrich the mappings. All of our techniques were evalu-ated using real-world ontologies and mappings, highlighting the use of biomedical domain ontologies such as SNOMED CT and LOINC, and the Multifarm ontology set appearing in the Ontology Alignment Evaluation Initiative (OAEI) competition. The experimen-tal evaluations demonstrated the effectiveness of our approaches to cross-lingual ontology matching.

(10)

1.1 Overview of the research steps and the chapter number where they are addressed. . . 19 2.1 Semantic similarity in cross-language mapping. This shows the concepts

involved in the mapping described in a Language β and their translation to Language α or γ. . . 28 2.2 Experiment 1. This figure presents the behavior of similarity values

com-puted with NASARI [Babelnet] and translation to Latin Language. . . 30 2.3 Experiment 2. This figure presents the behavior of similarity values

com-puted with NASARI [Babelnet] and translation to Spanish. . . 31 2.4 Experiment 3. This figure presents the behavior of similarity values

com-puted with PATH-BASED [UMLS] and translation to Spanish. . . 31 2.5 Experiment 4. This figure presents the behavior of similarity values

com-puted with VECTOR-BASED [UMLS] and translation to Spanish. . . 31 3.1 Example of neighbourhood. . . 39 3.2 Schematic overview of the proposed cross-language ontology alignment

based on neighbourhood analysis. . . 42 3.3 Ontologies OX and OY under analysis. . . 45

3.4 Pair of entities considered for mapping. . . 46 3.5 “pessoa” and “person” are the pair of neighbouring concepts with

maxi-mum similarity. . . 46 4.1 Example of cross-language ontology mapping. . . 58 4.2 Rank aggregation in information retrieval mapped to the ontology matching

problem. . . 59 4.3 Technique workflow. The mapping processing stage is where the top-1

entity of the final ranking is mapped to the input concept e1. . . 61

4.4 Concept c1 ∈ OX is compared against all concepts cn∈ OY. . . 61

4.5 Ranked lists generated by each similarity measure used. . . 62 4.6 Rank aggregation of the ranked lists. Each rank aggregation algorithm

generates a distinct final rank. . . 62 4.7 Mapping generated between source entity c1 ∈ OX and top-1 entity of the

final rank generated by the rank aggregation algorithm, c2 ∈ OY. . . 62

4.8 Entity “author of contribution” ∈ OX under analysis. . . 63

4.9 Ranks generated by similarity measures. . . 63 4.10 Mapping established between the element in OX and the top-1 of the final

aggregated rank. . . 63 4.11 Results where our F-measure was higher than other matching methods. . . 70 4.12 Results where our F-measure was lower than other matching methods. . . . 71

(11)

mapping m12 ∈ LjXY candidate for refinement (C) Computing similarity

values between c1 and the neighborhood(c2) (D) Resulting L 0j

XY after our

refinement procedure (application of the derivation action). . . 82 5.3 (A) Addition of the concept “Zika virus Ab.IgG ” from LOIN C2.64

en release

version to LOIN C2.65

en (B) Resulting mapping set after refinement action

(12)

2.1 Biomedical ontologies considered in the study. . . 30 3.1 Example of NASARI vector representation, where synset_weight

repre-sents dimensions 1 : n ∧ n <= 300. . . 41 3.2 MultiFarm ontologies and statistics . . . 47 3.3 Conference-conference alignments used in the experiments and number of

mappings in each alignment. . . 48 3.3 Conference-conference alignments used in the experiments and number of

mappings in each alignment. . . 49 3.4 Experiments configurations. Different weights for syntactic and semantic

similarity are applied to each threshold. Each configuration was applied to Conference-Conference mappings for each of the 45 pairs of languages obtaining a total of 1260 experiments. . . 50 3.5 Results with highest f-measure for each pair of language. . . 51 3.5 Results with highest f-measure for each pair of language. . . 52 3.6 Results obtained with existing ontology alignment systems in OAEI

(Mul-tifarm Track) in 2018, considering the alignments between conference on-tologies in different languages. . . 54 4.1 Unsupervised and Supervised Rank Aggregation (RA) and Learning to

Rank (L2R) techniques considered in the experiments. . . 64 4.2 Example of word-based vector representation. . . 65 4.3 MultiFarm ontologies and their number of entities . . . 67 4.4 Best f-measure results by language pair and rank aggregation algorithm. . 69 4.5 Top-1 of each one of the ranked lists generated by similarity measures for

the entity ‘Fachgebiet des Gutachters’ for the conference-conference-de-en alignment. The correct match is highlighted by (*). . . 71 4.6 Top-1 of each one of the ranked lists generated by similarity measures for

entity ‘membro do comitê’ for the conference-conference-pt-ru alignment. The correct match is highlighted in the table by (*). . . 71 5.1 Ontology change operations (OCOs) [44]. . . 77 5.2 Evaluation results. . . 84

(13)

1 Introduction 15

1.1 Context and Motivation . . . 15

1.2 Research Problem and Challenges . . . 17

1.3 Research Goals and Approach . . . 18

1.4 Contributions . . . 21

1.5 Thesis Organization . . . 22

2 Influence of Semantic Similarity Measures on Ontology Cross-language Mappings 23 2.1 Introduction . . . 23 2.2 Related Work . . . 24 2.3 Fundamental Concepts . . . 26 2.4 Experiments . . . 27 2.5 Results . . . 30 2.6 Discussion . . . 32 2.7 Conclusion . . . 32

3 Neighbourhood-based Cross-Language Ontology Matching 34 3.1 Introduction . . . 34

3.2 Background . . . 36

3.3 Formalizations . . . 38

3.3.1 Ontologies . . . 38

3.3.2 Cross-Language Ontology Alignment . . . 39

3.3.3 Similarity Measures . . . 40

3.4 Cross-Language Ontology Alignment Relying on Neighbourhood Analysis . 42 3.5 Experimental Evaluation . . . 47

3.5.1 Datasets and Procedure . . . 47

3.5.2 Experimental Results . . . 50

3.6 Discussion . . . 52

3.7 Conclusion . . . 54

4 Exploring Rank Aggregation for Cross-Lingual Ontology Alignments 55 4.1 Introduction . . . 55

4.2 Background and related work . . . 57

4.2.1 Basic concepts . . . 57

4.2.2 Related work . . . 59

4.3 Cross-lingual Alignment Using Rank Aggregation . . . 60

4.3.1 Rank aggregation algorithms . . . 64

(14)

4.4.2 Results . . . 67

4.5 Discussion . . . 68

4.6 Conclusion . . . 72

5 Ontology changes-driven semantic refinement of cross-language biomed-ical ontology alignments 73 5.1 Introduction . . . 73

5.2 Background . . . 74

5.3 Preliminary Formalization . . . 76

5.4 Refinement of Biomedical Ontology Mappings across Languages . . . 78

5.4.1 Refinement Actions . . . 78 5.4.2 Refinement Procedure . . . 79 5.5 Evaluation . . . 82 5.6 Discussion . . . 84 5.7 Conclusion . . . 85 6 Conclusion 87 6.1 Summary of the Achieved Contributions . . . 87

6.2 Research Outcomes . . . 88

6.3 Future Work and Perspectives . . . 89

6.4 Final Remarks . . . 90

Bibliography 92

(15)

Chapter 1 Introduction

1.1 Context and Motivation

Interconnectivity between computational systems plays a central role in all scientific do-mains. For instance, modern medicine requires that health information is exchanged between different locations to prevent pandemics and accelerate the development of new drugs and treatments. In this context, more and more systems need to interoperate by smoothly sharing data in different languages and providing novel data analyses in a dis-tributed way. This task can be improved if the meaning of data in one system can be further interpreted by machines and adequately reused in other systems. To this end, several software systems demand organizing and turning the data semantics explicit. The use of ontologies has been considered a key element for this purpose.

Ontology refers to “the philosophical study of what-is or of what applies neutrally to everything that is real”1_{. By the end of the 20th century, the term was adopted by}

the Artificial Intelligence community, in the context of knowledge representation [41]. In Computer Science, ontologies define a common vocabulary in a knowledge domain and are used to represent semantics in computational systems by describing concepts and interrelationships among concepts. They are used as a knowledge representation formalism in applications in the context of Semantic Web, Software Engineering, Library Science, among others. In this Ph.D. thesis, we use ontology in a general sense, including taxonomies and thesauri. Gilchrist [39] provides a detailed definition of these terms and their differences.

Ontology entities (or components) vary among authors, but they mostly include in-stances (the objects), concepts (representing classes or types of objects), attributes (fea-tures, properties and characteristics of concepts) and relations (representing relationships between entities). Ontology concepts explore strings written in natural language to de-note labels [72]. An example of ontology entity is a concept labeled “Diabetes mellitus type 1 ” with attribute “Synonym:Type I Diabetes”, and an is-a (v) relation with another concept named “Diabetes Mellitus”.

Ontologies are created to represent the knowledge consensus of a domain. They help organizing information by limiting their complexity and enable sharing common

under-1

(16)

standing of the information structure, separating the domain knowledge from the opera-tional knowledge. Ontologies can be leveraged by a myriad of systems and applications enabling reuse of domain knowledge.

Ontology matching stands for the generation of correspondences between concepts from two different ontologies [29]. Matching is the process, whereas the result of this process is known as a mapping set or alignment (we use these terms interchangeably throughout this manuscript). An example of a mapping would be a correspondence found between the concept “Cardiopathy” in one ontology and the concept “Arrhythmias” in another. In a matching process, the correspondence between these two concepts is found and added to a mapping set.

Usually, in a matching process, only equivalent (≡) concepts are identified. In order to semantically enrich the result of the identification procedure, a mapping refinement technique can be used [63]. This aims to increase relevance and significance of the map-pings between ontologies. In our previous example, the relation between “Cardiopathy” and “Arrhythmias” can be refined to an is-a (v) relation, denoting that “Arrhythmias” is a specialization of “Cardiopathy”.

Ontologies are usually created by different authors, using different vocabularies and, possibly, in different natural languages. The number of ontologies created in different languages has grown as their use increases [91]. When ontologies are labeled in different natural languages, the process of creating correspondences is called cross-lingual ontology matching, which is the focus and key contributions provided in this Ph.D. thesis.

Methods for ontology matching across languages are crucial for enabling end users to access data in multiple languages. As a motivation scenario, consider the case of a user who wants to retrieve international code from medical procedures in a hospital in Brazil. This user is familiar with the ontology “Vocabulario Medico” created in Portuguese to support the patient’s electronic health records used in the hospital. Although the user is able to query the “Vocabulario Medico” ontology for “procedimentos em nefrologia”, for instance, it is necessary to query another ontology to obtain international codes for medical procedures. These codes can be obtained from SNOMED CT2_{, but acquiring the}

codes requires knowledge about the concepts in this ontology. A cross-lingual ontology matching procedure can overcome the language barrier between “Vocabulario Medico” and SNOMED CT, because it creates correspondences between their concepts. In this sce-nario, mapping generates, for instance, a relation between “procedimentos em nefrologia”, in “Vocabulario Medico”, and “[C0199981] Medical procedure on kidney”, in SNOMED CT. The user can then explore the “Vocabulario Medico” ontology, and the generated mappings allow the use of the correspondent concepts in SNOMED CT.

The adoption of SNOMED CT as medical vocabulary in Brazil is a real-world sce-nario where cross-language ontology matching can be used. There are different medical vocabularies used by the medical community in Brazil today, all of them in Portuguese language [59]. SNOMED CT adoption was first proposed in 2011 by the Ministry of Health and the adoption is expected by 2021. Since there is no version of SNOMED CT in Portuguese language, a cross-language mapping between SNOMED CT and the

cur-2_{SNOMED CT: Systematized Nomenclature of Medicine-Clinical Terms is a representation of clinical} information in electronic health records.

(17)

rently used medical vocabularies may expedite the adoption and increase interoperability of existing medical systems.

1.2 Research Problem and Challenges

When ontologies are created in different languages, the process of matching poses an issue still not fully solved by the current state-of-the-art research. To this day, cross-lingual on-tology matching remains an open research problem [47] and semantically enriched match-ing approaches lack thorough investigation in the literature [90]. The challenge of gen-erating correspondences between different ontologies, created for diversified purposes, is aggravated when concepts are labeled in different natural languages, even in the same domain.

The elaboration of monolingual ontology matching methods has been advanced over the last years [47]. In these methods, the input ontologies are described in the same natural language. Existing literature has found that the reuse of existing monolingual ontology matching methods is hardly applicable due to the complexity in dealing with distinct languages [62]. Trojahn et al. [91] highlighted that the current approaches for cross-lingual matching do not provide results as good as the existing monocross-lingual approaches.

In this context, this Ph.D. thesis addresses two key research problems. 1. Problem 1: Cross-language ontology matching

Ontologies expressed in different languages hamper the straightforward use of syntactic-based distance calculation techniques. The use of different symbolic systems to de-note the concepts makes a simple comparison between strings a difficult process. For example, it is complex for edit-distance algorithms [66] to automatically detect that the concept ‘肝’ in Japanese language is an equivalent concept of ‘liver’ in English language, described in a different alphabet.

As the adoption of ontologies grow, the need to expand the ontologies to reflect changes in the domain also grows, leading to an increase in the number of ontology entities. Usually, in several domains the current ontologies are very voluminous. In this sense, the generation of manually curated mappings, usually performed by ex-perts, becomes an extremely labour-intensive task on large ontologies. For instance, the biomedical domain has many examples of dense ontologies, e.g., SNOMED CT has over 300,000 entities and RxNORM vocabulary3 has over 100,000 entities. A common approach for the cross-lingual matching problem is the use of auto-matic translations. This solution is thwarted by linguistic characteristics such as homonyms. An example of a homonym is the word table, that can have more than one meaning depending on the context, for example, table can mean a piece of fur-niture or a diagram with columns of information. Finding the correct context to circumvent linguistic ambiguities is crucial for a successful cross-lingual matching. In this problem, our investigation aims to answer the following general research question: how to generate fully automated cross-language ontology mappings?

(18)

2. Problem 2: Cross-language alignment refinement

Literature established that ontology mappings with enriched semantic correspon-dences improve ontology merging [80], where two divergent ontologies or their in-stances are combined in a single one. Existing work has highlighted the importance of semantic enriched mappings to improve data exchange between systems [3]. How-ever, the current matching approaches are mostly focused on finding only exact matches between pairs of concepts or relations. The lack of semantic information in mappings is a problem affecting both monolingual and cross-lingual ontology matching. Although the literature has already preliminary investigated the seman-tic refinement of monolingual matching [3], this Ph.D. thesis is the first to investigate the refinement of cross-lingual mappings to the best of our knowledge.

The refinement is a challenging problem because of the intrinsic difficulties of lexical and semantic analysis of labels. Identifying linguistic relations such as hyponymy, hypernymy, meronymy, etc. usually require the use of background knowledge and external resources availability. Another challenge is the identification of what in-formation can be used to enable mapping refinement and the best resource to be leveraged in obtaining this information.

In this problem, our investigation aims to answer the following research question: how to semantically enrich mapping relations in cross-lingual ontology alignments?

1.3 Research Goals and Approach

In this Ph.D. thesis research, we define and build a framework suited to automatically generate mappings between ontologies described in different natural languages. We thor-oughly investigate and address the cross-language ontology matching and refinement by devising new methods suited to automatically generate ontology mappings and enrich the type of semantic relation in the created alignments. Our approach is based on a two-step method that firstly implements a matching operation, and secondly uses ontology evolution operations to refine mapping relations.

In order to achieve this goal, we have the following specific objectives:

1. Propose a cross-lingual matching technique using automated translation and the combination of similarity methods to generate a mapping candidate set;

2. Define a refinement technique using ontology change operations (evolution informa-tion) to semantically enrich generated cross-lingual mappings.

Figure 1.1 presents an overview of the research steps. Each number associated with a subject is the Chapter number in this thesis. In our research methodology, we started by a detailed literature review for thoroughly understanding the existing techniques. This enabled us to detect the shortcomings of the existing approaches and to highlight the originality of this investigation. Afterwards, experiments were conducted to identify key aspects involved in the mapping process (cf. 2 in Figure 1.1). Based on these results,

(19)

we propose two novel cross-lingual matching (3 and 4 in Figure 1.1) and a refinement technique (5 in Figure 1.1).

Figure 1.1: Overview of the research steps and the chapter number where they are ad-dressed.

Cross-language matching remains an open research issue due to the difficulties in tak-ing advantage of similarity computation. In our approach, similarity measures are used to calculate the degree of relatedness between ontology entities. We combine both syn-tactic similarity method [66], which explores lexical information of strings, with semantic similarity measures, which compute the semantic distance between two string denoting an ontology entity [12]. In ontology matching, semantic similarity measures are used to estimate how concepts are related, calculating a normalized similarity value 0 ≤ x ≤ 1. In particular, ontology matching techniques have considered the use of background knowl-edge for supporting the alignment via similarity calculation. The higher the value, the greater the similarity between the concepts. This helps to identify related concepts during the matching process.

Our study proposes a series of experiments based on real-world mappings to under-stand the alignment between ontologies in different languages. We investigate the role of a pivot-language related to the domain. In particular, we analyze the influence of syntactic and semantic similarity methods and the structure of terms denoting concepts in ontologies [18]. We investigate the effects of different semantic similarity measures in the identification of cross-language mappings. We carried out experiments exploring real-world biomedical ontology mappings available from open repositories to comprehend the behaviour of computed similarity values [19]. Our results indicate the relevance of the domain-related background knowledge in the effectiveness of semantic measures for ontology cross-language alignment. These studies drove design decisions in this research. These initial steps paved the way for the original conceptualization of techniques and algorithms for automatic cross-lingual ontology matching.

(20)

the first technique, we explore machine translation [45] in the automatic translation of labels denoting ontology entity labels. Our hypothesis is combining syntatic and seman-tic similarity measures with the structure of the ontology can improve cross-language ontology matching. In our approach, we define a novel technique for automatic cross-language ontology matching based on the combination of a composed similarity approach with the analysis of neighbour concepts to improve the effectiveness of the alignment re-sults. Our composed similarity considers lexical, semantic and structural aspects based on background knowledge to calculate the degree of similarity between contents of ontology entities in different languages.

In our second cross-language ontology matching technique, we explore information retrieval methods to generate a mapping set. A common approach to define proper align-ments relies on identifying the relationships among concepts from different ontologies by performing multiple entity-based searches. Our hypothesis is exploring rank aggregation techniques can leverage cross-language ontology matching. In this strategy, the most suitable matching is defined by the top-ranked concept found. Often, multiple similarity rankers, defined in terms of different similarity criteria, are considered to define candi-date entities. In this case, their complementary view could be exploited in the definition of the best possible matching. We explore the use of rank aggregation functions, under both unsupervised and supervised settings, in the task of defining suitable matches among entities belonging to ontologies encoded in distinct languages.

We thoroughly evaluate our matching techniques. We carried out a series of iments to investigate the quality of mappings generated by our techniques. Our exper-iments explored standard datasets from the Ontology Alignment Evaluation Initiative (OAEI)4 _{competition, the conference-domain ontologies in 45 language pairs from the}

MultiFarm5 _{dataset [61]. MultiFarm provides curated mappings between multilanguage}

ontologies. This dataset has been extensively used to assess cross-language ontology matching methods. Experimental results achieved in this thesis indicate a good effective-ness of our approaches leading to better f-measure results when compared with state-of-the-art techniques in cross-language ontology matching.

Ontology mapping refinement helps to expand the types of semantic relations iden-tified during the matching process. Refinement can modify or enrich semantic relations; for instance, during the refinement process, an equivalence (≡) relation (i.e., a relation-ship defining that two mapped concepts are equivalent) can be modified to an is-a (v) relationship (i.e., representing a relationship where one concept is a specialization of the other) [3]. Refinements can also insert or remove mappings, based on defined pattern rules [42]. These rules can be created manually or automatically, and are used to validate the relations.

We propose a method where the mapping set is refined to semantically enrich the candi-date mappings found during the matching process. When a knowledge domain expands, the ontologies representing the domain need to be updated to reflect domain changes. Consequently, ontologies are constantly evolving, by adding and removing concepts and relations over time. Our hypothesis is using ontology evolution changes can be useful

4

http://oaei.ontologymatching.org/ (As of June 2019).

(21)

to the cross-language alignment refinement. We leverage ontology evolution information, such as change operations (e.g., changing the value of an attribute), to help in the process of refinement. The use of this information provides an understanding of how the concepts were updated over time, supporting the application of actions needed to modify the type of semantic relation in mappings.

1.4 Contributions

We summarize the prime scientific contributions of this thesis as follows:

1. The experimental studies that uncover key aspects on ontology cross-language match-ing. We analyze the similarity values computed by distinct semantic similarity measures applied to cross-language mappings and compare the effectiveness of the measures under different scenarios and configurations. Experimental results, focused on the life science domain, indicate useful factors to take into account in the design of matching algorithms for domain-specific cross-language alignment (cf. Chapter 2).

2. A technique using weighted syntactic and semantic similarity measures to achieve automated cross-language matching. The proposed technique relies on the simi-larity computed among those concepts immediately related to a given entity (the neighbours), both on source and target ontologies (cf. Chapter 3).

3. Another novel technique for cross-language matching. This method mapped the ontology matching problem as an information retrieval problem, enabling the use of rank aggregation techniques to find correspondences between cross-language on-tologies (cf. Chapter 4).

4. An original ontology mapping refinement method based on ontology evolution changes in cross-lingual settings. The key contribution is the use of ontology evolution in-formation for cross-language ontology mapping refinement (cf. Chapter 5).

5. Thorough evaluation of the methods in our approach with datasets from biomedical domain and conference domain. The evaluations experimentally assess all the sug-gested methods. The results empirically achieved from the scholarly experimental evaluations attested the effectiveness of our approach and methods.

6. Algorithms that implement the proposed techniques and a software tool in which we participated in the 2018 OAEI competition.

Although several methods for mapping creation and refinement do exist, to the best of our knowledge none has used the proposed methods described in this Ph.D. thesis, nor has ever used information from evolution operations to improve cross-language matching.

(22)

1.5 Thesis Organization

This Ph.D. Thesis is organized in chapters representing a collection of articles already published or submitted for publication (under review).

Chapter 2 corresponds to the article “Influence of Semantic Similarity Measures on Ontology Cross-language Mappings”, published in the Proceedings of the Symposium on Applied Computing [19]. This presents our study on the impact of semantic similarity measures for cross-lingual ontology matching. These measures were subsequently used in our proposed techniques.

Chapter 3 corresponds to the article “Neighbourhood-based Cross-Language Ontology Matching”, submitted to an international journal [15]. This paper, currently under review, reports on our technique using structural information combined with composed similarity as a novel technique for cross-language ontology matching.

Chapter 4 corresponds to the article “Exploring Rank Aggregation for Cross-Lingual Ontology Alignments” submitted to an international journal [17]. This paper, currently under review, describes the results of our experimental studies on ontology mapping lever-aging rank aggregation.

Chapter 5 corresponds to the article “Ontology changes-driven semantic refinement of cross-language biomedical ontology alignments”, published in the Proceedings of 3rd Workshop on Semantic Web solutions for large-scale biomedical data analytics (SeWeBMeDA 2019), collocated with the 18th International Semantic Web Conference (ISWC 2019) [21]. This paper presents the mapping refinement proposal and an evaluation with ontologies in the biomedical domain.

Chapter 6 closes this thesis by providing a summary of our achievements and high-lighting the contributions made throughout the chapters. We present several directions for future work.

(23)

Chapter 2 Influence of Semantic Similarity

Measures on Ontology Cross-language

Mappings

Abstract.1 _{Cross-language mappings establish relations between ontology concepts}

de-fined in different languages. Similarity measures calculate the degree of relatedness be-tween concepts to support matching bebe-tween two distinct ontologies. Cross-language matching remains an open research issue due to the difficulties in taking advantage of similarity computation. This article investigates the effects of different semantic simi-larity measures on the identification of cross-language mappings. We carry out experi-ments exploring real-world biomedical ontology mappings to comprehend the behaviour of computed similarity values. The obtained results indicate the relevance of the domain-related background knowledge in the effectiveness of semantic measures for ontology cross-language alignment.

2.1 Introduction

Mapping establishes explicit links between ontology entities. Usually distinct ontologies are necessary to structure data and make their meaning explicit from scattered sources. In this context, mapping plays a key role as a reference for semantic interoperability between information systems performing data exchange. Given two ontologies (described in different natural languages), ontology matching is the task of finding a set of mappings interrelating the concepts.

Ontology matching has been deeply studied in the context of monolingual ontologies [83]. Nevertheless, cross-language ontology matching still remains hardly investigated [91]. Due to the large volume of current ontologies, automatic cross-language matching methods deserve a thorough investigation. This is particularly relevant in the life science domain where ontologies need mechanisms for data integration.

1_{DESTRO, J. M.; DOS REIS, J. C.; CARVALHO, A. M. B. R.; RICARTE, I. L. M. 2017. Influence}

of Semantic Similarity Measures on Ontology Cross-language Mappings. In Proceedings of the ACM Symposium on Applied Computing (SAC’17). The Semantic Web and Applications. Marrakech, Morocco, pp. 323-329.

(24)

Although proposals for cross-language ontology alignment do exist, the developed ap-proaches do not consider the linguistic characteristics of the domain. Ontologies expressed in different natural languages hamper the straightforward use of existing ontology match-ing methods. Similarity measures are a central technique for ontology matchmatch-ing. They provide means for computing the degree of relatedness between concepts. In particular, semantic similarity approaches explore the use of external sources such as vocabularies and domain knowledge structures to leverage the similarity calculation. These measures might contribute to advance solutions for the cross-matching problem.

Even though these measures have been studied in the context of traditional matching and in the context of biomedical ontologies [38], the literature lacks systematic experi-mental investigations to examine the relevance and adequate use of semantic similarity measures for cross-language matching. The benefits of applying semantic similarity mea-sures and the more appropriate techniques to address this issue are still unknown. We aim to originally study the accuracy and effectiveness of these measures in cross-language mappings.

In this article, we present experiments conducted with real-world ontology mappings to analyze the similarity values computed by distinct semantic similarity measures applied to cross-language mappings. The objective is to compare the effectiveness of the measures under different scenarios and configurations. We assume that the approach to calculate the similarity and the explored background knowledge might influence cross-language matching. This study explores large biomedical ontologies with mapping sets established among the ontologies.

The central contribution of this investigation is the distinct analyses on the imple-mented experiments. We investigate the influence of semantic similarity measures in the determination of similarity scores between biomedical concepts in different languages. The study reveals the usefulness of this class of measures for cross-lingual matching techniques. In our research method, we first recover biomedical ontologies and mappings from available repositories. The interrelated concepts for a given mapping are automatically translated to a pivot-language. For each mapping, distinct experiments analyze the degree of similarity for several semantic similarity measures. We calculate similarity between concept labels and synonyms, and analyze the impact of their length in the obtained similarity values. The overall results indicate that the choice of the semantic similarity method and the explored background knowledge have an important impact on cross-language matching.

This article is organized as follows: Section 2.2 presents the related work. Whereas Section 2.3 formalizes the definitions and the explored semantic similarity measures, Sec-tion 2.4 describes the experiments and datasets. SecSec-tion 2.5 reports on the achieved results and Section 2.6 discusses them. Section 2.7 draws final remarks and future perspectives.

2.2 Related Work

Methods for matching between ontologies defined in different natural languages have mostly investigated the use of a third language and the effects of translation procedures.

(25)

Jung et al. [51] proposed an approach for exploring indirect alignments between mul-tilingual ontologies. Their method uses already existing mappings with a third language available on both ontologies. Differently, Spohr et al. [85] studied the translation of con-cept labels to a third language for ontology cross-language matching. Their approach explored machine learning techniques. The approaches carrying out translation were motivated by results that analyse the impact of automatic translations on multilingual ontology alignment [36]. The findings highlighted the translation’s relevance for achieving adequate matching quality.

Studies have investigated the potential reuse of existing traditional ontology matching systems to cross-language matching. Meilicke et al. [62] studied the performance of matching systems. The obtained results indicate the difficulties for traditional ontology matching algorithms to do multilingual ontology alignment. More recently, Trojahn et al. [91] described an extensive survey on matching systems and techniques for accom-plishing multilingual and cross-language ontology matching. The outcome of their study revealed the need for novel techniques to advance the task.

Ontology matching techniques have considered the use of semantic similarity meth-ods relying on background knowledge. Similarity measures aim to calculate the degree of relatedness between concepts exploiting different knowledge sources (e.g., ontologies, thesauri, domain corpora, etc.). Stoutenburg [86] argues that the use of ontologies com-bined with linguistic resources might enhance ontology alignment processes, being this an alternative to syntactic-based similarity measures relying only on string comparisons to determine a similarity score.

Aleksovski et al. [1] proposed to align ontologies by doing matching with external knowledge sources. They explore paths between the anchored matched concepts to find mapping between concepts. Differently, Sabou et al. [82] proposed to align ontology con-cepts by selecting the most appropriate ontology over multiple and heterogeneous knowl-edge sources. The TaxoMap approach uses the WordNet lexical database as background knowledge [43].

Handling more than one language is hardly investigated in the literature. Mohammad et al. [65] proposed the use of an external lexicon corpora in the design of a cross-language measure of semantic distance. However, the lack of extensive word-aligned bilingual cor-pora in specific domains minimizes the usefulness of such approach.

Domain-specific resources have only been superficially studied in the biomedical do-main for the purpose of ontology alignment. Zhang and Bodenreider [101] proposed exploring the Unified Medical Language System (UMLS)2 _{structure to accomplish}

match-ing between anatomical ontologies. The obtained results indicate that domain knowledge is a key factor in the identification of additional mappings compared with the generic matching approach. Pesquita et al. [76] propose a comparative assessment among several semantic similarity measures applied to biomedical ontologies. They classified the seman-tic similarity techniques between edge-based and node-based to assess the similarity score between concepts. While edge-based explores the structure of the ontology relying on the defined relations between concepts, the node-based considers the elements defining concepts (e.g., labels and synonyms) as information content to quantity the similarity

(26)

score. Their work was not applied to the ontology matching task.

More recently, Garla and Brandt [38] evaluated the effectiveness of several approaches to measure semantic similarity applied to the biomedical domain. They analysed the influence of different types of background knowledge in the accuracy of the similarity values. Results indicate the benefits of knowledge-based semantic similarity measures and the use of UMLS for improving the measures accuracy.

To the best of our knowledge, semantic similarity measures have not been duly inves-tigated in the context we are addressing. We want to examine to which extent domain-specific resources, like UMLS applied to semantic similarities, benefits cross-language matching. Our research aims at empirically understanding the most adequate solution aspects involving similarity in the identification of mappings between ontologies in dif-ferent languages. We believe our findings might pave the way to the design of suited matching mechanisms based on semantic similarity measures.

2.3 Fundamental Concepts

An ontology O refers to a set of concepts interrelated by relationships, e.g., “is-a”, “part-of ”, “related-to”. We denote Concepts(Ox) = {C1, C2, ..., Cn} as the concepts defined in

an ontology Ox. Each concept is characterized by a unique identifier number, a label and

a set of synonyms.

A mapping mxy establishes a relation between two concepts Cx and Cy, from two

different ontologies, as mxy = (Cx, Cy, sim, ≡). The symbol ≡ consists in equivalence

semantic relation connecting Cx and Cy, Cx ∈ Concepts(Ox) and Cy ∈ Concepts(Oy).

The sim ∈ [0, 1] value stands for the similarity value between Cx and Cy. The LXY =

{(mxy)i|i ∈ N} refers to the mapping set of Ox and Oy.

Given a concept Ck ∈ Concepts(Ox), the function LB(Ck) defines the value of the

label of Ck expressing its local name denoted by a natural language string. For example,

“vascular diseases” describes the label of a concept. The properties like rdfs:label and skos:prefLabel define the labels in ontology description languages. We explore synonyms as the relevant and equivalent terms (strings denoting concepts) for further characterization of a concept. The function SY (Ck) = {s1, s2, ..., sw} returns the list of synonyms of the

given concept Ck. For instance, the term “atopy” is the synonym of “allergy”.

We need the translation of the concept label and synonyms. We define the translation of a concept Ck as CkT. Given that the result of LB(Ck) and SY (Ck) is expressed in a

language β, the label LB(C_kT) and synonyms SY (C_kT) of C_kT are expressed in language α such that β 6= α.

The function LE(el, n) calculates the n first tokens from the el. The parameter n refers to the number of tokens, and el remains a LB(Ck) or a element in SY (Ck). For instance,

LE(LB(Ck), 2) returns the first two tokens of the concept label Ck. The similarity value

is calculated by SIM (el1, el2) ∈ [0, 1] between two strings denoting a concept.

We examine the behaviour of similarity values calculated in the context of cross-language mappings when computing the relatedness by different semantic-based similarity techniques. Although several measures can implement the similarity function, this study

(27)

considers three well-known approaches to semantic measures.

We investigate the use of the NASARI method. It makes cross-language similarity measurement possible by using vectors in a unified language independent space of con-cepts from semantic representations in BabelNet [67]. BabelNet is a domain neutral se-mantic network used as a general background knowledge for the similarity computation. This technique explores the Weighted Overlap method applied to the semantic vector representation [50]. The similarity is computed by comparing the two interpretable vec-tor representations. Our motivation for examining NASARI is its potential benefits for cross-language matching issues.

In order to investigate semantic distances based on a domain-specific resource, we explore two measures available in the

UMLS::Similarity project [60]. They compute semantic relatedness of concepts relying on the semantic network defined in Unified Medical Language System (UMLS). In particular, we examine the path-based principle of measurement and the vector-based calculation.

The path-based method uses a edge counting technique between concepts denoted in UMLS. Inspired by graph theory, a path here refers to finite sequence of edges connecting a sequence of concepts. The input strings are first related to existing concepts in UMLS as a first anchoring step. The relatedness value resulted is the multiplicative inverse of path length between the detected concepts. If the compared concepts are the same, then the resulting similarity score remains the highest. This is a traditional approach for background-based semantic similarity, which motivates the analysis of its effectiveness in our study context.

The vector-based method uses the Gloss Vectors technique [74] to calculate related-ness based on the UMLS. Given the definition of each concept (string term denoting the concepts), the technique creates a context vector for each word from the definition by calculating its co-occurrence with data from the corpus. All the context vectors together build a matrix representing the term definition vector (gloss vector). The semantic re-latedness of two concepts is calculated via the cosine of the angle between the concepts’ gloss vector. Differently from the path-based approach, the Gloss Vectors does not require an underlying structure, and the knowledge from the corpus is explored to compute the similarity.

2.4 Experiments

This study investigates the similarity between the original and translated version of the concepts involved in a mapping. Figure 2.1 presents an example of mapping with the involved concepts and their translation. Given a mapping mxy ∈ LXY, in the first step

the concepts are translated. From the concepts Cx and Cy interrelated by the mapping,

the translation results in CT

x and CyT.

From the original language, β (English language), we translate concept labels and synonyms to a language α (Latin language) or γ (Spanish language). Latin (α) is used as a pivot-language because it is the most prevalent etymology in the chosen domain [9]. We also explore Spanish language (γ) as an alternative because it is the first romance

(28)

language on the lists of languages by number of native speakers3. This work relies on the Google API through python module TextBlob to obtain the translation automatically. This choice was motivated by the fact that it can be used at run-time, without the need to pre-process the strings.

Figure 2.1: Semantic similarity in cross-language mapping. This shows the concepts involved in the mapping described in a Language β and their translation to Language α or γ.

For each mapping, our procedure examines the behaviour of similarity values between the original elements (label and synonyms) of a concept Cx via LB(Cx) and SY (Cx) with

the translated version of the elements regarding the interrelated concept CT

y, as denoted

by LB(CT

y) and SY (CyT), and vice-versa (cf. a in Figure 2.1).

We investigate the effects of the label length, the similarity is thus computed con-sidering LE(LB(Cx), i) and LE(LB(CyT), i), where i = 1, ..., n, such that n refers to the

highest term length between LB(Cx) and LB(CyT).

For the similarity calculation, we do not distinguish between labels and synonyms. To this end, we define IC(Cx) = (LE(Cx), SY (Cx)) which considers all content in terms of

label and synonyms of a given concept. For each iteration over the length i, the similarity computation takes the maximum value, as follows:

S(i) = M AX SIM (LE(IC(Cx), i), LE(IC(C

T y), i))

SIM (LE(IC(CT

x), i), LE(IC(Cy), i))

In this sense, each analysed mapping results in an array of similarity, where for each index i a similarity value is associated. We apply this process considering i = 1, ..., n, such that i refers to the label and synonym length.

Given that IC(Cx) = {ex1, ex2, ..., exw} and IC(CyT) = {ey1, ey2, ..., eyn}, the

similar-ity calculation function considers the cartesian product over the elements of Cx and CyT.

Taking the first iteration between ex1 and ey1, these strings are split into separated words

and stopwords are removed. The procedure calculates the similarity value between each word of ex1 with all words in ey1 and retains the maximum similarity value computed

(29)

among all words. This similarity value is stored in a similarity array. The output sim-ilarity result between ex1 and ey1 remains the average of values stored in the similarity

array. The final similarity outcome between IC(Cx) and IC(CyT) returns the maximum

value from the set of stored average values of similarity (i.e., the maximum medium). This work applies this procedure and implements four experiments examining differ-ent semantic similarity measures for cross-language labels and synonyms. The objective is to analyze the results of similarities for each measure technique taking into account the length of labels and synonyms. We describe the experiment configurations as follows: Experiment 1: NASARI [Babelnet] / T:Latin. This experiment applies the NASARI similarity method with the translation of concepts made to Latin language. The NASARI method calculates semantic relatedness within BabelNet concepts. Latin was selected as pivot-language because it is etymologically related to the biomedical domain and it is available in BabelNet.

Experiment 2: NASARI [Babelnet] / T:Spanish. The second experiment explores NASARI with the translation of the concepts to the Spanish language. Spanish is a ro-mance language available in BabelNet and UMLS.

Experiment 3: PATH-BASED [UMLS] / T:Spanish. In order to perform a com-parative analysis, the third experiment explores the path-based similarity function relying on the UMLS background knowledge. Since path-based method for semantic similarity measure is a traditional measure, we aim to identify whether this measure results in a satisfactory outcome compared with more complex vector methods.

Experiment 4: VECTOR-BASED [UMLS] / T:Spanish. The fourth experiment considers the vector-based similarity function based on UMLS. This method uses multi-dimensional vectors, similar to NASARI, to calculate semantic relatedness value. In both third and fourth experiments, concepts are translated to Spanish only, because Latin is not a language available in the UMLS.

We carry out the experiments with biomedical ontology mappings. BioPortal4 _{refers to}

the source where ontologies and mappings were collected. The experiments explored map-pings between the Systematized Nomenclature of Medicine-Clinical Terms (SNOMEDCT) with several other ontologies. SNOMEDCT refers to a very large health terminologi-cal resource managed by the International Health Terminology Standards Development (IHTSDO)5_.

Table 2.1 presents characteristics of the explored ontologies including, in addition to SNOMEDCT, the Logical Observation Identifier Names and Codes (LOINC) and the National Cancer Institute Thesaurus (NCIT). Our experimental dataset includes 29.676 mappings between SNOMEDCT and LOINC, and 16.746 mappings between SNOMEDCT and NCIT. These mappings were created and curated by The National Center for Biomed-ical Ontology (NCBO).

4_{bioportal.bioontology.org} 5_{www.ihtsdo.org}

(30)

Table 2.1: Biomedical ontologies considered in the study.

Ontology O Release Language #Concepts Avg. Size of the label

SNOMEDCT 2015 English 353 639 24

LOINC 2015 English 174 512 13

NCIT 2015 English 116 093 15

2.5 Results

We present the results for the four experiments. We analyze the findings for the maximum length of ten tokens. Although in the run tests we considered larger number of tokens, we observed that the analysis with ten tokens was sufficient for our purposes.

Figure 2.2 shows the results for the first experiment applying NASARI with the trans-lation to Latin. We denote Tn, where n is the maximum amount of token used. We analyze

the behavior of similarity values when the length of labels and synonyms increase. For each length (T1, T2, T3,..., T10), the figure plots the distribution of the computed similarity

values organized in three groups of similarity ranges. The x-axis considers the results for the string length from 1 to 10 and three ranges of similarity analysis. The y-axis shows the number of mappings. The number of mappings to the highest range is presented on the top of each bar. Results for the remaining experiments follow the same presentation approach.

Figure 2.2: Experiment 1. This figure presents the behavior of similarity values computed with NASARI [Babelnet] and translation to Latin Language.

Figure 2.3 shows the results applying NASARI but with concepts translated to Span-ish. Comparing both configurations, results in Figure 2.2 presents a higher number of mappings with similarity values in the last range (i.e., the range with the highest score in the semantic relatedness value), whereas the number of mappings in the lowest range appears much less than in Figure 2.3. Indeed, except for T1 and T2, the similarity values

found with experiment 2 (NASARI / T:Spanish) concentrate in the lower range. We ob-tain greater values of similarity when the original concepts are translated to Latin. Note that results keep stable after T5.

Figure 2.4 presents the results applying the path-based similarity method based on UMLS background knowledge, whereas Figure 2.5 shows the results of the vector-based similarity function with UMLS. A similar analysis to experiments 1 and 2 reveals that the vector-based function outperforms the path-based method. Whereas the distribution of

(31)

Figure 2.3: Experiment 2. This figure presents the behavior of similarity values computed with NASARI [Babelnet] and translation to Spanish.

similarity values obtained with vector-based are slightly better in the last range, the path-based method obtains much more mappings with similarity values in the lowest range. This reveals that the vector-based function is more accurate for obtaining greater score of similarity. The use of the principle of path calculation makes the calculation of semantic relatedness more difficult, which may explain our findings.

Figure 2.4: Experiment 3. This figure presents the behavior of similarity values computed with PATH-BASED [UMLS] and translation to Spanish.

In an overall analysis, the semantic similarity methods exploring the UMLS back-ground knowledge obtain more accurate similarity values than the NASARI exploring BabelNet, a domain neutral semantic network. The explanation for this relies on the fact that the UMLS is a domain-specific resource that leverages the possibility of obtaining better similarity scores.

Figure 2.5: Experiment 4. This figure presents the behavior of similarity values computed with VECTOR-BASED [UMLS] and translation to Spanish.

(32)

2.6 Discussion

Literature lacks investigating the cross-language ontology alignment problem in the life sci-ence domain. The computation of similarity, yielding relatedness score between concepts, plays a central role for ontology alignment. Although semantic similarity approaches might be key to success of cross-language matching, existing work has hardly studied the effects of distinct approaches to semantic similarity calculation in our context. This investigation proposed a set of experiments to thoroughly study the influence of the type of similarity function for multilingual matching algorithms.

The influence of background knowledge in the semantic network used is clear when comparing the results of the experiments. A comprehensive and multilingual domain-specific corpus boosts the effectiveness of semantic similarity measures. This was made evident with the use of UMLS as a domain resource to support the similarity functions. Results made clear that vector-based methods, using multidimensional vectors, yield more accurate similarity scores than path-based methods. This is particularly important be-cause Gloss Vector method is applicable to other lexical resources than biomedical domain. The findings also demonstrated that by using a domain neutral corpus with a pivot-language strongly related to the domain (Latin in our experiments) produces better results than an unrelated domain pivot-language. These aspects might be useful to consider in the definition of specific matching algorithms to achieve mappings with quality and accuracy of results in cross-lingual ontology alignment.

Despite the relevance of the achieved findings, the dependency on a comprehensive multilingual domain-related corpus may limit the application of semantic similarity only to domains where such corpus are available. Until this stage, we can only generalize the results to romance languages. Our experiments also only took into account original mappings described in English language, but it does not invalidate the findings.

Further investigations involve implementing more extensive tests with additional datasets. We aim to examine the results by considering the different datasets separately to observe how the ontology characteristics and languages impact the obtained outcomes. The anal-ysis with datasets from other domains to compare the results can also enrich the relevance of this research.

2.7 Conclusion

Cross-language alignment remains an open problem. It relies on several different ap-proaches to obtain mappings that interrelate ontologies described in different languages. Similarity relatedness might help in the development of matching algorithms to determine the adequate mappings, but we need to understand the benefits and limitations of differ-ent semantic similarity approaches. This article proposed an original study to empirically comprehend the behaviour of similarity values achieved by applying several approaches to semantic similarity computation. We performed analyses of different semantic relatedness measurements with real-world cross-language mappings. Our implemented experimental procedure examined several aspects including: (i) the role of a different pivot-language to translate the concepts in mappings; (ii) the effects of similarity calculation relying on

(33)

domain-related background knowledge; and (iii) the impact of string length denoting the interrelated concepts in mappings. The findings indicated that domain-related semantic network boosts computed similarity scores, and can be more effective for cross-language alignment than semantic similarity measures relying on domain independent corpus. Re-sults also demonstrated the benefits of a pivot-language closely related to the domain when only a domain-neutral semantic network is available. Future work involves investigating the impact of neighbour concepts of mappings in the accuracy of semantic relatedness measures. We also plan the design, formalization and evaluation of an original matching algorithm to align cross-language biomedical ontologies.

(34)

Chapter 3 Neighbourhood-based Cross-Language

Ontology Matching

Abstract.1 _{Cross-language ontology alignments play a key role for the semantic}

integra-tion of data described in different languages. The task of automatic identifying ontology mappings in this context requires exploring similarities measures as well as ontology struc-tural information. Such measures compute the degree of relatedness between two given terms from ontology’s entities. The structural information in the ontologies may provide valuable insights about the concepts alignment. Although the literature has extensively studied these measures for monolingual ontology alignments, the use of similarity mea-sures and structural information for the creation of cross-language ontology mappings still requires further research. In this article, we define a novel technique for automatic cross-language ontology matching based on the combination of a composed similarity approach with the analysis of neighbour concepts to improve the effectiveness of the alignment re-sults. Our composed similarity considers lexical, semantic and structural aspects based on background knowledge to calculate the degree of similarity between contents of ontol-ogy entities in different languages. Experimental results with MultiFarm indicate a good effectiveness of our approach including neighbour concepts for mapping identification.

3.1 Introduction

Ontologies are used on a multitude of applications in computer science in the role of a specification mechanism or definition of a common vocabulary 2_{. Mapping establishes}

correspondences between different ontology entities and are relevant for the integration of heterogeneous data sources. There is a growing number of ontologies described in different natural languages. The challenge of generating correspondences between different ontolo-gies, created for diversified purposes, is aggravated when concepts are labeled in different natural languages, even in the same domain. Although automatic monolingual ontology

1_{DESTRO, J. M.; SANTOS, G. O.; DOS REIS, J. C., TORRES, R. S.; CARVALHO, A. M. B. R.;}

RICARTE, I. L. M. Neighbourhood-based cross-language ontology matching. Submitted to international journal. 2019.

2_{What is an ontology?}

http://www-ksl.stanford.edu/kst/what-is-an-ontology.html (As of April 2019).

(35)

matching has been extensively investigated [83], cross-language ontology matching still demands further investigations aiming to automatically identify correspondences between ontologies described in different languages [46].

In this context, accurate automatic methods are essential for ensuring the quality of the generated mappings. Current ontologies have highly grown in size. As differences between the used alphabets hamper the use of simple string comparison techniques, sim-ilarity measures play a key role to obtain well-defined ontology mappings because they allow calculating the level of lexical and semantic similarity between concepts [76]. Cross-language ontology matching approaches in the literature have not yet thoroughly inves-tigated the influence of similarity calculation neither have they analyzed the influence of neighbour concepts in the matching process.

In this article, we propose an original cross-language ontology alignment technique based on the analysis of neighbour concepts relying on composed similarity measure, by combining both syntactic and semantic similarity techniques. Syntactic similarity computes a score calculated based on string analysis (extracted from labels of entities), whereas the semantic similarity is computed taking into account background knowledge, such as synonyms and the context in which terms appear (e.g., use of external dictionaries and vocabularies). Our investigation explores a Weighted Overlap measure [77] relying on the neutral-domain semantic network BabelNet [68] and computes a weighted mean of semantic and syntactic similarities. The proposed technique also takes into account the similarity of those concepts immediately related to a given entity (the neighbours), both on source and target ontologies. The method finds the highest value of similarities among these concepts. In this investigation, we name such value as neighbourhood similarity. The neighbourhood similarity is used to improve the correctness of mappings and it is thus combined with the composed similarity whenever the initial value of composed similarity is in a doubtful range, that is, between a default and minimum threshold (set as parameters before the processing begins).

We carried out a series of experiments to investigate the quality of mappings generated by our technique. Our experiments explored conference-domain ontologies in 45 language pairs from the MultiFarm3 _{dataset [61]. MultiFarm provides curated mappings between}

multilanguage ontologies. This dataset has been extensively used to assess cross-language ontology matching methods. The obtained results indicate that syntactic and semantic similarities may have different weights in order to obtain a good accuracy. Our exper-iments suggest that the threshold, language in which the ontologies are described and translation tool play an important role in the quality of generated alignments.

The remaining of this paper is organized as follows: Section 3.2 describes the related work; Section 3.3 formalizes the fundamental concepts of our proposal; Section 3.4 reports on our proposed technique; Section 3.5 describes the experimental results whereas Section 3.6 discusses our findings; Section 3.7 provides the conclusion remarks.