Ontological representation of tumor-node-metastasis classification and an ontology-driven classifier: a study on colorectal cancer

(1)

Fábio Humberto Pinto França

outubro de 2015

Ontological Representation of

Tumor-Node-Metastasis Classification and

an Ontology-Driven Classifier: A Study on

Colorectal Cancer

Fábio Humber to Pint o F rança Ontological R epresent ation of T umor-Node-Me tas

tasis Classification and an Ontology

-Driven Classifier

: A Study on Colorect

al Cancer

UMinho|20

(2)

outubro de 2015

Ontological Representation of

Tumor-Node-Metastasis Classification and

an Ontology-Driven Classifier: A Study on

Colorectal Cancer

Trabalho efetuado sob a orientação de:

Martin Boeker

e

Paulo Jorge Freitas de Oliveira Novais

Dissertação de Mestrado

Mestrado Integrado em Engenharia Biomédica

Ramo de Informática Médica

(3)

Nome: Fábio Humberto Pinto França

Endereço eletrónico: fabiofranca92@gmail.com Cartão de Cidadão: 14194724

Título da Dissertação: Ontological Representation of Tumor-Node-Metastasis Classification and an Ontology-Driven Classifier: A Study on Colorectal Cancer

Orientadores: Martin Boeker e Paulo Jorge Freitas de Oliveira Novais

Ano de conclusão: 2015

Designação do Mestrado: Mestrado Integrado em Engenharia Biomédica Ramo: Informática Médica

É AUTORIZADA A REPRODUÇÃO INTEGRAL DESTA DISSERTAÇÃO APE-NAS PARA EFEITOS DE INVESTIGAÇÃO, MEDIANTE DECLARAÇÃO ES-CRITA DO INTERESSADO, QUE A TAL SE COMPROMETE.

Universidade do Minho, / / Assinatura:

(4)

Dr. Peter Bronsert for opening the doors of the Pathology Center of the Freiburg University Hospital, where he provided material essential for my thesis. A special thank you to all the people that work in the IMBI institute, that opened me the doors of their offices any time I needed some help and for providing me a pleasant and professional work environment. A very big thank you to all my brothers in arms in my Erasmus adventure. We shared a lot of adventures, stories, we supported each other and finally said goodbye. However, I’ll never forget anyone and I’ll cherish each moment that we passed in my memory and my heart. I’d like to leave a personal note to Adrian Fernandez that was there any time I needed both in good or bad moments and to Maria João Sousa, my portuguese Erasmus mate, for all the help and support given during this year.

Portugal

Em primeiro lugar queria agradecer ao meu orientador Professor Doutor Paulo Novais por acompanhar de muito perto o meu projeto mesmo estando a milhares quilometros de distância. Queria agradecer por toda ajuda e tempo que disponibi-lizou para mim sempre com muita sinceridade, profissionalismo e também boa dis-posição. Queria agradecer também à minha família, especialmente aos meus pais que amo muito, que trabalham e se esforçam todos os dias para que eu chegasse a este ponto. Estiveram sempre ao meu lado e nunca me impediram de nada.

(5)

Acreditaram em mim e fizeram que eu desse sempre o meu melhor . Quero que vocês vejam este trabalho e o meu futuro como a maior prova do meu agradeci-mento. Aos meus avós que me sempre apoiaram e sempre quiseram estar a par da minha vida académica, um pelas ajudas extras que tanto fazem falta e o outro por me querer ver vestido de negro por fora mas também por dentro. E as minhas irmãs, uma por ser um exemplo que sempre seguirei e que me faz cada vez lutar mais e a outra por ser sempre a minha companheira número um quando estou em casa. Gostaria também agradecer a todos os meus amigos de curso que partil-haram as mesmas lutas que eu até chegar a este día. Dentro destes gostava de dar um obrigado muito especial a Manel Zamith, João Macedo e Filipe Fernandes, que os considero como meus irmãos que sempre estiveram ao meu lado como sempre estarão! Queria deixar uma palavra a aqueles que patilharam casa comigo, João Prates e Cláudia Rodrigues (é como se lá vivesses) que juntos com o Macedo foram a minha família na Universidade do Minho. Nunca esquecer da minha malta de espinho - João Martins, Diogo Remoaldo, João Vitorino, Ricardo Pacheco, Guil-herme Mendes, Afonso Couto, que já estão comigo durante muitos anos e foram o meu ponto seguro quando ia a casa. Foram sempre magníficos e nunca se es-queceram de mim. Um grande obrigado ao Jorge França, que além de família é também um amigo especial. Fizeste-me abrir os olhos para a vida e tornaste-me uma melhor pessoa! Por fim , o obrigado mais especial para a Rita Roxo por tudo que ela passou durante este ultimo ano que também acreditou sempre em mim. Nada disto seria possível se não fosse o teu apoio e carinho incondicional.

(6)

base numa arquitetura modular. Cada módulo consiste numa ontologia que repre-senta as regras de classificação respetivas aos diferentes tumores. Estas ontologias podem ser importadas para a ontologia central, sendo que todas utilizam o Foun-dational Model of Anatomy (FMA) para representar os conceitos anatómicos e o BioTopLite 2 como ontologia de domínio. A aplicação desenvolvida para a clas-sificação de ontologias tem como base de conhecimeto a ontologia TNM. Esta foi programada em JAVA utilizando a OWL-API como ponte entre a aplicação e a base de conhecimento.

Neste estudo foram avaliados dois dataset com dados reais. O primeiro continha 382 registos que foram classificados pelos nódulos regionais. Comparando classi-ficação automática com a manual obteve-se uma precisão de 55%. No entanto, a aplicação apontou inconsistências e erros feitos na documentação do tumor que causou este resultado. O segundo dataset consistia em 292 registos produzidos e classificados manualmente por um patologista através de documentos em texto. A classificação automática revelou resultados ótimos para todos os tipos de classifi-cação

Este estudo mostrou que a aplicação desenvolvida melhora a consistência e eficiência dos dados na documentação de tumores assim como providencia classifi-cação automática exata durante o processo de diagnóstico do tumor.

(7)

The most important staging system for cancer is the TNM Classification of Malignant Tumors (TNM) classification. The staging procedure compiles several clinical and pathological parameters based on the Extent of Disease (EOD).

The objectives of this work are to present the Tumor-Nodes-Metastasis On-tology (TNM-O), a framework for the representation of the TNM classification of malignant tumors (TNM) system; to implement the TNM Colon and Rectum on-tology, a modular ontology that represents the TNM classification for the colorectal tumors based on this framework; to develop an ontologically driven classifier ap-plication with the TNM-O as it’s knowledge base and to show the feasibility of this approach on real data.

TNM Ontology (TNM-O) and TNM Colon and Rectum Ontology (TNMCR-O) use the Foundational Model of Anatomy (FMA) for representing anatomical entities and BioTopLite2 (BTL2) as a domain top-level ontology. The classification rules of the TNM classification for colorectal tumors were represented as described in the literature. The automatic classifier for pathological data uses these ontolo-gies as knowledge base. It was developed with JAVA using the Ontology Web Language (OWL)-application programming interface (API) to make the bridge between the application level and knowledge base.

In this study, two datasets with real data where evaluated. The first dataset contained 382 entries that was classified by the regional lymph nodes. This study compared automatic classification with the expert one and obtained an accuracy of 55%. However, the classifier flagged inconsistencies and errors made during the manual tumor documentation that caused the misclassification. The second dataset contained 292 records carefully classified by a pathologist. In this dataset, automatic classification was optimal to all types of assessment.

Therefore, this study proved that an ontology-driven automatic classifier en-hances the consistency in tumor documentation and provides accurate instance classification during pathological assessment of tumors.

(8)

2.2 Description Logics . . . 28

2.3 Medical Scope . . . 30

2.3.1 The TNM Classification . . . 30

2.3.2 TNM Classification for Colon and Rectum Tumors . . . 32

3 Methods 38 3.1 Ontology Development . . . 38 3.1.1 Specification . . . 38 3.1.2 Terminology . . . 39 3.1.3 Classification . . . 40 3.1.4 Implementation . . . 40 3.2 Software Development . . . 41 3.2.1 Requirements . . . 41 3.2.2 Application Development . . . 41 4 Results 44 4.1 TNM Ontology . . . 44 4.1.1 TNM Structure . . . 44 4.1.2 Representational Units . . . 46 4.1.3 Tumor Aggregate . . . 47

4.1.4 Quality and ValueRegion . . . 47

4.1.5 Representation of Anatomical Structures . . . 48 viii

(9)

4.1.6 Representation of the Primary Tumor . . . 50

4.1.7 Representation of Regional Lymph Nodes . . . 52

4.1.8 Representation of Distant Metastasis . . . 54

4.1.9 Staging . . . 54

4.2 TNMO-Classifier . . . 56

4.2.1 Automatic Classification . . . 59

4.3 Evaluation . . . 61

4.3.1 Classification of Metastatic Regional Lymph Nodes . . . 63

4.3.2 Classification of all Assessments . . . 64

5 Discussion 66 5.1 Limitations and Future Work . . . 67

6 Conclusion 69 Bibliography 74 Appendix 74 A Comparison between ontology development methodologies and the IEEE 1074-1995 standard 75 B Published Papers 77 B.1 TNM-O an Ontology for the Tumor-Node-Metastasis Classification of Malignant Tumors: a Study on Colorectal Cancer . . . 78

B.2 Feasibility of an Ontology Driven Tumor-Node-Metastasis Classifier Application: a Study on Colorectal Cancer . . . 80

(10)

2.8 Fragment of the main BioTopLite relation hierarchy . . . 19 2.9 Flowchart of the methodology for building ontologies from the Uschold

and King’s methodology . . . 24 2.10 Gruninger and Fox procedure for ontology design and evaluation . . 25 2.11 Methontology ontology development life cycle . . . 27 2.12 Anatomical sites and subsites of colon and rectum . . . 33 2.13 Identification and location of the regional lymph nodes . . . 33 2.14 Representations of the primary tumor classification for colon and

rectum tumor . . . 35 3.1 Screenshot of the FMA Explorer when searching for the concept

Submucosa . . . 43 4.1 Main structure of the TNM-O . . . 45 4.2 Hierarchies of classes for including the RepresentationalUnits of each

modular ontology . . . 46 4.3 Example of hierarchy and classes of all RepresentationalUnits

im-ported by the TNM Colon and Rectum ontology . . . 47 4.4 Quality and ValueRegions Classes of the TNMCR-O . . . 48 4.5 TNM-O current hierarchy of anatomic related classes . . . 49 4.6 Hierarchy of anatomical classes of TNM-O when TNM Colon and

Rectum Ontology imported . . . 49 x

(11)

4.7 Graph of the patho-anatomical structures represented by a T3/pT3

representational unit of the TNMCR-O . . . 51

4.8 Graph of the patho-anatomical structures represented by a N2a/pN2a representational unit of the TNMCR-O . . . 53

4.9 Graph of the patho-anatomical structures represented by a M1 rep-resentational unit of the TNMCR-O . . . 55

4.10 Technical architecture of the classifier application . . . 57

4.11 Graphical User Interface of the TNM-O Classifier . . . 58

4.12 Classification process . . . 60

4.13 Screenshot of the ontology editor Protege with the TNM-O loaded during the classification process . . . 61

4.14 Example of classification from manual input data with respective diagram of the involved classes from TNM-O . . . 62

(12)

4.5 Results obtained in percentage by automatic classification for the TNM version 6 of the colon and rectum tumors . . . 65 4.6 Results obtained in percentage by automatic classification for the

TNM version 7 of the colon and rectum tumors . . . 65 A.1 Comparison between ontology development methodologies and the

IEEE 1074-1995 standard . . . 76

(13)

AJCC American Joint Committee on Cancer. 2 API application programming interface. vii, 41, 42, 58 BFO Basic Formal Ontology. 9, 11–13, 17, 20

BTL2 BioTopLite2. vii, 17, 39, 40, 44, 45

CS Collaborative Stage Data Collection System. 2, 67 DL Description Logic. 2, 6, 13, 17, 20, 28–30, 40

DOLCE Descriptive Ontology for Linguistic and Cognitive Engineering. 9, 12–14, 20

EOD Extent of Disease. vii, 31, 39

FMA Foundational Model of Anatomy. vii, 2, 39, 40, 44, 66 FME Foundational Model Explorer. 39, 48

GO Gene Ontology. 15

GoodOD Good Ontology Designed. 20 GSS Government Statistical Service. 2 GUI Graphical User Interface. 41, 56, 58–60

IARC International Agency for Research on Cancer. 1

ICD-10 International Classification of Diseases - 10th Revision. 4 ICD-O International Classification of Diseases for Oncology. 32 IMBI Institute of Medical Biometry and Medical Informatics. 2, 17 ISI Information Sciences Institute. 26

MeSH Medical Subject Headings. 4, 15 ODE Ontology Design Environment. 27

OKBC Open Knowledge Base Connectivity. 28

OWL Ontology Web Language. vii, x, 5–9, 12, 13, 17, 20, 28, 30, 40–42, 56 RDF Resource Description Framework. x, 6, 7, 9, 28

(14)

W3C World Wide Web Consortium. 6, 7 WHO World Health Organization. 1

(15)

Introduction

The last estimations made by the International Agency for Research on Cancer (IARC), a specialized cancer agency of the World Health Organization (WHO), verified that in the year 2012 there were 14.1 million new cancer cases, 8.2 million cancer deaths and 32.6 million people living with cancer. In these estimations, colorectal cancer was the third most common type of tumor in men and the second in women [1]. Thus, research in new methods for diagnosing and treatment of cancer is the main goal of the WHO cancer programs [2].

One of the most globally accepted staging system for cancer is the TNM Classi-fication of Malignant Tumors (TNM) [3] published by the Union for International Cancer Control (UICC). This system compiles various pathological and clinical pa-rameters for three types of assessment: primary tumor (T), regional lymph nodes (N) and distant metastasis (M). It also provides a distinct and specialized classi-fication for each tumor site. The primary tumor classiclassi-fication generally evaluates the infiltration and size of the carcinoma; the regional lymph nodes assessment concerns the number of metastatic lymph nodes in the regional area of the pri-mary tumor and the presence and absence of distant metastasis.

TNM Classification has been used for more than fifty years, being under an developmental process for updating and revising its documentation. This process made this classification one of the most complete and precise tumor classifications of today, requiring a high level of knowledge and expertise in the domain. However, it is very difficult to this system to keep up with the overwhelming changes and updates in this field [4, 5]. Despite the importance of the TNM classification, no formal logic-based representation has been developed.

Ontologies are information artefacts that formally represent knowledge from a certain domain in order to be machine processable. In the biomedical domain, they are used to describe the structure of their complex domains and to relate their data to shared representations of biomedical knowledge. They provide reference encyclopaedic knowledge and enable computer reasoning of biomedical data [6].

(16)

already developed by Rita Faria [10, 11]. Despite representing a different type of tumor, this work provided some useful definitions and concepts to the development of the ontology for colorectal classification.

Today, tumor registries collect data on the diagnosis and staging of cancer to generate reports for the physicians and hospital cancer registries. Maintaining a consistent and updated cancer registry positively influences the quality of prognosis and treatment protocols. A project conducted by the American Joint Committee on Cancer (AJCC), the Collaborative Stage Data Collection System (CS), con-sists in a software equipped with algorithms capable to translate TNM staging information in order to be used across cancer statistical databases such as the The Surveillance, Epidemiology, and End Results (SEER) and Government Statistical Service (GSS) [12]. Other examples of software related to the TNM system are: an ontology-driven classifier that processes physicians annotations in images to reason the TNM classification [13] and a semi-automatic tool that classifies tumor documentation in the ESTHER system also based in the TNM classification sys-tem [14]. However, none of these studies present a tool for classification based on a formal representation of the TNM classification system. Additionally, no study was found were they provided a feasibility test of these systems.

One advantage of pursuing an ontology-based approach is that ontology main-tenance and updating is done in a quicker and more consistent way. Thus, all the modifications made on TNM can be centrally done in the ontology, with only little changes needed on the application level. Using Description Logic (DL) semantics in the ontology, adds the advantage of detecting logical inconsistencies and coding problems that can happen due to the system’s complexity.

Having a formal representation of the TNM system provides uniformity of knowledge that enhances interoperability and robustness between distinct systems. Maintaining independence between knowledge base and application allows the de-veloper to maintain and update the knowledge base without making substantial

(17)

changes in the application level.

With this work we propose to close the gap of a missing formal representa-tion by presenting the TNM Ontology (TNM-O) and the TNM Colon and Rec-tum Ontology (TNMCR-O). The first aims to represent the main structure of the TNM classification in order to provide support to the TNMCR-O, that represents the concepts and classification rules for the colorectal tumors. Additionally, we also present an ontology-driven automatic classifier that uses these ontologies as knowledge base to provide instance classification and consistency evaluation on the pathological data registry.

Therefore, the objectives of this project are to present the TNM-O, an onto-logical framework for the TNM classification system; implement the TNMCR-O, a modular ontology that represents the TNM classification for the colon and rectum tumors based on this framework; develop an ontology driven classifier application with the TNM-O as its knowledge base and show the feasibility of this approach on real data.

(18)

Many definitions of ontologies exist in literature. In 1991 a definition by R. Neches et al. stated that "An ontology defines the basic terms and relations comprising the vocabulary of a topic area as well as the rules for combining terms and relations to define extensions to the vocabulary." [16]. A few years later, the definition was adapted by Gruber et al. to "An ontology is a formal, explicit specification of a shared conceptualization" [17]. In this definition, an ontology is defined as a conceptualization, which means that an ontology is an abstract model that identifies relevant concepts and their relations within a certain domain. It also characterizes ontology as explicit, since all the concepts and their constraints are explicitly defined in order to be machine-processable. Finally, another aspect is shareability, as long as an ontology captures a consensual knowledge it will be shared in the community [18, 19].

Many definitions of ontologies complement each other. The one presented above is probably one that might reflect consensus in the ontological community.

Ontologies are used to represent shared knowledge of a certain domain in order to be handled by a machine. The interest in building ontologies has grown as researchers had problems to keep track of all scientific publications published in the medical domain. To render all this knowledge, documentation specialist have developed large terminologies, such as Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), International Classification of Diseases - 10th Revision (ICD-10) and Medical Subject Headings (MeSH). However, these

(19)

nologies cannot cope with problems that are inherent to human language, where, for example, one term can have multiple meanings. A solution is the use of on-tologies in which the terms of a domain are used to build logical relations between them.

Another motivation came from the inevitable incompatibilities of the current relational databases, in which, for example, labels can have the same name but different meaning, and vice versa. Despite its similar functions, there are some key characteristics that distinguishes an ontology from a relational database. First, ontologies are syntactically and semantically far richer than common databases. Second, knowledge is described in a formal language instead of the tabular infor-mation tuples. Finally, an ontology provides a consensual theory for a domain and not only the structure of a data container [15].

An ontology can been seen as a form of semantic network. One example is displayed in Figure 2.1 that presents knowledge as classes of individuals connected with is-a relations. These relations create a hierarchy based on super/subclass re-lationships. For example, Student is-a Person means that an entity that belongs to the class Student is also an entity of Person, therefore Student is a subclass of Person. Additionally, in the same figure it is also possible to identify relations between different types of concepts, such as Student studiesAt University. Stu-dent and University are two distinct types of entities, however, they can be used to represent the class that contains all the students that study at an university.

Even so, not every ontology can be represented as in Figure 2.1. For more detailed and complex axioms and restrictions there is no appropriate representation other than using one of the available ontology languages like the Ontology Web Language (OWL) [20] (on section 2.1.1). Besides providing the syntax for ontology representation, OWL is also used for serialization and transport of the ontologies [18].

Biomedical ontologies generally present taxonomic representations of concepts with a broad and sometimes non-consensual theoretical support, so the need for ontology alignment and integration is broadly accepted by the community. This can be done with mappings between terminology based ontologies. Although, most of these ontologies are of poor quality and the mappings between them do not constitute significant improvement. One of the most accepted methods nowadays is a vertical integration of ontologies with different scopes. So, ontologies can be distinguished in the following types [15, 21]:

• Top-level ontology • Upper-domain ontology • Domain-ontology

(20)

Figure 2.1: An example of a semantic network [18].

Ontologies have the potential to deeply change how intelligent systems are built. Building good knowledge bases and sharing them publicly will make libraries of ontologies available to every developer. This should decrease the time spent developing new software since the knowledge is already represented in a commonly accepted way [22]. Besides the biomedical domain, ontologies are used in domains such as agriculture, aviation, chemistry, civil engineering, business and others.

2.1.1 Web Ontology Language - OWL

OWL [20] is one of the most recent ontology language developed by the World Wide Web Consortium (W3C) in the Web Ontology Working Group. Its first goal was to represent information about categories of objects and their logical connections, what we now can call an ontology. Furthermore, OWL can also represent data about the object themselves [20, 23].

However, the development of the OWL was not made from scratch. At its core there is DL (section 2.2) that provides formalization of the semantics, lan-guage constructors, data types and data values. Moreover, OWL can be viewed as expressive DL, where an ontology in OWL is equivalent to a DL knowledge base. [15, 23].

(21)

Frame-work (RDF) [24]. This is a general-purpose language for representing information in the semantic web [25]. It provides meta-data for descriptors of resources on the web and it works as a framework capable of representing data. Compatibility between web languages is the main reason of the big influence of RDF on OWL. A basic way to do this was to provide OWL the same syntax as RDF. However, the W3C found out that for some cases, ontologies would require much more ex-pressiveness than the one provided by RDF. [26, 27]

So, extending RDF brings a trade-of between expressiveness and reasoning efficiency [23]. This lead to the development of three types of OWL that could meet the needs of each developer [26, 27]:

• OWL Full - The entire OWL language is called OWL Full. The main ad-vantage of using this is full compatibility with RDF both semantically and syntactically. So, any valid RDF document is a valid OWL Full document and vice-versa. However, this language becomes to heavy that limits drasti-cally the usage of reasoning support;

• OWL Description Logic (OWL DL) - this sub-language of the OWL Full is used when better computational efficiency is wanted. This is made by reducing the amount of OWL constructors to ensure that the language corresponds to a well defined description logic. Although, this will decrease the compatibility with RDF since an RDF document will need some exten-sion to be a valid OWL DL document. On the other hand, an OWL DL document is still a valid RDF document;

• OWL Lite - This is a subset of OWL DL where more restrictions to its limits were applied. The main goal of this language is to simplify usage and implementation in new frameworks;

Syntactically, as said before, OWL is an extension of RDF syntax. In Figure 2.2 there is a graphical representation of an example of how the concepts are arranged. Without entering in a lot of detail, and not forgetting that an OWL document is generally a RDF document, the constructors of the OWL syntax are [20, 27]:

• Header - despite of being the root of an OWL ontology, this is a RDF element rdf:RDF. This is the element where all the namespaces used are specified. Then, any OWL ontology should start with a set of assertions for purposes like: version control, comments and the inclusion of other ontologies with the element owl:Ontology;

• Class elements - classes in ontologies are defined with the owl:Class el-ement. In order to represent the relation of subsumption between class its

(22)

• Property restrictions - In OWL is also possible to specify that all the instances inside certain class satisfy one or more conditions. This made by creating a subclass, which can be anonymous, where all these instances belong to. This new subclass should contain the restriction that specifies which instances belongs there using the element owl:Restriction. For this restriction it is necessary to identify two things: first the property with the owl:onProperty and the type of restriction. Within the types of restrictions there are:

– owl:allValuesFrom that is used to restrict the range of the property to the instances of a specific class;

– owl:hasValue that identifies the exact value that the property must have to satisfy the restriction;

– owl:someValuesFrom that says that an instance should at least satisfy this restriction to be a instantiated;

– owl:maxCardinality and minCardinality for specifying the max and minimum of a specific number of properties are needed to satisfy the condition respectively;

Besides this, the OWL syntax provides much more elements that reinforce its great expressiveness and justify its great use for ontology development. Although, extensions for this syntax are in the making with the goal to provide further logical features. This syntax was the one used for the development of this project. For that reason, further discussion about existing ontologies will be done within the OWL syntax. Therefore, Italic font will be used for classes and bold for the relations.

(23)

Figure 2.2: Subclass relationships between OWL and RDF [27]

2.1.2 Upper Level Ontologies

The biomedical domain is highly complex and some overlapping between on-tologies occurs. Therefore, it is fundamental to consistently incorporate multiple ontologies. One approach is to build well designed and documented ontologies as high level structures with general concepts and relations on which domain ontolo-gies can be used.

These ontologies are the Upper Level Ontologies (ULO).They aim to provide reusable and reliable definitions of concepts and their relations for independent domains, facilitating the integration and development of new domain ontologies. They are differentiated by the entities they include, the theory of space and time as well as the relation between individuals to these theories. They contain rich definitions and axioms that are applicable across multiple domains.

Nowadays there is no agreement on what makes a good top level ontology. However , there are some candidates already published such as Basic Formal On-tology (BFO) and Descriptive OnOn-tology for Linguistic and Cognitive Engineering (DOLCE) [28–31]

Basic Formal Ontology - BFO

BFO, is an ULO still in development by Ontology Research Group (ORG) leaded by Barry Smith at the Department of Philosophy in the University of Buf-falo. Formal representations on the biomedical domain are focused in static and dynamic entities of biological reality, while BFO tries to combine these two per-spectives in order to address the issue of representing both in a consistent way. Thus, this ontology starts to provide formal distinctions between:

(24)

Figure 2.3: Example of the relation of instanceOf between an Universal and Particular

• Continuant and Occurrent • Dependent and Independent • Formal and Material

Universal and Particular

The Universals are the real invariants in the domain. On the other hand a Particular is an entity or individual that is the instantiation of a certain Universal (Figure 2.3). Another important relation is the subsumption between two Univer-sal s. For example, Rose subClassOf Plant indicates that any instance of Rose is also an instance of Plant.

Continuant and Occurrent

Continuants are entities that exist through time. They persist with the same identity even when undergoing through changes. A Continuant is bound to the space which it occupies but not in time. A Continuant is segmented in terms of space but will endure during time. A instance Continuant is our body, it can be divided in terms of space although it will keep its identity through time.

On the other hand, there are the Occurrent, Event or Process, that instead of existing in full at a single moment in time they unfold themselves in phases. Thus, in contradiction to Continuant they are bound with respect to time, that is, the Occurrent is segmented in terms of time. With this in mind, an example of

(25)

this is the process of embryological development which processes in a succession of stages.

Dependent and Independent

This is the distinction between entities which have the ability to exist without the support of others and the ones who are dependent. For example, a Quality is dependent on the Thing which it bears.

Both Continuant/Occurrent and Dependent/Independent distinctions are ap-plied to both Universals and Particulars. An example of this can be the functioning of my kidney in this moment, which is a Particular/Occurrent that depends on my kidney and its function (both Particular Continuant ). This can also reflect the same dependence between the corresponding Universals.

Formal and Material

Biomedical terms reference mainly material objects like organs, cells, organ-isms, etc. However, ontologies have to deal with vast formal relations in which these entities are related together. So Material is a class that is confined to its domain while Formal relations are used across multiple ones. Examples of formal relations can be dependence, instantiation, subsumption, etc .

SNAP and SPAN Ontologies

BFO ontology is divided in two different types: SNAP ontologies for represen-tation of Continuants and SPAN ontologies for the Occurrents.

The main classes in SNAP ontologies (Figure 2.4) are:

• Independent Continuant with the subclasses Object, Object Aggregate, Site, Boundary and Part of Object ;

• Dependent Continuant with the subclasses Quality and Realizable. These also have as subclasses Function, Role and Disposition;

• Spatial Region with subclasses Volume, Surface, Line and Point. And the main classes in the SPAN ontologies (Figure 2.5) are:

• Processual Entity with subclasses Process, Process Aggregate, Process Part, Processual Context and Boundary of Process;

(26)

Figure 2.4: Main classes in BFO SNAP ontologies [15]

• Spatiotemporal Region with subclasses Scattered Spatiotemporal Region and Connected Spatiotemporal Region. The last one has the sublasses Spation-temporal Interval and SpatioSpation-temporal Instant ;

• Temporal Region with subclasses Scattered Temporal Region and Connected Temporal Region, which has the subclasses of the last one are Temporal Interval and Temporal Instant

BFO was implemented in OWL and is freely available to the community. It contains the top class Entity, seventeen SPAN classes and eighteen SNAP classes. Its application is mainly biomedical and is applied e.g. for ontology development in the domain of trials on cancer [15, 32–34].

Descriptive Ontology for Linguistic and Cognitive Engineering - DOLCE DOLCE is the first module of the Library of Foundational Ontologies being developed as part of the WonderWeb project headed by Nicola Guarino at the Laboratory for Applied Ontology in Trento. Differently to other ontologies that follow a minimal taxonomic structure satisfying the needs of a specific domain, DOLCE aims to establish consensus in the multi-area community where artificial intelligence meets humans.

Contrarily to BFO that represents the world as it is, this ontology inclines to the cognitive point of view introducing ontological categories as cognitive artefacts depending on human perception, cultural background and social conventions.

Even with different perspectives to the world, BFO and DOLCE share many similarities (compare Figures 2.4 and 2.5 with Figure 2.6). For example an

(27)

En-Figure 2.5: Main classes in BFO SPAN ontologies [15]

durant in DOLCE corresponds to a Continuant in a SNAP ontology; an Occur-rence in DOLCE to a Occurent in BFO. In addition to this there are categories that almost share the same name like Temporal Region and Spatial Region.

Besides all the similarities there are some essential differences between these two ontologies worth noticing. In the BFO ontology there are no classes for abstract entities like cognitive and social objects. Also, there are no quality regions or values, instead, these are subclasses of the correspondent quality. Other differences are:

• DOLCE Processes can have qualities, while in BFO not;

• in DOLCE both SpatialRegion and TemporalRegion are Abstract entities; • BFO does not contain subclasses for processes.

DOLCE was implemented in First Order Logic with OWL. It contains around 100 terms and the same number of axioms. Many projects uses the DOLCE ontol-ogy, including the LOIS Project, an international research project on information retrieval from legal databases; SmartWeb, another prestigious research project on artificial intelligence technologies and their application on web based systems and AsIsKnown which is a semantic-based knowledge system form home textile industries. DOLCE Lite is the OWL DL representation of DOLCE [15, 32, 35, 36].

2.1.3 Upper Domain Ontologies

Developing ontologies using the same ULO, which means, reusing the same classes, relations and even high-level restrictions does not guarantee interoperabil-ity. Therefore, an additional common terminological framework is necessary to

(28)

(29)

obtain a soft, seamless transition between the most generic classes like Continuant and Occurrent and more granular ones.

This intermediary is an Upper Domain Ontologies (UDO) and it defines the types and relations essential to represent a specific domain. Examples of UDO are Biotop [29], that is used as UDO for the TNM-O and GENIA [37] which originally motivated the development of BioTop [21, 29, 38].

GENIA

The GENIA ontology was designed to provide a semantic annotation to the GENIA corpus. The latter is an aggregate of extracted articles from the MEDLINE database. Its purpose is to provide high quality materials for Natural Language processing and be used as a high performance standard for the evaluation of text mining systems.

Nowadays the ontology is in its second version and it was divided in two on-tologies:

• Term Ontology - This ontology is designed to support the GENIA corpus term annotation. It represents and classifies the most significant biological terms found in literature. It defines biological, anatomic and organism en-tities, most of which are mapped to the MeSH repository. This ontology is also subdivided in three sub-categories:

– GENIA Chemicals - which is intended to define any chemical sub-stance;

– GENIA Anatomy - it corresponds to the MeSH Anatomy category; – GENIA Organisms - this corresponds to the MeSH Organism

cate-gory;

• Event Ontology - this ontology is designed to provide a semantic platform for the GENIA corpus event annotation. Its main purpose is to match Natu-ral Language expressions within biological processes and molecular functions. It was designed to be interconnected with the Gene Ontology (GO) to im-prove its utility.

GENIA is free to the public and was implemented with XML and the DAML+OIL ontology language [39, 40]

BioTop

BioTop is an UDO developed with the aim to aid engineers with an ontological framework for the life sciences. It provides a layer for connecting and integrating

(30)

its subrelations (in particular hasLocus, locusOf and physicallyConnectedTo) [15] ;

• Immaterial Object - is a subclass of Continuant with n-spatial dimensions like points, lines or planes. Instances of Immaterial Object are related to other physical entities regarding their location, connected by the same relations as the instances of MaterialObject. Subclasses of ImmaterialObject are Wave and Cavities [15];

• Information Object - represents information. An instance of InformationOb-ject is dependent on a physical carrier that is bearerOf or inheresIn, but independent of a carrier with regard to its encoded content. For exam-ple, a treatment plan exists independently of the planned procedure but the planned procedure is dependent on the plan for its realization [15];

• Disposition - A disposition is a realizable entity that inheres in something and can bring itself to existence in a process. It depends on the physical make-up of the agent that participates. Although, even if a disposition exists it does not mean that its manifestation exists. Humans have the disposition for reproduction even if they never do [15].

• Role - In opposition to Disposition, a Role is brought into existence by its participation in a certain process. In this case, a human can have the role of a customer and salesman depending on his participation in a trading procedure;

• Process - Is an Occurrent that has temporal parts which are not always simultaneously present. It can have Material Objects and Immaterial Objects as participants;

(31)

• Quality - represents a feature of some other entity and cannot exist indepen-dently of it;

• ValueRegion - is a temporal, abstract or spatial region in which qualities are located, it corresponds to the values qualities can have;

• Condition - is the result of the union between Material Object, Process and Disposition with the aim to represent the ambiguous nature of a condition in the medical domain. Some terms can have different meanings such as tumor that can be a pathological process and also an abnormal growth of malignant tissue. This class provides a common class where these terms can be added without having to resolve this ambiguity [15].

BioTop ontology was aligned with the BFO upper level ontology and imple-mented in OWL-DL language. Today it is composed by 175 classes interconnected with 171 axioms. It has been developed at the IMBI at the University Medical Center Freiburg, Germany, the Department of Computer Linguistics at the Univer-sity of Jena, Germany and and in the Institute of Medical Informatics, Statistics and Documentation at the Medical University Graz. It is still under development and maintenance in the IMBI [15, 29, 41].

BioTopLite

BioTopLite is a smaller, simpler and computationally more efficient version of the BioTop ontology. Both share the same objective: to provide an upper domain ontological framework for ontology developers in the biomedical sciences. Provides a core of 53 classes with 240 logical axioms using a set of 37 ontological relations (Figure 2.8).

BioTopLite2 (BTL2) is the current version and like its predecessor, it was implemented in OWL-DL. The main changes comparing to its previous version are:

• Additional Classes - The use of biomedical terminologies motivated the creation of a class Life which represents the process of an organism during its lifetime. In medical diagnoses, time references are made in segments of the Life of the living organism;

• Simplified Relation Hierarchy - Relations were distinguished between processes and objects which turned out to complicate the use of this ontology. After the abolition of this distinction the number of relations was reduced to 37.

(32)

(33)

(34)

in time. This can lead to ontological inconsistencies due to the impossibil-ity to express time as an occurrent. In order to surpass this problem, the class Entity at some time was introduced. Using this with the relations is referred to a time/at some time turned out to be a possible solution. BioTopLite2 was used as a upper lever ontology in the project Good Ontology Designed (GoodOD), which provided an extensive guideline for good practices in ontology design in biomedical domain. It was also used as upper level ontology in the SemanticHealthNet project that ontologically integrates diverse semantic resources in order to increase interoperability between electronic health records and data [42].

2.1.4 Methodologies for Building Ontologies

Each development team follows its own criteria for the development of a on-tology. However, the absence of methods and guidelines decreases the ontology shareability.

The common practice of switching directly from knowledge acquisition to im-plementation poses some problems: commitment and design criteria are implicit; domain experts and end users have more difficulties in understanding the formal ontology; direct coding of the knowledge acquisition is too abrupt and ontology developers may have more difficulties to extend or reuse such ontologies.

The ontology development process identifies tasks and activities that the devel-oper should carry out, when building an ontology. These activities are presented in the IEEE 1074-1995 standard [43] that describes how the software development process should be structured. Since ontologies are software artefacts, they should also be developed according to the same standard, but slightly adapted to the on-tology environment. This standard applied on the onon-tology development process comprises to the following activities:

(35)

• Software Life Cycle - the life cycle of a software should specify in which order the activities and tasks defined below should be performed. A method-ology should specify at least one life cycle;

• Project Management - all the processes within this activity, recommended in the standard, should be applied also to ontology development. They are activities related to the project initiation, monitoring and ontology quality management;

• Development - concerns the production, installation, operation, mainte-nance and retirement from its use. These processes are divided in three stages:

– Pre-Development - involves the study of the environment in which the ontology will be used, the possibilities of integration in other systems and a feasibility study;

– Development - this includes the requirements, design and implemen-tation process;

– Post-Development - is related to the installation, operation, support and maintenance of an ontology.

• Integral Processes - these can include the training of the personnel re-sponsible to the usage and maintenance of the ontology;

Depending on the size or purpose of the ontology some steps can be skipped. On the other side, if correctness and completeness of an ontology must be assured these activities should be performed during the whole process of ontology development. In the next sections are presented some methods which are already applied in the ontology development process. In Appendix A it is possible to identify the similarities or small deviances between methodologies to the IEEE standard. [44, 45]

Cyc KB Project method

Since the beginning, the main goal of the Cyc project [46] was to build a large knowledge base that contained a vast formal knowledge background that could be suitable for a variety of domains. In the last twenty years it has been building a knowledge base capable to represent a vast selection of common-sense knowledge in order to support unforeseen future knowledge representation and reasoning tasks.

(36)

and others for acessing, utilizing and extending the knowledge base. [47, 48] Uschold and King’s method

The Uschold and King’s method [44, 48–50] consists of four activities :

• Identifying the purpose and level the formality - this activity is fo-cused on clarifying why the ontology is wanted and used for. This stage is important to know if the ontology should be built or not. If the developer can’t find its purpose, he shouldn’t proceed. After clarifying the purpose follows the decision about the level of formality. This level increases with the degree of automation in the tasks that the ontology will support. For example, if it is intended to support reusing and sharing of knowledge bases, then a more formal representation is needed.

• Building the ontology - this step concerns the development of the on-tology. For this, Uschold and King’s method gives 4 different approaches (Figure 2.9):

1. The first approach is ignoring all the stages above and start the devel-opment by defining terms and axioms in an ontology editor. This is the best approach when only a prototype is intended.

2. The second approach is more adequate for more simple and small on-tologies. This approach already needs to have a proper identification of purpose and scope.

3. The third approach starts by producing a prototypical ontology mainly structured in natural language with the terms and definitions of the domain. This process is mainly driven by hypothetical scenarios and competency questions. If this approach is taken, this informal document should be revised and evaluated before developing the formal ontology;

(37)

4. The last approach starts by identifying the formal within the informal set of terms using these to convert the informal competency questions into formal ones. Then specify the axioms and definitions that comprise the ontology.

• Evaluation and Revision - The evaluation and revision of an ontology can follow a more general or more specific criteria:

– the general criteria for evaluation are the clarity, consistency and reusabil-ity of an ontology. However, this method is limited since there is no proper way to do this. Although, automated support to evaluate the ontologies by the the criteria is available.

– the specific criteria involves techniques like manually checking the on-tology against the identified purpose. These criteria is more appropriate for evaluating informal ontologies

Gruninger and Fox

The Gruninger and Fox method [51] is manly targeted to the development of ontologies in the enterprise domain. It is inspired by the problems that can be found with particular enterprises using them to define a motivation for an ontology. This motivation often have the form of problems that could not be addressed by existing ontologies. Intuitively, knowing what the problem is, possible solutions comes to mind. These solutions provide the first glance of an informal semantics for terminology included in the ontology.

Defining the motivation and possible solutions, requirements come next. These requirements are transformed in competency questions that an ontology must an-swer. These questions are a set of natural language competency questions that are used to determine the scope of the ontology or it’s competency. This also provides an initial evaluation of the ontology that determines whether develop it or reuse existing ontologies.

The next step is to define the terminology. It will consist of concepts and def-initions represented as axioms that should provide the necessary depth to restate the informal competency questions. If designing a new ontology, for every compe-tency question there must be terms, relations and definitions on the ontology that should be able to intuitively answer the question.

After defining the terminology, the informal competency questions should be formally represented using the axioms of the ontology. These new formal questions will work as constraints on which axioms will be included. All the terms stated in these new formal competency questions should also be added to the terminology.

(38)

Figure 2.9: Flowchart of the methodology for building ontologies from the Uschold and King’s methodology [50]

(39)

Figure 2.10: Gruninger and Fox procedure for ontology design and evaluation [51]

All the components are formally expressed in first order logic inhering it’s intrinsic robustness. This model is also used as guide to convert informal scenarios into computable models [44, 48, 51].

KACTUS

The main objective of the KACTUS [52] project is to investigate the feasibility of reusing knowledge bases in complex technical problems and the role of ontologies in supporting it. This approach is conditioned by the number of applications being developed (bottom-up strategy). This means that when more applications are built, more general the ontology becomes. It all starts with building a knowledge base to a specific domain, further knowledge bases will be developed in order to be included in the existing ones. Therefore, when a new application is developed, the following steps are needed:

• Specification of the application - the first insight on what the ontology must represent;

• Preliminary design based on relevant top-level ontological cate-gories - this process involves looking at previous ontologies developed that are possible candidates to be extended to this new application;

• Ontology refinement and structuring - this is made to assure that all the modules are not very dependent on each other and the most coherent as possible.

In summary, this new ontology can be built by reusing others and possibly integrated into ontologies of future applications. Applying this method along the

(40)

1. Identification of a set of seed terms that are relevant to the domain; 2. Then, this seed is linked by hand to a broader ontology;

3. All the concepts in the path between the seed terms and the upper ontology are included;

4. The terms that are relevant for the domain that are not yet included in the ontology are then added manually (this step is repeated until all terms necessary are represented);

5. Finally, for the nodes that have a large number of paths between them, the entire sub-tree under the node is added. This step is mostly done by hand since it requires a deep knowledge of the domain.

Using this method, knowledge-based applications for air campaign have been developed in a conjunct work with the ISI, ARPA Rome Planning and DARPA Joint Forces Air Component Commander. These include the Strategy Develop-ment Assistant, a tool that supports intelligent guided plan developDevelop-ment. The method of using the same base ontology to develop ontologies in particular do-mains provides a high level of shareability [45, 48].

METHONTOLOGY

METHONTOLOGY [54] is a methodology developed in the Artificial Intelli-gence Lab from the Technical University of Madrid (UPM) for building ontologies either starting from zero or reusing other ontologies. This method enables the construction of ontologies at the knowledge level, that includes the identification of the ontology development process, a life cycle on evolving ontologies (Figure 2.11) and the techniques to carry out all the process.

(41)

Figure 2.11: Methontology ontology development life cycle [54]

The development process includes a set of tasks that should be done during the ontology building process. These tasks are schedule in the life cycle of the ontology and they are:

• Specification - the goal of this task is to develop a prototypical document with the ontology’s primary goal, purpose, granularity level and scope; • Conceptualization - after most of the knowledge acquisition is done, the

ontology developer must organize all this unstructured data.

• Knowledge Acquisition - the level of knowledge acquisition decreases with the increase of familiarity of domain and with the progression of the ontology development. The acquisition follows three stages:

1. Meetings with experts to give an overview knowledge about the domain; 2. Studying the documentation about the domain;

3. After having a good insight on the domain, knowledge is acquired by looking from general knowledge to more particular one.

• Integration - During the development, some terms can be included in other ontologies. The target ontologies must be checked if they have been validated and verified. Since there is no automatic tool for this, the guidelines given by Asunción Gómez-Pérez [55] were followed.

• Implementation - Tools like Ontology Design Environment (ODE) [56] and the WebODE [57] provide support to the METHONTOLOGY.

(42)

Informatics group in the last two decades. It is an environment for knowledge-based systems development. Nowadays Protégé is the leading ontology editor, used by a world-wide community of about 50 000 users, who themselves are contributing to its evolution.

It was originally developed for representing frame-based ontologies within the Open Knowledge Base Connectivity (OKBC) protocol and its original goal was to minimize de role of the engineer in the ontology design process and consequently reduce the knowledge acquisition bottleneck. Currently, and in collaboration with University of Manchester, it evolved to represent ontologies based on DL in a variety of ontology languages.

Besides this new features, recent updates let Protégé export ontologies in a big variety of formats like RDF, OWL , XML and others. It has an open architecture which can be extended through plug-in components created by other developers [58, 59].

2.2 Description Logics

Description Logics are a family of languages for representing knowledge in a formal and structural way. It allows the representation of a model for a certain domain using a syntax constituted by classes, individuals, relations and the logical connections between them [?].

In the 70’s , knowledge representation started to gain popularity and the ap-proach was divided into two types [60] :

• Logic Based - where new facts can be intuitively deducted by predicate calculus - an axiomatized form of predicate logic;

• Non-Logic Based - this representation unfolds more cognitive notions mainly derived from human memory and execution of tasks. Network

(43)

struc-tures and rule-base representations are examples of this type of representa-tion.

Between these two types, Non-Logic Based representations became more ap-pealing from a practical viewpoint. However, they are designed for a specific prob-lem or task where the knowledge is represented as structured data sources and reasoning is made by manipulation of them bringing some application limitations. On the other hand, on a logic based approach, the representational language uses a set of relational descriptions and variables to build predicates in which consistency and knowledge can be inferred by means of reasoning.

DL is the latest name on the knowledge representation family which is equipped with a formal, logic based semantic. It is called description logics because the important notions within the domain are described by concept descriptions. These can be represented by atomic concepts (unary predicates) and atomic roles (binary predicates) where the concept and role constructors are given by the particular DL [61].

Knowledge representation with description logics starts by first identifying and defining the most relevant concepts of the domain, this means its terminology, and uses this concepts to build the descriptions that specifies the properties of the individuals or objects in the domain. Unlike the other languages, DL are equipped with a logic-based semantics and reasoning. The latter allows to infer implicitly new knowledge from the already explicit knowledge representation. This new knowledge can be used by humans to structure and better understand the domain that is being represented with two types of classification [62]:

• Concept Classification - this type of classification is based on the principle of subsumption. This means that the classification is done by determining sub/superconcept relationships between concepts providing a terminology structured as an hierarchy. This provides useful information about the con-nection between concepts and also increases the performance of inference services;

• Individual Classification - this classification tries to determine if a certain individual is in fact an instance of a class. Knowing this, The properties of the individual are easily extracted.. Also, this may flag some inconsistencies on the knowledge base that forces the engineer to add or modify the knowl-edge base, thus contributing for the improvement of its own efficiency and robustness.

Inside of a knowledge base, it is possible to see a distinction between what is the general knowledge about the domain and the knowledge specific to the problem. Thus, a DL knowledge base is also divided in two components: a TBox and a

(44)

In the other side there is the ABox that contains the assertions made about the individuals. These are also called as membership assertions since they refer to an individual being an instance, or member, of a certain concept. For example:

1: Man uPerson(MICHAEL) 2: hasFather(MICHAEL,CHARLES)

The first assertion states that Michael is a Man. Concerning also the assertion made before, it is possible to state that Michael is an instance of Male. This type of assertions are called concept assertions. The second assertion describes that Michael has a father called Charles. These kind of assertions are denominated as role assertions. The reasoning task in ABox is to check if a given individual is an instance of a specific concept [15, 60, 62].

DLs have demonstrated their practical usage by being implemented in many systems in various domains. Software Engineering was one of the first target domain. One example took place in the ATT that developed the Classic system that helped the software developer in finding out information about a large software system. Another domain is for configuration tasks. DLs are useful to support the design of complex systems by combining multiple components. On the biomedical domain, DLs proved to be very useful in the development of decision support systems besides the complexity of the medical domain [60]. In the ontological domain, DL is the core of the OWL 2 ontology language.

2.3 Medical Scope

2.3.1 The TNM Classification

The staging of malignant tumors is essential to the diagnosis, prognostic and management of cancer. The TNM was developed between 1943 and 1952 by Pierre

(45)

Table 2.1: Objectives of the TNM Classification [3, 5] To aid the clinician in planning treatment

To give some indication of prognosis

To assist in evaluating the results of treatment

To contribute to continuing investigations of human malignancies To facilitate the exchange of information between treatment centres

Denoix and was first published by the UICC in 1968 . This system is used for more than 50 years and with time and different editions, it has been evolving to meet the explosive growth in medical research, knowledge and information.

Today the TNM classification is in its 7th edition and is considered a worldwide tool for reporting the Extent of Disease (EOD) and prognosis of the outcome of patients with cancer evaluating the anatomic EOD. This system is the base of decision-making systems and clinical practice guidelines making it immeasurably useful.

The UICC established a set of objectives, presented in Table 2.1, in which they believe will maintain their prime motivation to have a broad and unified system where a common language is used and understood by clinicians in all specialities. This system evaluates the attributes of the tumor including local growth and extension (T), spread to regional lymph nodes (N) and distant metastasis (M). T and N usually provide different levels with increasing severity, however for the distant metastasis, generally there is only a binary combination: 0 (no evidence) and 1 (evidence). Besides this complex classification, a series of different symbols exits to complement the classification increasing substantially its complexity. For example, each one of these levels can also have a suffix as a sub-classification (ex. T1a , N2b etc..) that can add specific information. This can become problematic because this varies in each tumor location. We can also have "X" when we face a clinical and pathological situation with incomplete or inaccurate information and "is" is needed for classifying a carcinoma in situ. The staging of the tumor corresponds to the combination of the three types of assessment.

There are two types of classification differing in the way the evidence was obtained:

• Clinical Classification - this consists as the pre-treatment clinical classi-fication, which means that is based on evidence gathered before treatment and physical examination, and is designated as c. This is essential in the process of choosing and evaluating the proper therapy. This classification requires the use of the prefix "c" e.g cT1 , cN2;

• Pathological Classification - designated as pTNM , this classification is used to guide through further therapy and provides new data to the

(46)

progno-accurate information systems based on this system has been increasing [3–5,63–65].

2.3.2 TNM Classification for Colon and Rectum Tumors

The TNM classification for colon and rectum tumors (International Classifi-cation of Diseases for Oncology (ICD-O) C18-20) provides more detail than any other staging systems. The Colon and Rectum Staging is based on the depth of the tumor invasion on the wall of the intestine (T), the number of regional lymph nodes involved (N) and the presence and absence of distant metastasis (M). It is applied both types of classification however, to this particular cancer site, the pathological and clinical classification are based in the same rules. [66]

The colon and rectum classification is subdivided in some anatomical sites and subsites, each one with their respective ICD-O coding (see Figure 2.12). As the principal anatomic components we have the Colon (C18) , Rectosigmoid Junction (C19) and Rectum (C20). As subdivisions of Colon there are : Caecum (C18.0), Ascending Colon (C18.2), Hepatic Flexure (C18.3), Transverse Colon (C18.4), Splenic Flexure (C18.5), Descending Colon (C18.6) and Sigmoid Colon (C18.7). [3] The regional lymph nodes are located near the major vessels that supply the colon and rectum, along the vascular arcades of the marginal artery and adjacent to the colon. They can be seen in the Figure 2.13. For the pN classification the only information needed is the amount of metastatic regional lymph nodes. Any non-regional metastatic lymph node is recorded as a distant metastasis

Definitions for Colon and Rectum TNM

The same classification is applied to both pathological and clinical classification [3, 66].

(47)

[a] [b]

Figure 2.12: Anatomical sites and subsites of colon [a] and rectum [b] [66]

(48)

• T1 - Tumor invades submucosa

• T2 - Tumor invades muscularis propria

• T3 - Tumor invades subserosa or non-peritonealized pericolic or perirectal tissues

• T4 - Tumor directly invades other organs or structures and/or perforates visceral peritoneum

– T4a - Tumor perforates visceral peritoneum

– T4b - Tumor directly invades other organs or structures N - Regional Lymph Nodes

The N classification concerns the number of metastatic regional lymph nodes and the presence or absence of tumor deposits. To determine this, the clinician proceeds with physical examination, imaging and/or surgical exploration.

• NX - Regional Lymph Nodes cannot be assessed • N0 - No regional lymph node metastasis

• N1 - Metastasis in 1-3 regional lymph nodes – N1a - Metastasis in 1 regional lymph node – N1b - Metastasis in 2-3 regional lymph nodes

– N1c - Tumor deposit(s), i.e. satellites, in the subserosa or in non-peritonealized pericolic and perirectal soft tissue withou regional lymph node metastasis

(49)

[a]

[b] [c]

[d]

Figure 2.14: Representations of the primary tumor classification for colon and rectum tumor [67]

(50)

lung and liver are the most common sites. Metastatic non-regional lymph nodes are considered as distant metastasis.

• M0 - No distant metastasis • M1 - Distant metastasis

– M1a - Metastasis confined to one organ (liver, lung, ovary, non-regional lymph node(s))

– M1b - Metastasis in more than one organ or the peritoneum Staging

The staging of the tumor is done after knowing the full TNM code . Each stage is associated to a different combination of classifications (Table 2.2). The higher staging value correspond to increasingly worse scenarios.

(51)

Table 2.2: Correspondence between the TNM classification and Staging [66] Stage T Classification N Classificatio M Classification

Stage 0 Tis N0 M0

Stage I T1, T2 N0 M0

Stage IIA T3 N0 M0

Stage IIB T4a N0 M0

Stage IIC T4b N0 M0

Stage IIIA T1, T2 N1 M0

T1 N2a M0

Stage IIIB T3, T4a N1 M0

T2,T3 N2a M0

T1, T2 N2b M0

Stage IIIc T4a N2a M0

T3, T4a N2b M0

T4b N1, N2 M0

Stage IVA Any T Any N M1a

(52)

tology. Then the extraction of all the concepts in order to provide a terminology. After the extraction of the terminology necessary, follows the development of the knowledge base.

3.1.1 Specification

The domain of the ontology is the TNM Classification published by the UICC. The scope represented was restricted to the classification of colon and rectum tumors.

For every cancer site TNM-O provides a set of ontologies that can be imported to it. So, TNM-O works as a connecting hub containing a set of classes that will be common between all the other ontologies. This modular architecture forces the main TNM-O represent the most general definitions where the other modules can connect to. Therefore, this ontology should contain the concepts that are transversal to all cancer sites.

For this project, TNMCR-O was also developed, as modular ontology, that represents the classification rules of the colon and rectum tumors. This ontology should contain all the concepts and definitions for the classification of colorectal tumors as described in the Section 2.3.2. For that, it was necessary to provide a representation to:

• all the anatomic concepts related to the classification of colorectal tumors; 38

(53)

• qualities and possible values for the tumor; • and the EOD;

All these representations were done in consideration with the upper level classes of the TNM-O. As a domain top-level ontology, BTL2 was used [69]. This ontology provides some predictability to the development of the TNM-O and its modules. This is intended since it facilitates the development and implementation of new modular ontologies to the main TNM-O. Besides a formal representation of the TNM classification of malignant tumors, this ontology should be capable of cor-rectly classifying instance data.

3.1.2 Terminology

As reference, the seventh edition of the "TNM Classification of Malignant Tu-mours" published by the UICC and edited by L. Sobin et al. was used, where all the classification rules for all cancer sites are described very extensively with natural language.

Extraction of Anatomical Structures

The representation of anatomical structures was based on the FMA ontology [7]. Besides the ontology, it also provides a web framework Foundational Model Explorer (FME), this allows the user to search for any concept about the human anatomy and the relations between them.

Each modular ontology should import its own anatomical concepts and hier-archy. Some ontologies can share some anatomical categories or even entities in it’s classification. Because of this, is necessary to design the base structure to the TNM-O to provide better categorization during the import of anatomical entities for each modular ontology.

Additionally, to provide a correct implementation, this process of building a general hierarchy must be iterative. It means that, each time a new module is developed, the anatomical tree in the TNM-O should be updated, which is not a problem since ontologies are very easily updated and maintained. By now, the anatomical representations in the TNM-O manly concern the colorectal tumor since this was the module also developed in this project.

In order to chose the best categories to add to the main ontology, it was nec-essary to take anatomical concepts from the colon and rectum classification and search them in the FME. In the Figure 3.1 there is a screenshot with the result given when searched the submucosa, which is a component in the wall of the colon and essential to the classification. Although, representing all the FMA ontology would increase the computational resources needed to use the TNM-O.

(54)

• Confinement - this quality concerns the confinement of the primary tumor within the wall of the colon and rectum. The respective values are Confined and Invasive;

• Cardinality - this quality represents a quantity. In this ontology it is used, for example, to represent the number of metastatic regional lymph nodes found;

• AssessmentQuality - this quality represents cases where the assessment was not done (NoAssessment ) or no evidence was found (NoEvidence).

3.1.3 Classification

The goal of the TNM Classification is to properly classify malignant tumors. Analogue to this, the TNM-O plus the TNMCR-O should also be able to perform such classification. For this, it is necessary to attach each classification rule to the respective TNM code. Ontologically, each code corresponds to a Representa-tionalUnit. RepresentationalUnit s were defined as a subclass of InformationObject which is provided by the BTL2.

3.1.4 Implementation

Both TNM-O and the TNMCR-O were implemented in the Semantic Web standard OWL-DL. This standard is a sub-language of the OWL strictly based on Description Logics and currently adopted by the ontology editor Protégé. For reasoning purposes was used the HermiT reasoner.