Promoting interoperability of biodiversity spreadsheets via purpose recognition = Promovendo interoperabilidade de planilhas de biodiversidade através do reconhecimento de propósito

(1)

COMPUTAÇÃO

Ivelize Rocha Bernardo

Promoting Interoperability of Biodiversity

Spreadsheets via Purpose Recognition

Promovendo Interoperabilidade de Planilhas de

Biodiversidade através do Reconhecimento de

Propósito

CAMPINAS

2017

(2)

Promoting Interoperability of Biodiversity Spreadsheets via

Purpose Recognition

Promovendo Interoperabilidade de Planilhas de Biodiversidade

através do Reconhecimento de Propósito

Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutora em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. André Santanchè

Este exemplar corresponde à versão final da Tese defendida por Ivelize Rocha Bernardo e orientada pelo Prof. Dr. André Santanchè.

CAMPINAS

2017

(3)

Ficha catalográfica

Universidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467

Bernardo, Ivelize Rocha,

B456p BerPromoting interoperability of biodiversity spreadsheets via purpose recognition / Ivelize Rocha Bernardo. – Campinas, SP : [s.n.], 2017.

BerOrientador: André Santanchè.

BerTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação.

Ber1. Biodiversidade. 2. Biodiversidade - Banco de dados. 3. Aprendizado de máquina. 4. Planilhas eletrônicas. 5. Integração semântica (Sistemas de computação). I. Santanchè, André, 1968-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Promovendo interoperabilidade de planilhas de biodiversidade

através do reconhecimento de propósito

Palavras-chave em inglês:

Biodiversity

Biodiversity - Databases Machine learning

Electronic spreadsheets

Semantic integration (Computer systems)

Área de concentração: Ciência da Computação Titulação: Doutora em Ciência da Computação Banca examinadora:

André Santanchè [Orientador] Antonio Mauro Saraiva

José Laurindo Campos dos Santos Flavio Antonio Maës Santos

Julio Cesar dos Reis

Data de defesa: 24-10-2017

Programa de Pós-Graduação: Ciência da Computação

(4)

COMPUTAÇÃO

Ivelize Rocha Bernardo

Promoting Interoperability of Biodiversity Spreadsheets via

Purpose Recognition

Promovendo Interoperabilidade de Planilhas de Biodiversidade

através do Reconhecimento de Propósito

Banca Examinadora:

• Prof. Dr. André Santanchè

Instituto de Computação - Unicamp • Prof. Dr. Antonio Mauro Saraiva

Escola Politécnica - USP

• Prof. Dr. José Laurindo Campos dos Santos Coordenação de Ação Estratégica - INPA • Prof. Dr. Flavio Antonio Maës Santos

Instituto de Biologia - Unicamp • Prof. Dr. Julio Cesar Dos Reis

Instituto de Computação - Unicamp

A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no processo de vida acadêmica do aluno.

(5)

cast.”

(6)

If I arrived here, it is because every person who crossed my life brought me a new expe-rience of self-improvement, and for them, I do not have words to say thank you!

How much should I say thank you to my advisor, Prof. Dr. André Santanchè, for the guidance, all dedication, and encouragement throughout the project? Moreover, Prof. Dr. Claudia Bauzer Medeiros, Prof. Dr. Helio Pedrini, Profa. Dra. Maria Cecília Calani Baranauskas, Dra. Debora Pignatari Drucker, Dra. Talita Soares Reis who have been collaborated with this research work, making it even better.

How to say thank you to Prof. Dr. Alvaro A. Fernandes for allowing me to have one of the most amazing experiences of my life? I will never forget the opportunity which he oﬀered me of developing my research project for a year at the University of Manchester with him. Prof. Dr. Norman Paton, who together with Alvaro inspired me with their wisdom questions and made me have a huge improvement as a researcher. Prof. Dr. Carole Goble and everyone who works with her, I cannot say how much I am pleased to have had the opportunity to receive your advices and help. I have learned a lot with you all. Thank you very much for everything!

My friends from Manchester who made my life more colorful during winter. My friends from LIS-Unicamp who loved me even when the weather was low. Bianca for having this huge heart, I will never forget what you did for me, and Gi, Artemis, and Shella, who welcomed me so well. Helo, because "life is not a cartesian plan!".

My father who has inspired me with his strength and optimistic way of facing the ob-stacles. My mother, for her endless dedication, her love, and her patience, for encouraging me to face the challenges. Mom, you showed me that life could be unexpected and that, sometimes, we just need to give the next step to open a world of opportunities. My sister, for being my best friend, without her, life would not be so full of love, so pure complicity. My grandma, teaching me the love for knowledge.

My friends Eddy and Bia for encouraging me so much, mainly in these lasts months, I can’t say what would be this doctorate without you both. Lucas and Mau for always supporting me and saving me by oﬀering their love, their time, their house and the love of their dog, Yoshi. Lilian and Ale for sharing their smiles, their thoughts, their couch, their love. Vania, André, and Lettys who have been in my life, and bringing me always so many good things. Chris for encouraging me to follow my way, and for always making me smile. Special thanks to my friend Guilherme, who is no longer with us, but who I am going to remember for the rest of my life. Guys, definitely you all make my life worthwhile.

All professors, staﬀ, and colleagues at UNICAMP and the University of Manchester. This work was developed at UNICAMP and participantly at the School of Computer Science at the University of Manchester and financed by FAPESP (2014 / 21963-4) and FAPESP (2012 / 16159-6). I am pleased to acknowledge them. The opinions expressed in this work do not necessarily reflect those of the funding agencies.

(7)

Existem muitas iniciativas para promover "intelligent openness"ou "FAIR principles"de dados, ou seja, formas de tornar os dados disponíveis, acessíveis, interoperáveis e reutilizá-veis. No entanto, no domínio da biodiversidade, ainda é habitual que os biólogos produzam seus dados em formatos ad-hoc e heterogêneos. A conformidade com um padrão impõe-lhes um custo inicial de reestruturação e anotação de seus dados. Esta pesquisa aborda este cenário com foco em planilhas. Contribui com uma técnica para produzir automa-ticamente anotações semânticas em dados extraídos de planilhas, explorando a maneira como os atributos são organizados em seus esquemas para inferir seu propósito. Os dados semânticos resultantes podem ser integrados, articulados e manipulados de acordo com sua finalidade, em uma abordagem incremental e exploratória, permitindo que os biólogos naveguem e interajam com uma rede interconectada de dados de biodiversidade.

(8)

There are many initiatives to promote "intelligent openness" or "FAIR principles" of data, i.e., ways to turn data Findable, Accessible, Interoperable, and Reusable. They rely on a compliance with reference schemas, common standards or ontologies. However, in the biodiversity domain, it is still usual that biologists produce their data in ad hoc and heterogeneous formats. A compliance with a standard imposes on them an upfront cost of restructuring and annotating their data. This research addresses this scenario focusing on spreadsheets. It presents our technique to automatically produce semantic annotations in data extracted from spreadsheets, exploring the way that attributes are arranged in their schemas to infer their purpose. Elements of the resulting semantic dataset can be integrated, articulated and handled according to their purpose, in an incremental and exploratory approach, allowing biologists to navigate and interact with an interconnected network of biodiversity data.

(9)

2.1 Fields Characterization. . . 20

2.2 Terms by schema of initial lines . . . 23

2.3 SciSpread - Proportions among fields of catalog spreadsheets. . . 23

2.4 Survey - Proportions among fields of catalog spreadsheets. . . 23

2.5 SciSpread - Proportions among fields of event spreadsheets. . . 24

2.6 Survey - Proportions among fields of event spreadsheets. . . 24

2.7 Comparative terms quantities between spreadsheets category . . . 25

2.8 Comparative terms location between spreadsheets nature . . . 26

2.9 Spreadsheet 1 - used in the Survey . . . 26

2.10 Spreadsheet 2 - used in the Survey . . . 26

2.11 Comparative results about spreadsheets classification . . . 27

2.12 Spreadsheet 3 - used in the survey . . . 27

2.13 Conceptual model for catalog spreadsheets annotated with qualifiers . . . . 29

3.1 Biodiversity data grouped by purpose [35] . . . 34

3.2 Set of spreadsheets used by scientists to record biodiversity data . . . 35

3.3 Spreadsheets of our survey [9] analysing how the organization of attributes influences the interpretation of a spreadsheet. . . 36

3.4 Operations for biodiversity data sets according to their purpose [35] . . . . 39

3.5 System Architecture . . . 40

3.6 Schema Extraction process. . . 41

3.7 Spreadsheet Classification by a Random Forest Model. . . 43

3.8 ROC curves of experiments with arrangement weighting. . . 45

3.9 ROC curves of experiments without arrangement weighting. . . 45

3.10 ROC curves of experiments with arrangement weighting for spreadsheets of the survey. . . 47

3.11 ROC curves of experiments without arrangement weighting for spread-sheets of the survey. . . 47

4.1 Seven example spreadsheets (A-F ) for integration . . . 50

4.2 Integrating spreadsheets using a DSMS . . . 53

4.3 Questions (1-9) to spreadsheets (A-F ) . . . 55

4.4 Example Transformations on Spreadsheets . . . 59

4.5 Integration Process for Question 2 . . . 62

4.6 Integration Steps for Question 2 . . . 63

4.7 Process graphically illustrating an integration program over the spread-sheets in Fig. 4.1 . . . 64

4.8 Similarity network created from the comparison of 339 spreadsheets cap-tured on the Web. . . 66

(10)

(11)

1 Introduction 13 2 Interpretation of Construction Patterns for Biodiversity Spreadsheets 17

2.1 Introduction . . . 17

2.2 Research Scenario . . . 18

2.3 Methodology . . . 19

2.3.1 Initial Data Collection and Analysis . . . 19

2.3.2 Hypothesis . . . 20

2.4 Evidences of Construction Patterns . . . 20

2.4.1 Evidences based on Automatic Recognition of Web Spreadsheets . . 21

2.4.2 Exploratory Analysis based on a Survey conducted with Biologists . 21 2.4.3 Comparative Results: SciSpread x Survey . . . 22

2.5 Representation of Construction Patterns . . . 27

2.5.1 Formalizing the Model to Represent Patterns . . . 29

2.6 Related Work . . . 30

2.7 Conclusions and Future Work . . . 31

3 Annotating Biodiversity Spreadsheets through 5W1H based on Machine Learning 33 3.1 Background . . . 33

3.1.1 Problem Definition . . . 34

3.3 Spreadsheet Annotation Process . . . 39

3.3.1 Schema Extraction . . . 40

3.3.2 Purpose Recognition . . . 43

3.4 Experimental Results . . . 44

3.4.1 Knowledge Transfer . . . 44

3.4.2 Model Evaluation - Scoring Position Test . . . 44

3.4.3 Biodiversity Spreadsheets Benchmark . . . 45

3.5 Discussion and Conclusions . . . 47

4 A Model-Management Approach to Biodiversity Spreadsheet Integra-tion 49 4.1 Introduction . . . 49

4.1.1 Background . . . 49

4.1.2 A Motivating Scenario . . . 50

4.1.3 Contributions . . . 51

4.1.4 Organization of the Paper . . . 52

(12)

4.2.4 Model Management in DSToolkit . . . 59

4.3 Case Study . . . 63

4.3.1 Integrating and Managing Spreadsheets through Operators . . . 63

4.3.2 Characteristic Attributes Shared in Communities . . . 66

4.5 Conclusion and Future Work . . . 69

5 Conclusion and Future Work 71 5.0.1 Future Work . . . 72

Bibliography 73

(13)

Chapter 1 Introduction

Biodiversity Informatics is the application of computer science to biodiversity informa-tion for better management, discovery, explorainforma-tion, and analysis, producing new ways of visualizing and analyzing existing information, as well as providing new predictive models [39].

Several researchers [29, 55, 35, 34] have emphasized the importance of having an infrastructure that allows scientists to manipulate and navigate through biodiversity data in a holistic approach, allowing access to diﬀerent facets of biodiversity.

However, in many areas of biodiversity, most of the data is in spreadsheets. They are appealing to scientists because they are perceived as intuitive and flexible enough to meet different individual demands. On the downside, spreadsheets present a difficult challenge for semantic integration attempts, which undermines interoperability. This is a side effect of the representational flexibility and the ability to accommodate individual conceptualizations that scientists value so much.

There are many biodiversity data standards that aim at unifying the vocabulary used by scientists to reuse and share data – DWC [69], ENVO [17], MIASE [67] – and com-mittees to disseminate this idea and to show the benefits of data sharing – TDWG [1], FAIRDOM [42], ELIXIR [20], RD-Alliance [65]. But a lot of data is still created in spread-sheets without considering these standards, generating a considerable heterogeneity.

The main goal of this research is to design and implement a biodiversity spreadsheet dataspace, able to integrate and articulate data from several distinct sources. The specific goals include: (i) to create an automatic method for identifying the purpose of each spread-sheet in the biodiversity context, extracting and mapping its schema to an interoperable format inside a dataspace; (ii) to define a set of operators that allow the manipulation of this data and the articulation of several spreadsheets, according to individual purposes.

The first main research challenge addressed in this thesis is the automatic recognition of spreadsheets’ content and their semantic annotation. Spreadsheets are typically formed by a tabular organization that includes an implicit schema – as a set of attributes – plus values given to these attributes in instances. Semantic interoperability is achieved insofar as such implicit schema is explicitly described, making it readable and interpretable by machines.

Related work attempts to achieve semantic interoperability of spreadsheets with strate-gies ranging from manual mapping to automatic attribute recognition, and associating

(14)

spreadsheet elements with concepts available in Web knowledge bases – for example, DBpedia. However, we observed limitations in these automatic interpretation processes, since their scopes are isolated attributes of spreadsheets or pairs of attributes and do not consider the arrangement of these attributes, which restrains them from identifying the purpose of the spreadsheet and consequently the user’s intention in creating it. As we show in this thesis, the recognition of the purpose of each spreadsheet is fundamental to produce a richer semantic representation of its content, enabling machines to better interpret them and to oﬀer operations aligned to each type of spreadsheet.

The limitations observed in related work motivated us to investigate the work of biolo-gists over spreadsheets in depth, i.e., how they arrange them, and how they integrate and articulate their content according to their purpose. It gave rise to the hypothesis analyzed in this research: there are, in the tabular data of spreadsheets, structural reasons for being interpreted by humans for one purpose instead of another. In Chapter 2, we emphasize that, despite the importance of attributes in their interpretation, spreadsheets with the same attributes present diﬀerent purposes, which vary according to the organization of these attributes.

The second main research challenge addressed in this thesis is what will be an infras-tructure able to integrate and articulate the spreadsheet’s content, once their purposes were recognized. While most of the related work depends on compliance with reference schemas or ontologies, imposing on individual scientists the upfront cost of restructuring and annotating their spreadsheets before the integration and interoperation are possible, our proposal reflects the incremental and exploratory approach followed by biologists. It enables the combination and recombination of spreadsheets in a minimally costly man-ner [3], through an incremental, stepwise approach, depending on the focus of the study. Intermediate results can be reused and lead to alternative exploratory paths. This approach is consistent with the momentum in the scientific inquiry for unforeseen oppor-tunities and unpredictable connections.

Departing from these two main research challenges and the study about how biologists organize their data in spreadsheets, we have implemented: (i) a machine learning based process to automatically recognize the purpose of spreadsheets in order to extract and semantically annotate their schemas, mapping their content to an interoperable format (detailed in Chapter 3); (ii) a prototype that experimentally illustrates the integration of biodiversity data from hundreds of spreadsheets collected on the Web (detailed in Chapter 4).

The two main research contributions generated by this thesis to answer the research challenges mentioned before are:

• A strategy to automatically recognize and annotate biodiversity spreadsheets ac-cording to their purpose;

• A dataspace approach for integrating biodiversity spreadsheets data based on data management operators.

In addition, an important practical contribution of this research is our Biodiversity Spreadsheets Benchmark, which is the result of several interviews, two surveys, and the analysis of spreadsheets collected on the Web, as we further detail.

(15)

In order to validate our assumptions and implementations, we developed an extensive field study, collecting and analyzing spreadsheets and their creators. These spreadsheets have increasingly composed our Biodiversity Spreadsheets Benchmark.

The initial nine spreadsheets came from the Institute of Biology at Unicamp, which contributed with us in this research. Our main contributor was a biologist, who is the curator of the Zoology museum.

We further manually selected forty spreadsheets from the Web, as well as eleven thou-sand spreadsheets automatically selected from the Web by a crawler script. These spread-sheets were the basis to evaluate our recognition engine. We also conducted a survey over these spreadsheets to validate the hypotheses about how biologists organize their data. The research form is in https://goo.gl/Lixii4, and the biologists’ answers are in https://goo.gl/w8dn4L. All these first steps are described in Chapter 2.

The following step was to identify how researchers manipulate data of spreadsheets. Several interviews were conducted with researchers addressing the following aspects:

• What is the motivation to use spreadsheets?

• How the data manipulation in spreadsheets is performed?

• Is it necessary to link spreadsheet data with external complementary information? • Do the data coming from diﬀerent spreadsheets show potential correlation?

• How and where data analysis is performed?

The participants were: two biologists who are working in the polychaete museum; one biologist who is working on bird sounds in an audio library; two biologists whose doctoral projects are: (i) temporal analysis of land cover, (ii) identification of genome regions related with the risk of stroke in sickle cell anemia patients; and two ecologists whose doctoral projects are related to environment analysis and how it can influence the development or continuation of a particular species. The report about the results of this interview is also attached to this thesis in Appendix A. It is also detailed in Chapter 4.

Finally, we did a field research with biologists to confirm some of our hypotheses about spreadsheet classification and also to annotate spreadsheets collected from the Web according to their purpose. The resulting annotated set of three hundred spreadsheets was used to validate our classification algorithm. Chapter 3 presents more details of this survey. The survey is in https://goo.gl/tXSuDV and the respective biologists’ answers are in https://goo.gl/MA4yPX.

An important part of this benchmark is the categorization of spreadsheets according to their purpose, which involved building a reference dictionary to recognize biodiversity spreadsheets for the following purposes: species occurrence, species monitoring, species catalog, experimental observation, genetic data.

The dictionary has also been refined along this research, evolving through four versions – empirical version, network analysis, standards comparison, and specialist analysis – which are tested in diﬀerent contexts. The process is detailed in Chapter 3.

(16)

Those interviews, surveys, and analysis are detailed along of the three following chap-ters.

This thesis is composed of a collection of three papers published and/or submitted for publication. Each chapter corresponds to a diﬀerent paper. They are arranged in the sequence of steps followed by the research. We further detail them.

Chapter 2 presents an analysis of how biologists organize and manipulate their data. It details pieces of evidence of construction patterns followed by users in the biodiversity domain and how they can lead to characterize the purpose of a spreadsheet, as well as its fields in a domain. It combines an automatic analysis of thousands of spreadsheets, collected on the Web, with results from a survey conducted with biologists.

Chapter 3 departs from the pieces of evidence observed in the previous chapter to exploit the implicit organization of spreadsheets to automatically extract their schema and recognize their purpose. A process using a machine learning approach turns spreadsheets data broadly reusable and enables to establish consistent operations among spreadsheets, according to their purpose.

Chapter 4 addresses the challenge of how to oﬀer an infrastructure where biologists can manipulate their data coming from spreadsheets in an incremental and exploratory approach. It demonstrates how to interlink biodiversity data assets according to their purpose and how to navigate through them.

(17)

Chapter 2 Interpretation of Construction Patterns

for Biodiversity Spreadsheets

2.1 Introduction

When producing spreadsheets, end-users have autonomy and freedom to create their sys-tematization structures, with few formal requirements. However, the product is driven to personal reading, causing a side effect: programs provide low assistance in perform-ing tasks, since they are unable to recognize the spreadsheet structure and to discern its implicit schema – hidden in the tabular organization – from the instances and conse-quently the semantics of this schema. Therefore, it is difficult to combine and coordinate data among spreadsheets using conventional methods, because each new different schema cannot be interpreted.

But, how much diﬀerent are they in fact? We present in this paper evidences that similarities in spreadsheets can indicate patterns followed by groups. We consider that it is possible to map these patterns to a respective semantic description, through the recognition of structural reasons which lead a user to interpret a spreadsheet in one way and not another.

Thus, our strategy focuses on the detection of patterns to recognize similar spread-sheets. We argue that the particular way authors build their spreadsheets – i.e. the criterion to define elements, the approach to spatially organize them and the relationship between these elements – is directly related to their daily experience in the community that they belong.

The challenge of this research is to consider a computer system as a consumer of spreadsheets besides the user. Our approach involves achieving a richer semantic inter-operability for data from spreadsheets through pattern recognition.

Most of the related work disregard these patterns to implement strategies for seeking interoperability of tabular data. This paper argues that the structure, i.e. the organization of the spreadsheet’s elements, must be considered since it leads to the identification of construction patterns, which is related to the user intention/action. This technique allows us to go towards the pragmatic interoperability layer [68].

(18)

construction patterns evidence adopted by biologists based on an application implemented by us, able to automatically recognize these patterns in the implicit spreadsheet schemas. To support our thesis we collected and analyzed approximately 11,000 spreadsheets be-longing to the biodiversity domain.

In this paper, we confront previous results with an exploratory observation of these construction patterns based on a survey answered by 44 biologists.

The next sections are organized as follows: Section 2.2 gives an overview of some basic concepts and our research, Section 2.3 details the process of collecting and analyzing spreadsheets employed by biologists, as well as research hypotheses and their evaluation; Section 2.4 highlights construction patterns evidence followed by biologists; Section 2.5 introduces our model to represent construction patterns; Section 2.6 presents Related Work and Section 2.7 our concluding remarks and the next steps of this research.

2.2 Research Scenario

According to Syed et al. [64], a significant amount of the information available in the world is in spreadsheets. Despite their flexibility, spreadsheets were designed for independent and isolated use, and are not easily articulated with data from other spreadsheets/files.

For this reason, there is a growing concern to make spreadsheet data more apt to be shared and integrated. The main strategies convert them into open standards to allow software to interpret, combine and link spreadsheet data [54, 28, 2].

Related work address this problem mainly by manual mapping to Semantic Web open standards or by automatic recognition, relating spreadsheet elements to concepts available on Web knowledge bases such as DBpedia (dbpedia.org).

Systematic approaches for data storage, such as databases, predefine explicit schemas to record data. These schemas can be considered as semantic metadata for the stored data. Spreadsheets, on the other hand, have implicit schemas, i.e. metadata and data merged in the same tabular space.

The central thesis behind our approach is that we can detect and interpret the spread-sheet’s schema by looking for construction patterns shared by research groups. We pro-pose in this paper a representation model able to capture such patterns, as well as to be processed by machines. Results of our analysis in thousands of spreadsheets and in the survey, indicate the existence of such recurrent patterns and that they can be exploited to recognize implicit schemas in spreadsheets.

There are several aspects that hinder the spreadsheet recognition and its implicit schema, such as diﬀerences between columns order, the label used to identify fields and their respective semantics etc. Although related work explore a subset of the common practices in tabular data – sometimes taking into account their context [38, 66, 53] – they do not define a mechanism or model to independently represent these patterns. Since the knowledge about how to recognize patterns is mixed with the programs, they cannot be decoupled from their code. We claim here that a representation to materialize the knowledge about these patterns as artefacts, independently of specific programs and platforms, enables to share, reuse, refine and expand such patterns among users and

(19)

applications.

This research is driven by a larger project that involves cooperation with biologists to build biodiversity bases. We observed that biologists maintain a significant portion of their data in spreadsheets and, for this reason, this research adopted the context of biology as its specific focus.

We propose a model to represent construction patterns, departing from observations conducted through incremental steps, including spreadsheets collecting/catalog, formulat-ing hypotheses/models, exploratory observation of biology survey and evaluation. These steps will be detailed in next sections.

2.3 Methodology

As previously mentioned, our approach to represent construction patterns was based on a study of related work and field research in the biology domain.

Based on an initial analysis of how biologists of the Institute of Biology (IB) at Uni-camp created their spreadsheets, we defined a set of hypothesis and designed experiments and a survey to validate them. These data were also the basis to produce our first model adopted in a process to automate the recognition of construction patterns, whose design involved:

(i) preliminary collection and analysis of spreadsheet data: in a first moment, we discussed with biologists of the Institute of Biology (IB) at Unicamp about the practices applied when they work with spreadsheets, starting from their creation until their reuse and manipulation;

(ii) formulation of hypotheses about spreadsheet construction patterns: departing from the spreadsheets, we performed a visual analysis looking for common elements able to be identified by a system;

(iii) design and implementation of an automatic recognizer for those spreadsheets: we implemented a systems – the SciSpread [12] – which was validated in three progressive groups of spreadsheets: 9 spreadsheets belonging to IB at Unicamp; 40 spreadsheets from the Web selected manually; 11,000 spreadsheets from the Web automatically selected by a crawler script;

(iv) exploratory observation of biology survey: we deployed an online form (available in goo.gl/b1iEvl) to biology students, professors, researchers and professionals, in order to verify according to our hypothesis – formulated in (ii) – how biologists classify and arrange their data.

2.3.1 Initial Data Collection and Analysis

Our analysis started with 9 spreadsheets belonging to the IB, in which we identified two main construction patterns, related to the nature of the spreadsheet: catalogs of objects – e.g., specimens in a museum – and event related spreadsheets, e.g., a log of samples collected in the field. We further will refer to these spreadsheet natures as catalog and event.

(20)

In order to address the significant diﬀerences among spreadsheet types we classified each field in six exploratory questions (who, what, where, when, why, how) [37]. It enabled us to represent and recognize patterns in a higher level of abstraction, e.g., a catalog spreadsheet has as initial fields the taxonomic identification – classified as what question – on the other hand, a event spreadsheet has as initial fields: date and locality – classified as when and where questions, as illustrated in Figure 2.1.

Figure 2.1: Fields Characterization.

The next step involved collecting more 33 spreadsheets on the Web to compose our sample. To search spreadsheets belonging to the biology domain, we applied domain related keywords as criterion.

2.3.2 Hypothesis

According to the observation of these spreadsheets, we proposed the following pattern-related hypotheses:

H1: most of the spreadsheets organization follows the pattern of columns as fields and rows as records;

H2: in order to characterize the context [37] fields in the spreadsheets can be classified in one of the six exploratory questions;

H3: the first fields of a spreadsheet often define its purpose, e.g., catalog or event, as well as its construction pattern.

We developed a system – SciSpread – to automatically recognize schemas based on these hypotheses (see details at [12]). We found evidence, based on our hypothesis, that patterns can drive the recognition of the spreadsheet purpose in a context, to make its schema explicit and to support its semantic annotation.

2.4 Evidences of Construction Patterns

The evidences of Construction Patterns are based in two steps. The first was based in an automatic interpretation done by the SciSpread system [12], and the second is an exploratory analysis of a survey conducted with biologists.

(21)

Thus, the next subsections explain (i) the considerations about the scenario that we implemented in the system and observed the evidence, (ii) the survey conducted with biologists in order to verify and refine the hypothesis, and (iii) the comparative results about (i) and (ii).

2.4.1 Evidences based on Automatic Recognition of Web

Spread-sheets

Based on our preliminary assumptions presented in Section 2.3, we looked for patterns concerning: schema layout (e.g., column labels), order and grouping of spreadsheet fields, etc. A set of hypotheses – presented in the previous section – was defined and we developed an initial version of the automatic recognition system to validate these hypotheses [12].

The system was tuned to recognize all spreadsheets of this initial sample, whose pur-pose fit in our context. We further randomly collected more 1,914 spreadsheets on the Web, finding them through the Google search engine, based on keywords extracted from previous spreadsheets: kingdom, phylum, order, biodiversity, species, identification key etc. The system recognized 137 spreadsheets (7%) of all 1,914 spreadsheets collected. The manual analysis of these spreadsheets showed that the system correctly recognized 116 spreadsheets and incorrectly recognized (false positives) 21 spreadsheets. Even though the latter spreadsheets have the expected construction pattern, they do not address the focus of our study, which are spreadsheets used for data management.

Increasing our sample size to 5,633 spreadsheets, the system recognized 7%; subse-quently, increasing to about 11,000 spreadsheets, the system recognized 10.4%, which corresponds to 1,151 spreadsheets, in which 806 were classified in catalog and 345 as event.

We selected a random subset of 1,203 spreadsheets to evaluate the precision / recall of our system. The percentage of automatic recognition of the spreadsheets in the subset was approximately the same as the larger group. Our system achieved a precision of 0.84, i.e. 84% of retrieved spreadsheets were relevant; a recall of 0.76, i.e. the system recognized 76% of all relevant spreadsheets; and an F-measure of 0.8. The accuracy was 93% and the specificity 95%, i.e. among all spreadsheets that the system classified as not relevant, 95% were in fact not relevant.

The recognition rate of approximately 10.4% of the spreadsheets must consider that they were collected through a Web search tool. According to Venetis et al. [66], these search tools treat tabular structures like any piece of text, without considering the implicit semantics of their organization and thus causing imprecision in the search results. We further show an analysis of the data extracted from spreadsheets.

2.4.2 Exploratory Analysis based on a Survey conducted with

Biologists

The survey was performed with 44 biologists, where 36 are from Brazil, 4 are from France, 3 are from EUA and 1 is from each country: Colombia, Denmark and England.

(22)

The distribution about their main activity is: 11% are professors, 23% are researchers, 5% work in companies, 20% are post-doctoral students, 25% are PhD students and 17% are master students, where 79,5% work with spreadsheets always or often, 14% work sometimes and 2,5% never use spreadsheets.

We developed an online form with ten questions to mainly analyze (this survey is available in goo.gl/b1iEvl).:

• How biologists organize their specimens’ catalogs or event records? • How the fields arrangement influences the spreadsheet purpose? • Which fields are identifiers according to the spreadsheet purpose?

The questions were divided in two blocks, in the first block we would like to know the vocabulary which describes catalog and event spreadsheets. Thus, the biologists choose 8 elements from a list of 17 (author, altitude, collector, common name, date, field number, group, latitude, life stage, locality, longitude, museum, note, species, source, taxonomy identification, time, other) ranked by relevance.

In the second block, we presented 3 spreadsheet images and asked to biologists to classify them according to: catalog, event, both types, neither and other. If the option was “other”, an extra text field must be filled with the description of the type more adequate to the image.

Compared two spreadsheets retrieved from the Web, the characteristics analyzed in the survey indicated a more specialized set of spreadsheets, i.e. some spreadsheets types retrieved from the Web and classified by the system as catalog or event, cannot be related to a specific context as a museum catalog or a research field sampling. Therefore, the results of the survey refined our hypothesis about construction patterns and in a future work we will increase the sampling with more biologists.

In the following subsections, we will confront results obtained from the automatic recognition of the Web spreadsheets by our SciSpread system introduced in Section 4.1, and those obtained from the survey introduced in this section.

2.4.3 Comparative Results: SciSpread x Survey

Pattern for Schema Location: Our SciSpread system identifies the schema of a sheet guided by the adopted vocabulary, which varies according to the type of the spread-sheet: catalog or event. The scatter chart in Figure 2.2 shows the percentage of terms extracted from a given spreadsheet recognized in a vocabulary against the row in the spreadsheet they were recognized. We observe that the spreadsheetsśchemas were con-centrated on the initial lines, since the percentage of matching terms per line decreases exponentially as we move away from the initial lines and most of the terms are located in the initial lines. Therefore, there is a tendency of positioning schemas at the top followed by their respective instances.

Even though we did not design a specific experiment to validate this hypothesis in the Survey, we assumed this pattern in its examples, when we asked biologists to recognize

(23)

Figure 2.2: Terms by schema of initial lines

spreadsheet images. No biologist reported having diﬃculty to recognize the schema and the answers indicate they successfully recognized it.

Predominance of Terms and Spatial Distribution: In this stage of the analysis, we verified how much the predominant terms and their disposition in the schema can indicate the spreadsheet nature: catalog or event (see explanation of these natures in Section 3.1). In order to perform a comparative analysis among proportions and positions of the fields in spreadsheets of the catalog type, we present a radar chart in Figure 2.3 (SciSpread) and Figure 2.4 (Survey).

Figure 2.3: SciSpread - Proportions among

fields of catalog spreadsheets. Figure 2.4: Survey - Proportions amongfields of catalog spreadsheets. While fields and their positions were automatically recognized in SciSpread, in the Survey we requested to the interviewee to select and sort fields for a catalog spreadsheet schema. The schema fields were grouped in one of the six exploratory questions and they were weighted according to their position in the schema – the weight has the value one

(24)

in the leftmost field and decreases to half of its value for each field position towards the right.

In catalog spreadsheets, we observed that charts have smaller diﬀerences and both have a strong tendency to “what” questions, validating that spreadsheets recognized as catalog tend to have many fields that answer the “what” question appearing in the initial positions of the schema. The quantities of the other five questions were no significant in the SciSpread spreadsheets and were a bit more significant in the Survey spreadsheets, with emphasis to “who” and “where” fields. It delineates a pattern for catalog spreadsheets, which tend to have more fields to identify and detail objects – specimens in this case – in the beginning.

We interpret the increase in the provenance questions (who, where and when) due to the specialization of spreadsheets in the survey, i.e. while we do not have control of the context of the SciSpread spreadsheets and it is not possible to attest the purpose of its catalog spreadsheets, the survey spreadsheets were clearly directed to museums, requiring a more strict control of provenance.

Following the same approach, in Figure 2.5 (SciSpread) and Figure 2.6 (Survey) we show the proportions of fields in event spreadsheets. We can note that there are diﬀerences among the proportions of the fields. As in the previous case, our interpretation is that the diﬀerences are due to the specialization in the survey spreadsheets.

Figure 2.5: SciSpread - Proportions among fields of event spreadsheets.

Figure 2.6: Survey - Proportions among fields of event spreadsheets.

While in the SciSpread spreadsheets the provenance fields – mostly the “when” fields but also the “who” fields – are highly predominant, appearing in the beginning, in the survey spreadsheets the diﬀerence compared to “what” fields is less remarkable, but still exists. The predominance of provenance fields, comparing event to catalog spreadsheets, is enough in both cases (SciSpread and survey) to distinguish their nature. In the survey, we had the opportunity of discussing about the relationship among fields related to “when” and “where” questions. For biologists these fields usually appear together and there is an

(25)

extra field – usually named field number – which works as an index for a “when” and “where” specification. The relationship among these fields/questions can be considered as a sub-pattern in an automatic analysis.

The previous charts show that it is possible to distinguish between catalog and event spreadsheets, but we need to identify which words are the most used by biologists. In the previous paper [12] we presented detailed observations about the distribution of fields identified by SciSpread, in this paper we will present values based on the Survey – first block questions.

Thus, the next bar chart (Figure 2.7) show which fields are more used by biologists in each spreadsheet category. The vertical axis describe the quantity of biologists that answered the question and the horizontal axis describe the fields that they chose, according catalog or event spreadsheets.

Figure 2.7: Comparative terms quantities between spreadsheets category

Using the result of the previous bar chart, we show in Figure 2.8 the distribution of these fields in the columns, i.e. if the biologists must create a new catalog or event spreadsheet, how they will arrange these fields in columns? This distribution indicates that, considering the western behavior, the tendency is to organize most import things from the left to the right. We infer that species, taxonomy identification and museum can be identifiers of catalog spreadsheets, and the fields date, species, locality and field number can be the identifiers of event spreadsheets (sampling spreadsheets), because in both types the respective fields appear in greater amount in the first columns.

After we consider the vocabulary and the structural organization of the fields – first block questions, it is important analyze how the biologist classifies the spreadsheets – second block questions. According to the spreadsheet shown in Figure2.9, 48% of the

(26)

Figure 2.8: Comparative terms location between spreadsheets nature

biologists considered this spreadsheet as catalog. The explanation to the decision is due to the detailed information and complete taxonomy identification.

Figure 2.9: Spreadsheet 1 - used in the Survey

In the second image, we copied the same spreadsheet and we changed the order of the fields, aiming at identifying if the arrangement of the fields is important to characterize the nature of a spreadsheet. The modified spreadsheet is shown in Figure 2.10.

Figure 2.10: Spreadsheet 2 - used in the Survey

The charts in Figure 2.11 show in the vertical axis the percentage of biologists that answered the questions and in the horizontal axis the possible choices to classify the spreadsheet images. According to these charts, 40% of the biologists continue classifying as catalog spreadsheet, but analyzing these charts, we observed that, even though most of the biologists do not consider the second spreadsheet as event, there is a significant number of biologists who changed their opinion to event, and this was due to the simple fact that we changed the fields disposition.

Therefore, we consider that the order of the fields is important to decide the nature of the spreadsheet, and we add the observation that there is a common understanding among biologists that if a spreadsheet has many fields and complete information, this spreadsheet is more related to a catalog than about events. On the other hand, if the information

(27)

is synthetic and objective, this spreadsheet is about events. This diﬀerentiation is due to the context. Biologists do not have time to detail information and classify specimens during the collection in the field; therefore the conditions limit the answers.

Figure 2.11: Comparative results about spreadsheets classification

The third spreadsheet, shown in Figure 2.12, biologists classified as event. According to our hypothesis, this spreadsheet must be classified as catalog, considering only the fields. We concluded that the biologists classified as collection despite the lack of date and location fields, due to the quantity of fields, which is limited, characterizing a sample collection in the field.

This information is important since we were not considering in the automatic inter-pretation of this data type of spreadsheet the relevance of the detail level of the fields. Another point to consider in an automatic interpretation of catalog spreadsheets is that they never start with a “common name” field.

Figure 2.12: Spreadsheet 3 - used in the survey

Through this comparative results, we verified that the schema usually appear in the top of spreadsheets; the fields arrangement influenced in a spreadsheet characterization; the identifiers of catalog spreadsheets can be a taxonomy identifier or a museum number and of event spreadsheets can be a date plus locality information or field number.

2.5 Representation of Construction Patterns

This section details the model, proposed in this paper, to capture and represent con-struction patterns in spreadsheets, which can be interpreted and used by machines. The characteristics of this model were based on field observations reported in the previous section.

(28)

As detailed before, the schema recognition step involves analyzing patterns used by users to organize their data, which we argue to be strongly influenced by the spreadsheet nature inside a domain. Departing from our spreadsheet analysis, we produced a sys-tematic categorization of construction patterns observed in biology spreadsheets, which supported the design of a process to recognize these patterns. Our process to recognize construction patterns and consequently the spreadsheet nature is focused on the schema recognition.

Our representation approach considers that there is a latent conceptual model hidden in each spreadsheet, which authors express through patterns. How authors conceive mod-els and transform them into spreadsheets is highly influenced by shared practices of the context in which the author is inserted. Our analysis shows that a catalog spreadsheet contains taxonomic information of a specimen (“what” question) concentrated in the ini-tial positions, defining their role as identifiers. On the other hand, an event spreadsheet contains temporal and location fields in the initial positions.

A visual analysis of the spreadsheet structure gives us directions of how the pattern is organized, e.g., schema up / instances down; identifier on the left, as a series of progres-sively specialized taxonomic references. To express these characteristics of the pattern in a computer interpretable representation, we represent them as qualifiers [12] identified by the prefix "q", and they are categorized as follows:

Positional qualifier– characterizes an element in a pattern according to its absolute position within a higher level element. There are four positional qualifiers: left (q ), right (q!), top (q") and bottom (q#).

Order qualifier(q#) – characterizes an element in a pattern according to its relative order regarding its neighboring elements.

Label qualifier (q@) – indicates that the label characterizes the element. In the example, the label species identifies that a column refers to species.

Data type qualifier (q$) – characterizes the predominance of one data type in the instances of a given property.

Range qualifier – specify if neighbor elements have generalization / specialization relations. The qualifier (q>) indicates that the left one is more general than the right one and (q<) the opposite.

Classified qualifier – characterizes instances of a given property that are arranged in ascending order (q+) or descending order (q-).

Redundancy qualifier(q=) – characterizes redundancy of information in instances of a property. Such redundancy is typical, for example, in non-normalized relations among properties and composite properties, in which the values of a sub-property are broader or more generic of a related sub-property – usually the value of one sub-property embraces the value of the other.

Besides the qualifiers, we associate the relation of elements with one of the six ex-ploratory questions (who, what, where, when, why, how). This association will subsidize the characterization of construction patterns in a more abstract level. For example, look-ing at other kinds of catalog spreadsheets, outside the biology domain, we observed they define “what” fields as identifiers and they appear in the leftmost position (q ).

(29)

2.5.1 Formalizing the Model to Represent Patterns

In this subsection, we will present a more formal representation, to be stored in digital format and to be read and interpreted by machines. This representation takes as a starting point the conceptual model implicitly expressed through the pattern. 2.13 shows the representation of the construction pattern. The model is based on the OWL Semantic Web standard. The ovals represent classes (owl:class) and rectangles represent properties (owl:ObjectPropertyorowl:DatatypePropert).

The root class –Specimen in the figure –is related to the spreadsheet nature; in this case, instances of the Specimen class represent the instances of the spreadsheet, which catalogs specimens. A class will have a set of applicable properties, represented by a domain edge (rdfs:domain). Properties of this model are related to fields extracted from the spreadsheet. Range edges (rdfs:range)indicate that values of a given property are instances of the indicated class. For simplicity, the diagram omits details of the OWL representation.

Figure 2.13: Conceptual model for catalog spreadsheets annotated with qualifiers Properties in our model are annotated by qualifiers presented in the previous sec-tion. Properties can be annotated in OWL through the owl:AnnotationProperty. In this case, annotations are objects that specify the qualifier and the pattern they are related. Qualifiers as annotations are depicted above and/or below the properties they qualify. A qualifier above a property indicates that it is applied to the relationship between the property and the class to which it is applied by the domain relation. For example, the qualifier q (left positional qualifier) is represented above the identifier property, indi-cating that when this property is used as a field in a spreadsheet describing a Specimen, we expect that it will appear in the leftmost position.

A qualifier below the property means that it applies to property values – instances in the spreadsheet. For example, the qualifier q$= below the kingdom property indicates that a specific type (string) and redundancy are observed in the values of this property in the instances.

(30)

Properties are also annotated as answering one of the six exploratory questions. These annotations are depicted in Fig. 2.13 inside brackets. There are additional concerns in the OWL model that are necessary to bridge it to the implicit spreadsheet schema, which are also represented as annotations: the order of properties and their relation with labels. This OWL representation allows us to digitally materialize building patterns of spreadsheets, to be shared by users and applications.

2.6 Related Work

As discussed in Section 5, a fundamental characteristic of spreadsheets used for data management is the separation between schema and instances. The schema is presented above (q") or left (q ) and instances are below (q#) or right (q!).

This observation appears in all the papers of related work, whose purpose is to rec-ognize the implicit schema of spreadsheets. Syed et al. [64] point out that this challenge leads to a more general problem of extracting implicit schemas of data sources – includ-ing databases, spreadsheets etc. One approach to make the semantics of spreadsheets interoperable, promoting the integration of data, is the manual association of spreadsheet fields to concepts in ontologies represented by open standards of the Semantic Web.

Han et al. [28] adopt the simplest approach to devise a schema and its respective instances, called entity-per-row [54]. In this approach, besides the schema, each row of the table should describe a diﬀerent entity and each column an attribute for that entity, for example, each column corresponds to an attribute – e.g., Date, Genus, Species etc. – and each row to an event – a collection of a specimen. Han et al. [28] and several related work assume the entity-per-row organization to support the process of manually mapping attributes, to make them semantically interoperable. Initially, the user must indicate a cell whose column contains a field which plays the role of identifier–equivalent to the primary key of a database.

Langegger and Wob [43] propose a similar, but more flexible, solution to map spread-sheets in an entity-per-row organization. They are able to treat hierarchies among fields, when a field is divided into sub-fields, for example, fields Date, Latitude and Longitude refer to when and where the species was collected. It is usual that authors create a label spanning the entire range above these columns – e.g., labelled as "Field Number" – to indicate that all these fields are subdivisions of the category. This hierarchical perspec-tive can be expressed in our model, since a property can be typed (rdfs:range) by a class, which in turn has properties related to it.

RDF has been widely adopted by related work as an output format to integrate data from multiple spreadsheets, since it is an open standard that supports syntactic and semantic interoperability. Langegger and Wob [43] propose to access these data through SPARQL — a query language for RDF. Oconnor et al. [54] propose a similar solution using OWL.

Abraham and Erwig [2] observed spreadsheets are widely reused, but due to their flexibility and level of abstraction, the reuse of a spreadsheet by people outside its do-main increases errors of interpretation and therefore inconsistency. Thus they propose a

(31)

spreadsheet life cycle defined in two phases: development and use, in order to separate the schema of its respective instances. The schema is developed in the first cycle, to be used in the second cycle. Instances are inserted and manipulated in the second cycle guided by the schema, which cannot be changed in this cycle.

Another approach to address this problem is automating the semantic mapping using Linked Data. Syed et al. [64] argue that a manual process to map spreadsheets is not feasible, so they propose to automate the semantic mapping by linking existing data in the spreadsheets to concepts available in knowledge bases, such as DBpedia (dbpedia.org) and Yago (www.mpi-inf.mpg.de/yago-naga/yago/).Yago is a large knowledge base, whose data are extracted, among others, from Wikipedia and WordNet (wordnet.princeton.edu). The latter is a digital lexicon of the English language, which semantically relates words.

Among the advantages of the last approach, there is the fact that such bases are con-stantly maintained and updated by people from various parts of the world. On the other hand, the search for labels without considering their contexts can generate ambiguous con-nections, producing inconsistencies. Thus, there are studies that stress the importance of delimiting a scope before attempting to find links.

Venetis et al. [66] exploit the existing semantics in the tables to drive the consistent manipulation operations applicable to them. The proposal describes a system that an-alyzes pairs of terms heading columns and their relationship, in order to improve the semantic interpretation of them. Authors state that a main problem in the interpreta-tion of tabular data is the analysis of terms independently. This paper tries to identify the scope by recognizing a construction pattern, which is related to a spreadsheet nature inside a context.

Jannach et al. [38] state that the compact and precise way to present the data are primarily directed to human reading and not for machine interpretation and manipula-tion. They propose a system to extract information from web tables, associating them to ontologies. They organize the ontologies in three groups: 1. core: concepts related to the model disassociated from a specific domain; 2. core + domain: domain concepts of a schema related to the information to be retrieved; 3. instance of ontology: domain concepts of instances. These ontologies aim at gradually linking the information to a semantic representation and directed by the user’s goal.

Among these solutions, we note that some of them address individual pieces of in-formation inside spreadsheets – devoid of context – and others consider the importance of identifying and characterizing the context. Even though all approaches rely on con-struction patterns of spreadsheets, none of them proposes a model to represent, exchange, reuse and refine those patterns, which is one of the main contributions of this work.

2.7 Conclusions and Future Work

This paper presented our thesis that it is possible, from a spreadsheet structure, to recog-nize, map and represent how users establish construction patterns, which are reflected in the schema and data organization. One of our main contributions here is an exploratory observation of these patterns through a survey with biologists, as well as the refinement

(32)

of hypothesis for construction patterns.

Our process also involves the comparative analysis between the SciSpread system and survey results. None of the related work departs from the characterization of the under-lying conceptual models and their association with construction patterns, to categorize spreadsheets according to the nature of information they represent, and to recognize them. Our studies presented here have focused in the area of biodiversity. We intend to investigate its generalization to other domains of knowledge, extending this strategy to a semiotic representation.

(33)

Chapter 3 Annotating Biodiversity Spreadsheets

through 5W1H based on Machine

Learning

3.1 Background

Scientists use a variety of methods to collect, record, and store biological data. The appli-cations for these data are also diverse, and several efforts are advancing to aggregate, man-age and disseminate data coming from different sources according to FAIR principles [42]. As result of these efforts, there are committees [42, 20, 65] to propagate the idea about the gains obtained in data sharing and reusing through open data standards [69, 17, 67]. Brazma et al. [15] analyze various aspects of data standards aimed at making life science data interoperable. They discuss about the reasons to standardizing life science data and the contributions to the advancement of knowledge in biology. However, they highlight even with the large quantity of the common standards, they are not often used, and this should be because the standards are not reflecting the scientists’ needs. For Brazma et al.[15], one of the secrets of successful standards is the simplicity, low cost and lack of ambiguity.

Parr et al. [59] describe the challenges faced by biology informatics and consider that is a hurdle building a semantic web departing from data that do not follow standards. According to them, data in ad-hoc format is created for independent analysis and in an idiosyncratic way.

In our previous work [12, 9], we observed that, although ad-hoc data format were built for an independent analysis, they follow an organization, bearing implicit patterns which can be automatically recognized and mapped to an common data standard. The goal of this research is to annotate biodiversity spreadsheet data to turn them on some level according to the FAIR principles. Rather than requiring a standard adoption, leading scientists to adapt to a new reality, our strategy automatically recognizes implicit patterns behind their data, annotating them with its purpose.

(34)

3.1.1 Problem Definition

In a TDWG 2016 presentation, Donald Hobern and Andrea Hahn from GBIF talked about "A Standards Architecture for Integrating Information in Biodiversity Science" [35]. They highlighted the importance of enabling scientists to navigate along several articulated datasets, as shown in Figure 3.1, so they can start from a point of their research topic and finish at another point. For instance, from a specimen register, the scientist should be able to access its genetic data; starting from a gene sequence, associating it with the respective taxon concepts and occurrence data.

Figure 3.1: Biodiversity data grouped by purpose [35]

In this scenario, since spreadsheets store a slice of scientific data available in the world, which are mostly organized in ad-hoc formats, it is possible to identify two main challenges: (i) how to automatically recognize the implicit purpose of biodiversity data stored in spreadsheets? (ii) how to interlink biodiversity data assets according to their purpose and how to navigate through them?

This paper presents our strategy to answer the first research question of a technique to automatically recognize the biodiversity purpose of each spreadsheet. It produces and abstract representation of each spreadsheet implicit schema based on the exploratory questions (5W1H), which is an input to a classification machine learning algorithm.

To exemplify the challenge of purpose recognition in the biodiversity domain, Fig-ure 3.2 presents a set of five spreadsheets collected from the web, focusing in their implicit schemas, i.e., which attributes they use to record data and their respective organization. Spreadsheet A seems to reflect the same purpose as spreadsheet B, which is to record the taxonomic concepts of species, but their schemas are not identical.

Spreadsheet C contains attributes as "Phylum, Class, Order and Family", which ap-pears in the previous spreadsheets too, but even with similar schemas, spreadsheet C does not have the same purpose of spreadsheets A and B. The purpose of spreadsheet C is to record experimental observations.

Spreadsheet D has similar attributes to spreadsheet E, but their purposes are diﬀerent. Spreadsheet D records data of species occurrences and spreadsheet E records data of species monitoring, i.e., it involves tagging and tracking animals along a period of time, which is not the case of spreadsheet D.

The automatic recognition of the biodiversity purpose behind each spreadsheet is a challenge because, as observed, there are many variations of schemas among spreadsheets

(35)

Figure 3.2: Set of spreadsheets used by scientists to record biodiversity data that have the same purpose and similarities among the ones attending diﬀerent purposes.

Related work addresses the challenge of automatic spreadsheet data recognition fo-cusing on individual attributes [28]. For example, the interpretation of spreadsheet A, following the strategy of the recognition of isolated attributes, results in a mapping with-out a context. In this mapping, some ambiguities can be faced, and some inconsistencies can be generated, the attribute "Class" could be associated with concepts in several con-texts and not only to the biology. Moreover, as mentioned before, a proper articulation of data from several spreadsheets, to support a navigation through them requires the recognition of the purpose of each spreadsheet.

Other related work interpret pairs of attributes, instead of analyzing them isolated [43, 44]. Thus, as shown in Figure 2-A, the attribute "Class" would not be analyzed alone, but related to "Phylum" and "Order". This strategy of interpreting attributes in pairs reduces the ambiguity and inconsistency, reaching a better level of semantic interoperability, but its interpretation does not recognize the purpose of each spreadsheet.

Another strategy is to select an ontology that will conduct the interpretation process of spreadsheets. Data is then interpreted by a context defined by the ontology. This automatic mapping previously requires the recognition of the implicit data semantics. And this is accomplished through the selection of attributes or set of attributes to be recognized semantically, that choice will directly influence the semantics level to be acquired.

Figure 3.3 shows two spreadsheet schemas, which are part of a survey that we con-ducted [9]. Its goal was to understand how biologists organize their data. In the activity that involved these two spreadsheets, biologists should choose the most representative spreadsheet header related to each kind of data. In order to analyze the influence of the order of the attributes in the interpretation of a given schema, both spreadsheets were presented with the same attributes in diﬀerent order. The spreadsheets were classified as being of diﬀerent purposes. The first one has been categorized as species occurrence data and the second one as species collections. Strategies that consider only individual

(36)

attributes cannot distinguish both spreadsheets.

Figure 3.3: Spreadsheets of our survey [9] analysing how the organization of attributes influences the interpretation of a spreadsheet.

The first spreadsheet started presenting temporal information (When) and the second one started with taxonomic information (What). This research has been working to show the importance of this arrangement of a spreadsheet schema in its interpretation, abstracting the role of each attribute (or set of attributes) according to the 5W1H. For example, we have observed that spreadsheets starting by "What" attributes are more propense to catalog things, while those starting with When attributes are more propense to record temporal events, like observations.

This paper presents a strategy for automatically recognizing and annotating biodiver-sity spreadsheets with their purpose, based on the previous premises. It departs from our previous investigation [12, 9] and proposes a machine learning approach, which exploits aspects of the schema arrangement, categorization of spreadsheets and abstraction based on the 5W1H.

3.2 Related Work

Biodiversity informatics is the application of information technologies to the management, algorithmic exploration, analysis and interpretation of primary data regarding life [29]. Related work focusing on interoperability of biodiversity data are mainly divided into two strategies: (i) adoption of data standards to register data [70, 22, 47];(ii) reposito-ries, which allow the scientists to store, manage, and share their data according to the repository’s constraints [60, 40, 7, 45, 57].

There are several standards aimed at unifying the vocabulary used by biologists. For example, the Darwin Core standard (http://rs.tdwg.org/dwc/) provides a framework of terms to record biodiversity data, e.g., observation and monitoring of species.

However, Sansone et al. [60] point out that many scientists would like to share their data but face challenges to fit them to a standard, which does not completely represent them. Thus, they propose the ISA (Investigation, Study, Assay) framework, which is a backbone to underpin the discovery, exchange, and integration of data sets.

GBIF [57] is a web infrastructure that allows scientists and organizations to share their biodiversity data. It defines a successful distributed architecture and has became widely adopted by organizations around the globe. One of the limitations raised by [3] is

(37)

that any data set in the GBIF must be in the DWC standard. As GBIF, there are other initiatives such as GenBank [8], BarCode of Life [56], Encyclopedia of Life [58], etc.

There are strategies to integrate biological databases as Spice [40]. They connect au-tonomous taxonomic databases to achieve interoperability among them. The architecture evolved to create a middleware between the users and the sources. Other solutions, like Bio2RDF [7] and KaBOB [45] integrate biological databases using ontologies.

Related work focused on biodiversity data usually provide strategies to underpin the interoperation of to be created, but they do not address existing data, created in ad-hoc formats, as spreadsheets. Among existing related work in this second direction, which are not specialized in the biodiversity domain, there are initiatives to recognize existing tabular data without explicit schema, in order to interoperate them. In this context, we highlight two main groups: recognition of spreadsheet data and recognition of data from web tables.

The level of semantics achieved in the recognition process is highly correlated with the granularity of the input elements, selected to drive the interpretation, which we call "unit of interpretation". To emphasize the importance of this unit, we further analyse related work according to it.

Han et al. [28] and Connor et al. [54] provide a manual mapping from spreadsheet attributes to RDF properties. In this approach, each isolated attribute is mapped to an RDF concept. In this case, the unit of interpretation is each isolated attribute, having as output only attributes semantically richer. It would be possible to achieve a higher semantic level if the user could map sets of attributes or the entire spreadsheet to richer structures like RDF classes.

Langegger and Wöÿ [43] go beyond and propose a manual method of mapping hier-archies found in spreadsheets. This approach can recognize groups of cells, transforming them into a hierarchy of semantic attributes correlated to each other, in which an upper level attribute aggregates lower level ones. For example, the attributes "latitude" and "longitude" can be hierarchically related to a higher level attribute "localization". The semantic description of this strategy is richer than Han et al., because they elect as unit of the interpretation pairs of attributes.

Wolstencroft et al. [70] and Maguire et al. [47] achieve a semantic interoperability closer to the one that we propose here, i.e., related to purpose of the spreadsheet. Their unit of interpretation are sets of attributes associated to ontologies. Although the mapping is manual, in both initiatives it is possible to associate cells or set of cells to one or more concepts in ontologies. The resulting annotated spreadsheet can be reused as a template to further spreadsheets carrying all its semantic mapping.

Syed et al. [64] and Mulwad et al. [53] consider that semantically mapping data man-ually is unfeasible and propose to automate this process. They propose a recognition strategy to be applied to any context. They associate spreadsheet labels to concepts available in knowledge bases, mapping attributes and values found in the spreadsheet to RDF properties and values. Among the advantages of relying on public knowledge bases instead of static vocabularies is the fact that such bases are maintained and updated by people all over the world. A limitation of choosing attributes as unit of interpretation is the fact that it can generate ambiguous and inconsistent links.