Labtablet: multi domain laboratory notebook

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

LabTablet: A multi-domain laboratory

book

Ricardo Filipe Carvalho Amorim

D

ISSERTATION

R

EPORT

Master in Informatics and Computing Engineering

Supervisor: Maria Cristina de Carvalho Alves Ribeiro Co-supervisor: João Miguel Rocha da Silva

(2)

(3)

LabTablet: A multi-domain laboratory book

Ricardo Filipe Carvalho Amorim

Master in Informatics and Computing Engineering

Approved in oral examination by the committee:

Chair: Doctor Gabriel de Sousa Torcato David

External Examiner: Doctor José Carlos Ramalho

Supervisor: Doctor Maria Cristina de Carvalho Alves Ribeiro July 28, 2014

(4)

(5)

Abstract

Research environments are commonly associated with the production of large amounts of data, with a large diversity of formats and structures, that make it difficult to manage. If we expect these datasets to live beyond their immediate use, researchers have to be involved in the annotation and documentation of their data, to provide the required context to those who may want to validate their findings or re-use the data. Part of the context of the data includes the conditions in which they were collected, which in many cases are registered in traditional supports such as laboratory books. Although these tools are very popular with scientists due to their simplicity and flexibility, they lack the systematic and standardized approach that a digital tool can provide. Moreover, the metadata registered in laboratory books is very hard to transfer to a research data management workflow.

The growing concerns with methodologies to preserve research data led to the development of repository software supporting the persistent identification and the association of metadata to datasets. The deposit workflows implemented by repository platforms rely on a collaboration between the research team and a data manager, or curator, to validate and upload the metadata. However, this workflow for research data tends to require more resources than the research team can afford, at a stage where the research results have already been delivered. Changes in the research team, due to the project life cycles, combined with the lack of curation resources can contribute to the loss of both data and metadata.

The proposed LabTablet application fits into a workflow where researchers are involved in the creation of metadata from the start of the process. The overall goal is to make data ready for deposit prior to publication of results, so that researchers may even cite the data in the corresponding publications. As metadata is hard to generate, the goal of LabTablet is to automatically collect data that is available in the research environment, making it available for the researchers to validate and eventually associate to the datasets.

This application takes advantage of the device’s built-in capabilities to automatically fill meta-data records, and is available to gather additional records along with the meta-data production. Further-more, as one of its objectives is to be able to gather metadata, regardless of the research domain, LabTablet allows its users to fully customize the available descriptors, as well as to create associ-ations to suggest a descriptor, based on the record’s context. The development of this application was done with the collaboration of two researchers who provided feedback and evaluated the implementation, leading to its refinement. Following the definition of an experimental data ma-nagement workflow, a survey on the existing research data repositories was done, comparing the different key aspects of these solutions, such as their architecture, their interoperability and their compliance with metadata and harvesting standards.

(6)

(7)

Resumo

Hoje em dia, a informação gerada em ambiente laboratorial é muitas vezes associada com grandes conjuntos de dados, com diferentes formatos e estruturas variadas, dificultando a sua análise posterior. Para estes conjuntos de dados, é fulcral que os investigadores forneçam dados adicionais uma vez que estes dados dão suporte à sua contextualização para o acesso por parte de agentes externos. Esta informação adicional — as condições em que os dados foram recolhidos, por exemplo — é registada em suportes de informação convencionais como os cadernos de labora-tório. Apesar de serem populares junto dos cientistas, os cadernos de laboratório não encorajam a utilização de estruturas para a anotação dos dados e estes constituem uma perspetiva pessoal sobre que informação incluir na contextualização.

Por outro lado, a crescente preocupação na preservação deste tipo de dados tem vindo a moti-var o aparecimento de repositórios científicos que integram o suporte para metadados como uma forma de contextualização. No seguimento destas ferramentas, o processo atual de preservação de dados foi criado, consistindo no seu passo final, numa colaboração próxima da equipa de cientistas com um perito na validação dos dados que avalia a informação a inserir no repositório. Contudo este fluxo para a preservação dos dados científicos requer uma grande disponibilidade por parte da equipa autora dos dados, numa fase em que esta pode já estar dedicada a outros projetos e a publicação já foi apresentada, perdendo-se a ligação com os dados. Sendo um processo tradicio-nalmente demorado, os cadernos de laboratório são particularmente vulneráveis à deterioração e consequente perda dos dados, e por isso mesmo, a diferença entre melhorar o processo e manter o seu estado atual pode significar a diferença entre dispor de metadados de grande qualidade ou não.

A validação do contexto assim que este é fornecido é a proposta feita para tornar os conjun-tos de dados disponíveis para consulta e citação ainda durante a fase de investigação. Para tal, a utilização dos já conhecidos, flexíveis e económicos Tablets pode servir como um facilitador para incluir e validar metadados importantes durante todo o processo de investigação e mantê-los sincronizados com um repositório de dados científicos.

(8)

(9)

Acknowledgements

First and foremost, I would like to thank my supervisors, professor Cristina Ribeiro and PhD student João Rocha da Silva, which fully supported the development of this thesis and allowed me to maintain my autonomy throughout its stages. I also thank João Castro for his patience giving me feedback on my papers before their review and to both Luís Fonseca and Sara Morais for being lifetime companions always available to have a good laugh. Testing and refining my implementations wouldn’t be possible without the contribution of Ângela Lomba, João Honrado and José Luís Moreira who were readily available to provide insights on their work environment and challenges.

Lastly, but not less important, I would like to express my deepest gratitude for both my mother and my sister who supported me through these years and helped me to successfully achieve one of my lifetime objectives.

(10)

(11)

“Não sou nada. Nunca serei nada. Não posso querer ser nada. À parte isso, tenho em mim todos os sonhos do mundo.”

(12)

(13)

Conteúdo

1 Introduction 1

1.1 Creation of laboratory data . . . 1

1.2 Preservation as a solution . . . 2

1.3 LabTablet for integration and automation . . . 3

1.4 Document structure . . . 5

2 Data management in research environments 7 2.1 Metadata: a key for long term preservation . . . 8

2.1.1 Established Metadata schemas . . . 9

2.1.2 Generic metadata schemas . . . 9

2.1.3 Specialized metadata standards . . . 13

2.1.4 Application profiles . . . 15

2.2 OAIS reference model . . . 16

2.2.1 The OAIS environment . . . 17

2.2.2 Packaging . . . 18

2.2.3 OAIS Activities . . . 19

2.3 OAI-PMH for harvesting metadata from repositories . . . 20

2.3.1 OAI-PMH concepts . . . 21

2.4 Preservation solutions . . . 22

2.4.1 Research data repositories . . . 23

2.4.2 Open-source data repository frameworks . . . 24

2.4.3 Repository directories . . . 26

2.4.4 Ongoing development of platforms . . . 28

2.5 Conclusions . . . 30

3 Approaches on digital notes 33 3.1 Multi-purpose note taking applications . . . 34

3.1.1 The OneNote and SharePoint R _{environments . . . .} ₃₄

3.1.2 Evernote: an extensive platform . . . 35

3.1.3 The Google R _{environment . . . .} ₃₆

3.1.4 Platform comparison . . . 37

3.2 Laboratory domain approach . . . 37

3.2.1 Electronic Laboratory Notebooks . . . 38

3.2.2 LabArchives . . . 39

3.2.3 Collaboratory . . . 39

3.2.4 eCat . . . 40

3.2.5 Accelrys . . . 40

(14)

CONTEÚDO

4 The LabTablet solution 43

4.1 Objectives . . . 43

4.2 Defining the environment . . . 44

4.2.1 Dendro, a staging area for data . . . 44

4.2.2 General-purpose data repositories . . . 44

4.3 The application’s architecture . . . 47

4.3.1 Recording metadata . . . 47

4.3.2 Exporting metadata . . . 51

4.3.3 A walkthrough of a typical LabTablet usage scenario . . . 52

4.4 The resulting workflow . . . 53

5 Evaluating LabTablet 55 5.1 The research groups . . . 55

5.1.1 Research areas . . . 55

5.1.2 Data management practices . . . 56

5.2 Evaluation results . . . 56

6 Conclusions 65 6.1 The importance of data preservation . . . 65

6.2 LabTablet environment . . . 66

6.3 Future work . . . 66

A Minute meeting, Prof. José Luís Moreira 75

(15)

Lista de Figuras

1.1 The high level model of the research data management workflow . . . 3

1.2 The high-level model of the proposed solution . . . 4

1.3 Solution integration within the data management workflow . . . 5

2.1 OAIS functional model representation . . . 17

2.2 Information packages within the OAIS protocol . . . 18

2.3 Workflow within OAIS model, including the ingestion stage. Image adapted from [PAN98] 20 2.4 High level representation of the OAI-PMH structure. . . 21

2.5 Distribution of the main data repositories, provided by OpenDOAR platform. . . 25

2.6 Search results for ecology domain repositories. . . 27

2.7 UPBox and DataNotes architecture [RBG+13] . . . 29

2.8 Dendro interface, showing the description of a dataset . . . 30

3.1 OnePoint: Integration of the two platforms and sharing environment . . . 35

3.2 EMSL collaboratory web-based interface with support for 3D protein structure representations. . . 39

4.1 Overview of the research workflow . . . 46

4.2 The main interface screens of LabTablet . . . 48

4.3 View to associate file extensions wiht a selected descriptor . . . 49

4.4 First two stages of the upload . . . 50

4.6 Result of uploading metadata records . . . 52

4.7 Overview of the planned data management workflow . . . 53

5.1 Overview of the planned evaluation tasks . . . 56

(16)

(17)

Lista de Tabelas

2.1 MARC21 available ranges of fields. . . 10

2.2 Example of some translations between the two schemas . . . 11

2.3 Most common metadata schemas classification . . . 13

2.4 Followed standards for OAIS SIP package . . . 18

2.5 Number of datasets per repository. Information retrieved from OpenDOAR. . . . 24

2.6 Statistical data about research data repositories search tools . . . 28

2.7 Repositories assessment . . . 31

3.1 Classification of the popular note taking and file sharing applications . . . 38

3.2 Results of the assessment on existing applications . . . 41

4.1 Overview of the evaluated features. . . 46

5.1 Feedback given by Researcher A, from the chemical engineering domain . . . 57

5.2 Results of the assessment with the CIBIO researcher . . . 62

(18)

(19)

Abbreviations

AIP Archival Information Package API Application Programming Interface

CKAN Comprehensive Knowledge Archive Network

DC Dublin Core

DCMI Dublin Core Metadata Initiative DIP Dissemination Information Package DOI Digital Object Identifier

ELN Electronic laboratory notebook ISSN International Standard Serial Number

LOM Learning Object Metadata

MARC Machine readable Cataloging

METS Metadata Encoding and Transmission Standard MODS Metadata Object Description Schema

PDI Preservation Description Information OAI Open Archives Initiative

OAI-PMH Open Archives Initiative - Protocol for Metadata Harvesting OAIS Open Archival Information System

(20)

(21)

Capítulo 1

Introduction

The growing concerns about data preservation in a long-term perspective has promoted the emergence of new approaches to solve the constraints related to this matter. The main challenge is to ensure that the digitally stored data of today’s research institutions remain accessible and can survive long-term changes that may happen during its preservation, regardless of their nature.

With the increasing complexity and amount of research data in mind, improvements in the preservation process have resulted in new methodologies, standards and tools. These are inten-ded to facilitate the creation of an environment of collaborative research, while also ensuring the continuous investment in richer and better documented data.

1.1 Creation of laboratory data

Research data is characterized by datasets with very specialized information that requires a high quality set of contextual information to enable an external observer to correctly interpret its contents. These datasets often comprise several files from different sources, following the diversity of devices available within the laboratory. Pharmaceutical organizations are an example of how datasets belonging to the same domain can require different information to correctly contextualize them, and show the necessary efforts to provide a tool to manage such data while guaranteeing their compatibility with all the domains [Slo10,BGW08b]. Within this domain, different researches may present different sources of data such as molecular representations, spreadsheets, voice or video recordings as well as fMRI and PET scans [BGW08b].

The normal workflow of a laboratory researcher starts with a project and an associated team with capabilities to manage the available tools to produce data, that will later support conclusions of the performed studies. Although the environment may be stable, the collected data may come from different media and be stored in different formats, and the meaning of the generated data may only be totally available to their creators. The lack of documentation makes it hard to understand

(22)

Introduction

the data should its creator leave the team. The dynamic state of research teams increases the risk of losing the meaning of this data, as soon as the team changes to another project.

Laboratory books —- usually a common paper notebook — serve as the means to collect meta-data such as the conditions in which meta-data was collected, the temperature, humidity and other rele-vant indicators that will then allow a better understanding of their contents. This vital information can then be used to document the dataset and be associated with it for future references [Kum13]. The amount of metadata that can be registered with the dataset depends on the context and environment of the research, but can go from a few lines about the author and the creation date up to thousands of descriptors about every related indicator [GWB+12]. This shows how diverse research data can be, directly affecting the kind of context that can be necessary to an improved interpretation. Descriptors capturing the context can be cumbersome to manage and in some cases lead to situations where only the ones considered relevant by the researcher are provided whereas the rest of them, some with a great potential to contextualize the dataset, are left behind. Further-more, this notion of required or relevant metadata, together with ad-hoc description practices may produce unstructured contextual data that in many cases can only be understood by its creator, since no standards or guidelines to specify such information are provided.

1.2 Preservation as a solution

To respond to the constant demands of accessing data, research data management has come to represent a concern within organizations and research institutions, not only to ensure their long-term availability, but also to build a strong knowledge base always available for reference.

An important aspect to ensure the reproducibility of research results is the interpretation of data [CDS+07]. Although understanding data from a graphical representation (images, charts, graphs and other representations) can be a trivial task, when accompanied by an explanation of its meaning, the underlying raw data often lacks a better contextualization that otherwise could help understanding it. This, meaning that data context can be systematically captured in order to obtain a higher quality dataset that can be used by third parties to either assess the drawn conclusions, or to reproduce the testing environment and reuse this data. Other advantages of this availability relate to being able to cite and examine datasets along with publications.

Aiming to provide a solution to such needs, research data management is assuming a higher importance in the research workflow. Figure 1.1presents the current workflow of research data along with the correspondent publication. This workflow is a high-level model of how the products of research activity are preprepared and deposited in a repository to make them accessible to external entities for exploration and referencing. During the first stage (Area 1), the research team is focused on developing their work, and the produced data is kept on temporary storage solutions such as the laboratory computers, and can also be represented in many formats and follow different structures [SRL11]. During this stage, researchers make use of their laboratory notebooks to annotate relevant information about the conditions in which the data was produced. The second stage begins when the research results have been obtained, and the team focuses on

(23)

Introduction Curation & Deposit Data Production Research data repository Publication Institutional Repository Community +metadata 1 2 4 3

Figura 1.1: The high level model of the research data management workflow

publishing them. After this publication is accepted (2), institutional repositories store it and, if no restrictions apply, make it available to the research community. After this, the research team ends the study and is likely to get involved in other projects. Depending on the institutions, the preservation of the data (3) may also be a concern. In these cases, the research team is still needed to provide basic description to their data (4). Besides providing the basic dataset documentation, the team can be actively involved in providing domain-specific metadata, together with the generic elements added in the first place. The whole process may require the collaboration of an individual ensuring the correct specification of the metadata — the curator. This collaboration can improve the overall quality of the dataset descriptions, by associating it with domain-specific metadata and thus providing a better context for their interpretation.

This workflow ensures that, for those datasets where it was possible to collaborate with the researchers, the additional information included will be of greater value in the future.

1.3 LabTablet for integration and automation

To validate both the metadata and its structure, the curation stage often involves the researchers and the curator. Nevertheless, the dynamic environment associated with the research domain may pose serious risks of producing basic documentation for datasets or even losing the metadata associated with them [RBG+13].

The key limitation of the current workflow of research data is that dealing with the data ma-nagement constraints is an a posteriori stage [RBG+13]. After the publication is completed, the researcher does not necessarily keep its association with that dataset and often embraces other pro-jects. Since the start of the curation process can take some time to happen, the information stored within all the laboratory books of the concluded project can be lost. In such cases, the simple speed up of the process can be the difference between a high quality dataset and a poorly documented

(24)

Introduction

one, which is very likely to cause its loss [PG12]. Involving researchers in the production of high quality metadata from the early stages of their research can actually invite them to acknowledge the importance of such contextual information when preserving their dataset, and thus effectively playing an important role in this workflow.

Several alternatives to the traditional laboratory books have emerged, making use of new te-chnologies available at the laboratories such as computers with Internet connectivity. There are also examples of using tablets to completely replace the laboratory books and make use of touch interfaces to ease the collection of research-related personal notes.

Data Creation & Description Storage & Exportation Publication Research Team Search engines External Projects External repositories Research community Meta 1 2

Figura 1.2: The high-level model of the proposed solution

LabTablet aims to bring the process of collecting relevant documentation to the early steps in the research activity. By serving as a laboratory book by itself, and rethinking the whole work-flow, this approach intends to produce well documented datasets within a shorter time span. Fi-gure 1.2 shows the proposed paradigm in preservation and documentation of research datasets. Along with the production of research data, researchers are able to describe it (1), gathering di-verse domain-level metadata. The application is also able to validate both the description structure and representation, making use of established standards to do so. After this production and des-cription, researchers are able to deposit the whole contents, including associated metadata records in a repository. This repository (2) is then be responsible for making these contents available to other platforms and users. These external entities can be other repositories, research data inde-xers or other projects looking for data to be reused. Following this approach, bringing the records validation to earlier in the workflow improves the chances of this data being contextualized with domain-level metadata, as this stage is seamlessly merged with the research process.

In the end, as shown in Figure1.3, this solution is intended to be fully integrated within the data management workflow, and be able to capture valuable metadata records during the produc-tion stage, and even before data is created. In a constantly changing environment, the collected

(25)

Introduction

metadata can be synced with an intermediate platform and, upon the research ending, the created dataset can be exported to a public repository, along with the associated metadata.

Production

Intermediate platform

Publication

Meta Data Preserve

Public Repository

Share Cite

Figura 1.3: Solution integration within the data management workflow

1.4 Document structure

This document is organized as follows: Section2provides a survey on data preservation, from the importance of metadata and its standardization – to obtain the best interoperability and integra-tion between systems – to the different soluintegra-tions to store and preserve research and organizaintegra-tional data. Sections2.4.1and2.4.2provide an assessment of the currently available solutions for data repositories, regarding their business logic, affordability, ownership and other important aspects when choosing a platform to manage research data.

Chapter3presents the currently available applications for note-taking activities. These appli-cations offer a variety of functionalities and often integrate seamlessly with digital devices such as smartphones and tablets. In this chapter the applications are divided according to their purpose, from generic tools for notes and reminders, to dedicated applications used to collect and store research information and personal notes from the researchers.

Chapter4presents LabTablet, an approach to integrate the laboratory book into a data preser-vation workflow. The overall architecture of the proposed solution as well as the technological choices are presented, and its integration in a research data management workflow is detailed.

Chapter5describes the evaluation of the platform and its refinement based on the feedback of two researchers from two different research domains. Also within this chapter, the set of tasks that were used to assess the implemented interfaces is summarized along with the observations made by the researchers in each case.

(26)

Introduction

Chapter6presents an evaluation of the completed work, a reflection on the satisfaction of the original requirements, as well as some final remarks on this subject and an outline of the future work.

(27)

Capítulo 2

Data management in research

environments

Research institutions are constantly producing data and many are concerned about supporting the long-term preservation of such assets. Although it is very common that this data is already stored in digital supports, it is often not considered to be sufficiently described and structured to be cited and preserved for future references. In this context, long term preservation is considered to be long enough to be concerned about impacts of changes in technologies such as new data formats or even new communities [PAN98]. These aspects can change the way both data and metadata are retrieved and read.

Libraries, research laboratories and universities are some examples where the archived objects – books, reports, scientific papers and corporate documents—have high value for future referen-cing, thus driving projects that aim the large scale long-term data management of such materials.

While in some cases no additional information is needed to properly understand the contents of the files, the majority of the stored objects have particular characteristics. Such characteristics like the objects’ provenance and research topic may benefit from additional contextual information, provided the author or a person with the needed knowledge about their domain is present to supply such information.

Adequate structuring of the metadata for the different datasets opens a wide range of possi-bilities for later integration with other platforms and interpretation from external sources. In this context, where both the data and metadata need to follow standards for interchangeability between platforms and long-term environments, two main concerns emerge. On one hand, standards for metadata records consist of an established specification of which elements should be included to successfully document a specific dataset. On the other hand, integration and automation protocols rely on a set of procedures and concepts to represent processes of ingestion—the upload of data to the repository and consequent index of its contents within a search tool for faster reference and retrieval [PAN98].

(28)

Data management in research environments

2.1 Metadata: a key for long term preservation

Metadata is often defined as data about data, and it is typically represented as a set of ele-ments describing, explaining or locating the information they refer to and may facilitate its retri-eval [SM91]. An example of such kind of data is a library’s catalog, managing to have for each record information about the author and other relevant data.

The inclusion of metadata to document and annotate the dataset is a key point in the contextu-alization process, as it can be obtained using several tools and be produced in a variety of contexts. In the particular case of a laboratory environment, the most common tool for metadata capture is the laboratory notebook, where researchers gather information they consider relevant to their rese-arch such the people that participated in data production, the conditions - humidity, temperature, geographical location, and others - in which the data was collected as well as personal notes and observations [Slo10].

Metadata is classified according to the purpose for its use, and consists of four main categories— descriptive, administrative, preservation and structural metadata—each one having the purpose of discovering, managing, controlling, identifying and structuring the referred data1.

• Descriptive metadata describes the resource for purposes such as discovery and identifi-cation. It often presents information about the title, author, subject or topical information as well as representing relationships with other resources like hierarchic structure or versi-ons. This type of metadata is commonly the most standardized and well understood among several domains and its popularity within institutions has been considered to motivate the expansions of the same standardization to other types of metadata;

• Administrative metadata is intended to help manage the object, providing technical cha-racteristics such as provenance (changes and updates made to the object or its ownership), rights and permissions of access to use the object. This type is often found at a lower level of detail when comparing with descriptive metadata;

• Preservation metadata ensures that the information contained within the resource remains accessible over a long period of time. This category is dedicated to finished or closed da-tasets but is also sometimes used with works in progress. Information about migration instructions is also included to this category;

• Structural metadata describes the digital resources regarding their physical and logical structure, expressing the relationships between different component parts of an object and facilitating the navigation and presentation of such complex items. Information about pagi-nation, chapters, editions and related metadata is an example of this kind of metadata;

When using a conventional laboratory notebook, each researcher is responsible for choosing which elements are more appropriate to document their dataset, leading to an environment where

1_{Examples of each of these categories can be found at} _{https://www.library.cornell.edu/} preservation/tutorial/metadata/table5-1.html

(29)

each dataset is accompanied by a personal overview of what additional information is needed, mainly due to the absence of a standardized structure to represent and analyze such data.

2.1.1 Established Metadata schemas

Metadata schemas consist of a set of metadata descriptors designed for a specific purpose or domain. There are several examples of domain-specific metadata schemas, containing relevant elements for that particular domain, and also more general approaches to provide a base level for documenting data from a broader set of domains.

The specification of the detail for the managed data is defined as metadata granularity. Within a library environment, the most common unit of granularity is the concept of book, journal or issue. Although it is considered to be a good unit for such requirements, the use of this level of granularity may not be adequate for other environments. When the specialization of the managed objects may offer a greater level of information, new units of granularity emerge to fulfill those needs. In the particular case of libraries, with the acquisition of other resources, users gain access to finer levels such as chapter or article rather than the previous unit volume-on-the-shelf [Han12]. To develop new granularity levels of access to its users, libraries also need to study new standards for metadata description.

2.1.2 Generic metadata schemas

Generic metadata schemas provide sets of descriptors that can be used to describe many types of resources.

Dublin Core is an example of a generic metadata schema intended to describe any resource on the web. Developed from the mid 90s as the result of an international collaboration, Dublin Core is maintained by the Dublin Core Metadata initiative (DCMI) and has achieved a wide acceptance and standardization [fS09]. The simplest form of Dublin Core contains 15 elements to represent entities such as creator, contributor, description and format to provide the description of the most commonly used elements to describe an object while still being simple enough to be used across more domains. Listing2.1shows a description of a publication associated with the author, title, publisher and date.

1 <dc:title>Dave Beckett’s Home Page</dc:title>

2 <dc:creator>Dave Beckett</dc:creator>

3 <dc:publisher>ILRT, University of Bristol</dc:publisher>

4 <dc:date>2000-06-06</dc:date>

Listing 2.1: Subset of elements from Dublin Core

During the 1960s the idea of digitally manage information about an entire library promoted the development of the Machine-readable Cataloging (MARC) guidelines as a set of digital formats to describe the cataloged items such as books, thesis and other related work [GS09]. Developed

(30)

at the Library of Congress, this set became widely used and was chosen as a standard among digital libraries. While being adopted as a standard and included in several projects from many institutions, MARC was open to future integrations with other projects within an interoperable environment.

During these years, several representation of the MARC principles emerged. MARC21 was the result of the harmonization between the largest communities of users of this standard. The most recent version of MARC21 includes formats for newly emerged domains such as authority records and community information. After the widely spreaded use of MARC21, and taking into account the evolution of communication and integration protocols developed for the world wide web, an XML schema—MARCXML—was adopted to support the MARC standards together with an XML representation. This adaptation allowed a much wider integration with other software packages such as aggregation platforms and search clients.

The table2.12 _{shows the currently available fields within MARC21 format for bibliographic}

data. The physical description fields include schedule-related events such as playing time, publi-cation frequency and geospatial reference data.

Range Purpose

00X Control Fields

01X—09X Number and Code Fields 1XX Main Entry Fields

20X—24X Title and Title-related fields 25X—28X Edition, Imprint, Etc fields

3XX Physical Description, Etc, fields 4XX Series statement fields

5XX Note Fields

6XX Subject Access Fields 70X—75X Added Entry Fields 76X—78X Linking Entry Fields 80X—83X Series Added Entry Fields

841—88X Holdings, Location, Alternate Graphics, Etc. Fields Tabela 2.1: MARC21 available ranges of fields.

The Metadata Object Description Schema (MODS)3is an alternative schema for general do-cumentation. MODS was announced for trial in June 2002 and is currently used by institutions such as the Columbia University Libraries and the University of Chicago Library4.

Table2.2is an excerpt of existing mappings from MODS elements to Dublin Core (unqualified Dublin Core) showing the possible translation between these two formats5.

2_{Further details can be obtained at}_{http://www.loc.gov/marc/bibliographic/} 3_{MODS specification and outline available at}_{http://www.loc.gov/standards/mods/}

4_{More related institutions can be consulted at}_{http://www.loc.gov/standards/mods/registry.php} 5_{Other translations available at}_{http://www.loc.gov/standards/mods/mods-dcsimple.html}

(31)

MODS element Correspondent Dublin Core

Title, TitleInfo Title

Abstract, Note, Table of Contents Description

OriginInfo, Publisher Publisher

TypeOfResource, Genre Type

AccessCondition Rights

Tabela 2.2: Example of some translations between the two schemas

Following the growing interest in metadata encoding for digital libraries archives, several ap-proaches emerged to select the best elements to describe the resources. The lack of an overall framework to integrate those developed schemas promoted the creation of the Metadata Enco-ding and Transmission Standard (METS)6 to attempt to promote the exchange of digital objects between repositories or between repositories and their users.

METS is designed to convey the metadata necessary for both the digital objects within a repo-sitory and the exchange of those objects between repositories, supporting several formats, such as text, image, video or sound files. It also addresses the lack of standardization within digital libra-ries and promotes the growth of digital archives with cross referencing capabilities. As an XML schema application, METS supports referencing both internal and external objects (whether it is embedded within its own structure or referenced by external files), and is considered to be the first widely accepted standard for digital library metadata [BGW08a]. While searching for standards of metadata, there is also support for a hierarchy represented in a single METS file, referencing all the types of metadata – descriptive, administrative, preservation and structural. METS is inten-ded to balance flexibility and standardization of metadata, allowing a wide range of elements to be included, but specifying which ones are available and providing guidelines for specifying how to document an object. Listing2.2shows a volume divided in two chapters, the first one having two sections with pointers to files containing the images of the pages. In the end, this builds up a section and is finally referenced by the FILEID attribute.

1 <structMap TYPE="PHYSICAL">

2 <div ID="title.div.1" LABEL="Chapter 1">

3 <div ID="title.1.div.1.1" LABEL="Section 1">

4 <fptr FILEID="title1.image.1"/>

6 </div>

7 <div ID="title.1.div.1.2" LABEL="Section 2">

9 </div>

10 </div>

11 <div ID="title.div.1" LABEL="Chapter 2">

12 <...>

(32)

13 </div>

14 </structMap>

Listing 2.2: Example of a METS structural map adapted from [BGW08a]

Although encoding proprietary descriptive and administrative metadata, the METS standard also allows the nesting of other metadata elements retrieved from other schemas such as DC and MODS. Another example of how METS deals with all kinds of metadata is shown in the listing2.3. In this particular case, the descriptive metadata was embedded with a MODS 7 record within the document. This example shows how the descriptive field could be pointing to another file containing the same definition used, by using the mdRef element instead of the mdWrap.

1 <mets:mets>

2 <mets:dmdSec ID="DMD1">

3 <mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">

4 <mets:xmlData> 5 <mods:mods version="3.1"> 6 <mods:titleInfo> 7 <mods:title>Letters</mods:title> 8 </mods:titleInfo> 9 <mods:name type="personal"> 10 <namePart>Roger S.</namePart> 11 </mods:name> 12 <mods:typeOfResource>text</mods:typeOfResource> 13 </mods:mods> 14 <...> 15 </mets:mets>

Listing 2.3: Descriptive metadata with METS

The METS support community is steadily increasing since 2001, consisting mainly of univer-sity libraries, archives or museums [Fed10].

Although the creation of new schemas may seem to be an easy and sensible task, the use of standards provides better means of resource sustainability and discovery while saving costs and creating stronger communities. Table2.3presents a summary of the main metadata schemas and their domain of applicability.

After analyzing the available and widely used standards, it is possible to state that it can be very hard to choose a single schema to cover the needs of a given research group. This, due to the diverse existing research domains to be described, but also to the different levels of detail that can be included in the description. There may not be a schema capable of dealing with several domains, but the reuse of elements from other standards, combining subsets in a larger set of descriptors, can be a closer approach to describe such domains.

(33)

Data management in research environments Metadata schema Focus Description Dublin Core

General Document networked resources, such as articles, publi-cations and journals. Due to its wide range of possible applications, Dublin Core is also used to generally des-cribe other resources such as datasets.

Darwin Core

Biology Geographic occurrence of species and collections

METS Library archives Library archives & Encoding descriptive, administra-tive and structural metadata of digital libraries objects MODS Library archives Library archives & Bibliographic element set

VRA Core Arts and visual culture Arts and general culture & A categorical description of visual culture and images documenting it

NIEM Security, Law enforcement Exchange of justice, national security, biometrics and other fields related to the United States activities

IEEE LOM Education Syntax and semantics of learning object data

Tabela 2.3: Most common metadata schemas classification

2.1.3 Specialized metadata standards

For specific domains, the use of general purpose metadata can be very limiting and not suf-ficient to provide comprehensive information about the collected data. When the domain allows the inclusion of more precise and detailed information, the need of a standard to describe such in-formation may lead to the development of a dedicated metadata schema or to the reuse of existing resources from other schemas.

The Darwin Core initiative is an example of a metadata schema designed to facilitate the exchange of information about geographic occurrence of species and biological collections. This body of standards, often abbreviated to DwC, extends Dublin Core to the biodiversity applications domain and is currently a part of a larger set of vocabularies and specifications under development maintained by the Biodiversity Information Standards8.

1 dcterms:type=PhysicalObject 2 dwc:basisOfRecord=PreservedSpecimen 3 dcterms:modified=2009-02-12T12:43:31 4 dwc:institutionCode=MVZ 5 dwc:collectionCode=Mammals 6 dwc:catalogNumber=14523 7 dwc:collectionID=urn:lsid:biocol.org:col:34904 8 dwc:occurrenceID=urn:catalog:MVZ:Mammals:14523 8_{http://www.tdwg.org/}

(34)

9 dwc:locality>Valle Limay, Estancia Rincon Grande, 48 ha area with centroid at this point

10 dwc:decimalLatitude>-40.97467

11 dwc:decimalLongitude>-71.0734

Listing 2.4: Excerpt of a Darwin Core document

Concerns over global loss of biodiversity demanded a quick and structured solution to access well documented data in a short term period, with information about relationships between species and their environment [GWB+12]. This documentation on the environmental variations across a long time span proved to be a key factor in the study of events such as climate change.

Another field where documentation of data is a concern is the arts and culture domain. Re-garding visual culture as well as images that document it, the VRA Core9was developed to keep track of information such as display location, cultural context and date, material set and others. The excerpt shown in2.5consists of a description of a Gothic church within the architecture field of arts and culture.

1 <culturalContextSet>

2 <culturalContext>French</culturalContext>

3 </culturalContextSet>

4 <dateSet>

5 <display>begun 1194 (creation); consecrated 1260 (other)</display>

6 <notes>Louis IX (Saint Louis) present for consecration in 1260.</notes>

7 <date type="creation">

8 <earliestDate>1194</earliestDate>

9 <latestDate>1194</latestDate>

Listing 2.5: VRA Core documenting a Gothic Church

Darwin Core is an example on how demanding and specific a domain can be in terms of documenting all the produced data. Creating a new metadata schema isn’t a trivial task, since it needs to balance the amount of added information and clearly justify the needs of each included fields, while keeping in mind possible integration with other domain level standards [CRR13].

The need for an open standard for the education domain and to find educational content led to the creation of the IEEE Learning Object Metadata10 (IEEE LOM). A project developed within the University of Washington approached the usability of Dublin Core and metadata for educatio-nal resources [SM01], and concluded that the initial Dublin Core set of 15 elements would be too limited to provide cross-domain network information discovery and retrieval.

A Learning Object is defined as an entity, digital or not, that can be used to provide a technology-based learning 11. Taking into account the previously presented limitations for the educational domain, the IEEE Learning Object Metadata standard (IEEE LOM) enables its potential users to

9_{Documentation and specification available at}_{http://www.loc.gov/standards/vracore/} 10_{Specification available at}_{http://ltsc.ieee.org/xsd/lomv1.0/lom.xsd}

(35)

search, assess and utilize Learning Objects, independently of the supporting learning systems at the same time it increases its integration within other digital platforms to provide faster means of dissemination of such data. This project is mainly focused on educational, training and learning organizations to express educational content in a standardized and independent format.

When considering other domains, several established schemas are already broadly used:

• Ecological Metadata Language—EML is mainly targeted to describe data from the eco-logical domain, allowing describing both digital and non-digital objects such as maps and geographical shapes;

• Data Documentation Initiative—this initiative aims to describe resources from the statis-tical and social sciences. This schema also addresses constraints related to the distribution of social science metadata, mainly due to the lack of guidelines and established formats for the representation of such information in this domain;

• ThermoML—In a similar way as the previously presented schemas, the TermoML initiative enables researchers to describe data from the thermophysical and thermochemical domains. Resources from this domain include compounds, mixtures and chemical and biochemical reactions.

The existing standards show that despite of the well established schemas to describe resour-ces from all the domains with high-level metadata, other schemas exist to provide domain-level descriptions that otherwise wouldn’t be included [WHEm+12].

2.1.4 Application profiles

In spite of the particular requirements of each research area, creating a completely new meta-data schema for each and every research domain may have a negative impact on interoperability. Standardized data descriptors are often a more reliable and convenient way to promote interope-rability. Several initiatives such as the Dublin Core Metadata Initiative12provide a framework to select a set of elements designed to meet specific application needs. This set of elements, drawn from one or more namespaces, that are combined and optimized to a particular environment is defined as an Application Profile [HP00].

The methodology to develop a specific profile consists of several analysis and evaluation sta-ges and involves commonly the collaboration with a community to assess their data requirements. These are usually developed together with documentation plans to promote the role of the requi-red information within the researchers teams [CRR13]. The key advantages of using Application profiles rely on the avoidance of using strict metadata schemas, often with unnecessary descrip-tors, instead giving access to a more precise and optimized profile for the selected domain. To ensure a greater interoperability with other descriptors from existing standards, thus improving both integration and interpretation.

(36)

The evaluation stage is intended to gather information about the daily activities of the selec-ted team, regarding their data production and contextualizing their datasets as well as the used workflow in data management.

2.1.4.1 UPorto: a case study identifying an application profile

In 2011, a study was carried out with the goal of gathering information about the research data management practices of University of Porto (UP) researchers [SRL11]. In 2013, another study was performed to identify the requirements to describe datasets from a group of mechanical engineers [CRR13]. The presented tools helped them in the process of describing the datasets and selecting a set of qualified Dublin Core elements as well as an additional set of complementary descriptors.

After the first impressions on the requirements analysis, the interviewed stated that they requi-red a small set of metadata descriptors to understand their data, revealing that the team was not entirely aware of the importance of including a wider set of descriptors to facilitate the dataset analysis by external users [CRR13]. Instead, they were mainly concerned with the interpreta-tion of the datasets within the group, and not by external researchers or instituinterpreta-tions. By the end of the project, the developed application profile consisted of a combination of qualified Dublin Core terms with others from Ecological Metadata Language schema, as well as several specific descriptors that were not covered by any existing schemas13.

This study revealed that there is room for improvement when it comes to data description. Without the presence of data management tools and capabilities designed to manage data from its early stages the inclusion of additional data description is mostly dependent on the research team motivation and awareness.

2.2 OAIS reference model

The need to preserve information for future access by a community of space data systems [PAN98] led to the creation of a set of instructions and processes to solve some of the related constraints, while providing such access to the data. Following these requirements, in 1990 the Consultative Committee for Space Data Systems (CCSDS) started the development of the Open Archival In-formation System as a reference model defined by basic functional components, dedicated to the long-term preservation of digital information14. After the success and advantages shown by these initiatives, in 2003 the recommendations were exported to a large set of domains, and currently seen as an international standard [ISO12].

The Open15 Archival Information System comprises an organization of people and systems with the goals of both preserving the information and making it available to the community. This reference model has been widely accepted by the digital preservation community and stated to be

13_{http://knb.ecoinformatics.org/software/eml/eml-2.1.1/eml-methods.html} 14_{CCSDS 650.0-B-1 of the Consultative Committee for Space Data Systems}

(37)

the key standard for digital repositories data management. The specification of these interactions ranges from data ingestion, up to the consequent management and dissemination throughout the platform to the user. It consists of a high-level model of the intended workflow, making it flexible enough to integrate within any domain of research. Apart from the technical details the model presents, it is also designed to provide basic concepts needed by non-archival organizations to be integrated and actively participate in the process of preserving data.

This approach ensures that the data is in a long-term preservation state even if the OAIS it-self doesn’t follow such life-cycle (e.g. if its organizational representation changes with future improvements).

Figura 2.1: OAIS functional model representation

2.2.1 The OAIS environment

The OAIS environment consists of three agents tasks and their interactions with the archive:

• Producer—is responsible to provide the information and the metadata to contextualize such information. A producer can be either a person or a client system (can include other OAIS or internal systems);

• Manager—controls the overall policy execution, and is not involved with daily archive operations [PAN98];

• Consumer—harvests and acquires the interesting preserved information. Is considered to be a part of the Designated Community – a set of potential users, often associated with libraries and other organizations [FSR12] –, and should be able to understand the preserved information.

(38)

2.2.2 Packaging

In the context of a preservation archive, an information package is composed of both the content information and the preservation description information. As shown in the figure2.2, the model identifies three main packages - information packages - to guarantee the correct flow of data through all the stages. These packages consist of an high level representation of the digital object(s) to be preserved, the contextual data pointing to those objects, and a third package relating the two previous packages and offers an externalization of their contents.

Information Package Information Package Archival Information Package Dissemination Information Package Figura 2.2: Information packages within the OAIS protocol

The submission information packages (or SIPs) [Gel01] represent both the metadata and the data that goes together with it, normally supplied by the author of such data—ideally the team of researchers, the original contents creators. At this level, this metadata can also be provided by other repositories making use of the interoperable metadata standards and thus minimizing the efforts required to ingest the material into the repository. A SIP package provides a normative data format for each section, nevertheless its structure is often defined in-loco, taking into consideration both the ingest infrastructure and the data provider.

Component Function Normative Format Structural Metadata METS 1.0 schema

Descriptive Metadata EJAR 1.0 descriptive and administrative metadata schema Administrative Metadata EJAR 1.0 descriptive and administrative metadata schema Issue-level Text EJAR 1.0 issue schema

Item Reference Links EJAR 1.0 issue schema Raster Still Image TIFF 6.0

Page Description PDF 1.4

(39)

Table 2.4 summarizes some of the components of a SIP package as well as the standards associated with them16.

The SIP package also provides a representation for directory structures enabling a three-level directory hierarchy, being the first one the base line correspondent to the journal title, and the second level to a single sub-directory for each issue-level components—commonly the parent of sub-directories for each item.

Following the first step in the ingestion process and the submission package, the Archival Information Packages—AIP—contains a complete set of SIPs packages for the associated Content Information17. In this step, each of the SIP packages is targeted for preservation by retrieving the stored preservation description information (PDI) packages, containing additional information about the contained data as shown below:

• Reference Information—provides one or more unique identifiers with which the content information may be uniquely identified. Examples of such identifications include the ISBN number or a set of attributes distinguishing two instances of content information;

• Provenance Information—the history of the archived object, with an historical record of who has had its custody since its origin, including a processing history;

• Context Information—relationship to other objects, e.g. the hierarchical structure of a digital archive. It can relate with objects outside the information package;

• Fixity Information—a demonstration of authenticity, such as a hash value aiming to protect the information from unwanted or undocumented changes;

In the last stages of the process, the dissemination information package (DIP) is a response to any request made by a consumer. This package form depends on the customers requirements and dissemination media. The DIP package may be different from the AIP ones since they can happen to be a subset of the archived information [FSR12].

2.2.3 OAIS Activities

Figure2.1shows a set of activities managing to preserve the information received by the pro-ducers. These activities aim to process the incoming information and transpose it to an adequate representation for storage as well as routine procedures to maintain the archive operational.

The Ingest stage (Figure2.3) provides the functionality to accept and process Submission In-formation Packages (SIPs) either from the Producers or other internal elements, and prepare their contents for the storage and preservation stages. The Ingest stage is also charged with ensuring the quality within the package and, after this process is completed, generating the Archival In-formation Package (AIP). It also extracts the Descriptive InIn-formation to posterior inclusion in the database, coordinating the whole process with both the Archival Storage and Data Management.

16_{Adapted from [}_Gel01_{]. Note that for each schema there is a set of authorized elements that can be used to fulfill}

the needs of additional data.

(40)

Figura 2.3: Workflow within OAIS model, including the ingestion stage. Image adapted from [PAN98]

The Archival Storage is responsible for controlling, maintaining and retrieve the previously generated AIPs. It adds the received packages to the permanent storage and notifies the archive of such operation and starts a routine related to error checking to provide recover capabilities.

Data Management is the key agent in maintaining and accessing the Descriptive information and the administrative data used to manage the archive. Besides the management tasks, the Data Management activity is also responsible of populating such structures with the received informa-tion and performing database updates with the newly acquired descriptive informainforma-tion.

The connection with the model is made through the Access layer of functionalities which provides the services to support Customers in determining whether the pretended information exists or not in the archive and if so, to allow them to obtain such data.

2.3 OAI-PMH for harvesting metadata from repositories

When considering the interoperability between the publications repositories, many organiza-tions intend to make their data available to the external community through several channels. To exchange and share metadata associated with those repositories, the Open Archives Initiative Pro-tocol for Metadata Harvesting18_{(OAI-PMH) was developed.}

The OAI-PMH was originally designed to enhance the communication and access to eprints19 archives with the purpose of increasing the scholarly information availability. It has been con-tinuously improved and is currently considered a standard for information dissemination [PG12,

HS08,Lav04]. To facilitate such dissemination, OAI-PMH relies on an application-independent framework based on metadata harvesting and identifies two major agents within its activity:

18_{http://www.openarchives.org/pmh/} 19_{http://www.eprints.org/}

(41)

• Data Providers that are responsible for managing the supporting systems exposing the me-tadata;

• Service Providers that harvest metadata records and build additional services to provide additional value;

2.3.1 OAI-PMH concepts

The OAI-PHM protocol defines several concepts to model the flow of information:

• Harvester—the client issuing requests and is operated by a service provider with the pur-pose of collecting the expur-posed metadata;

• Repository—an accessible source of metadata that has an available interface to expose it; • Item—an item has metadata about a single resource in multiple formats that can be

harves-ted through the OAI-PMH. It also has an unique identifier that distinguishes from the other items within the repository;

• Record—encoding an XML response to an OAI-PMH request from an item. It is uniquely identified through the combination of the items unique identifier and its own date stamp. This record is also structured in several parts including a header and the metadata that aim to provide information about the origins of the extracted metadata;

In addition to the described functionalities, OAI-PMH supports selective harvesting, meaning that a harvester can search for specific portions of metadata from a repository. With this intero-perability, clients can exchange metadata using open standards such as XML. To ensure a basic level of interoperability the unqualified version of Dublin Core must be supported by all the data providers [HS08]. Harvester Data provider Data provider Data provider Service provider Researcher Data provider

(42)

Regarding the implementation details, the OAI-PMH specification features six requests, as shown below

• GetRecord—Resposible for retrieving an individual metadata records. As arguments it requires the record identification, along with the format of the metadata that should be in-cluded in the record, such asoai_dc;

• Identify—To retrieve information about a specific repository. This request can also be implemented in the host repository to provide additional information;

• ListIdentifiers—This request is mainly responsible for obtaining the headers rather than records (as with the ListRecords call);

• ListMetadataFormats—With this call, it is possible to obtain the available metadata for-mats to represent a specific record from an available repository;

• ListRecords—In this case, the ListRecords call is intended to harvest records from a re-pository. To support selective harvesting, this call can also include parameters such as a datestamp;

• ListSets—Listing sets of a repository will output a set structure of a repository. This is mainly useful for selective harvesting;

These request establish guidelines that other systems can make use of to harvest information from different sources, resulting in greater interoperability and content dissemination. As of February 2014, this protocol was in use by 2192 organizations20.

2.4 Preservation solutions

Following the research data deluge [Bor11], some concerns over the preservation of research data have emerged. In recent years, as an attempt to improve the management digital information from various domains, some projects emerged aiming to solve the challenges of having large data-sets in a collaborative environment and their long term preservation. While some of the platforms are targeted at very specific needs, others strive to be more generic, making it possible to merge and manage data from several domains.

These tools also serve as a way to motivate not only the researchers but entire communities to a new paradigm of sharing their data with other communities and thus creating an environment with better documented data. In this context, a preservation solution is defined as a platform that can store and manage research datasets, even though it may not comply with any data preservation guidelines. This, despite of the importance and contribute to the overall preservation of research data.

(43)

2.4.1 Research data repositories

Research data repositories serve the specific purpose of managing data related to research publications, articles and datasets. These repositories are often managed by communities of spe-cialists, and tailored to the specific needs of their datasets.

• Dryad — intended to support the preservation, use and reuse of research data within the evolutionary biology field, the Dryad repository features a set of data management tools designed to support a better documentation of the related datasets. Dryad allows the upload of datasets while specifying if they can be readily accessible to the public or if they are still under review. To facilitate citing the managed datasets, a Digital Object Identifier (DOI) is added to the data as soon as it is published. Its integration with the publishing platforms of prominent journal editors from the biology domain (Elsevier, for instance) facilitates the citation process. Dryad is based on a DSpace software platform thus offering its functiona-lities such as an OAI-PMH endpoint to streamline contents dissemination;

• FighShare — FigShare focuses on giving credit for each submission, by allowing the iden-tification of each researcher and associating his or her datasets. Additional features include (limited) private storage and unlimited public domain space, with a particular focus on sha-ring the managed data. While it supports uploading any kind of file formats to the repository, the available metadata to describe such files is limited to generic descriptors such as author and date of creation. To ease the quick access to some parts of the dataset, Figshare also supports previewing the file without downloading it. Another useful feature, recently made available by other repositories, is the inclusion of a citation-formated quote that the user can use on the publication;

• Zenodo — previously known as OpenAIRE Orphan Record Repository, Zenodo is based on the Invenio21framework and does not restrict the domain of the uploaded datasets. Data ownership is kept unchanged after the upload of the dataset is completed and a history of changes is kept to manage unwanted changes made to it. As for metadata schema compli-ance, unlike the other repositories that try to reuse standards, Zenodo specifies a fixed set of elements to hold the metadata. Although some of these elements may be compatible (having the same meaning) with some descriptors of metadata schemas, the overall set of descriptors does not comply with any established metadata schema. Following the integration with ex-ternal applications, Zenodo exposes the gathered metadata, complying with the OAI-PMH protocol, so that external harvesters can index its contents easily. One of the biggest contri-butors to the Zenodo platform is the European Organization for Nuclear Research, known as CERN, storing over 100PB of physics data from the Large Hadron Collider.

Repository popularity is an important indicator when research institutions decide to deposit their data. Table2.5presents the number of datasets managed by each repository, retrieved from the repository directory OpenDOAR.

(44)

Repository Number of datasets Zenodo 1229 (30/01/2014) Figshare 28 (02/03/2011)

Dryad 31245 (28/11/2013)

Tabela 2.5: Number of datasets per repository. Information retrieved from OpenDOAR.

2.4.2 Open-source data repository frameworks

Research data repositories are often developed on top of generic platforms that are tailored to meet their requirements and implement new functionalities. These platforms—research data repo-sitory frameworks—provide support for data storage. Frameworks consist of a standalone solution where the basic functionality is provided with an installation package and the additional functio-nalities are served based on add-ons that the system administrator can install or develop, according to the domain requirements. Thus, this approach allows a higher customization according to the requirements of each particular repository instance. Custom developments may change the way files are stored (if for instance it is intended to locally store the files rather than a pointer to their location in another repository), provide additional services (API integration and access control) and allow metadata elements to contextualize the datasets.

2.4.2.1 DSpace

In March 2000, Hewlett-Packard Company invested 1.8 million dollars in a collaboration with the MIT libraries to build DSpace22. DSpace presents itself as a cross-domain software solution, flexible enough to tailor its functionalities to the organizational specifications. Its main objectives relate to the preservation and open access to the stored information—text, images, video, audio and datasets.

DSpace is currently in use within thousands of organizations to manage their publications and other collections of documents. The University of Porto is one example of a successful application of this platform, using the platform to support its Open Repository23and Thematic Repository.

With one of the largest community of supporters, DSpace24is currently a popular framework destined to manage the publications from thousands of authors. As shown in the Figure2.5, DS-pace has almost 40% of share within all the publications repositories25, and is used by educational, government private and commercial institutions.

Nevertheless, this popularity is based on the number of institutional repositories and not on the number of data repositories. DSpace isn’t particularly tailored to support research data mana-gement, focusing instead on publications. Nevertheless, there are other projects that look towards adapting it to support both publications and their correspondent datasets[dSRL+12].

22_{http://www.DSpace.org/}

23_{http://sciencedata.up.pt/DSpace/} 24_{http://DSpace.org}

(45)

Figura 2.5: Distribution of the main data repositories, provided by OpenDOAR platform.

Although these two purposes are very distinct in terms of the metadata requirements, they are related, as many of the publications need to cite datasets their are based on.

2.4.2.2 CKAN

Over the last few years, the initiatives to provide general access to governmental data led to the development of a platform to support both the access and storage of such records for public consultation.

The Comprehensive Knowledge Archive Network (CKAN) is an open-source data manage-ment platform that presents itself as a complete out-of-the-box solution to make data deposit and sharing easier. It provides tools for publishing, sharing, finding and retrieving data and is used by several institutions worldwide. CKAN is considered to be a good tool to manage large quantities of data. From the currently active projects, CKAN supports:

• Data.gov.uk26_{—the official Open Data infrastructure of the United Kingdom Government.}

This project follows the intentions of releasing public data to provide information about the government activities, gathering data about politics, laws and statistics. This platform also has integration with additional services to make possible searching and obtaining data from external clients for added value. The announced statistics point out more than 9000 available datasets;

• PublicData.eu27_{—a research prototype of an European data catalog. The main objectives}

of this initiative is to provide an easier access to Europe’s public data, allowing users to

26_{http://data.gov.uk} 27_{http://publicdata.eu}