ArchGraph: Desenho e concepção de um protótipo vertical de infraestrutura para arquivos semânticos

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

ArchGraph: Design of a vertical

prototype infrastructure for semantic

Abstract

Historical archives curate invaluable cultural heritage artifacts, and follow specific standards of information organization in their descriptive records. In Portugal the Arquivo Nacional da Torre do Tombo (ANTT), operated by DGLAB, uses the ISAD(G) standard to support description and the same applies to the Digitarq platform, which provides a vast digital catalogue of the assets in the archives. ISAD(G) is based on a hierarchical model and uniform description, which allows subcollections of documents to inherit description values from parent collections. However, some of the standard components, such as descriptions and biographies, are coarse-grained, and limit the use of the corresponding information in search and interlinking with other resources. Cur-rently, the ANTT is preparing the adoption of a more fine-grained and interconnected data model, migrating in this process their hierarchical structure to a knowledge graph. This is expected to make the ANTT more accessible and interoperable, allowing easier access for external systems. The knowledge graph is also intended as a foundation to support machine learning techniques used to uncover knowledge in the existing descriptions. The CIDOC-CRM model, a formal ontology created for integrating heterogeneous cultural heritage information, was selected by the ANTT as the candidate to model the information available in their Digitarq-powered archive. Mappings be-tween CIDOC-CRM and widely used metadata standards such as Dublin Core Metadata Element Set (DCMES) are already available. Since metadata records in Digitarq are DCMES-compliant, it becomes possible to establish a mapping between the two data models, outlining a possible migra-tion path for certain entities or properties that can be used in future work. The aim of this work is to analyze the requirements of ANTT and the existing platforms to select appropriate technologies for the design and implementation of a knowledge graph for ANTT and to develop a vertical proto-type capable of demonstrating the applicability of CIDOC-CRM to the archival context. The first stage has gathered the functional and non-functional requirements through a preliminary analy-sis of Digitarq, complemented by an analyanaly-sis of ANTT records and several meetings with the DGLAB team in this area. The task of defining a CIDOC-CRM compliant data model to represent the information in the archive has been carried out in a separate, though coordinated work. Based on the model, a prototype implementation of the graph was created, after comparing available technologies for possible database implementations, including relational databases, labeled prop-erty graphs and triplestores. Various methods were applied to ensure consistency of the model at all times. Use cases defined by the ANTT staff were implemented to test the usability of the prototype. As a result, the lower and middle layers of the platform can now support an upper layer that will take into account the use cases for professional and lay users interested in exploring and manipulating this knowledge graph. The prototype was evaluated with respect to its performance while maintaining the consistency of the data model. This also allowed us to conclude that it is not trivial to balance the expressiveness of an ontology-based model with the practical constraints re-quired of an operational system. Further work includes prototypes for the applications that manage the archives, as well as a more detailed evaluation of the scalability of the proposed solution.

(6)

(7)

Resumo

Os arquivos históricos velam por artefatos inestimáveis do património cultural e seguem padrões específicos de organização da informação nos seus registros descritivos. Em Portugal, o Arquivo Nacional da Torre do Tombo (ANTT), operado pela DGLAB, usa a norma ISAD (G) para dar suporte à descrição e o mesmo se aplica à plataforma Digitarq, que fornece um vasto catálogo digital dos ativos nos arquivos. O ISAD (G) é baseado num modelo hierárquico e em descrição uniforme, tornando fácil a uma subcoleção de documentos herdar valores de descrição de coleções acima. No entanto, alguns dos componentes da norma, como descrições e biografias, são de grão grosso e limitam o uso da informação correspondente na pesquisa e na interligação com outros recursos. Atualmente, a ANTT está a preparar a adoção de um modelo de dados mais refinado e interconectado, migrando neste processo sua estrutura hierárquica para um grafo de conhecimento. Espera-se que isso torne a ANTT mais acessível e interoperável, permitindo um acesso mais fácil por sistemas externos. O grafo de conhecimento também serve como base para suportar técni-cas de aprendizagem automática usadas para descobrir conhecimento nas descrições existentes. O modelo CIDOC-CRM, uma ontologia formal criada para integrar informação heterogénea so-bre património cultural, foi selecionado pela ANTT como o candidato para modelar a informação disponível no seu arquivo baseado no Digitarq. Os mapeamentos entre o CIDOC-CRM e as nor-mas de metadados amplamente usadas, como o Dublin Core Metadata Element Set (DCMES), já estão disponíveis. Como os registros de metadados no Digitarq são compatíveis com o DCMES, torna-se possível estabelecer um mapeamento entre os dois modelos de dados, delineando um possível caminho de migração. O objetivo deste trabalho é analisar os requisitos da ANTT e das plataformas existentes para selecionar tecnologias apropriadas para o desenho e implementação de um grafo de conhecimento para a ANTT após comparação destas mesmas tecnologias, e desen-volver um protótipo vertical capaz de demonstrar a aplicabilidade do CIDOC-CRM no contexto arquivístico. A primeira etapa reuniu os requisitos funcionais e não funcionais por meio de uma análise preliminar do Digitarq, complementada por uma análise dos registos da ANTT e diversos encontros com a equipa da DGLAB nessa área. A tarefa de definir um modelo de dados com-patível com CIDOC-CRM para representar a informação no arquivo foi executada num trabalho separado, embora coordenado. Com base no modelo, foi criado um protótipo de implementação do grafo, após comparação das tecnologias disponíveis. Vários métodos foram aplicados para garan-tir a consistência do modelo em todos os momentos. Os casos de uso definidos pela equipa da ANTT foram implementados para testar a usabilidade do protótipo. Como resultado, as camadas inferiores e intermédias da plataforma podem agora suportar uma camada superior que levará em conta os casos de uso para utilizadores profissionais e leigos interessados em explorar e manipular o grafo de conhecimento. O protótipo foi avaliado em relação ao seu desempenho, mantendo a consistência do modelo de dados. Isso também nos permitiu concluir que não é trivial equilibrar a expressividade de um modelo baseado em ontologias com as restrições práticas exigidas por um sistema operacional. Outros trabalhos incluem protótipos para as aplicações que gerem os arquivos, bem como uma avaliação mais detalhada da escalabilidade da solução proposta.

(8)

(9)

Acknowledgements

The development of this thesis would not have been possible without the aid of several people, as their support and advice was able to help me go through both the thesis and the entire educational period.

Firstly I would like to give my profound thanks to my thesis advisor, the professor João Miguel Rocha da Silva, who had complete availability to help me throughout the whole project, that always steered me when I wasn’t sure what path to take and that was understanding and supportive when my own personal problems got in the way. Working with him was an extremely positive experience, and I hope to still be able to ask for advice and help on further work.

I would also like to thank professors Maria Cristina de Carvalho Alves Ribeiro, Carla Alexan-dra Teixeira Lopes and Gabriel Torcato David, for their continued help and support as we worked on the EPISA project. In the same context, I would like to thank, Inês Dias Koch who worked in parallel to me on her own thesis in the same project EPISA. Our collaboration greatly helped this thesis and was able to result in a published article in the TPDL 2019 conference.

I would also like to thank the Torre do Tombo staff for their efforts through the year as well as the availability shown despite the large distance between both Oporto and Lisbon.

My thanks as well to my friends João Chaves, Inês Ferreira, Inês Teixeira, André Reis, Sara Santos, Nuno Castro and José Carlos Coutinho. You were the best company anyone could have throughout all the years passed here in FEUP, and you always pushed me to do and be better, be it on this thesis or anything else.

Finally, my most profound thanks to my parents and sister for providing everything they could at all times throughout all my years of study and life. Nothing that I do comes without your absolute support and love, and there are no words to express how deeply grateful I am.

(10)

(11)

“Perserverance and spirit. have done wonders in all ages.”

(12)

(13)

List of Figures

2.1 Representing the Construction of FEUP as a CIDOC-CRM graph . . . 15

2.2 Representation of the Episa Project’s goals . . . 19

3.1 Vertical Prototype Use Case Diagram . . . 30

3.2 Representation of Diamond Problem, Class D cannot know if overrides Method from B or C . . . 33

3.3 Diagram showing an example of the implementation of a CIDOC-CRM entity. . . 35

3.4 Class Diagram of the E18 Physical Thing class implementation . . . 36

3.5 Component Diagram . . . 39

3.6 Example of a single CIDOC-CRM entity modelled in JSON-LD . . . 41

3.7 The structure of client-server communication . . . 42

3.8 Flowchart of Parsing of JSON-LD representation of a CIDOC-CRM graph . . . . 46

3.9 Flowchart of the obtaining of property instances a entity can have . . . 47

3.10 Flowchart of Parsing of JSON-LD representation of a CIDOC-CRM graph . . . . 48

(16)

LIST OF FIGURES

(17)

List of Tables

2.1 Comparison Table Between Possible Database Choices . . . 24

4.1 Average duration of graph modification methods . . . 51

(18)

LIST OF TABLES

(19)

Abbreviations

API Application Programming Interface CIDOC-CRM CIDOC Conceptual Reference model

ISAD(G) General International Standard Archival Description ANTT Arquivo Nacional Torre do Tombo

DGLAB Direção-Geral do Livro, dos Arquivos e das Bibliotecas GPL General Public License

OGM Object Graph Mapping

REST Representational State Transfer HTTP HyperText Transfer Protocol JSON JavaScript Object Notation

JSON-LD JavaScript Object Notation for Linked Data RDF Resource Description Framework

OWL Web Ontology Language SQL Structured Query Language NoSQL Not Only SQL

ACID Atomicity, Consistency, Isolation, Durability CRUD Create, read, update and delete

UI User Interface

EPISA Entity and Property Inference for Semantic Archives APOC Awesome Procedures on Cypher

UC User Case

EAD Encoded Archival Description XML eXtensible Markup Language

(20)

(21)

Chapter 1

Introdução

This dissertation aims to determine the most efficient way to create a vertical prototype that mi-grates a digital archive data model from a hierarchical archival description based model to a se-mantically rich graph model. To do so a prototype of the aforementioned database with the new data model will be built and then evaluated. This chapter will explain the context of the disserta-tion.

1.1 Context

The Torre do Tombo Portuguese National Archive (ANTT) is a central archive of the nation, under direct administration of the State and integrated into the Secretary of the Culture. The archive maintains a diverse collection of cultural heritage artifacts, that encompasses different kinds of documents ranging from the IX century to the present, both in physical form or the more recent digital counterparts. More than just preserving these archival assets the Torre do Tombo aims to make them available for access by those interested in them.

Given the need to make these documents more available to the public, the coordinating entity of the national system of archives, DGLAB (Direção-Geral do Livro, dos Arquivos e das Bibliote-cas)1created a digital archive accessible to the public, named Digitarq. This archive2is supported by a relational database created designed to comply with the ISAD(G) norm. The ISAD(G) norm is both a content standard and schema and, as such, gives guidance for how to provide data within the element set and also defines this set of elements as well, in other terms it represents the collec-tion in archival descripcollec-tions and also represents the metadata of each record. The ISAD(G) is the core element set agreed upon on 1999 by the International Council of Archives, and was adopted by DGLAB on 2002 for the creation of Digitarq.

1_Website:_{http://dglab.gov.pt/}

(22)

Introdução

While this solution served its purpose for the archive’s initial goals, the relational database model that it uses has issues with performance and scalability due to the large amount of data the database needs to hold. Additionally, as the database grows in data the amount of relations be-tween them grows as well, which makes traversing the information on the database more complex; because of the nature of relational databases, more taxing on the performance of the database. Additionally the ISAD(G) norm represents a primarily hierarchical structure, making it hard to discover other relationships between records through related entities, this reduces information inter-linking, while preventing interoperability.

Additionally, the ANTT is interested in reducing costs related to software licensing while moving to a fully open-source technology stack, as the current Digitarq software is built on pro-prietary and commercial software whose source code is not publicly available. Moving to a fully open-source tech stack also aims to further the commitment of the ANTT with the “Resolução 12/2002 da Portuguese Presidência do Conselho de Ministros” national directive, which calls for the adoption of Open Source software all across the public administration of Portugal, whenever the maturity of the solutions and cost is deemed appropriate[Pr12a].

This is all part of the EPISA project3, an effort from the INESC TEC to represent the informa-tion of the Torre do Tombo in ways that allow it to be interconnected and interoperable by moving the data archived form representations of archival descriptions to semantically rich representations of the information contained in the records.

The proposed solution to this problem would be the creation of a graph-based data model and represent it on a graph database, which by design is better equipped to handle large amounts of interconnected data. Additionally to better guide the creation of this graph database, it will be supported by an ontology created for the purposes of defining a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation named CIDOC-CRM. This ontology was also developed with several object-oriented programming values in mind, which facilitate its usage in its applications for graph databases.

This solution is to be developed at the InfoLab of FEUP (Laboratory of Information Systems of the Faculty of Engineering of the University of Porto, operating at the Department of Informatics Engineering), whose main objective is the research in the areas of Information management, In-formation Systems and InIn-formation Retrieval. The work will be carried out in close collaboration with a team from DGLAB.

1.2 Motivation

The main motivation of this dissertation is the need for a migration from the current relational database that the Digitarq digital archives are currently based on to a graph counterpart, compliant with the CIDOC-CRM cultural heritage data ontology.

3_Website _on _INESC _TEC: _{https://www.inesctec.pt/pt/noticias/}

inesc-tec-quer-digitalizar-o-maior-acervo-documental-da-historia-de-portugal

(23)

Introdução

Through this alteration there will be an improvement of several attributes of the database. These may include a boost in performance, obtained by the handling of large amounts of inter-connected data by a graph database as well as easier accommodation of new necessities that the database may require provided by the scalability that graph databases provide.

Furthermore this work aims to make the Digitarq digital archive, and by consequence the Torre do Tombo archives more accessible and interoperable, as the information can be more linked and read with more ease, as well as becomes available to be integrated into other information systems more readily. To perform these improvements the ISAD(G), a hierarchical model, will be changed to a cultural heritage ontology designed for the interlinking of information, CIDOC-CRM.

The migration from ISAD(G) to CIDOC-CRM and from a relational database to a graph database will also cover another improvement that the Torre do Tombo’s archives needs: expand-ing the interconnection of the information contained within the archive databases, which will allow for faster and simpler digital searches, while enabling more sophisticated searches within the database.

Finally, by applying the modifications to the archive database already mentioned it is also in-tended that the interoperability, interconnectivity, and performance improvements allow it to sup-port new machine learning techniques over this new knowledge graph (called ArchGraph), which when applied to the information available in the national archives will allow for the uncovering of new knowledge.

1.3 Goals

The first objective of this dissertation is to study the CIDOC-CRM cultural heritage ontology. While the ontology itself is not new, its implementation in an operational system supported in a graph database is an innovative approach due to the while the CIDOC-CRM has been used to substantially to make mappings and modelations it’s implementations in graph databases is relatively unexplored, and the model itself while modular and flexible is large and complex enough to require research and analysis. Additionally, one of the advantages of CIDOC-CRM is that when choosing from the concepts within, they can be as general or as specific as necessary—this richness, and flexibility, although interesting from a modeling point of view, requires in-depth knowledge of the model to make sure the right decisions are being made.

The second main goal of this dissertation is the development of a working vertical prototype that employs an efficient database with a CIDOC-CRM-compliant graph model, which culminated in the ontology named ArchOnto.. For such it will be necessary to research and compare the differ-ent kinds of technologies to implemdiffer-ent that graph. It is of extreme importance that this prototype is developed with the Torre do Tombo’s needs of improved interoperability, interconnectivity, and performance in mind, and that it is scalable to the point of being used in future Machine Learning work, which is expected to be applied to all of the ANTT.

(24)

Introdução

1.4 Related Work

This prototype’s model was adopted from the work of Inês Koch’s own thesis, " Estudo de ma-peamento entre o ISAD/ISAAR e o modelo CIDOC-CRM para a descrição de objetos culturais da Torre do Tombo", in which she completes a data model that is utilized on this database4. Both Inês Koch’s thesis and this one fall under the EPISA project.

As a result of the work done for this thesis a paper was written for the TPDL 2019 conference, named "Knowledge Graph Implementation of Archival Descriptions through CIDOC-CRM". This article has been accepted.

1.5 Dissertation Structure

The structure of this dissertation is to be divided into chapters, namely the research and explanation of the context of archives, the analysis and comparison of technologies to create a CIDOC-CRM compliant graph database, the written record of the development of the prototype, the explanation and display of the tests made to evaluate the prototype and finally the conclusions and results of the dissertation.

The first section of the dissertation will focus on archives, be them physical or digital. Firstly the definition of the archive will be made clear, as well as that of the library and repository. Fol-lowing that we will explain what is the ISAD(G) norm, who created it and for what, describe its main concepts and explain its implementations. Afterward the CIDOC-CRM ontology will be explained, from the model outline to its entities and properties. Additionally its applications will be displayed in the categories of cultural heritage, industry and audiovisual media. There will also be a subsection for past implementations of CIDOC-CRM for a benchmark of comparison who are similar in varying degrees to this one but that don’t have a graph database implementation.

Finally the current situation of the archive will be explained, starting with the current archive structure, the Digitarq structure, why we believe a graph database might be of use and make some tentative mappings of the current data model to CIDOC-CRM.

The next section will cover the requirements of the project, firstly discussing the use cases, as well as displaying a use case diagram, in order to make clear what operations the user will be able to perform on the developed prototype. A list of user stories and explanations of them will be used in order to further the point made in the previous discussion. Finally there will be a list of both functional and non-functional requirements that were drafted prior to the start of the development of the project.

The chapter following is focused on discussing and displaying the chosen and developed ar-chitecture of the platform, starting by discussing the data model in all its intricacies. Superseding this the chosen technology stack will be displayed in all its levels from the database (along with an explanation for the choice of database type) to the choice of programming language, API server

4_{OWL version available on GitHub:}_{https://github.com/feup-infolab/archontology}

(25)

Introdução

and framework utilized to develop the user interface. Finally the overall structure of the platform will be summarized through a component diagram.

Succeeding the previous chapter is one that demonstrates the implementation details of the platform, starting by explaining how the database was modelled, containing details such as naming conventions, the structure of properties and entities and so on. Then the overall structure of the prototype will be explained in detail. Following that the user interface and the choices taken are explained. Then an important subsection is the explanation of how the JSON-LD is parsed and utilized to aid Object Graph Mapper operations. The ending subsection of this chapter is the explanation of certain design choices in terms of using Groovy class introspection to aid validation of the CIDOC-CRM model through the Object Graph Mapper functions.

The next section describes the validation of the prototype through several metrics, including performance of several aspects. Following that the dissertation is finished by determining conclu-sions of the work and possible future work in this project.

(26)

Introdução

(27)

Chapter 2

Archival standards and data models

To fully understand the context of this dissertation it is necessary to understand what archives are, as well as the models in question and implementations for each. We will also clarify the difference between an hierarchical model for archival description and a semantically rich cultural heritage data ontology, making the case for migrating an archive from the former to the latter.

2.1 Archives, libraries and repositories

Archives, libraries and repositories share their purpose as locations where knowledge is collected and organized. This knowledge can be amassed in different types and quantities, depending on the kind of information storage location, however they all have the same overall goal. They are responsible not just for collecting information records but as well as preserving that information and making it available to specific audiences, in different degrees depending on the institution that holds the archive, library or repository.

Despite their similarities it is important to highlight the differences between the two. A library is a collection of knowledge materials whose primary objective is the service to a community through maintaining access to these materials. Its services and selection of materials take in con-sideration the needs of the users of the library. Additionally, the preservation of their works is regular but not the library’s central concern, as they mainly use copies of their documents and not the originals. Archives, on the other hand, have a motivation more related to the keeping of in-formation records and preserving the memory and inin-formation through time, making preservation of materials a central concern. Archives focus heavily on the description of individual items and coherent conjunctions by provenance. This description and organization as long been the central task of archives, while ensuring access to these materials is a more recent requirement, and each of the archives has specific guidelines for access of their materials to users. This contrasts with the library to which access to the materials is a major focus, while the description of the items is only a tool to ease access. [Sch16]

(28)

Archival standards and data models

Archives come in different forms, such as academic archives (also known as Institutional repositories), national archives, business archives, church archives and more. They differ on the owners and people responsible for the collection and preservation of data as well as the different kinds of archival material and availability of their content.

It is important to mention Digital Libraries, which are collections of information which is both digitized and organized [Les04], as they can be used to preserve information in a digital form, complementing physical libraries and archives.

They have a lot of advantages over the physical methods, including increased search speed, less complications with storage space, easier preservation due to lack of need of physical space and availability. However, because of the difference in archival methods, Digital Libraries pose new problems that archivists have to deal with. These can come for example in the need to or-ganize bigger quantities of material, which requires knowledge representation schemes that are in accordance with the data in the digital library. However as the quantity of information in digital libraries increases, so does the difference in digital archival methods. Given the need to search in different digital libraries, such methods need to be standardized and thus interoperable.[Les04] To help deal with these problems, archival description standards such as the ISAD-G were drafted.

2.2 ISAD-G

The ISAD-G is a standard for the creation and preparation of archival descriptions. The purpose of an archival description is to identify and explain the context and content of archival material in order to promote its accessibility”. [oA99] This is done by creating predetermined models of information based on appropriate representations of records. The ISAD-G was created by a com-mittee of the International Council on Archives and approved by that same council. Two versions were published, one in 1994 and another in 1998, to be used in conjunction with national stan-dards in the development of a more generalized way to organize archival information and thus ensure that searches in archives are made in the same manner. This promotes access to data, as the archival descriptions made by different institutions in separate systems with their own intricacies can become interoperable.

Additionally the ISAD-G aims to ensure the creation of consistent and appropriate descrip-tions, as well as facilitate the retrieval and exchange of information on archival material. The creation of a unified system also permits an easier time in the sharing of information between different archives.

This standard was developed through the analysis of preexisting archival descriptions and finding the ties that are necessary in order to have a robust system of descriptions or in other words, the elements that allowed efficient searches in all these archives. The final choice of elements allowed the creation of a standard that can be applied to most archives and to any medium they might need to archive the information in. The ISAD-G’s elements total 26, and their content and structure are meant to be altered depending on the national standards and they do not expect any output formats. These elements are to be applied to descriptions of part of the ISAD-G model,

(29)

which organizes all the information existing in an archive into a bigger structure that contains all the data existing in an archive. [oA99]

2.2.1 Main Concepts

The ISAD-G is a hierarchical model with its top member being the fond, which is the entirety of the records, regardless of medium or form. A fond can be represented in a description, being that descriptions can be applied to any part of the model. Although these descriptions can use any of the twenty six elements in ISAD-G, only six are considered essential to all. These are the reference code, the title, the creator, the date of creation, the extent of the unit of description and the level of description. As mentioned before, all parts of the model have descriptions and the sum of all of these descriptions both define the fonds and each of the separate entities. This concept is called multilevel description. Multilevel description allows one to reach any information as long as the search starts on the top of the model. This is is guaranteed by applying four specific rules:

• All items of the model must have information on the items above it. As such a document has information on all items directly above it;

• Despite having information on the levels above it only display information relevant to the level an element is in. For example when reading the description of a fonds, the description of a higher unit shouldn’t be displayed;

• Identify the level of a description and link it to the next higher unit of description

• Information should not be repeated and information already given at higher levels should not be displayed in lower levels.

[oA99]

Next, we will briefly explain the hierarchical model implemented in ISAD-G. Given the top level is the fonds, which are the whole of the records, regardless of form or medium, organically created or accumulated by a particular entity, the next is sub-fonds which are subdivisions of fonds containing bodies of related records corresponding to subdivisions of the overall data. Sub-fonds can have other subordinate sub-fonds in those cases when the hierarchical body is more complex. The level below this are the series which are documents kept as an unified unit due to being filed together from the same recording activity. These can have sub-series as a level below and just as the fonds these can be divided into other subordinate sub-fonds. Following that there are files, which are documents grouped together because they relate to the same subject, activity or transaction. And finally the last level are the items, which are the individual documents that can no longer be divided. [oA99]

2.2.2 Implementations

The ISAD(G) is not meant to be implemented alone, and instead it was designed to be used with national standards in archival descriptions to develop new national standards more in-tune with an

(30)

international agreement the creation of records and their descriptions.[oA99] As such, it has been used to create several slightly different standards. One of the most important for this project was its use by the DGLAB (back then named DGARQ), to establish a national recommendation to all archives through a document that applies ISAD(G) [dA07]. Additionally the DGLAB has adopted ISAD(G) for all the archives it is responsible for and has applied the model in the creation of the Digitarq software.

Other implementations of the model can be seen as national standards of other countries such as DACS (Describing Archives: A Content Standard) employed by the United States of America and adopted by its Society of American Archivists. In the United Kingdom the Encoded Archival Description was modeled after the ISAD(G) model as well.

2.3 EAD

The Encoded Archival Description, EAD is a international standard for encoding digital data per-taining to archives and libraries [Pit99], mainly archival finding aids, created in 1998 by the Soci-ety of American Archivists and the United States Library of Congress. The reason for the creation of the EAD standard was the development of a encoding standard that was independent of hard-ware or softhard-ware in order to make the information contained in archives and libraries enduring, as when those concerns are not addressed a good amount of information and records is lost to the passing of time and to the obsoletion of methods of storage. Additionally, the EAD standard aims to be utilized to provide machine-readable encoding, to make easier the identification and comprehension of archival description components and to provide an universal and standardized access to primary resources.

The EAD is a description communication standard, which is a standard meant to create a for-mat where an archival description standard, such as ISAD-G, can be disclosed between people and computers [Pit99]. In order to do so it was built through SGML and XML, and the archival description standard it formats is the ISAD-G. Like the ISAD-G, EAD follows a hierarchical struc-ture, being that it contains elements that define a collection, then elements that define components of the collection, then components of that component and so on.

An EAD document starts with a ead root element, and then contains three upper elements, eadheader, the header element that contains the documents metadata, frontmatter which is the element that contains optional information for the finding aid, and finally, the archdesc which is the archival description. The archival description carries the archives contents information and creation context. The archival description can nest in itself core identification information (did), additional data that eases the use of the archival record (bioghist), and description of the com-ponents, which are gathered in a wrapper element, dsc. In this wrapper element the archival component level is documented by level, and it contains the ISAD-G’s level designations such as fonds, series, documents and so on. Components inside the dsc are then defined through c or c01 to c12 [BG11].

(31)

Recently the EAD received an update into EAD3. The EAD3 is a version that attempts to achieve a greater conceptual and semantic consistency, explores mechanisms whereby EAD-encoded information might more seamlessly and effectively connect with, exchange, or incorpo-rate data maintained according to other protocol and improves on the functionality of EAD for representive descriptive information [Ste16].

2.3.1 Implementations

The EAD standard, during the years when it first surfaced from late 1990s to early 2000s, was faced with a mixed response, as it solved several issues with global standardization of digital archives, however many archivists were skeptic to it due to the hard to learn and implement SGML[YK05]. However many of the statements that at the time condemned SGML as the cause for the EAD’s lack of adoption also remark the XML as a possible solution due to is more simple nature [Rot01]. Since then XML has been more adopted overall than SGML, and EAD has not faded at all, even being updated into a version EAD3 [Ste16]. In fact the EAD is used in a large amount of institutions.

Examples of those implementations incude the European Portal of Archives1, the U.S. Libray of Congress2and the DGLAb’s Digitarq [FRF08],

2.4 CIDOC-CRM

2.4.1 Model outline - Moving from a hierarchy to a graph

The CIDOC-CRM model was created by the CIDOC organization, which is a documentation wing of the International Council of Museums, developed to replace the E-R model, a previously used modelling system used in the design of relational database systems for cultural heritage domains. The E-R model suffered from several issues, mostly related to a lack of flexibility. This meant that as the needs of cultural heritage databases increased so too did the model, eventually leading to an ever growing, complex model that was increasingly difficult to maintain, until its support could no longer be justifiable [OL14].

The CIDOC-CRM model was developed to answer these problems through a semantically richer form of representation, based on an object-oriented approach, which allowed it to surpass the redundant representations of the previous model that had accumulated over time.

The CIDOC-CRM model is formalized as an ontology, a form of knowledge representation that represents a categorical knowledge within a domain and provide a framework under which different organizations can collaborate and interpret their information in an interoperable man-ner. An ontology can also be defined as a "explicit specification of a conceptualization"[Gru95], what this means is that "definitions associate the names of entities in the universe of discourse (e.g., classes, relations, functions, or other objects) with human-readable text describing what the

1_{http://apex-project.eu/index.php/en/outcomes/standards/apeead} 2_{http://www.loc.gov/rr/ead/lcp/}

(32)

names mean, and formal axioms that constrain the interpretation and well-formed use of these terms"[Gru95]. CIDOC-CRM was developed with several object-oriented programming values in mind, which increased its modularity and rate of generalization or specialization for each imple-mentation. [OL14]

Another of the goals of CIDOC-CRM is the creation for data harmonization,and as such the data from different sources joined together can still be integrated into consistent technological data frameworks that can always be worked on while maintaining their consistency.

One of the biggest strengths of the CIDOC-CRM model lies in its flexibility, which can be seen in several fields. Technologically speaking the CIDOC-CRM model is entirely independent from any technological implementation frameworks, which allows any organization or developer to have flexibility on the choice of framework. This however can also create issues as the lack of implementation standardization there are less references on how to implement the model at a technical level, and only at a conceptual level.

Pertaining to implementations, the model keeps its flexibility by not mandating any kind of fields or properties, being that any implementation needs only to use whatever is necessary and won’t use the load of all the other entities and properties of the model, making every use of the model as light as possible. Finally the model is poli-hierarchical, which allows it to be as generalized or specific as necessary for whatever specific dataset it is applied to. Additionally the CIDOC-CRM model was designed with rich computer-based reasoning in mind, and as such the object-oriented ontology still conforms to the needs of logic based systems.[OL14]

Due to its structure, the CIDOC-CRM can be considered a graph model; in contrast,the ISAD(G) is a hierarchical model. The graph model improves on several aspects of the hierar-chical model, as it allows for faster searches of data, makes data more harmonized, and more importantly interlinked, which also promotes the interoperability of the information contained in these models.

2.4.2 Entities and Properties

The model itself is composed of two major principles that from which the entire model spans: entities and properties. The model consists of sets of entities, which are real world things related using the properties, which establish entity inter-relationships. The entity and property types added to the model were drafted out of analysis of numerous cultural heritage data models and direct interviews with cultural heritage experts through meetings and workshops. To implement these all the entities of the model use the prefix “E” and capitalize the first letter of each word, while the Properties use the prefix “P” and are written in lowercase. As an example, the most general element is "E1 CRM Entity"; it can have a relationship "P2 has type" that can only target the entity "E55 Type" and entities that inherit from it, due to the fact that the property "P2 has type" has "E55 Type" as its range and "E1 CRM Entity" as its domain—thus, an instance of "P2 has type" have any entity that is E1 or that inherits from it as its subject. Since CIDOC-CRM is formalized as an ontology, Entities types and Properties are formalized as Classes and Object Properties.

(33)

The CIDOC-CRM ontology is poly-hierarchical, which means that both entities and relation-ships have a specific hierarchy of meanings that allows for different amounts of specialization or generalization depending on the needs of every specific use of the model. Entity types can have subtypes which are more specific than themselves. Conversely, sub-properties have different kinds of properties that can only be applied to themselves and not to the higher levels but they can use any of the properties applicable to their super-entity types. As an example a type of entity that exists in the CIDOC-CRM model is the Man-made Thing to which entails "identifiable man-made items that are documented as single units". Two entities belong to this class a Beethoven piece and the Statue of Liberty. If a higher specificity was intended, then it would be possible to make the Beethoven piece a Conceptual Object instance, while the Statue of Liberty a Physical Man-Made Thing instance, making them both more specific than just Man-made Thing" . This means that they can use all properties the Man-Made Thing could use, but now they can also use the properties associated with their more specific classes.

2.4.3 Applications and Past Implementations

The CIDOC-CRM boasts a high degree of flexibility in several ways. Because it doesn’t establish any technological frameworks, the developers that use the ontology are free to choose the most appropriate alternative. Another interesting aspect is the fact that the model does not mandate any kind of entities or properties, and that it allows to use any level of detail or specificity, making the CIDOC-CRM ontology applicable to many different problems.

The most common use of CIDOC-CRM as well as its intended target as per its creation is the modelling of cultural heritage information—information about a group or society that pertains to either physical things or even abstract characteristics [OL14]. This kind of information is normally organized in either archival repositories or museums.

The CIDOC-CRM model can be used to model the information contained in archival records in a manner that can contain the information of the content of each record in a more atomic manner, improving the record, as previously part of the information of the contents of the record was stored in lengthy textual descriptions.

A modelation of CIDOC-CRM into a model that has similarities with ISAD-G was the map-ping of the Dublin Core Metadata Element set, a vocabulary that was used to represent both phys-ical and digital information resources utilizes[Doe00]. The mapping of the Dublin Core allowed for a better knowledge of how to model descriptions and their contents to the CIDOC-CRM. The Dublin Core Metadata Element set has been a starting point in the effort of mapping ISAD-G through CIDOC-CRM, as its limited scope of 15 generic descriptors makes it too simple to fully encompass the needs needs of the ANTT, despite this simplicity making it one of the most well known and used metadata schemas.

An example of implementations of the CIDOC-CRM model for representing cultural her-itage information includes the AMA (Archive Mapper for Archaeology) project, whose aim is to develop tools for semi automatic mapping of cultural heritage data to a CIDOC-CRM com-pliant model, done by creating electronic text form the original documents and supplying a TEI

(34)

header with bibliographical information about the text and marking all archaeological information through a predefined XML grammar that is already mapped to CIDOC-CRM subsets.[EFO+08]

2.4.4 Case study: The Construction of FEUP in CIDOC-CRM terms

In order to better understand the CIDOC-CRM model, a draft case study was developed. Its objective is to represent the construction of FEUP using entities and properties defined in the CIDOC-CRM model.

This case study focuses in recording the event of the beginning of the construction of FEUP in space and time, and as such focuses more heavily on entities that represent them.

To explain the model firstly there are the "E5 Event" nodes "FEUP Construction" and "Laying of the First Stone in FEUP", which are connected via instances of the property "P9 consists of". In this case Events consist of changes of state in physical, social or cultural systems. These nodes connect themselves to others of the type "E52 Time Span" through the property "P4 has time span". The Time Span nodes represent intervals of time of any kind, and in this case they are the nodes "From 27-09-1996 to 22-03-2001", "22-03-2001" and "27-09-1996". These time spans can be contained within one another, a relationship expressed through the "P86 falls within" property. Coming back to the Events nodes, these can be interlinked to "E53 Place" nodes through the "P7 took place" property. The Place nodes represent non physical denominations of space, and in this model there is only one that one, that one being "The entirety of the Campus of FEUP". To actually locate the premises of FEUP the entities "E41 Appellations" are used. This serve to identify elements in a specific context, in this case to make the location of FEUP possible to understand given the address coordinates and name of the location. Place connects to Appellations through "P87 is identified by" and Appellations connect to each other through "P139 has alternative form". It is important to note that the published version of the CIDOC-CRm contains entities for specific appellations, for example coordinate appellations, however we are not using them for two reasons, one to show that one can use entities in more general levels and still be model compliant and two because the current version of the model, not the published one, has deprecated those different kinds of appellations, and has such when the new version gets published this case study would no longer be CIDOC-CRM compliant.

2.5 The case of Portuguese national archives

2.5.1 Current archive structure

Currently most national archives in Portugal are controlled and coordinated by the DGLAB, whose objective is structuring and promoting the interventions of the archival policies made by the Por-tuguese State, administrating those same policies after their implementation, and safeguarding the Portuguese archival patrimony as well as maintaining their accessibility and divulging.

(35)

(36)

Among these is the "Torre do Tombo Portuguese National Archive", which has preserved archival records from the ninth century to the present. Its responsibilities include the preservation and promotion of accessibility of archival and photographic records, functioning as an archive for both administrative records and historical records, the application of the state laws which integrate policies for the safeguarding of cultural patrimony, as well as other legislation that targets archival and photographic records [Pr12b].

2.5.2 The DigitArq platform

The DigitArq platform is a digital archive that can be searched on publicly in order to make documents and records that exist on different portuguese national archives, such as the ANTT available through an online platform.

The data model of DigitArq has been designed to represent the structure of the archive and its metadata records in compliance with the ISAD and ISAAR standards, additionally, the software can also interchange information via standards such as EAD and EAC [FFR10]. This allows it to keep to an international standard of archival description organization. Thus, it can be applied to the data in any of the national archives, while establishing interoperability in their data records, allowing for searches across in all archives to be performed in a consistent manner.

The DigitArq had several goals it succeeded on upon its creation. Firstly before the DigitArq a large amount of different finding aids were being adopted by archivists and creating and het-erogeneous environment that made archives harder to manage over time, and DigitArq was meant to homogenize the collection of finding aids and create a digital archive[FR04]. Additionally the DigitArq was held as a possible solution for the preservation of archival records, replacing the methods of digitization used before it, that included CD-ROMs, and digital copies of books crated in a ad-hoc basis [FR04]. In the present we can observe that these goals were met with success, as DigitArq is utilized by several archives around the country that are managed by the DGLAB, creating a consistency between finding aids not only in the first archive where DigitArq was first implemented as in time its use grew to the national level. Additionally, the preservation was also successful as only the ANTT’s records in Digitarq contains close to a million entries, many of them with images of the contents of the record.

The DigitArq’s solution to synchronize a hierarchical model and a relational database was a middleware solution based on a set of methods meant to manipulate the hierarchical structure loaded through an abstract class named LazyNode. [FR04]. Through them one can get control the child nodes of a certain entity, create and clone nodes and obtain parents of a specific node. That abstract class was implemented by two classes one named EADLazyNode and SQLLazyN-ode. The first classes objective is the control of EAD/XML files and the SQLLazyNode has the responsibility of managing the communication between the DigitArq application and the relational database [FR04].

However, its reliance on a standard with a hierarchical model brings several problems associ-ated to the difficulty in establishing relationships between records that fall outside of the traditional

(37)

hierarchical representations. At the same time, it complicates the discovery of relationships be-tween records, for example due to the existence of a common entity that relates them. Additionally, the description of the contents of the records rely on a textual field that can be too long, and the ANTT wants to increase the atomicity of these contents in order to make creation of records sim-pler, link related contents and create the possibility to use modern machine learning techniques to research this data. Finally, the use of ISAD(G), ISAAR, EAD and EAC makes it so users without knowledge of these standards have a harder time interpreting the contents of the archives, and the ANTT is hoping to expand the use of third party users. These reasons are a part of the motivation for the DGLAB to consider the migration of Digitarq’s implemented model which is relational and hierarchical to a graph counterpart built on CIDOC-CRM. Despite this the Digitarq platform exposes it’s record data to external services through OAI-PMH, and it’s expected that in the future, after the model is implemented through CIDOC-CRM this information is made avail-able through a SPARQL endpoint. However due to the need to maintain the system interoperavail-able with already integrated external services, such as Europeana, a translation mechanism between RDF and OAI-PMH recorss, also known as crosswalk, will also have to be developed, despite some expected losses of information. The reason for wanting to shift to Linked Open Data from OAI-PMH comes from the fact that while Linked Open Data allows for direct interrogation in the server, OAI-PMH makes it necessary for harvesting of all content from a provider before being able to query it. [HS08]

2.5.3 Moving towards a graph model built on CIDOC-CRM

As per prior mention the DigitArq relies on a hierarchical model based on ISAD(G) on top of a relational database. The application of both of these can have complications with their applications on the databases of archives due to two major aspects: the complexity of resource interlinking and performance problems due to fast data growth. Additionally, the nature of archives makes it so that new fields and modifications to the database structure might be necessary, and the model applied in Digitarq is inflexible in those terms.

A possible solution to these issues is the adoption of graph data model, implemented in a database such as Neo4J or Virtuoso, and guided by the CIDOC-CRM ontology. Firstly, graph models like the CIDOC-CRM are effective at interlinking different resources due to the diverse set of entities and properties that create relationships between them. Secondly, the nature of graph databases allows them to deal with relationships between multiple resources of multiple types better than relational databases, and are easier to modify to cope with model changes, should the need to do so arise.

In conclusion a migration to a graph database can resolve both performance and post develop-ment issues, which increases the functional scalibility of the database. At the same time, a change to a graph database will allow for simpler interlinking of the entities of the database, which not only makes searches more intuitive for users, it also makes the entities not only accessible to be used with other external systems but directly queryable without the need for periodic harvesting , improving the interoperability of Archgraph versus the current Digitarq solution.

(38)

2.6 The EPISA Project

EPISA (Entity and Property Inference for Semantic Archives) is a national project, funded by FCT under the "Data Science and Artificial Intelligence in the Public Administration" 2018 call, run-ning from January 2019 to December 2021. The project partners are INESC TEC, the University of Évora and DGLAB, the public administration partner. DGLAB manages the National Archive of Torre do Tombo (TT) that holds the largest and most relevant national cultural heritage col-lection, digitized in a substantial part. The TT assets are accessed by history researchers, history enthusiasts and the general public from the Portuguese-speaking countries and beyond.

The vast amounts of archival description metadata help users find and contextualize the docu-ments they seek. In a pioneering initiative in the archival world, TT designed its online description system 20 years ago, according to the standards by the International Council of Archives (ICA). Metadata in TT is mainly composed by textual descriptions of the context and contents of the doc-uments. Meanwhile, the archival assets evolved to encompass growing amounts of born-digital information and the interoperability requirements of cultural heritage repositories grew. A new generation of description tools is needed that includes libraries, archives and museums, is more fine grained, more flexible and specially more machine-actionable. These are the characteristics of linked open data in semantic networks.

Preliminary work in TT led to the choice of CIDOC Conceptual Reference Model (CRM), a standard developed in the museums community. The conceptual model of CIDOC CRM is a graph where nodes are entities and edges are relations. The huge step represented by such a paradigm shift raises many issues, some of which the EPISA project is addressing.

The project progresses in three lines. The first one has to do with the data model for archival description. TT proposed the CIDOC CRM as a base model, but archival description goes be-yond its expressive power. The project has to consider the content of actual archival records, test the CIDOC CRM ontology on them and propose the adoption of complementary ontologies as necessary. Moreover, existing records have to be migrated between the ISAD/ISAAR model and the target one, and new algorithms have to be devised to handle the transformation to this more fine-grained representation. The second line concerns the database that embodies the data model. The starting point is a graph model that natively supports the semantic web representations issued from the migration. Technologies in this area are still evolving, and plenty of hypothesis and test are required to assess robustness and scalability of the candidate solutions. The third line focuses on the prototype applications that add to and explore the information in the archival records. It is supported on the knowledge graph and provides interfaces for archival professionals who edit records and create new ones, archives managers who keep track of the assets and their use, and various search modes to serve professional and casual users alike.

As of July 2019, the project just concluded its first semester of activity and concentrated on two tasks: the experiments leading to a CIDOC-CRM-compliant archival model and the first pro-totype of the knowledge graph. This work is focused on the latter, and will frequently mention its coordination with the data model task, delivered in the context of a MSc dissertation in Information

(39)

Figure 2.2: Representation of the Episa Project’s goals

Science.

The figure 2.2, references the four main goals that the EPISA project is aiming to tackle, the first being the creation of a CIDOC-CRM-compliant archival model and the development of a knowledge graph containing the representation of the archival contents of the ANTT and structured through the built data model. Through that graph and model the project also aims to utilize machine learning techniques in order to extract entities and properties from the textual descriptions already existing in the Digitarq, and finally, the visualization of the data in different layers in order to present the information adapted to each user, being either archival experts, or lay people in terms of archival descriptions and standards.

(40)

2.7 Database discussion

In order to create a vertical prototype that utilizes a CIDOC-CRM compliant model to store archival descriptions, a database would have to be created, however the fact that the CIDOC-CRM is a knowledge graph model made the decision for a database technology unclear due to the significant structural difference to the most common relational databases. As such an examination of databases had to be performed, with a comparison between relational and graph databases, and in the latter case the difference between triple store databases and labeled property graphs.

2.7.1 Relational Database

Part of the requirements set by the Torre do Tombo include the utilization of the CIDOC-CRM ontology to model their new database in order to replace the ISAD-G norms, and the CIDOC-CRM was created with a philosophy that takes direct inspiration from object programming prac-tices [EFO+08]. To achieve this, it relies on entities and properties, being that those properties are essentially relationships between the entities. This fact complicates the implementation of rela-tional databases, as each of the entities of the CIDOC-CRM contain small amounts of information and rely on its relations to produce substantial knowledge. The complication arises from the need to query data linked to such an extent, as there’s the need to utilize costly and complicatedJOIN

operations through inferences created through foreign keys or associative entity tables [VWA+15]. This muddles the comprehension of the connections between data, and affects performance when performing queries in the database that pertain to the connections between tables[VMZ+10]. An-other issue arises from the necessity to represent complex data types as multiple tables and foreign keys become necessary [RWE15], and multiple number of properties between entities whose re-trieval requires multiple join tables and and heavy join operations[VMZ+10].

The CIDOC-CRM creates two more issues that make the use of relational databases quite complex. The first issue is that the model tends to change and be modified frequently due to the fact the organization that maintains the CIDOC-CRM is continuously improving the model, which forces any compliant database schema to go through regular changes as well. On relational databases the schema needs to be defined upfront and the relational model’s rigidity increases the difficulty through which revisions can be performed [Mil13]. Secondly, CIDOC-CRM implements a class inheritance between its entities, that is, every entity is supposed to be able to be the subject of properties that have its super-classes as their domain (likewise when speaking of their range). This would further complicate the schema model and increase even further the cost of the Join queries.

Relational databases have the major advantage of being able to maintain the referential in-tegrity of the model as the database itself performs validation due to the normalized schema. Additionally, relational databases rely on a querying language, SQL, that has become an interna-tionally known and utilized language, and with major support.

Relational databases can be transactional databases, which means they do write transactions through all or nothing methods, ensuring that either all updates done during a transaction to the

(41)

database are successful or the database is rolled back to the state it was before[ÖV11]. This is an important point to the ANTT due to the expectation of heavy concurrent updates done to the database.

2.7.2 Triple Store

A triple store is a graph database designed for the storage and retrieval of triples, data entities that represent a subject-predicate-object relationship that corresponds to the definition put forth by the RDF graph standard [Rus04]. The RDF graph is composed of nodes and directed and labeled arcs that connect pairs of nodes, which are the RDF triples, which connect the subject the predicate and object. All of the mmodes ad RDF URI references, RDF literals or blank nodes. Predicates are RDF URI references that can be identified as a relationship between two nodes or a defining attribute value3.

RDF triple stores have a major advantage when being analyzed for the database implementa-tion of a CIDOC-CRM compliant model, and that is that CIDOC-CRM has been published as a OWL document, being that an OWL is a "Web Ontology Language". A ontology explicitly repre-sents the meaning of terms in vocabularies and the relationships of those terms[aOTW+19]. OWL is built on XML standards for RDF [04712]. Due to this publication utilizing a triple store database would come with all the natural advantages of using the intended environment for a certain specific model.

The triple store database also has the advantage of utilizing the SPARQL query language, which is currently the standard query language for any RDF based database. SPARQL is a pow-erful language that could be used to explore the database in depth, which is well known and used in different systems and as such maintains flexibility for the changes in the future, and doesn’t tie the platform down to a single technological system.

Additionally triple stores have the capacity for graph inference, which is the ability to predict the existence or non existence of edges between a specified set of nodes, based on the already preexisting knowledge on the graph. This can be useful to determine relationships and information based on the preexisting logic of the graph [VY05]. Other tools that the triple store have to find information based on their contents are graph traversal algorithms, that on triple stores tend to have logarithmic costs [Pok15,Ang12]. However certain triple store technologies support plugins such as Gremlin, that could potentially improve on the costs of graph traversal. However it is to note that graph traversal efficiency is an important aspect to take in consideration, as while currently it is not a requirement to the prototype, the need to traverse the graph for knowledge gathering and implementation of machine learning techniques that make use of graph traversal.

Triple store databases are normally not transactional, with certain exceptions that can support implementations that are ACID-compliant. They are more suited to scenarios where the number of reads far outweigh the number of writes (which are usually bulk loads), making them suited

(42)

to the decision support role, similar to Data Warehouses which are commonly the destination of periodic ETL processes.

2.7.3 Labeled Property Graph and Neo4j

Neo4J is a graph database technology, known for being one of the most popular choices when speaking in terms of graph database management systems. This fact pertains to several factors, many of which also were reasons why this technology was chosen to host the database that will contain the CIDOC-CRM model applied to the Torre do Tombo archival information.

In order to make these factors more clear we will compare it to other graph databases like it, those being ArangoDB and OrientDB. Both of these like Neo4J are open source graph databases, which is important to remark since one of the non functional requirements that the ANTT set was the exclusive use of open source technology. All of them feature flexible schema options, with the possibility of being schema-less or not. OrientDB and ArangoDB utilize SQL or SQL-like languages to query data which in this case is viewed as a positive aspect due to SQL’s wide use and therefore familiar syntax, with OrientDB having the capability to utilize Gremlin and Java alongside it. Neo4J has better tools for performing backups than both ArangoDB and OrientDB, with OrientDB’s backup system being difficult to understand. Neo4J and ArangoDB are on par in terms scalability with OrientDB requiring a distributed architecture to be on the same level. Neo4J does not support sharding unlike the others [FB18].

As such, all three databases are not so different in how good they are, however the choice of Neo4J in order to compare graph databases to other kinds is justified by the fact that Neo4j has the most support behind it, with a strong community, frequent updates and a good number of both first and third-party plugins that extend the system’s functionality. Additionally, Neo4j contains an interface named Neo4j Browser which allows for easy monitoring and manipulation of the database, which permits a higher degree of control during development without much strain, as well as faster ways to demonstrate results.

Neo4j has a substantial amount of learning materials, from formal tutorials to the informal community questions, which eases the learning of its structure and query language Cypher. In addition to this, the query language Cypher shares similarities with SQL, which makes it simpler to learn due to previous knowledge of the more used SQL.

Neo4j has an additional advantage when being compared to technologies in the same technical field. Most NoSQL databases traded the transactional attributes found in most databases, such as ACID support, for higher performance and scalability. This can cause clients to read stale data for eventual consistency or ignoring the durability of data for faster performance. Neo4j is able to conjoin the native graph storage that both have scalability and optimized performance expected of graph databases and ACID compliance, ensuring Atomicity by wrapping database operations within a single transaction and making sure that if one operation fails, the entire transaction is rolled back, ensures Consistency by making sure that every client accessing the graph database always sees the latest updated data. It also ensures Isolation by having all operations in a single

ArchGraph: Desenho e concepção de um protótipo vertical de infraestrutura para arquivos semânticos

F

E

U

P

ArchGraph: Design of a vertical

prototype infrastructure for semantic

archives

Nuno Miguel Cardoso Lopes de Freitas

D

ArchGraph: Design of a vertical prototype infrastructure

for semantic archives

Nuno Miguel Cardoso Lopes de Freitas

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Abstract

Resumo

Acknowledgements

Contents

List of Figures

List of Tables

Abbreviations

Chapter 1

Introdução

1.1

Context

1.2

Motivation

1.3

Goals

1.4

Related Work

1.5

Dissertation Structure

Chapter 2

Archival standards and data models

2.1

Archives, libraries and repositories

2.2

ISAD-G

2.3

EAD

2.4

CIDOC-CRM

2.5

The case of Portuguese national archives

2.6

The EPISA Project

2.7

Database discussion