Byzantine fault tolerant architecture for geographically distributed graph databases

(1)

ESTAT´

ISTICA - INE

Ray Willy Neiheiser

BYZANTINE FAULT TOLERANT ARCHITECTURE

FOR GEOGRAPHICALLY DISTRIBUTED GRAPH

DATABASES

Florian´

opolis(SC)

2017

(2)

BYZANTINE FAULT TOLERANT ARCHITECTURE

FOR GEOGRAPHICALLY DISTRIBUTED GRAPH

DATABASES

Disserta¸cão submetida ao Programa de Pós Gradua¸cão em Ciência da Com-puta¸cão para a obten¸cão do Grau de Computer Science Master.

Orientadora: Luciana de Oliveira Rech, Dr.

Florian´

opolis(SC)

2017

(3)

Neiheiser, Ray

Byzantine Fault Tolerant Architecture for Geographically Distributed Graph Databases / Ray Neiheiser ; orientador, Luciana de Oliveira Rech, 2017.

99 p.

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós Graduação em Ciência da Computação, Florianópolis, 2017.

Inclui referências.

1. Ciência da Computação. 2. Sistemas

distribuídas. 3. Tolerância a faltas bizantinas. 4. Banco de dados do tipo grafo. 5. Replicação geographica. I. de Oliveira Rech, Luciana. II. Universidade Federal de Santa Catarina. Programa de Pós-Graduação em Ciência da Computação. III. Título.

(4)

Byzantine Fault Tolerant Architecture in Geographically

Distributed Graph Databases

Esta dissertação foi julgada adequada para obtenção do título de

mestre e aprovada em sua forma final pelo Programa de Pós-Graduação

em Ciência da Computação.

Florianópolis, de ______ de _.

__________________________

Prof. José Luís Almada Güntzel, Dr.

Coordenador do Programa

Banca Examinadora:

_________________________

Profa. Luciana de Oliveira Rech, Dra.

Universidade Federal de Santa Catarina

Orientadora

_________________________

Prof. Carlos Alberto Maziero, Dr.

Universidade Federal do Paraná (videoconferência)

_________________________

Prof. Frank Augusto Siqueira, Dr.

Universidade Federal de Santa Catarina

_________________________

Prof. Ronaldo dos Santos Mello, Dr.

Universidade Federal de Santa Catarina

(5)

I thank my girlfriend which always supported me, my

par-ents which enabled me to study, Professor Lau Cheuk Lung for

his dedication and making this experience possible and Capes for

financing my research.

(6)

(7)

Grafos s˜

ao estruturas de dados extremamente flex´ıveis e poderosas

para representar rela¸

c˜

oes entre objetos. Para muitas aplica¸

c˜

oes

que apresentam informa¸

c˜

oes altamente conectadas, modelar os

dados como grafos ´

e algo natural. Al´

em da pr´

opria matem´

atica e

computa¸c˜

ao, grafos s˜

ao utilizados em diversas ´

areas como

Bi-ologia e Economia.

Nos ´

ultimos anos, com a emergˆ

encia da

Web e das redes sociais, o interesse pela manipula¸

c˜

ao de grafos

muito amplos aumentou.

A necessidade de armazenar e

bus-car informa¸

c˜

oes ´

e muito comum em aplica¸

c˜

oes cujos dados est˜

ao

modelados como grafos. Entretanto, as caracter´ısticas inerentes

aos grafos, como a alta conectividade entre os dados, trazem

de-safios a bancos de dados de prop´

osito geral, como s˜

ao os bancos

relacionais e boa parte dos chamados NoSQL. Por isso, para lidar

com grafos de maneira eficiente, nos ´

ultimos anos foram lan¸

cados

diversos bancos de dados espec´ıficos para grafos, como Neo4j,

Ti-tan e OrientDB. Quando comparados a modelos relacionais ou

outros NoSQL, estes bancos de dados de grafos apresentam um

desempenho muito melhor em consultas e manipula¸

c˜

ao de grafos

(

VICKNAIR; AL.

, 2010).

´

E comum que bancos de grafos implementem algum mecanismo

de tolerˆ

ancia a faltas. Por´

em, de maneira geral, suportam apenas

os modelos mais simples de faltas de parada e n˜

ao suportam

mod-elos de faltas mais complexas como faltas bizantinas. O resultado

deste tipo de falta pode variar de indisponibilidade prolongada

nos sistemas at´

e corrup¸

c˜

ao do banco de dados. Isso ´

e

particular-mente importante em aplica¸c˜

oes cr´ıticas, onde mesmo pequenas

falhas nos dados armazenados podem causar graves preju´ızos `

a

organiza¸c˜

ao que usa o banco de dados.

Neste trabalho ´

e apresentado BAG (Byzantine Fault-Tolerant

Architecture for Graph Databases), uma arquitetura distribu´ıda

para bancos de grafos capaz de tolerar um modelo amplo de

fal-tas, incluindo faltas Bizantinas. Al´

em da arquitetura, s˜

ao

apre-sentados os algoritmos que garantem a replica¸

c˜

ao das transa¸c˜

oes,

usando a t´

ecnica atualiza¸

c˜

ao tardia (deferred update). Para avaliar

a arquitetura e os algoritmos, foi constru´ıdo um prot´

otipo que foi

submetido a uma detalhada avalia¸

c˜

ao experimental.

Os experimentos foram realizados em um cluster do Instituto de

Engenharia de Sistemas e Computadores - Investiga¸

c˜

ao e

Desen-volvimento (INESC-ID) / Universidade de Lisboa, Portugal.

Os resultados desta avalia¸c˜

ao demonstram que o modelo

apre-senta um overhead muito pequeno se comparado aos bancos de

dados do tipo grafos existentes, o que torna o seu custo aceit´

avel

(8)

tolerˆ

ancia a faltas bizantinas, bem como apresenta uma

escala-bilidade superior em termos de eficiˆ

encia a replica¸

c˜

ao tolerante a

faltas crash do l´ıder de mercado Neo4j.

O algoritmo e a arquitetura proposta apresentam um ´

otimo

de-sempenho nos casos de usos t´ıpicos de bancos de dados do tipo

grafo apresentando um custo relativamente baixo devido a

ar-quitetura n˜

ao considerar faltas bizantinas simultˆ

aneas dos l´ıderes,

um evento extremamente improv´

avel.

(9)

Juntamente com o aumento do n´

umero de usu´

arios da

Inter-net, o tamanho de dados e suas conex˜

oes cresceram

substan-cialmente nas ´

ultimas d´

ecadas, o que levou o Big Data a ser

um dos principais t´

opicos em ciˆ

encia da computa¸

c˜

ao. Devido

a isso, iniciaram-se pesquisas buscando oferecer servi¸

cos

alta-mente dispon´ıveis com tempos de resposta curtos na area de

da-dos altamente conectada-dos, o que levou ao desenvolvimento das

bases de dados de grafos de hoje. Infelizmente, para as bases de

dados de grafos mais comuns NoSQL apenas algumas solu¸

c˜

oes

para tolerˆ

ancia a faltas foram propostas, por´

em faltas

bizanti-nas, as quais podem afetar a consistˆ

encia de todo o sistema,

n˜

ao receberam nenhuma aten¸

c˜

ao no campo de bancos de

da-dos de grafos distribu´ıda-dos. Este trabalho, portanto, prop˜

oe um

algoritmo de tolerˆ

ancia a faltas bizantinas na ´

area de bancos

de dados de grafos. Devido `

as preocupa¸

c˜

oes de desempenho no

campo da tolerˆ

ancia a faltas, tamb´

em propomos uma arquitetura

hier´

arquica flex´ıvel capaz de tolerar faltas bizantinas com uma

so-brecarga de desempenho significativamente menor no caso de uso

t´ıpico em compara¸

c˜

ao com a arquitetura plana.

Palavras-chave: Tolerˆ

ancia a faltas, sistemas distribu´ıdos,

replica¸

c˜

ao de m´

aquina de estado, deferred update, banco de dados

do tipo grafo, replica¸

c˜

ao geogr´

afica.

(10)

Along with the increase of Internet user numbers, data sizes

and their connections grew massively in the last few decades,

Big Data emerged as one of the main topics in computer

sci-ence. Due to this development, concerns raised on how to offer

highly available services with short response times in the field of

highly connected data. This caused the development of today’s

graph databases. Unfortunately, the most common NoSQL graph

databases have not received much attention of researchers yet and

only a few solutions for fault tolerance have been proposed in this

area. But, Byzantine failures, which may affect the consistency

of the whole system, haven’t found any attention at all in the

field of distributed graph databases. This work, therefore,

pro-poses a solution which implements an algorithm for Byzantine

fault tolerance in the area of graph databases. Due to the

perfor-mance concerns in the field of fault tolerance, we also propose a

flexible hierarchical architecture which is able to endure

Byzan-tine failures with a significantly lower performance overhead in

the typical use-case compared to the flat counterpart.

Keywords: Fault-tolerance, distributed systems, state machine repli-cation, deferred update, graph databases, geographic replication.

(11)

Figure 1 Termination protocol . . . 26

Figure 2 Decision with two normal and one byzantine server . . . . 29

Figure 3 Decision with three normal and one byzantine server . . . 30

Figure 4 Example Graph . . . 36

Figure 5 Directed Graph . . . 36

Figure 6 Undirected Graph . . . 37

Figure 7 Weighted Graph . . . 37

Figure 8 Unweighted Graph. . . 37

Figure 9 Property Graph . . . 37

Figure 10 Hyper Graph . . . 38

Figure 11 BFT-DU sequence (PEDONE; SCHIPER, 2012a) . . . 48

Figure 12 Byzantinum architecture (GARCIA; RODRIGUES; PREGUICA, 2011) . . . 49

Figure 13 CloudBFT Architecture (NOGUEIRA; ARAUJO; BARBOSA, 2014) . . . 50

Figure 14 ReBFT workflow (DISTLER; CACHIN; KAPITZA, 2016) . . 51

Figure 15 Augusts Architecture (PADILHA; PEDONE, 2013) . . . 51

Figure 16 The proposed architecture. . . 55

Figure 17 Architecture Components . . . 56

Figure 18 Execution Sequence Example . . . 57

Figure 19 Execution Sequence Example - Visualized . . . 57

Figure 20 BAG with crash in local and global . . . 59

Figure 21 BAG with byzantine in global and crash in local . . . 60

Figure 22 BAG with byzantine in global and crash in local . . . 62

Figure 23 BAG with crash in global and byzantine in local . . . 63

Figure 24 BAG with crash in global and byzantine in local . . . 64

Figure 25 Crash failure of local cluster primary . . . 65

Figure 26 Unrecoverable failure. . . 66

Figure 27 Byzantine failure in primary . . . 67

Figure 28 Common Interface . . . 74

Figure 29 Percentages of writes . . . 77

(12)

Figure 33 Read-only Transactions. . . 81

Figure 34 Crash Fault Tolerant . . . 82

Figure 35 Byzantine fault-tolerance . . . 83

Figure 36 Byzantine fault-tolerance . . . 84

Figure 37 Scaling Byzantine fault-tolerance . . . 85

Figure 38 Mean throughput . . . 86

(13)

Table 1 Graph Database Comparison . . . 45 Table 2 Comparing BFT Database Protocols . . . 53 Table 3 Comparing BFT Protocols/Architectures . . . 53

(14)

RDBMS Relational Database Management Systems RDF Resource Descriptive Framework Triples LOD Linked Open Data

OLAP Online Analytical Processing

BFT-DU Byzantine fault-tolerant deferred update replication MITRA Middleware for Interoperability on TRAnsactional repli-cated databases

BAG Byzantine fault tolerant Architecture for distributed Graph databases.

(15)

1 INTRODUCTION

. . . 16

1.1 OBJECTIVES

. . . 18

1.2 ORGANIZATION

. . . 19

2 FAULT TOLERANCE

. . . 20

2.1 FAILURES

. . . 20

2.2 FAULT TYPES

. . . 21

2.2.1 Failure types

. . . 21

2.3 FAULT-TOLERANCE

. . . 22

2.4 CRASH FAULT TOLERANCE

. . . 23

2.5 FAILURE DETECTION

. . . 24

2.6 REPLICATION TECHNIQUES

. . . 25

2.6.1 Deferred update propagation

. . . 25

2.6.1.1 Termination protocol

. . . 25

2.6.2 State-machine replication

. . . 26

2.7 ATOMIC BROADCAST

. . . 27

2.8 FINAL CONSIDERATIONS

. . . 28

3 BYZANTINE FAULT TOLERANCE

. . . 29

3.1 SPECULATIVE PROTOCOLS

. . . 31

3.2 N-VERSION PROGRAMMING

. . . 31

3.3 BFT-SMART

. . . 32

3.3.1 Design Principles

. . . 32

3.3.2 Modules

. . . 33

3.3.3 Evaluation

. . . 34

3.4 FINAL CONSIDERATIONS

. . . 34

4 NOSQL - GRAPH DATABASES

. . . 35

4.1 GRAPH DATA MODELS

. . . 35

4.2 GRAPH STORAGE

. . . 38

4.3 GRAPH PROCESSING

. . . 39

4.4 GRAPH COMPUTE ENGINES AND GRAPH DATABASE

39

4.5 GRAPH THEORY

. . . 40

4.6 GRAPH DATABASE SOLUTIONS

. . . 41

4.6.1 Neo4j

. . . 41

4.6.2 OrientDB

. . . 42

4.6.3 Titan

. . . 42

4.6.4 ArangoDB

. . . 43

4.6.5 Sparksee

. . . 43

4.6.6 Final Considerations

. . . 44

5 RELATED WORK

. . . 46

(16)

5.2.1 BFT-DU

. . . 47

5.2.2 MITRA

. . . 48

5.2.3 Byzantinum

. . . 49

5.2.4 CloudBFT

. . . 49

5.2.5 ReBFT

. . . 50

5.2.6 Augusts

. . . 51

5.2.7 Byzantine failures in WAN

. . . 51

5.3 COMPARISON

. . . 52

5.4 FINAL CONSIDERATIONS

. . . 54

6 PROPOSAL

. . . 55

6.1 BAG

. . . 55

6.2 DIFFERENT CASE MODELS

. . . 58

6.2.1 Case 1: Crash/Crash

. . . 59

6.2.2 Case 2: Byzantine/Crash

. . . 59

6.2.3 Case 3: Byzantine/Byzantine

. . . 61

6.2.4 Case 4: Crash/Byzantine

. . . 61

6.2.5 Case 5: Byzantine/Byzantine with 3

datacen-ters

. . . 62

6.3 READ-ONLY TRANSACTIONS

. . . 63

6.4 BEHAVIOR UNDER FAILURES

. . . 64

6.4.1 Crash failures

. . . 65

6.4.2 Byzantine failures

. . . 66

6.5 ALGORITHM

. . . 68

6.6 FINAL CONSIDERATIONS

. . . 73

7 PERFORMANCE EVALUATION

. . . 74

7.1 EVALUATION

. . . 75

7.1.1 Justifying write levels

. . . 76

7.1.2 Hierarchical vs Flat Atomic Broadcast

. . . 76

7.1.3 Read-only Transactions

. . . 79

7.1.4 Single Datacenter

. . . 81

7.1.4.1 Crash faults

. . . 81

7.1.4.2 Byzantine Faults

. . . 82

7.1.5 Byzantine Faults/ Multiple Datacenters

. . . 83

7.1.6 Scaling Byzantine Faults

. . . 84

7.1.7 Alternative Approaches

. . . 85

7.2 FINAL CONSIDERATIONS

. . . 86

8 CONCLUSIONS

. . . 87

8.1 SCIENTIFIC CONTRIBUTIONS

. . . 87

(17)

(18)

1 INTRODUCTION

Today’s data is usually stored in relational databases with the help of Relational Database Management Systems , RDBMS like Ora-cle, MySQL, and Microsoft SQL Server are the most common way of storing data (DBRANKING, 2016). But, with the recent growth of Inter-net user numbers–which more than tripled in the last decade–concerns increased on how to handle these immense data flows ( INTERNETLIVES-TATS, 2016).

Unfortunately, with the large amounts of highly connected data, relational data-bases and also most NoSQL databases like document or values stores often don’t offer a solution for storing this intensively connected data with plausible query times (VICKNAIR; AL., 2010). How-ever, graph databases offer an excellent way to store and retrieve highly connected data. Vicknair in (VICKNAIR; AL., 2010) compares Neo4j, a graph database, with MySQL, a relational database. In their ex-periments, they discovered that graph databases show great perfor-mance advantages for queries traversing the graph or searching strings. For instance, while MySQL needs 69.8 seconds to traverse the graph to a depth of 128, Neo4j only needs 18 seconds. Today, most of the biggest Internet companies developed their own graph databases, such as (FACEBOOK, 2016; TWITTER, 2016; BLOGSPOT, 2016; SHAO HAIXUN WANG, 2012). Graph databases are on the rise and find new use cases everyday.

Taking a look at the list of clients of the biggest graph databases, e.g. Neo4j or OrientDB, one can notice that these graph databases are often used to solve problems requiring high consistency, security and scalability at the same time (NEO4J, 2017d). Neo4j customers, such as Ebay or Global 500 Logistics, often use their graph database in order to offer a real time routing and delivering service to their clients, and have chosen Neo4j due to its great scalability. In addition, there are security firms, investigation units and huge companies e.g. Sky, Comcast, Warner or even the United Nations among the costumers of Neo4j and OrientDB (NEO4J, 2017d;ORIENTDB, 2016a, 2016b). A lot of these services require 24/7 availability and solid consistency in order to protect their clients and offer an excellent service.

The most common approach to increase availability is replica-tion. However, processes tend to present errors, which may lead to failures that may affect the whole system. Therefore, in the last two decades, several studies had developed fault-tolerance replication in

(19)

relational databases (EKWALL; SCHIPER, 2011; GUERRAOUI; SCHIPER, 1996;SUTRA; SHAPIRO, 2008).

Additionally, others proposed fault-tolerance support for differ-ent systems, e.g., graph processing, like the work of Wang et. al. in (WANG et al., 2014). Another trend are fault-tolerant protocols, which implement deferred update with atomic broadcast as in (HADZILACOS;

TOUEG, 1994; PEDONE; GUERRAOUI; SCHIPER, 1998; SCIASCIA;

PE-DONE; JUNQUEIRA, 2012a).

However, distributed systems are not only victims of crash fail-ures: byzantine failures–caused by a bug or even an intruder– creating incorrect data may impair distributed systems significantly. A study in 2007 discovered that database bugs often show byzantine behavior (VANDIVER et al., 2007). Therefore, a lot of recent studies regarding fault-tolerance tried to implement algorithms tolerating byzantine fail-ures. Recently, researchers all around the globe developed various ap-proaches in order to develop a solution for this problem (VANDIVER et al., 2007;AUBLIN; MOKHTAR; QUEMA, 2013;CASTRO; LISKOV, 2002).

Some approaches, due to the NoSQL trend, offer speculative al-gorithms (KOTLA et al., 2007; VERONESE et al., 2013) which improve performance but leave consistency to the client. Others, like the work of Vandiver et al. in (VANDIVER et al., 2007), developed an algorithm with a central coordinator, which shows good performance due to lim-ited messaging overhead and only requires 2f + 1 replicas to tolerate f failures. However, this approach exposes a single point of failure and a possible bottleneck. Other ones, as in the case of crash fault toler-ance, use atomic broadcast and deferred update, as in (LUIZ; LUNG; CORREIA, 2014; PEDONE; SCHIPER, 2012a), requiring 3f + 1 replicas and n2 messages.

NoSQL databases–a relatively new technology compared to rela-tional databases–also received growing attention in the last years with solutions that brings byzantine fault-tolerance for NoSQL to the cloud

(NOGUEIRA; ARAUJO; BARBOSA, 2014).

More recent publications, like (DUAN; PEISERT; LEVITT, 2015),

achieve lower complexity and require fewer cryptographic operations, moving critical jobs to the clients with no additional costs, but relying on a speculative protocol. (DISTLER; CACHIN; KAPITZA, 2016) on the other side, relies on passive servers, which only turn on in the case of failures. This approach does require failure detection, and passive servers cannot be used to handle client requests. Another interesting approach is (PORTO et al., 2015), where only 2f + 1 replicas are re-quired in order to endure Byzantine failures. They achieve this

(20)

assum-ing that in the environment of data centers communication is close to synchronous. In the case of graph databases, which often are also geo-graphically distributed, this assumption cannot hold. Overall, no solu-tions offering byzantine fault-tolerance for distributed graph databases have been proposed yet. Byzantine fault-tolerance has been widely criticized for their additional overhead, noting that Byzantine-failures are relatively rare and that the additional cost often does not outweigh the risk (DUAN; PEISERT; LEVITT, 2015; DISTLER; CACHIN; KAPITZA, 2016). Due to the high messaging overhead, which causes a significant performance overhead in the case of geographic distribution, several methods have been proposed aiming to improve this area (AMIR et al., 2006, 2010). Unfortunately, such approaches require, intra-datacenter coordination for each step the protocol takes, which results in substan-tial performance losses.

1.1 OBJECTIVES

The main goal of this work is the development of an efficient algorithm handling byzantine failures in the field of replicated graph databases and an hierarchical architecture able to handle additional replicas without an exponential increase in messaging overhead and a performance advantage compared to the flat counterpart relying neither on a central coordinator nor on trusted component properties. The specific objectives include:

• Development of an algorithm using deferred update replication accepting crash failures and byzantine failures in distributed graph databases;

• Construction of an efficient architecture able to endure byzantine failures in the environment of geographically distributed graph databases with a viable communication and replica overhead;

• Creating a middleware connecting 4 popular graph databases to implement heterogeneity;

• Evaluate the developed prototype in order to test the performance of the above mentioned properties.

(21)

1.2 ORGANIZATION

This Masters thesis is organized as follows: The next chapter contains basic information about general fault tolerance following chap-ter 3 with an more in depth discussion about byzantine fault tolerance. Following this we discuss graph databases, their use cases, their differ-ences compared to more common databases. After this, the proposed architecture is introduced, explaining its design principles and precon-ditions. Following a chapter containing the achieved experimental re-sults The thesis finishes with an overview over related work and the conclusion of the executed experiments offering a future outlook.

(22)

2 FAULT TOLERANCE

This chapter explains the basic principles of fault tolerance. It therefore starts with a short elaboration about failures and failure-types in order to explain the motives and then explains fault tolerance and several different approaches.

2.1 FAILURES

Even if privacy, integrity and authenticity are guaranteed, errors caused by internal or external faults may impair the correct operation of the whole system. A fault is a weakness or flaw in the code of a service, project, configuration or operation and also might be introduced by external agents. This fault may cause an error or allow external faults to happen, which again may cause another error. These errors may never reach the surface due to isolation. But, occasionally, a fault in an internal component may cause faults in outer component, which may lead to a failure of the whole system. The moment the whole service fails is called a service outage. This outage could impair a possible client of using a certain system (AVIZIENIS et al., 2004). Systems, therefore, are often evaluated by their dependability. According to (AVIZIENIS et al., 2004), the dependability of a system may be measured using the

following measures:

• availability: Accessibility at all time;

• reliability: Amount of failures during operation; • safety: Damage caused by failures;

• integrity: Correctness of the data;

• maintainability: How easy it is to change the state or repair the system.

Availability and reliability are the two measures which may be impaired by faults. In order to guarantee the correct operation, a lot of research has been conducted over the last decades and several key disciplines emerged:

• fault prevention: To prevent the fault before it happens; • fault tolerance: to prevent the failure during a fault;

(23)

• fault removal: to remove possible faults; • fault forecasting: to detect possible faults.

While the first two disciplines focus on preventing a fault causing damage. The latter two work on improving the quality of the system to make it more reliable (AVIZIENIS et al., 2004).

2.2 FAULT TYPES

(AVIZIENIS et al., 2004) divides the phases a fault may enter the application into Development Phase and Use Phase. In the Develop-ment Phase, the programmers, either maliciously or accidentally, may introduce bugs into the system. However, besides the programmers exist many other variables as the physical world, possibly faulty devel-opment tools like language exploits or faulty hardware, and possibly the test facilities used in order to evaluate the correctness of the system.

During the Use Phase of the system, these faults may root in a wider selection of possible causes. Since there are users, administrators, providers and possibly intruders interacting with the system in the physical world and, additionally external infrastructure offering services to the system, all these variables could be significant origins for the faults. (AVIZIENIS et al., 2004) offer a detailed taxonomy of faults which divides them into well defined categories. These groups may be broken down into three basic groups:

• Development faults; • Physical faults; • Interaction faults.

Where development faults are faults introduced during the devel-oping of the project, physical faults are faults introduced by hardware failure, and interaction faults are faults introduced externally to the application or system.

2.2.1 Failure types

The previous analyzed qualifications of faults may cause different classes of failures. According to (TANENBAUM; STEEN, 2006), a wide range of failure types exists:

(24)

• Crash failures: Sudden unavailability of a service;

• Omission failures: Unable to respond or receive messages; • Timing failures: Response outside of considerable time frame; • Response failures: Incorrect response to request;

• Byzantine failures: Incorrectly produced value, arbitrary or ma-liciously.

While the first 3 failure types can be handled by crash-fault tolerance, since servers not responding to a message may be marked as offline, Response failures and especially byzantine failures are more serious. Response failures can be masked by recognizing the invalid response and ignoring it. In the case of byzantine, including arbitrary failures and malicious failures, a server produces or responds incorrect values or instructions which cannot be recognized as such. This leads to the problem, that it is very difficult to automatically detect this output as inaccurate. These failures may be caused by a bug or by an intruder trying to manipulate and control the system.

2.3 FAULT-TOLERANCE

There exists a wide selection of techniques which are able to tol-erate faults in a system in operation. According to (AVIZIENIS et al., 2004), fault tolerance can be divided into Error Prevention and Recov-ery. In the field of error detection, errors may be detected on runtime (Concurrent Detection) or while the system is offline (Preemptive De-tection). Recovery, on the other side, may treat errors or faults as they occur. A faulty component might be isolated, reconfigured or reini-tialized in order to reestablish a valid state, and additionally diagnosis might be run in order to specify the exact location and type of the error. Error handling might handle the error on run time, rolling back or forward to a valid state, or compensate errors due to sufficient re-dundancy. One of the most common approaches to tolerate failures is redundancy. On following (TANENBAUM; STEEN, 2006, Chapter. 8.3.1) three types of redundancy exist:

• Information redundancy; • Physical redundancy; • Time redundancy.

(25)

An example for information redundancy can be duplicated in-formation inside a message in the case of partial corruption of the mes-sage. Time redundancy can be defined as replaying actions or resending message to make sure the message arrives at the receiver (TANENBAUM; STEEN, 2006, Chapter. 8.3.1). The more important redundancy in the context of this work is physical redundancy. In this case, extra physical structures are provided to handle possible faults.

2.4 CRASH FAULT TOLERANCE

Before heading into crash fault tolerance several terms have to be explained in order to understand the following concepts:

• Replica: A full copy of the system or database;

• Leader: A replica with additional functionality, coordinating the other replicas;

• Consensus: Agreement about a certain state of the system be-tween the replicas.

In order to guarantee a functional system in the presence of f possible crash failures at least f + 1 replicas are required to work cor-rectly. This number does only apply if the hierarchical design has been chosen. In the absence of a central element, which may represent a bottleneck and single point of failure, at least 2f + 1 replicas are re-quired in order to reach a consensus. If more than f servers fail, the liveliness of the system cannot be guaranteed (TANENBAUM; STEEN, 2006, Chapter. 8.2.2). Following (TANENBAUM; STEEN, 2006,

Chap-ter. 8.2.2) a consensus can only be reached if a certain combination of preconditions are provided:

• A system may be synchronous or asynchronous: All servers com-plete the same number of steps in a given time or not;

• Bounded or unbounded message times: Messages are delivered globally within a maximum time or notI ;

• Ordered or unordered messaging: Messages sent always reach their destination in the same order;

• Uni- or multicasting: Messages are directly sent to a specific re-ceiver or messages will be distributed to all the servers at once.

(26)

Since most systems in practice use unicast and unbound messaging, messaging has to be ordered to be able to meet a consensus ( TANEN-BAUM; STEEN, 2006, Chapter. 8.2.2). Other working combinations are:

• Asynchronous system with bounded messaging;

• Asynchronous system ordered.

2.5 FAILURE DETECTION

Another important issue in the field of fault-tolerance is failure detection. A possible approach for this are Alive-Messages. After a certain amount of time, all servers send these messages to the servers close to them to check if the server is still alive and responds. If the server doesn’t respond, he will be marked as failed.

It is also possible to send Ping-Messages to the other server and check if these messages reach the receiver. Other famous approaches are Gossiping-protocols, in which a server once in a while notifies all surrounding servers that he is up and alive.

The problem of these protocols is that it is difficult to differ-entiate between omission and crash failures, since a non-responding server might be crashed or just have a bad connection. Therefore, a server might be marked as crash even if it’s still working but is having connectivity issues.

Another approach is using a ping to check if the message can reach its receiver. Also gossiping protocols can be used letting servers send ”I’m alive” messages every now and then to let the other servers know that they’re still working. The problem with these approaches is, that in case of connection problems, a false-positive result can be produced, meaning an active server is marked as crashed because of networking issues. Therefore, it is important to divide response failures into crashes and network failures.

In order to solve this problem, servers would have to decide in a consensus about the state of a certain node. Several servers might try to reach one server to check its state. If one of the server receives a positive message, he would notify the others about this. The disadvan-tage of these protocols is the significant additional messaging overhead (TANENBAUM; STEEN, 2006, Chapter. 8.2.4).

(27)

2.6 REPLICATION TECHNIQUES

Several approaches exist in order to implement replication. The two main solutions are called deferred update and state machine repli-cation.

2.6.1 Deferred update propagation

In the deferred update propagation technique, transactions will be processed and synchronized only at the local server. Only when the client requests the commit of the transaction the transaction will be forwarded to the other processes in order to detect conflicts with concurrent transactions. Within the forwarded message are the trans-actions as well as created logs. The transaction can only be globally committed if all agree. The difference between immediate and deferred update is that with help of the deferred update principle, the updates can be executed at a nearby local server, which improves performance. Immediate update would require the updates to be sent to a central master server. In the case of deferred update replication the updates will be propagated only on commit time. Also, there are no global deadlocks. Usually, deferred update propagation works with atomic commit which has the disadvantage of high response times and high abort rates through conflicting transactions.

2.6.1.1 Termination protocol

A central piece of the deferred update is the termination protocol displayed in Figure 1. The termination protocol will be started with the request to commit a transaction. Transactions played in the distributed environment must have the same effects as transactions executed on a single database. This principle is called One-copy serializable.

If a transaction T 2 arrives at Master 2 before T 1, but T 2 arrives at Master 3 after T 1, both transactions may be aborted if they are conflicting.

The goals of the termination protocol is propagating the commit request to all masters in order to analyze the transaction to detect conflicts and abort if so. The Termination protocol relies on atomic broadcast to propagate the commit. Therefore, Transaction T aborts, if there is a transaction T0 which has been committed after T started

(28)

Figure 1: Termination protocol

its execution and both have conflicting items read by T . When T is certified, 3 possible outcomes exist:

• A locally executing transaction conflicts with T : Local transac-tion is aborted;

• A locally committing transaction is conflicting with T : T can’t achieve write-locks - Abort;

• No conflicting operations Commit.

Following the explanation above the algorithm of the execution is the following:

1. Transaction is initiated and executed at any master. Requested locks are locally granted: Transaction switches to executing stage; 2. On commit requests the read locks are released. The local master initiates the termination protocol and sends the read/writeSets to the other masters: Transaction switches to committing stage; 3. Certification of transaction will be processed by every master. Outcome is a global commit or global abort depending on con-currently running transactions;

4. When past certification all write-locks are released and updates are applied on all databases. Transaction which are in execution stage conflicting with the committing transaction will be aborted: Transaction switches to commit stage.

2.6.2 State-machine replication

State-machine replication follows the approach of implementing State machines. State machines include State variables which describe

(29)

their state and commands which allow them to pass to the next state. Commands are atomic operations and may modify state variables or produce output. Client request trigger commands to produce output or change state variables (SCHNEIDER, 1990a).

Nevertheless, State-machine replication implements immediate update and, therefore, opposite to deferred update replication, checks for conflicts during transaction execution. Therefore, in the scope of a transaction, all write-requests already obtain the write-locks for the wanted resources at execution time and already access the database. If another transaction tries to access these locked resources it will have to abort. The locking transaction will obtain the lock during its whole execution time and only release it on commit. This procedure has the advantage, that the transaction won’t be aborted on commit time. Nev-ertheless, in the scope of long transactions, a single transaction might lock significant parts of the database and block other transactions from executing (SCHNEIDER, 1990b). Although this approach has shown ex-cellent performance already, as in (BEZERRA; PEDONE; RENESSE, 2014), graph databases are often not able to offer the strong scopes of a trans-action and locks during execution. Therefore, deferred update, which only obtains write-locks on commit time, offers the better alternative in the scope of this work.

2.7 ATOMIC BROADCAST

Both replication techniques, deferred update and state-machine replication, may rely on atomic broadcast in order to guarantee that all processes receive the incoming data in the exact same order. A broadcast is the distribution of messages will reach all participants of a certain group. In this case Delivering describes the process of handing over the message to each participant. According to (V. Hadzilacos S. Toueg, 1994), atomic broadcast has to fulfill the following properties:

• Validity: If a correct process broadcasts a message m, then it eventually delivers m;

• Agreement: If a correct process delivers a message m, then all correct processes eventually deliver m;

• Integrity: For any message m, every correct process delivers m at most once, and only if m was previously broadcast by sender(m); • Total order: If correct processes p and q both deliver messages m and m0, then p delivers m before m0 if and only if q delivers m

(30)

before m0.

Fulfilling the requirements above, atomic broadcast can be used to im-plement the deferred update technique or immediate update through state machine replication. In deferred update the transaction will be sent on commit and in state-machine replication every read and write requests is immediately propagated to the server/database.

Atomic broadcast can be established in a number of different ways, depending on the system requirements:

• With the help of a central coordinator: Which will establish the total order of the messages, but, presents a single point of failure; • Each round another master will become coordinator: Resulting in more messages due to the coordinator election, but, does not present a single point of failure;

• Above case with write ahead logging: Resulting in additional complexity, however, on failure after delivering a message before logging the state of failed server can be easily recovered.

Additionally, the developer of the distributed system might choose between two different architectures:

• Hierarchical; • Flat group.

In hierarchical designs, there are leaders coordinating a number of slaves, which are replicas which may receive client requests but do not take part in the consensus. This design has the advantage of a very low message overhead but leader failures can be severe. In the flat group design, on the other side, every replica behaves like a leader and, therefore, no single point of failure exists, but, flat groups provide a higher message overhead. (TANENBAUM; STEEN, 2006, Chapter. 8.2.1).

2.8 FINAL CONSIDERATIONS

This chapter discussed failures and fault tolerance as well as sev-eral different approaches to implement it. Deferred update, as stated has shown great performance in previous works in the field of graph databases and is, therefore, the used protocol in our developed algo-rithm (NEIHEISER R. LAU, 2016).

(31)

3 BYZANTINE FAULT TOLERANCE

Byzantine failures have been widely studied since Lamport pub-lished about the byzantine generals problem in 1982 (LAMPORT; SHOSTAK; PEASE, 1982). This chapter aims at describing what byzantine failures are, how they can be tolerated and gives an overview about relevant approaches in the field of byzantine fault tolerance.

Byzantine failures, as previously described, can be separated in arbitrary and malicious faults. While arbitrary faults are randomly produced values by a bug, bitflip or similar, malicious faults are more severe, since they’ve been intentionally produced by an outside source. Famous examples of byzantine failures are Stuxnet in 2010, where the Iranian nuclear program has been invaded by a worm which ma-nipulated their research, or Wanna Cry in 2017 which invaded several thousands of servers all over the world, encrypting their data in order to blackmail the owners to receive money in exchange of decrypting them again. Both mentioned examples, could have been avoided if the systems would’ve implemented byzantine fault tolerance.

The name ”byzantine-failures” refers to the Byzantine Empire where the local government has been permeated with intrigues and corruption and still had to guarantee a continuous working system in the case of war or rebellion. Imagining three generals which have to decide about a war situation or three processes which have to decide about a certain value. Considering Figure 2, both normal processes receive one falsely produced value 1 and one correct value 0. In both cases it is impossible for the processes to decide which value they should accept and therefore the whole system stops working correctly, or in the case of the three generals, none of the generals will be able to make a correct decision about the strategy.

(32)

Assuming three correct processes and a single faulty one as in Figure 3, the processes may come to a reliable decision and the system may continue running correctly. In the case of a completely reliable central coordinator, this coordinator could correctly decide in the pres-ence of 2f + 1 processes. Else, if there is no coordinator, 3f + 1 replicas are always required in order to guarantee a correctly working system.

Figure 3: Decision with three normal and one byzantine server

As previously mentioned, in the presence of byzantine failures at least 2f + 1 replicas are required. However, this replica overhead can only be guaranteed, if certain preconditions can be met, e.g. a central leader (VANDIVER et al., 2007), trustable elements (VERONESE et al., 2013) or synchronous behavior (PORTO et al., 2015). In the absence of these strong preconditions, in order to come to an agreement in the environment of byzantine failures, at least 3f + 1 replicas have to be present to endure f byzantine processes. Which has been proven by Lamport et al. in (LAMPORT; SHOSTAK; PEASE, 1982) in 1982.

Even with 3f + 1 replicas it must be guaranteed that all 2f + 1 non-faulty processes execute the operations in the same order. This may be achieved through atomic commit or atomic multicast, where atomic commit is a global agreement about a value and atomic multicast the relative order of this value. Such an agreement may be achieved through a consensus algorithm like Paxos or a central coordinator. Noting that a central coordinator may cause a bottleneck and present single point of failure in a distributed system. Due to this, several speculative protocols have been developed in order to improve the complexity and performance of byzantine fault tolerant protocols.

(33)

3.1 SPECULATIVE PROTOCOLS

Speculative algorithms for byzantine fault-tolerance respond to a client request before agreeing on the order of requests. In case of inconsistencies, the clients detect these and correct the server. Three general publications exist proposing a speculative protocol.

Most importantly Zyzzyva which requires 3f +1 replicas in order to provide a fault-tolerant service to byzantine failures. It offers imme-diate update through state machine replication and executes requests in total order (KOTLA et al., 2007).

As Zyzzyva, hfbft in (DUAN; PEISERT; LEVITT, 2015) is also a general Byzantine Fault-tolerance protocol and is eventually consistent. (DUAN; PEISERT; LEVITT, 2015) optimizes Zyzzyva in order to minimize complexities and decreases the amount of cryptographic operations. They adapt PBFT, an existing fault-tolerance implementation, in order to achieve this and use a 3 phase checkpoint system. The protocol starts with the election of a primary replica, which is the responsible for the creation of the first checkpoint. After this first checkpoint, all replicas agree on a certain execution and will then, after a certain number of requests, create a new checkpoint.

MinZyzzyva is a remodeled version of Zyzzyva which uses only 2f + 1 replicas, using a trusted component on each replica but, as all three approaches, hasn’t found any database application yet (VERONESE et al., 2013).

3.2 N-VERSION PROGRAMMING

As discussed by Miguel Castro and Barbara Liskov in (CASTRO; LISKOV, 2002), if the byzantine failure or security weakness can be exploited on all 3f + 1 servers, the byzantine fault tolerant protocol is no use since more than f servers will be faulty. Therefore, it is important to additionally implement N-version programming.

Research on N-version programming, or software diversity, started in the 1970s and raised considerable interest since then (AVIZIENIS, 1985; CHEN; AVIZIENIS, 1995;KNIGHT; LEVESON, 1986). As discussed

in (AVIZIENIS, 1985), faults are often caused by physical phenomena as

well as by design flaws. Faults caused by physical phenomena can be masked by computing the same algorithm on different machines. De-sign flaws, on the other hand, may affect in a similar manner all nodes running the same software. N-version programming aims at tolerating

(34)

these design faults, using a range of independently-designed software elements. Furthermore, it has been shown that the diversity gained from N-version programming also has the positive effect of decreasing the likelihood of malicious intruders (GARCIA et al., 2011). Unfortu-nately, due to the large deployment effort that needs to be performed to put into operation a solution that can combine enough distinct imple-mentations, N-version programming is—with a few exceptions (GASHI; POPOV; STRIGINI, 2007)—seldom used practice. MITRA (LUIZ; LUNG; CORREIA, 2014) is an example of the use of software diversity to shield the system from Byzantine faults. Although MITRA is a middleware designed for relational databases, a similar approach may be an asset for Byzantine fault tolerance when replicating graph databases. Our system, in fact exploits the use of software diversity by supporting 4 distinct graph databases, namely Neo4j, OrientDB, Titan, and Spark-see.

3.3 BFT-SMART

BFT-SMaRt is an open-source library based on Java which im-plements byzantine fault tolerant state machine replication. In this Master’s thesis, BFT-SMaRt is being used as a library in the im-plementation of the algorithm and architecture. With the help of it, byzantine fault-tolerant communication can be established between the replicas. It has been chosen since it shows excellent performance when compared to similar work and has been proven to be very robust in test-environments due to the ongoing development. This section ex-plains the fundamentals of the BFT-SMaRt libraries (BESSANI; SOUSA; ALCHIERI, 2014).

3.3.1 Design Principles

BFT-SMaRt follows a certain set of design principles to guaran-tee functionality and robustness;

• Tunable fault model; • Simplicity;

• Modularity;

(35)

• Multi-core awareness.

It follows a pessimistic system model, allowing messages to be delayed, dropped or corrupted and processes to execute abnormally in arbitrary ways. Additionally, it provides cryptographic signatures in order to protect the system against malicious clients or servers, and a protocol similar to Paxos to tolerate crash faults or corrupted messages. Besides that, it has been implemented as simple as possible to avoid bugs introduced by complex code or mechanics which is also the reason why it has been developed in Java. Nevertheless, as shown later, it shows great performance compared to other implementations.

BFT-SMaRt also follows a modular approach, separating the state machine replication from the core. It additionally tries to offer a simple API which offers developers a way to embed it into their implementation seamlessly. On the other side, BFT-SMaRt also offers multi-core support in order to improve the performance of some costly tasks.

3.3.2 Modules

According to (BESSANI; SOUSA; ALCHIERI, 2014), BFT-SMaRt can be configured to require 3f + 1 replicas in order to tolerate byzan-tine failures, or 2f + 1 to only endure crash failures. Independently,it requires a reliable point-to-point connection between the server and client. Communication is established using RSA and keys are shared using Diffie-Helman. It relies on a number of protocols to implement fault-tolerance.

Total order multicast is possible through BFT-SMaRt which im-plements state-machine replication. In this protocol, clients send their requests to all replicas establishing total order through various instances of consensus between certain numbers of client requests.

In order to allow replicas to recover after a failure, certain mea-sures are taken:

• Logging operations in batches on a server; • Snapshotting after certain amounts of operations;

• Restoring the state with the help of various active replicas. Additionally, BFT-SMaRt allows reconfiguration on run-time, which allows the removal or addition of replicas on execution time with-out impact on the execution itself.

(36)

3.3.3 Evaluation

In (BESSANI; SOUSA; ALCHIERI, 2014), they compare BFT-SMaRt with similar implementations and show huge performance advantages. While PBFT (CASTRO; LISKOV, 2002) has their performance peak at 100 Clients with a throughput of 78765 thousand operations a second and is unable to handle more than 200 clients without crashing. BFT-SMaRt handles up to 1000 clients with a throughput of 83801. Besides that, in their crash-fault tolerance mode, BFT-SMaRt is able to outper-form JPaxos (KO ´NCZAK et al., 2011), a popular paxos implementation peaking at 600 clients with a throughput of 90909 operations per sec-ond, while JPaxos only goes up to 62847 with 800 clients. Comparing all implementations with 200 clients, BFT-SMaRt shows a through-put of 66665 thousand operations per second which is slightly higher than PBFT and significantly higher than UpRight (CLEMENT, 2011)

or JPaxos.

3.4 FINAL CONSIDERATIONS

This chapter aimed at delivering a rough understand about what byzantine fault tolerance is, how it emerged and how it can be imple-mented. Additionally, it explains why we chose to use BFT-SMaRt to implement byzantine fault tolerant atomic broadcast in the context of this work.

(37)

4 NOSQL - GRAPH DATABASES

Due to the massive growth in highly connected data, especially due to social networks where billions of users are connected with each other, SQL databases and the most common NoSQL Databases, e.g. key value stores or document stores, do not offer a viable solution for this kind of data. For instance, in (VICKNAIR; AL., 2010) Vicknair com-pares Neo4j, a graph database, with MySQL, a relational database. In their experiments, they discover that graph databases show great performance advantages for queries traversing the graph or searching strings. For instance, while MySQL requires 3911.3ms traversing a graph of 100,000 vertices to the depth of 4, Neo4j only requires 41.4ms. Most importantly, additional vertices have an immense performance impact on the relational database, while the graph database does scale very well in this kinds of environments. Applying this to most of the use cases of graph databases, where huge graphs with millions of nodes are required, shows that SQL databases are not the correct choice in this kind of environment. For this purpose, in the last years, graph databases hugely grew in importance. A lot of graph databases were de-veloped like Neo4j, OrientDB, ArangoDB, Titan, Sparksee etc. Graph databases store and represent their data in the form of a data. Instead of tables, documents or key-value pairs, the data is stored in the forum of vertices and edges. Figure 4 displays an example graph where the vertices map real world objects like a person, town, state or university and the edges show the connections between them. This chapter will head into details of graph databases focusing on the different solutions and approaches.

4.1 GRAPH DATA MODELS

First of all, different graph databases might offer different kind of models. A graph might be undirected or directed.

In a directed graph, as in Figure 5, every edge points from one vertex to another.

Besides that, a graph might be weighted or unweighted, i.e. edges connecting vertices might have certain attributes which can be used to consider a certain edge while traversing the graph. An example of a unweighted graph is Figure 6 and contrary to that the weighted graph in Figure 7.

(38)

Figure 4: Example Graph

Figure 5: Directed Graph

Besides these two attributes, graph data models may have more explicit features:

• Property Graph Which relies on vertices and edges to store its data. Vertices can contain key-value pairs as properties. Edges are named, directed and always have exactly one start and end-point. They may also contain properties (ROBINSON; WEBBER; EIFREM, 2013, p. 4, Appendix A). A good example of this would be Figure 9;

• Hyper-graphs In Hyper-graphs an edge connects as many vertices as required. This is useful when the graph contains many-to-many relationships. Hyper-graphs succeed in connecting an amount of vertices requiring less edges than the property graph model. This

(39)

In an undirected graph, as in Figure 6, on the other side, every

edge establishes a bilateral relationship between the two vertices

without pointing into a general direction.

Figure 6: Undirected Graph

Figure 7: Weighted Graph

Figure 8: Unweighted Graph

(40)

Figure 10: Hyper Graph

is shown in Figure 10 where all vertices would require an edge to ”Brazil” on their own, but can be connected with only one edge in the Hyper graph (ROBINSON; WEBBER; EIFREM, 2013, p. 197,198);

• Resource Descriptive Framework Triples - RDF It is a subject - predicate - object store, examples are ”Bob buys a bicycle” or ”Alice arranges art”. These Triples are frequently used in the field of LOD (Linked Open Data) since they are able to display huge quantities of data in a very simple data structure.

In this type of database, data is logically linked and, therefore, it can also be described as a graph database. Triples are inde-pendent and do not offer fast traversal over vertices. That’s why they’re more widely used for analytics than for database purposes (ROBINSON; WEBBER; EIFREM, 2013, p. 199,200).

4.2 GRAPH STORAGE

Most of the popular graph databases offer different approaches to store their data. A certain differentiation can be drawn between native and non-native graph storage. Native graph storage stores the graph as a graph on disk. Neo4j, for example, stores nodes and relationships in separate sections in memory with concrete pointers to each other to

(41)

allow fast traversal between nodes. Non-native graph storage use well known storage technologies, like SQL databases or NoSQL databases like key-value stores or document stores to store their graphs. The disadvantage of the usage is the additional processing time to connect the graphs on application level. Nevertheless, it facilitates the usage since these databases types are more common and well understood.

4.3 GRAPH PROCESSING

Graph processing, as well as graph storage, can be separated in native and non-native. In the case of native processing, every node physically points to each other, which allows Index-free-adjacency. There-fore, the traversal of one node to the other in whatever direction works with a cost of O(1). Non-native processing requires O(N ) to traverse through an incoming edge. However, in order to still achieve plausible query times the graph is loaded in main memory which still allows fast traversals as long as the required parts of the graph are loaded in the main memory. (ROBINSON; WEBBER; EIFREM, 2013, p. 142,143)

4.4 GRAPH COMPUTE ENGINES AND GRAPH DATABASE

It is important to note that Graph databases differ from graph compute engines. Graph compute engines are generally used for ana-lytic purposes - Online Anaana-lytical Processing (OLAP) for identifying clusters, computing averages or data mining. These analysis usually run longer periods of time and might run online or offline. A good example of a offline processing engine is Cassonary, which is a Graph computing library for the Java virtual machine. Offline processing is often used if no changes are expected for the data, while online pro-cessing might work on variable data. Examples for online propro-cessing for distributed graphs are Pregel or Giraph. Giraph describes itself as the open source counterpart to Pregel which has been developed by Google. Facebook is using this system at the moment to analyze so-cial graphs (APACHE, 2017; CASSOVARY, 2017). However, the named processing engines are not used and not viable for storing and read-ing data with high access numbers and immediate responses like graph databases (ROBINSON; WEBBER; EIFREM, 2013, p. 6,7).

(42)

4.5 GRAPH THEORY

The field of graph theory is a long studied field which started with Euler’s – The Seven Bridges of K¨onigsberg – in 1735 (MAA, 2017). Since then, a vast amount of graph algorithms have been developed which encounter application in industry and science. The following section gives a short overview a few of the most common graph algo-rithms.

• Breadth-first search The Breadth-first search is a graph search algorithm which is often used for path-finding. It starts at a cer-tain vertex n1 and then spreads to all adjacent vertices and from these vertices again to all their adjacent vertices which haven’t been visited yet until all the vertices of the graph have been vis-ited. (ROBINSON; WEBBER; EIFREM, 2013, p. 164);

• Dijkstra The Dijkstra algorithm searches the shortest path be-tween two vertices. It visits the closest connected nodes until the searched node has been found. Dijkstra is a breadth-first algo-rithm and, therefore, expensive in big graphs, due to the deep search. But, on the other hand, guarantees finding the shortest way (ROBINSON; WEBBER; EIFREM, 2013, p. 163)(STANDFORD, 2017);

• Depth-first search Depth first searches any path from the start vertex to the targer vertex. It then backtracks and searches better iterations of the initial path (ROBINSON; WEBBER; EIFREM, 2013, p. 163). Depth-first (or as well best-first) knows the distance to the end node from each node it visits. It does not guarantee to find the shortest way (STANDFORD, 2017);

• A* is an informed search algorithm and combines the advantages of depth and breadth first algorithms. A* finds the shortest way faster than Dijkstra algorithm favoring low weight edges and ver-tices which are closer to the target vertex (STANDFORD, 2017); • Triadic Closures In graph theory, when two vertices are connected

through a third there is a high probability that these two may be connected at a future moment. An example would be becoming friends of your friends or the colleague of my colleague is also my colleague (ROBINSON; WEBBER; EIFREM, 2013, p. 174). This helps to make future predictions. But, there is a huge difference between weak and strong relationships. For example, while there

(43)

is a high chance that I will become friends with the friends of my friends, the probability of me getting to know someone my friends also know is significantly lower (ROBINSON; WEBBER; EIFREM,

2013, p. 174).

4.6 GRAPH DATABASE SOLUTIONS

There are various different graph databases on the market. Many native and non-native solutions. In this section a number of graph databases will be compared in support of transactions, fault tolerance and security mechanisms.

4.6.1 Neo4j

Neo4j is the most common graph database on the market and states to be the fastest and most scalable native graph database. It is a native graph database and, therefore, uses native graph storage and graph processing. Companies like ebay, Walmart and Telenor are known to use Neo4j (NEO4J, 2017d). Neo4j uses its own Programming Language called Cypher to process graph data. Cypher is a declarative programming language which uses clauses which are similar to SQL (ROBINSON; WEBBER; EIFREM, 2013, p. 27). Neo4j is only open source if your program also is open source (APGL).

• Transactions Neo4j offers regular transaction support which is se-mantically identical to transactions in regular databases. It works through write locks and discards changes completely on failure. Neo4j has a build in deadlock detection (ROBINSON; WEBBER; EIFREM, 2013, p. 155);

• Fault tolerance For fault tolerance Neo4j uses master-slave repli-cation. In the case of a suspected failure, the last transaction will be replayed to recover from a possible unclean transaction termination. Replaying the same transaction multiple times is no problem through idem-potency (Can be repeatedly called without different outcome). Neo4j offers two modes for writing. Writing only through master or writing through slaves which redirect the writes to their master. Read-access can be gained through all slaves (master can act as slave as well) (ROBINSON; WEBBER; EIFREM, 2013, p. 79,80,156,157). Failure or unavailability of a slave will be automatically detected by the others results in

(44)

mark-ing the affected node as “temporary failed”. The failed slave will automatically catch up with the other slaves on restart. If the master fails a new master will be automatically elected. This protocol works on any number of machines down to 1 (NEO4J, 2017a);

• Security Neo4j does not offer any security on data-level only access restrictions on database level. It offers authentication and authorization with SSL over HTTPS. Additionally it offers proxy support and access to third-party solutions through java-api (NEO4J, 2017b, 2017c).

4.6.2 OrientDB

OrientDB is built on SQL and calls its query language SQL+. However, it is not a fully native graph database. It offers index-free-adjacency but stores its data with the help of a document store ( ORI-ENTDB, 2017c).

• Transactions It uses optimistic transactions without locks. After the completion of a transaction the database checks if any other changes have been made and locks them on commit. This way OrientDB implements ACID. In default mode transactions are turned off (ORIENTDB, 2017a, 2017b, 2017d);

• Fault tolerance To implement fault-tolerance OrientDB uses write ahead logging for easy recovery from crashes. It uses a multi-master sharded architecture. In this architecture all servers work like masters;

• Security OrientDB offers authentication from record up to database level. It offers Https and SSL authentication (ORIENTDB, 2017b);

4.6.3 Titan

Titan, like OrientDB uses non-native graph storage. Native-graph processing is implemented using an adjacency list in-memory which enables Titan to access this list with an sequential scan (TITAN, 2017a) (ROBINSON; WEBBER; EIFREM, 2013). For storage Titan offers different famous third-party implementations like Cassandra, HBase or BerkleyDB. Titan also uses its own Query language Gremlin (TITAN, 2017b).

(45)

• Transactions are used in all three storage solutions but are only ACID on BerkleyDB. Transactions can also be nested and concur-rent (TITAN, 2017f). Additionally, Titan has a transaction log to

broadcast changes to other instances. Transaction failures depend on the used storage. To recover from failures a write-ahead-log can be used which increases latency (TITAN, 2017e, 2017c); • Fault tolerance Failure of instances are no problem for Titan, in

case of user installed indexes the user must remove failed nodes manually from the list. Titan also supports monitoring transac-tions, storage backends and response times (TITAN, 2017c). Its concrete behavior for fault tolerance and replication depends on the used storage backend (TITAN, 2017d);

• Security Titan does not offer its own security mechanisms and does depend on the security mechanisms which the used database type offers.

4.6.4 ArangoDB

Uses index-free-adjacency and is an open source graph database. It also uses its own query language (ARANGODB, 2017b, 2017f).

• Transactions ArangoDBs transactions are ACID through locking. To avoid deadlocks locks will be released after a defined waiting time (ARANGODB, 2017e);

• Fault tolerance It offers optional asynchronous master-slave repli-cation. Writing only works over the master. It does not sup-port multiple masters. Failures of instances need to be handled by the client. For recovering slaves it offers a write-ahead-log (ARANGODB, 2017c, 2017d, 2017a);

• Security is only implemented through simple authentication and authorization. Nevertheless, it supports SSL. Additional security can be achieved through third-party solutions for example the FOXX-Framework (ARANGODB, 2017a).

4.6.5 Sparksee

Sparksee previously known as *dex is not completely open source but offers a completely native graph database with implicit indices in

(46)

nodes (index-free-adjacency) and native graph storage. Sparksee works with Java, .NET and C++ on Windows, Linux and Mac (ROBINSON; WEBBER; EIFREM, 2013) (SPARKSEE, 2017c).

• Transactions Supports concurrent reads but only exclusive writes (SPARKSEE, 2017a);

• Fault tolerance Uses master-slave replication in which the master also acts as a slave. The master will be automatically elected. Write operations are centrally synchronized by the master with the help of history logs. The master can also work alone. If a slave goes down in auto commit mode the system remains oper-ational but if the slave was within a transaction the master will be blocked until recovery of the slave. New slaves are easily ac-cepted if the master is available. In case of failure of the master the system stops working. Complete fault tolerance is planned to be implemented in the future (SPARKSEE, 2017b);

• Security Security is only available through third-party software.

4.6.6 Final Considerations

Table 1 compares 5 of the most common graph databases, we decided to compare the 4 described graph databases because they are the current market leaders in the field. As shown in the table, most graph databases offer some type of transactions but significantly differ in features. We discarded an implementation of our Algorithm for ArangoDB since its Java integration did not work out very well for our needs.

(47)

Table 1: Graph Database Comparison

Name Transactions Fault tolerance Security

Neo4j

Supported(ACID) using locks and offering deadlock detection Master-slave SSL, HTTPS, third Party Solutions OrientDB Supported(ACID). Optimistic trans-actions without locks Multi-master Https, SSL and security down to data-level

TITAN Supported Depends on

stor-age back-end

Third party soft-ware

Sparksee Supported

Master-slave not tolerant to fail-ures

Third party soft-ware

ArangoDB

Supported(ACID) using locks and offering deadlock avoidance Master-slave not tolerant to fail-ures SSL and third party solutions

(48)

5 RELATED WORK

This chapter discusses the most relevant work compared to our proposal in order to put this work in a general context and compare it to show its distinct features. The first section discusses general works re-garding crash fault tolerance, the next section then heads into byzantine fault tolerance. The chapter finishes with a summary of the discussed works.

5.1 CRASH FAULT-TOLERANCE

Fault-tolerance, a topic that emerged almost 40 years ago with Randell in (RANDELL, 1975), brought up a wide range of famous ap-proaches, e.g., Huang in (HUANG; ABRAHAM, 1984) or Rabin in (RABIN, 1989), Thompson in (THOMPSON; AL., 1993), Cai in (CAI; HEMACHAN-DRA; VYSKOVC, 1993) and Kopetz in (KOPETZ et al., 1989) who were the first implementing fault-tolerance for relational databases. Since then, the field of fault-tolerance has evolved further. Various solutions have been developed over the last years and, due to the NoSQL trend, fault-tolerance approaches emerged for different systems, e.g., graph processing as in (WANG et al., 2014).

Protocols for fault-tolerance differ in various features. Existing protocols implementing crash-fault tolerance rely on 2f + 1 replicas where f is the number of accepted failures. Protocols may be divided into centralized and decentralized approaches. While centralized ap-proaches offer write-access only over one central replica, decentralized protocols achieve higher throughput through multiple write masters. In order to guarantee consistency in the environment of multiple write masters, most existing protocols offer state-machine replication propa-gating updates directly to the other replicas. State-machine replication generally comes with the disadvantage of sequential single-threaded ex-ecution and, therefore, suffers a low throughput.

The work of Marandi (MARANDI; BEZERRA; PEDONE, 2014), proposes a solution offering parallel execution while still using state-machine replication. Unfortunately, through the immediate propaga-tion of every write and read, the distributed database easily suffers global locks, which may slow down or block other transactions.

Sciascia works with deferred update, which propagates updates only on commit time and, therefore, represents a more scalable and

Byzantine fault tolerant architecture for geographically distributed graph databases

ESTAT´

ISTICA - INE

BYZANTINE FAULT TOLERANT ARCHITECTURE

FOR GEOGRAPHICALLY DISTRIBUTED GRAPH

DATABASES

Florian´

opolis(SC)

2017

BYZANTINE FAULT TOLERANT ARCHITECTURE

FOR GEOGRAPHICALLY DISTRIBUTED GRAPH

DATABASES

Florian´

opolis(SC)

2017

Byzantine Fault Tolerant Architecture in Geographically

Distributed Graph Databases

Esta dissertação foi julgada adequada para obtenção do título de

mestre e aprovada em sua forma final pelo Programa de Pós-Graduação

em Ciência da Computação.

Florianópolis, ____de __________________ de _________.

__________________________

Prof. José Luís Almada Güntzel, Dr.

Coordenador do Programa

Banca Examinadora:

_________________________

Profa. Luciana de Oliveira Rech, Dra.

Universidade Federal de Santa Catarina

Orientadora

_________________________

Prof. Carlos Alberto Maziero, Dr.

Universidade Federal do Paraná (videoconferência)

_________________________

Prof. Frank Augusto Siqueira, Dr.

Universidade Federal de Santa Catarina

_________________________

Prof. Ronaldo dos Santos Mello, Dr.

Universidade Federal de Santa Catarina

I thank my girlfriend which always supported me, my

par-ents which enabled me to study, Professor Lau Cheuk Lung for

his dedication and making this experience possible and Capes for

financing my research.

Grafos s˜

ao estruturas de dados extremamente flex´ıveis e poderosas

para representar rela¸

c˜

oes entre objetos. Para muitas aplica¸

c˜

oes

que apresentam informa¸

c˜

oes altamente conectadas, modelar os

dados como grafos ´

e algo natural. Al´

em da pr´

opria matem´

atica e

computa¸c˜

ao, grafos s˜

ao utilizados em diversas ´

areas como

Bi-ologia e Economia.

Nos ´

ultimos anos, com a emergˆ

encia da

Web e das redes sociais, o interesse pela manipula¸

c˜

ao de grafos

muito amplos aumentou.

A necessidade de armazenar e

bus-car informa¸

c˜

oes ´

e muito comum em aplica¸

c˜

oes cujos dados est˜

ao

modelados como grafos. Entretanto, as caracter´ısticas inerentes

aos grafos, como a alta conectividade entre os dados, trazem

de-safios a bancos de dados de prop´

Florianópolis, de ______ de _.