Sumário: Modelos de Computação Distribuída Tolerante a Faltas

(1)

SEG

1

Modelos da Computação e Comunicação

Confiável

Uma introdução

Sumário: Modelos de Computação

Distribuída Tolerante a Faltas

• Como conseguir tolerância a faltas?

• Enquadramentos e estratégias

• Modelos de gestão da replicação:

– Operações remotas ou Closed-loop

– Modelo de Difusão ou Open-loop

(2)

SEG

Frameworks and Strategies

for

Fault Tolerance

Interaction Fault classification

(specially important in distributed systems)

• Omissive

– Crash

» host that goes down

– Omission

» message that gets lost

– Timing

» computation gets delayed

• Assertive

– Syntactic

» sensor says air

temperature is 100º semantic syntactic timing omission crash

OM

ISSIV

_E

_AS

SE

RT

IVE

ARBITRARY

(3)

SEG

Basic fault tolerance frameworks

• Hardware Fault Tolerance

• Software-Based Hardware Fault Tolerance

• Software Fault Tolerance

• Fault-Tolerant Communication

Classical fault tolerance strategies

• Fault Tolerance versus Fault Avoidance in HW-FT

– tradeoff between:

– reliable but expensive components and

– cheaper, but less performant and more complex mechanisms

• Tolerating Design Faults

– beyond HW-FT, which is helpless with common-mode faults (e.g. SW)

• Perfect Non-stop Operation?

– when no perceived glitch is acceptable – tightly synchronised replicas

(4)

SEG

Classical fault tolerance strategies

• Reconfigurable Operation

– less expensive, when a glitch is allowed

• Recoverable Operation

– cheap, when a noticeable but acceptable service outage allowed

• Fail-Safe versus Fail-Operational

– safety track--- when a fault cannot be tolerated, two hypothesis: – shutdown, which should be made in some orderly way

– or contingency plan for degraded operational mode (what is the fail-safe

position of the engines of a plane?  )

Models for Replication management

(5)

SEG

Reliability of remote operations

• more complex than meets the eye

…

• remote operation expects a reply to

the request

– what if request is lost? – what if reply is lost?

– what if server dies in the middle? » (a) before processing request » (b) after, but before sending reply

• these situations are

indistinguishable, if comm’s level

has volatile state and no error

detection

req req req DO W N DO W N req resp

DISC O NNE CTED

Reliability of remote operations

(network failures and remedies)

• communication error detection (ack):

(6)

SEG

Reliability of remote operations

(network failures and remedies)

• surveillance of communication with server (aya):

– detect communication error in reply

Reliability of remote operations

(server failures and remedies)

• surveillance of communication with server (aya):

– tell failure from slowness

• server fails before or after executing operation:

– we cannot know whether or not it was executed

(7)

SEG

Reliability of remote operations

(server failures and remedies)

• solution: in doubt, request just once (at-most-once)

• stable memory registers improve AMO :

– request IDs are stored in disc or NVRAM – marked executed or pending

– when server recovers, it knows what it executed through the end

pending done

Reliability of event services

(Volatile Channels)

• event diffusion is open loop

– sender publishes to a group of destinations and goes on

• reliability is given by degree of replication vs. failures

– request goes to n replicas, first reply is enough

– request is executed if no more than n-1 failures occur

• request is executed once and only once (exactly-once)

(8)

SEG

Models for Replication management

(open-loop systems)

State Machine programming

• Characteristics

– confinement - atomic commands – fault tolerance - easy replication

• Execution model

– servers start in same state

– execute same sequence of input commands, in same

order

– commands modify state variables and produce

outputs (I/O or return results)

– THEN: all follow same sequence of state/outputs

• Programming

– message-based, diffusion (multicast) – requires deterministic execution

m2 m3 m4 INPUT QUEUE Client 1 Client 2

(9)

SEG

Replicated State Machine

(active replication)

• replicated state machine:

– all replicas execute at same time – achieves error masking

– determinism mandatory

• replica quorums:

– benign communication

– omissive process failures - f+1 replicas – affirmative process failures – 2f+1 replicas

• message ordering:

– total order of commands to replicas – same commands in same order => same

results m2 m3 m1 m2 m3 m1 OUTPUT (consolidated) REPLICATED STATE MACHINE INPUT (disseminated) Client 1 Client 2 m2 m3 m1

Replicated State Machine

(passive replication)

• passive replication

– only Primary executes – in the order it decides

– supports preemption and

non-determinism(active rep. doesn’t)

• state transferred to Backup(s)

– inter-replica deferred state-level

synchronization (checkpoints)

– Backup(s) log commands until

checkpoint received

– Primary fails: Backup assumes – potentially long takeover-glitch

• message ordering:

– non-ordered message diffusion

m2 m3 m3 m2 Checkpoint S(m1) OUTPUT P1 - PRIMARY P2- BACKUP m1 m1

m4

Message LOG Execute(m1) Empty LOG