• Nenhum resultado encontrado

The failure detection is an activity performed simultaneously by protectors and observers. Each one performs specific activities in this task, according to its role in the fault tolerance scheme.

How protectors detect failures

The failure detection procedure contains two tasks: a passive monitoring task and an active monitoring task. Because of this, each protector has two parts: it is, simultaneously, antecessor of one protector and successor of other.

There is a heartbeat/watchdog mechanism between two neighbors. The antecessor is the watchdog element and the successor is the heartbeat element. Figure 3-8 represents the operational flow of each protector element.

A successor regularly sends heartbeats to an antecessor. The heartbeat/watchdog cycle determines how fast a protector will detect a failure in its neighbor, i.e., the response time of the failure detection scheme. Short cycles reduce the response time, but also increase the interference over the communication channel. Figure 3-7 depicts three protectors and the heartbeat/watchdog mechanism between them. In this picture we see the antecessors running the watchdog routine waiting for a heartbeat sent by its neighbor.

Figure 3-7: Three protectors (TX, TY and TZ) and their relationship to detect failures.

Successors send heartbeats to antecessors.

A node failure generates events in the node’s antecessor and in the node’s successor. If a successor detects that its antecessor has failed, it immediately starts a search for a new antecessor. The search algorithm is very simple. Each protector knows the address of its antecessor and the address of the current antecessor of its antecessor. Therefore, when a antecessor fails, the protector know exactly who its new antecessor will be.

An antecessor, in turns, begins to wait for a new successor detects a failure in its current successor. Furthermore, the antecessor also starts the recovering procedure, in order to recover the faulty processes that were running in the successor node.

Figure 3-8: Protector algorithms for antecessor and successor tasks

How the observers detect failures

Each observer relates with two classes of remote elements: its protector and the other application processes. An observer detects failures either when the communication with other application processes fails or when the communication with its protector fails. However, because an observer just communicates with its protector when it has to do a checkpoint or a message log, an additional mechanism shall exist to certify that an observer will quickly perceive that its protector has failed.

RADIC provides such mechanism using a warning message between the observer and the local protector (the protector that is running in the same node of the observer).

Whenever a protector detects a fail in its antecessor, such protector sends a warning message to all observers in its nodes because it knows that the failed antecessor is the protector that the local observers are using to save checkpoints and message logs.

When an observer receives such message, it immediately establishes a new protector and takes a checkpoint.

How the observers confirm a failure

There are two situations which create a communication failure between application processes, but that must not indicate a node failure. The first failure situation occurs when an observer is taking a checkpoint of its application process.

The second occurs when a process fails and restarts in a different node.

In this paragraph, we explain how the observers get rids of the first problem. We will explain how the observer gets rid of the second situation in the description of the Fault Masking Phase.

A process becomes unavailable to communicate inside the checkpoint procedure.

Such behavior could cause that a sender process interprets the communication failure caused by the checkpoint procedure as a failure in the destination.

Table 3-3: The radictable of each observer in the cluster in Figure 3-3.

Process identification Address Protector

(antecessor address)

In order to avoid this fake failure detection, before a sender observer assumes a communication failure with a destination process, the sender observer contacts the destination’s protector and asks about the destination’s status. To allow that each observer knows the location of the protector of the other process, the radictable now includes the address of the destination’s protector, as shown in Table 3-3.

Analyzing Table 3-3, one may see that the protector in node eight protects the processes in node zero, the protector in node zero protects processes in node one and so forth.

Using its radictable, any sender observer may locate the destination’s protector.

Since the destination’s protector is aware about the checkpoint procedure of the destination process, it will inform the destination’s status to the sender observer.

Therefore, the sender observers can discover if the communication failure is consequence of a current checkpoint procedure.

The radictable and the search algorithm

Whenever an observer needs to contact another observer (in order to send a message) or an observer’s protector (in order to confirm the status of a destination), this observer will look for the address of the element in its radictable. However, after a failure occurs, the radictable of an observer becomes outdated, because the address of the recovered process and their respective protectors changed.

To face this problem, each observer uses a search algorithm for calculates the address of failed elements. This algorithm relies on the determinism of the protection chain. Each observer knows that the protector of a failed element (observer or protector) is the antecessor of this element. Since a antecessor is always the previous element in the radictable, whenever the observer needs to find an element it simply looks the previous line in its radictable, and finds the address of the element. The observer repeats this procedure until it finds the element it is looking for.

No documento lauramariaandradedeoliveira (páginas 74-107)

Documentos relacionados