ENTREVISTA COM OS LEITORES - arthurovidiodaniel

Based on the fault tolerance RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers) architecture (DUARTE, A., 2007), factors influencing the com-puter cluster’s performability, such as message log latency, performance degradation because of node losses or availability under concurrent correlated faults are studied and solutions are presented in order to improve performability in fault-free and post-recovery situations (after the occurrence of one or more faults).

In fault-free situations, the root causes of the performance overhead are identified and studied. Checkpointing activity is a common cause of the performance overhead in rollback-recovery solutions (OLINER, A.J. et al., 2005) and have been widely studied by the scientific community, resulting in some improvements (ELNOZAHY, E. N. and Plank, J. S., 2004),

Figure 1-3: Throughput of an application under maintenance stop

(GAO, W. et al., 2005), (DALY, J. T., 2006), (AGARWAL, S. et al., 2004). Another known cause raised in this study is the increase of the message delivery latency caused by the pessi-mistic logging approach, which demands storing a copy of each message in a repository be-fore continuing program execution. In case of using a higher degree of availability, the data replication of logs and checkpoints also has a strong influence on the performance overhead of the fault tolerance. Facing these factors, it is presented a solution reducing the overhead caused by log-based fault tolerance solutions such as RADIC and a method of imposing a low overhead when increasing the availability provided by RADIC, which directly improves the system’s performability.

In post-recovery situations, the performance degradation effects of one or more faults in the system configuration after the RADIC recovery process is analyzed. The presented solution avoids configuration changes caused by the recovery process after a fault occur-rence, which avoids performance degradation, and is able to restore a changed configuration, which re-establishes a process per node distribution, a factor that may influence the cluster’s performability. Moreover, the mechanism also allows ‘stopless’ preventive maintenance to be performed and is completely integrated into the RADIC fault distributed controller. This works transparently and is configurable in order to adapt to the application and system re-quirements.

The solutions for failure-free situations improve a RADIC-enabled system’s perfor-mability in two ways: a) by reducing the message delivery latency (in many systems, the message delivery latency is crucial to achieve a desired performance) and; b) by decreasing the system unavailability through low-overhead storing of n-replicas of the redundant data over several repositories (SANTOS, G. et al., 2009).

In contrast, after the recovery task, when the application process per node distribution may change, (affecting the system performance and consequently its performability), a dy-namic redundancy (KOREN, I. and Krishna, C. M., 2007) was incorporated with functionali-ty that enables RADIC, via spare nodes, to protect the system configuration from the changes that a recovery task may generate (SANTOS, G. et al., 2006), (SANTOS, G. et al., 2008).

When the original recovery process of RADIC changes a system configuration, it is proposed that a mechanism allows the re-establishment of the original process distribution.

Such a mechanism permits the insertion of a replacement node during the program execution.

This inserted node will take the recovered process, and restore the original process distribu-tion.

Furthermore, a solution allowing maintenance tasks to be performed without needing to stop the entire application is also presented (SANTOS, G. et al., 2008). This solution in-serts new or updated nodes during the program execution and uses a fault injector to schedule a fault in the node to be replaced just after the next checkpoint. This reduces recovery time by avoiding log processing during the recovery, and will forces the application processes in ex-ecution on this node to be migrated to the new node.

The major premise of these solutions is to keep RADIC features such as transparency, decentralization, flexibility and scalability as far as possible. Moreover, the solutions must also: a) impose a negligible overhead in relation to RADIC during failure-free executions. b) provide a quick recovery process when avoiding system configuration changes.

Several experiments were performed with the techniques presented in this work in or-der to validate their functionality and evaluate their employment in different scenarios.

In this work, the performability of computer clusters using a fault tolerance solution is quantitatively evaluated using the model referred in the previous section. This model takes into consideration the performance measurement of the system S under some situation versus the unavailability of this system in the same situation as represented by Equation (2).

The pipelined log and the process state n-replication were evaluated by a set of expe-riments involving a simple token pass application, which measured the message delivery la-tency and a matrix product program measuring the execution time. In each experiment, the overhead generated by these solutions were compared with a regular RADIC configuration and without fault tolerance.

The dynamic redundancy solution was evaluated by comparing the effects of recovery with and without available spare nodes. These experiments observed two measures: overall execution time, and throughput of an application. Different approaches for a matrix product algorithm were applied by using a static distributed Master/Worker and a SPMD approach implementing a Cannon algorithm and an N-Body particle simulation using a pipeline para-digm was executed.

No documento arthurovidiodaniel (páginas 144-149)