CONSIDERAÇÕES - arthurovidiodaniel

Since their creation, computers have played an important and increasing role in solv-ing complex problems. Followsolv-ing the computers evolution, new and more complex problems can be solved each day. Indeed, it seems that despite the growing power of computers appli-cations will always need more resources and large periods of execution time.

This demand for computational power has leaded to the improvement of the High Per-formance Computing (HPC) area, generally represented by the use of parallel systems run-ning specifically-designed applications. For this reason, the design of parallel systems has

commonly been oriented to achieve the highest performance possible. As shown in TABLE 1-1 as extracted from the Top500 site (TOP500.ORG, 2008), the most usual architectural design of current parallel systems is the computer cluster, which has been adopted by more than 80% of the 500 fastest supercomputers in many areas of knowledge. In order for the computing power of these machines to be effective, it is also important that these computers suffer a minimum of interruptions, i.e., they must be available to perform useful work as much time as possible.

TABLE 1-1: Architecture share of the fastest 500 supercomputers. Source www.top500.org

Architecture Count Share %

Constellations 2 0.4

MPP (Massively Parallel Processing) 88 17.6

Cluster 410 82.0

In order to achieve more computing power it is usual to aggregate a large number of computing elements. The problem of this approach is that as more elements have a system, the probability of faults grows. As the number of computing elements of the computer clus-ters steady increases, faults are already one of the major concerns when designing parallel systems. Taking into consideration that the system mean time between failure (SMTBF) of a computer cluster is given by the average mean time between failures of all nodes ( ) divided by the number of cluster’s nodes, and supposing that a failure in some node causes a system stop (fail-stop semantic) that takes time to be repaired defined by the mean time to repair (MTTR), the overall availability (ASystem) can be given by the Equation (1). This equa-tion allows to deduce that as more elements have a system, so its availability decreases, This

issue is the reason by why availability and fault tolerance have been widely studied in the past.

Computer clusters may be considered as a class of computing systems with degrada-ble performance (NAGARAJA, K. et al., 2005) i.e., under some circumstances during a de-termined utilization period, the system may present different performance levels. Such per-formance degradation is generally caused by faults occurrence, which may also affect the system availability if they have generated an interruption.

Until now, efforts have been focused on providing high availability to computer clus-ters (GEIST, A. and Engelmann, C., 2002), (CHAKRAVORTY, S. et al., 2006), (NAGARAJAN, A. B. et al., 2007). The solutions resulting from these efforts are commonly based on rollback-recovery techniques (AGBARIA, A. and Friedman, R., 1999), (DUARTE, A. et al., 2006), (BOUTEILLER, A. et al., 2006) and they have shown their efficacy in im-proving computer cluster availability. However they impose some kind of performance over-head because of their related activities, such as process state saving, messages exchange log-ging or system health monitoring. In these solutions, performance is commonly analyzed sep-arately from the availability and it is not a concern in many cases.

It is not trivial to evaluate performance completely dissociated from availability when analyzing an entire computing system, because the perceived system performance can be affected by the system availability. Deriving from this assumption, according to Meyer in

“On Evaluating the Performability of Degradable Computing Systems” (MEYER, J. F.,

⁄

⁄ (1)

1980) performability is considered as a more real, complete and accurate measurement for evaluating degradable systems such as computer clusters.

In “Performability evaluation: where it is and what lies ahead” (MEYER, J. F., 1995) Meyer initially defines performability as a “term referred to a class of (probability) measures that quantify a system’s ‘ability to perform’ in the presence of faults.” This definition takes into consideration systems that gracefully degrade in the presence of faults such as the com-puter clusters mentioned before. Moreover, Meyer says that “...such degradation may result directly from fault-caused errors, may be due to additional computational demands asso-ciated with error processing, or may be the consequence of subsequent fault-related action such as reconfiguration and repair.” This work addresses the last two cases: evaluating the performance overhead demanded by the RADIC fault tolerance architecture (DUARTE, A., 2007) and the degradation caused by the repair and reconfiguration process. To evaluate per-formability, Meyer also say that it “can be either model-based or conducted experimentally via measurements of an actual system.” All evaluation in this work was conducted experi-mentally via performance measurement under different availability conditions.

No documento arthurovidiodaniel (páginas 115-120)