REVIEW OF CHECKPOINTING ALGORITHMS IN DISTRIBUTED SYSTEMS

(1)

REVIEW OF SOME CHECKPOINTING

ALGORITHMS IN DISTRIBUTED AND

MOBILE SYSTEMS

Poonam Gahlan1,Parveen kumar²

1Department of Computer Sc & Engg, Singhania University, Pacheri Bari (Rajasthan) India

² Dept. of Computer Science and Engineering, Meerut Institute of Engineering and Technology, Meerut (UP) India

ABSTRACT

Checkpointing is the process of saving the status information. Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. This paper presents the review of the algorithms, which have been reported in the literature for checkpointing. This paper also covers backward error recovery techniques for distributed systems specially the distributed mobile systems.

Keywords: Checkpointing algorithms; parallel & distributed computing; shared memory systems; rollback recovery; fault-tolerant systems.

1. INTRODUCTION

A distributed system is a collection of computers that are spatially separated and do not share a common memory. The processes executing on these computers communicate with one another by exchanging messages over communication channels. The messages are delivered after an arbitrary delay. Parallel computing with clusters of workstations (cluster computing) is being used extensively as they are cost-effective and scalable, and are able to meet the demands of high performance computing. Increase in the number of components in such systems increases the failure probability. There are mainly two kinds of faults: permanent and transient. Permanent faults are caused by permanent damage to one or more components and transient faults are caused by changes in environmental conditions. Permanent faults can be rectified by repair or replacement of components. Transient faults remain for a short duration of time and are difficult to detect and deal with. Hence it is necessary to provide fault tolerance particularly for transient failures in parallel computers. Fault-tolerant techniques enable a system to perform tasks in the presence of faults. Fault tolerance involves fault detection, fault location, fault containment and fault recovery.

1.1 Checkpointing:

(2)

Need of Checkpointing:  To recover from failures.

 Checkpointing is also used in debugging distributed programs and migrating processes in multiprocessor system. In debugging distributed programs, state changes of a process during execution are monitored at various time instances. Checkpoints assist in such monitoring.

 To balance the load of processors in the distributed system, processes are moved from heavily loaded processors to lightly loaded ones. Checkpointing a process periodically provides the information necessary to move it from one processor to another.

 With checkpointing, an arbitrary temporal section of a program’s runtime can be extracted for exhaustive analysis without the need to restart the program from beginning.

1.2 Basic terms:

 In-Transit Message/Lost Message: Messages whose transmission has been recorded but the record of their reception has been lost .This happens if the receiver rolls back to a state prior to reception of the message while the sender does not roll back to a state prior to reception of the message.

 Orphan Messages: Messages whose reception has been recorded, but the record of their transmission has been lost. This situation arises when the sender node rolls back to a state prior to sending the message while the receiver node still has the record of its reception.

 Duplicate Messages: This happens when more than one copy of the same message arrives at a node; perhaps one corresponding to the original computation and one generated during recovery phase. If the first copy has been processed, all subsequent copies should be discarded.

 Consistent global checkpoint: In Checkpointing protocols each process periodically saves its state on stable storage. The saved state contains sufficient information to restart process execution .A consistent global checkpoint is set of N local checkpoints, one from each process, forming a consistent system state.

 Rollback recovery: It is a process of resuming/recovering a computation from a consistent global checkpoint.  Recovery Line: It is desirable to minimize the amount of lost work by restoring the system to most recent

consistent global checkpoint, which is called the recovery line.

 Domino effect: Processes may coordinate their checkpoints to form consistent states .The cascaded rollback may continue and eventually may lead to the Domino effect ,which causes the system to rollback to the beginning of the computation ,in spite of all saved checkpoints.

 Output Commit Problem: Before sending output to the outside world, the system must ensure that the state from which the output is sent will be recovered despite any future failure .Such problem is called output commit problem.

 Stable Storage: Rollback recovery uses stable storage to save Checkpoints and recovery related information. Magnetic Disks have been the medium of choice for implementing stable storage. Stable storage must ensure that the recovery data persist through the tolerated failures and their corresponding recoveries. This requirement can lead to different implementation styles of stable storage.

 Garbage Collection: Checkpoints and other recovery information consume storage resources. As the application progresses and more recovery information is collected, a subset of the stored information may become useless for recovery. Garbage collection is the deletion of such useless recovery information. A common approach to garbage collection is to identify the recovery line and discard all information relating to events that occurred before that line. For example, processes that coordinate their checkpoints to form consistent states will always restart from the most recent checkpoint of each process, and so all previous checkpoints can be discarded.

1.3 Overheads of a Checkpointing Algorithm

Coordination Overhead

(3)

processes. Coordination overhead is due to special control messages and piggybacked information. The book-keeping operations necessary to maintain coordination also contribute to coordination overhead.

Context-Saving overhead

The time taken to save the global context of a computation is defined as the context saving overhead. This overhead is proportional to the size of context. If stable storage is not available with every node in a multiprocessor system; the context is transferred over the network. Network transmission delay is also included in the overhead

2. LITERATURE SURVEY

2.1 In [1] Chandy and Lamport proposed a global snapshot algorithm for distributed systems. It is observed that every checkpointing algorithm proposed for message passing system uses Chandy and Lamport’s algorithm as the base. It is observed that most of the algorithms proposed for message passing systems use Chandy and Lamport’s algorithm as a base. The algorithms proposed in literature for MP systems may be derived by relaxing various assumptions made by the demand modifying the way each step is carried out.

Chandy and Lamport’s Algorithm

Chandy and Lamport’s algorithm is based on following assumptions:

 Distributed system consists of a finite set of processors and a finite set of channels.

 The processors communicate with each other by exchanging messages through communication channels.

 The channels are fault free.

 Communication delay is arbitrary but finite.

 The global state of the system includes the local states of processors and the state of communication channels.

 State of a channel refers to the set of messages sent along that channel and not yet received by the destination node from that channel.

 Buffers are of infinite capacity.

 Termination of the algorithm is ensured by fault-free communication.

Algorithm: The global state is constructed by coordinating all the processors and logging the channel state at the time of checkpointing. Special messages called markers are used for coordination and for identifying the messages originating at different checkpointing intervals. The algorithm is initiated by centralized nodes. The steps followed after a checkpoint initiation are the same in all the nodes except that a centralized node initiates checkpoint on its own and the other nodes initiates checkpoints as soon as they receive a marker.

The steps are below:

(1) Save the local context in a stable storage.

(2) For i = 1 to all outgoing channels do send markers along channel I ; (3) Continue regular computation;

(4) For i=1 to all incoming channels do Save incoming messages in channel i until a marker I is received along that channel.

2.2 In [6] Ravi Prakash and Mukesh Singhal had described a Synchronous Snapshot collection algorithm for Mobile Systems that neither forces every node to take a local snapshot nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshot of only those nodes that have directly or transitively affected the initiator. This paper presents that the global snapshot collection terminates within a finite time of its request and collected global snapshot is consistent.

(4)

requires minimum number of processes to take tentative checkpoints and thus minimizes the workload on stable storage server. Their algorithm has three kinds of checkpoints: tentative, permanent and forced. Tentative and permanent checkpoints are saved on stable storage. Forced checkpoints do not need to be saved on stable storage. They can be saved on any where even in the main memory .When a process takes a tentative checkpoint; it forces all dependent processes to take checkpoints. However a process taking a forced checkpoint does not require its dependent processes to take checkpoint. Thus taking a forced checkpoint avoids the cost of transferring large amount of data to stable storage and accessing the stable storage device and thus it has much less overhead compared to taking a tentative checkpoint on stable storage. Also by taking forced checkpoints their algorithm avoids avalanche effects (in avalanche effects the processes in the system recursively ask others to take checkpoints) and significantly reduces number of checkpoints. A process takes a forced checkpoint only when it receives a computation message which has a checkpoint sequence number larger than the process expects. Their algorithm is efficient in the sense that it is non-blocking, requires minimum stable storage, minimizes number of tentative checkpoints and avoids avalanche effect.

2.4 In [2] Guohong Cao and Mukesh Singhal had introduced the concept of “Mutable Checkpoint” which is neither a tentative checkpoint nor a permanent checkpoint to design efficient checkpointing algorithms for mobile computing system. Mutable Checkpoint can be saved anywhere e.g. the main memory or local disk of MHs. Taking a mutable checkpoint avoids the overhead of transferring large amount of data to stable storage at MSSs over the wireless network. This paper presents techniques to minimize the number of mutable checkpoints Based on Mutable Checkpoints. The Simulation results show that the overhead of taking mutable checkpoints is negligible. Their proposed algorithm is based on Mutable Checkpoint which is non blocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage. Any process can take checkpoint initiate a checkpointing process when a process Pi initiate

a checkpointing process it takes a local checkpoint, increments its csni[i] (csni[j] represents the checkpoint

sequence number of Pj that Pi knows i.e. Pi expects to receive message a message from Pj with checkpoint

sequence number csni[j] .),sets weight to 1(weight is a non negative variable of type real with maximum value

of 1 which is used to detect the end of checkpointing algorithm), sets cp_statei to 1( cp_statei is a Boolean

which is set to 1 if Pi is in checkpointing process )and stores its own identifier and trigger and the new csni[i] in

its trigger. Then it sends a checkpoint request to each process Pj such that Ri[j] = 1 (i.e. Pi receives a

computation message from Pj in current checkpoint interval).If all of following conditions are satisfied then Pi

takes a mutable checkpoint:

1) Pj is in the checkpointing process before sending m

2) Pi has sent a message since last checkpoint.

3) Pi has not taken a checkpoint associated with initiator.

2.5 In [9] Weigang Ni Susan V Vrbsky and Sibabrata Ray had presented a new checkpoint algorithm for mobile distributed system .Their algorithm be non-blocking and minimizes the number of checkpoint participants .This paper presents the idea of called proxy coordinator which is designed to reduce the communication overhead of mobile participant of mobile participants during checkpointing events .A proxy coordinator is a process running on a FH which acts as an agent of the process running on a MH and handles all necessary coordination work on behalf of that process. The Main features of their protocol are:

 A checkpointing event does not require all the processes to participate ;only a minimum number of processes need to take their checkpoints on stable storage

 Their algorithm is non-blocking.

 A MH’s coordinating message overhead is minimized which greatly save battery power and wireless bandwidth

In this paper it is assumed that a process will not receive a checkpoint request associated with another initiator before the current executing one is completed .The simplest way to handle concurrent initiation is that a process discards all other checkpoint request before current one is finished.

(5)

than the snapshot collected by any initiator. Global snapshots are used to establish checkpoints for recovery from node failure. A maximal snapshot implies that the amount of computation lost during rollback after node failures is minimized. The maximal snapshot algorithm consists of two phases:

(a) Independent snapshot collection by the initiators and (b) sharing snapshot information among initiators. In first phase each node in the system takes one or more tentative local snapshot (depending on the number of initiators).In second phase the initiators exchange information with each other and make one tentative local snapshot permanent at each node. Lamport’s clock is maintained by each node .During a snapshot collection it is not mandatory for all the nodes to take their local snapshots .A snapshot initiator can maintain a list of nodes on which it is casually dependent .Thus minimum set of nodes required to maintain the consistency are made to take their local snapshot .This saves the effort involved in global snapshot collection.

2.7 In [11] Bidyut Gupta Shahram Rahimi and Ziping Liu had presented a non-blocking coordinated checkpointing algorithm suitable for mobile environments. The advantages make the proposed algorithm suitable for mobile distributed computing systems are following advantages: (a) The proposed algorithm does not take any temporary checkpoint and hence the overhead of converting temporary checkpoint to permanent checkpoint is eliminated.(b) The proposed algorithm does not use mutable checkpoints. Hence the overhead of converting them to permanent ones is eliminated. (c) Their algorithm does not allow any process to take useless checkpoints. It uses very few control messages and participating processes are interrupted less number of times. Algorithm Non-Blocking produces a consistent global state of the system. In first two steps of algorithm for the initiator process Pi identifies all application messages received from different processes that might become

orphan if it takes a checkpoint by looking at its dependency vector. The initiator then sends primary checkpoint requests to all those processes that have sent at least one message to it asking them to take their respective checkpoints. Consider the pseudo code for any process Pj. Process Pj makes sure that all processes from which it

has received messages also take checkpoints so that there are no orphan messages that it has received In second else if block of the pseudo code process Pj first takes its checkpoint if needed then processes the piggybacked

application message .Hence such a messages can not be orphan .Hence the algorithm generates a consistent global state of the system.

2.8 In [12] Arup Acharya and B.R.Badrinath had presented a checkpoint algorithm for MHs that satisfies the following constraints:

(a) Location of a MH within the static network varies with time and therefore, a MH will first need to be located (“searched”) in order to obtain its local checkpoint and thereby incur search overhead.

(b) MHs often (voluntarily) disconnect from the network; a disconnected MH is not reachable from the rest of network. This means that a (disconnected) MH may not be available to provide its local checkpoint. Disconnection of one or more MHs should not prevent recording the global state of an application executing on MHs.

(c) Lastly a MH is not equipped with stable storage; disk space at a MH is not considered stable due to vulnerability of MHs to loss, theft and physical damage. Therefore alternative stable repository is required to save local checkpoints of MHs.

This paper first identifies the problems in recording a consistent global state of mobile distributed applications executing on mobile hosts. The focus of this paper is restricted to recording checkpoints. Their checkpointing algorithm determines: (1) which local checkpoint should be included in the global checkpoint for each MH to ensure consistency. (2) Where can the selected local checkpoint for MH are found within static network. The algorithm does not require explicit control messages to be sent to MHs and local checkpoints of MHs are not recorded in coordinated fashion .Disconnection are handled by requiring a MH to checkpoint its local state prior to disconnection. The lack of stable storage at the MHs is compensated by transferring the checkpointed state of a MH to the (stable) disk storage of its local MSS.

Additionally; the algorithm also efficiently tackles the following two issues:

(6)

therefore one of the goals of checkpointing schemes for MHs should be to reduce the search necessary to locate the required checkpoint(s).

2. The checkpointing algorithm must specify when should a MH checkpoint its local state i.e. what is the maximum interval (sequence of events)that it can wait between two successive local checkpoints.Each application message is piggybacked with control information in the form of CKPT and LOC arrays: given a local checkpoint of some MH as a starting point, The CKPT array enables the checkpointing algorithm to first select a set of local checkpoints to form a consistent global checkpoint. While LOC array provides the location of these local checkpoints.

2.9 In [13] Ch. D. V. Subba Rao and M.M. Naidu had proposed a new checkpointing protocol combined with selective sender based message logging .The protocol is free from the problem of lost messages .The term ‘selective’ means that messages are logged only within specified interval known as active interval, in this manner reducing message logging overhead .All the processes take checkpoints at the end of their respective active intervals forming a consistent global state Outside the active interval there is no checkpointing of process state. This protocol minimizes different overheads i.e. checkpointing overhead, recovery overhead, blocking overhead. In this protocol there exists Pinitiator ,which coordinates with all the processes to take a consistent

global checkpoint. Pinitiator is responsible for invoking the checkpoint operation periodically .It sends control

messages ,prepare checkpoint and take checkpoint messages to all other processes .Here the concept of active interval is introduced .The time that elapses between two events sending ‘prepare checkpoint and ‘take checkpoint’ messages by Pinitiator to all the process is referred to as an active interval of Pinitiator . Similarly ,the

time that elapses between two events of receiving ‘prepare checkpoint and ‘take checkpoint’ messages by any process is referred to as an active interval of that process .The maximum transmission delay incurred by any message to reach the destination is assumed to be t . It is also assumed that T>3t,Since checkpoint interval is obviously greater than active interval and length of active interval is bound to be at least ‘3t’ to survive the transmission delay of control messages and to enable logging of computational messages. If any process wants to send a message in side the active interval, initially it has to be logged and the process execution is continued. It enables the proposed protocol to handle the lost messages. Every process maintains to counters namely message received count (MRC) and message send count (MSC). These counters are initialized to zero at the start of active interval .The counts of MRC and MSC are incremented only within the active interval .Outside the active interval there will not be any change in their values. At time K*T + 3*t ,the initiator sends ‘take’ checkpoint’ signal to other processes .Afterwards it takes the checkpoint and exits from active interval. In response to take checkpoint, the rest of process will take checkpoint and exits from the respective active intervals. These checkpoints forms consistent global state .After exiting from the active interval, all the processes follow their normal operation .It implies that there is no checkpointing and logging of messages outside the active interval .In case of any failure ,every process rolls back to its latest checkpoint and necessary messages will be replayed from stable storage to reconstruct the previous state of the whole system .If failure occurs after all processes exited from their respective active intervals, then the application rolls back to the latest consistent global state namely ‘g’; else if failure occurs before one of the processes exits from their respective active intervals ,then the application rolls back to previous global state namely ‘g-1’.

(7)

referred to as hard checkpoints .Soft checkpoints are less reliable than hard checkpoints because they can be lost with hard failures .However, Soft checkpoints cost much less than hard checkpoints because they are created locally without any message exchange. Hard checkpoints are to be sent through wireless link .The protocol uses the distinct creation costs of the two checkpoint types to adapt its behavior to the quality of service of the current network. For different network configuration the protocol saves a distinct number of soft checkpoints per hard checkpoint. If network is slow, the protocol creates many soft checkpoints to avoid networks transmissions. By correctly balancing soft and hard checkpoints, the protocol can keep its overheads approximately equal across various types of networks. Hard failures are recovered with global states containing hard checkpoints. If the protocol creates hard checkpoints frequently, the amount of rollback due to hard failures is small on average and the performance of protocol can be poor. Soft checkpoints let the protocol continue to function correctly while the mobile host is disconnected .Conceptually; a disconnected mobile host can be viewed as a host connected to a network with no bandwidth. In this case number of soft checkpoints per hard checkpoint is set to infinity, which means that all the processes states are stored locally .Local checkpoints are used to recover the mobile host from soft failures.

2.11 In [15] Y. Manable proposed a distributed coordinated checkpointing algorithm .A consistent global checkpoint is a set of states in which no message is recorded as received in one process and as not yet sent in another process. This algorithm obtains a consistent global checkpoint for any checkpoint initiation by any process. Under Chandy and Lamport’s assumption that one consistent global checkpoint is obtained for a set of concurrent checkpoint initiation, the total number of checkpoints is minimized. This paper discusses minimizing the number of additional checkpoints by reusing the checkpoints in a consistent global checkpoint for initiation c as checkpoints for other initiation c’. This paper shows two distributed checkpointing algorithms. The first one takes the minimum number of additional checkpoints under Chandy and Lamport’s assumption that one consistent global checkpoint is obtained for a set of concurrent initiation. This algorithm is optimal one. The second algorithm in this paper modifies Chandy and Lamport’s assumption in order to further reduce the number of additional checkpoints.

2.12 In [16] J.L.Kim and T.Park had presented a new efficient synchronized checkpointing protocol which exploits the dependency relation between processes in distributed systems. In their protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, and hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols. As a result, the checkpointing coordination time is substantially reduced and the possibility of total abort of the checkpointing coordination is reduced. By doing so the second phase of the checkpointing coordination may be removed. When multiple checkpointing co-ordinations are overlapped. Time under their protocol can also be saved if it is possible to use the decision of one checkpointing coordination for other coordination. The checkpointing commitment decision can be made locally so that the total abort of checkpointing is avoid i.e. when a process involved in a checkpointing coordination fails, the processes not affected by failed one can make their decision, while the protocols following the straightforward two-phase mechanism abort the whole checkpointing activity. Even if the checkpointing and rollback coordination overlap, the processes which are involved in checkpointing coordination but not involved in the rollback coordination can successfully make their decisions.

(8)

interval. If no message has sent in current checkpoint interval then it will not take the checkpoint for current checkpoint interval. However, due to varying drift rates of local clocks, the timers at different sites are not perfectly synchronized. Hence, the checkpoints may not be consistent because of orphan message. In order to avoid this situation, every message sent between processes is piggybacked with the sender’s information which tells how many checkpoint intervals have passed at the sender process. Using this information, creation messages is avoided .Every message, originated from a process, reaches its destination through MSS, where it is piggybacked ‘time to next checkpoint’ by the local MSS. When the message is received by receiver, it sets its local timer equal to the timer of local MSS. In this way, the timer synchronization is implemented.After taking checkpoint, processes send their checkpoint information to its local MSS, where it is stored in a stable storage .A global checkpoint consists of all the Nth checkpoints of every process, where N ≥ 0.If any process has not taken its Nth checkpoint (as it did not send any message in the Nth checkpoint interval), its previous checkpoint would be included in the Nth global checkpoint. The Nth global checkpoint is not complete unless every process sends either Nth checkpoint or information that N checkpoint intervals have passed.

2.14 In [18] Kanmani-Anitha-Ganesan proposed a new approach which is used to reduce the much overheads of the previous non-blocking algorithms. The new algorithm is based on the timeouts of coordinator process. Instead of storing a single checkpoint like other non-blocking algorithms, it sets three checkpoints. Whenever a state change event happened in the system, FS is taken first and for the second change, MS is taken. When the coordinator say P2 reaches its first timeout, the checkpoint coordination process starts. It uses a dependency record to store the dependent process and their sequence number .When the coordinator’s local clock reaches its timeout, it sends the checkpoint request to its dependent process and continues its normal computations. When the checkpoint request of the coordinator reaches the process, it forces the process to take the last checkpoint LS. With the checkpoint request, it sends the sequence number of last message between the coordinator and the dependent process. The dependent process P1 now checks the dependency array of its three checkpoints FS, MS

and LS. If any one of dependency vector matches, it sets that checkpoint as tentative checkpoint. Find out its own dependency and sends the checkpoint request to them. Depending upon the dependent process of P1 any one of the above three checkpoint is going to be set as the permanent and consistent checkpoint. Suppose if P1 may depend upon other P3 then the request from P1 to P3 must be sent by the process P1. Suppose in P1 after tentatively set FS as a checkpoint, if any request for changing the checkpoint from FS to MS arises, then it checks for new dependency array .After the checkpointing process is over, it sends a reply to initiator .When the coordinator receives reply from its dependent process, it ends a commit signal to its dependent process. When the consistent global state of the system is set by means of this local coordination all the process make its tentative checkpoint to permanent .After taking the permanent checkpoint, other unwanted checkpoints are discarded to utilize the memory optimally.

2.15 In [5] Kumar-Mishra-Joshi presented a non-blocking minimum process coordinated checkpointing protocol that not only minimizes useless checkpoints but also minimizes overall bandwidth required over wireless channels. In their proposed protocol the height of checkpointing tree proposed to reduce. This will reduce the uncertainty period and number of induced checkpoint. This is achieved by asking all processes to send their direct dependencies to initiator similar to Cao and Singhal approach. Initiator finds the minimum set and then sends the minimum set to all processes. Only those processes which are in minimum set, take checkpoint. The CSN is used to avoid blocking .Any process Pi can initiate the algorithm. They used their

(9)

3 CONCLUSION

At last we conclude that Checkpointing algorithms has the following desirable features:  The time taken by checkpointing algorithms should be minimum during failure free run.  Domino effect or Rollback propagations should be minimum.

 Selective rollback should be possible.

 Resources requirement for checkpointing should be minimum.

 Recovery should be fast in event of failure .Availability of consistent global state in stable storage expedite recovery.

4. REFERENCES

[1] Chandy K. M. and Lamport L., “Distributed Snapshots: Determining Global State of Systems,” ACM Transaction on Computing Systems, vol. 3, No. 1, pp. 63-75, February 1985.

[2] G. Cao and M. Singhal,”Mutable Checkpoints:A New Checkpointing Approach for Mobile Computing Systems”, IEEE Transactions On Parallel And D istributed Systems,Vol.12,No.2,February 2001,pp 157-172.

[3] Pradhan D.K., Krishana P.P. and Vaidya N.H., “Recovery in Mobile Wireless Environment: Design and Trade-off Analysis,” Proceedings 26th

International Symposium on Fault-Tolerant Computing, pp. 16-25, 1996. [4] Proceedings of International Conference on Parallel Processing, pp. 37-44, August 1998.

[5] Lalit Kumar Awasthi, Kumar p. 2007 A Synchoronous Checkpointing Protocol For Mobile Distributed Systems. Probabilistic Approach. Int J. Information and Computer Security, Vol.1, No.3 .pp 298-314

[6] R. Prakash and M. Singhal. “Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems”. IEEE Trans. on Parallel and Distributed System, pages 1035-1048,Oct. 1996.

[7] Elnozahy E.N., Alvisi L., Wang Y.M. and Johnson D.B., “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.

[8] G. Cao and M. Singhal. “On impossibility of Min-Process and Non-Blocking Checkpointing and An Efficient Checkpointing algorithm for mobile computing Systems”. OSU Technical Report #OSU-CISRC-9/97-TR44, 1997.

[9] Weigang Ni, Susan V. Vrbsky and Sibabrata Ray “Pitfalls in Distributed Non blocking Checkpointing”, University of Alabama [10] Prakash R. and Singhal M. “Maximal Global Snapshot with concurrent initiators,” Proc. Sixth IEEE Symp. Parallel and Distributed

Processing, pp.344-351, Oct.1994

[11] Bidyut Gupta, S.Rahimi and Z.Lui. “A New High Performance Checkpointing Approach for Mobile Computing Systems”. IJCSNS International Journal of Computer Science and Network Security,Vol.6 No.5B, May 2006.

[12] Acharya A. and Badrinath B. R., “Checkpointing Distributed Applications on Mobile Computers,” Proceedings of the 3rd

International Conference on Parallel and Distributed Information Systems, pp. 73-80, September 1994.

[13] Ch.D.V. Subba Rao and M.M. Naidu. “A New, Efficient Coordinated Checkpointing Protocol Combined with Selective Sender-Based Message Logging”

[14] Nuno Neves and W. Kent Fuchs. “Adaptive Recovery for Mobile Environments”, in Proc.IEEE High-Assurance Systems Engineering Workshop,October 21-22,1996,pp.134-141.

[15] Y.Manable. “A Distributed Consistent Global Checkpoint Algorithm with minimum number of Checkpoints”. Technical Report of IEICE, COMP97-6(April1997)

[16] J.L.Kim and T.Park. “An efficient protocol for checkpointing recovery in Distributed Systems” IEEE Transaction On Parallel and Distributed Systems,4(8):pp.955-960, Aug 1993.

[17] Yanping Gao, Changhui Deng, Yandong Che. “An Adaptive Index-Based Algorithm using Time-Coordination in Mobile Computing”. International Symposiums on Information Processing, 2008.