Fault tolerance in the checkpoint approach

Topic > Fault tolerance in the checkpoint approach

Today, there is a demand for a highly secure virtual network where you can share any resource from any cluster even if there is a failure in the system. Grid computing is aimed at large-scale systems that even cross organizational boundaries which is a distributed computing paradigm that differs from traditional distributed computing. Reliability challenges arise due to the unreliable nature of network infrastructure in addition to the challenges of managing and scheduling these applications. A failure can occur due to link failure, resource failure, or any other reason that must be tolerated to make the system operate smoothly and accurately without interrupting current work. Many techniques are used accordingly for the detection and recovery of these faults. An appropriate fault detector can avoid a loss occurring in the system due to a system crash and a reliable fault tolerance technique can save from a system failure. To achieve reliability, availability and QOS, fault tolerance is an important property. The fault tolerance mechanism used here sets job checkpoints based on the resource failure rate. If a resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Selecting the optimal intervals for checkpointing an application is important to minimize the application's execution time in the presence of system errors. The failure index-based rescheduling algorithm reschedules the job from the failed resource to another available resource with the lowest failure index value and runs the job from a recently saved checkpoint in case of resource failure. This ensures that the work gets done within the stipulated deadline with higher productivity and helps in making the network environment reliable. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Grid computing is a term that refers to the aggregation of computing resources from multiple administrative domains to achieve a common goal. The grid can be thought of as a distributed system with workloads that are non-interactive and involve a large number of files. It is more common for a single grill to be used for a variety of different purposes, although a grill may be dedicated to a specialized application. Grids are often built with the help of generic grid software libraries known as middleware. The network enables the sharing, selection, and aggregation of a wide variety of geographically distributed resources, including supercomputers, storage systems, data sources, and specialized devices owned by different organizations. Managing these resources is an important infrastructure in the grid computing environment. To achieve the promising potentials of computational grids, fault tolerance is of paramount importance as resources are distributed geographically to achieve the promising potentials of the computational grid. Furthermore, the probability of resource failure is much higher than traditional parallel computing, and resource failure fatally affects job execution. Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults and makes the system more reliable. Fault tolerance service is essential to meet QoS requirements in grid computing and deals with various types of resource failures, including process failures, processor failures, andnetwork failures. Checkpoint interval or application state checkpoint period is one of the important parameters in a checkpoint system that provides fault tolerance. Smaller checkpoint intervals lead to increased application execution costs due to checkpointing, while larger checkpoint intervals lead to longer times to recover from failures. So, in the presence of faults, optimal checkpoint intervals leading to minimal applicationit is necessary to determine the execution time. PROBLEMS: 1. If a failure occurs on one network resource, the job is rescheduled on another resource, which ultimately results in failure to meet the user's QOS requirements, i.e. deadline. The reason is simple. When the job is rerun, it takes longer. 2. There are resources that meet the deadline constraint criterion, but they have a tendency to fail in compute-based grid environments. In such a scenario, the grid scheduler selects the same resource for the simple reason that the grid resource promises to satisfy the user's requirements for grid jobs. This ultimately results in compromising the user's QOS parameters to complete the job. 3. Even if there is a fault in the system, a running task should be completed within the deadline. There is no point in such a task that is not completed before its deadline. Therefore, real-time expiration is the main issue. 4. Availability of real-time distributed systems of end-to-end services and ability to suffer systematic failures or attacks, without impacting customers or operations. 5. It concerns the ability to handle increasing amounts of work and the ability of a system to increase total throughput under increased load as resources are added. REMEDIES: Adaptive fault tolerance approach with checkpoints is used to overcome the above-mentioned drawbacks in such a scenario. In this approach, each resource maintains fault tolerance information. When an error occurs, the resource updates information about the occurrence of the error. During the decision-making process of assigning resources to work, fault tolerance information is used. Checkpointing is one of the most widespread techniques. To provide fault tolerance on unreliable systems, checkpointing is one of the most popular techniques. This is a snapshot recording of the entire system state to restart the application after some error occurs. Checkpoint can be stored in temporary or stable storage. However, the efficiency of the mechanism strongly depends on the duration of the checkpoint interval. Frequent checkpointing increases overhead, while slow checkpointing can lead to the loss of significant computations. Therefore, the decision on checkpoint interval size and checkpointing technique is a complicated task and should be based on knowledge of the system and application. Checkpoint recovery depends on the MTTR of the system. Typically, a hard drive periodically saves the state of an application to stable storage. After a crash, the application restarts from the last checkpoint rather than restarting it from scratch. There are three checkpointing strategies. These are coordinated checkpoints, uncoordinated checkpoints and communication-induced checkpoints. 1. In coordinated checkpointing, processes synchronize checkpoints to ensure that saved states are consistent with each other, so that the overall combined saved state is also consistent. In contrast, 2. In uncoordinated checkpointing, processes schedule checkpoints independently at different times and do not take messages into account. The checkpointinduced by communication attempts to coordinate only selected critical checkpoints. CHECKPOINTING MECHANISM: A grid resource is a member of a grid and offers computing services to users of the grid. Grid users register with a grid's Grid Information Server (GIS) by specifying QoS requirements such as deadline to complete execution, number of processors, operating system type, and so on. The components used in the architecture are described below: Scheduler-Schedulers is an important entity of a grid. Receives jobs from network users. Select feasible resources for such jobs based on the information received from the GIS. It then generates mappings between job and resource. When the schedule manager receives a grid job from a user, it gets the details of the available grid resources from GIS. Then pass the list of available resources to the entities in the MTTR planning strategy. The Matchmaker entity performs matching of resources and job requirements. The Response Time Estimator estimates the response time for a job on each matched resource based on the job's transfer time, queue wait time, and service time. The resource selector selects the resource with minimum response time. A job dispatcher sends jobs one by one to the checkpoint manager. GIS-GIS contains information about all available grid resources. Maintains resource details such as processor speed, available memory, load, etc. All grid resources joining and leaving the grid are tracked by GIS. A scheduler consults the GIS to obtain information about available network resources whenever it has jobs to execute. Checkpoint Manager: Receives scheduled work from the scheduler and sets the checkpoint based on the failure rate of the resource it is scheduled on. Then send the work to the resource. The checkpoint manager receives a job completion message or job error message from the network resource and responds accordingly. During execution, if an error occurs, the job is rescheduled from the last checkpoint instead of running from scratch. Checkpoint Server - The status of the job is reported to the checkpoint server on each checkpoint set by the checkpoint manager. The Checkpoint server saves the job state and returns it on request, i.e. during a job/resource failure. For a particular job, the checkpoint server discards the previous checkpoint result when a new checkpoint result value is received. Fault Index Manager: Maintains the fault index value of each asset which indicates the failure rate of the asset. The error rate of a resource is incremented whenever a resource does not complete its assigned work by the deadline and also in case of resource failure. A resource's error rate is decreased when the resource completes its assigned work by the deadline. The error index manager updates the error index of a network resource using the error index update algorithm. Checkpoint Replication Server: When a new checkpoint is created, Checkpoint Replication Server initializes CRS which replicates checkpoints created on remote resources by applying RRSA. The details are stored in Checkpoint Server after replication. To obtain information about all checkpoint files, the replication server queries Checkpoint Server.CRS monitors Checkpoint Server for the latest checkpoint versions throughout the application runtime. Information about available resources, hardware, memory, and bandwidth details is obtained from the GIS. The required details are periodically propagated by.