Need for Fault Tolerance in Distributed Systems
Fault Tolerance is required in order to provide below four features.
- Availability: Availability is defined as the property where the system is readily available for its use at any time.
- Reliability: Reliability is defined as the property where the system can work continuously without any failure.
- Safety: Safety is defined as the property where the system can remain safe from unauthorized access even if any failure occurs.
- Maintainability: Maintainability is defined as the property states that how easily and fastly the failed node or system can be repaired.
Fault Tolerance in Distributed System
Distributed systems are defined as a collection of multiple independent systems connected together as a single system. Every independent system has its own memory and resources and some common resources and peripheral devices that are common to devices connected together. The design of Distributed systems is a complex process where all the nodes or devices need to be connected together even if they are located at long distances. Challenges faced by distributed systems are Fault Tolerance, transparency, and communication primitives. Fault Tolerance is one of the major challenges faced by distributed systems.
In distributed systems, there are three types of problems that occur. All these three types of problems are related.
- Fault: Fault is defined as a weakness or shortcoming in the system or any hardware and software component. The presence of fault can lead to error and failure.
- Errors: Errors are incorrect results due to the presence of faults.
- Failure: Failure is the final outcome where the assigned goal is not achieved.