Failure Detection

Definitions

Fault — malfunction in system's component that is the reason of error
Error — problem with system's component inner state that can lead to failure
Failure — externally visible system's problem
Fault-tolerance — system's property to continue operating properly even in the presence of faults within some of its components

Algorithm
- Periodically ping every process
- Suspect process if it didn't respond within the time \(T_{suspect}\) after ping
- If afterward suspected process respond on ping then it becomes unsuspected

Complexity
- Number of messages: \(O(N)\)
- Network traffic: \(O(N)\)
- Load: \(O(1/T)\)

Complexity
- Number of messages: \(O(N^2)\)
- Network traffic: \(O(N^2)\)
- Load: \(O(N/T)\)
Algorithm
- Every process periodically sends its state to all other processes

Algorithm
- Every process stores list of neighbours and counters of received neighbours' state
- Every process periodically sends its state to neighbours
- On receiving other's state process updates counter and sends state to other neighbours
- Failure detector shows counters without any interpretation

Complexity
- Message size: \(O(N)\)
- Number of messages: \(O(N\log N)\)
- Network traffic: \(O(N^2\log N)\)
- Load: \(O(N\log N/T)\)

Properties
- Completeness
- High accuracy (it growths exponentially with \(K\))
- Scalability
- Time of failure detection doesn't depend on \(N\)
Complexity
- Number of messages (avg): \(O(N)\)
- Network traffic (avg): \((N)\)
- Load (avg): \(O(N/T)\)
Algorithm
- Every process periodically ping other random process
- If no response received from process \(P\) within time \(T_1\) then process ping other \(K\) random processes asking to ping \(P\)
- If still no response about process \(P\) then \(P\) is failed and removed from membership list

Fast And Lethal Component Observation Network

Algorithm uses all communication messages between processes for determining state of process