Failure handling
Introduction
Dealing with failures is a key challenge in distributed systems like Dynamo. In an infrastructure with thousands of servers, failures are the norm rather than the exception. To provide high availability in such an environment, Dynamo needs efficient failure detection and flexible membership protocols.
Failure detection
In Dynamo, failure detection is completely decentralized. Unlike centralized systems which use a cluster manager to monitor liveness, each Dynamo node locally detects failures of its peers. How does it work under the hood?
When a node receives a client's read request, it needs to coordinate with the preferred replica
nodes to fetch the latest version of the data. It sends requests to each node and waits for their response. If a replica does not respond within a timeout period
, the coordinator treats it as failed
and simply moves on to the next replica in the preference list.
The same goes for writes - the coordinator waits for the write quorum before considering the write successful. Unresponsive nodes are marked as failed and excluded from the quorum count. By keeping failure detection localized
, different coordinators can have a divergent view of failures
, which actually helps availability in partition scenarios.
Gossip base propagation
Timeouts alone are not sufficient to propagate knowledge of crashes. This is where membership gossip comes in. Every node periodically picks
another random node to exchange
membership lists.
- If node A talks to node B, they reconcile their list of live members.
- If B has marked node C as dead, A will also update its list to show C as failed after
verification
.
This gossip allows failures to propagate across the system. Combined with localized timeouts, the coordinator will stop attempting
to contact a failed node after receiving gossip about the failure. The gossip also spreads news when a node recovers
, allowing recovered nodes to rejoin the active membership list.
Summary
Dynamo eschews centralized failure monitoring and uses decentralized failure detection driven by timeouts and membership gossip. This architecture results in higher availability
as each coordinator can independently assess liveness of peers based on partial knowledge. The gossip protocol acts as a feedback mechanism
to reconcile and disseminate the latest membership state. Together, they provide a scalable and available way to handle failures.
Handling Failures
If failures are transient, Dynamo needs to keep data available with minimum disruption.
Sloppy Quorums
To prevent temporary node failures from blocking operations, Dynamo uses sloppy quorums. This means reads and writes can proceed even if some replicas are down, as long as a minimum threshold
have responded.
Hinted Handoff
All reads and writes are preformed on the first N healthy nodes from the preference list, which might not always be the first N nodes encountered while walking the consistent hashing ring
- If node A temporarily down or unreachable during the write operation, then the replica that is supposed to live on A will be sent to node D
- The replica in node D will have a hint stored in the meta data that suggests which node was the intended recipient of the replica
- Node D will keep them in a separate local database that is scanned
periodically
- Upon detecting that A has recovered, D will attempt to deliver the replica to A
- Once transfer
success
, D delete the object from the local storeQuestiondoes this transfer take up the vector clock
Recovering from Permanent Failures
For permanent failures, Dynamo needs to reconstruct the lost replicas to maintain the replication factor.
Replica Synchronization
Replica synchronization is an important mechanism in Dynamo for reconciling divergent replicas caused by failures and ensuring eventual consistency. Here are some key points:
- It is implemented using an
anti-entropy
protocol that runs periodically in the background. - Each node maintains a Merkle tree for the key ranges that it is responsible for. The tree roots are exchanged between replicas.
- By comparing the Merkle tree roots, nodes can efficiently
detect differences
in thekey range
without transferring the entire data. - If the
roots mismatch
, theytraverse down
the hash tree byexchanging child hashes
to identify thespecific keys that differ
. - For the actual key(s) that diverge, the replicas exchange the full content and the
latest version (based on vector clocks) wins
. - Any deleted keys are also reconciled this way by replicating the deletions.
- Over multiple runs of anti-entropy, all replicas will eventually converge.
- If a replica
falls behind significantly
, it can request afull bulk transfer
from another up-to-date replica to catch up. - Replica sync is done in the background without affecting foreground traffic. An admission control mechanism regulates its resource usage.
Removal of Failed Nodes
Once a node is declared permanently dead, a node remove request can be issued to remove it from the consistent hashing ring. Its partitions are reassigned to other nodes.
So in summary, through decentralized failure detection, sloppy quorums, hinted handoff, replica reconciliation and gossip-based membership, Dynamo provides high availability even with unreliable infrastructure. These mechanisms enable seamless handling of failures to maintain 24x7 data access.