Failure handling

#todo

Introduction

Dealing with failures is a key challenge in distributed systems like Dynamo. In an infrastructure with thousands of servers, failures are the norm rather than the exception. To provide high availability in such an environment, Dynamo needs efficient failure detection and flexible membership protocols.

Failure detection

In Dynamo, failure detection is completely decentralized. Unlike centralized systems which use a cluster manager to monitor liveness, each Dynamo node locally detects failures of its peers. How does it work under the hood?

When a node receives a client's read request, it needs to coordinate with the preferred replica nodes to fetch the latest version of the data. It sends requests to each node and waits for their response. If a replica does not respond within a timeout period, the coordinator treats it as failed and simply moves on to the next replica in the preference list.

The same goes for writes - the coordinator waits for the write quorum before considering the write successful. Unresponsive nodes are marked as failed and excluded from the quorum count. By keeping failure detection localized, different coordinators can have a divergent view of failures, which actually helps availability in partition scenarios.

Gossip base propagation

Timeouts alone are not sufficient to propagate knowledge of crashes. This is where membership gossip comes in. Every node periodically picks another random node to exchange membership lists.

If node A talks to node B, they reconcile their list of live members.
If B has marked node C as dead, A will also update its list to show C as failed after verification.

This gossip allows failures to propagate across the system. Combined with localized timeouts, the coordinator will stop attempting to contact a failed node after receiving gossip about the failure. The gossip also spreads news when a node recovers, allowing recovered nodes to rejoin the active membership list.

Summary

Dynamo eschews centralized failure monitoring and uses decentralized failure detection driven by timeouts and membership gossip. This architecture results in higher availability as each coordinator can independently assess liveness of peers based on partial knowledge. The gossip protocol acts as a feedback mechanism to reconcile and disseminate the latest membership state. Together, they provide a scalable and available way to handle failures.

Handling Failures

If failures are transient, Dynamo needs to keep data available with minimum disruption.

Sloppy Quorums

To prevent temporary node failures from blocking operations, Dynamo uses sloppy quorums. This means reads and writes can proceed even if some replicas are down, as long as a minimum threshold have responded.

Hinted Handoff

All reads and writes are preformed on the first N healthy nodes from the preference list, which might not always be the first N nodes encountered while walking the consistent hashing ring

Example

If node A temporarily down or unreachable during the write operation, then the replica that is supposed to live on A will be sent to node D

The replica in node D will have a hint stored in the meta data that suggests which node was the intended recipient of the replica

Node D will keep them in a separate local database that is scanned periodically

Upon detecting that A has recovered, D will attempt to deliver the replica to A

Once transfer success, D delete the object from the local store

Question

does this transfer take up the vector clock

Recovering from Permanent Failures

For permanent failures, Dynamo needs to reconstruct the lost replicas to maintain the replication factor.

Replica Synchronization

Replica synchronization is an important mechanism in Dynamo for reconciling divergent replicas caused by failures and ensuring eventual consistency. Here are some key points:

It is implemented using an anti-entropy protocol that runs periodically in the background.
Each node maintains a Merkle tree for the key ranges that it is responsible for. The tree roots are exchanged between replicas.
By comparing the Merkle tree roots, nodes can efficiently detect differences in the key range without transferring the entire data.
If the roots mismatch, they traverse down the hash tree by exchanging child hashes to identify the specific keys that differ.
For the actual key(s) that diverge, the replicas exchange the full content and the latest version (based on vector clocks) wins.
Any deleted keys are also reconciled this way by replicating the deletions.
Over multiple runs of anti-entropy, all replicas will eventually converge.
If a replica falls behind significantly, it can request a full bulk transfer from another up-to-date replica to catch up.
Replica sync is done in the background without affecting foreground traffic. An admission control mechanism regulates its resource usage.

Removal of Failed Nodes

Once a node is declared permanently dead, a node remove request can be issued to remove it from the consistent hashing ring. Its partitions are reassigned to other nodes.

So in summary, through decentralized failure detection, sloppy quorums, hinted handoff, replica reconciliation and gossip-based membership, Dynamo provides high availability even with unreliable infrastructure. These mechanisms enable seamless handling of failures to maintain 24x7 data access.