Failure handling

Introduction

Dealing with failures is a key challenge in distributed systems like Dynamo. In an infrastructure with thousands of servers, failures are the norm rather than the exception. To provide high availability in such an environment, Dynamo needs efficient failure detection and flexible membership protocols.

Failure detection

In Dynamo, failure detection is completely decentralized. Unlike centralized systems which use a cluster manager to monitor liveness, each Dynamo node locally detects failures of its peers. How does it work under the hood?

When a node receives a client's read request, it needs to coordinate with the preferred replica nodes to fetch the latest version of the data. It sends requests to each node and waits for their response. If a replica does not respond within a timeout period, the coordinator treats it as failed and simply moves on to the next replica in the preference list.

The same goes for writes - the coordinator waits for the write quorum before considering the write successful. Unresponsive nodes are marked as failed and excluded from the quorum count. By keeping failure detection localized, different coordinators can have a divergent view of failures, which actually helps availability in partition scenarios.

Gossip base propagation

Timeouts alone are not sufficient to propagate knowledge of crashes. This is where membership gossip comes in. Every node periodically picks another random node to exchange membership lists.

This gossip allows failures to propagate across the system. Combined with localized timeouts, the coordinator will stop attempting to contact a failed node after receiving gossip about the failure. The gossip also spreads news when a node recovers, allowing recovered nodes to rejoin the active membership list.

Summary

Dynamo eschews centralized failure monitoring and uses decentralized failure detection driven by timeouts and membership gossip. This architecture results in higher availability as each coordinator can independently assess liveness of peers based on partial knowledge. The gossip protocol acts as a feedback mechanism to reconcile and disseminate the latest membership state. Together, they provide a scalable and available way to handle failures.

Handling Failures

If failures are transient, Dynamo needs to keep data available with minimum disruption.

Sloppy Quorums

To prevent temporary node failures from blocking operations, Dynamo uses sloppy quorums. This means reads and writes can proceed even if some replicas are down, as long as a minimum threshold have responded.

Hinted Handoff

All reads and writes are preformed on the first N healthy nodes from the preference list, which might not always be the first N nodes encountered while walking the consistent hashing ring

Example

ABCDEFKey kcant be insertedsince A is downKey k insertinto D with metadataKey k is transferredonce A is detected back onlinekk

Question

does this transfer take up the vector clock

Recovering from Permanent Failures

For permanent failures, Dynamo needs to reconstruct the lost replicas to maintain the replication factor.

Replica Synchronization

Replica synchronization is an important mechanism in Dynamo for reconciling divergent replicas caused by failures and ensuring eventual consistency. Here are some key points:

Removal of Failed Nodes

Once a node is declared permanently dead, a node remove request can be issued to remove it from the consistent hashing ring. Its partitions are reassigned to other nodes.

So in summary, through decentralized failure detection, sloppy quorums, hinted handoff, replica reconciliation and gossip-based membership, Dynamo provides high availability even with unreliable infrastructure. These mechanisms enable seamless handling of failures to maintain 24x7 data access.