Saltar a contenido

Distributed consensus

Why it matters

"Several machines agreeing on a single answer despite some of them failing" is the hard problem at the heart of every fault-tolerant system. The algorithm of choice - Raft - appears in three different language paths' capstones. Reading them side by side teaches the algorithm; reading them across languages teaches the engineering trade-offs.


The problem

A bank account replicated across 3 servers. A transaction arrives ("debit $50"). All 3 must agree on whether it succeeded before the user sees the confirmation. If one server crashes, the others must continue. If a network partition splits them, only the majority can make progress (otherwise two halves both commit conflicting transactions).

The constraints are subtle. The naive "broadcast and wait for ACKs" works until your second server dies during a write; recovery becomes "did the write apply on the dead one?" with no clear answer.

Consensus is the formal name for "get a quorum of replicas to agree on a value." The classical algorithms: Paxos (Lamport, 1989), Raft (Ongaro, 2014), Viewstamped Replication, Zab (ZooKeeper).

This page focuses on Raft because that's what every modern implementation you'll encounter uses, and what the three capstones implement.


The lens, per path

Go - the canonical implementations live here

Capstone: Raft-replicated KV store. The most-implemented Raft in Go: HashiCorp's raft, used in Consul and Nomad. etcd's Raft library (used in Kubernetes' etcd, and as a building block elsewhere). TiKV's Raft (Rust, but C++ history).

What's unique: Go's standard library + goroutines map cleanly to Raft's structure - each replica is a state machine driven by a single goroutine reading from a channel. Most distributed-systems tutorials use Go because of this fit.

Java - Apache Ratis, Atomix

Capstone: Raft KV store on virtual threads. Apache Ratis is the production-grade implementation; Atomix is older. Both predate Loom - designed around the reactive / CompletableFuture async model.

What's unique: Loom's virtual threads (Month 4) make Raft's "many concurrent state machines" cheap again. Pre-Loom Java Raft implementations had to be reactive or thread-pooled; post-Loom you can write thread-per-replica code without scaling concerns.

Rust - raft-rs, openraft, tikv/raft-rs

Capstone: Raft KV store. TiKV's raft-rs is the production reference (used in TiDB). openraft is a newer, more ergonomic implementation. Both use async/await + Tokio.

What's unique: the type system enforces some Raft invariants. State machine transitions become enum match exhaustiveness; ownership prevents some data-race bugs at compile time.

Python - pysyncobj, etcd client patterns

Python doesn't ship a Raft implementation in any major path. Mentioned because production Python systems often consume a Raft cluster (etcd, Consul) via clients rather than implementing their own. Worth knowing: etcd is the canonical Raft cluster you'd use from Python.


What Raft actually does

Three states per server: leader, follower, candidate.

  • Followers receive log entries from the leader and ACK.
  • A leader is elected via a randomized timeout: when no leader heartbeat is seen, a follower becomes a candidate, requests votes, becomes leader if it gets a majority.
  • Once leader, it accepts client writes, replicates each to followers, and considers a write "committed" once a majority (including itself) has it.
  • Reads can be served from the leader (linearizable) or from followers (with caveats about staleness).

The state-machine transitions are clean:

Follower → Candidate (election timeout)
Candidate → Leader   (majority votes)
Candidate → Follower (heard from a leader with newer term)
Leader → Follower    (heard from a leader with newer term)

That's the algorithm in eight lines. The complexity is in the edge cases: log compaction, snapshotting, membership changes, network partitions.

Why understanding Raft matters beyond Raft

Even if you never implement Raft, every distributed system you operate uses it:

  • etcd (Kubernetes' configuration store) - Raft.
  • Consul, Nomad (HashiCorp) - Raft.
  • CockroachDB, TiKV/TiDB, YugabyteDB - Raft.
  • Apache ZooKeeper - Zab (Raft's predecessor, similar shape).

When these break, the failure modes are Raft failure modes: split votes, log divergence, quorum loss, slow follower. Reading Raft's paper once makes the alerts from any of these systems readable.

Read the paper

Diego Ongaro's Raft paper (2014) is short (~18 pages) and was explicitly written for understandability. Read it. Then read one of the implementations. Then read the paper again - the second reading is where the algorithm sticks.


The contrasts that teach across implementations

Aspect HashiCorp raft (Go) Apache Ratis (Java) tikv/raft-rs (Rust)
Concurrency model one goroutine per replica state machine thread pool + futures (pre-Loom); migration in progress async tasks on Tokio
Storage pluggable; BoltDB common pluggable; RocksDB common RocksDB
Snapshot strategy application-driven via callback pluggable pluggable
Membership changes joint consensus + simple config change joint consensus joint consensus
Used by Consul, Nomad, Vault Hadoop Ozone, IoTDB TiDB / TiKV

The Go version is the smallest and easiest to read for learning. The Rust version is the most production-tested at scale.


What to read first

  • You want to learn distributed-systems consensus from first principles → Ongaro's Raft paper. Then watch his talk ("Designing for Understandability") on YouTube.
  • You want to read a working implementation → HashiCorp's raft (Go). ~10k LOC, well-structured. The single most-readable Raft.
  • You want to ship a Raft-backed service → use etcd (as a cluster) and clients in your preferred language. Don't roll your own.
  • You want to implement Raft as a learning exercise → any of the three capstone tracks in this site's language paths. Pick the one whose language you're most comfortable in; the algorithm is the same.