Week 1 - etcd and the Raft Consensus Foundation¶
1.1 Conceptual Core¶
- etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
- Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
- A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.
1.2 Mechanical Detail¶
- Read the Raft paper. Then read
etcd-io/raft/raft.goend to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition. - etcd's data model: a flat keyspace,
mvccrevisions, watch streams. Kubernetes uses keys like/registry/pods/<namespace>/<name>and stores protobuf-encoded objects. - Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
- Performance characteristics:
- Write latency = network RTT × 2 (leader → quorum) + fsync.
- Read latency = local read on leader (or any member with
serializable=true). - Throughput is bound by the leader's fsync rate. SSDs help dramatically.
- Compaction and defragmentation are required ops; without them etcd grows unbounded.
1.3 Lab-"etcd, Up Close"¶
- Bring up a 3-node etcd cluster locally (
etcdbinaries, no Kubernetes yet). Configure peer/client URLs. - Use
etcdctlto put/get keys; observe consistent reads. - Kill the leader. Use
etcdctl endpoint status --clusterto identify the new leader within seconds. - Use
etcdctl watch /foofrom one terminal; put values from another. Internalize the watch model. - Use
etcdctl --command-timeout=60s defragto compact + defragment. Observe disk-usage drop.
1.4 Hardening Drill¶
- Configure mTLS between etcd peers and between client and etcd. Configure auth (
role,user). - Take a snapshot via
etcdctl snapshot save. Restore to a new cluster. Verify integrity.
1.5 Operations Slice¶
- Wire etcd metrics to Prometheus:
etcd_server_has_leader,etcd_disk_wal_fsync_duration_seconds,etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approachingquota-backend-bytes.