Week 1 - etcd and the Raft Consensus Foundation¶

1.1 Conceptual Core¶

etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.

Read the Raft paper. Then read etcd-io/raft/raft.go end to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition.
etcd's data model: a flat keyspace, mvcc revisions, watch streams. Kubernetes uses keys like /registry/pods/<namespace>/<name> and stores protobuf-encoded objects.
Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
Performance characteristics:
Write latency = network RTT × 2 (leader → quorum) + fsync.
Read latency = local read on leader (or any member with serializable=true).
Throughput is bound by the leader's fsync rate. SSDs help dramatically.
Compaction and defragmentation are required ops; without them etcd grows unbounded.

Bring up a 3-node etcd cluster locally (etcd binaries, no Kubernetes yet). Configure peer/client URLs.
Use etcdctl to put/get keys; observe consistent reads.
Kill the leader. Use etcdctl endpoint status --cluster to identify the new leader within seconds.
Use etcdctl watch /foo from one terminal; put values from another. Internalize the watch model.
Use etcdctl --command-timeout=60s defrag to compact + defragment. Observe disk-usage drop.

Configure mTLS between etcd peers and between client and etcd. Configure auth (role, user).
Take a snapshot via etcdctl snapshot save. Restore to a new cluster. Verify integrity.

Wire etcd metrics to Prometheus: etcd_server_has_leader, etcd_disk_wal_fsync_duration_seconds, etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approaching quota-backend-bytes.