Saltar a contenido

Week 1 - etcd and the Raft Consensus Foundation

1.1 Conceptual Core

  • etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
  • Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
  • A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.

1.2 Mechanical Detail

  • Read the Raft paper. Then read etcd-io/raft/raft.go end to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition.
  • etcd's data model: a flat keyspace, mvcc revisions, watch streams. Kubernetes uses keys like /registry/pods/<namespace>/<name> and stores protobuf-encoded objects.
  • Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
  • Performance characteristics:
  • Write latency = network RTT × 2 (leader → quorum) + fsync.
  • Read latency = local read on leader (or any member with serializable=true).
  • Throughput is bound by the leader's fsync rate. SSDs help dramatically.
  • Compaction and defragmentation are required ops; without them etcd grows unbounded.

1.3 Lab-"etcd, Up Close"

  1. Bring up a 3-node etcd cluster locally (etcd binaries, no Kubernetes yet). Configure peer/client URLs.
  2. Use etcdctl to put/get keys; observe consistent reads.
  3. Kill the leader. Use etcdctl endpoint status --cluster to identify the new leader within seconds.
  4. Use etcdctl watch /foo from one terminal; put values from another. Internalize the watch model.
  5. Use etcdctl --command-timeout=60s defrag to compact + defragment. Observe disk-usage drop.

1.4 Hardening Drill

  • Configure mTLS between etcd peers and between client and etcd. Configure auth (role, user).
  • Take a snapshot via etcdctl snapshot save. Restore to a new cluster. Verify integrity.

1.5 Operations Slice

  • Wire etcd metrics to Prometheus: etcd_server_has_leader, etcd_disk_wal_fsync_duration_seconds, etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approaching quota-backend-bytes.

Comments