Kubernetes¶
Control plane, kubelet/CRI, controllers, networking, day-2.
Printing this page
Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.
Kubernetes Platform Engineering-A 24-Week Mastery Roadmap¶
Authoring lens: Principal Platform Engineer / Kubernetes Maintainer.
Target outcome: A graduate of this curriculum is capable of (a) building and operating a hardened Kubernetes cluster from scratch on bare metal or any cloud, (b) extending the control plane via custom controllers/operators built with controller-runtime or client-go, and (c) contributing patches to kubernetes/kubernetes or one of its core ecosystems (Cilium, Istio, ArgoCD, Crossplane).
This is not "kubectl in 24 weeks." It assumes the reader has used Kubernetes (deployed an app, run kubectl), understands containers (see the CONTAINER_INTERNALS_PLAN curriculum if not), and is ready to read kubernetes/kubernetes source as primary literature.
Repository Layout¶
| File | Purpose |
|---|---|
00_PRELUDE_AND_PHILOSOPHY.md |
What Kubernetes is, what it isn't, the design ethics, reading list. |
01_MONTH_CONTROL_PLANE.md |
Weeks 1–4. etcd & Raft, kube-apiserver, scheduler, controllers. |
02_MONTH_KUBELET_AND_CRI.md |
Weeks 5–8. kubelet, CRI, kube-proxy, CSI, device plugins. |
03_MONTH_CONTROLLERS_AND_OPERATORS.md |
Weeks 9–12. client-go, controller-runtime, CRDs, the operator pattern. |
04_MONTH_NETWORKING_AND_STORAGE.md |
Weeks 13–16. CNI, Cilium/eBPF, service meshes, CSI, dynamic provisioning. |
05_MONTH_PLATFORM_AND_DAY2.md |
Weeks 17–20. GitOps (Argo/Flux), IaC (Crossplane), HPA/VPA, admission, OPA. |
06_MONTH_HARD_WAY_CAPSTONE.md |
Weeks 21–24. K8s the Hard Way; multi-tenancy; mTLS; capstone. |
APPENDIX_A_HARDENING.md |
CIS, Pod Security, network policy, RBAC, audit. |
APPENDIX_B_TROUBLESHOOTING.md |
Reference flows: pod-pending, node-notready, etcd-degraded, etc. |
APPENDIX_C_CONTRIBUTING.md |
Contributing to k8s.io: SIGs, KEPs, first-PR playbook. |
CAPSTONE_PROJECTS.md |
Three tracks: hard-way bare-metal cluster, GitOps platform, operator from scratch. |
How Each Week Is Structured¶
- Conceptual Core-the why, with a mental model.
- Mechanical Detail-the how, with kubernetes/kubernetes source pointers.
- Lab-a hands-on exercise using a real cluster (kind, k3s, or hard-way).
- Hardening Drill-a security/compliance micro-task.
- Operations Slice-a Day-2-ops micro-task: monitoring, scaling, recovery.
Each week is sized for ~12–16 focused hours. Almost every lab requires a working cluster-invest early in a smooth local cluster setup (kind or k3d for dev; a 3-node kubeadm cluster on cloud VMs for realistic ops).
Progression Strategy¶
Control Plane ──► Kubelet & CRI ──► Controllers & Operators
│ │ │
└────────┬───────┴────────────────────┘
▼
Networking & Storage
│
▼
Platform & Day-2 Ops
│
▼
Hard Way & Capstone
Prerequisites¶
- Container fluency (the
CONTAINER_INTERNALS_PLANweeks 1–3 minimum). - Linux fluency (the
LINUXcurriculum weeks 9–10 minimum: namespaces & cgroups). - Comfortable with at least one of Go, Python, or Rust at a "I can build a small CLI" level.
- A budget for cloud VMs OR hardware to run a multi-node cluster (3 small VMs is sufficient).
Capstone Tracks (pick one in Month 6)¶
- Hard Way Track-provision a multi-node Kubernetes cluster from scratch on bare metal or cloud, with mTLS, fine-grained RBAC, multi-tenancy, and a documented runbook.
- Platform Track-build a GitOps-driven platform-as-a-service: ArgoCD/Flux + Crossplane + OPA Gatekeeper + multi-tenancy + self-service. Demonstrate onboarding a new team in <30 minutes.
- Operator Track-build a non-trivial operator from scratch (e.g., a stateful database operator with backup/restore, or an operator that manages an external SaaS resource via Crossplane composition). Production quality.
Details in CAPSTONE_PROJECTS.md.
Prelude-What Kubernetes Actually Is¶
Sit with this document for an evening before week 1.
1. Kubernetes Is a Distributed Reconciliation Loop¶
The most clarifying way to understand Kubernetes:
Kubernetes is a distributed key-value store (etcd) wrapped in an HTTP API server, surrounded by a swarm of independent controllers, each of which watches some types of objects in the store and writes other types of objects in response-until the cluster's actual state matches the desired state.
That's it. No central brain. No orchestration engine in the traditional sense. Every component is a client of the API server. The "control plane" is an emergent property of independent controllers cooperating through a shared, transactional store.
If you internalize this, the rest of the curriculum is bookkeeping.
2. The Control Loop Is the Atom¶
Every interesting behavior in Kubernetes is some controller running this loop:
for {
desired := apiServer.List(myWatchedKind)
actual := observe(realWorld)
diff := compute(desired, actual)
for _, action := range diff {
actOn(action) // create / update / delete real resources
apiServer.UpdateStatus(...)
}
watchOrSleep()
}
The Deployment controller watches Deployment objects, creates/updates ReplicaSet objects. The ReplicaSet controller watches ReplicaSet objects, creates/updates Pod objects. The kubelet watches Pod objects bound to its node, talks to the CRI to actually start containers. Each controller's only knowledge of the others is the objects they share.
This is also why Kubernetes is eventually consistent-there is no central scheduler enforcing global state, just many controllers converging.
3. The Five-Axis Cost Model¶
A working platform engineer reasons along five axes:
| Axis | Question to ask |
|---|---|
| Control plane | Does this load etcd? How many writes? How many list/watch consumers? |
| Scheduling | What's the resource request? Affinity? Taint/toleration? Topology spread? |
| Networking | Cluster-internal service / NodePort / LoadBalancer / Ingress / Gateway? CNI overhead? |
| Identity & isolation | What ServiceAccount? What namespace? What RBAC? What NetworkPolicy? What PodSecurity profile? |
| Day-2 ops | What does upgrade look like? Backup? Disaster recovery? Cost? |
Beginner courses teach axis 2 only.
4. The Reading List¶
Primary - Kubernetes in Action (Lukša, 2e). The single best book. - Programming Kubernetes (Hausenblas & Schimanski). Required for Months 3–4. - Production Kubernetes (Vyas, et al.). The Day-2 bible. - Cloud Native Patterns (Davis). Architectural patterns.
Source
- kubernetes/kubernetes - the monorepo. Particularly:
-cmd/kube-apiserver,pkg/apiserver/,staging/src/k8s.io/apiserver/ - API server.
- cmd/kube-scheduler, pkg/scheduler/ - scheduler.
-cmd/kube-controller-manager,pkg/controller/ - built-in controllers.
- pkg/kubelet/ - node agent.
-pkg/proxy/ - service implementations.
- kubernetes/community/sig-* - design docs.
- KEPs (Kubernetes Enhancement Proposals) atkubernetes/enhancements`. The canonical record of why features exist.
Adjacent canon - Designing Data-Intensive Applications (Kleppmann). Especially chapters on consensus and replication. - The Raft paper. Read in week 1. - Site Reliability Engineering (Google). The "what does Day-2 mean?" book.
5. Curriculum Philosophy¶
- Source first, blog second. When the curriculum says "study informer mechanics," open
staging/src/k8s.io/client-go/tools/cache/. Blogs go stale; commits are dated. - Run a real cluster. Many labs assume a multi-node setup.
kindis fine for development; weeks 17+ assume something closer to production. - Defaults are wrong. Kubernetes ships with permissive defaults to ease onboarding (no NetworkPolicy, no PodSecurity, broad RBAC). Production requires inverting them.
6. What Kubernetes Is Not For¶
A graduate of this curriculum should be able to argue these points:
- Single-server simple deploys. A Postgres + an app on one VM with systemd is operationally simpler than a one-node Kubernetes cluster. Don't add a control plane to host one app.
- Hard real-time / latency-critical hot paths. kube-proxy adds latency. CNI plugins add latency. The scheduler is not designed for sub-millisecond placement decisions. Use bare-metal or VM-based deployments for ultra-low-latency workloads.
- Stateful databases at scale, naively. Kubernetes can run stateful workloads with operators (Postgres operator, MongoDB operator, etc.), but doing it correctly requires a mature operator ecosystem and skilled operators. "Just put your DB in K8s" is not free.
- Teams without ops capacity. Kubernetes is not a Heroku replacement. The complexity is real. If you don't have a platform team, use Cloud Run, Fly, or a managed container service before reaching for K8s.
7. AI-Assisted Workflows¶
- Always read generated YAML. Models hallucinate field names; Kubernetes silently ignores unknown fields by default-your "successful apply" may be doing nothing.
- Verify CRD generation. Tools like
controller-genare deterministic; let them generate, never hand-edit. - Treat generated RBAC with extreme suspicion. Models tend toward over-broad permissions ("just give it cluster-admin"). Tighten by hand.
You are now ready for Week 1. Open 01_MONTH_CONTROL_PLANE.md.
Month 1-The Control Plane: etcd, kube-apiserver, scheduler, controllers¶
Goal: by the end of week 4 you can (a) describe etcd's Raft model and replay a leader election, (b) trace an apply through the API-server's request pipeline (auth → admission → validation → storage), (c) explain how the scheduler picks a node, and (d) read the source of one built-in controller (Deployment) end-to-end.
Weeks¶
- Week 1 - etcd and the Raft Consensus Foundation
- Week 2 - The kube-apiserver
- Week 3 - The Scheduler
- Week 4 - Built-in Controllers and
client-goFoundations
Week 1 - etcd and the Raft Consensus Foundation¶
1.1 Conceptual Core¶
- etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
- Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
- A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.
1.2 Mechanical Detail¶
- Read the Raft paper. Then read
etcd-io/raft/raft.goend to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition. - etcd's data model: a flat keyspace,
mvccrevisions, watch streams. Kubernetes uses keys like/registry/pods/<namespace>/<name>and stores protobuf-encoded objects. - Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
- Performance characteristics:
- Write latency = network RTT × 2 (leader → quorum) + fsync.
- Read latency = local read on leader (or any member with
serializable=true). - Throughput is bound by the leader's fsync rate. SSDs help dramatically.
- Compaction and defragmentation are required ops; without them etcd grows unbounded.
1.3 Lab-"etcd, Up Close"¶
- Bring up a 3-node etcd cluster locally (
etcdbinaries, no Kubernetes yet). Configure peer/client URLs. - Use
etcdctlto put/get keys; observe consistent reads. - Kill the leader. Use
etcdctl endpoint status --clusterto identify the new leader within seconds. - Use
etcdctl watch /foofrom one terminal; put values from another. Internalize the watch model. - Use
etcdctl --command-timeout=60s defragto compact + defragment. Observe disk-usage drop.
1.4 Hardening Drill¶
- Configure mTLS between etcd peers and between client and etcd. Configure auth (
role,user). - Take a snapshot via
etcdctl snapshot save. Restore to a new cluster. Verify integrity.
1.5 Operations Slice¶
- Wire etcd metrics to Prometheus:
etcd_server_has_leader,etcd_disk_wal_fsync_duration_seconds,etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approachingquota-backend-bytes.
Week 2 - The kube-apiserver¶
2.1 Conceptual Core¶
- The API server is the only stateful component (well-the only stateless component that talks to etcd). It exposes the REST/JSON+YAML+protobuf API, performs authn/authz/admission, and writes to etcd.
- Every Kubernetes operation-
kubectl apply, controller reconciliation, kubelet status update-is an HTTP request to this server. - Three middleware stages every request traverses: Authentication (who are you?), Authorization (RBAC: what can you do?), Admission (mutating + validating webhooks).
2.2 Mechanical Detail¶
- Authentication mechanisms: x509 client certs, bearer tokens (ServiceAccount tokens, OIDC), webhook tokens. Each is a request handler chain entry.
- Authorization: RBAC is the dominant mode. ABAC and webhook authz exist but are rare. RBAC binds subjects (User, Group, ServiceAccount) to roles (verb + resource + namespace combinations) via
RoleBinding/ClusterRoleBinding. - Admission:
- Mutating webhooks: can modify the object before validation (e.g., inject sidecars).
- Validating webhooks: can only accept or reject.
- Built-in admission controllers:
LimitRanger,ResourceQuota,ServiceAccount,PodSecurity,NodeRestriction, etc. Readplugin/pkg/admission/in the k8s tree. - Aggregated API server: third parties can register their own API surface (e.g., metrics-server, Knative). The main apiserver proxies requests to them.
- Storage: every object has a "storage version" in etcd. The server converts between API versions on read/write. This is what allows v1beta1 → v1 migrations.
2.3 Lab-"Read the Pipeline"¶
- Use
kubectl --v=8to dump the wire-level request/response of akubectl apply. Read it carefully. - Use
kubectl get --rawto hit/apis/,/api/v1,/apis/apps/v1and see the discovery surface. - Configure the apiserver to log all requests with - -audit-policy-file=audit.yaml`. Apply a few changes; read the audit log.
- Write a tiny mutating webhook (in Go, using
controller-runtime's webhook facilities) that adds a label to every Pod. Deploy and verify.
2.4 Hardening Drill¶
- Audit policy template: log
Metadatafor every request,Requestforsecrets/configmaps,RequestResponseforroles/rolebindings/clusterroles/clusterrolebindings. Ship logs off-cluster.
2.5 Operations Slice¶
- Wire apiserver metrics:
apiserver_request_total,apiserver_request_duration_seconds,apiserver_storage_objects. Alert on per-resource latency p99 spikes.
Week 3 - The Scheduler¶
3.1 Conceptual Core¶
- The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's
spec.nodeNamefield. - The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
- The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.
3.2 Mechanical Detail¶
- Scheduler framework extension points (read
pkg/scheduler/framework/types.go): - PreFilter-short-circuit conditions.
- Filter-must return Success for the Node to be eligible.
- PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
- PreScore, Score, NormalizeScore.
- Reserve, Permit, PreBind, Bind, PostBind.
- Built-in plugins:
NodeResourcesFit,NodeAffinity,PodTopologySpread,InterPodAffinity,TaintToleration,NodePorts,VolumeBinding,ImageLocality,NodeResourcesBalancedAllocation. - Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
- Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods.
priorityClassis the knob.
3.3 Lab-"Scheduler in Action"¶
- Use
kubectl describeon a pending Pod to see filter/score reasons. - Set a Node taint (
kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it. - Define
PriorityClasses (high,default,batch); deploy mixed-priority Pods; trigger preemption by oversaturating. - Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.
3.4 Hardening Drill¶
- Set
priorityClassNameon system-critical Pods (CSI driver, ingress controller). Usesystem-cluster-criticalandsystem-node-criticalfor cluster-internal pods.
3.5 Operations Slice¶
- Wire scheduler metrics:
scheduler_pending_pods,scheduler_pod_scheduling_duration_seconds,scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.
Week 4 - Built-in Controllers and client-go Foundations¶
4.1 Conceptual Core¶
- The kube-controller-manager is a single binary running ~30 built-in controllers. Each is a goroutine running the reconciliation loop pattern.
- The Deployment controller is the best worked example: watches
Deploymentobjects, creates/updatesReplicaSetobjects, drives rolling-update progression. - The patterns the built-in controllers establish-informers, work queues, structured logging, leader election-are the templates you'll use when building custom controllers.
4.2 Mechanical Detail¶
- Informers (
staging/src/k8s.io/client-go/tools/cache/): - A shared in-memory cache populated by a single watch stream per resource type.
- Event handlers:
OnAdd,OnUpdate,OnDelete. - Cache provides O(1) lookup by namespace/name.
- Multiple controllers in one process share informers via
SharedInformerFactory. - Work queues (
client-go/util/workqueue): rate-limited, deduplicated, item-keyed queues. Reconcile functions pull a key, list-from-cache, act, requeue on error. - The Deployment controller flow (
pkg/controller/deployment/): - Informer detects a Deployment change.
- Reconciler computes the desired ReplicaSet count and per-RS replica counts based on strategy (rolling vs recreate).
- Creates new RS / scales old RS / scales new RS.
- Updates Deployment status with progress.
4.3 Lab-"Read the Deployment Controller"¶
- Read
pkg/controller/deployment/deployment_controller.goend-to-end (~1500 lines). - Trace a
kubectl rolloutthrough the source: which conditions are checked, which fields updated, what triggers the next loop iteration. - Reproduce a stuck-rollout scenario (deploy a bad image); observe
Progressing=Falseafter the deadline; inspect status conditions. - Manually scale a Deployment to 0 with
kubectl scale; trace what the controller does in response.
4.4 Hardening Drill¶
- Set sensible Deployment defaults in your platform:
progressDeadlineSeconds,revisionHistoryLimit, rolling-updatemaxSurge/maxUnavailablefor production workloads.
4.5 Operations Slice¶
- Wire workqueue metrics:
workqueue_adds_total,workqueue_depth,workqueue_queue_duration_seconds. Alert on persistent depth or processing latency.
Month 1 Capstone Deliverable¶
A control-plane/ workspace:
1. etcd-cluster/ - week 1's 3-node cluster + backup/restore script.
2.audit-pipeline/ - week 2's audit-log shipping + sample queries.
3. custom-scheduler-plugin/ - week 3's scheduler plugin + deployment.
4.controller-walkthrough.md - week 4's annotated tour of the Deployment controller.
Month 2-The Node: kubelet, CRI, kube-proxy, CSI, Devices¶
Goal: by the end of week 8 you can (a) describe how the kubelet maintains the desired Pod state on a node, (b) trace a service request from client to backing Pod through kube-proxy or eBPF dataplane, (c) explain CSI volume lifecycle (provision → attach → mount), and (d) write a basic device plugin or CSI driver shim.
Weeks¶
- Week 5 - Kubelet Internals
- Week 6 - CRI: kubelet ↔ Runtime
- Week 7 - kube-proxy, Services, and the Networking Dataplane
- Week 8 - CSI, Storage, and Device Plugins
Week 5 - Kubelet Internals¶
5.1 Conceptual Core¶
- The kubelet is the per-node agent. Its job: watch Pods bound to this node, drive the CRI to make the actual containers match. Plus: report node status, manage volumes, run health checks, evict on resource pressure.
- The kubelet is also a PLEG (Pod Lifecycle Event Generator) that polls the runtime to detect actual container state changes-necessary because container exits aren't always pushed events.
- Kubelet is the component most often blamed for "weird" Kubernetes behavior; understanding it is non-optional.
5.2 Mechanical Detail¶
- Read
pkg/kubelet/kubelet.go. Major loops: - `syncLoop - the main reconciliation loop.
- `PLEG - pod-lifecycle event generation.
- `volumeManager - volume mount/unmount.
- `statusManager - Pod status updates back to apiserver.
- `evictionManager - resource-pressure eviction.
- Static pods-Pods defined as YAML files on disk (
/etc/kubernetes/manifests/). Kubelet runs them directly without an apiserver. How control-plane pods bootstrap themselves. - Pod lifecycle phases:
Pending→Running→Succeeded/Failed. With container statesWaiting/Running/Terminated. - Pod resource enforcement: kubelet sets cgroups based on
requests/limits. Withcpu-manager-policy=static, the kubelet pins exclusive CPUs to Guaranteed-class Pods. Same idea formemory-manager-policyandtopology-manager-policy. - Eviction: when a node runs low on memory, disk, PID space, the kubelet evicts Pods in priority order. Soft vs hard thresholds.
5.3 Lab-"Kubelet Forensics"¶
- SSH to a node.
journalctl -u kubelet -fand trigger a Pod creation. Watch the log. crictl ps,crictl pods, `crictl inspect - operate at the CRI layer directly.- Place a static pod manifest; observe kubelet picking it up.
- Trigger a memory eviction by setting low
evictionHardand oversubscribing. Read the eviction event and the kubelet's decision.
5.4 Hardening Drill¶
- Set kubelet args: - -read-only-port=0
, - -anonymous-auth=false, - -authorization-mode=Webhook, - -protect-kernel-defaults=true, - -make-iptables-util-chains=true, - -tls-min-version=VersionTLS12.
5.5 Operations Slice¶
- Wire kubelet metrics:
kubelet_pod_start_duration_seconds,kubelet_running_pods,kubelet_volume_stats_used_bytes. Alert on slow Pod starts.
Week 6 - CRI: kubelet ↔ Runtime¶
6.1 Conceptual Core¶
- The kubelet does not run containers itself; it talks gRPC to a CRI implementation (containerd, CRI-O). The CRI provides RuntimeService (containers + sandboxes) and ImageService (pull/list/remove).
- Every Pod is a sandbox (a network namespace + the "pause" container) plus N containers sharing it.
6.2 Mechanical Detail¶
- The CRI proto:
cri-api/pkg/apis/runtime/v1/api.proto. The most relevant calls:RunPodSandbox,CreateContainer,StartContainer,StopContainer,RemovePodSandbox,Exec,Attach,PortForward,PullImage. - The pause container is a tiny binary (it just calls
pause(2)); it holds open the network namespace so the application containers can come and go. crictlis the CLI for talking to a CRI directly.crictl ps,crictl inspect,crictl exec. Different from `kubectl - talks straight to kubelet's runtime, bypassing the API server.- Runtime classes:
RuntimeClassobjects bind a name (gvisor,kata) to a runtime handler. Pods reference viaspec.runtimeClassName.
6.3 Lab-"CRI Direct"¶
- From a node,
crictl pull alpine; crictl runp pod-config.json; crictl create <pod-id> ctr-config.json img-config.json; crictl start <ctr-id>. You've launched a pod-equivalent without the apiserver. - Compare with kubectl deploying the same: trace each CRI call in the kubelet log.
- Configure containerd with multiple runtimes (runc + runsc); register both as
RuntimeClasses; deploy Pods against each.
6.4 Hardening Drill¶
- Configure containerd to default to a non-root user, drop default capabilities, apply default seccomp. The same hardening from the Container curriculum, applied at the daemon level.
6.5 Operations Slice¶
- Monitor
container_runtime_*metrics from cAdvisor (built into kubelet). Alert on container-restart rate spikes.
Week 7 - kube-proxy, Services, and the Networking Dataplane¶
7.1 Conceptual Core¶
- A Service is a stable virtual IP and port that load-balances across a set of Pods. It is implemented at L4 by kube-proxy-or, in modern eBPF-based clusters, by the CNI directly (Cilium replaces kube-proxy entirely).
- Modes:
- iptables (default): kube-proxy programs iptables DNAT rules. O(N) match per packet; degrades with many Services.
- IPVS: kube-proxy programs the kernel IPVS load balancer. O(1) lookup; better for >1000 services.
- eBPF (Cilium): bypasses iptables entirely; programs are attached at the socket layer (
bpf_sock_ops) and at the egress point. Lowest overhead.
7.2 Mechanical Detail¶
- EndpointSlices replaced Endpoints in 1.21+: split per-Service endpoint lists into multiple objects to scale beyond ~1000 endpoints per Service.
- Service types:
ClusterIP(default, internal),NodePort(open a port on every node),LoadBalancer(cloud LB integration),ExternalName(DNS CNAME). - Headless Services (
clusterIP: None): no virtual IP; DNS returns Pod IPs directly. Used by StatefulSets. - Topology-aware routing: prefer endpoints in the same zone (since 1.27 stable). Saves cross-zone egress costs.
- Service IPs are virtual: no NIC has them; they live only in iptables/IPVS/eBPF rules.
7.3 Lab-"Service Path"¶
- Create a Service + Deployment. From a Pod,
curl <service>.<ns>.svc.cluster.local. Trace the DNS lookup (CoreDNS) and the iptables/IPVS rules that DNAT. - Switch kube-proxy to IPVS mode (
mode: ipvsin kube-proxy config). Verify withipvsadm -L -n. - Install Cilium with
kubeProxyReplacement=true. Observe kube-proxy not running. Verify Service connectivity still works. - Compare per-packet latency under each mode with a small benchmark.
7.4 Hardening Drill¶
- Enable
topology-aware routingto keep traffic in zone. Apply NetworkPolicies (next month) that allow only intended traffic.
7.5 Operations Slice¶
- Wire
kubeproxy_sync_proxy_rules_duration_seconds. With many Services and iptables mode, this can take seconds-a known scale ceiling.
Week 8 - CSI, Storage, and Device Plugins¶
8.1 Conceptual Core¶
- CSI (Container Storage Interface) is the standard plugin interface for storage. Every cloud and many on-prem systems ship a CSI driver. Kubernetes calls the driver via gRPC.
- A CSI driver runs in two modes (or both):
- Controller plugin-provision, delete, attach, detach, snapshot. Cluster-wide.
- Node plugin-stage and publish (mount) the volume on the kubelet node.
- PVC → PV → CSI flow: user creates a PVC; the external-provisioner sidecar sees it, calls the CSI controller's
CreateVolume, which creates a PV bound to the PVC. Kubelet then asks the CSI node plugin to mount.
8.2 Mechanical Detail¶
- StorageClass parameters:
provisioner(CSI driver name),parameters(driver-specific),reclaimPolicy(Delete vs Retain),volumeBindingMode(Immediate vs WaitForFirstConsumer),allowVolumeExpansion. - WaitForFirstConsumer is critical for zone-aware provisioning-wait until the Pod is scheduled to know which zone to provision in.
- Snapshots:
VolumeSnapshotAPI; the external-snapshotter sidecar drives the CSI driver's snapshot calls. - Device plugins: a separate gRPC API (
pluginapi.proto) for exposing custom resources (GPUs, FPGAs, RDMA NICs) to Pods. NVIDIA'sk8s-device-pluginis the canonical example.
8.3 Lab-"Storage Hands-On"¶
- Install a local-path CSI driver (
rancher/local-path-provisionerworks for kind). Create a PVC; observe binding. - Take a snapshot; restore to a new PVC.
- Author a mock device plugin that exposes 4 instances of a fake resource. Deploy a Pod requesting it; verify scheduling and resource accounting.
- Read the CSI proto. Diagram the provision + attach + mount flow on paper.
8.4 Hardening Drill¶
- Use
volumeBindingMode: WaitForFirstConsumerfor all multi-zone clusters. Without it, you'll provision a volume in zone A and try to schedule its Pod in zone B.
8.5 Operations Slice¶
- Monitor
csi_*metrics emitted by sidecars. Alert on provision/attach errors and slowMountoperations.
Month 2 Capstone Deliverable¶
A node-and-cri/ workspace:
1. kubelet-tour/ - week 5's annotated journal-log walkthrough.
2.cri-direct/ - week 6's crictl - based pod-launch demo.
3.dataplane-bench/ - week 7's iptables vs IPVS vs Cilium-eBPF comparison.
4. `mock-device-plugin/ - week 8's working device plugin.
Month 3-Controllers and Operators: client-go, controller-runtime, CRDs¶
Goal: by the end of week 12 you can (a) build a controller from scratch with client-go and informers, (b) build a more sophisticated controller with controller-runtime (including webhooks, finalizers, status conditions), (c) define and version CRDs idiomatically, and (d) ship a non-trivial operator that manages an external system.
Weeks¶
- Week 9 -
client-goInternals and a Bare Controller - Week 10 -
controller-runtimeand Kubebuilder - Week 11 - CRDs: Schema, Versioning, Validation
- Week 12 - Operator Patterns: Finalizers, External Resources, Multi-Cluster
Week 9 - client-go Internals and a Bare Controller¶
9.1 Conceptual Core¶
client-gois the Kubernetes Go client library-typed clients, informers, work queues, leader election, the lot.- Building a controller "from scratch" in
client-gois verbose but instructive-every other framework hides these primitives. - The pattern (the informer + workqueue pattern):
- Create a
SharedInformerFactoryfor the resources you watch. - For each kind, register
OnAdd/OnUpdate/OnDeletehandlers that compute a key (namespace/name) andAddit to aRateLimitingQueue. - Start N workers that pull keys from the queue and run
reconcile(key). reconcile: list-from-cache (never call apiserver in the hot path), compute diff, apply changes, requeue on error with backoff.
9.2 Mechanical Detail¶
- The informer's resync period: re-deliver every cached object every N (default 10 minutes). Belt-and-suspenders against missed events.
- Indexers (
cache.Indexer): O(1) lookup by namespace, by label, by custom key. Free with the informer. - Listers (
<group>/<version>/<resource>/lister.goin generated client code): typed accessors over the indexer. - Leader election (
tools/leaderelection): only one replica of the controller acts; others stand by. Uses aLeaseresource as the lock. - Generated clients: for built-in types,
client-goships them. For your own CRDs, generate withcontroller-genorkubebuilder(week 10).
9.3 Lab-"Controller From Scratch"¶
Build a controller that watches ConfigMaps with the label mirror=true and copies them into every namespace whose name matches a configurable prefix.
- Use client-go informers + workqueue directly.
- Add leader election.
- Idempotent: same input twice produces same result.
- Handle deletions: when the source is deleted, delete all mirrors.
- Run as a Deployment in the cluster.
9.4 Hardening Drill¶
- Define a minimum RBAC: only
get/list/watchonconfigmapsandnamespaces, pluscreate/update/deleteonconfigmaps(constrained by namespace prefix? Use admission webhooks or namespace selectors).
9.5 Operations Slice¶
- Expose
controller_runtime_* - style metrics: queue depth, work duration, error rate. Add a/healthzand/readyz. Run with/livez` probe.
Week 10 - controller-runtime and Kubebuilder¶
10.1 Conceptual Core¶
controller-runtimeis the modern, opinionated framework for controllers. Built atopclient-go, it provides:Manager(informer factory + leader election + metrics + healthz wired together).Reconciler(typed reconcile method).Client(cached read, direct write).- Webhook scaffolding (mutating + validating + conversion).
- Finalizers helpers.
- Kubebuilder is a CLI on top of
controller-runtimethat scaffolds projects from CRD definitions. The de facto starting point for new operators.
10.2 Mechanical Detail¶
- Project structure (
kubebuilder init && kubebuilder create api): - The
Reconcilemethod is the hot path; it should be idempotent and make no assumption about why it was called. Re-derive everything each call. controllerutil.CreateOrUpdate-the reliable upsert helper.- Owner references-when a controller creates a child object, it sets the parent as the owner. Garbage collection handles cascading deletion.
- Finalizers-string keys on
metadata.finalizers. Block deletion until the controller removes the finalizer (after performing cleanup). The pattern for cleaning up external resources before the K8s object disappears. - Status subresource-separates spec writes from status writes; allows least-privilege RBAC.
10.3 Lab-"Rebuild Week 9 in controller-runtime"¶
Take week 9's mirror controller; rebuild with kubebuilder + controller-runtime. Compare LOC and verbosity. The framework should save substantial code.
10.4 Hardening Drill¶
- Use
controller-runtime's metric and health endpoints. Configure leader election with a non-default lease duration appropriate to your environment.
10.5 Operations Slice¶
- Wire
controller_runtime_reconcile_*metrics. Establish dashboards: reconcile rate, error rate, average reconcile duration per controller.
Week 11 - CRDs: Schema, Versioning, Validation¶
11.1 Conceptual Core¶
- A CRD (CustomResourceDefinition) registers a new resource kind with the apiserver. Once registered, you can
kubectl get/applyit like any built-in. - The CRD includes an OpenAPI v3 schema that the apiserver uses for validation. Get this right or you'll ship buggy custom resources.
- Multiple versions can coexist; conversion webhooks translate between them. The pattern that allows v1alpha1 → v1beta1 → v1 evolution.
11.2 Mechanical Detail¶
- Marker comments (
+kubebuilder:...) on Go types generate the CRD YAML viacontroller-gen. Examples: +kubebuilder:validation:Required+kubebuilder:validation:MinLength=3+kubebuilder:validation:Enum=foo;bar;baz+kubebuilder:subresource:status+kubebuilder:printcolumn:name="Phase",type=string,JSONPath=.status.phase``- Status conditions: array of
{type, status, reason, message, lastTransitionTime}. The standard pattern for surfacing controller state. Use the Kubernetes types directly (metav1.Condition). - Server-side apply: with SSA, multiple controllers can own different fields of the same object via
fieldManager. Replaces hand-rolled patches for many use cases. - Conversion webhooks: invoked when apiserver needs to translate between stored and requested versions. Implement carefully-round-trip stability is essential.
11.3 Lab-"A Well-Versioned CRD"¶
- Define a CRD with v1alpha1.
- Add validation, defaults, status conditions, printer columns.
- Add a v1beta1 with renamed fields and a conversion webhook between them.
- Verify round-trip:
kubectl get -o v1alpha1then - o v1beta1` returns identical content.
11.4 Hardening Drill¶
- Validation only at the schema level is not enough. Add admission webhooks for cross-field validation (e.g., "if mode=X then field Y is required").
11.5 Operations Slice¶
- Track
apiserver_storage_objectsper CRD. CRDs that grow unbounded are a frequent platform failure mode.
Week 12 - Operator Patterns: Finalizers, External Resources, Multi-Cluster¶
12.1 Conceptual Core¶
- The "operator" pattern: a controller that encapsulates operational knowledge for a specific application. Examples: Postgres operator (provisions DBs, handles backups, failover), Cert-Manager (ACME-driven cert lifecycle), Prometheus operator (manages Prometheus + Alertmanager + ServiceMonitor stack).
- An operator is a controller plus one or more CRDs representing the application's domain concepts.
- Production operators handle: leader election, finalizers, status conditions, observability, RBAC, upgrades, multi-tenant isolation, external-system reconciliation, retries with backoff.
12.2 Mechanical Detail¶
- External resources (cloud APIs, SaaS): the controller's reconcile loop calls outward. Idempotency is essential-assume your reconcile may run multiple times before the external API confirms.
- Crossplane (week 19) generalizes this: every external resource is itself a Kubernetes object backed by a controller that talks to the cloud. You compose them.
- Cluster-scoped vs namespace-scoped operators: namespace-scoped is safer (lower blast radius) but limits multi-tenant operator deployment.
- Operator SDK vs Kubebuilder: largely converged today; pick whichever your team prefers. The patterns are identical.
12.3 Lab-"An Operator That Manages an External Resource"¶
Build an operator with a GitHubRepo CRD: spec includes a repo name and visibility; the controller calls the GitHub API to create/update/delete the repo to match. Includes:
- Authentication via a Secret referenced by the CR.
- Finalizers for cleanup.
- Status conditions: Ready, Synced, Error with reasons.
- Rate-limited reconciles with exponential backoff.
- E2E test using a fake GitHub API server.
12.4 Hardening Drill¶
- Define an OPA/Kyverno policy: every
GitHubRepomust reference a Secret in the same namespace; cross-namespace references denied. Tests for the policy.
12.5 Operations Slice¶
- Add
tracingto the reconcile path; export traces via OTel. The operator's hop into GitHub appears as an external span-useful for diagnosing outages.
Month 3 Capstone Deliverable¶
A controllers-and-operators/ workspace:
1. mirror-controller-clientgo/ (week 9).
2. mirror-controller-cr/ (week 10).
3. versioned-crd/ (week 11).
4. github-repo-operator/ (week 12).
Month 4-Networking and Storage at Scale¶
Goal: by the end of week 16 you can (a) explain the CNI spec and trace a Pod-to-Pod packet through a working CNI, (b) install and operate Cilium with eBPF-based dataplane, kube-proxy replacement, and L7 visibility, (c) reason about service-mesh tradeoffs (Istio vs Linkerd vs Cilium Service Mesh), and (d) operate dynamic CSI provisioning at scale with backups and snapshots.
Weeks¶
- Week 13 - The CNI Spec and Pod Networking
- Week 14 - Cilium and eBPF Networking
- Week 15 - Service Meshes: Istio, Linkerd, Cilium Service Mesh
- Week 16 - CSI at Scale: Snapshots, Backup, Cloning
Week 13 - The CNI Spec and Pod Networking¶
13.1 Conceptual Core¶
- The CNI (Container Network Interface) spec is small (~30 pages). A CNI plugin is a binary that the kubelet (via the runtime) invokes when a Pod sandbox is created. Inputs: container ID, network namespace path, JSON config. Outputs: assigned IP, routes.
- Kubelet does not know about networking beyond "ask the CNI." This is what makes the dataplane pluggable.
- Modern CNIs ship as DaemonSets that program kernel rules (iptables, OVS, eBPF) and run a small "agent" plus a thin "delegator" CNI binary.
13.2 Mechanical Detail¶
- Read
containernetworking/cni/SPEC.md. Operations:ADD,DEL,CHECK,VERSION. - The CNI binary must be at
/opt/cni/bin/<name>; config at/etc/cni/net.d/*.conf. - The kubelet → CRI → CNI flow:
- Kubelet asks runtime to create a sandbox.
- Runtime creates a netns; calls CNI
ADD. - CNI assigns IP, sets up the netns.
- Sandbox containers join via
CLONE_NEWNS=falseplus the existing netns. - CNI chains: multiple plugins composed, each running in order (e.g., a primary CNI + a metering plugin + a port-mapping plugin).
13.3 Lab-"Read a CNI's Source"¶
- Pick a simple CNI (
flannelor the referencebridgeplugin fromcontainernetworking/plugins). Read itscmdAddend to end. - Deploy a small kind cluster; trace a Pod creation in the kubelet log; correlate with the CNI binary invocation.
- Use
nsenter -t <pause-pid> -n ip ato inspect the container's network namespace from the host.
13.4 Hardening Drill¶
- Default-deny
NetworkPolicyper namespace. Allow only intended Pod-to-Pod and Pod-to-Service traffic.
13.5 Operations Slice¶
- Monitor CNI errors in kubelet logs. A node with consistent CNI ADD failures will have stuck pending Pods-alert on this.
Week 14 - Cilium and eBPF Networking¶
14.1 Conceptual Core¶
Cilium is the dominant eBPF-based CNI. It replaces iptables-based packet processing with eBPF programs attached at three layers:
- Socket layer (
bpf_sock_ops) - connection-level decisions before packets exist. - Cgroup egress - per-pod outbound policy enforcement.
- NIC-level XDP - ingress filtering at line rate, before the kernel network stack.
The shift from iptables matters at scale: an iptables-based kube-proxy walks a linear chain of rules per packet - O(services). eBPF programs do hash-table lookups: O(1) per packet, regardless of service count.
Beyond replacing the CNI, Cilium provides: - Kube-proxy replacement (eBPF-based service load balancing - no iptables churn on every endpoint change). - L7 NetworkPolicy (HTTP, gRPC, Kafka filtering at the dataplane, not in a sidecar). - ClusterMesh (multi-cluster service discovery and cross-cluster policy). - Hubble (eBPF-based flow observability - every pod-to-pod connection visible without sampling). - Service Mesh (sidecar-less mTLS via eBPF + SPIFFE).
This is the bridge to Linux Month 3 - eBPF in production. See also: eBPF in the observability cross-topic page.
14.2 Mechanical Detail¶
- Dataplane as eBPF graph: Cilium's eBPF programs live under
bpf/incilium/cilium. The agent compiles them at startup with the cluster's specific configuration baked in (BTF-driven CO-RE for portability across kernels). - Identity-based policy: pods are assigned a numeric identity derived from their labels (
app=foo,env=prod→ identity 1234). eBPF programs match on these identities, not on IPs. This is what allows policy to scale to thousands of pods without per-pod iptables rules - identities are stable across pod restarts and IP changes. - Service load balancing: instead of iptables DNAT chains, Cilium uses an eBPF map indexed by
(service IP, port)returning a backend. Connection state lives in a separate eBPF map; updates are atomic, no kernel reload, no race during endpoint churn. - Encryption: WireGuard (recommended; in-kernel since 5.6) or IPsec tunnels between nodes. Per-NetworkPolicy opt-in or cluster-wide.
- Hubble captures every packet's metadata via eBPF - source/dest identity, verdict (allowed/denied), L7 protocol info - and exposes it via gRPC + a CLI + a UI. Per-packet overhead is single-digit-percent CPU.
The trap
Switching kubeProxyReplacement from false → true on a live cluster without draining nodes. The iptables rules from the old kube-proxy don't get cleaned up automatically, and they interact badly with Cilium's eBPF NAT. Always: drain node → reconfigure → uncordon. The Cilium installer's kubeProxyReplacement: strict mode aborts if it finds residual rules.
14.3 Lab - "Install and Drive Cilium"¶
- Install via Helm with:
kubeProxyReplacement=true,hubble.enabled=true,hubble.relay.enabled=true,hubble.ui.enabled=true,encryption.enabled=true,encryption.type=wireguard. - Use the Hubble UI (
cilium hubble ui) to visualize pod-to-pod traffic in real time. - Author L4
NetworkPolicy(standard k8s API); test enforcement with a denied + allowed flow. - Author an L7
CiliumNetworkPolicy(e.g., allow onlyHTTP GET /api/*from frontend → backend); test enforcement. - Enable Cilium Service Mesh; observe sidecar-free mTLS between two test services.
14.4 Hardening Drill¶
Enable transparent encryption (WireGuard) between nodes. Combined with default-deny NetworkPolicy (start: deny everything, allow explicitly), this gives defense-in-depth: even if a node is compromised, the attacker sees only encrypted traffic for flows they haven't been explicitly authorized to observe.
14.5 Operations Slice¶
Monitor cilium_* Prometheus metrics. Alert on:
- policy-drop rate spikes - legitimate workloads being denied (usually a NetworkPolicy author mistake, or a new service that didn't get its allow rule).
- identity-table pressure - Cilium has a max identity count per cluster; approaching it means too many distinct label combinations, often from a bad operator emitting unique labels per request.
- endpoint regeneration time - if it climbs past 5-10s, your label churn is overwhelming the agent.
Week 15 - Service Meshes: Istio, Linkerd, Cilium Service Mesh¶
15.1 Conceptual Core¶
- A service mesh adds: mTLS between Services, retries/timeouts/circuit-breaking, traffic shifting (canary, blue/green), observability (RED metrics + traces), policy enforcement.
- Two architectural patterns:
- Sidecar (Istio classic, Linkerd)-Envoy/
linkerd-proxyruns in every Pod. ~50 MB memory per Pod, ~1 ms latency overhead. - Sidecar-less (Istio ambient, Cilium SM)-eBPF + per-node proxy. Much lower per-Pod overhead.
- Decision matrix:
- Mature, full-featured, complex → Istio.
- Minimalist, Rust-based, fast to install → Linkerd.
- Already running Cilium, want sidecar-less → Cilium Service Mesh.
15.2 Mechanical Detail¶
- Envoy (under Istio + others) is the dataplane proxy. xDS APIs (LDS, RDS, CDS, EDS) push config from the control plane.
- mTLS rotation: the mesh control plane issues short-lived certs (typically 24h) signed by an internal CA (or SPIFFE-compatible).
- Traffic management: Istio
VirtualService+DestinationRulefor routing rules. K8s Gateway API is the standard-track replacement, supported by all major meshes. - Observability: every mesh emits RED metrics (Rate, Errors, Duration) per-service. With OTel, traces propagate through the mesh.
15.3 Lab-"Three Meshes"¶
- Install Istio in ambient mode on a test cluster. Apply a
VirtualServicethat does 90/10 canary routing. Verify with Hubble or Kiali. - Repeat with Linkerd. Compare install footprint, configuration ergonomics, and observability quality.
- (If running Cilium) enable Cilium Service Mesh. Compare again.
- Document tradeoffs: install effort, per-Pod overhead, feature gaps.
15.4 Hardening Drill¶
- Enable mTLS in
STRICTmode. DefineAuthorizationPolicys denying cross-namespace traffic by default; allow only intended pairs.
15.5 Operations Slice¶
- Wire the mesh's RED metrics into your service-level dashboards. Define SLOs per service: latency p99, error rate, mTLS handshake success rate.
Week 16 - CSI at Scale: Snapshots, Backup, Cloning¶
16.1 Conceptual Core¶
- Production storage in K8s requires:
- Dynamic provisioning (week 8).
- Volume Snapshots (point-in-time captures).
- Backups (off-cluster, often app-consistent via operator hooks).
- Cloning (PVC from snapshot, or PVC-from-PVC).
- Resizing (online expansion).
- Velero is the de-facto cluster backup tool: backs up resource manifests + PV snapshots to object storage; restores selectively.
16.2 Mechanical Detail¶
VolumeSnapshotClass↔VolumeSnapshot↔VolumeSnapshotContent. Mirrors the SC/PVC/PV trio.- The external-snapshotter sidecar runs alongside the CSI controller, watching
VolumeSnapshotobjects. - Volume populators (since 1.24+)-populate a new PVC from arbitrary sources (snapshots, other PVCs, S3, etc.). Modular framework.
- Velero: install, configure storage location (S3-compatible bucket), schedule backups via
Scheduleresource. Plugins for cloud providers and for "BackupStorageLocation" abstraction.
16.3 Lab-"Backup and Restore"¶
- Install Velero against a MinIO bucket.
- Schedule a daily backup of one namespace.
- Delete the namespace; restore from backup; verify Pods come back, PVs reattach, data intact.
- Create a stateful workload (Postgres via an operator); test snapshot + clone flow for fast dev/test environment provisioning.
16.4 Hardening Drill¶
- Test restore into a different cluster. This is the actual disaster-recovery scenario, and the most commonly broken backup story.
16.5 Operations Slice¶
- Wire Velero metrics: backup success rate, backup duration, restore-test outcomes (run a synthetic restore weekly to validate).
Month 4 Capstone Deliverable¶
A networking-and-storage/ workspace:
1. cni-source-walkthrough.md (week 13).
2. cilium-policies/ - L4 + L7 + identity-based examples (week 14).
3.mesh-comparison/ - three meshes, RED dashboards (week 15).
4. `velero-DR/ - backup, restore, and cross-cluster-restore demos (week 16).
Month 5-Platform Engineering and Day-2 Operations¶
Goal: by the end of week 20 you can (a) operate a GitOps workflow with ArgoCD or Flux at scale, (b) provision cloud infrastructure declaratively via Crossplane (or Terraform from K8s), (c) configure HPA/VPA against custom Prometheus metrics, and (d) author and enforce policies with OPA Gatekeeper or Kyverno.
Weeks¶
- Week 17 - GitOps: ArgoCD and Flux
- Week 18 - IaC From Within K8s: Crossplane and Terraform
- Week 19 - HPA, VPA, KEDA: Autoscaling
- Week 20 - Admission Control: Webhooks, OPA Gatekeeper, Kyverno
Week 17 - GitOps: ArgoCD and Flux¶
17.1 Conceptual Core¶
- GitOps = the cluster's desired state is the contents of a git repo. A controller in the cluster watches the repo and reconciles drift.
- The two dominant tools:
- ArgoCD-UI-rich, opinionated about app structure (
ApplicationCRD), wide adoption. - Flux-CLI/CRD-first, more composable (
Kustomization,HelmRelease,GitRepository,OCIRepository), favored by CNCF-style purists. - Both implement the same control loop: pull manifests from git → render (Kustomize/Helm) → apply → reconcile drift.
17.2 Mechanical Detail¶
- ArgoCD
Application:spec.source(git path or Helm chart),spec.destination(cluster + namespace),spec.syncPolicy(manual vs automatic, prune, self-heal). ApplicationSet(Argo)-generate many Apps from templates; the foundation for multi-tenant fleet management.- Flux
Kustomization+HelmRelease-separate CRs for source-of-truth, transform, and apply. - Sync waves / dependencies: both tools support ordering. Critical for "install CRDs before the resources that use them."
- Drift detection: tools auto-revert manual changes by default. Sometimes that is not what you want during incident response-know how to disable temporarily.
17.3 Lab-"Two GitOps Stacks"¶
- Install ArgoCD. Set up an
Applicationfor a small app from a git repo. Verify auto-sync and auto-prune. - Install Flux. Set up the equivalent. Compare ergonomics.
- Use
ApplicationSet(Argo) to deploy the same app to three environment overlays (dev,staging,prod). Verify per-environment configuration via Kustomize overlays.
17.4 Hardening Drill¶
- ArgoCD/Flux talk to a git repo with read access. Use SSH deploy keys or fine-scoped GitHub apps; never broad PATs. Encrypt secrets at rest with
sealed-secretsorsops.
17.5 Operations Slice¶
- Wire ArgoCD/Flux metrics: per-Application sync rate, drift rate, reconciliation duration. Alert on persistent OutOfSync or Failed states.
Week 18 - IaC From Within K8s: Crossplane and Terraform¶
18.1 Conceptual Core¶
- Crossplane flips IaC inside out: cloud resources are Kubernetes resources (CRDs), reconciled by Crossplane's providers (
provider-aws,provider-gcp,provider-azure,provider-helm,provider-kubernetes). You manage cloud infra withkubectl apply. - Compositions let you bundle low-level primitives into domain-specific abstractions: define a
XPostgresInstancethat, when applied, creates a VPC subnet, an RDS instance, IAM bindings, and aServiceMonitor. Platform teams ship Compositions; app teams consume them. - Terraform alternative: run Terraform Cloud / Atlantis externally; treat the cluster as a deploy target only. Simpler in some shops; doesn't unify the control plane.
18.2 Mechanical Detail¶
- Provider = a controller image that knows how to talk to one external system. Install via
ProviderCRD. ProviderConfig= credentials + connection details for the provider.- Managed Resource (MR) = the K8s representation of a cloud resource (
Bucket,Database,IAMRole). - Composition = a YAML transform: "given this
XPostgresInstanceclaim, produce these MRs with these field mappings." - Composite Resource Definition (XRD) = the schema for the abstract type; the platform-team-facing equivalent of CRD.
18.3 Lab-"Self-Service Database"¶
- Install Crossplane. Install
provider-aws(orprovider-gcp). - Configure provider credentials.
- Define an XRD
XDatabasewith parameters:size,engine,version,region. - Define a Composition that materializes an RDS instance + a Secret with credentials.
- As an "app team" persona, create a
Databaseclaim. Watch it become a real RDS instance. Delete; watch it be torn down.
18.4 Hardening Drill¶
- Restrict Composition
selectorsandcompositionRefso app teams cannot select unintended Compositions. Use OPA/Gatekeeper to enforce naming, region, size limits.
18.5 Operations Slice¶
- Compositions are platform contracts. Version them. Provide migration paths. Treat as you would a public API: SLAs, deprecation windows, changelogs.
Week 19 - HPA, VPA, KEDA: Autoscaling¶
19.1 Conceptual Core¶
- HPA (Horizontal Pod Autoscaler): scales replica count based on metrics. CPU/memory by default; with
metrics.k8s.io+custom.metrics.k8s.ioadapters (e.g.,prometheus-adapter), any metric is fair game. - VPA (Vertical Pod Autoscaler): adjusts a Pod's CPU/memory
requestsbased on observed usage. Two modes:Auto(recreate pod with new resources),Off/Initial(only on creation). - KEDA (Kubernetes Event-Driven Autoscaling): scale to zero, scale on event-source backlog (Kafka lag, SQS depth, custom). Sits in front of HPA.
19.2 Mechanical Detail¶
- HPA reconcile interval: 15s by default. Picking metrics that are too jittery causes flapping; smooth at the source.
- HPA scaling policies:
scaleUp.policiesandscaleDown.policieswith stabilization windows. Tune to workload's elasticity profile. - Custom metrics adapter (
prometheus-adapter): translates Prometheus queries into thecustom.metrics.k8s.ioAPI the HPA reads. Define rules in adapter config. - VPA's recommender computes percentile-based recommendations from historical usage. Often used in
Offmode just to suggest resource changes; production safety prefers manual approval.
19.3 Lab-"Autoscale on Custom Metrics"¶
- Deploy a load-test target with a Prometheus-exposed
requests_per_secondmetric. - Install
prometheus-adaptermapping that metric tocustom.metrics.k8s.io. - Author HPA targeting
AverageValue=200of that metric. Drive load; watch scaling. - Add KEDA in front for scale-to-zero behavior. Verify cold-start latency.
19.4 Hardening Drill¶
- Set
minReplicasto a non-zero value for any tier-1 service (avoid cold-start during incident traffic). CapmaxReplicasto avoid runaway autoscaling on metric anomalies.
19.5 Operations Slice¶
- Wire HPA event metrics. Alert on persistent
desiredReplicas == maxReplicas(you've hit the cap) and on flapping (scaleUpandscaleDownevents alternating rapidly).
Week 20 - Admission Control: Webhooks, OPA Gatekeeper, Kyverno¶
20.1 Conceptual Core¶
- Admission control is the apiserver's last gate: every create/update is run through configured admission webhooks before persistence.
- Two policy-engine choices in the modern ecosystem:
- OPA Gatekeeper-Rego-language policies; the standard for "policy as code."
- Kyverno-YAML-native policies; lower learning curve, strong template/mutation/generate support.
- Pod Security Admission (replacement for the deprecated PodSecurityPolicy)-built into the apiserver. Three profiles:
privileged,baseline,restricted. Apply per-namespace.
20.2 Mechanical Detail¶
- Validating webhooks: receive AdmissionReview, return Allowed=true/false with reasons. Cannot mutate.
- Mutating webhooks: also return JSON Patch / strategic merge for changes. Applied before validating.
- Failure policy (
FailvsIgnore): if the webhook is unreachable, fail closed (safer) or open (operationally simpler). Trade off carefully. - Gatekeeper's
ConstraintTemplate(Rego) +Constraint(instance) model. Audit mode reports without enforcing-start there in any new policy rollout. - Kyverno's
ClusterPolicy/PolicyCRDs cover validate, mutate, generate, verifyImages.
20.3 Lab-"Three Policy Layers"¶
- Apply Pod Security Admission per-namespace:
restrictedeverywhere except aprivnamespace. - Author 5 Gatekeeper Constraints: require resource limits, forbid
latesttags, enforce non-root, label-required, namespace-must-have-team-label. - Author equivalents in Kyverno. Compare expressiveness.
- Run in audit-mode for a week against a pre-existing cluster; triage findings before enforcing.
20.4 Hardening Drill¶
- Mandate signed images via Kyverno's
verifyImageswith cosign keys. Combined with Sigstore policy from the Container curriculum, this closes the supply-chain gate at the cluster.
20.5 Operations Slice¶
- Track admission-webhook latency. Slow webhooks slow every apply. Pod-creation latency p99 is your warning signal.
Month 5 Capstone Deliverable¶
A platform-and-day2/ workspace:
1. gitops-stack/ (week 17)-ArgoCD + ApplicationSet + multi-env overlays.
2. crossplane-platform/ (week 18)-XDatabase composition + claim demo.
3. hpa-custom-metrics/ (week 19)-Prom-adapter + HPA + KEDA scale-to-zero demo.
4. policy-suite/ (week 20)-Gatekeeper + Kyverno + PSA examples.
Month 6-Kubernetes The Hard Way + Capstone¶
Goal: by the end of week 24 you have built (or substantially built) a multi-node Kubernetes cluster from raw VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy isolation, and a documented operational runbook.
Weeks¶
- Week 21 - Bootstrap: VMs, Certificates, etcd
- Week 22 - Control Plane and Worker Nodes
- Week 23 - RBAC, Multi-Tenancy, mTLS Everywhere
- Week 24 - Defense, Documentation, and the Capstone Demo
Week 21 - Bootstrap: VMs, Certificates, etcd¶
21.1 Conceptual Core¶
- "Kubernetes the Hard Way" is Kelsey Hightower's exercise: bring up a Kubernetes cluster step by step, from raw VMs, generating certs by hand, configuring every flag explicitly. The point is not operational efficiency; it is deep understanding of every moving part.
- This curriculum's hard-way variant: bring up 3 control-plane nodes + 3 worker nodes on cloud VMs (or bare metal). Use modern toolchain (containerd, Cilium, latest stable Kubernetes).
21.2 Mechanical Detail¶
- VM provisioning: 6 VMs, ~2 vCPU 4 GB each. Cloud (AWS/GCP/Hetzner) or bare metal.
- PKI: a CA + intermediate CAs for
etcd,kube-apiserver,kubelet,front-proxy. Usecfssloreasy-rsa. Every component identifies itself with x509. - etcd cluster: 3 nodes, mTLS between peers and clients, snapshots scheduled.
- Loopback bootstrap considerations: kubelet needs a kubeconfig before the apiserver is up. Either use static-pod manifests for control-plane components (the
kubeadmapproach) or run the control plane outside the cluster on the VMs themselves.
21.3 Lab-"Bring Up etcd"¶
- Provision 3 VMs labeled
etcd-{1,2,3}. - Generate CA + per-node certs.
- Install etcd binaries; configure systemd units with mTLS.
- Bring up; verify
etcdctl member listshows healthy quorum. - Take a snapshot. Restore on a separate test machine.
21.4 Hardening Drill¶
- etcd encryption-at-rest is separate from the K8s secret encryption (next week). Configure etcd with an encryption-providers config from day one.
21.5 Operations Slice¶
- etcd backup automation:
etcdctl snapshot savecron'd to S3 every 6 hours. Verify restore weekly.
Week 22 - Control Plane and Worker Nodes¶
22.1 Conceptual Core¶
- The control plane: kube-apiserver, kube-scheduler, kube-controller-manager. Run all three as systemd-managed binaries on each control-plane node, behind a load balancer (HAProxy or cloud LB) for HA.
- The worker plane: containerd + kubelet + kube-proxy (or Cilium replacement). Joins the cluster via a kubelet kubeconfig signed by the cluster CA.
22.2 Mechanical Detail¶
- kube-apiserver flags:
-
- -etcd-servers=https://etcd-{1,2,3}:2379` with mTLS.
-
- -encryption-provider-config=...` for secret encryption-at-rest.
-
- -audit-policy-file=...
and - -audit-log-path=....
- -audit-policy-file=...
-
- -authorization-mode=Node,RBAC`.
-
- -enable-admission-plugins=NodeRestriction,PodSecurity,ResourceQuota,...`.
-
- -service-account-issuer
, - -service-account-signing-key-filefor ServiceAccount tokens (projected, OIDC-compatible).
- -service-account-issuer
- kubelet bootstrap: TLS bootstrap using a bootstrap token; kubelet auto-rotates its cert via
kubelet-csr-approver. - CNI: install Cilium first (DaemonSet); only after Cilium is healthy do worker-node Pods become ready.
- DNS: install CoreDNS as a Deployment; the kubelet's cluster-DNS arg points at its Service IP.
22.3 Lab-"Cluster Live"¶
- Bring up 3 control-plane nodes; HAProxy in front.
- Bring up 3 workers; join via bootstrap tokens.
- Install Cilium; verify Pod-to-Pod connectivity.
- Install CoreDNS; verify Service DNS works.
- Smoke test: deploy a sample app + Service + Ingress; verify end-to-end.
22.4 Hardening Drill¶
- Apply CIS Kubernetes Benchmark v1.8 (or current). Use
kube-benchto score. Address allFAILs; documentWARNs.
22.5 Operations Slice¶
- Wire control-plane components to Prometheus. Define SLOs: apiserver request p99 < 1s, etcd-leader-changes per hour < 1, scheduler queue depth < 100.
Week 23 - RBAC, Multi-Tenancy, mTLS Everywhere¶
23.1 Conceptual Core¶
- Multi-tenancy is the hardest sustained problem in Kubernetes. The kernel and Kubernetes give you soft isolation by default; converting that to hard isolation requires layered controls.
- The required layers: namespace-per-tenant + RBAC + NetworkPolicy + ResourceQuota + LimitRange + PodSecurity + node-pool isolation + (optionally) sandboxed runtime.
- mTLS everywhere: control-plane (already from week 22), service mesh between Services (week 15), workload identity for Pods talking to cloud APIs (e.g., AWS IRSA, GCP Workload Identity).
23.2 Mechanical Detail¶
- Tenant onboarding as code (Crossplane Composition or Helm chart):
- Namespace.
- ResourceQuota + LimitRange.
- Default-deny NetworkPolicy + an allow-namespace-internal exception.
- PodSecurity admission label (
restricted). - RoleBindings for the tenant's group.
- ServiceAccount with workload identity binding for cloud access.
- GitOps Application(Set) entries to deploy the tenant's app catalog.
- Hard isolation tiers:
- Tier 1: namespace + RBAC. Default. Suitable for trusted internal teams.
- Tier 2: + sandboxed runtime (gVisor) for tenant-owned untrusted workloads.
- Tier 3: + dedicated node pool with taints. Suitable for compliance-bound workloads.
- Tier 4: separate cluster (vCluster, Cluster API). Strongest isolation; highest cost.
23.3 Lab-"Onboard a Tenant"¶
- Author a tenant Composition (Crossplane) or Helm chart that, given
{tenant: "acme"}, materializes everything in §23.2. - Onboard
acme. Have a "tenant developer" persona deploy an app via GitOps. - Verify isolation: from
acme's namespace, can you read another tenant's secrets? Pods? Logs? Each should fail.
23.4 Hardening Drill¶
- Run
kubescapeorpolarisagainst the cluster. Address findings until score is >90%.
23.5 Operations Slice¶
- Per-tenant cost attribution: label every resource with
tenant=; exportkube-state-metricswith that label to Prometheus; cost-allocate viaOpenCost.
Week 24 - Defense, Documentation, and the Capstone Demo¶
24.1 Conceptual Core¶
The final week is integration and defense. Bring the capstone (whichever track) to production-defensible quality.
24.2 Final Hardening Checklist¶
- CIS benchmark green (
kube-bench). - All control-plane components mTLS, with cert auto-rotation tested.
- Encryption-at-rest enabled for secrets in etcd.
- Audit logging enabled; logs shipped off-cluster.
- Default-deny NetworkPolicy in every namespace.
- PodSecurity
restrictedeverywhere except documented exceptions. - Image admission requires signed images (Sigstore policy).
- Velero backups + tested cross-cluster restore.
- Chaos: drain a node, kill a master, partition the network-cluster recovers.
- Observability: Prometheus + Grafana + Loki + Tempo (or equivalent) integrated.
- Cost attribution per tenant.
- Runbooks: node-not-ready, etcd-degraded, apiserver-OOM, namespace-stuck-terminating, pod-pending-forever.
24.3 Lab-"Defend the Cluster"¶
Schedule a 60-minute mock review. Demo: 1. The architecture diagram. 2. Provisioning (Ansible/Terraform/Crossplane). 3. Tenant onboarding from request to running app. 4. Failure injection: kill a control-plane node; show cluster recovery. 5. Observability: trace a request from ingress through service mesh to backend, with metrics, logs, and trace ID correlation. 6. Backup + restore.
24.4 Operations Slice¶
- Tag the cluster manifest repo
v1.0.0. Sign with cosign. Publish aRUNBOOK.mdthat, in principle, lets a successor team rebuild the cluster from scratch.
Month 6 Deliverable¶
The capstone artifact (per CAPSTONE_PROJECTS.md), plus the aggregated kubernetes-mastery/ repo containing every prior month's deliverable.
Appendix A-Kubernetes Hardening Reference¶
Cumulative hardening checklist. By week 24 the reader's cluster-baseline/ template should encode every section.
A.1 Control Plane¶
- etcd: 3 or 5 nodes, mTLS, encryption-at-rest, snapshot+restore tested.
- kube-apiserver: encryption providers, audit logging, NodeRestriction admission, PodSecurity admission, OIDC (or trustedSA) for users.
- kube-scheduler: leader election; default + custom plugins reviewed.
- kube-controller-manager: leader election; minimum SA permissions.
- kubelet: read-only port disabled, TLS bootstrap with CSR approval, anonymous-auth false, authorization webhook.
A.2 RBAC¶
- No bindings to the
cluster-adminClusterRole except for break-glass. - Per-tenant Roles, not ClusterRoles.
- Audit
system:authenticatedandsystem:unauthenticatedgroup bindings-both should be empty. - Use
kubectl auth can-i --as=...to verify least-privilege per persona.
A.3 Pod Security¶
- PodSecurity admission
restrictedeverywhere by default. - Exceptions documented in code (namespace labels) with justification.
- Pod-level:
runAsNonRoot,readOnlyRootFilesystem, drop all caps, seccompRuntimeDefault. - Mutating webhook to inject defaults if Pod spec omits them.
A.4 Network¶
- CNI with NetworkPolicy support (Cilium, Calico).
- Default-deny ingress + egress in every namespace.
- Allowed flows declared per workload as labeled NetworkPolicy.
- L7 policy on ingress (Cilium L7 NetworkPolicy or service mesh).
- mTLS between Services (mesh).
- Egress controls: explicit allowed CIDRs / FQDNs.
A.5 Image Supply Chain¶
- Image admission (Kyverno / Cosign policy-controller) requires signature.
- Allowlisted registries.
- No
latesttags; pin by digest in production. - SBOM and SLSA provenance attestations attached to every image.
A.6 Secrets¶
- etcd encryption-at-rest with rotated keys.
- External Secret Operator (ESO) for cloud-KMS-sourced secrets.
- No secrets in env vars where possible (use volume mounts, watch for restart).
- No secrets committed to git, even in sealed form, without
sealed-secrets/sopsratchet.
A.7 Multi-Tenancy¶
- One namespace per tenant; ResourceQuota + LimitRange.
- Hierarchical Namespaces or Capsule for nested tenants.
- PriorityClasses by tier; preemption tuned.
- Per-tenant cost attribution via labels + OpenCost.
A.8 Observability¶
- Audit logs shipped off-cluster (read-only on cluster).
- Container logs (Loki / cloud equivalent).
- Metrics (Prometheus + kube-state-metrics + node-exporter).
- Traces (OTel Collector + Tempo / Jaeger / cloud).
- Continuous profiling (Parca / Pyroscope) optional but recommended.
- SLO tracking per service (Pyrra / Sloth).
A.9 Backup + DR¶
- Velero scheduled backups to off-cluster storage.
- Cross-region or cross-cluster restore tested at least quarterly.
- etcd snapshot tested for catastrophic-recovery scenario.
- DR runbook with RTO + RPO documented.
A.10 The cluster-baseline/ Template¶
cluster-baseline/
bootstrap/
pki/ # CA + per-component certs (cfssl)
etcd/ # systemd unit + config
kube-apiserver/
kube-scheduler/
kube-controller-manager/
kubelet/
cni/cilium-values.yaml
service-mesh/ # istio or linkerd values
observability/
prometheus/
grafana/
loki/
tempo/
parca/
policy/
pod-security/
networkpolicy-default-deny.yaml
gatekeeper-constraints/
kyverno-policies/
sigstore-policy.yaml
tenancy/
namespace-template/ # Crossplane composition
rbac-template/
quotas-template/
velero/
schedule.yaml
locations.yaml
runbooks/
node-not-ready.md
etcd-degraded.md
apiserver-oom.md
pod-pending-forever.md
cluster-rebuild.md
RUNBOOK.md
THREAT_MODEL.md
This is the artifact every cluster you bring up after week 24 should be provisioned from.
Appendix B-Troubleshooting Reference Flows¶
Reference flows for the failure modes you will see in production.
B.1 Pod Pending Forever¶
Common causes (in observed-frequency order):
1. No node satisfies scheduling constraints. Events: shows FailedScheduling. Read the reason: insufficient CPU/memory, no matching nodeSelector, taints unmatched, no PV available, topology spread blocked.
2. PVC stuck pending. kubectl get pvc <pvc> - if Pending, check StorageClass, provisioner pods, cloud-side quota.
3. **Image pull failure**.Events:showsErrImagePull/ImagePullBackOff. Check registry auth, image tag exists, network egress to registry.
4. **Admission webhook rejected**. Often hidden in apiserver logs;kubectl get events -Amay surface it.
5. **Quota exceeded**.ResourceQuota` denied creation.
Drilldown: kubectl get events -A --sort-by=.lastTimestamp | tail -30.
B.2 Pod CrashLoopBackOff¶
Common causes:
1. App-level crash. Read the previous container's logs.
2. Liveness probe failing. The probe is killing the container. Check probe path/port; loosen initialDelaySeconds.
3. OOMKilled. kubectl describe shows Reason: OOMKilled. Increase memory limit or fix leak.
4. ConfigMap / Secret missing. Pod is mounting it; if missing, kubelet fails the start. Watch for events.
5. Init container failure. Pod won't progress; check init container logs first.
B.3 Node NotReady¶
Common causes:
1. kubelet down. systemd unit failure; check journal.
2. CNI agent down. The node has no functional networking; Cilium/Calico DaemonSet pod has crashed.
3. Disk pressure. Events: shows EvictionThresholdMet. Free space (delete old container images, journal logs).
4. PID pressure. Too many processes.
5. Out-of-resources kernel-side. Check dmesg on the node.
B.4 etcd Degraded¶
Common causes:
1. Disk full or slow. fsync latency spikes; everything else feels slow. Check etcd_disk_wal_fsync_duration_seconds.
2. Leader election thrashing. Network instability between etcd nodes; check inter-node latency.
3. Database size growth. Forgot to compact. etcdctl compact <rev>; etcdctl defrag.
4. Quorum lost. Majority of nodes down. Restore from snapshot to a new cluster; recover.
B.5 Apiserver 5xx / Timeouts¶
Common causes:
1. etcd issues (above).
2. Webhook timeouts. Slow admission webhooks block every apply. Check webhook latency; consider failurePolicy: Ignore with caution.
3. Aggregated API down (e.g., metrics-server). kubectl top fails; downstream features (HPA) degrade.
4. Apiserver overload. Too many list/watch consumers; CPU pegged. Add replicas; review priority-and-fairness flow control.
B.6 Service Has No Endpoints¶
Common causes:
1. Selector mismatch. Service spec.selector doesn't match Pod labels. Most common.
2. Pods not ready. ReadinessProbe failing; only ready Pods join Endpoints.
3. Port mismatch. Service port name vs container port name out of sync.
4. Topology-aware routing dropping endpoints. Check service.kubernetes.io/topology-aware-hints.
B.7 Namespace Stuck Terminating¶
Cause: A finalizer can't be removed because its owning controller is gone (or stuck).
Fix path (carefully-you are bypassing a safety):
kubectl get namespace <ns> -o json \
| jq '.spec.finalizers = []' \
| kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -
But also: investigate why the finalizer wouldn't clear. Often a dangling external resource the operator was waiting on.
B.8 ImagePullBackOff in a Private Registry¶
- `kubectl get secret
-o yaml - exists and well-formed? - Pod's
spec.imagePullSecretsreferences it? - Secret type is
kubernetes.io/dockerconfigjson? - Decoded
.dockerconfigjsonhas the right registry URL and credentials? - From the node, can you
crictl pullthe image manually with the same creds?
B.9 HPA Not Scaling¶
- `kubectl describe hpa
- events show why. - Metrics available?
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"for resource metrics;kubectl get --raw "/apis/custom.metrics.k8s.io/..."for custom. - Pod requests set? HPA uses
requestsas the denominator; without them, percentage-based metrics are meaningless. behaviorpolicies preventing fast scaling? CheckscaleUp.policiesandstabilizationWindowSeconds.
B.10 Mesh: 503 from Sidecar¶
(Istio specifics, but general patterns apply)
1. Service has Endpoints?
2. mTLS mode strict, but caller without sidecar? Check PeerAuthentication.
3. AuthorizationPolicy denying the call?
4. DestinationRule with circuit-breaker tripped? kubectl describe destinationrule.
5. Envoy access log: istioctl proxy-config log <pod> --level debug and re-issue.
Appendix C-Contributing to Kubernetes¶
Kubernetes is the largest open-source project in the world by contributor count. The flip side: the bureaucracy is real. This appendix is the on-ramp.
C.1 Mental Model¶
The Kubernetes project is governed by SIGs (Special Interest Groups)-domain-scoped groups (sig-node, sig-network, sig-storage, sig-api-machinery, sig-scheduling, sig-cli, sig-auth, etc.). Each SIG has chairs, technical leads, regular meetings, and a Slack channel. Almost every PR maps to a SIG.
Major changes go through KEPs (Kubernetes Enhancement Proposals)-design docs in the kubernetes/enhancements repo. KEPs progress through provisional → implementable → implemented → deprecated over multiple releases.
Implications for newcomers: 1. Find the right SIG before opening a PR. 2. For non-trivial changes, write or piggyback on a KEP first. 3. The cycle time is slow. Two-week review is normal; six-week is common.
C.2 Setting Up¶
The build is heavy (it's all of Kubernetes); plan for ~10 minutes the first time.
For tests:
make test # unit
make test-integration KUBE_TEST_ARGS="-run <name>"
make test-e2e KUBE_TEST_ARGS=...
For local dev cluster: kind (uses your local Docker / containerd, spins up a multi-node cluster in containers in seconds).
C.3 Where the Easy Wins Are¶
C.3.1 Documentation¶
kubernetes/website(the website repo). Docs improvements are welcome and reviewed quickly.
C.3.2 e2e flakes¶
- The
flakeylabel onkubernetes/kubernetesissues. Fixing flakes is unglamorous but high-impact.
C.3.3 kubectl¶
- `staging/src/k8s.io/kubectl - small, contained. Bug fixes and small features are tractable.
C.3.4 client-go improvements¶
- `staging/src/k8s.io/client-go - used by every controller in the world. Improvements compound.
C.3.5 SIG-specific work¶
- Pick a SIG matching your interest. Their backlogs have
good first issuelabels.
C.3.6 Don't start here (yet)¶
- Scheduler core (high stakes; small SIG; deep changes need KEPs).
- API machinery (apiserver internals, conversion, validation).
- kubelet (touches every node; PR latency is high for safety).
- Anything touching scaling / performance critical paths.
C.4 The First-PR Workflow¶
- Find an issue. Read
CONTRIBUTING.mdand SIG-specific contribution guides. Comment/assignto claim. - Branch from master.
- Implement. Run
make verify,make test, the relevantmake test-integrationsubset. - Commit with
Signed-off-by(DCO). - Open the PR. SIG bots will auto-assign reviewers. Use the PR template; fill in every field.
- CI cycle. Tests run on Prow. Re-run with
/test all. Address comments. - Approval flow: a reviewer adds
/lgtm; an approver adds/approve. Both required. The bot then merges. - Backport (if applicable): for bugfixes, the PR may need cherry-picks to release branches. Use the cherry-pick robot or do manually.
C.5 The KEP Process¶
For changes that: - Add or modify the API. - Have user-visible behavior changes. - Affect multiple SIGs.
Process:
1. Open an issue in the relevant SIG.
2. Get at least informal agreement that the problem is real.
3. Write a KEP using the template in kubernetes/enhancements/keps/NNNN-template/.
4. Submit as a PR. The KEP itself goes through review.
5. Once implementable, you can submit code PRs referencing the KEP number.
6. KEPs target a Kubernetes release (alpha → beta → GA over multiple releases).
Time scale: months to a year for a substantial KEP.
C.6 Adjacent Targets if k/k Is Too Heavy¶
The CNCF ecosystem has many high-impact projects with smaller surface area:
| Project | Bar |
|---|---|
| kubectl plugins (krew) | Low. Author your own; submit to krew index. |
| kind | Low–Medium. Friendly maintainers. |
| kustomize | Medium. |
| Helm | Medium. Larger team. |
| ArgoCD / Flux | Medium. Active. |
| Cilium | Medium–High. |
| Crossplane | Medium. Welcoming to providers. |
| Operator SDK / Kubebuilder | Medium. |
| OpenTelemetry (collector + Operator) | Medium. |
A merged contribution to any of these is a credible Kubernetes-ecosystem signal in interviews.
C.7 Calibration¶
A reasonable goal for a curriculum graduate:
- By end of week 23: a PR open against
kubernetes/website, a kubectl plugin, or a small fix to an ecosystem project. - By end of capstone: that PR merged.
- 6 months post-curriculum: a substantive contribution-a kubectl feature, a new operator, a Cilium policy plugin.
Patient contributors become trusted contributors. Trusted contributors become reviewers. Reviewers become approvers. Approvers become SIG chairs. The path exists; it just takes time.
Capstone Projects-Three Tracks, One Choice¶
Pick one. The work performed here is what you describe in interviews.
Track 1-Hard Way: A Production-Grade Cluster From Scratch¶
Outcome: a multi-node Kubernetes cluster brought up on bare metal or cloud VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy, GitOps-managed workloads, and a documented operational runbook.
Functional spec¶
- 3 control-plane nodes + 3 workers (minimum). HAProxy or cloud LB in front of the apiservers.
- etcd with mTLS, encryption-at-rest, scheduled snapshots to off-cluster storage.
- CNI: Cilium with kube-proxy replacement, Hubble enabled.
- Service mesh: Istio (ambient) or Linkerd, mTLS strict between services.
- Storage: a real CSI driver (local-path for dev; OpenEBS / Longhorn / cloud CSI for "real").
- Observability: Prometheus + Grafana + Loki + Tempo + OTel Collector.
- GitOps: ArgoCD or Flux managing platform addons.
- Policy: Pod Security
restricted, NetworkPolicy default-deny, Kyverno or Gatekeeper enforcing org rules. - Backup: Velero scheduled, restore tested.
Non-functional spec¶
- CIS benchmark green (
kube-bench≥90% pass). - Cluster rebuild from scratch in <2 hours via Ansible/Terraform.
- Zero-downtime kubelet upgrades (drain + replace pattern).
- A demo: kill a control-plane node; cluster recovers without intervention.
Acceptance¶
- Public repo with provisioning playbooks and runbooks.
- A 30-minute screencast walking the assessor through bring-up, an incident drill, and a tenant onboarding.
- A
RUNBOOK.mdcovering: cluster provisioning, node addition/removal, etcd backup/restore, certificate rotation, upgrade procedure, top 5 incident types and remediation.
Skills exercised¶
- All months-but Months 1, 2, 6 most heavily.
Track 2-Platform: GitOps Multi-Tenant PaaS¶
Outcome: a self-service developer platform built on Kubernetes that demonstrates onboarding a new team in <30 minutes, with policy guardrails, infra-from-code, and full observability.
Functional spec¶
- Tenant model: each tenant gets a Namespace, ResourceQuota, LimitRange, RBAC bindings, default NetworkPolicy, monitoring scrape config, GitOps Application (Argo) entry-all materialized from a single tenant claim (Crossplane Composition or Helm chart).
- Self-service: developers commit a
manifest.yamlto their repo; ArgoCD/Flux picks it up; their app deploys. - Policy: Kyverno or OPA Gatekeeper enforces: image signatures, no
latesttags, mandatory labels, resource limits, no privileged Pods. - Observability: each tenant's metrics/logs/traces are isolated (via labels and Loki/Prom multi-tenancy); a per-tenant Grafana folder with default dashboards.
- Cost attribution: OpenCost emits per-tenant cost; surface in a dashboard.
- Crossplane: a
Databaseclaim that materializes a real cloud database (or, for demo, a chart-deployed Postgres).
Non-functional spec¶
- Tenant onboarding: from "claim PR opened" to "deployed app reachable" in <30 minutes (target: <5).
- Failure isolation: a tenant exceeding quota does not affect other tenants.
- Compliance: every running Pod can be traced back to a git commit + signature verification.
Acceptance¶
- Public repo with platform manifests + tenant-onboarding template.
- Live demo: onboard a fresh tenant; deploy a sample app; demonstrate observability + policy denial; demonstrate quota enforcement.
- A
PLATFORM.mddescribing the contract between platform team and tenants: versioning, deprecation, support, escalation.
Skills exercised¶
- Months 3 (operators / Crossplane), 5 (GitOps + IaC + autoscaling + admission), 6 (multi-tenancy).
Track 3-Operator: Production-Quality Operator From Scratch¶
Outcome: a non-trivial operator that manages a stateful application or external system, complete with backup/restore, upgrades, observability, and a thoughtful API.
Suggested scopes¶
elasticsearch-mini-operator: manage Elasticsearch clusters with auto-scaling, snapshot lifecycle, index lifecycle policies.postgres-mini-operator: with automatic failover (using the Postgres replication primitives), backup/restore via WAL-G to S3, point-in-time recovery.saas-resource-operator: manage external SaaS resources via Crossplane composition (e.g., aGitHubRepooperator complete with branch protection, secret scanning, codeowners).
Acceptance¶
- Public repo, written with
controller-runtime+ Kubebuilder. - CRDs versioned (v1alpha1 + v1beta1 + conversion webhook).
- Status conditions, finalizers, owner references-all idiomatic.
- Comprehensive RBAC (least-privilege, generated from kubebuilder markers).
- Mutating + validating admission webhooks.
- E2E tests (Ginkgo + envtest, plus a kind-based suite).
- Helm chart or kustomize manifests for installation.
- Observability: Prometheus metrics, structured logs (logr), OTel traces.
- Helm-test-style upgrade tests across three operator versions.
- Documentation: design rationale, API reference, examples.
Skills exercised¶
- Months 3 (operators), 4 (storage if stateful), 5 (admission), 6 (defense).
Cross-Track Requirements¶
cluster-baseline/template (Appendix A) integrated.- ADRs (≥3).
THREAT_MODEL.md.RUNBOOK.md.- Defense readiness: 60-minute walkthrough.
The track choice signals career direction: Track 1 for SRE/cluster-operator roles, Track 2 for platform-engineering roles, Track 3 for software-engineering-on-Kubernetes roles. Pick based on where you want the next interview loop.
Worked example - Week 14: a NetworkPolicy → what eBPF actually does¶
Companion to Kubernetes → Month 04 → Week 14: Cilium and eBPF. The week explains the Cilium model: CNI plugin, identities, the L3/L4/L7 policy layers, and the eBPF datapath. This page takes one Kubernetes NetworkPolicy and traces it through Cilium all the way to the eBPF program enforcing it on a packet.
You need a kind/k3s/minikube cluster with Cilium installed (cilium install from the Cilium CLI; or Helm with --set kubeProxyReplacement=true).
The policy¶
# api-deny-from-frontend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-deny-from-frontend
namespace: shop
spec:
podSelector:
matchLabels:
app: api
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels:
app: orders
ports:
- port: 8080
protocol: TCP
What this says, in English: "Pods in namespace shop with label app=api will accept TCP/8080 traffic only from pods labeled app=orders in the same namespace. Everything else gets dropped."
Without a NetworkPolicy controller, Kubernetes ignores this object entirely. With Cilium installed, the policy becomes a real packet-level rule. Walk through how.
Step 1 - Pod IPs and identities¶
Apply the policy and deploy three sample pods:
$ kubectl apply -f api-deny-from-frontend.yaml
$ kubectl run -n shop api --image=nginx --labels=app=api --port 8080
$ kubectl run -n shop orders --image=alpine --labels=app=orders -- sh -c "while true; do sleep 60; done"
$ kubectl run -n shop frontend --image=alpine --labels=app=frontend -- sh -c "while true; do sleep 60; done"
Now look at what Cilium did:
$ kubectl exec -n kube-system ds/cilium -- cilium endpoint list -o json | jq '.[] | {id, name: .status.identity.labels, ip: .status.networking.addressing[0].ipv4}'
{ "id": 412, "name": ["k8s:app=api","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.42" }
{ "id": 1207, "name": ["k8s:app=orders","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.43" }
{ "id": 1208, "name": ["k8s:app=frontend","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.44" }
Cilium assigned each pod an endpoint ID and a security identity derived from the pod's labels. The identity is a number, not the label set itself. All pods with the same label set share an identity, which is the unit Cilium reasons about.
The key trick: traditional iptables-based CNIs do rule matching by IP, which means rules scale O(n²) with pod count. Cilium does it by identity, which scales O(unique_label_sets²) - vastly smaller in practice.
Step 2 - The policy in Cilium's view¶
$ kubectl exec -n kube-system ds/cilium -- cilium policy get
[
{
"endpointSelector": {"matchLabels": {"k8s:app": "api", "k8s:io.kubernetes.pod.namespace": "shop"}},
"ingress": [
{
"fromEndpoints": [
{"matchLabels": {"k8s:app": "orders", "k8s:io.kubernetes.pod.namespace": "shop"}}
],
"toPorts": [{"ports": [{"port": "8080", "protocol": "TCP"}]}]
}
]
}
]
Same content, Cilium's internal representation. The selectors will resolve to specific identity numbers when the policy is materialized into eBPF maps.
Step 3 - Test the policy works¶
$ kubectl exec -n shop orders -- wget -qO- --timeout=2 http://10.244.0.42:8080
<!DOCTYPE html>
<html>
<head><title>Welcome to nginx!</title>
...
$ kubectl exec -n shop frontend -- wget -qO- --timeout=2 http://10.244.0.42:8080
wget: download timed out
orders succeeds (allowed). frontend times out (silently dropped). Good.
But where is the drop happening?
Step 4 - Find the eBPF program¶
Cilium attaches eBPF programs at several kernel hook points: tc (traffic control) ingress/egress on every pod's veth, and on the host's external interface. List them:
$ kubectl exec -n kube-system ds/cilium -- bpftool prog show | grep cil_
1342: sched_cls name cil_from_container tag 4f...
1343: sched_cls name cil_to_container tag 8a...
1344: sched_cls name cil_from_host tag c2...
1345: sched_cls name cil_to_host tag d7...
1346: sched_cls name cil_from_netdev tag e3...
These are the BPF programs implementing the datapath. cil_from_container runs on every packet leaving a pod's veth; cil_to_container on every packet entering. The policy enforcement happens in cil_to_container.
Step 5 - The maps Cilium uses¶
eBPF programs are stateless; they read from kernel-managed maps (kv stores). Cilium maintains several:
$ kubectl exec -n kube-system ds/cilium -- bpftool map show | grep -E "cilium_"
221: hash name cilium_policy key 16B value 48B max_entries 16384
222: lru_hash name cilium_ct4 key 40B value 64B max_entries 524288
223: hash name cilium_lxc key 4B value 64B max_entries 65536
224: hash name cilium_metrics key 8B value 16B max_entries 65536
...
cilium_lxc- endpoint ID → pod info (IP, MAC, security identity).cilium_policy-(endpoint_id, src_identity, port, protocol) → allow/deny. This is the lookup table the BPF program consults to decide whether a packet is allowed.cilium_ct4- connection tracking. Stores active flows for established-connection allowance.
Step 6 - The actual lookup¶
When a packet from frontend (identity 1208) reaches the host with destination api (10.244.0.42:8080, endpoint 412):
cil_to_containerBPF program triggers on the veth's ingress hook.- Program reads packet headers - src IP
10.244.0.44, dst IP10.244.0.42, dst port8080. - Program looks up dst endpoint via
cilium_lxc[10.244.0.42]→ endpoint 412. - Program looks up src identity via
cilium_ipcache[10.244.0.44]→ identity 1208. - Program builds policy key
(endpoint=412, identity=1208, port=8080, proto=TCP)and queriescilium_policy. - No matching entry → returns
DROP. - Program updates
cilium_metrics(drop counter ++). tcframework drops the packet.
When orders (identity 1207) sends the same kind of packet, step 5 builds key (412, 1207, 8080, TCP), the policy map has this entry (from the NetworkPolicy → identity match), and the program returns PASS. The packet proceeds; the connection is tracked in cilium_ct4 so the return packet is also allowed via fast-path.
Step 7 - See the drop in real time¶
$ kubectl exec -n kube-system ds/cilium -- cilium monitor -t drop
xx drop (Policy denied) flow 0xab12 to endpoint 412, identity 1208->10044, file bpf_lxc.c line 1142, 86 bytes
This is the BPF program emitting a perf event when it drops a packet. The format includes the source line of the bpf_lxc.c program that made the decision, the source/destination identities, and the byte count. Cilium's hubble (a separate component) consumes these events to provide a real-time UI.
Why this matters¶
The traditional kube-proxy + iptables path for this same policy would:
- Maintain ~O(pods^2) iptables rules per port.
- Linearly walk those rules on every packet.
- Rewrite rules on every pod create/delete, which under churn can take seconds and lose packets.
Cilium's eBPF path:
- Maintains a hash map keyed by (endpoint, identity, port, proto).
- O(1) lookup on every packet.
- Identity-based: adding a new orders pod doesn't change the policy map at all (same identity).
In a cluster with 10,000 pods, the difference is "stable 50µs latency vs unbounded tail." That's the whole pitch for Cilium.
The trap¶
A NetworkPolicy without a controller that supports it does nothing. Many K8s users apply policies on clusters where the CNI doesn't enforce them, and the cluster silently allows everything. Verify with kubectl exec between pods that shouldn't be able to reach each other, or use cilium connectivity test if you're on Cilium.
The other trap: Cilium identity granularity. Two pods with identical label sets share an identity. If you split traffic by namespace alone, every pod in the namespace has the same identity for policy purposes. Add labels (role, tier, app-version) to get finer-grained control.
Exercise¶
- Run the demo above. Confirm the drop is visible via
cilium monitor. - Add a third allowed source: pods labeled
app=admin. Reapply the policy. Watchcilium policy getchange and confirm admin pods now succeed. - (Advanced) Use
bpftool prog dump xlated id <prog-id>on one of thecil_*programs. Read the BPF assembly. Find the map lookup instructions. - (Advanced) Read
Documentation/bpf/in the kernel tree for the BPF instruction set reference. FindBPF_LDX,BPF_JEQ. You'll see them in the disassembly.
Related reading¶
- The main Week 14 chapter covers Cilium's architecture beyond just NetworkPolicy.
- The Linux Kernel path → Month 3 → eBPF chapters cover the verifier and the JIT.
- Glossary: eBPF, Identity, CNI, tc (traffic control) in the main glossary.