Kubernetes¶
Control plane, kubelet/CRI, controllers, networking, day-2.
Printing this page
Use your browser's Print → Save as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.
Kubernetes Platform Engineering — A 24-Week Mastery Roadmap¶
Authoring lens: Principal Platform Engineer / Kubernetes Maintainer. Target outcome: A graduate of this curriculum is capable of (a) building and operating a hardened Kubernetes cluster from scratch on bare metal or any cloud, (b) extending the control plane via custom controllers/operators built with controller-runtime or client-go, and (c) contributing patches to kubernetes/kubernetes or one of its core ecosystems (Cilium, Istio, ArgoCD, Crossplane).
This is not "kubectl in 24 weeks." It assumes the reader has used Kubernetes (deployed an app, run kubectl), understands containers (see the CONTAINER_INTERNALS_PLAN curriculum if not), and is ready to read kubernetes/kubernetes source as primary literature.
What you'll learn¶
A 24-week curriculum organized into six monthly modules, plus three reference appendices and a capstone catalog at the end.
| Module | What it covers |
|---|---|
| Prelude - Philosophy and reading list | What Kubernetes is, what it isn't, the design ethics, reading list. |
| Month 1 - Control Plane (Weeks 1–4) | etcd & Raft, kube-apiserver, scheduler, controllers. |
| Month 2 - Kubelet and CRI (Weeks 5–8) | kubelet, CRI, kube-proxy, CSI, device plugins. |
| Month 3 - Controllers and Operators (Weeks 9–12) | client-go, controller-runtime, CRDs, the operator pattern. |
| Month 4 - Networking and Storage (Weeks 13–16) | CNI, Cilium/eBPF, service meshes, CSI, dynamic provisioning. |
| Month 5 - Platform and Day-2 (Weeks 17–20) | GitOps (Argo/Flux), IaC (Crossplane), HPA/VPA, admission, OPA. |
| Month 6 - Hard-Way Capstone (Weeks 21–24) | Kubernetes the Hard Way; multi-tenancy; mTLS; capstone. |
| Appendix - Hardening | CIS, Pod Security, network policy, RBAC, audit. |
| Appendix - Troubleshooting | Reference flows: pod-pending, node-notready, etcd-degraded, etc. |
| Appendix - Contributing | Contributing to k8s.io: SIGs, KEPs, first-PR playbook. |
| Capstone tracks | Three options to pick from in Month 6: hard-way bare-metal cluster, GitOps platform, operator from scratch. |
How Each Week Is Structured¶
- Conceptual Core-the why, with a mental model.
- Mechanical Detail-the how, with kubernetes/kubernetes source pointers.
- Lab-a hands-on exercise using a real cluster (kind, k3s, or hard-way).
- Hardening Drill-a security/compliance micro-task.
- Operations Slice-a Day-2-ops micro-task: monitoring, scaling, recovery.
Each week is sized for ~12–16 focused hours. Almost every lab requires a working cluster-invest early in a smooth local cluster setup (kind or k3d for dev; a 3-node kubeadm cluster on cloud VMs for realistic ops).
Progression Strategy¶
Control Plane ──► Kubelet & CRI ──► Controllers & Operators
│ │ │
└────────┬───────┴────────────────────┘
▼
Networking & Storage
│
▼
Platform & Day-2 Ops
│
▼
Hard Way & Capstone
Prerequisites¶
- Container fluency (the
CONTAINER_INTERNALS_PLANweeks 1–3 minimum). - Linux fluency (the
LINUXcurriculum weeks 9–10 minimum: namespaces & cgroups). - Comfortable with at least one of Go, Python, or Rust at a "I can build a small CLI" level.
- A budget for cloud VMs OR hardware to run a multi-node cluster (3 small VMs is sufficient).
Capstone Tracks (pick one in Month 6)¶
- Hard Way Track-provision a multi-node Kubernetes cluster from scratch on bare metal or cloud, with mTLS, fine-grained RBAC, multi-tenancy, and a documented runbook.
- Platform Track-build a GitOps-driven platform-as-a-service: ArgoCD/Flux + Crossplane + OPA Gatekeeper + multi-tenancy + self-service. Demonstrate onboarding a new team in <30 minutes.
- Operator Track-build a non-trivial operator from scratch (e.g., a stateful database operator with backup/restore, or an operator that manages an external SaaS resource via Crossplane composition). Production quality.
Detailed track briefs are in the capstone catalog.
Get started¶
Ready to begin? Start with the Prelude — philosophy, the mental model, and the reading list — then work Month 1 forward. The labs are the unit of mastery: do them.
Prelude-What Kubernetes Actually Is¶
Sit with this document for an evening before week 1.
1. Kubernetes Is a Distributed Reconciliation Loop¶
The most clarifying way to understand Kubernetes:
Kubernetes is a distributed key-value store (etcd) wrapped in an HTTP API server, surrounded by a swarm of independent controllers, each of which watches some types of objects in the store and writes other types of objects in response-until the cluster's actual state matches the desired state.
That's it. No central brain. No orchestration engine in the traditional sense. Every component is a client of the API server. The "control plane" is an emergent property of independent controllers cooperating through a shared, transactional store.
If you internalize this, the rest of the curriculum is bookkeeping.
2. The Control Loop Is the Atom¶
Every interesting behavior in Kubernetes is some controller running this loop:
for {
desired := apiServer.List(myWatchedKind)
actual := observe(realWorld)
diff := compute(desired, actual)
for _, action := range diff {
actOn(action) // create / update / delete real resources
apiServer.UpdateStatus(...)
}
watchOrSleep()
}
The Deployment controller watches Deployment objects, creates/updates ReplicaSet objects. The ReplicaSet controller watches ReplicaSet objects, creates/updates Pod objects. The kubelet watches Pod objects bound to its node, talks to the CRI to actually start containers. Each controller's only knowledge of the others is the objects they share.
This is also why Kubernetes is eventually consistent-there is no central scheduler enforcing global state, just many controllers converging.
3. The Five-Axis Cost Model¶
A working platform engineer reasons along five axes:
| Axis | Question to ask |
|---|---|
| Control plane | Does this load etcd? How many writes? How many list/watch consumers? |
| Scheduling | What's the resource request? Affinity? Taint/toleration? Topology spread? |
| Networking | Cluster-internal service / NodePort / LoadBalancer / Ingress / Gateway? CNI overhead? |
| Identity & isolation | What ServiceAccount? What namespace? What RBAC? What NetworkPolicy? What PodSecurity profile? |
| Day-2 ops | What does upgrade look like? Backup? Disaster recovery? Cost? |
Beginner courses teach axis 2 only.
4. The Reading List¶
Primary - Kubernetes in Action (Lukša, 2e). The single best book. - Programming Kubernetes (Hausenblas & Schimanski). Required for Months 3–4. - Production Kubernetes (Vyas, et al.). The Day-2 bible. - Cloud Native Patterns (Davis). Architectural patterns.
Source - kubernetes/kubernetes - the monorepo. Particularly: -cmd/kube-apiserver,pkg/apiserver/,staging/src/k8s.io/apiserver/ - API server. - cmd/kube-scheduler, pkg/scheduler/ - scheduler. -cmd/kube-controller-manager,pkg/controller/ - built-in controllers. - pkg/kubelet/ - node agent. -pkg/proxy/ - service implementations. - kubernetes/community/sig-* - design docs. - KEPs (Kubernetes Enhancement Proposals) atkubernetes/enhancements`. The canonical record of why features exist.
Adjacent canon - Designing Data-Intensive Applications (Kleppmann). Especially chapters on consensus and replication. - The Raft paper. Read in week 1. - Site Reliability Engineering (Google). The "what does Day-2 mean?" book.
5. Curriculum Philosophy¶
- Source first, blog second. When the curriculum says "study informer mechanics," open
staging/src/k8s.io/client-go/tools/cache/. Blogs go stale; commits are dated. - Run a real cluster. Many labs assume a multi-node setup.
kindis fine for development; weeks 17+ assume something closer to production. - Defaults are wrong. Kubernetes ships with permissive defaults to ease onboarding (no NetworkPolicy, no PodSecurity, broad RBAC). Production requires inverting them.
6. What Kubernetes Is Not For¶
A graduate of this curriculum should be able to argue these points:
- Single-server simple deploys. A Postgres + an app on one VM with systemd is operationally simpler than a one-node Kubernetes cluster. Don't add a control plane to host one app.
- Hard real-time / latency-critical hot paths. kube-proxy adds latency. CNI plugins add latency. The scheduler is not designed for sub-millisecond placement decisions. Use bare-metal or VM-based deployments for ultra-low-latency workloads.
- Stateful databases at scale, naively. Kubernetes can run stateful workloads with operators (Postgres operator, MongoDB operator, etc.), but doing it correctly requires a mature operator ecosystem and skilled operators. "Just put your DB in K8s" is not free.
- Teams without ops capacity. Kubernetes is not a Heroku replacement. The complexity is real. If you don't have a platform team, use Cloud Run, Fly, or a managed container service before reaching for K8s.
7. AI-Assisted Workflows¶
- Always read generated YAML. Models hallucinate field names; Kubernetes silently ignores unknown fields by default-your "successful apply" may be doing nothing.
- Verify CRD generation. Tools like
controller-genare deterministic; let them generate, never hand-edit. - Treat generated RBAC with extreme suspicion. Models tend toward over-broad permissions ("just give it cluster-admin"). Tighten by hand.
You are now ready for Week 1. Open the Control Plane month.
Month 1-The Control Plane: etcd, kube-apiserver, scheduler, controllers¶
Goal: by the end of week 4 you can (a) describe etcd's Raft model and replay a leader election, (b) trace an apply through the API-server's request pipeline (auth → admission → validation → storage), (c) explain how the scheduler picks a node, and (d) read the source of one built-in controller (Deployment) end-to-end.
Weeks¶
- Week 1 - etcd and the Raft Consensus Foundation
- Week 2 - The kube-apiserver
- Week 3 - The Scheduler
- Week 4 - Built-in Controllers and
client-goFoundations
Week 1 - etcd and the Raft Consensus Foundation¶
1.1 Conceptual Core¶
- etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
- Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
- A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.
1.2 Mechanical Detail¶
- Read the Raft paper. Then read
etcd-io/raft/raft.goend to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition. - etcd's data model: a flat keyspace,
mvccrevisions, watch streams. Kubernetes uses keys like/registry/pods/<namespace>/<name>and stores protobuf-encoded objects. - Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
- Performance characteristics:
- Write latency = network RTT × 2 (leader → quorum) + fsync.
- Read latency = local read on leader (or any member with
serializable=true). - Throughput is bound by the leader's fsync rate. SSDs help dramatically.
- Compaction and defragmentation are required ops; without them etcd grows unbounded.
1.3 Lab-"etcd, Up Close"¶
- Bring up a 3-node etcd cluster locally (
etcdbinaries, no Kubernetes yet). Configure peer/client URLs. - Use
etcdctlto put/get keys; observe consistent reads. - Kill the leader. Use
etcdctl endpoint status --clusterto identify the new leader within seconds. - Use
etcdctl watch /foofrom one terminal; put values from another. Internalize the watch model. - Use
etcdctl --command-timeout=60s defragto compact + defragment. Observe disk-usage drop.
1.4 Hardening Drill¶
- Configure mTLS between etcd peers and between client and etcd. Configure auth (
role,user). - Take a snapshot via
etcdctl snapshot save. Restore to a new cluster. Verify integrity.
1.5 Operations Slice¶
- Wire etcd metrics to Prometheus:
etcd_server_has_leader,etcd_disk_wal_fsync_duration_seconds,etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approachingquota-backend-bytes.
Week 2 - The kube-apiserver¶
2.1 Conceptual Core¶
- The API server is the only stateful component (well-the only stateless component that talks to etcd). It exposes the REST/JSON+YAML+protobuf API, performs authn/authz/admission, and writes to etcd.
- Every Kubernetes operation-
kubectl apply, controller reconciliation, kubelet status update-is an HTTP request to this server. - Three middleware stages every request traverses: Authentication (who are you?), Authorization (RBAC: what can you do?), Admission (mutating + validating webhooks).
2.2 Mechanical Detail¶
- Authentication mechanisms: x509 client certs, bearer tokens (ServiceAccount tokens, OIDC), webhook tokens. Each is a request handler chain entry.
- Authorization: RBAC is the dominant mode. ABAC and webhook authz exist but are rare. RBAC binds subjects (User, Group, ServiceAccount) to roles (verb + resource + namespace combinations) via
RoleBinding/ClusterRoleBinding. - Admission:
- Mutating webhooks: can modify the object before validation (e.g., inject sidecars).
- Validating webhooks: can only accept or reject.
- Built-in admission controllers:
LimitRanger,ResourceQuota,ServiceAccount,PodSecurity,NodeRestriction, etc. Readplugin/pkg/admission/in the k8s tree. - Aggregated API server: third parties can register their own API surface (e.g., metrics-server, Knative). The main apiserver proxies requests to them.
- Storage: every object has a "storage version" in etcd. The server converts between API versions on read/write. This is what allows v1beta1 → v1 migrations.
2.3 Lab-"Read the Pipeline"¶
- Use
kubectl --v=8to dump the wire-level request/response of akubectl apply. Read it carefully. - Use
kubectl get --rawto hit/apis/,/api/v1,/apis/apps/v1and see the discovery surface. - Configure the apiserver to log all requests with - -audit-policy-file=audit.yaml`. Apply a few changes; read the audit log.
- Write a tiny mutating webhook (in Go, using
controller-runtime's webhook facilities) that adds a label to every Pod. Deploy and verify.
2.4 Hardening Drill¶
- Audit policy template: log
Metadatafor every request,Requestforsecrets/configmaps,RequestResponseforroles/rolebindings/clusterroles/clusterrolebindings. Ship logs off-cluster.
2.5 Operations Slice¶
- Wire apiserver metrics:
apiserver_request_total,apiserver_request_duration_seconds,apiserver_storage_objects. Alert on per-resource latency p99 spikes.
Week 3 - The Scheduler¶
3.1 Conceptual Core¶
- The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's
spec.nodeNamefield. - The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
- The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.
3.2 Mechanical Detail¶
- Scheduler framework extension points (read
pkg/scheduler/framework/types.go): - PreFilter-short-circuit conditions.
- Filter-must return Success for the Node to be eligible.
- PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
- PreScore, Score, NormalizeScore.
- Reserve, Permit, PreBind, Bind, PostBind.
- Built-in plugins:
NodeResourcesFit,NodeAffinity,PodTopologySpread,InterPodAffinity,TaintToleration,NodePorts,VolumeBinding,ImageLocality,NodeResourcesBalancedAllocation. - Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
- Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods.
priorityClassis the knob.
3.3 Lab-"Scheduler in Action"¶
- Use
kubectl describeon a pending Pod to see filter/score reasons. - Set a Node taint (
kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it. - Define
PriorityClasses (high,default,batch); deploy mixed-priority Pods; trigger preemption by oversaturating. - Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.
3.4 Hardening Drill¶
- Set
priorityClassNameon system-critical Pods (CSI driver, ingress controller). Usesystem-cluster-criticalandsystem-node-criticalfor cluster-internal pods.
3.5 Operations Slice¶
- Wire scheduler metrics:
scheduler_pending_pods,scheduler_pod_scheduling_duration_seconds,scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.
Week 4 - Built-in Controllers and client-go Foundations¶
4.1 Conceptual Core¶
- The kube-controller-manager is a single binary running ~30 built-in controllers. Each is a goroutine running the reconciliation loop pattern.
- The Deployment controller is the best worked example: watches
Deploymentobjects, creates/updatesReplicaSetobjects, drives rolling-update progression. - The patterns the built-in controllers establish-informers, work queues, structured logging, leader election-are the templates you'll use when building custom controllers.
4.2 Mechanical Detail¶
- Informers (
staging/src/k8s.io/client-go/tools/cache/): - A shared in-memory cache populated by a single watch stream per resource type.
- Event handlers:
OnAdd,OnUpdate,OnDelete. - Cache provides O(1) lookup by namespace/name.
- Multiple controllers in one process share informers via
SharedInformerFactory. - Work queues (
client-go/util/workqueue): rate-limited, deduplicated, item-keyed queues. Reconcile functions pull a key, list-from-cache, act, requeue on error. - The Deployment controller flow (
pkg/controller/deployment/): - Informer detects a Deployment change.
- Reconciler computes the desired ReplicaSet count and per-RS replica counts based on strategy (rolling vs recreate).
- Creates new RS / scales old RS / scales new RS.
- Updates Deployment status with progress.
4.3 Lab-"Read the Deployment Controller"¶
- Read
pkg/controller/deployment/deployment_controller.goend-to-end (~1500 lines). - Trace a
kubectl rolloutthrough the source: which conditions are checked, which fields updated, what triggers the next loop iteration. - Reproduce a stuck-rollout scenario (deploy a bad image); observe
Progressing=Falseafter the deadline; inspect status conditions. - Manually scale a Deployment to 0 with
kubectl scale; trace what the controller does in response.
4.4 Hardening Drill¶
- Set sensible Deployment defaults in your platform:
progressDeadlineSeconds,revisionHistoryLimit, rolling-updatemaxSurge/maxUnavailablefor production workloads.
4.5 Operations Slice¶
- Wire workqueue metrics:
workqueue_adds_total,workqueue_depth,workqueue_queue_duration_seconds. Alert on persistent depth or processing latency.
Month 1 Capstone Deliverable¶
A control-plane/ workspace: 1. etcd-cluster/ - week 1's 3-node cluster + backup/restore script. 2.audit-pipeline/ - week 2's audit-log shipping + sample queries. 3. custom-scheduler-plugin/ - week 3's scheduler plugin + deployment. 4.controller-walkthrough.md - week 4's annotated tour of the Deployment controller.
Month 2-The Node: kubelet, CRI, kube-proxy, CSI, Devices¶
Goal: by the end of week 8 you can (a) describe how the kubelet maintains the desired Pod state on a node, (b) trace a service request from client to backing Pod through kube-proxy or eBPF dataplane, (c) explain CSI volume lifecycle (provision → attach → mount), and (d) write a basic device plugin or CSI driver shim.
Weeks¶
- Week 5 - Kubelet Internals
- Week 6 - CRI: kubelet ↔ Runtime
- Week 7 - kube-proxy, Services, and the Networking Dataplane
- Week 8 - CSI, Storage, and Device Plugins
Week 5 - Kubelet Internals¶
5.1 Conceptual Core¶
- The kubelet is the per-node agent. Its job: watch Pods bound to this node, drive the CRI to make the actual containers match. Plus: report node status, manage volumes, run health checks, evict on resource pressure.
- The kubelet is also a PLEG (Pod Lifecycle Event Generator) that polls the runtime to detect actual container state changes-necessary because container exits aren't always pushed events.
- Kubelet is the component most often blamed for "weird" Kubernetes behavior; understanding it is non-optional.
5.2 Mechanical Detail¶
- Read
pkg/kubelet/kubelet.go. Major loops: - `syncLoop - the main reconciliation loop.
- `PLEG - pod-lifecycle event generation.
- `volumeManager - volume mount/unmount.
- `statusManager - Pod status updates back to apiserver.
- `evictionManager - resource-pressure eviction.
- Static pods-Pods defined as YAML files on disk (
/etc/kubernetes/manifests/). Kubelet runs them directly without an apiserver. How control-plane pods bootstrap themselves. - Pod lifecycle phases:
Pending→Running→Succeeded/Failed. With container statesWaiting/Running/Terminated. - Pod resource enforcement: kubelet sets cgroups based on
requests/limits. Withcpu-manager-policy=static, the kubelet pins exclusive CPUs to Guaranteed-class Pods. Same idea formemory-manager-policyandtopology-manager-policy. - Eviction: when a node runs low on memory, disk, PID space, the kubelet evicts Pods in priority order. Soft vs hard thresholds.
5.3 Lab-"Kubelet Forensics"¶
- SSH to a node.
journalctl -u kubelet -fand trigger a Pod creation. Watch the log. crictl ps,crictl pods, `crictl inspect - operate at the CRI layer directly.- Place a static pod manifest; observe kubelet picking it up.
- Trigger a memory eviction by setting low
evictionHardand oversubscribing. Read the eviction event and the kubelet's decision.
5.4 Hardening Drill¶
- Set kubelet args: - -read-only-port=0
, - -anonymous-auth=false, - -authorization-mode=Webhook, - -protect-kernel-defaults=true, - -make-iptables-util-chains=true, - -tls-min-version=VersionTLS12.
5.5 Operations Slice¶
- Wire kubelet metrics:
kubelet_pod_start_duration_seconds,kubelet_running_pods,kubelet_volume_stats_used_bytes. Alert on slow Pod starts.
Week 6 - CRI: kubelet ↔ Runtime¶
6.1 Conceptual Core¶
- The kubelet does not run containers itself; it talks gRPC to a CRI implementation (containerd, CRI-O). The CRI provides RuntimeService (containers + sandboxes) and ImageService (pull/list/remove).
- Every Pod is a sandbox (a network namespace + the "pause" container) plus N containers sharing it.
6.2 Mechanical Detail¶
- The CRI proto:
cri-api/pkg/apis/runtime/v1/api.proto. The most relevant calls:RunPodSandbox,CreateContainer,StartContainer,StopContainer,RemovePodSandbox,Exec,Attach,PortForward,PullImage. - The pause container is a tiny binary (it just calls
pause(2)); it holds open the network namespace so the application containers can come and go. crictlis the CLI for talking to a CRI directly.crictl ps,crictl inspect,crictl exec. Different from `kubectl - talks straight to kubelet's runtime, bypassing the API server.- Runtime classes:
RuntimeClassobjects bind a name (gvisor,kata) to a runtime handler. Pods reference viaspec.runtimeClassName.
6.3 Lab-"CRI Direct"¶
- From a node,
crictl pull alpine; crictl runp pod-config.json; crictl create <pod-id> ctr-config.json img-config.json; crictl start <ctr-id>. You've launched a pod-equivalent without the apiserver. - Compare with kubectl deploying the same: trace each CRI call in the kubelet log.
- Configure containerd with multiple runtimes (runc + runsc); register both as
RuntimeClasses; deploy Pods against each.
6.4 Hardening Drill¶
- Configure containerd to default to a non-root user, drop default capabilities, apply default seccomp. The same hardening from the Container curriculum, applied at the daemon level.
6.5 Operations Slice¶
- Monitor
container_runtime_*metrics from cAdvisor (built into kubelet). Alert on container-restart rate spikes.
Week 7 - kube-proxy, Services, and the Networking Dataplane¶
7.1 Conceptual Core¶
- A Service is a stable virtual IP and port that load-balances across a set of Pods. It is implemented at L4 by kube-proxy-or, in modern eBPF-based clusters, by the CNI directly (Cilium replaces kube-proxy entirely).
- Modes:
- iptables (default): kube-proxy programs iptables DNAT rules. O(N) match per packet; degrades with many Services.
- IPVS: kube-proxy programs the kernel IPVS load balancer. O(1) lookup; better for >1000 services.
- eBPF (Cilium): bypasses iptables entirely; programs are attached at the socket layer (
bpf_sock_ops) and at the egress point. Lowest overhead.
7.2 Mechanical Detail¶
- EndpointSlices replaced Endpoints in 1.21+: split per-Service endpoint lists into multiple objects to scale beyond ~1000 endpoints per Service.
- Service types:
ClusterIP(default, internal),NodePort(open a port on every node),LoadBalancer(cloud LB integration),ExternalName(DNS CNAME). - Headless Services (
clusterIP: None): no virtual IP; DNS returns Pod IPs directly. Used by StatefulSets. - Topology-aware routing: prefer endpoints in the same zone (since 1.27 stable). Saves cross-zone egress costs.
- Service IPs are virtual: no NIC has them; they live only in iptables/IPVS/eBPF rules.
7.3 Lab-"Service Path"¶
- Create a Service + Deployment. From a Pod,
curl <service>.<ns>.svc.cluster.local. Trace the DNS lookup (CoreDNS) and the iptables/IPVS rules that DNAT. - Switch kube-proxy to IPVS mode (
mode: ipvsin kube-proxy config). Verify withipvsadm -L -n. - Install Cilium with
kubeProxyReplacement=true. Observe kube-proxy not running. Verify Service connectivity still works. - Compare per-packet latency under each mode with a small benchmark.
7.4 Hardening Drill¶
- Enable
topology-aware routingto keep traffic in zone. Apply NetworkPolicies (next month) that allow only intended traffic.
7.5 Operations Slice¶
- Wire
kubeproxy_sync_proxy_rules_duration_seconds. With many Services and iptables mode, this can take seconds-a known scale ceiling.
Week 8 - CSI, Storage, and Device Plugins¶
8.1 Conceptual Core¶
- CSI (Container Storage Interface) is the standard plugin interface for storage. Every cloud and many on-prem systems ship a CSI driver. Kubernetes calls the driver via gRPC.
- A CSI driver runs in two modes (or both):
- Controller plugin-provision, delete, attach, detach, snapshot. Cluster-wide.
- Node plugin-stage and publish (mount) the volume on the kubelet node.
- PVC → PV → CSI flow: user creates a PVC; the external-provisioner sidecar sees it, calls the CSI controller's
CreateVolume, which creates a PV bound to the PVC. Kubelet then asks the CSI node plugin to mount.
8.2 Mechanical Detail¶
- StorageClass parameters:
provisioner(CSI driver name),parameters(driver-specific),reclaimPolicy(Delete vs Retain),volumeBindingMode(Immediate vs WaitForFirstConsumer),allowVolumeExpansion. - WaitForFirstConsumer is critical for zone-aware provisioning-wait until the Pod is scheduled to know which zone to provision in.
- Snapshots:
VolumeSnapshotAPI; the external-snapshotter sidecar drives the CSI driver's snapshot calls. - Device plugins: a separate gRPC API (
pluginapi.proto) for exposing custom resources (GPUs, FPGAs, RDMA NICs) to Pods. NVIDIA'sk8s-device-pluginis the canonical example.
8.3 Lab-"Storage Hands-On"¶
- Install a local-path CSI driver (
rancher/local-path-provisionerworks for kind). Create a PVC; observe binding. - Take a snapshot; restore to a new PVC.
- Author a mock device plugin that exposes 4 instances of a fake resource. Deploy a Pod requesting it; verify scheduling and resource accounting.
- Read the CSI proto. Diagram the provision + attach + mount flow on paper.
8.4 Hardening Drill¶
- Use
volumeBindingMode: WaitForFirstConsumerfor all multi-zone clusters. Without it, you'll provision a volume in zone A and try to schedule its Pod in zone B.
8.5 Operations Slice¶
- Monitor
csi_*metrics emitted by sidecars. Alert on provision/attach errors and slowMountoperations.
Month 2 Capstone Deliverable¶
A node-and-cri/ workspace: 1. kubelet-tour/ - week 5's annotated journal-log walkthrough. 2.cri-direct/ - week 6's crictl - based pod-launch demo. 3.dataplane-bench/ - week 7's iptables vs IPVS vs Cilium-eBPF comparison. 4. `mock-device-plugin/ - week 8's working device plugin.
Month 3-Controllers and Operators: client-go, controller-runtime, CRDs¶
Goal: by the end of week 12 you can (a) build a controller from scratch with client-go and informers, (b) build a more sophisticated controller with controller-runtime (including webhooks, finalizers, status conditions), (c) define and version CRDs idiomatically, and (d) ship a non-trivial operator that manages an external system.
Weeks¶
- Week 9 -
client-goInternals and a Bare Controller - Week 10 -
controller-runtimeand Kubebuilder - Week 11 - CRDs: Schema, Versioning, Validation
- Week 12 - Operator Patterns: Finalizers, External Resources, Multi-Cluster
Week 9 - client-go Internals and a Bare Controller¶
9.1 Conceptual Core¶
client-gois the Kubernetes Go client library-typed clients, informers, work queues, leader election, the lot.- Building a controller "from scratch" in
client-gois verbose but instructive-every other framework hides these primitives. - The pattern (the informer + workqueue pattern):
- Create a
SharedInformerFactoryfor the resources you watch. - For each kind, register
OnAdd/OnUpdate/OnDeletehandlers that compute a key (namespace/name) andAddit to aRateLimitingQueue. - Start N workers that pull keys from the queue and run
reconcile(key). reconcile: list-from-cache (never call apiserver in the hot path), compute diff, apply changes, requeue on error with backoff.
9.2 Mechanical Detail¶
- The informer's resync period: re-deliver every cached object every N (default 10 minutes). Belt-and-suspenders against missed events.
- Indexers (
cache.Indexer): O(1) lookup by namespace, by label, by custom key. Free with the informer. - Listers (
<group>/<version>/<resource>/lister.goin generated client code): typed accessors over the indexer. - Leader election (
tools/leaderelection): only one replica of the controller acts; others stand by. Uses aLeaseresource as the lock. - Generated clients: for built-in types,
client-goships them. For your own CRDs, generate withcontroller-genorkubebuilder(week 10).
9.3 Lab-"Controller From Scratch"¶
Build a controller that watches ConfigMaps with the label mirror=true and copies them into every namespace whose name matches a configurable prefix. - Use client-go informers + workqueue directly. - Add leader election. - Idempotent: same input twice produces same result. - Handle deletions: when the source is deleted, delete all mirrors. - Run as a Deployment in the cluster.
9.4 Hardening Drill¶
- Define a minimum RBAC: only
get/list/watchonconfigmapsandnamespaces, pluscreate/update/deleteonconfigmaps(constrained by namespace prefix? Use admission webhooks or namespace selectors).
9.5 Operations Slice¶
- Expose
controller_runtime_* - style metrics: queue depth, work duration, error rate. Add a/healthzand/readyz. Run with/livez` probe.
Week 10 - controller-runtime and Kubebuilder¶
10.1 Conceptual Core¶
controller-runtimeis the modern, opinionated framework for controllers. Built atopclient-go, it provides:Manager(informer factory + leader election + metrics + healthz wired together).Reconciler(typed reconcile method).Client(cached read, direct write).- Webhook scaffolding (mutating + validating + conversion).
- Finalizers helpers.
- Kubebuilder is a CLI on top of
controller-runtimethat scaffolds projects from CRD definitions. The de facto starting point for new operators.
10.2 Mechanical Detail¶
- Project structure (
kubebuilder init && kubebuilder create api): - The
Reconcilemethod is the hot path; it should be idempotent and make no assumption about why it was called. Re-derive everything each call. controllerutil.CreateOrUpdate-the reliable upsert helper.- Owner references-when a controller creates a child object, it sets the parent as the owner. Garbage collection handles cascading deletion.
- Finalizers-string keys on
metadata.finalizers. Block deletion until the controller removes the finalizer (after performing cleanup). The pattern for cleaning up external resources before the K8s object disappears. - Status subresource-separates spec writes from status writes; allows least-privilege RBAC.
10.3 Lab-"Rebuild Week 9 in controller-runtime"¶
Take week 9's mirror controller; rebuild with kubebuilder + controller-runtime. Compare LOC and verbosity. The framework should save substantial code.
10.4 Hardening Drill¶
- Use
controller-runtime's metric and health endpoints. Configure leader election with a non-default lease duration appropriate to your environment.
10.5 Operations Slice¶
- Wire
controller_runtime_reconcile_*metrics. Establish dashboards: reconcile rate, error rate, average reconcile duration per controller.
Week 11 - CRDs: Schema, Versioning, Validation¶
11.1 Conceptual Core¶
- A CRD (CustomResourceDefinition) registers a new resource kind with the apiserver. Once registered, you can
kubectl get/applyit like any built-in. - The CRD includes an OpenAPI v3 schema that the apiserver uses for validation. Get this right or you'll ship buggy custom resources.
- Multiple versions can coexist; conversion webhooks translate between them. The pattern that allows v1alpha1 → v1beta1 → v1 evolution.
11.2 Mechanical Detail¶
- Marker comments (
+kubebuilder:...) on Go types generate the CRD YAML viacontroller-gen. Examples: +kubebuilder:validation:Required+kubebuilder:validation:MinLength=3+kubebuilder:validation:Enum=foo;bar;baz+kubebuilder:subresource:status+kubebuilder:printcolumn:name="Phase",type=string,JSONPath=.status.phase``- Status conditions: array of
{type, status, reason, message, lastTransitionTime}. The standard pattern for surfacing controller state. Use the Kubernetes types directly (metav1.Condition). - Server-side apply: with SSA, multiple controllers can own different fields of the same object via
fieldManager. Replaces hand-rolled patches for many use cases. - Conversion webhooks: invoked when apiserver needs to translate between stored and requested versions. Implement carefully-round-trip stability is essential.
11.3 Lab-"A Well-Versioned CRD"¶
- Define a CRD with v1alpha1.
- Add validation, defaults, status conditions, printer columns.
- Add a v1beta1 with renamed fields and a conversion webhook between them.
- Verify round-trip:
kubectl get -o v1alpha1then - o v1beta1` returns identical content.
11.4 Hardening Drill¶
- Validation only at the schema level is not enough. Add admission webhooks for cross-field validation (e.g., "if mode=X then field Y is required").
11.5 Operations Slice¶
- Track
apiserver_storage_objectsper CRD. CRDs that grow unbounded are a frequent platform failure mode.
Week 12 - Operator Patterns: Finalizers, External Resources, Multi-Cluster¶
12.1 Conceptual Core¶
- The "operator" pattern: a controller that encapsulates operational knowledge for a specific application. Examples: Postgres operator (provisions DBs, handles backups, failover), Cert-Manager (ACME-driven cert lifecycle), Prometheus operator (manages Prometheus + Alertmanager + ServiceMonitor stack).
- An operator is a controller plus one or more CRDs representing the application's domain concepts.
- Production operators handle: leader election, finalizers, status conditions, observability, RBAC, upgrades, multi-tenant isolation, external-system reconciliation, retries with backoff.
12.2 Mechanical Detail¶
- External resources (cloud APIs, SaaS): the controller's reconcile loop calls outward. Idempotency is essential-assume your reconcile may run multiple times before the external API confirms.
- Crossplane (week 19) generalizes this: every external resource is itself a Kubernetes object backed by a controller that talks to the cloud. You compose them.
- Cluster-scoped vs namespace-scoped operators: namespace-scoped is safer (lower blast radius) but limits multi-tenant operator deployment.
- Operator SDK vs Kubebuilder: largely converged today; pick whichever your team prefers. The patterns are identical.
12.3 Lab-"An Operator That Manages an External Resource"¶
Build an operator with a GitHubRepo CRD: spec includes a repo name and visibility; the controller calls the GitHub API to create/update/delete the repo to match. Includes: - Authentication via a Secret referenced by the CR. - Finalizers for cleanup. - Status conditions: Ready, Synced, Error with reasons. - Rate-limited reconciles with exponential backoff. - E2E test using a fake GitHub API server.
12.4 Hardening Drill¶
- Define an OPA/Kyverno policy: every
GitHubRepomust reference a Secret in the same namespace; cross-namespace references denied. Tests for the policy.
12.5 Operations Slice¶
- Add
tracingto the reconcile path; export traces via OTel. The operator's hop into GitHub appears as an external span-useful for diagnosing outages.
Month 3 Capstone Deliverable¶
A controllers-and-operators/ workspace: 1. mirror-controller-clientgo/ (week 9). 2. mirror-controller-cr/ (week 10). 3. versioned-crd/ (week 11). 4. github-repo-operator/ (week 12).
Month 4-Networking and Storage at Scale¶
Goal: by the end of week 16 you can (a) explain the CNI spec and trace a Pod-to-Pod packet through a working CNI, (b) install and operate Cilium with eBPF-based dataplane, kube-proxy replacement, and L7 visibility, (c) reason about service-mesh tradeoffs (Istio vs Linkerd vs Cilium Service Mesh), and (d) operate dynamic CSI provisioning at scale with backups and snapshots.
Weeks¶
- Week 13 - The CNI Spec and Pod Networking
- Week 14 - Cilium and eBPF Networking
- Week 15 - Service Meshes: Istio, Linkerd, Cilium Service Mesh
- Week 16 - CSI at Scale: Snapshots, Backup, Cloning
Week 13 - The CNI Spec and Pod Networking¶
13.1 Conceptual Core¶
- The CNI (Container Network Interface) spec is small (~30 pages). A CNI plugin is a binary that the kubelet (via the runtime) invokes when a Pod sandbox is created. Inputs: container ID, network namespace path, JSON config. Outputs: assigned IP, routes.
- Kubelet does not know about networking beyond "ask the CNI." This is what makes the dataplane pluggable.
- Modern CNIs ship as DaemonSets that program kernel rules (iptables, OVS, eBPF) and run a small "agent" plus a thin "delegator" CNI binary.
13.2 Mechanical Detail¶
- Read
containernetworking/cni/SPEC.md. Operations:ADD,DEL,CHECK,VERSION. - The CNI binary must be at
/opt/cni/bin/<name>; config at/etc/cni/net.d/*.conf. - The kubelet → CRI → CNI flow:
- Kubelet asks runtime to create a sandbox.
- Runtime creates a netns; calls CNI
ADD. - CNI assigns IP, sets up the netns.
- Sandbox containers join via
CLONE_NEWNS=falseplus the existing netns. - CNI chains: multiple plugins composed, each running in order (e.g., a primary CNI + a metering plugin + a port-mapping plugin).
13.3 Lab-"Read a CNI's Source"¶
- Pick a simple CNI (
flannelor the referencebridgeplugin fromcontainernetworking/plugins). Read itscmdAddend to end. - Deploy a small kind cluster; trace a Pod creation in the kubelet log; correlate with the CNI binary invocation.
- Use
nsenter -t <pause-pid> -n ip ato inspect the container's network namespace from the host.
13.4 Hardening Drill¶
- Default-deny
NetworkPolicyper namespace. Allow only intended Pod-to-Pod and Pod-to-Service traffic.
13.5 Operations Slice¶
- Monitor CNI errors in kubelet logs. A node with consistent CNI ADD failures will have stuck pending Pods-alert on this.
Week 14 - Cilium and eBPF Networking¶
14.1 Conceptual Core¶
Cilium is the dominant eBPF-based CNI. It replaces iptables-based packet processing with eBPF programs attached at three layers:
- Socket layer (
bpf_sock_ops) - connection-level decisions before packets exist. - Cgroup egress - per-pod outbound policy enforcement.
- NIC-level XDP - ingress filtering at line rate, before the kernel network stack.
The shift from iptables matters at scale: an iptables-based kube-proxy walks a linear chain of rules per packet - O(services). eBPF programs do hash-table lookups: O(1) per packet, regardless of service count.
Beyond replacing the CNI, Cilium provides: - Kube-proxy replacement (eBPF-based service load balancing - no iptables churn on every endpoint change). - L7 NetworkPolicy (HTTP, gRPC, Kafka filtering at the dataplane, not in a sidecar). - ClusterMesh (multi-cluster service discovery and cross-cluster policy). - Hubble (eBPF-based flow observability - every pod-to-pod connection visible without sampling). - Service Mesh (sidecar-less mTLS via eBPF + SPIFFE).
This is the bridge to Linux Month 3 - eBPF in production. See also: eBPF in the observability cross-topic page.
14.2 Mechanical Detail¶
- Dataplane as eBPF graph: Cilium's eBPF programs live under
bpf/incilium/cilium. The agent compiles them at startup with the cluster's specific configuration baked in (BTF-driven CO-RE for portability across kernels). - Identity-based policy: pods are assigned a numeric identity derived from their labels (
app=foo,env=prod→ identity 1234). eBPF programs match on these identities, not on IPs. This is what allows policy to scale to thousands of pods without per-pod iptables rules - identities are stable across pod restarts and IP changes. - Service load balancing: instead of iptables DNAT chains, Cilium uses an eBPF map indexed by
(service IP, port)returning a backend. Connection state lives in a separate eBPF map; updates are atomic, no kernel reload, no race during endpoint churn. - Encryption: WireGuard (recommended; in-kernel since 5.6) or IPsec tunnels between nodes. Per-NetworkPolicy opt-in or cluster-wide.
- Hubble captures every packet's metadata via eBPF - source/dest identity, verdict (allowed/denied), L7 protocol info - and exposes it via gRPC + a CLI + a UI. Per-packet overhead is single-digit-percent CPU.
The trap
Switching kubeProxyReplacement from false → true on a live cluster without draining nodes. The iptables rules from the old kube-proxy don't get cleaned up automatically, and they interact badly with Cilium's eBPF NAT. Always: drain node → reconfigure → uncordon. The Cilium installer's kubeProxyReplacement: strict mode aborts if it finds residual rules.
14.3 Lab - "Install and Drive Cilium"¶
- Install via Helm with:
kubeProxyReplacement=true,hubble.enabled=true,hubble.relay.enabled=true,hubble.ui.enabled=true,encryption.enabled=true,encryption.type=wireguard. - Use the Hubble UI (
cilium hubble ui) to visualize pod-to-pod traffic in real time. - Author L4
NetworkPolicy(standard k8s API); test enforcement with a denied + allowed flow. - Author an L7
CiliumNetworkPolicy(e.g., allow onlyHTTP GET /api/*from frontend → backend); test enforcement. - Enable Cilium Service Mesh; observe sidecar-free mTLS between two test services.
14.4 Hardening Drill¶
Enable transparent encryption (WireGuard) between nodes. Combined with default-deny NetworkPolicy (start: deny everything, allow explicitly), this gives defense-in-depth: even if a node is compromised, the attacker sees only encrypted traffic for flows they haven't been explicitly authorized to observe.
14.5 Operations Slice¶
Monitor cilium_* Prometheus metrics. Alert on: - policy-drop rate spikes - legitimate workloads being denied (usually a NetworkPolicy author mistake, or a new service that didn't get its allow rule). - identity-table pressure - Cilium has a max identity count per cluster; approaching it means too many distinct label combinations, often from a bad operator emitting unique labels per request. - endpoint regeneration time - if it climbs past 5-10s, your label churn is overwhelming the agent.
Week 15 - Service Meshes: Istio, Linkerd, Cilium Service Mesh¶
15.1 Conceptual Core¶
- A service mesh adds: mTLS between Services, retries/timeouts/circuit-breaking, traffic shifting (canary, blue/green), observability (RED metrics + traces), policy enforcement.
- Two architectural patterns:
- Sidecar (Istio classic, Linkerd)-Envoy/
linkerd-proxyruns in every Pod. ~50 MB memory per Pod, ~1 ms latency overhead. - Sidecar-less (Istio ambient, Cilium SM)-eBPF + per-node proxy. Much lower per-Pod overhead.
- Decision matrix:
- Mature, full-featured, complex → Istio.
- Minimalist, Rust-based, fast to install → Linkerd.
- Already running Cilium, want sidecar-less → Cilium Service Mesh.
15.2 Mechanical Detail¶
- Envoy (under Istio + others) is the dataplane proxy. xDS APIs (LDS, RDS, CDS, EDS) push config from the control plane.
- mTLS rotation: the mesh control plane issues short-lived certs (typically 24h) signed by an internal CA (or SPIFFE-compatible).
- Traffic management: Istio
VirtualService+DestinationRulefor routing rules. K8s Gateway API is the standard-track replacement, supported by all major meshes. - Observability: every mesh emits RED metrics (Rate, Errors, Duration) per-service. With OTel, traces propagate through the mesh.
15.3 Lab-"Three Meshes"¶
- Install Istio in ambient mode on a test cluster. Apply a
VirtualServicethat does 90/10 canary routing. Verify with Hubble or Kiali. - Repeat with Linkerd. Compare install footprint, configuration ergonomics, and observability quality.
- (If running Cilium) enable Cilium Service Mesh. Compare again.
- Document tradeoffs: install effort, per-Pod overhead, feature gaps.
15.4 Hardening Drill¶
- Enable mTLS in
STRICTmode. DefineAuthorizationPolicys denying cross-namespace traffic by default; allow only intended pairs.
15.5 Operations Slice¶
- Wire the mesh's RED metrics into your service-level dashboards. Define SLOs per service: latency p99, error rate, mTLS handshake success rate.
Week 16 - CSI at Scale: Snapshots, Backup, Cloning¶
16.1 Conceptual Core¶
- Production storage in K8s requires:
- Dynamic provisioning (week 8).
- Volume Snapshots (point-in-time captures).
- Backups (off-cluster, often app-consistent via operator hooks).
- Cloning (PVC from snapshot, or PVC-from-PVC).
- Resizing (online expansion).
- Velero is the de-facto cluster backup tool: backs up resource manifests + PV snapshots to object storage; restores selectively.
16.2 Mechanical Detail¶
VolumeSnapshotClass↔VolumeSnapshot↔VolumeSnapshotContent. Mirrors the SC/PVC/PV trio.- The external-snapshotter sidecar runs alongside the CSI controller, watching
VolumeSnapshotobjects. - Volume populators (since 1.24+)-populate a new PVC from arbitrary sources (snapshots, other PVCs, S3, etc.). Modular framework.
- Velero: install, configure storage location (S3-compatible bucket), schedule backups via
Scheduleresource. Plugins for cloud providers and for "BackupStorageLocation" abstraction.
16.3 Lab-"Backup and Restore"¶
- Install Velero against a MinIO bucket.
- Schedule a daily backup of one namespace.
- Delete the namespace; restore from backup; verify Pods come back, PVs reattach, data intact.
- Create a stateful workload (Postgres via an operator); test snapshot + clone flow for fast dev/test environment provisioning.
16.4 Hardening Drill¶
- Test restore into a different cluster. This is the actual disaster-recovery scenario, and the most commonly broken backup story.
16.5 Operations Slice¶
- Wire Velero metrics: backup success rate, backup duration, restore-test outcomes (run a synthetic restore weekly to validate).
Month 4 Capstone Deliverable¶
A networking-and-storage/ workspace: 1. cni-source-walkthrough.md (week 13). 2. cilium-policies/ - L4 + L7 + identity-based examples (week 14). 3.mesh-comparison/ - three meshes, RED dashboards (week 15). 4. `velero-DR/ - backup, restore, and cross-cluster-restore demos (week 16).
Month 5-Platform Engineering and Day-2 Operations¶
Goal: by the end of week 20 you can (a) operate a GitOps workflow with ArgoCD or Flux at scale, (b) provision cloud infrastructure declaratively via Crossplane (or Terraform from K8s), (c) configure HPA/VPA against custom Prometheus metrics, and (d) author and enforce policies with OPA Gatekeeper or Kyverno.
Weeks¶
- Week 17 - GitOps: ArgoCD and Flux
- Week 18 - IaC From Within K8s: Crossplane and Terraform
- Week 19 - HPA, VPA, KEDA: Autoscaling
- Week 20 - Admission Control: Webhooks, OPA Gatekeeper, Kyverno
Week 17 - GitOps: ArgoCD and Flux¶
17.1 Conceptual Core¶
- GitOps = the cluster's desired state is the contents of a git repo. A controller in the cluster watches the repo and reconciles drift.
- The two dominant tools:
- ArgoCD-UI-rich, opinionated about app structure (
ApplicationCRD), wide adoption. - Flux-CLI/CRD-first, more composable (
Kustomization,HelmRelease,GitRepository,OCIRepository), favored by CNCF-style purists. - Both implement the same control loop: pull manifests from git → render (Kustomize/Helm) → apply → reconcile drift.
17.2 Mechanical Detail¶
- ArgoCD
Application:spec.source(git path or Helm chart),spec.destination(cluster + namespace),spec.syncPolicy(manual vs automatic, prune, self-heal). ApplicationSet(Argo)-generate many Apps from templates; the foundation for multi-tenant fleet management.- Flux
Kustomization+HelmRelease-separate CRs for source-of-truth, transform, and apply. - Sync waves / dependencies: both tools support ordering. Critical for "install CRDs before the resources that use them."
- Drift detection: tools auto-revert manual changes by default. Sometimes that is not what you want during incident response-know how to disable temporarily.
17.3 Lab-"Two GitOps Stacks"¶
- Install ArgoCD. Set up an
Applicationfor a small app from a git repo. Verify auto-sync and auto-prune. - Install Flux. Set up the equivalent. Compare ergonomics.
- Use
ApplicationSet(Argo) to deploy the same app to three environment overlays (dev,staging,prod). Verify per-environment configuration via Kustomize overlays.
17.4 Hardening Drill¶
- ArgoCD/Flux talk to a git repo with read access. Use SSH deploy keys or fine-scoped GitHub apps; never broad PATs. Encrypt secrets at rest with
sealed-secretsorsops.
17.5 Operations Slice¶
- Wire ArgoCD/Flux metrics: per-Application sync rate, drift rate, reconciliation duration. Alert on persistent OutOfSync or Failed states.
Week 18 - IaC From Within K8s: Crossplane and Terraform¶
18.1 Conceptual Core¶
- Crossplane flips IaC inside out: cloud resources are Kubernetes resources (CRDs), reconciled by Crossplane's providers (
provider-aws,provider-gcp,provider-azure,provider-helm,provider-kubernetes). You manage cloud infra withkubectl apply. - Compositions let you bundle low-level primitives into domain-specific abstractions: define a
XPostgresInstancethat, when applied, creates a VPC subnet, an RDS instance, IAM bindings, and aServiceMonitor. Platform teams ship Compositions; app teams consume them. - Terraform alternative: run Terraform Cloud / Atlantis externally; treat the cluster as a deploy target only. Simpler in some shops; doesn't unify the control plane.
18.2 Mechanical Detail¶
- Provider = a controller image that knows how to talk to one external system. Install via
ProviderCRD. ProviderConfig= credentials + connection details for the provider.- Managed Resource (MR) = the K8s representation of a cloud resource (
Bucket,Database,IAMRole). - Composition = a YAML transform: "given this
XPostgresInstanceclaim, produce these MRs with these field mappings." - Composite Resource Definition (XRD) = the schema for the abstract type; the platform-team-facing equivalent of CRD.
18.3 Lab-"Self-Service Database"¶
- Install Crossplane. Install
provider-aws(orprovider-gcp). - Configure provider credentials.
- Define an XRD
XDatabasewith parameters:size,engine,version,region. - Define a Composition that materializes an RDS instance + a Secret with credentials.
- As an "app team" persona, create a
Databaseclaim. Watch it become a real RDS instance. Delete; watch it be torn down.
18.4 Hardening Drill¶
- Restrict Composition
selectorsandcompositionRefso app teams cannot select unintended Compositions. Use OPA/Gatekeeper to enforce naming, region, size limits.
18.5 Operations Slice¶
- Compositions are platform contracts. Version them. Provide migration paths. Treat as you would a public API: SLAs, deprecation windows, changelogs.
Week 19 - HPA, VPA, KEDA: Autoscaling¶
19.1 Conceptual Core¶
- HPA (Horizontal Pod Autoscaler): scales replica count based on metrics. CPU/memory by default; with
metrics.k8s.io+custom.metrics.k8s.ioadapters (e.g.,prometheus-adapter), any metric is fair game. - VPA (Vertical Pod Autoscaler): adjusts a Pod's CPU/memory
requestsbased on observed usage. Two modes:Auto(recreate pod with new resources),Off/Initial(only on creation). - KEDA (Kubernetes Event-Driven Autoscaling): scale to zero, scale on event-source backlog (Kafka lag, SQS depth, custom). Sits in front of HPA.
19.2 Mechanical Detail¶
- HPA reconcile interval: 15s by default. Picking metrics that are too jittery causes flapping; smooth at the source.
- HPA scaling policies:
scaleUp.policiesandscaleDown.policieswith stabilization windows. Tune to workload's elasticity profile. - Custom metrics adapter (
prometheus-adapter): translates Prometheus queries into thecustom.metrics.k8s.ioAPI the HPA reads. Define rules in adapter config. - VPA's recommender computes percentile-based recommendations from historical usage. Often used in
Offmode just to suggest resource changes; production safety prefers manual approval.
19.3 Lab-"Autoscale on Custom Metrics"¶
- Deploy a load-test target with a Prometheus-exposed
requests_per_secondmetric. - Install
prometheus-adaptermapping that metric tocustom.metrics.k8s.io. - Author HPA targeting
AverageValue=200of that metric. Drive load; watch scaling. - Add KEDA in front for scale-to-zero behavior. Verify cold-start latency.
19.4 Hardening Drill¶
- Set
minReplicasto a non-zero value for any tier-1 service (avoid cold-start during incident traffic). CapmaxReplicasto avoid runaway autoscaling on metric anomalies.
19.5 Operations Slice¶
- Wire HPA event metrics. Alert on persistent
desiredReplicas == maxReplicas(you've hit the cap) and on flapping (scaleUpandscaleDownevents alternating rapidly).
Week 20 - Admission Control: Webhooks, OPA Gatekeeper, Kyverno¶
20.1 Conceptual Core¶
- Admission control is the apiserver's last gate: every create/update is run through configured admission webhooks before persistence.
- Two policy-engine choices in the modern ecosystem:
- OPA Gatekeeper-Rego-language policies; the standard for "policy as code."
- Kyverno-YAML-native policies; lower learning curve, strong template/mutation/generate support.
- Pod Security Admission (replacement for the deprecated PodSecurityPolicy)-built into the apiserver. Three profiles:
privileged,baseline,restricted. Apply per-namespace.
20.2 Mechanical Detail¶
- Validating webhooks: receive AdmissionReview, return Allowed=true/false with reasons. Cannot mutate.
- Mutating webhooks: also return JSON Patch / strategic merge for changes. Applied before validating.
- Failure policy (
FailvsIgnore): if the webhook is unreachable, fail closed (safer) or open (operationally simpler). Trade off carefully. - Gatekeeper's
ConstraintTemplate(Rego) +Constraint(instance) model. Audit mode reports without enforcing-start there in any new policy rollout. - Kyverno's
ClusterPolicy/PolicyCRDs cover validate, mutate, generate, verifyImages.
20.3 Lab-"Three Policy Layers"¶
- Apply Pod Security Admission per-namespace:
restrictedeverywhere except aprivnamespace. - Author 5 Gatekeeper Constraints: require resource limits, forbid
latesttags, enforce non-root, label-required, namespace-must-have-team-label. - Author equivalents in Kyverno. Compare expressiveness.
- Run in audit-mode for a week against a pre-existing cluster; triage findings before enforcing.
20.4 Hardening Drill¶
- Mandate signed images via Kyverno's
verifyImageswith cosign keys. Combined with Sigstore policy from the Container curriculum, this closes the supply-chain gate at the cluster.
20.5 Operations Slice¶
- Track admission-webhook latency. Slow webhooks slow every apply. Pod-creation latency p99 is your warning signal.
Month 5 Capstone Deliverable¶
A platform-and-day2/ workspace: 1. gitops-stack/ (week 17)-ArgoCD + ApplicationSet + multi-env overlays. 2. crossplane-platform/ (week 18)-XDatabase composition + claim demo. 3. hpa-custom-metrics/ (week 19)-Prom-adapter + HPA + KEDA scale-to-zero demo. 4. policy-suite/ (week 20)-Gatekeeper + Kyverno + PSA examples.
Month 6-Kubernetes The Hard Way + Capstone¶
Goal: by the end of week 24 you have built (or substantially built) a multi-node Kubernetes cluster from raw VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy isolation, and a documented operational runbook.
Weeks¶
- Week 21 - Bootstrap: VMs, Certificates, etcd
- Week 22 - Control Plane and Worker Nodes
- Week 23 - RBAC, Multi-Tenancy, mTLS Everywhere
- Week 24 - Defense, Documentation, and the Capstone Demo
Week 21 - Bootstrap: VMs, Certificates, etcd¶
21.1 Conceptual Core¶
- "Kubernetes the Hard Way" is Kelsey Hightower's exercise: bring up a Kubernetes cluster step by step, from raw VMs, generating certs by hand, configuring every flag explicitly. The point is not operational efficiency; it is deep understanding of every moving part.
- This curriculum's hard-way variant: bring up 3 control-plane nodes + 3 worker nodes on cloud VMs (or bare metal). Use modern toolchain (containerd, Cilium, latest stable Kubernetes).
21.2 Mechanical Detail¶
- VM provisioning: 6 VMs, ~2 vCPU 4 GB each. Cloud (AWS/GCP/Hetzner) or bare metal.
- PKI: a CA + intermediate CAs for
etcd,kube-apiserver,kubelet,front-proxy. Usecfssloreasy-rsa. Every component identifies itself with x509. - etcd cluster: 3 nodes, mTLS between peers and clients, snapshots scheduled.
- Loopback bootstrap considerations: kubelet needs a kubeconfig before the apiserver is up. Either use static-pod manifests for control-plane components (the
kubeadmapproach) or run the control plane outside the cluster on the VMs themselves.
21.3 Lab-"Bring Up etcd"¶
- Provision 3 VMs labeled
etcd-{1,2,3}. - Generate CA + per-node certs.
- Install etcd binaries; configure systemd units with mTLS.
- Bring up; verify
etcdctl member listshows healthy quorum. - Take a snapshot. Restore on a separate test machine.
21.4 Hardening Drill¶
- etcd encryption-at-rest is separate from the K8s secret encryption (next week). Configure etcd with an encryption-providers config from day one.
21.5 Operations Slice¶
- etcd backup automation:
etcdctl snapshot savecron'd to S3 every 6 hours. Verify restore weekly.
Week 22 - Control Plane and Worker Nodes¶
22.1 Conceptual Core¶
- The control plane: kube-apiserver, kube-scheduler, kube-controller-manager. Run all three as systemd-managed binaries on each control-plane node, behind a load balancer (HAProxy or cloud LB) for HA.
- The worker plane: containerd + kubelet + kube-proxy (or Cilium replacement). Joins the cluster via a kubelet kubeconfig signed by the cluster CA.
22.2 Mechanical Detail¶
- kube-apiserver flags:
-
- -etcd-servers=https://etcd-{1,2,3}:2379` with mTLS.
-
- -encryption-provider-config=...` for secret encryption-at-rest.
-
- -audit-policy-file=...
and - -audit-log-path=....
- -audit-policy-file=...
-
- -authorization-mode=Node,RBAC`.
-
- -enable-admission-plugins=NodeRestriction,PodSecurity,ResourceQuota,...`.
-
- -service-account-issuer
, - -service-account-signing-key-filefor ServiceAccount tokens (projected, OIDC-compatible).
- -service-account-issuer
- kubelet bootstrap: TLS bootstrap using a bootstrap token; kubelet auto-rotates its cert via
kubelet-csr-approver. - CNI: install Cilium first (DaemonSet); only after Cilium is healthy do worker-node Pods become ready.
- DNS: install CoreDNS as a Deployment; the kubelet's cluster-DNS arg points at its Service IP.
22.3 Lab-"Cluster Live"¶
- Bring up 3 control-plane nodes; HAProxy in front.
- Bring up 3 workers; join via bootstrap tokens.
- Install Cilium; verify Pod-to-Pod connectivity.
- Install CoreDNS; verify Service DNS works.
- Smoke test: deploy a sample app + Service + Ingress; verify end-to-end.
22.4 Hardening Drill¶
- Apply CIS Kubernetes Benchmark v1.8 (or current). Use
kube-benchto score. Address allFAILs; documentWARNs.
22.5 Operations Slice¶
- Wire control-plane components to Prometheus. Define SLOs: apiserver request p99 < 1s, etcd-leader-changes per hour < 1, scheduler queue depth < 100.
Week 23 - RBAC, Multi-Tenancy, mTLS Everywhere¶
23.1 Conceptual Core¶
- Multi-tenancy is the hardest sustained problem in Kubernetes. The kernel and Kubernetes give you soft isolation by default; converting that to hard isolation requires layered controls.
- The required layers: namespace-per-tenant + RBAC + NetworkPolicy + ResourceQuota + LimitRange + PodSecurity + node-pool isolation + (optionally) sandboxed runtime.
- mTLS everywhere: control-plane (already from week 22), service mesh between Services (week 15), workload identity for Pods talking to cloud APIs (e.g., AWS IRSA, GCP Workload Identity).
23.2 Mechanical Detail¶
- Tenant onboarding as code (Crossplane Composition or Helm chart):
- Namespace.
- ResourceQuota + LimitRange.
- Default-deny NetworkPolicy + an allow-namespace-internal exception.
- PodSecurity admission label (
restricted). - RoleBindings for the tenant's group.
- ServiceAccount with workload identity binding for cloud access.
- GitOps Application(Set) entries to deploy the tenant's app catalog.
- Hard isolation tiers:
- Tier 1: namespace + RBAC. Default. Suitable for trusted internal teams.
- Tier 2: + sandboxed runtime (gVisor) for tenant-owned untrusted workloads.
- Tier 3: + dedicated node pool with taints. Suitable for compliance-bound workloads.
- Tier 4: separate cluster (vCluster, Cluster API). Strongest isolation; highest cost.
23.3 Lab-"Onboard a Tenant"¶
- Author a tenant Composition (Crossplane) or Helm chart that, given
{tenant: "acme"}, materializes everything in §23.2. - Onboard
acme. Have a "tenant developer" persona deploy an app via GitOps. - Verify isolation: from
acme's namespace, can you read another tenant's secrets? Pods? Logs? Each should fail.
23.4 Hardening Drill¶
- Run
kubescapeorpolarisagainst the cluster. Address findings until score is >90%.
23.5 Operations Slice¶
- Per-tenant cost attribution: label every resource with
tenant=; exportkube-state-metricswith that label to Prometheus; cost-allocate viaOpenCost.
Week 24 - Defense, Documentation, and the Capstone Demo¶
24.1 Conceptual Core¶
The final week is integration and defense. Bring the capstone (whichever track) to production-defensible quality.
24.2 Final Hardening Checklist¶
- CIS benchmark green (
kube-bench). - All control-plane components mTLS, with cert auto-rotation tested.
- Encryption-at-rest enabled for secrets in etcd.
- Audit logging enabled; logs shipped off-cluster.
- Default-deny NetworkPolicy in every namespace.
- PodSecurity
restrictedeverywhere except documented exceptions. - Image admission requires signed images (Sigstore policy).
- Velero backups + tested cross-cluster restore.
- Chaos: drain a node, kill a master, partition the network-cluster recovers.
- Observability: Prometheus + Grafana + Loki + Tempo (or equivalent) integrated.
- Cost attribution per tenant.
- Runbooks: node-not-ready, etcd-degraded, apiserver-OOM, namespace-stuck-terminating, pod-pending-forever.
24.3 Lab-"Defend the Cluster"¶
Schedule a 60-minute mock review. Demo: 1. The architecture diagram. 2. Provisioning (Ansible/Terraform/Crossplane). 3. Tenant onboarding from request to running app. 4. Failure injection: kill a control-plane node; show cluster recovery. 5. Observability: trace a request from ingress through service mesh to backend, with metrics, logs, and trace ID correlation. 6. Backup + restore.
24.4 Operations Slice¶
- Tag the cluster manifest repo
v1.0.0. Sign with cosign. Publish aRUNBOOK.mdthat, in principle, lets a successor team rebuild the cluster from scratch.
Month 6 Deliverable¶
The capstone artifact (per the capstone catalog), plus the aggregated kubernetes-mastery/ repo containing every prior month's deliverable.
Appendix A-Kubernetes Hardening Reference¶
Cumulative hardening checklist. By week 24 the reader's cluster-baseline/ template should encode every section.
A.1 Control Plane¶
- etcd: 3 or 5 nodes, mTLS, encryption-at-rest, snapshot+restore tested.
- kube-apiserver: encryption providers, audit logging, NodeRestriction admission, PodSecurity admission, OIDC (or trustedSA) for users.
- kube-scheduler: leader election; default + custom plugins reviewed.
- kube-controller-manager: leader election; minimum SA permissions.
- kubelet: read-only port disabled, TLS bootstrap with CSR approval, anonymous-auth false, authorization webhook.
A.2 RBAC¶
- No bindings to the
cluster-adminClusterRole except for break-glass. - Per-tenant Roles, not ClusterRoles.
- Audit
system:authenticatedandsystem:unauthenticatedgroup bindings-both should be empty. - Use
kubectl auth can-i --as=...to verify least-privilege per persona.
A.3 Pod Security¶
- PodSecurity admission
restrictedeverywhere by default. - Exceptions documented in code (namespace labels) with justification.
- Pod-level:
runAsNonRoot,readOnlyRootFilesystem, drop all caps, seccompRuntimeDefault. - Mutating webhook to inject defaults if Pod spec omits them.
A.4 Network¶
- CNI with NetworkPolicy support (Cilium, Calico).
- Default-deny ingress + egress in every namespace.
- Allowed flows declared per workload as labeled NetworkPolicy.
- L7 policy on ingress (Cilium L7 NetworkPolicy or service mesh).
- mTLS between Services (mesh).
- Egress controls: explicit allowed CIDRs / FQDNs.
A.5 Image Supply Chain¶
- Image admission (Kyverno / Cosign policy-controller) requires signature.
- Allowlisted registries.
- No
latesttags; pin by digest in production. - SBOM and SLSA provenance attestations attached to every image.
A.6 Secrets¶
- etcd encryption-at-rest with rotated keys.
- External Secret Operator (ESO) for cloud-KMS-sourced secrets.
- No secrets in env vars where possible (use volume mounts, watch for restart).
- No secrets committed to git, even in sealed form, without
sealed-secrets/sopsratchet.
A.7 Multi-Tenancy¶
- One namespace per tenant; ResourceQuota + LimitRange.
- Hierarchical Namespaces or Capsule for nested tenants.
- PriorityClasses by tier; preemption tuned.
- Per-tenant cost attribution via labels + OpenCost.
A.8 Observability¶
- Audit logs shipped off-cluster (read-only on cluster).
- Container logs (Loki / cloud equivalent).
- Metrics (Prometheus + kube-state-metrics + node-exporter).
- Traces (OTel Collector + Tempo / Jaeger / cloud).
- Continuous profiling (Parca / Pyroscope) optional but recommended.
- SLO tracking per service (Pyrra / Sloth).
A.9 Backup + DR¶
- Velero scheduled backups to off-cluster storage.
- Cross-region or cross-cluster restore tested at least quarterly.
- etcd snapshot tested for catastrophic-recovery scenario.
- DR runbook with RTO + RPO documented.
A.10 The cluster-baseline/ Template¶
cluster-baseline/
bootstrap/
pki/ # CA + per-component certs (cfssl)
etcd/ # systemd unit + config
kube-apiserver/
kube-scheduler/
kube-controller-manager/
kubelet/
cni/cilium-values.yaml
service-mesh/ # istio or linkerd values
observability/
prometheus/
grafana/
loki/
tempo/
parca/
policy/
pod-security/
networkpolicy-default-deny.yaml
gatekeeper-constraints/
kyverno-policies/
sigstore-policy.yaml
tenancy/
namespace-template/ # Crossplane composition
rbac-template/
quotas-template/
velero/
schedule.yaml
locations.yaml
runbooks/
node-not-ready.md
etcd-degraded.md
apiserver-oom.md
pod-pending-forever.md
cluster-rebuild.md
RUNBOOK.md
THREAT_MODEL.md
This is the artifact every cluster you bring up after week 24 should be provisioned from.
Appendix B-Troubleshooting Reference Flows¶
Reference flows for the failure modes you will see in production.
B.1 Pod Pending Forever¶
Common causes (in observed-frequency order): 1. No node satisfies scheduling constraints. Events: shows FailedScheduling. Read the reason: insufficient CPU/memory, no matching nodeSelector, taints unmatched, no PV available, topology spread blocked. 2. PVC stuck pending. kubectl get pvc <pvc> - if Pending, check StorageClass, provisioner pods, cloud-side quota. 3. **Image pull failure**.Events:showsErrImagePull/ImagePullBackOff. Check registry auth, image tag exists, network egress to registry. 4. **Admission webhook rejected**. Often hidden in apiserver logs;kubectl get events -Amay surface it. 5. **Quota exceeded**.ResourceQuota` denied creation.
Drilldown: kubectl get events -A --sort-by=.lastTimestamp | tail -30.
B.2 Pod CrashLoopBackOff¶
Common causes: 1. App-level crash. Read the previous container's logs. 2. Liveness probe failing. The probe is killing the container. Check probe path/port; loosen initialDelaySeconds. 3. OOMKilled. kubectl describe shows Reason: OOMKilled. Increase memory limit or fix leak. 4. ConfigMap / Secret missing. Pod is mounting it; if missing, kubelet fails the start. Watch for events. 5. Init container failure. Pod won't progress; check init container logs first.
B.3 Node NotReady¶
Common causes: 1. kubelet down. systemd unit failure; check journal. 2. CNI agent down. The node has no functional networking; Cilium/Calico DaemonSet pod has crashed. 3. Disk pressure. Events: shows EvictionThresholdMet. Free space (delete old container images, journal logs). 4. PID pressure. Too many processes. 5. Out-of-resources kernel-side. Check dmesg on the node.
B.4 etcd Degraded¶
Common causes: 1. Disk full or slow. fsync latency spikes; everything else feels slow. Check etcd_disk_wal_fsync_duration_seconds. 2. Leader election thrashing. Network instability between etcd nodes; check inter-node latency. 3. Database size growth. Forgot to compact. etcdctl compact <rev>; etcdctl defrag. 4. Quorum lost. Majority of nodes down. Restore from snapshot to a new cluster; recover.
B.5 Apiserver 5xx / Timeouts¶
Common causes: 1. etcd issues (above). 2. Webhook timeouts. Slow admission webhooks block every apply. Check webhook latency; consider failurePolicy: Ignore with caution. 3. Aggregated API down (e.g., metrics-server). kubectl top fails; downstream features (HPA) degrade. 4. Apiserver overload. Too many list/watch consumers; CPU pegged. Add replicas; review priority-and-fairness flow control.
B.6 Service Has No Endpoints¶
Common causes: 1. Selector mismatch. Service spec.selector doesn't match Pod labels. Most common. 2. Pods not ready. ReadinessProbe failing; only ready Pods join Endpoints. 3. Port mismatch. Service port name vs container port name out of sync. 4. Topology-aware routing dropping endpoints. Check service.kubernetes.io/topology-aware-hints.
B.7 Namespace Stuck Terminating¶
Cause: A finalizer can't be removed because its owning controller is gone (or stuck).
Fix path (carefully-you are bypassing a safety):
kubectl get namespace <ns> -o json \
| jq '.spec.finalizers = []' \
| kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -
But also: investigate why the finalizer wouldn't clear. Often a dangling external resource the operator was waiting on.
B.8 ImagePullBackOff in a Private Registry¶
- `kubectl get secret
-o yaml - exists and well-formed? - Pod's
spec.imagePullSecretsreferences it? - Secret type is
kubernetes.io/dockerconfigjson? - Decoded
.dockerconfigjsonhas the right registry URL and credentials? - From the node, can you
crictl pullthe image manually with the same creds?
B.9 HPA Not Scaling¶
- `kubectl describe hpa
- events show why. - Metrics available?
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods"for resource metrics;kubectl get --raw "/apis/custom.metrics.k8s.io/..."for custom. - Pod requests set? HPA uses
requestsas the denominator; without them, percentage-based metrics are meaningless. behaviorpolicies preventing fast scaling? CheckscaleUp.policiesandstabilizationWindowSeconds.
B.10 Mesh: 503 from Sidecar¶
(Istio specifics, but general patterns apply) 1. Service has Endpoints? 2. mTLS mode strict, but caller without sidecar? Check PeerAuthentication. 3. AuthorizationPolicy denying the call? 4. DestinationRule with circuit-breaker tripped? kubectl describe destinationrule. 5. Envoy access log: istioctl proxy-config log <pod> --level debug and re-issue.
Appendix C-Contributing to Kubernetes¶
Kubernetes is the largest open-source project in the world by contributor count. The flip side: the bureaucracy is real. This appendix is the on-ramp.
C.1 Mental Model¶
The Kubernetes project is governed by SIGs (Special Interest Groups)-domain-scoped groups (sig-node, sig-network, sig-storage, sig-api-machinery, sig-scheduling, sig-cli, sig-auth, etc.). Each SIG has chairs, technical leads, regular meetings, and a Slack channel. Almost every PR maps to a SIG.
Major changes go through KEPs (Kubernetes Enhancement Proposals)-design docs in the kubernetes/enhancements repo. KEPs progress through provisional → implementable → implemented → deprecated over multiple releases.
Implications for newcomers: 1. Find the right SIG before opening a PR. 2. For non-trivial changes, write or piggyback on a KEP first. 3. The cycle time is slow. Two-week review is normal; six-week is common.
C.2 Setting Up¶
The build is heavy (it's all of Kubernetes); plan for ~10 minutes the first time.
For tests:
make test # unit
make test-integration KUBE_TEST_ARGS="-run <name>"
make test-e2e KUBE_TEST_ARGS=...
For local dev cluster: kind (uses your local Docker / containerd, spins up a multi-node cluster in containers in seconds).
C.3 Where the Easy Wins Are¶
C.3.1 Documentation¶
kubernetes/website(the website repo). Docs improvements are welcome and reviewed quickly.
C.3.2 e2e flakes¶
- The
flakeylabel onkubernetes/kubernetesissues. Fixing flakes is unglamorous but high-impact.
C.3.3 kubectl¶
- `staging/src/k8s.io/kubectl - small, contained. Bug fixes and small features are tractable.
C.3.4 client-go improvements¶
- `staging/src/k8s.io/client-go - used by every controller in the world. Improvements compound.
C.3.5 SIG-specific work¶
- Pick a SIG matching your interest. Their backlogs have
good first issuelabels.
C.3.6 Don't start here (yet)¶
- Scheduler core (high stakes; small SIG; deep changes need KEPs).
- API machinery (apiserver internals, conversion, validation).
- kubelet (touches every node; PR latency is high for safety).
- Anything touching scaling / performance critical paths.
C.4 The First-PR Workflow¶
- Find an issue. Read
CONTRIBUTING.mdand SIG-specific contribution guides. Comment/assignto claim. - Branch from master.
- Implement. Run
make verify,make test, the relevantmake test-integrationsubset. - Commit with
Signed-off-by(DCO). - Open the PR. SIG bots will auto-assign reviewers. Use the PR template; fill in every field.
- CI cycle. Tests run on Prow. Re-run with
/test all. Address comments. - Approval flow: a reviewer adds
/lgtm; an approver adds/approve. Both required. The bot then merges. - Backport (if applicable): for bugfixes, the PR may need cherry-picks to release branches. Use the cherry-pick robot or do manually.
C.5 The KEP Process¶
For changes that: - Add or modify the API. - Have user-visible behavior changes. - Affect multiple SIGs.
Process: 1. Open an issue in the relevant SIG. 2. Get at least informal agreement that the problem is real. 3. Write a KEP using the template in kubernetes/enhancements/keps/NNNN-template/. 4. Submit as a PR. The KEP itself goes through review. 5. Once implementable, you can submit code PRs referencing the KEP number. 6. KEPs target a Kubernetes release (alpha → beta → GA over multiple releases).
Time scale: months to a year for a substantial KEP.
C.6 Adjacent Targets if k/k Is Too Heavy¶
The CNCF ecosystem has many high-impact projects with smaller surface area:
| Project | Bar |
|---|---|
| kubectl plugins (krew) | Low. Author your own; submit to krew index. |
| kind | Low–Medium. Friendly maintainers. |
| kustomize | Medium. |
| Helm | Medium. Larger team. |
| ArgoCD / Flux | Medium. Active. |
| Cilium | Medium–High. |
| Crossplane | Medium. Welcoming to providers. |
| Operator SDK / Kubebuilder | Medium. |
| OpenTelemetry (collector + Operator) | Medium. |
A merged contribution to any of these is a credible Kubernetes-ecosystem signal in interviews.
C.7 Calibration¶
A reasonable goal for a curriculum graduate:
- By end of week 23: a PR open against
kubernetes/website, a kubectl plugin, or a small fix to an ecosystem project. - By end of capstone: that PR merged.
- 6 months post-curriculum: a substantive contribution-a kubectl feature, a new operator, a Cilium policy plugin.
Patient contributors become trusted contributors. Trusted contributors become reviewers. Reviewers become approvers. Approvers become SIG chairs. The path exists; it just takes time.
Capstone Projects-Three Tracks, One Choice¶
Pick one. The work performed here is what you describe in interviews.
Track 1-Hard Way: A Production-Grade Cluster From Scratch¶
Outcome: a multi-node Kubernetes cluster brought up on bare metal or cloud VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy, GitOps-managed workloads, and a documented operational runbook.
Functional spec¶
- 3 control-plane nodes + 3 workers (minimum). HAProxy or cloud LB in front of the apiservers.
- etcd with mTLS, encryption-at-rest, scheduled snapshots to off-cluster storage.
- CNI: Cilium with kube-proxy replacement, Hubble enabled.
- Service mesh: Istio (ambient) or Linkerd, mTLS strict between services.
- Storage: a real CSI driver (local-path for dev; OpenEBS / Longhorn / cloud CSI for "real").
- Observability: Prometheus + Grafana + Loki + Tempo + OTel Collector.
- GitOps: ArgoCD or Flux managing platform addons.
- Policy: Pod Security
restricted, NetworkPolicy default-deny, Kyverno or Gatekeeper enforcing org rules. - Backup: Velero scheduled, restore tested.
Non-functional spec¶
- CIS benchmark green (
kube-bench≥90% pass). - Cluster rebuild from scratch in <2 hours via Ansible/Terraform.
- Zero-downtime kubelet upgrades (drain + replace pattern).
- A demo: kill a control-plane node; cluster recovers without intervention.
Acceptance¶
- Public repo with provisioning playbooks and runbooks.
- A 30-minute screencast walking the assessor through bring-up, an incident drill, and a tenant onboarding.
- A
RUNBOOK.mdcovering: cluster provisioning, node addition/removal, etcd backup/restore, certificate rotation, upgrade procedure, top 5 incident types and remediation.
Skills exercised¶
- All months-but Months 1, 2, 6 most heavily.
Track 2-Platform: GitOps Multi-Tenant PaaS¶
Outcome: a self-service developer platform built on Kubernetes that demonstrates onboarding a new team in <30 minutes, with policy guardrails, infra-from-code, and full observability.
Functional spec¶
- Tenant model: each tenant gets a Namespace, ResourceQuota, LimitRange, RBAC bindings, default NetworkPolicy, monitoring scrape config, GitOps Application (Argo) entry-all materialized from a single tenant claim (Crossplane Composition or Helm chart).
- Self-service: developers commit a
manifest.yamlto their repo; ArgoCD/Flux picks it up; their app deploys. - Policy: Kyverno or OPA Gatekeeper enforces: image signatures, no
latesttags, mandatory labels, resource limits, no privileged Pods. - Observability: each tenant's metrics/logs/traces are isolated (via labels and Loki/Prom multi-tenancy); a per-tenant Grafana folder with default dashboards.
- Cost attribution: OpenCost emits per-tenant cost; surface in a dashboard.
- Crossplane: a
Databaseclaim that materializes a real cloud database (or, for demo, a chart-deployed Postgres).
Non-functional spec¶
- Tenant onboarding: from "claim PR opened" to "deployed app reachable" in <30 minutes (target: <5).
- Failure isolation: a tenant exceeding quota does not affect other tenants.
- Compliance: every running Pod can be traced back to a git commit + signature verification.
Acceptance¶
- Public repo with platform manifests + tenant-onboarding template.
- Live demo: onboard a fresh tenant; deploy a sample app; demonstrate observability + policy denial; demonstrate quota enforcement.
- A
PLATFORM.mddescribing the contract between platform team and tenants: versioning, deprecation, support, escalation.
Skills exercised¶
- Months 3 (operators / Crossplane), 5 (GitOps + IaC + autoscaling + admission), 6 (multi-tenancy).
Track 3-Operator: Production-Quality Operator From Scratch¶
Outcome: a non-trivial operator that manages a stateful application or external system, complete with backup/restore, upgrades, observability, and a thoughtful API.
Suggested scopes¶
elasticsearch-mini-operator: manage Elasticsearch clusters with auto-scaling, snapshot lifecycle, index lifecycle policies.postgres-mini-operator: with automatic failover (using the Postgres replication primitives), backup/restore via WAL-G to S3, point-in-time recovery.saas-resource-operator: manage external SaaS resources via Crossplane composition (e.g., aGitHubRepooperator complete with branch protection, secret scanning, codeowners).
Acceptance¶
- Public repo, written with
controller-runtime+ Kubebuilder. - CRDs versioned (v1alpha1 + v1beta1 + conversion webhook).
- Status conditions, finalizers, owner references-all idiomatic.
- Comprehensive RBAC (least-privilege, generated from kubebuilder markers).
- Mutating + validating admission webhooks.
- E2E tests (Ginkgo + envtest, plus a kind-based suite).
- Helm chart or kustomize manifests for installation.
- Observability: Prometheus metrics, structured logs (logr), OTel traces.
- Helm-test-style upgrade tests across three operator versions.
- Documentation: design rationale, API reference, examples.
Skills exercised¶
- Months 3 (operators), 4 (storage if stateful), 5 (admission), 6 (defense).
Cross-Track Requirements¶
cluster-baseline/template (Appendix A) integrated.- ADRs (≥3).
THREAT_MODEL.md.RUNBOOK.md.- Defense readiness: 60-minute walkthrough.
The track choice signals career direction: Track 1 for SRE/cluster-operator roles, Track 2 for platform-engineering roles, Track 3 for software-engineering-on-Kubernetes roles. Pick based on where you want the next interview loop.
Workshop - Bootstrap a Kubernetes control plane by hand¶
Companion to Kubernetes -> Month 06 -> Weeks 21-22: Bootstrap, Control Plane, and Worker Nodes ("Kubernetes the Hard Way"). Every other workshop built things on top of Kubernetes. This one goes underneath: you start etcd and the kube-apiserver as plain processes by hand, talk to your hand-built control plane with kubectl, then add the scheduler and a kubelet - and watch a Pod actually run. By the end, the single most important truth about Kubernetes is something you've proven with your own hands: the control plane is just a few Go binaries and a database.
~120 minutes - the capstone of the workshop series. Needs: a Linux host, root, and the Kubernetes binaries (etcd, kube-apiserver, kube-scheduler, kubelet, kubectl - download from the official releases). This deliberately does not use kind/kubeadm - the whole point is no automation.
What you'll build, and the idea it makes concrete¶
You'll assemble a working (single-node, no-TLS-shortcuts-where-safe) Kubernetes control plane from individual binaries: etcd for storage, the apiserver as the front door, then the scheduler and kubelet so Pods actually run. No kubeadm, no kind, no installer - you start each process and wire them together yourself.
The idea this makes concrete:
Kubernetes is not a monolithic system - it's a handful of independent programs coordinating through one shared database (etcd) and one API (the apiserver). "The control plane" is:
etcd(the only stateful component - the entire cluster state lives here as key/values),kube-apiserver(the only thing that talks to etcd; everything else talks to it),kube-schedulerandkube-controller-manager(control loops that watch the apiserver and act - the loops you built in every prior workshop), and on each node akubelet(runs containers) andkube-proxy(networking). Take away the installers and a cluster is these processes plus etcd. There is no magic - there is process supervision and a database.
Every prior workshop built a participant in this system (a controller, a scheduler, an operator). This workshop builds the system itself, revealing that your custom scheduler and the real one are peers - both just clients of the apiserver.
Step 0: the architecture you're about to assemble by hand¶
+------------------+
kubectl --------------> | |
your custom controller->| kube-apiserver | <----- the ONLY component that
kube-scheduler -------->| (the front door)| reads/writes etcd
kube-controller-mgr --->| |
kubelet (each node) --->+--------+---------+
| (the only etcd client)
v
+-----------+
| etcd | <-- ALL cluster state lives here
| (the DB) | (every object, as key/value)
+-----------+
The non-obvious truths this layout encodes, which you'll verify: - etcd is the only stateful thing. Every Pod, Deployment, Secret, your Website CR from the operator workshop - all of it is rows in etcd. Lose etcd, lose the cluster. Back up etcd, back up everything. - Only the apiserver touches etcd. The scheduler, controllers, kubelet - none of them know etcd exists. They all go through the apiserver's REST API. This is why the apiserver is the central component and why everything is "just an API client." - Everything else is a stateless client running a watch-and-act loop. Restart the scheduler and nothing is lost - it re-reads state from the apiserver. This is why control-plane components are easy to make HA (run several, lease-elect a leader - the controller workshop's leader election).
Step 1: start etcd - the cluster's brain¶
etcd is a distributed key/value store. For a single-node learning cluster, one instance:
$ etcd \
--data-dir=/tmp/etcd-data \
--listen-client-urls=http://127.0.0.1:2379 \
--advertise-client-urls=http://127.0.0.1:2379 &
It's just a process listening on :2379. Prove it's a plain key/value store - put and get a key directly:
$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 put /hello world
OK
$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 get /hello
/hello
world
That's the database the entire cluster will live in. Right now it holds your test key; in a minute it'll hold Kubernetes objects. (In production etcd runs as a 3 or 5 node Raft cluster for HA - Week 1's consensus material - but it's the same program.)
Step 2: start the apiserver - the front door¶
The apiserver is the REST API in front of etcd. Point it at your etcd:
$ kube-apiserver \
--etcd-servers=http://127.0.0.1:2379 \
--service-cluster-ip-range=10.0.0.0/24 \
--bind-address=127.0.0.1 \
--secure-port=6443 \
--authorization-mode=AlwaysAllow \ # workshop shortcut; real clusters use RBAC + TLS auth
--token-auth-file=/tmp/tokens.csv \ # a trivial token for kubectl (workshop)
... (cert flags) &
(Real bootstrap requires a PKI - CA, certs for each component, TLS everywhere. "Kubernetes the Hard Way" spends a whole section on certificates; for this workshop you can use --authorization-mode=AlwaysAllow and minimal certs to focus on the architecture. The security hardening is Week 23.)
Now point kubectl at your apiserver and talk to your hand-built control plane:
$ kubectl --server=https://127.0.0.1:6443 --token=<your-token> --insecure-skip-tls-verify get nodes
No resources found. # the apiserver answers! (no nodes yet - we haven't added a kubelet)
$ kubectl ... get namespaces
NAME STATUS AGE
default Active 10s
kube-system Active 10s <- the apiserver created the built-in namespaces in YOUR etcd
You're talking to a Kubernetes API you assembled from two processes. And here's the etcd connection made literal - look at what the apiserver wrote into etcd:
$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 get /registry --prefix --keys-only | head
/registry/namespaces/default
/registry/namespaces/kube-system
/registry/apiregistration.k8s.io/apiservices/v1.
...
Every Kubernetes object lives under /registry/... in etcd. When you kubectl get namespaces, the apiserver reads these keys. When you kubectl create anything, it writes here. You can see the entire cluster as key/value pairs - the abstraction is gone, it's a database.
Step 3: create an object, watch it land in etcd¶
Make the etcd-is-the-state truth undeniable. Create a ConfigMap via the API, then read it straight from etcd:
$ kubectl ... create configmap demo --from-literal=k=v
configmap/demo created
$ ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 get /registry/configmaps/default/demo
/registry/configmaps/default/demo
{"kind":"ConfigMap","apiVersion":"v1","metadata":{"name":"demo",...},"data":{"k":"v"}}
The object you created through the API is sitting in etcd as a value at a predictable key. kubectl -> apiserver -> etcd. That's the entire write path of Kubernetes, and you just watched a single object travel it. This is also why etcd backup = cluster backup: etcdctl snapshot save captures every object; restore it and the whole cluster comes back.
Step 4: add the scheduler - now Pods can be placed¶
Right now if you created a Pod it would sit Pending forever (exactly the scheduler workshop's lesson - no scheduler, no placement). Start the real scheduler as another client of your apiserver:
$ kube-scheduler \
--kubeconfig=/tmp/scheduler.kubeconfig \ # points at https://127.0.0.1:6443
--bind-address=127.0.0.1 &
It's just another process pointed at the apiserver - the same way your custom scheduler connected in that workshop. The real scheduler and your mini one are peers: both watch the apiserver for unscheduled Pods and write bindings back. There's nothing privileged about the "official" one; it's a client like any other.
Step 5: add a kubelet - now Pods actually run¶
The control plane decides; the kubelet executes (the scheduler workshop's decoupling). Start a kubelet to turn this host into a worker node:
$ kubelet \
--kubeconfig=/tmp/kubelet.kubeconfig \ # registers with your apiserver
--config=/tmp/kubelet-config.yaml \ # cgroup driver, container runtime endpoint, etc.
--container-runtime-endpoint=unix:///run/containerd/containerd.sock &
The kubelet registers itself as a Node with the apiserver and starts watching for Pods assigned to it. Confirm your cluster now has a node:
$ kubectl ... get nodes
NAME STATUS ROLES AGE VERSION
myhost Ready <none> 20s v1.29.x <- your kubelet registered itself
Step 6: the payoff - run a Pod on the cluster you built¶
Everything's assembled: etcd (state) + apiserver (API) + scheduler (placement) + kubelet (execution). Create a Pod and watch it travel the entire pipeline you built by hand:
$ kubectl ... run hello --image=nginx
$ kubectl ... get pod hello -o wide -w
NAME READY STATUS NODE
hello 0/1 Pending <none> <- created, stored in etcd, not yet scheduled
hello 0/1 ContainerCreating myhost <- scheduler bound it; kubelet is starting it
hello 1/1 Running myhost <- kubelet started the container
Trace what just happened across the components you started: 1. kubectl run -> apiserver validated and wrote the Pod to etcd (spec.nodeName=""). 2. The scheduler (watching the apiserver) saw an unscheduled Pod, picked the node, wrote the binding back through the apiserver to etcd. 3. The kubelet (watching the apiserver) saw a Pod assigned to it, called the container runtime, started nginx, and reported status back.
A container is running on a Kubernetes cluster that you assembled from individual processes - no kubeadm, no kind, no cloud. Every step went through the apiserver; all state lives in etcd; the scheduler and kubelet are just clients running watch-and-act loops. You've seen the whole machine with the cover off.
Step 7: prove the resilience model - kill and restart components¶
The "stateless clients + one stateful store" design has a dramatic consequence you can demonstrate. Kill the scheduler and the apiserver - leave etcd alone:
$ kill %scheduler %apiserver # control plane "down"
$ kubectl ... get pods # fails - the API is gone
The connection to the server was refused
The API is down, but the running Pod keeps running (the kubelet is independent and the container doesn't care the control plane is down). Now restart the apiserver and scheduler:
$ kube-apiserver ... & # restart, pointed at the SAME etcd
$ kubectl ... get pods
NAME READY STATUS AGE
hello 1/1 Running 5m <- still there! state was never lost
Nothing was lost, because all state was in etcd the whole time - the apiserver and scheduler are stateless, so restarting them just re-reads from etcd. This is why control-plane components are trivially HA (run several, they share etcd) and why the only thing you must back up is etcd. Lose a control-plane process: restart it. Lose etcd without a backup: lose the cluster. You just proved both halves.
Now extend it (toward the real "Hard Way")¶
- Add the PKI. Do it properly: a CA, TLS certs for every component,
--authorization-mode=Node,RBAC. This is half of "Kubernetes the Hard Way" and where Week 23's security lives. Painful and illuminating. - Add kube-controller-manager. Start it and watch the built-in controllers (Deployment, ReplicaSet, the ones from the controller workshop) come alive - now
kubectl create deploymentactually creates Pods. - Add kube-proxy + a CNI. Wire up Service networking (kube-proxy) and pod networking (your CNI from that workshop) so Pods get IPs and Services route - a fully functional node.
- Multi-node etcd. Run a 3-member etcd cluster and kill one member; watch Raft keep the cluster available (Week 1 consensus, live).
- Then appreciate kubeadm/kind. Run
kubeadm initorkind create clusterand recognize every step it automates - it's doing exactly what you just did by hand, plus the PKI and CNI.
What you might wonder¶
"Is this really how production clusters are built?" The components are identical - production runs exactly these binaries (etcd, apiserver, scheduler, controller-manager, kubelet, kube-proxy). What differs is automation and hardening: kubeadm/kops/managed-services (EKS/GKE/AKS) bootstrap them with proper PKI, HA topologies, and lifecycle management. Managed control planes run these same processes for you (you just don't see them). Doing it by hand once means you know what those tools and services are actually managing.
"Why is etcd so central / why all the fuss about backing it up?" Because etcd holds 100% of cluster state - every object, as you saw under /registry/. Everything else is stateless and reconstructible. So etcd is your single point of truth and your single point of catastrophic failure: a corrupted etcd with no backup is an unrecoverable cluster. This is why etcd runs as an HA Raft cluster (survive node loss) and why etcdctl snapshot save on a schedule is non-negotiable. The most important cluster backup is the etcd snapshot.
"The apiserver is the only thing talking to etcd - why that design?" Centralizing etcd access in the apiserver gives one place for validation, authorization, admission (the webhook workshop!), versioning, and the watch/cache machinery every controller depends on. If components talked to etcd directly, you'd have no consistent policy enforcement and no shared watch infrastructure. The apiserver as the sole etcd client is why admission control and RBAC can exist - they're enforced at that single chokepoint.
"Everything's a client of the apiserver - even the scheduler?" Yes, and this is the unifying insight of the whole workshop series. The scheduler, controller-manager, kubelet, your custom controller, your custom scheduler, kubectl, Argo CD - all of them are just API clients running watch-and-act loops against the apiserver. There's no privileged inner ring. This is why you could build a scheduler and a controller as ordinary programs: they're the same kind of thing as the "real" components. The control plane is a set of peers coordinating through one API and one database.
"Should I ever bootstrap by hand in real life?" Almost never - use kubeadm, a managed service, or a tool. But doing it once is the single best way to understand Kubernetes: it dissolves the "magic cluster" mental model into "processes + etcd + an API," which is what makes you able to debug a broken control plane (apiserver won't start? check its etcd connection and certs; Pods stuck Pending? is the scheduler running? Pods won't start? is the kubelet healthy?). The hand-build is for understanding, not production.
What this gave you¶
- You assembled a working Kubernetes control plane from individual processes: etcd + apiserver + scheduler + kubelet, no installer.
- You saw cluster state is etcd - read Kubernetes objects directly as key/values under
/registry/. - You watched a Pod travel the full pipeline you built: kubectl -> apiserver -> etcd -> scheduler -> kubelet -> running container.
- You proved the resilience model: killed and restarted stateless components with zero data loss, because all state is in etcd.
- You understand why etcd is the one thing to back up, why the apiserver is the sole etcd client (enabling RBAC/admission), and why everything is "just an API client."
- You can now debug a broken control plane, because you know it's processes + a database, not magic.
The workshop series, complete¶
You've now built, by hand, the core of every layer of Kubernetes: - a controller (the reconcile loop - the atom of the platform), - an operator with a CRD (extending the API), - admission webhooks (intercepting the API), - a scheduler (deciding placement), - pod networking / a CNI (connectivity), - a GitOps sync loop (git as truth), - an autoscaler (feedback control), - and now the control plane itself (processes + etcd).
The thread through all of them: Kubernetes is a set of control loops coordinating through one API backed by one database. Every component - built-in or custom, control-plane or yours - is a client watching the apiserver and acting to reconcile desired state with actual. You don't just use Kubernetes now; you understand how it's built, because you've built each piece.
Back to the Hard-Way Capstone month, or revisit the controller workshop where the series began.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with the scripts you used to start each component
- Output of `kubectl get nodes && kubectl get pods` against your hand-built control plane
- etcdctl output showing a Pod object you created sitting at `/registry/pods/...`
- Note on what you killed and restarted to prove the resilience model
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build a Kubernetes controller from scratch¶
Companion to Kubernetes -> Month 03 -> Week 9: client-go Internals and a Bare Controller. The chapter explains the informer + workqueue pattern. This workshop has you build a real, running controller in Go and then watch it reconcile - including the moment that makes Kubernetes finally click: you delete something it manages, and it heals it back. By the end you'll have a working prototype and the one mental model the entire platform is built on.
~90 minutes. Needs: a local cluster (kind or k3d - free, Kubernetes-in-Docker), Go 1.21+, and kubectl. No cloud, no cost.
What you'll build, and the idea it makes concrete¶
You'll build a controller that watches ConfigMaps labeled workshop.io/managed=true and ensures each one has a companion ConfigMap named <name>-synced holding a copy of its data. Small on purpose - because the point isn't the feature, it's watching the reconcile loop work.
Here's the one idea, and why building beats reading it:
Kubernetes is declarative and level-triggered. You declare desired state; controllers continuously drive actual state toward it. A controller doesn't react to "an event happened" - it asks "what should exist? what does exist? make them match," over and over, forever. That's why you can delete a Pod and the Deployment brings it back, why
kubectl applyis idempotent, why the system self-heals.
You can read that paragraph ten times and not feel it. You'll feel it in Step 7 when you delete a ConfigMap and watch your own code recreate it within a second. Building a controller is the "build a container by hand" of Kubernetes - it dispels the magic by making you the magician.
Step 0: spin up a cluster¶
$ kind create cluster --name workshop
$ kubectl cluster-info --context kind-workshop
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
workshop-control-plane Ready control-plane 30s v1.29.x
A full Kubernetes cluster, running in Docker on your laptop, in ~20 seconds. (k3d is equally fine: k3d cluster create workshop.) Everything in this workshop runs against it.
Step 1: the mental model - level vs edge triggered¶
Before code, fix the distinction that everything hinges on:
- Edge-triggered (the wrong model for K8s): "when X changes, do Y." Fragile - if you miss the event (controller was down, a message dropped), you never act, and state drifts forever.
- Level-triggered (how K8s works): "whenever you run, make actual match desired - regardless of what changed or whether you saw the change." Robust - a missed event is harmless because the next reconcile fixes it anyway.
A Kubernetes controller is a level-triggered reconcile loop:
loop forever:
desired = what the spec says should exist
actual = what's really in the cluster
if actual != desired:
make actual match desired
Events (a ConfigMap was created/changed/deleted) are just hints to reconcile sooner - not instructions about what to do. The reconcile function always recomputes from scratch. Hold this; the code is a direct expression of it.
Step 2: scaffold the project¶
$ mkdir mirror-controller && cd mirror-controller
$ go mod init workshop/mirror-controller
$ go get k8s.io/client-go@v0.29.3 k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/klog/v2
Three libraries do the work: client-go (the Kubernetes Go client - informers, workqueues, clients), api (the typed objects like ConfigMap), apimachinery (the machinery: object metadata, errors, label selectors).
Step 3: connect to the cluster¶
First, just prove you can talk to the API. Create main.go:
package main
import (
"context"
"fmt"
"os"
"path/filepath"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
// Out-of-cluster: read your kubeconfig (the same file kubectl uses).
kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
if err != nil {
panic(err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err)
}
// Sanity check: list ConfigMaps in the default namespace.
cms, err := clientset.CoreV1().ConfigMaps("default").List(context.TODO(), metav1.ListOptions{})
if err != nil {
panic(err)
}
fmt.Printf("connected. %d configmaps in default\n", len(cms.Items))
}
That clientset is your typed door to every Kubernetes API. clientset.CoreV1().ConfigMaps(ns).Get/List/Create/Update/Delete(...) - the verbs you'd guess. You're now a Kubernetes client. But polling with List in a loop would hammer the API. Controllers use informers instead.
Step 4: the informer - watch, don't poll¶
An informer maintains a local, always-current cache of objects by watching the API, and calls your handlers when things change. It's the efficient "watch" that makes controllers cheap. Replace main with the informer wiring:
package main
import (
"os"
"path/filepath"
"time"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog/v2"
)
func main() {
kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
clientset, _ := kubernetes.NewForConfig(config)
// A factory builds informers that share one watch connection + cache.
// resync period 30s: re-deliver every cached object every 30s (belt-and-suspenders).
factory := informers.NewSharedInformerFactory(clientset, 30*time.Second)
cmInformer := factory.Core().V1().ConfigMaps().Informer()
// For now, just log what the informer sees, to watch it work.
cmInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { klog.Infof("ADD %s", key(obj)) },
UpdateFunc: func(old, new interface{}) { klog.Infof("UPDATE %s", key(new)) },
DeleteFunc: func(obj interface{}) { klog.Infof("DELETE %s", key(obj)) },
})
stop := make(chan struct{})
defer close(stop)
factory.Start(stop) // start watching
cache.WaitForCacheSync(stop, cmInformer.HasSynced) // wait for the initial list
klog.Info("cache synced, watching...")
<-stop // block forever
}
func key(obj interface{}) string {
k, _ := cache.MetaNamespaceKeyFunc(obj) // "namespace/name"
return k
}
Run it, then in another terminal create a ConfigMap and watch the informer notice:
$ go run .
cache synced, watching...
ADD kube-system/kube-root-ca.crt
ADD default/some-existing-cm
...
# other terminal:
$ kubectl create configmap demo --from-literal=hello=world
# back in the controller log, instantly:
UPDATE default/demo
ADD default/demo
$ kubectl delete configmap demo
DELETE default/demo
You're watching the informer's event stream in real time. But notice the problem: these handlers run inline on the watch thread, and if reconcile is slow or errors, you'd block the stream or drop work. That's what the workqueue fixes - and it's where the level-triggered design lives.
Step 5: the workqueue - the heart of the pattern¶
The handlers shouldn't do the work. They should just drop a key (namespace/name) onto a queue. Separate workers pull keys and reconcile. This decouples "something changed" from "do the work," gives you retries with backoff, and dedupes (the same key queued twice = processed once). This is the controller pattern.
The key insight that makes it level-triggered: the queue holds keys, not events. A worker pulling key default/foo doesn't know or care whether foo was added, updated, or its companion was deleted - it just reconciles foo from scratch. Here's the full controller:
package main
import (
"context"
"crypto/sha256"
"encoding/hex"
"fmt"
"os"
"path/filepath"
"strings"
"time"
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
listers "k8s.io/client-go/listers/core/v1"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/workqueue"
"k8s.io/klog/v2"
)
const (
managedLabel = "workshop.io/managed"
syncedSuffix = "-synced"
checksumAnn = "workshop.io/source-checksum"
)
type Controller struct {
client kubernetes.Interface
lister listers.ConfigMapLister // reads from the cache, never the API (fast)
synced cache.InformerSynced
queue workqueue.RateLimitingQueue
}
func main() {
kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
client, _ := kubernetes.NewForConfig(config)
factory := informers.NewSharedInformerFactory(client, 30*time.Second)
cmInformer := factory.Core().V1().ConfigMaps()
c := &Controller{
client: client,
lister: cmInformer.Lister(),
synced: cmInformer.Informer().HasSynced,
queue: workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter()),
}
// Handlers enqueue a SOURCE key. The trick: a change to a companion
// (named foo-synced) enqueues its source (foo), so deleting a companion
// triggers reconcile of the source -> self-healing.
cmInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: c.enqueue,
UpdateFunc: func(old, new interface{}) { c.enqueue(new) },
DeleteFunc: c.enqueue,
})
stop := make(chan struct{})
defer close(stop)
factory.Start(stop)
c.Run(2, stop) // 2 workers
}
// enqueue maps any observed ConfigMap to the SOURCE key to reconcile.
func (c *Controller) enqueue(obj interface{}) {
cm, ok := obj.(*corev1.ConfigMap)
if !ok { // tombstone on delete
if t, isT := obj.(cache.DeletedFinalStateUnknown); isT {
cm, _ = t.Obj.(*corev1.ConfigMap)
}
if cm == nil { return }
}
name := cm.Name
if strings.HasSuffix(name, syncedSuffix) {
name = strings.TrimSuffix(name, syncedSuffix) // companion -> source
} else if cm.Labels[managedLabel] != "true" {
return // not ours, and not a companion: ignore
}
c.queue.Add(cm.Namespace + "/" + name)
}
func (c *Controller) Run(workers int, stop <-chan struct{}) {
defer c.queue.ShutDown()
if !cache.WaitForCacheSync(stop, c.synced) {
return
}
klog.Info("cache synced, controller running")
for i := 0; i < workers; i++ {
go func() {
for c.processNext() {
}
}()
}
<-stop
}
func (c *Controller) processNext() bool {
key, quit := c.queue.Get()
if quit {
return false
}
defer c.queue.Done(key)
if err := c.reconcile(key.(string)); err != nil {
klog.Errorf("reconcile %s failed, requeueing: %v", key, err)
c.queue.AddRateLimited(key) // retry with exponential backoff
return true
}
c.queue.Forget(key) // success: reset the backoff
return true
}
Step 6: the reconcile function - desired vs actual, made literal¶
Here's where the mental model becomes code. reconcile is handed a source key. It computes desired state and makes the cluster match - every time, from scratch:
func (c *Controller) reconcile(key string) error {
ns, name, _ := cache.SplitMetaNamespaceKey(key)
// 1. What's the SOURCE? (read from cache via the lister)
source, err := c.lister.ConfigMaps(ns).Get(name)
if apierrors.IsNotFound(err) {
// Source is gone. The companion has an owner reference to it,
// so Kubernetes garbage-collects the companion automatically.
klog.Infof("source %s gone; companion will be GC'd", key)
return nil
}
if err != nil {
return err
}
if source.Labels[managedLabel] != "true" {
return nil // not managed
}
// 2. What SHOULD the companion look like? (desired state)
sum := checksum(source.Data)
companionName := name + syncedSuffix
desired := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: companionName,
Namespace: ns,
Annotations: map[string]string{checksumAnn: sum},
// Owner reference: ties the companion's lifecycle to the source.
// Delete the source -> Kubernetes deletes this automatically.
OwnerReferences: []metav1.OwnerReference{{
APIVersion: "v1", Kind: "ConfigMap",
Name: source.Name, UID: source.UID,
Controller: boolPtr(true),
}},
},
Data: source.Data,
}
// 3. What EXISTS? Make actual match desired.
existing, err := c.lister.ConfigMaps(ns).Get(companionName)
if apierrors.IsNotFound(err) {
_, err = c.client.CoreV1().ConfigMaps(ns).Create(context.TODO(), desired, metav1.CreateOptions{})
klog.Infof("created companion %s/%s", ns, companionName)
return err
}
if err != nil {
return err
}
// It exists - has the source drifted since we last synced?
if existing.Annotations[checksumAnn] != sum {
updated := existing.DeepCopy()
updated.Data = source.Data
updated.Annotations[checksumAnn] = sum
_, err = c.client.CoreV1().ConfigMaps(ns).Update(context.TODO(), updated, metav1.UpdateOptions{})
klog.Infof("updated companion %s/%s (source changed)", ns, companionName)
return err
}
// Already correct. Reconcile is a no-op. (This is the common case - and
// it's why running reconcile a thousand times is harmless: idempotent.)
return nil
}
func checksum(data map[string]string) string {
h := sha256.New()
for k, v := range data {
fmt.Fprintf(h, "%s=%s;", k, v)
}
return hex.EncodeToString(h.Sum(nil))[:12]
}
func boolPtr(b bool) *bool { return &b }
Read reconcile against the mental model: get desired (the companion that should exist), get actual (what does exist), make them match - create if missing, update if drifted, do nothing if correct. It never asks "what event brought me here?" It just drives toward desired state. That is a Kubernetes controller. Everything else - Deployments, the scheduler, cert-manager, Argo - is this loop at larger scale.
Step 7: run it and watch reconciliation - the payoff¶
Now, in another terminal, watch the cluster while you poke it. Open a watch so you see reconciliation live:
Create a managed ConfigMap:
$ kubectl create configmap colors --from-literal=sky=blue
$ kubectl label configmap colors workshop.io/managed=true
In the watch terminal, a colors-synced appears within a second - your controller saw the label, reconciled, and created the companion:
$ kubectl get configmap colors-synced -o jsonpath='{.data}{"\n"}'
{"sky":"blue"} # it copied the data
Now change the source and watch the copy follow:
$ kubectl patch configmap colors --type merge -p '{"data":{"sky":"orange"}}'
# controller log: updated companion default/colors-synced (source changed)
$ kubectl get configmap colors-synced -o jsonpath='{.data}{"\n"}'
{"sky":"orange"} # the copy synced
Step 8: the moment Kubernetes clicks - watch it self-heal¶
This is the whole workshop in one action. Delete the companion and watch your controller bring it back:
$ kubectl delete configmap colors-synced
configmap "colors-synced" deleted
# controller log, instantly:
# created companion default/colors-synced
$ kubectl get configmap colors-synced
NAME DATA AGE
colors-synced 1 1s <- it's BACK, 1 second old
You deleted it. It came back. You didn't tell the controller "recreate it" - you deleted the companion, the informer noticed, enqueued the source, reconcile ran, saw the companion was missing, and recreated it. This is exactly why deleting a Pod doesn't work (the Deployment recreates it), why kubectl apply is safe to re-run, why Kubernetes is self-healing. The desired state is the source of truth; the controller relentlessly enforces it. You just built that.
And the owner-reference payoff - delete the source, and the companion is garbage-collected automatically:
$ kubectl delete configmap colors
configmap "colors" deleted
$ kubectl get configmap colors-synced
Error from server (NotFound): configmaps "colors-synced" not found # auto-cleaned
You didn't write deletion logic for that path - the owner reference told Kubernetes "this companion belongs to that source," and the garbage collector did the rest. Owner references are how every controller manages cleanup.
Step 9: break it - see the failure mode and retry¶
Watch the workqueue's retry/backoff handle errors. Temporarily make reconcile fail: add return fmt.Errorf("boom") at the top of reconcile, re-run, create a managed ConfigMap. The log shows the backoff:
reconcile default/colors failed, requeueing: boom
reconcile default/colors failed, requeueing: boom # ~1s later
reconcile default/colors failed, requeueing: boom # ~2s later (exponential)
reconcile default/colors failed, requeueing: boom # ~4s later...
AddRateLimited requeues with exponential backoff, so a transiently-failing reconcile retries patiently instead of hot-looping. Remove the boom line and the next retry succeeds and Forgets the key (resetting backoff). This resilience - keep trying, back off, never give up - is why controllers survive flaky APIs and transient errors. It's free from the workqueue.
Step 10: the hardening slice - minimum RBAC¶
Running locally you used your admin kubeconfig. To run in the cluster, the controller needs a ServiceAccount with the least privilege that works - exactly what it does and no more:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: mirror-controller
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"] # exactly our verbs
The discipline: grant the verbs the controller actually calls (we get/list/watch to observe, create/update/delete to reconcile) and nothing else. Over-broad RBAC on a controller is a real security hole - a compromised controller can do whatever its ServiceAccount allows. Minimum privilege is the rule.
Now extend it (the real Week 9 lab)¶
You've built the core. Extend it into the full lab to cement the pattern:
- Cross-namespace mirror. Instead of a companion in the same namespace, mirror managed ConfigMaps into every namespace matching a prefix. You'll watch Namespaces too (a second informer) and enqueue affected sources when a new namespace appears. (Owner references don't cross namespaces - you'll handle cleanup yourself, which teaches why.)
- Leader election. Run two replicas; add
tools/leaderelectionso only one acts (aLeaseis the lock). Kill the leader, watch the standby take over - HA controllers. - Metrics + health. Expose queue depth and reconcile duration on
/metrics, add/healthzand/readyz. This is what makes a controller operable. - Graduate to controller-runtime. Rebuild this in
controller-runtime/kubebuilder (Week 10) and see how much of the boilerplate it hides - now that you know what it's hiding, because you built it by hand.
What you might wonder¶
"Why client-go directly instead of controller-runtime/kubebuilder?" Because building it raw shows you the machinery - informer, lister, workqueue, reconcile - that every framework hides. controller-runtime is what you'll use in production (it's less code), but if you start there, the reconcile loop is magic. Build it once by hand (this workshop), then let the framework do it (the operator workshop). You can only appreciate what kubebuilder saves you after you've written the boilerplate yourself.
"Why a workqueue? Why not just do the work in the event handler?" Three reasons, all of which you saw: decoupling (handlers stay fast, work happens on workers), retries with backoff (AddRateLimited - Step 9), and deduplication (the same key enqueued repeatedly is processed once). Doing work inline blocks the watch stream and gives you none of this. The workqueue is the controller pattern.
"Is reconcile really called over and over even when nothing changed?" Yes - the 30s resync re-delivers every object, so reconcile runs periodically even with no changes (plus on every event). That's the point: it's level-triggered, so a no-op reconcile (Step 6's "already correct" path) must be cheap and harmless. Idempotency is non-negotiable - reconcile must produce the same result no matter how many times it runs. A controller that isn't idempotent corrupts state on the second run.
"Why read from the lister (cache) instead of the API?" Calling the API in the hot path would hammer the apiserver - a controller watching thousands of objects reconciling constantly would melt it. The informer keeps a local cache; the lister reads from it (microseconds, no network). You only call the API to change things (create/update/delete), never to read in reconcile. This read-from-cache / write-to-API split is a core scaling principle.
"How does this scale to real controllers like Deployments?" Identically. The Deployment controller watches Deployments and ReplicaSets, and reconciles: desired replicas vs actual Pods, create/delete to match. The ReplicaSet controller watches ReplicaSets and Pods. The scheduler watches unscheduled Pods and binds them. Every one is the loop you just built - watch, diff desired vs actual, act, requeue. Once you've built one controller, you understand the entire control plane's shape.
What this gave you¶
- You built a real, running Kubernetes controller in Go from raw client-go - informer, lister, workqueue, reconcile.
- You watched it reconcile: create a source, the companion appears; change the source, the copy follows.
- You watched it self-heal - deleted the companion, your code brought it back - the moment that makes "declarative, level-triggered, self-healing" concrete instead of a slogan.
- You used owner references for automatic garbage collection.
- You saw the workqueue's retry/backoff handle failures, and the read-from-cache / write-to-API scaling split.
- You know minimum-privilege RBAC for a controller.
- You understand that every controller in Kubernetes - Deployments, the scheduler, every operator - is this exact loop.
This is the foundation for the rest of the Kubernetes workshops. Next, you'll extend the API itself: build an operator with a Custom Resource Definition, where you define a new kind of Kubernetes object and the controller that gives it meaning.
Back to the Controllers & Operators month for the conceptual frame.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your controller code (link it in your submission)
- Terminal log of the self-heal test - delete the managed ConfigMap, controller recreates it
- Short note (3 to 5 sentences) on the one thing that clicked for you
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build a custom scheduler¶
Companion to Kubernetes -> Month 01 -> Week 3: The Scheduler. The chapter explains how the scheduler assigns Pods to nodes through filtering and scoring. This workshop has you build a working scheduler - a real program that watches for unscheduled Pods and binds them to nodes by your own logic - and watch your placement decisions take effect. By the end you'll understand that scheduling is "just" another control loop, and you'll have replaced one of the most mysterious-seeming control-plane components with ~120 lines of Go.
~90 minutes. Needs: kind/k3d (ideally a multi-node cluster), Go 1.21+, kubectl. Prerequisite: the controller-from-scratch workshop - a scheduler is a controller with a specific job.
What you'll build, and the idea it makes concrete¶
You'll build a minimal scheduler that watches for Pods requesting schedulerName: workshop-scheduler, picks a node by a simple policy (least-loaded by Pod count), and binds the Pod to it. Then you'll watch your scheduler place Pods - and watch a Pod sit Pending forever when your scheduler isn't running, proving you are the thing making placement happen.
The idea this makes concrete:
Scheduling is not magic, and not part of the kubelet - it's a separate control loop. The default
kube-schedulerwatches for Pods with no node assigned (spec.nodeName == ""), runs filter (which nodes can run this?) then score (which is best?), and writes the chosen node back via a Bind call. The kubelet on that node then notices "a Pod is assigned to me" and runs it. Placement and execution are decoupled. You can run multiple schedulers side by side, and a Pod picks one byschedulerName.
The controller workshop showed reconcile on ConfigMaps. This shows the same loop doing the job that feels most like core-Kubernetes-magic - and revealing it's the same pattern: watch, decide, act.
Step 0: how scheduling actually works¶
The mental model, because the magic dissolves once you see it:
1. You create a Pod. The apiserver stores it with spec.nodeName = "" (unscheduled).
2. A SCHEDULER (watching for unscheduled Pods) notices it.
3. Scheduler FILTERS: which nodes have enough CPU/mem, match nodeSelector, tolerate taints? -> feasible nodes
4. Scheduler SCORES: rank the feasible nodes by some policy -> best node
5. Scheduler BINDS: writes spec.nodeName = <best node> (a POST to pods/<name>/binding).
6. The KUBELET on that node sees a Pod assigned to it, pulls images, starts containers.
Two facts that surprise people: - The scheduler never starts a container. It only decides and records the decision (the bind). The kubelet does the running. Decision and execution are different components. - A Pod with no scheduler to handle it stays Pending forever. There's no fallback - if you set a schedulerName no scheduler is watching, the Pod just waits. You'll see this directly.
Your scheduler implements steps 2-5.
Step 1: a multi-node cluster (so placement is visible)¶
Placement is only interesting with more than one node. kind can make a multi-node cluster:
$ cat <<EOF | kind create cluster --name sched-workshop --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF
$ kubectl get nodes
NAME STATUS ROLES AGE
sched-workshop-control-plane Ready control-plane 40s
sched-workshop-worker Ready <none> 25s
sched-workshop-worker2 Ready <none> 25s
sched-workshop-worker3 Ready <none> 25s
Three worker nodes - your scheduler will choose among them.
Step 2: the scheduler loop - watch unscheduled Pods¶
A scheduler is a controller whose "desired state" is "every Pod assigned to a node." Set up the project and the watch:
$ mkdir mini-scheduler && cd mini-scheduler
$ go mod init workshop/mini-scheduler
$ go get k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/client-go@v0.29.3
package main
import (
"context"
"fmt"
"path/filepath"
"os"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/fields"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/klog/v2"
)
const schedulerName = "workshop-scheduler"
func main() {
kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
client, _ := kubernetes.NewForConfig(config)
factory := informers.NewSharedInformerFactory(client, 0)
podInformer := factory.Core().V1().Pods()
nodeInformer := factory.Core().V1().Nodes()
podLister := podInformer.Lister()
nodeLister := nodeInformer.Lister()
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
pod := obj.(*corev1.Pod)
// Only handle pods that ask for US and aren't scheduled yet.
if pod.Spec.SchedulerName != schedulerName || pod.Spec.NodeName != "" {
return
}
if err := schedule(client, podLister, nodeLister, pod); err != nil {
klog.Errorf("failed to schedule %s/%s: %v", pod.Namespace, pod.Name, err)
}
},
})
stop := make(chan struct{})
factory.Start(stop)
cache.WaitForCacheSync(stop, podInformer.Informer().HasSynced, nodeInformer.Informer().HasSynced)
klog.Infof("%s running", schedulerName)
<-stop
}
The trigger is the same as any controller: an informer fires when an unscheduled Pod (asking for our scheduler) appears. The work is schedule().
Step 3: filter and score - the decision¶
Here's the heart - the same filter-then-score the real scheduler does, simplified to one policy (fewest Pods wins):
func schedule(client kubernetes.Interface, podLister listers.PodLister,
nodeLister listers.NodeLister, pod *corev1.Pod) error {
nodes, err := nodeLister.List(labels.Everything())
if err != nil {
return err
}
// FILTER: which nodes can run this pod?
var feasible []*corev1.Node
for _, n := range nodes {
if isReady(n) && !hasBlockingTaint(n) { // (real schedulers also check cpu/mem/affinity)
feasible = append(feasible, n)
}
}
if len(feasible) == 0 {
return fmt.Errorf("no feasible node for %s", pod.Name)
}
// SCORE: pick the node running the fewest pods (least-loaded policy).
allPods, _ := podLister.List(labels.Everything())
countByNode := map[string]int{}
for _, p := range allPods {
if p.Spec.NodeName != "" {
countByNode[p.Spec.NodeName]++
}
}
best := feasible[0]
for _, n := range feasible[1:] {
if countByNode[n.Name] < countByNode[best.Name] {
best = n
}
}
// BIND: record the decision. THIS is what "scheduling" actually is.
binding := &corev1.Binding{
ObjectMeta: metav1.ObjectMeta{Name: pod.Name, Namespace: pod.Namespace},
Target: corev1.ObjectReference{Kind: "Node", Name: best.Name},
}
err = client.CoreV1().Pods(pod.Namespace).Bind(context.TODO(), binding, metav1.CreateOptions{})
if err != nil {
return err
}
klog.Infof("bound %s/%s -> %s (had %d pods)", pod.Namespace, pod.Name, best.Name, countByNode[best.Name])
// Emit a Scheduled event so `kubectl describe pod` shows what happened (like the real scheduler).
return nil
}
func isReady(n *corev1.Node) bool {
for _, c := range n.Status.Conditions {
if c.Type == corev1.NodeReady {
return c.Status == corev1.ConditionTrue
}
}
return false
}
func hasBlockingTaint(n *corev1.Node) bool {
for _, t := range n.Spec.Taints {
if t.Effect == corev1.TaintEffectNoSchedule {
return true
}
}
return false
}
The Bind call is the whole job: a POST to pods/<name>/binding that writes spec.nodeName. That's it - "scheduling a Pod" is setting one field. Everything else (filter, score) is just deciding which value to write. The real scheduler has dozens of filter and score plugins; yours has one of each, but the shape is identical.
Step 4: run it and watch your scheduler place Pods¶
Create a few Pods that ask for your scheduler:
$ for i in 1 2 3 4 5 6; do
kubectl run pod$i --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
done
Watch your scheduler's log - it's making decisions in real time:
bound default/pod1 -> sched-workshop-worker (had 0 pods)
bound default/pod2 -> sched-workshop-worker2 (had 0 pods)
bound default/pod3 -> sched-workshop-worker3 (had 0 pods)
bound default/pod4 -> sched-workshop-worker (had 1 pods) <- least-loaded: spreads them
bound default/pod5 -> sched-workshop-worker2 (had 1 pods)
bound default/pod6 -> sched-workshop-worker3 (had 1 pods)
Your least-loaded policy spread 6 Pods evenly across 3 nodes. Confirm from the cluster's side:
$ kubectl get pods -o wide --sort-by='{.spec.nodeName}'
NAME READY STATUS NODE
pod1 1/1 Running sched-workshop-worker
pod4 1/1 Running sched-workshop-worker
pod2 1/1 Running sched-workshop-worker2
pod5 1/1 Running sched-workshop-worker2
pod3 1/1 Running sched-workshop-worker3
pod6 1/1 Running sched-workshop-worker3
Two Pods per node, exactly as your scoring decided. And critically - kubectl describe pod pod1 shows it was scheduled and then the kubelet started it. Your scheduler decided; the kubelet executed. You watched the decoupling.
Step 5: the proof that YOU are the scheduler - watch a Pod hang Pending¶
This is the "it clicks" moment. Stop your scheduler (Ctrl-C). Now create another Pod that asks for it:
$ kubectl run orphan --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
$ kubectl get pod orphan
NAME READY STATUS AGE
orphan 0/1 Pending 30s <- stuck. Forever.
$ kubectl describe pod orphan | grep -A2 Events
Events: <none> # NO scheduler touched it. No node assigned. It just waits.
The Pod is Pending with no events - because nothing is watching for workshop-scheduler Pods. There's no default fallback; a Pod whose named scheduler is absent waits indefinitely. Now restart your scheduler:
$ go run .
# instantly in the log:
bound default/orphan -> sched-workshop-worker (had 2 pods)
$ kubectl get pod orphan
NAME READY STATUS AGE
orphan 1/1 Running 3s <- your scheduler woke up and placed it
You watched a Pod sit unschedulable until your program ran, then get placed the instant it did. That's the proof that scheduling is a control loop you can own - the Pending Pod was waiting for you. This is also why a broken/overloaded scheduler manifests as Pods stuck Pending cluster-wide - now you know exactly why.
Step 6: break it - filter everything out¶
See the "no feasible node" path. Taint all workers so your filter rejects them:
$ kubectl taint nodes -l '!node-role.kubernetes.io/control-plane' workshop=blocked:NoSchedule
$ kubectl run blocked --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
# scheduler log: failed to schedule default/blocked: no feasible node for blocked
$ kubectl get pod blocked # Pending - your filter rejected every node
The Pod is unschedulable because no node passed the filter - exactly what happens with the real scheduler when resource requests exceed every node's capacity, or affinity/taints exclude everything. Remove the taint (kubectl taint nodes ... workshop-) and your scheduler (which retries via the informer's resync) places it. This is the difference between "no scheduler" (Step 5, no events) and "scheduler ran but found nowhere to put it" (Step 6, a FailedScheduling event in the real scheduler).
Now extend it¶
- Real resource filtering. Filter on actual CPU/memory: sum the requests of pods already on each node, compare to allocatable. This is the core of real scheduling - bin-packing by resources.
- Respect nodeSelector / affinity / taints+tolerations. Add these filters and watch pods land only where they're allowed. You'll appreciate how much the default scheduler does.
- Spread vs bin-pack. Add a second scoring policy (most-loaded = bin-pack, to empty nodes for scale-down) and make it configurable. This is the real spreading-vs-consolidation tradeoff.
- The Scheduler Framework plugin path. Instead of a standalone scheduler, write a plugin for the real
kube-scheduler(aFilterorScoreplugin). This is how production custom scheduling is done - you extend the battle-tested scheduler rather than replace it.
What you might wonder¶
"Should I ever run a custom scheduler in production?" Rarely as a replacement - the default scheduler is sophisticated (resource fit, affinity, topology spread, preemption) and hard to beat. But custom scheduling is real for specialized needs: gang scheduling (all-or-nothing for ML training jobs), hardware-topology-aware placement (GPUs/NUMA), or batch/HPC. The right path is almost always a Scheduler Framework plugin (extend the default), not a from-scratch scheduler. Building one here is to understand the default - so you can reason about why a Pod landed where it did, and why one is stuck Pending.
"Why is binding a separate API call, not just updating the Pod?" The binding subresource (pods/<name>/binding) exists specifically so scheduling is an atomic, auditable, permission-scoped operation - a component can be granted "bind pods" without "edit pods." It also lets the apiserver enforce that a Pod is bound exactly once. It's the same subresource-design reason status is separate from spec.
"How does the real scheduler avoid placing two pods on a node that only fits one?" It tracks assumed Pods (Pods it has bound but that the cache hasn't caught up on yet) so back-to-back decisions account for in-flight bindings - avoiding the race where it over-commits a node in the gap before the bind is observed. Your mini-scheduler has this race (it reads pod counts from the cache); the real one handles it. A good "now extend it" thought experiment.
"What's preemption?" When a high-priority Pod can't fit, the scheduler can evict lower-priority Pods to make room (then schedule the high-priority one). It's why PriorityClass matters. Your mini-scheduler doesn't preempt - it just fails when nothing fits. Preemption is one of the most complex parts of the real scheduler.
"Multiple schedulers really run at once?" Yes - that's why schedulerName exists. The default scheduler handles Pods with schedulerName: default-scheduler (the default); yours handles workshop-scheduler. They coexist, each watching for its own Pods. This is how you can run a specialized scheduler for batch jobs alongside the default for services, in the same cluster.
What this gave you¶
- You built a working scheduler: watch unscheduled Pods, filter feasible nodes, score, bind.
- You learned scheduling is recording a decision (the bind sets
spec.nodeName) - the scheduler never runs containers; the kubelet does. - You watched your least-loaded policy spread Pods across nodes, confirmed from the cluster.
- You watched a Pod hang
Pendingwith no events until your scheduler ran - proving scheduling is a loop you own, and explaining cluster-wide Pending. - You saw the "no feasible node" path and how it differs from "no scheduler."
- You know when custom scheduling is warranted (gang/topology/batch) and that the Scheduler Framework plugin path beats a from-scratch replacement.
Next: go below the control plane to the network - build pod networking and see how packets actually flow between Pods.
Back to the Control Plane month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your scheduler code
- Multi-node kind cluster output showing your scheduler placing Pods (schedulerName set)
- Proof of the "hangs Pending until your scheduler runs" demo
- Short note on the scoring function you chose and why
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build a GitOps sync loop¶
Companion to Kubernetes -> Month 05 -> Week 17: GitOps (ArgoCD and Flux). The chapter explains GitOps - git as the source of truth, continuously reconciled into the cluster. This workshop has you build a tiny GitOps engine - a loop that pulls manifests from a git repo and applies them, detects drift, and heals it - then watch it revert a manual change you make to the cluster. By the end you'll understand precisely what Argo CD and Flux do, because you'll have built the core in ~80 lines.
~75 minutes. Needs: kind/k3d, Go 1.21+, git, kubectl. Prerequisite: the controller-from-scratch workshop - GitOps is the reconcile loop with git as the desired state.
What you'll build, and the idea it makes concrete¶
You'll build a controller whose "desired state" isn't a Kubernetes object - it's a git repository. The loop: clone/pull the repo, apply the manifests it contains, and on every tick re-apply so any drift (someone kubectl edit-ed a Deployment, someone deleted a Service) is corrected back to what git says. Then you'll kubectl scale a Deployment by hand and watch your engine revert it within seconds.
The idea this makes concrete:
GitOps is the reconcile loop you already know, with git as the source of truth. Instead of "desired state lives in a
WebsiteCR," desired state lives in a git repo, and the controller continuously drives the cluster toward what's committed. The consequences are the whole value proposition: git is the audit log (every change is a commit), rollback isgit revert, the cluster self-heals toward git (manualkubectlchanges get reverted), and no human ever runskubectl applyagainst production - they open a PR. Argo CD and Flux are this loop, industrialized with a UI, multi-repo/multi-cluster support, health checks, and sync waves.
The controller workshop reconciled a CR into objects. This reconciles a git repo into a whole cluster - same loop, different desired-state source, and it's the pattern modern platform teams run everything on.
Step 0: the GitOps model¶
Fix the shift in thinking. Traditional ("push") deployment: a human or CI runs kubectl apply / helm install to the cluster. GitOps ("pull") deployment: an agent in the cluster continuously pulls from git and reconciles:
PUSH (traditional): PULL (GitOps):
human/CI --kubectl apply--> cluster git repo <--pull-- agent-in-cluster --apply--> cluster
- imperative, point-in-time - declarative, continuous
- drift accumulates silently - drift is detected and reverted
- "what's actually deployed?" = unknown - "what's deployed?" = what's in git, always
- rollback = remember the old command - rollback = git revert
- audit = who ran what when? = murky - audit = git history
The agent runs the same reconcile loop as any controller: desired = manifests in git, actual = objects in the cluster, act = apply to make them match, forever. You're building that agent.
Step 1: a git repo of manifests (the source of truth)¶
Make a local git repo holding the manifests your engine will deploy:
$ mkdir gitops-repo && cd gitops-repo && git init
$ mkdir manifests
$ cat > manifests/app.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector: {matchLabels: {app: web}}
template:
metadata: {labels: {app: web}}
spec:
containers:
- {name: web, image: nginx:1.27-alpine}
EOF
$ git add . && git commit -m "deploy web at 2 replicas"
$ cd ..
This repo is now the declared truth: "the cluster should run web at 2 replicas." Anyone wanting to change production changes this - via a commit/PR - not the cluster directly.
Step 2: the sync loop¶
The engine: every N seconds, pull the repo and apply its manifests. Set up the project:
$ mkdir mini-gitops && cd mini-gitops && go mod init workshop/mini-gitops
$ go get k8s.io/client-go@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/cli-runtime@v0.29.3
The core loop (using kubectl apply's server-side-apply semantics via the dynamic client; for the workshop we shell out to kubectl apply for clarity, then note the API path):
package main
import (
"os/exec"
"time"
"k8s.io/klog/v2"
)
const (
repoURL = "/path/to/gitops-repo" // local for the workshop; a real URL in production
repoDir = "/tmp/gitops-checkout"
manifest = "manifests"
interval = 15 * time.Second
)
func main() {
klog.Info("mini-gitops starting")
for {
if err := syncOnce(); err != nil {
klog.Errorf("sync failed: %v", err)
}
time.Sleep(interval)
}
}
func syncOnce() error {
// 1. PULL: get the latest desired state from git.
if err := pull(); err != nil {
return err
}
// 2. APPLY: drive the cluster toward what git says. apply is idempotent and
// declarative - re-applying the same manifests reverts any drift back to git.
cmd := exec.Command("kubectl", "apply", "-f", repoDir+"/"+manifest, "--prune",
"-l", "gitops.workshop.io/managed=true")
out, err := cmd.CombinedOutput()
klog.Infof("apply:\n%s", out)
return err
}
func pull() error {
// clone if missing, else fetch+reset to origin (git is the truth, discard local)
if _, err := exec.Command("test", "-d", repoDir+"/.git").Output(); err != nil {
return exec.Command("git", "clone", repoURL, repoDir).Run()
}
if err := exec.Command("git", "-C", repoDir, "fetch", "origin").Run(); err != nil {
return err
}
return exec.Command("git", "-C", repoDir, "reset", "--hard", "origin/HEAD").Run()
}
Two design choices that are GitOps: - git reset --hard origin/HEAD - the local checkout is disposable; git is the truth. Never trust local state. - kubectl apply --prune - apply also deletes cluster objects that are no longer in git (matched by the label). Without prune, removing a manifest from git wouldn't remove it from the cluster - git wouldn't be the complete truth. Prune makes the cluster exactly mirror git: present in git -> exists; absent from git -> deleted.
(Add the label gitops.workshop.io/managed=true to your manifests so prune knows what it owns - or use Argo/Flux's ownership tracking in production.)
Step 3: run it and watch git become the cluster¶
Your engine pulled the repo and created the Deployment. The cluster now matches git:
$ kubectl get deployment web
NAME READY UP-TO-DATE AVAILABLE AGE
web 2/2 2 2 10s <- 2 replicas, as git declares
You didn't run kubectl apply - your engine did, from git. This is the GitOps inversion: the deploy happened because the repo said so, pulled by an in-cluster agent.
Step 4: the payoff - watch it heal drift¶
This is the moment GitOps clicks. Manually change the cluster, as a panicked engineer might at 3 AM:
$ kubectl scale deployment web --replicas=5
deployment.apps/web scaled
$ kubectl get deployment web
NAME READY UP-TO-DATE AVAILABLE
web 5/5 5 5 <- you scaled it to 5
Now wait one sync interval (~15s) and watch your engine's log and the cluster:
# engine log:
apply:
deployment.apps/web configured <- it noticed drift and re-applied git's spec
$ kubectl get deployment web
NAME READY UP-TO-DATE AVAILABLE
web 2/2 2 2 <- back to 2. Your manual change was REVERTED.
Your manual kubectl scale was undone, because git says 2 and the engine relentlessly enforces git. This is GitOps's superpower and its discipline: the cluster cannot drift from git, because anything not in git gets reverted. The only way to make web run 5 replicas is to commit that change:
$ cd gitops-repo
$ sed -i 's/replicas: 2/replicas: 5/' manifests/app.yaml
$ git commit -am "scale web to 5"
$ cd -
# wait one interval; engine log: deployment.apps/web configured
$ kubectl get deployment web
NAME READY UP-TO-DATE AVAILABLE
web 5/5 5 5 <- now it stays at 5, because GIT says 5
The change stuck because it's in git. This is the entire GitOps contract: the cluster is a function of the repo. Want to change production? Commit. Want to roll back? Revert the commit and watch the engine restore the previous state. Want to know what's deployed? Read the repo. Want an audit trail? It's the git log.
Step 5: rollback is git revert¶
Watch the rollback story - the operational win that sells GitOps to teams:
$ cd gitops-repo
$ git revert --no-edit HEAD # undo the "scale to 5" commit
$ cd -
# wait one interval
$ kubectl get deployment web
NAME READY AVAILABLE
web 2/2 2 <- reverted to 2, because git reverted
Rolling back a production change was git revert. No remembering the old kubectl command, no "what was the previous image tag?" - the previous state is in git history, and reverting the commit makes the engine restore it. This is why GitOps teams sleep better: every change and every rollback is a reviewable, auditable git operation.
Step 6: break it - delete a managed object¶
The other half of self-healing - deletion drift:
$ kubectl delete deployment web
deployment.apps "web" deleted
# wait one interval; engine log: deployment.apps/web created
$ kubectl get deployment web
NAME READY AVAILABLE
web 2/2 2 <- recreated, because git says it should exist
Deleting a git-managed object just makes the next sync recreate it - same self-healing as the controller pilot, now at the level of "the whole cluster mirrors the repo." To actually remove web, delete its manifest from git and let prune remove it:
$ cd gitops-repo && git rm manifests/app.yaml && git commit -m "remove web" && cd -
# next sync, with --prune: deployment.apps/web pruned (deleted)
Present in git -> exists. Absent from git -> deleted. The cluster is exactly the repo.
Now extend it¶
- Use the API, not
kubectl. Replace the shell-out with server-side apply via the dynamic client (dynamicClient.Resource(gvr).Apply(...)). This is how real GitOps engines work - nokubectlsubprocess. - Status + health. Report sync status (last synced commit, in-sync vs drifted) as a CR or metrics, so you can see "is the cluster in sync with git?" at a glance - Argo's core UI feature.
- Webhook-triggered sync. Instead of polling every 15s, sync on a git webhook (push -> immediate reconcile), with polling as a fallback. Faster, less API churn.
- Then deploy Argo CD or Flux and recognize every concept: the
Application/KustomizationCR is your repo+path config, "OutOfSync" is your drift detection, "Sync" is your apply, "auto-prune" is your--prune, "self-heal" is your revert-drift loop. You built their core; now you understand their knobs.
What you might wonder¶
"How is this different from running kubectl apply in CI?" CI apply is push and point-in-time: it applies once, then the cluster can drift freely until the next pipeline run, and CI needs cluster credentials. GitOps is pull and continuous: an in-cluster agent reconciles constantly (drift is reverted in seconds, not at the next deploy), the cluster pulls (no external system holds cluster creds), and the desired state is always exactly git. The continuous-reconcile + drift-revert is the difference, and it's why GitOps beats "apply in CI."
"What does --prune actually do, and why is it scary?" It deletes cluster objects that the engine manages (by label) but that are no longer in git - making the cluster exactly mirror git. It's essential (otherwise removing a manifest doesn't remove the object) but dangerous (a bad label selector or an accidental git rm can delete production). This is why Argo/Flux make auto-prune opt-in and track ownership carefully. Powerful, handle deliberately.
"Argo CD vs Flux - and do I need them if I built this?" Yes, use them in production - your 80 lines lack multi-repo/multi-cluster, a UI, health assessment, sync waves/hooks, RBAC, drift visualization, and battle-testing. Argo CD is UI-centric and app-centric; Flux is CRD/Kustomize-centric and composable. Building this workshop is to understand them - so "OutOfSync," "auto-sync," "self-heal," and "prune" are concepts you've implemented, not buzzwords.
"What about secrets? You can't commit those to git." The real wrinkle. You commit encrypted secrets (Sealed Secrets, SOPS) or reference an external secret store (External Secrets Operator pulling from Vault/cloud secret managers). GitOps + plaintext secrets would put credentials in git history - never do that. Secret management is the part of GitOps that needs a deliberate companion tool.
"Does the cluster reverting my manual changes ever get in the way?" During an incident you might need to make a fast manual change - and GitOps will revert it. The answer is either "make the fix in git (it's fast to commit)" or temporarily disable auto-sync for that app, fix manually, then reconcile git to match. The discipline (all changes via git) is the point, but mature tools give you an escape hatch for emergencies.
What this gave you¶
- You built a GitOps engine: pull a repo, apply it, continuously - the reconcile loop with git as desired state.
- You watched git become the cluster (deploy happened because the repo said so, via an in-cluster agent).
- You watched it heal drift - a manual
kubectl scalereverted within seconds because git said otherwise. - You changed production the GitOps way (commit), rolled back the GitOps way (
git revert), and removed via prune. - You understand the model: the cluster is a function of the repo - git is the audit log, rollback, and source of truth.
- You can map every Argo CD / Flux concept (OutOfSync, sync, self-heal, prune) onto what you built, and you know the secrets caveat.
Next: the autoscaling layer - build an HPA-like controller that watches a metric and scales a Deployment, and watch it react to load.
Back to the Platform & Day-2 month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your GitOps engine code
- Demo of drift heal - manually scale a Deployment, watch the engine revert it
- Terminal log of a prune pass - a resource removed from git is deleted from the cluster
- Note on how you handle resources you do NOT own (prune scoping)
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build an admission webhook¶
Companion to Kubernetes -> Month 05 -> Week 20: Admission Control (Webhooks, OPA Gatekeeper, Kyverno). The chapter explains that admission controllers can reject or rewrite objects before they're stored. This workshop has you build both kinds - a validating webhook that blocks bad objects and a mutating webhook that rewrites them - and watch the cluster enforce your rules at kubectl apply time. By the end you'll understand the API request path that every policy tool (Gatekeeper, Kyverno, the Pod Security admission) plugs into.
~90 minutes. Needs: kind/k3d, Go 1.21+, kubectl. Prerequisite: the controller-from-scratch workshop for the basic client/cluster mechanics.
What you'll build, and the idea it makes concrete¶
You'll build a webhook server that does two things to every Pod as it's created: - Validating: reject any Pod that has no resource limits (a real production guardrail). - Mutating: auto-inject a default label and a default securityContext onto Pods that lack them.
The idea this makes concrete:
Every write to the Kubernetes API passes through an admission chain before it's persisted to etcd. Admission webhooks are your hook into that chain: the apiserver calls your HTTP endpoint with the object, and you reply "allow," "deny (with a reason)," or "allow, but here's a patch." This is synchronous and in-band - your webhook runs between
kubectl applyand the object existing. It's how Pod Security, Istio sidecar injection, Gatekeeper policies, and Kyverno all work.
The controller workshop showed you the reconcile path - reacting after objects exist. This shows you the admission path - intervening before they exist. Two different points in an object's life, two different powers.
Step 0: where admission sits in the request path¶
Fix the mental model first. When you kubectl apply a Pod, the apiserver runs it through a pipeline before storing it:
kubectl apply
|
v
apiserver: authentication -> authorization (RBAC) -> MUTATING admission -> schema validation -> VALIDATING admission -> etcd
^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^
your mutating webhook your validating webhook
(can PATCH the object) (can only ALLOW/DENY)
Two key facts that explain everything: - Mutating runs before validating. So a mutating webhook can add the resource limits, and then the validating webhook (or schema) sees the mutated object. Order matters. - It's before etcd. A rejected object is never stored - it doesn't exist, there's nothing to clean up. This is fundamentally different from a controller, which acts on objects that already exist. Admission is prevention; controllers are correction.
You're going to build endpoints that the apiserver calls at those two arrows.
Step 1: cluster + project¶
$ kind create cluster --name webhook-workshop
$ mkdir admission-webhook && cd admission-webhook
$ go mod init workshop/admission-webhook
$ go get k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3
A webhook is just an HTTPS server that speaks the AdmissionReview protocol - you don't even need client-go for the core. The apiserver POSTs an AdmissionReview (containing the object), and you return an AdmissionReview (containing the verdict).
Step 2: the validating webhook - reject Pods without limits¶
The heart of a validating webhook: read the incoming object, decide allow/deny, respond. Create main.go:
package main
import (
"encoding/json"
"fmt"
"net/http"
admissionv1 "k8s.io/api/admission/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// validate rejects Pods whose containers lack resource limits.
func validate(w http.ResponseWriter, r *http.Request) {
review := readReview(r)
req := review.Request
var pod corev1.Pod
json.Unmarshal(req.Object.Raw, &pod) // the object being created
// The decision logic: every container must declare CPU + memory limits.
allowed, reason := true, ""
for _, c := range pod.Spec.Containers {
if c.Resources.Limits.Cpu().IsZero() || c.Resources.Limits.Memory().IsZero() {
allowed = false
reason = fmt.Sprintf("container %q must set cpu and memory limits", c.Name)
break
}
}
// Build the response: allow, or deny with a human-readable reason.
resp := &admissionv1.AdmissionResponse{UID: req.UID, Allowed: allowed}
if !allowed {
resp.Result = &metav1.Status{Message: reason} // shown to the user at apply time
}
writeReview(w, review, resp)
}
The contract is simple: Allowed: true lets it through; Allowed: false with a Result.Message blocks it and shows the user why. That message is what appears in the kubectl apply error - so make it actionable.
The plumbing (readReview/writeReview) handles the AdmissionReview envelope:
func readReview(r *http.Request) *admissionv1.AdmissionReview {
var review admissionv1.AdmissionReview
body, _ := io.ReadAll(r.Body)
json.Unmarshal(body, &review)
return &review
}
func writeReview(w http.ResponseWriter, review *admissionv1.AdmissionReview, resp *admissionv1.AdmissionResponse) {
review.Response = resp
out, _ := json.Marshal(review)
w.Header().Set("Content-Type", "application/json")
w.Write(out)
}
Step 3: the mutating webhook - inject defaults via a JSON patch¶
A mutating webhook returns the same allow/deny, plus an optional JSON Patch (RFC 6902) that the apiserver applies to the object. Add to main.go:
// mutate injects a default label and securityContext onto Pods that lack them.
func mutate(w http.ResponseWriter, r *http.Request) {
review := readReview(r)
req := review.Request
var pod corev1.Pod
json.Unmarshal(req.Object.Raw, &pod)
// Build a JSON Patch: a list of operations the apiserver will apply.
var patches []map[string]interface{}
// Add a label if missing.
if pod.Labels == nil {
patches = append(patches, map[string]interface{}{
"op": "add", "path": "/metadata/labels",
"value": map[string]string{"workshop.io/injected": "true"},
})
} else if _, ok := pod.Labels["workshop.io/injected"]; !ok {
patches = append(patches, map[string]interface{}{
"op": "add", "path": "/metadata/labels/workshop.io~1injected", // ~1 escapes "/"
"value": "true",
})
}
// Force runAsNonRoot on the pod securityContext if unset.
if pod.Spec.SecurityContext == nil || pod.Spec.SecurityContext.RunAsNonRoot == nil {
patches = append(patches, map[string]interface{}{
"op": "add", "path": "/spec/securityContext",
"value": map[string]interface{}{"runAsNonRoot": true},
})
}
resp := &admissionv1.AdmissionResponse{UID: req.UID, Allowed: true}
if len(patches) > 0 {
patchBytes, _ := json.Marshal(patches)
pt := admissionv1.PatchTypeJSONPatch
resp.Patch = patchBytes // the apiserver applies this to the object
resp.PatchType = &pt
}
writeReview(w, review, resp)
}
func main() {
http.HandleFunc("/validate", validate)
http.HandleFunc("/mutate", mutate)
// The apiserver requires HTTPS - serve with a cert (Step 4 generates it).
http.ListenAndServeTLS(":8443", "/certs/tls.crt", "/certs/tls.key", nil)
}
The mutation is expressed as a patch, not a modified object - you tell the apiserver "add this field," and it applies it. This is exactly how Istio injects its sidecar container into your Pods and how the Pod Security admission adds defaults: a mutating webhook returning a patch.
Step 4: the TLS requirement (and why webhooks are fiddly)¶
The apiserver only calls webhooks over HTTPS, and it must trust the webhook's certificate. This is the part that makes admission webhooks notoriously finicky - you need a cert whose CA the apiserver is told to trust. For the workshop, generate a self-signed CA + serving cert:
$ # generate CA + server cert for the service DNS name the apiserver will call
$ ./gen-certs.sh admission-webhook.default.svc # (a short openssl script; see note)
The serving cert's SAN must match the in-cluster Service DNS name (<service>.<namespace>.svc) the apiserver dials. The CA bundle gets embedded in the webhook configuration (Step 5) so the apiserver trusts it. In production you'd use cert-manager to issue and rotate these automatically (the operator workshop's ecosystem) - manual certs are the #1 source of "webhook not working" pain, which is why everyone uses cert-manager for it.
Deploy the webhook server as a Deployment + Service in the cluster (mounting the cert as a Secret):
$ kubectl create secret tls webhook-certs --cert=tls.crt --key=tls.key
$ kubectl apply -f webhook-deployment.yaml # Deployment + Service exposing :8443
Step 5: register the webhooks - tell the apiserver to call you¶
The apiserver doesn't know about your webhook until you register it with a ValidatingWebhookConfiguration / MutatingWebhookConfiguration. This is where you declare what to intercept:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: require-limits
webhooks:
- name: require-limits.workshop.io
rules: # WHAT to intercept
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"] # only Pod creations
clientConfig: # WHERE to send the request
service:
name: admission-webhook
namespace: default
path: /validate
caBundle: <base64 CA cert> # so the apiserver trusts the webhook's TLS
admissionReviewVersions: ["v1"]
sideEffects: None
failurePolicy: Fail # if the webhook is down, DENY (fail closed)
Two fields with big consequences: - rules scope what you intercept (here: Pod CREATE). Too broad and you intercept (and can break) everything; scope tightly. - failurePolicy decides what happens when your webhook is unavailable: Fail (deny - fail closed, safe but can wedge the cluster if your webhook crashes) or Ignore (allow - fail open, never blocks but skips your policy). This choice has taken down clusters: a Fail-policy webhook whose backend died blocks all Pod creation, including the webhook's own replacement Pods - a deadlock. Choose deliberately, and exclude the kube-system namespace.
Register both (the mutating one points at /mutate).
Step 6: watch the cluster enforce your rules¶
The payoff - your policy is now live in the API path. Try to create a Pod without limits:
$ kubectl run bad --image=nginx
Error from server: admission webhook "require-limits.workshop.io" denied the request:
container "bad" must set cpu and memory limits
Rejected, with your exact message, before the Pod ever existed. kubectl get pod bad shows nothing - it was never stored. That's admission: prevention, not correction. Now create one with limits:
$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata: {name: good}
spec:
containers:
- name: good
image: nginx
resources:
limits: {cpu: "100m", memory: "128Mi"}
EOF
pod/good created # passed validation
And watch the mutating webhook's injection - you didn't set a label or securityContext, but they're there:
$ kubectl get pod good -o jsonpath='{.metadata.labels}{"\n"}'
{"workshop.io/injected":"true"} # your mutating webhook added this
$ kubectl get pod good -o jsonpath='{.spec.securityContext}{"\n"}'
{"runAsNonRoot":true} # injected too
You wrote a bare Pod; the apiserver routed it through your mutating webhook (which patched in the defaults), then your validating webhook (which checked limits), then stored it. This is exactly the path Istio's sidecar injection, Pod Security defaults, and Gatekeeper policies travel. You just built two stops on it.
Step 7: break it - the failurePolicy lesson, live¶
See why failurePolicy matters. Scale your webhook Deployment to zero (simulate it crashing):
$ kubectl scale deployment admission-webhook --replicas=0
$ kubectl run another --image=nginx --requests='cpu=100m' # try to create any pod
Error from server: ... failed calling webhook "require-limits.workshop.io":
connect: connection refused
With failurePolicy: Fail, the webhook being down blocks all Pod creation - the apiserver can't reach your endpoint, so it denies. This is the deadlock that takes down clusters: a fail-closed webhook whose pods died can't be replaced (the replacement pods are themselves blocked). Switch to failurePolicy: Ignore and the same kubectl run succeeds (your policy is skipped, but the cluster keeps working). Restore replicas to fix it. This single experiment teaches the most important operational fact about webhooks - and why production webhooks scope tightly, exclude kube-system, and think hard about fail-open vs fail-closed.
Now extend it¶
namespaceSelector/objectSelector. Scope the webhook to only namespaces labeledpolicy=enforced, so you can roll it out gradually and never touch kube-system. The real-world safe-rollout pattern.- Validate more. Reject
latestimage tags, require ateamlabel, forbidhostNetwork. Each is a one-liner in your decision logic - and a real production guardrail. - Do it with controller-runtime. Rebuild using controller-runtime's webhook framework (the operator workshop's toolkit), which handles the AdmissionReview plumbing and integrates with cert-manager for certs.
- Compare to policy engines. Express the same "require limits" rule in Kyverno (YAML, no code) and OPA Gatekeeper (Rego). Now you understand what they generate under the hood - they're admission webhooks with a policy language on top.
What you might wonder¶
"Webhook vs controller - when do I use which?" Admission webhook = intervene before an object is stored (reject bad input, inject defaults) - synchronous, in the request path, prevention. Controller = act after objects exist (reconcile to desired state) - asynchronous, correction. Use a webhook to enforce policy at write time ("no Pods without limits"); use a controller to maintain state over time ("keep 3 replicas running"). Many systems use both.
"Should I write webhooks, or use Kyverno/Gatekeeper?" For most policy needs, use Kyverno (YAML policies) or Gatekeeper (Rego) - they're battle-tested webhooks with a policy language, cert management, and audit built in. Write a custom webhook only when you need logic a policy engine can't express, or tight integration with your own types. Building one here (this workshop) is to understand what those tools are - so you can debug them when they misbehave and know when to reach past them.
"Why is mutating before validating?" So mutation can fix things up before validation judges them. A mutating webhook injects default limits; the validating webhook (or schema) then sees a Pod with limits and passes it. If validation ran first, it would reject the Pod before mutation could fix it. The order is deliberate and you must design with it in mind.
"What's the cert pain really about?" The apiserver must trust the webhook's TLS cert, and the cert must match the service DNS name. Manual certs expire and break webhooks silently (a webhook with an expired cert + failurePolicy: Fail = wedged cluster). This is why cert-manager (auto-issue + auto-rotate) is universal for webhooks in production. The fiddliness is real; the fix is automation.
"Can a webhook see the user who made the request?" Yes - the AdmissionRequest includes UserInfo (username, groups). You can write policy like "only the platform team may create privileged Pods." This is how webhooks enforce rules RBAC can't express (RBAC is verb-on-resource; webhooks can inspect the object's content and the user together).
What this gave you¶
- You know the API request path and where admission sits: before etcd, mutating then validating.
- You built a validating webhook that rejects Pods without limits, with an actionable error shown at apply time.
- You built a mutating webhook that injects defaults via a JSON patch - the same mechanism as Istio sidecar injection.
- You registered both with the apiserver and watched them enforce policy at
kubectl applytime. - You hit the TLS/cert requirement and know why cert-manager is universal for webhooks.
- You broke it with
failurePolicy: Failand learned the fail-open/fail-closed lesson that takes down real clusters. - You understand that Kyverno, Gatekeeper, Pod Security, and sidecar injection are all this - admission webhooks - and when to build your own vs use them.
Next: decide where Pods run - build a custom scheduler and watch your placement logic bind Pods to nodes.
Back to the Platform & Day-2 month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your validating and mutating webhook code and certs setup
- Terminal log of a Pod rejected by your validator (missing resource limits)
- Terminal log of a Pod whose spec was rewritten by your mutator (injected default)
- Note on what happened when you set failurePolicy=Fail and the webhook was down
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build an autoscaler from scratch¶
Companion to Kubernetes -> Month 05 -> Week 19: HPA, VPA, KEDA: Autoscaling. The chapter explains how the Horizontal Pod Autoscaler scales workloads on metrics. This workshop has you build an HPA-like controller - a loop that reads a metric, computes the desired replica count, and scales a Deployment - then watch it scale up under load and back down when load drops. By the end you'll understand the exact control-theory the real HPA runs, including why it sometimes oscillates and how stabilization prevents it.
~75 minutes. Needs: kind/k3d, Go 1.21+, kubectl. Prerequisite: the controller-from-scratch workshop - an autoscaler is a controller whose desired state is computed from a metric.
What you'll build, and the idea it makes concrete¶
You'll build an autoscaler that watches a Deployment, reads a load metric (we'll use CPU; you'll see the formula), computes how many replicas should exist to hit a target utilization, and scales the Deployment to match. Then you'll drive load up and watch it scale out, drop the load and watch it scale back in.
The idea this makes concrete:
Autoscaling is a control loop with a feedback formula. The HPA isn't magic - every ~15s it asks "given the current metric and my target, how many replicas do I need?", computes
desired = ceil(current * currentMetric / targetMetric), and resizes the Deployment. It's a proportional controller: the further the metric is from target, the bigger the scaling step. The hard parts - not over-reacting to spikes, not flapping up and down - are solved with a stabilization window and tolerance, which you'll build and then break to see why they exist.
The controller pilot computed desired state from a spec. This computes desired state from a live measurement and feeds it back - closing a control loop, which is a genuinely different and powerful pattern.
Step 0: the HPA formula and the control-loop model¶
The whole autoscaler is one formula, applied in a loop:
Read it as proportional control: if current CPU is 2x your target, you need ~2x the replicas; if it's at target, replicas stay. The loop:
every 15s:
metric = average metric across the Deployment's pods (e.g. CPU utilization %)
desired = ceil(current * metric / target)
desired = clamp(desired, minReplicas, maxReplicas) # never below min / above max
if desired != current AND outside tolerance:
scale the Deployment to desired
Two facts that shape everything: - It's a feedback loop, so it's prone to oscillation (scale up -> metric drops -> scale down -> metric rises -> scale up...). Tolerance and a stabilization window damp this. You'll build them after seeing the raw version flap. - The HPA scales the Deployment, not Pods directly - it edits spec.replicas via the scale subresource, and the Deployment/ReplicaSet controllers (the controller pilot's pattern) create the actual Pods. Layered controllers again.
Step 1: cluster + a scalable target + metrics¶
$ kind create cluster --name hpa-workshop
# metrics-server provides pod CPU/mem (the HPA's default metric source)
$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
$ kubectl patch -n kube-system deployment metrics-server --type=json \
-p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]' # kind needs this
# a deployment to scale: a CPU-burnable web app
$ kubectl create deployment load --image=registry.k8s.io/hpa-example
$ kubectl set resources deployment load --requests=cpu=100m # HPA needs requests to compute %
$ kubectl expose deployment load --port=80
metrics-server is what feeds CPU/memory to autoscalers; the hpa-example image busy-loops on each request so you can drive its CPU up. Note the CPU request - utilization % is usage / request, so without a request there's no percentage to target.
Step 2: the autoscaler loop¶
$ mkdir mini-hpa && cd mini-hpa && go mod init workshop/mini-hpa
$ go get k8s.io/client-go@v0.29.3 k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/metrics@v0.29.3
package main
import (
"context"
"math"
"path/filepath"
"os"
"time"
autoscalingv1 "k8s.io/api/autoscaling/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
metricsclient "k8s.io/metrics/pkg/client/clientset/versioned"
"k8s.io/klog/v2"
)
const (
namespace = "default"
deploymentName = "load"
targetCPUPct = 50 // scale to keep avg CPU at ~50% of request
minReplicas = 1
maxReplicas = 10
tolerance = 0.10 // ignore changes within +/-10% of target (anti-flap)
syncInterval = 15 * time.Second
)
func main() {
kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
client, _ := kubernetes.NewForConfig(config)
metrics, _ := metricsclient.NewForConfig(config)
for {
if err := reconcile(client, metrics); err != nil {
klog.Errorf("reconcile: %v", err)
}
time.Sleep(syncInterval)
}
}
Step 3: the reconcile - read metric, compute desired, scale¶
func reconcile(client kubernetes.Interface, metrics metricsclient.Interface) error {
ctx := context.TODO()
// 1. Current replicas (via the scale subresource - same one HPA uses).
scale, err := client.AppsV1().Deployments(namespace).GetScale(ctx, deploymentName, metav1.GetOptions{})
if err != nil {
return err
}
current := scale.Spec.Replicas
// 2. Current metric: average CPU across the deployment's pods, as % of request.
podMetrics, err := metrics.MetricsV1beta1().PodMetricses(namespace).List(ctx, metav1.ListOptions{
LabelSelector: "app=" + deploymentName,
})
if err != nil || len(podMetrics.Items) == 0 {
return err
}
var totalMilli int64
for _, pm := range podMetrics.Items {
for _, c := range pm.Containers {
totalMilli += c.Usage.Cpu().MilliValue()
}
}
avgMilli := totalMilli / int64(len(podMetrics.Items))
currentPct := float64(avgMilli) / 100.0 * 100.0 // usage(milli) / request(100m) * 100
// 3. THE HPA FORMULA: desired = ceil(current * currentMetric / targetMetric)
ratio := currentPct / float64(targetCPUPct)
if math.Abs(ratio-1.0) <= tolerance {
klog.Infof("CPU %.0f%% (target %d%%) within tolerance; staying at %d replicas",
currentPct, targetCPUPct, current)
return nil // anti-flap: don't scale for small deviations
}
desired := int32(math.Ceil(float64(current) * ratio))
// 4. Clamp to [min, max].
if desired < minReplicas { desired = minReplicas }
if desired > maxReplicas { desired = maxReplicas }
if desired == current {
return nil
}
// 5. Scale (write spec.replicas via the scale subresource).
scale.Spec.Replicas = desired
_, err = client.AppsV1().Deployments(namespace).UpdateScale(ctx, deploymentName, scale, metav1.UpdateOptions{})
klog.Infof("CPU %.0f%% (target %d%%) -> scaling %d -> %d replicas",
currentPct, targetCPUPct, current, desired)
return err
}
That's a working horizontal autoscaler. The GetScale/UpdateScale calls use the scale subresource - the same generic /scale endpoint the real HPA uses, which is why HPA works on Deployments, StatefulSets, and even custom resources that expose /scale. You read the metric, plug it into the formula, clamp, and scale. The tolerance check is the first anti-flap guard.
Step 4: run it and watch it scale up under load¶
Idle, so it holds at 1. Now generate load - hammer the service from a busybox pod:
$ kubectl run loadgen --image=busybox --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://load; done"
Watch your autoscaler react over the next minute:
# autoscaler log:
CPU 0% (target 50%) within tolerance; staying at 1 replicas
CPU 220% (target 50%) -> scaling 1 -> 5 replicas <- load spiked, scale out
CPU 95% (target 50%) -> scaling 5 -> 10 replicas <- still hot, more
CPU 48% (target 50%) within tolerance; staying at 10 replicas <- settled near target
$ kubectl get deployment load
NAME READY UP-TO-DATE AVAILABLE
load 10/10 10 10 <- scaled out to handle the load
Your autoscaler watched CPU climb, computed it needed more replicas to bring per-pod CPU down to ~50%, and scaled out - then settled once the metric reached target. That's closed-loop control: it drove the system to the setpoint. The formula did it - ceil(1 * 220/50) = 5, then ceil(5 * 95/50) = 10, capped at max.
Step 5: watch it scale back down¶
Stop the load and watch the loop run in reverse:
$ kubectl delete pod loadgen
# autoscaler log over the next minute:
CPU 3% (target 50%) -> scaling 10 -> 1 replicas <- load gone, scale in
It scaled in because the metric dropped far below target. The full loop: load up -> scale out -> metric settles; load down -> scale in -> back to min. You built a system that automatically right-sizes a workload to its load - the entire value of autoscaling, in one feedback formula.
Step 6: break it - watch it flap, then fix it with stabilization¶
Now the lesson that separates a toy from the real HPA. Set tolerance = 0.0 (remove the anti-flap guard) and drive a bursty load (a loadgen that sleeps between bursts). Watch the autoscaler oscillate:
CPU 80% -> scaling 2 -> 4 replicas # burst: scale up
CPU 20% -> scaling 4 -> 2 replicas # quiet: scale down
CPU 85% -> scaling 2 -> 4 replicas # burst: scale up again
CPU 18% -> scaling 4 -> 2 replicas # flap, flap, flap...
This flapping (thrashing replicas up and down) is a real production problem - it churns Pods, disrupts traffic, and wastes resources. The real HPA solves it with a stabilization window: it remembers recent desired-replica computations and, for scale-down, uses the highest recommendation over the last N minutes (default 5min for down, 0 for up - scale up fast, scale down slow). Add it:
// Keep a window of recent scale-down recommendations; use the MAX for scale-down.
var downWindow []int32 // recommendations over the last stabilizationPeriod
// ... in reconcile, when desired < current (scale-down):
downWindow = appendWithExpiry(downWindow, desired, stabilizationPeriod)
desired = maxOf(downWindow) // don't scale down below the highest recent recommendation
With stabilization, a brief dip in load doesn't immediately scale down - the autoscaler waits to be sure load has dropped, preventing flap. Re-run the bursty load and watch it scale up promptly but scale down patiently - stable. You just built the most important real-world refinement of autoscaling, and you understand why it exists because you watched the raw loop flap without it.
Now extend it¶
- Custom metrics. Scale on requests-per-second or queue depth instead of CPU (the custom/external metrics API). This is what KEDA specializes in - scaling on Kafka lag, queue length, etc., even to zero.
- Scale to zero. Allow
minReplicas: 0and scale up from zero on the first request (needs a request-buffering proxy). The serverless-on-Kubernetes pattern (Knative, KEDA). - Multiple metrics. Compute desired from CPU and memory and a custom metric, taking the max - exactly what the real HPA does with multiple metric sources.
- Then read the real HPA config. An
HorizontalPodAutoscalerobject'sbehavior.scaleDown.stabilizationWindowSeconds,policies,metrics- every field maps to something you built. You'll configure it with understanding instead of cargo-culting.
What you might wonder¶
"Why scale up fast but down slow?" Asymmetry by design. Under-provisioning hurts immediately (dropped requests, latency), so scale up fast to protect availability. Over-provisioning just costs a little money, and load is often bursty, so scale down slowly to avoid flapping and to absorb brief dips. The HPA defaults encode this: 0s stabilization for up, 300s for down. Your stabilization window implemented exactly this.
"HPA vs VPA vs Cluster Autoscaler - how do they relate?" HPA (this workshop) scales replica count horizontally on load. VPA (Vertical Pod Autoscaler) adjusts each Pod's CPU/memory requests (right-sizing one Pod, not adding Pods) - and conflicts with HPA on the same metric, so they're used on different axes. Cluster Autoscaler scales the nodes - when Pods can't be scheduled (the scheduler workshop's Pending), it adds nodes; when nodes are underused, it removes them. Three autoscalers, three layers: Pods-count (HPA), Pod-size (VPA), node-count (CA). They compose.
"Why does the HPA need resource requests?" Utilization is usage / request. With no request, there's no denominator - no percentage to target. This is the #1 "my HPA isn't scaling" cause: the target Deployment has no CPU request, so the HPA can't compute utilization and does nothing. Always set requests on autoscaled workloads.
"Is the formula really that simple?" The core is ceil(current * currentMetric / targetMetric), yes. The real HPA adds: tolerance (don't act on tiny deviations), stabilization windows (anti-flap), configurable scaling policies (max % or pods per interval), multiple metrics (take the max), and special handling for unready/missing-metric pods. But the proportional-control heart is exactly what you built - all the rest is refinement around it.
"When KEDA instead of HPA?" HPA scales on CPU/memory (and custom metrics with setup). KEDA extends autoscaling to event sources - Kafka topic lag, RabbitMQ queue depth, cloud queue length, cron schedules - and crucially supports scale-to-zero. Use KEDA when you scale on "how much work is queued" rather than "how busy are the current pods," or when you want zero replicas at idle. KEDA actually drives an HPA under the hood for the scaling mechanics.
What this gave you¶
- You built a horizontal autoscaler: read a metric, apply
desired = ceil(current * metric/target), clamp, scale via the scale subresource. - You watched it scale out under load and back in when load dropped - closed-loop control to a setpoint.
- You watched it flap without anti-flap guards, then fixed it with tolerance + a scale-down stabilization window - and understand why the real HPA scales up fast and down slow.
- You know why autoscaled workloads need resource requests (utilization = usage/request).
- You can place HPA vs VPA vs Cluster Autoscaler (Pod-count / Pod-size / node-count) and know when KEDA fits.
- Every field of a real
HorizontalPodAutoscalernow maps to something you implemented.
Next, the capstone of the workshop series: bootstrap a Kubernetes control-plane component by hand - the ultimate proof that the control plane is "just" processes and etcd.
Back to the Platform & Day-2 month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your autoscaler code
- Load-test screenshot showing scale-up under traffic
- Demo of the flapping problem and your stabilization-window fix
- Note on why HPA scales up fast and down slow
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build an operator with a Custom Resource Definition¶
Companion to Kubernetes -> Month 03 -> Weeks 10-12: controller-runtime, CRDs, Operator Patterns. The chapters explain that you can extend the Kubernetes API with your own resource types. This workshop has you do it - define a brand-new kind of Kubernetes object (kubectl get websites!) and build the operator that gives it meaning. By the end you'll have extended Kubernetes itself, and you'll understand why "Kubernetes is a platform for building platforms" is literally true.
~120 minutes. Needs: kind/k3d, Go 1.21+, kubectl, and kubebuilder (brew install kubebuilder or the official install script). Prerequisite: the controller-from-scratch workshop - this builds directly on the reconcile loop you learned there.
What you'll build, and the idea it makes concrete¶
You'll define a custom resource called Website - a high-level object where a user declares "I want a website serving this HTML at this replica count" - and an operator that reconciles each Website into a real Deployment + Service + ConfigMap. So a user writes 6 lines of YAML:
apiVersion: web.workshop.io/v1
kind: Website
metadata:
name: hello
spec:
replicas: 2
html: "<h1>Hello from my operator</h1>"
...and your operator creates and manages all the Kubernetes plumbing behind it. The idea this makes concrete:
Kubernetes is extensible at the API level. A Custom Resource Definition (CRD) teaches the apiserver a new noun; a controller gives that noun behavior. Together they're an operator - and from the user's side, your custom type is indistinguishable from a built-in like
Deployment.kubectl get,kubectl describe, RBAC,kubectl apply, watches - all work on your type for free. This is how cert-manager, Prometheus Operator, Crossplane, and every database-on-Kubernetes works: they add nouns and the controllers that fulfill them.
The pilot workshop showed you the reconcile loop on a built-in type (ConfigMap). This one shows you defining your own type and reconciling it - the leap from "I can write a controller" to "I can extend Kubernetes."
Step 0: cluster + scaffold¶
$ kind create cluster --name operator-workshop
$ mkdir website-operator && cd website-operator
$ kubebuilder init --domain workshop.io --repo workshop.io/website-operator
kubebuilder init scaffolds a full operator project: a main.go that wires up the manager, a Makefile with all the build/deploy targets, the controller-runtime dependencies, and the manifest generators. controller-runtime is the framework the pilot's raw client-go boilerplate becomes - now that you know what it hides, you'll appreciate it.
Step 1: define the API - create your new resource type¶
$ kubebuilder create api --group web --version v1 --kind Website
# answer 'y' to "Create Resource" and 'y' to "Create Controller"
This generates api/v1/website_types.go - the Go struct that is your new resource. Edit it to define the spec (what the user declares) and status (what the operator reports back):
// WebsiteSpec is the DESIRED state - what the user asks for.
type WebsiteSpec struct {
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=10
Replicas int32 `json:"replicas"` // how many copies to run
// +kubebuilder:validation:Required
HTML string `json:"html"` // the page content to serve
}
// WebsiteStatus is the OBSERVED state - what the operator reports.
type WebsiteStatus struct {
ReadyReplicas int32 `json:"readyReplicas"` // how many are actually ready
URL string `json:"url"` // where to reach it
}
Those +kubebuilder:validation markers are not comments - they generate OpenAPI schema that the apiserver enforces. Declare replicas must be 1-10, and the apiserver itself rejects replicas: 50 before your controller ever sees it. You get validation for free, at the API layer, just by annotating a struct field.
The Spec/Status split is the universal Kubernetes object shape: spec = desired (user writes it), status = observed (controller writes it). Every built-in works this way; now yours does too.
Step 2: generate and install the CRD - teach the apiserver your noun¶
$ make manifests # generates the CRD YAML from your Go markers
$ make install # installs the CRD into the cluster
That second command is the moment your cluster learns a new word. Verify:
$ kubectl get crds | grep website
websites.web.workshop.io 2026-05-23T...
$ kubectl api-resources | grep website
websites web.workshop.io/v1 true Website
Kubernetes now knows what a Website is. Before you've written a line of controller logic, kubectl already works on it:
$ kubectl get websites
No resources found in default namespace. # the apiserver answers - it knows the type
You taught the apiserver a new noun. kubectl get/describe/apply/delete, RBAC, watches - all already work on Website, because the CRD made it a first-class API citizen. That's the "for free" the intro promised, made literal.
Step 3: the reconcile function - give the noun meaning¶
Open internal/controller/website_controller.go. controller-runtime hands you a Reconcile method - the same loop from the pilot, minus the boilerplate (no manual informer/workqueue; the framework runs them). Your job is purely: given a Website, make the cluster match it.
func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the Website (desired state). Gone? Owned children are GC'd.
var site webv1.Website
if err := r.Get(ctx, req.NamespacedName, &site); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Ensure the ConfigMap holding the HTML.
cm := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{Name: site.Name + "-html", Namespace: site.Namespace},
}
_, err := ctrl.CreateOrUpdate(ctx, r.Client, cm, func() error {
cm.Data = map[string]string{"index.html": site.Spec.HTML}
return ctrl.SetControllerReference(&site, cm, r.Scheme) // owner ref -> auto GC
})
if err != nil {
return ctrl.Result{}, err
}
// 3. Ensure the Deployment (nginx serving the HTML, at the requested replicas).
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{Name: site.Name, Namespace: site.Namespace},
}
_, err = ctrl.CreateOrUpdate(ctx, r.Client, deploy, func() error {
deploy.Spec.Replicas = &site.Spec.Replicas // desired replicas from the spec
deploy.Spec.Selector = &metav1.LabelSelector{MatchLabels: map[string]string{"site": site.Name}}
deploy.Spec.Template = corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{"site": site.Name}},
Spec: corev1.PodSpec{Containers: []corev1.Container{{
Name: "nginx",
Image: "nginx:1.27-alpine",
VolumeMounts: []corev1.VolumeMount{{
Name: "html", MountPath: "/usr/share/nginx/html",
}},
}},
Volumes: []corev1.Volume{{
Name: "html",
VolumeSource: corev1.VolumeSource{ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{Name: cm.Name},
}},
}},
},
}
return ctrl.SetControllerReference(&site, deploy, r.Scheme)
})
if err != nil {
return ctrl.Result{}, err
}
// 4. Ensure the Service.
svc := &corev1.Service{ObjectMeta: metav1.ObjectMeta{Name: site.Name, Namespace: site.Namespace}}
_, err = ctrl.CreateOrUpdate(ctx, r.Client, svc, func() error {
svc.Spec.Selector = map[string]string{"site": site.Name}
svc.Spec.Ports = []corev1.ServicePort{{Port: 80, TargetPort: intstr.FromInt(80)}}
return ctrl.SetControllerReference(&site, svc, r.Scheme)
})
if err != nil {
return ctrl.Result{}, err
}
// 5. Update STATUS - report observed state back to the user.
site.Status.ReadyReplicas = deploy.Status.ReadyReplicas
site.Status.URL = fmt.Sprintf("http://%s.%s.svc.cluster.local", svc.Name, svc.Namespace)
if err := r.Status().Update(ctx, &site); err != nil {
return ctrl.Result{}, err
}
log.Info("reconciled", "website", site.Name, "replicas", site.Spec.Replicas)
return ctrl.Result{}, nil
}
It's the same shape as the pilot - desired vs actual, make them match - but CreateOrUpdate collapses the "get, create-if-missing, update-if-drifted" into one call, and SetControllerReference wires the owner reference (so deleting the Website garbage-collects all three children). The framework runs the informer and workqueue you built by hand last time.
One required wiring: tell the manager to also watch the children, so changes to them re-trigger reconcile (self-healing). In SetupWithManager:
func (r *WebsiteReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webv1.Website{}). // primary: watch Websites
Owns(&appsv1.Deployment{}). // also watch Deployments we own -> reconcile owner
Owns(&corev1.Service{}). // same for Services
Owns(&corev1.ConfigMap{}). // and ConfigMaps
Complete(r)
}
Owns(...) is the framework doing what you did manually in the pilot (mapping a companion's change back to its owner's key). Three lines instead of custom event-handler logic.
Step 4: run the operator and create your first Website¶
In another terminal, apply the 6-line YAML from the intro:
$ cat <<EOF | kubectl apply -f -
apiVersion: web.workshop.io/v1
kind: Website
metadata:
name: hello
spec:
replicas: 2
html: "<h1>Hello from my operator</h1>"
EOF
website.web.workshop.io/hello created
Now watch your operator turn those 6 lines into running infrastructure:
$ kubectl get website,deployment,service,configmap -l site=hello
$ kubectl get all -l site=hello
NAME READY STATUS RESTARTS AGE
pod/hello-7d9f8c-xk2p4 1/1 Running 0 8s
pod/hello-7d9f8c-m4nl9 1/1 Running 0 8s <- 2 replicas, as requested
NAME TYPE CLUSTER-IP PORT(S) AGE
service/hello ClusterIP 10.96.142.7 80/TCP 8s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/hello 2/2 2 2 8s
Six lines of YAML became a Deployment, a Service, a ConfigMap, and two running Pods - because your operator reconciled the Website into them. Confirm it actually serves your HTML:
$ kubectl port-forward service/hello 8080:80 &
$ curl localhost:8080
<h1>Hello from my operator</h1> # your declared HTML, served
And the status you wrote flows back to the user:
$ kubectl get website hello -o jsonpath='{.status}{"\n"}'
{"readyReplicas":2,"url":"http://hello.default.svc.cluster.local"}
Step 5: watch it reconcile - the operator earns its name¶
This is the payoff - your custom resource behaves exactly like a built-in. Change the spec, watch the operator drive the change:
$ kubectl patch website hello --type merge -p '{"spec":{"replicas":4}}'
$ kubectl get pods -l site=hello -w
# watch 2 more pods appear - the operator scaled the Deployment because the Website spec changed
Change the HTML, watch it propagate:
$ kubectl patch website hello --type merge -p '{"spec":{"html":"<h1>Updated live</h1>"}}'
# operator updates the ConfigMap; (in a fuller version you'd roll the pods to pick it up)
And the self-heal moment, now on your own resource type:
$ kubectl delete deployment hello
deployment.apps "hello" deleted
$ kubectl get deployment hello
NAME READY AGE
hello 2/2 2s <- your operator recreated it
You deleted the Deployment; your operator recreated it, because the Website says it should exist. The owner-reference cleanup works too - delete the Website and everything it created vanishes:
$ kubectl delete website hello
website.web.workshop.io "hello" deleted
$ kubectl get all -l site=hello
No resources found. # Deployment, Service, ConfigMap, Pods - all GC'd
One kubectl delete website tore down four objects, because they're all owned by the Website. You built a self-healing, self-cleaning, high-level abstraction that the user drives with 6 lines of YAML - which is exactly what cert-manager, the Prometheus Operator, and every database operator do.
Step 6: finalizers - cleanup that owner references can't do¶
Owner references handle in-cluster cleanup. But operators often manage external resources (a cloud DNS record, an S3 bucket, a SaaS account) that Kubernetes garbage collection knows nothing about. Finalizers solve this: they block deletion until your operator does external cleanup first.
The pattern (Week 12's core): add a finalizer string to the object; when the object is deleted, Kubernetes sets a deletionTimestamp but keeps the object until your finalizer is removed - giving your reconcile a chance to clean up externally, then remove the finalizer to let deletion complete.
const finalizer = "web.workshop.io/cleanup"
// In Reconcile, near the top:
if site.DeletionTimestamp.IsZero() {
// Not being deleted: ensure our finalizer is present.
if !controllerutil.ContainsFinalizer(&site, finalizer) {
controllerutil.AddFinalizer(&site, finalizer)
return ctrl.Result{}, r.Update(ctx, &site)
}
} else {
// Being deleted: do external cleanup, THEN remove the finalizer.
if controllerutil.ContainsFinalizer(&site, finalizer) {
// ... e.g. delete the external DNS record for site.Status.URL ...
log.Info("external cleanup done", "website", site.Name)
controllerutil.RemoveFinalizer(&site, finalizer)
return ctrl.Result{}, r.Update(ctx, &site)
}
return ctrl.Result{}, nil // finalizer gone: Kubernetes completes deletion
}
This is why some objects "hang" in Terminating - a finalizer is blocking deletion until its controller finishes (or is stuck). Now you know what Terminating means and how to build deliberate teardown. Finalizers are the operator's hook for "delete this safely, including the things Kubernetes can't see."
Step 7: ship it (deploy the operator into the cluster)¶
make run ran the operator on your laptop. To run it in the cluster like a real operator:
$ make docker-build docker-push IMG=<your-registry>/website-operator:v0.1
$ make deploy IMG=<your-registry>/website-operator:v0.1
make deploy installs the CRD, the operator Deployment, a ServiceAccount, and the RBAC (kubebuilder generated minimum-privilege rules from the +kubebuilder:rbac markers in your controller). Now the operator runs as a Pod, reconciling Websites cluster-wide, with no laptop involved. That's a shippable operator.
Now extend it¶
- Printer columns. Add
+kubebuilder:printcolumnmarkers sokubectl get websitesshows REPLICAS, READY, URL in the table - the polish real operators have. - Conditions. Replace the flat status with standard
metav1.Conditions (Ready,Progressing) - the convention every mature operator follows, and whatkubectl wait --for=condition=Readykeys on. - Webhooks. Add a defaulting/validation webhook (the next workshop) for logic the OpenAPI schema can't express ("html must contain an
<h1>"). - Multiple versions. Add a
v2with a conversion webhook - how operators evolve their API without breaking existing resources (Week 11). - Owned-resource health in status. Watch the Deployment's real readiness and reflect it, so
Websitestatus is trustworthy.
What you might wonder¶
"CRD + controller = operator - is that the whole definition?" Essentially yes. An operator is a custom resource (the noun, via a CRD) plus a controller that reconciles it (the behavior). The term often implies operational domain knowledge encoded in the controller (how to back up this database, how to fail over this cluster) - but structurally it's CRD + controller. You just built one.
"controller-runtime hid the informer/workqueue I built in the pilot. Is that bad?" No - it's the point. You built them by hand once to understand them; now the framework runs them correctly so you focus on reconcile logic. For() sets up the primary informer+workqueue, Owns() sets up the child watches and owner-mapping, the manager runs the loop. Knowing what's underneath (from the pilot) means controller-runtime is convenience, not magic.
"Why update Status separately from Spec?" Spec is the user's; Status is yours. They're often separate API subresources so that updating status doesn't conflict with a user editing spec (separate resourceVersion tracking), and so RBAC can grant status-write without spec-write. r.Status().Update() hits the /status subresource specifically. Mixing them causes update conflicts - a real operator bug.
"How do real operators handle upgrades to the CRD schema?" Versioned APIs (v1, v1beta1) with conversion webhooks that translate between versions on the fly, so old resources keep working when you ship a new schema. It's Week 11 and the "multiple versions" extension above. This is how, e.g., cert-manager moved from v1alpha2 to v1 without breaking anyone's Certificates.
"Is writing an operator usually the right call?" Often not - if a Helm chart or plain manifests suffice, use those. Operators earn their complexity when you need ongoing operational logic: reconciling drift, automating backups/failover/upgrades, managing external resources, encoding domain expertise. "Does this need a control loop, or just a one-time install?" is the deciding question. You now can build one when the answer is "control loop."
What this gave you¶
- You extended the Kubernetes API: defined a
WebsiteCRD and watchedkubectltreat it as a first-class type. - You got apiserver-enforced validation for free from struct markers.
- You built the operator that reconciles
Websiteinto a Deployment + Service + ConfigMap, with controller-runtime running the loop you built by hand in the pilot. - You watched it reconcile spec changes, self-heal a deleted child, and GC everything on delete via owner references.
- You implemented finalizers for external cleanup - and now know what
Terminatingmeans. - You can ship the operator into the cluster with generated RBAC.
- You understand that cert-manager, Prometheus Operator, Crossplane, and every operator are this: a CRD plus a reconciling controller.
Next: intercept the API itself - build an admission webhook that validates and mutates objects as they're created, before they're ever stored.
Back to the Controllers & Operators month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your operator code and the Website CRD manifest
- Output of `kubectl get website,deploy,svc,cm` showing the resources your operator reconciled into existence
- Demonstration that deleting the Website resource cleans up the children (finalizer + owner refs)
Submit your build Request feedback on your output Discuss this workshop
Workshop - Build pod networking from scratch (a minimal CNI)¶
Companion to Kubernetes -> Month 04 -> Weeks 13: The CNI Spec and Pod Networking. The chapter explains the CNI contract and how Pods get IPs and connectivity. This workshop has you build the networking by hand - wire up a network namespace with a veth pair and routing exactly as a CNI plugin does, then write a tiny CNI plugin script and watch the kubelet call it to network a real Pod. By the end "every Pod gets its own IP and can reach every other Pod" stops being a slogan and becomes plumbing you've installed yourself.
~90 minutes. Needs: a Linux host (CNI is Linux networking), kind/k3d, root, and the ip command. Builds on the Linux build-a-container-by-hand investigation (namespaces) - pod networking is namespaces + virtual ethernet + routing.
What you'll build, and the idea it makes concrete¶
First you'll wire two network namespaces together by hand (the exact moves a CNI plugin makes), prove they can ping each other, then package those moves into a minimal CNI plugin and watch the kubelet invoke it to give a real Pod connectivity. The idea:
Pod networking is not magic and not built into Kubernetes - it's delegated. Kubernetes defines a contract (CNI - Container Network Interface): "when I create a Pod, I'll call a plugin executable with the Pod's network namespace; the plugin's job is to give it an interface, an IP, and routes." Kubernetes itself ships no networking - it relies entirely on a CNI plugin (Calico, Cilium, Flannel, ...) to fulfill that contract. The "every Pod gets an IP, every Pod can reach every other Pod with no NAT" model is implemented by the plugin, using plain Linux primitives: network namespaces, veth pairs, bridges, and routes.
The container-by-hand investigation showed a Pod is a process in namespaces. This shows how that isolated process gets on the network - the piece that investigation deliberately left as "an empty network world."
Step 0: the model - veth pairs bridge namespaces¶
The core Linux primitive is the veth pair: a virtual ethernet cable with two ends. Put one end inside a Pod's network namespace and the other on the host, and you've connected the Pod to the host's networking. Add a bridge and routes, and Pods can reach each other:
Pod A netns host Pod B netns
+-----------+ +-----------+
| eth0 | | eth0 |
|10.244.0.2 | |10.244.0.3 |
+-----+-----+ +-----+-----+
|veth pair |veth pair
| |
veth-a ----+ cni0 bridge +---- veth-b
+------[10.244.0.1]------+
|
(routing to other nodes / internet)
A CNI plugin's whole job, per Pod: create a veth pair, move one end into the Pod's netns and name it eth0, assign it an IP, attach the host end to a bridge, and set up routes. That's it. You're about to do each step by hand.
Step 1: wire two namespaces by hand¶
Create two network namespaces (standing in for two Pods) and connect each to a bridge. All as root on a Linux host:
# Create a bridge (the "node" network all pods attach to)
$ sudo ip link add cni0 type bridge
$ sudo ip addr add 10.244.0.1/24 dev cni0
$ sudo ip link set cni0 up
# "Pod A": a network namespace
$ sudo ip netns add podA
# veth pair: host end veth-a, pod end will become eth0
$ sudo ip link add veth-a type veth peer name eth0-a
# move the pod end into podA's namespace
$ sudo ip link set eth0-a netns podA
# attach the host end to the bridge
$ sudo ip link set veth-a master cni0
$ sudo ip link set veth-a up
# inside podA: name it eth0, give it an IP, bring it up, add a default route
$ sudo ip netns exec podA ip link set eth0-a name eth0
$ sudo ip netns exec podA ip addr add 10.244.0.2/24 dev eth0
$ sudo ip netns exec podA ip link set eth0 up
$ sudo ip netns exec podA ip link set lo up
$ sudo ip netns exec podA ip route add default via 10.244.0.1
Repeat for "Pod B" (podB, veth-b/eth0-b, IP 10.244.0.3). Now the payoff - watch the two "Pods" reach each other:
$ sudo ip netns exec podA ping -c2 10.244.0.3
PING 10.244.0.3: 56 data bytes
64 bytes from 10.244.0.3: icmp_seq=1 ttl=64 time=0.07 ms
64 bytes from 10.244.0.3: icmp_seq=2 ttl=64 time=0.05 ms # podA reaches podB!
You just built pod-to-pod networking by hand. Two isolated network namespaces, each with its own IP, talking over a bridge - exactly the "flat pod network" Kubernetes promises. No Kubernetes involved yet; this is pure Linux. A CNI plugin does precisely these ip commands, programmatically, every time a Pod is created. Inspect what you built:
$ sudo ip netns exec podA ip addr show eth0 # the pod's own eth0 at 10.244.0.2
$ bridge link show # veth-a and veth-b attached to cni0
(Clean up: sudo ip netns del podA podB; sudo ip link del cni0.)
Step 2: the CNI contract - how Kubernetes calls a plugin¶
Now connect this to Kubernetes. The kubelet, when starting a Pod, calls a CNI plugin executable with: - An operation (ADD when creating, DEL when destroying) via the CNI_COMMAND env var. - The Pod's network namespace path via CNI_NETNS. - The container ID, interface name (CNI_IFNAME, usually eth0), via env vars. - A JSON config on stdin (from /etc/cni/net.d/).
The plugin does the Step 1 moves for that Pod's netns, then prints a JSON result (the assigned IP) to stdout. That's the entire contract: an executable in /opt/cni/bin/, config in /etc/cni/net.d/, env vars + stdin in, JSON out. Kubernetes ships no networking - it just calls this executable.
Step 3: write a minimal CNI plugin¶
A CNI plugin can be any executable - even a shell script. Here's a minimal one that does Step 1's moves for whatever netns the kubelet hands it (/opt/cni/bin/mini-cni):
#!/bin/bash
# Minimal CNI plugin. Reads CNI_COMMAND, CNI_NETNS, CNI_IFNAME from env; config on stdin.
set -e
config=$(cat) # JSON config from /etc/cni/net.d
case "$CNI_COMMAND" in
ADD)
# Pick an IP (toy IPAM: random host octet; real plugins track allocations).
ip_octet=$(( (RANDOM % 200) + 10 ))
pod_ip="10.244.0.${ip_octet}"
host_veth="veth$$" # unique host-side name
# Create veth pair; move one end into the pod's netns as eth0.
ip link add "$host_veth" type veth peer name "$CNI_IFNAME" netns "$CNI_NETNS" 2>/dev/null || true
ip link set "$host_veth" master cni0
ip link set "$host_veth" up
ip -n "$CNI_NETNS" addr add "${pod_ip}/24" dev "$CNI_IFNAME"
ip -n "$CNI_NETNS" link set "$CNI_IFNAME" up
ip -n "$CNI_NETNS" link set lo up
ip -n "$CNI_NETNS" route add default via 10.244.0.1
# CNI requires a JSON result on stdout describing what we assigned.
cat <<EOF
{
"cniVersion": "1.0.0",
"interfaces": [{"name": "$CNI_IFNAME", "sandbox": "$CNI_NETNS"}],
"ips": [{"address": "${pod_ip}/24", "interface": 0, "gateway": "10.244.0.1"}]
}
EOF
;;
DEL)
# Tear down: the veth in the netns is removed when the netns goes away;
# a real plugin also frees the IP and removes the host veth.
echo '{"cniVersion":"1.0.0"}'
;;
*)
echo "unknown CNI_COMMAND $CNI_COMMAND" >&2; exit 1 ;;
esac
This is a real, if naive, CNI plugin - same contract as Calico or Cilium, just without IP-address-management, cross-node routing, or network policy. It does the veth+IP+route dance from Step 1, driven by the env vars the kubelet sets. (Note ip link add ... peer ... netns creates the peer directly in the target namespace - the modern one-step form of Step 1's two commands.)
Step 4: install it and watch the kubelet network a Pod¶
On a kind node (or a single-node cluster you control), install the plugin and its config, then create a Pod and watch it get networked by your plugin. (On kind, exec into the node: docker exec -it <node> bash.)
# install the plugin binary and config on the node
$ cp mini-cni /opt/cni/bin/mini-cni && chmod +x /opt/cni/bin/mini-cni
$ cat > /etc/cni/net.d/10-mini.conf <<EOF
{ "cniVersion": "1.0.0", "name": "mini", "type": "mini-cni" }
EOF
# create the bridge the plugin attaches to
$ ip link add cni0 type bridge 2>/dev/null; ip addr add 10.244.0.1/24 dev cni0; ip link set cni0 up
Now create a Pod and watch it come up with an IP your plugin assigned:
$ kubectl run netpod --image=nginx
$ kubectl get pod netpod -o wide
NAME READY STATUS IP NODE
netpod 1/1 Running 10.244.0.137 <node> <- an IP YOUR plugin assigned
The kubelet started the Pod's sandbox (a network namespace), called /opt/cni/bin/mini-cni ADD with that netns, your plugin wired up the veth + IP + route, and reported the IP back - which the kubelet recorded as the Pod's IP. The kubelet did the lifecycle; your plugin did the networking. Confirm connectivity:
$ kubectl exec netpod -- ip addr show eth0 # the eth0 your plugin created
$ kubectl exec netpod -- ping -c2 10.244.0.1 # reaches the bridge gateway
You watched Kubernetes delegate networking to an executable you wrote. That's CNI.
Step 5: see what the real plugins add (and why)¶
Your mini-CNI works on one node. Real plugins solve the hard parts your toy skips - and seeing the gaps teaches what they're for:
- IPAM (IP Address Management). Your plugin picks a random octet and could collide. Real plugins maintain an allocation database per node so every Pod gets a unique, leak-free IP. (CNI even has dedicated IPAM plugins like
host-local.) - Cross-node routing. Your Pods can reach each other on the same node (same bridge). For Pod-on-node-A to reach Pod-on-node-B, you need routes between node subnets - Flannel does this with a VXLAN overlay, Calico with BGP, Cilium with eBPF. This is the "flat network across nodes" part, and it's the bulk of what a CNI does.
- Network policy. Your plugin allows all traffic. Enforcing
NetworkPolicy(the Cilium/eBPF investigation) is a CNI feature - and why not all CNIs support it. - Performance. iptables-based vs eBPF-based dataplanes (the netfilter and Cilium investigations) - real plugins differ enormously here at scale.
So "which CNI should I use?" is really "which set of these do I need, and how should they be implemented?" You now understand the question because you built the floor they all stand on.
Now extend it¶
- Real IPAM. Replace the random octet with a file-based allocator (track assigned IPs in a file, never reuse a live one, free on DEL). You'll feel why IPAM is a whole subsystem.
- Cross-node connectivity. On a 2-node kind cluster, add static routes between the two node Pod-subnets so a Pod on node1 can ping a Pod on node2. This is Flannel-host-gw in miniature.
- Use the standard plugins. Chain the official
bridge+host-localIPAM plugins (in/opt/cni/binalready) via a CNI config, instead of your script. See how the reference plugins do it properly. - Trace the dataplane. Use the netfilter investigation to watch a Pod's packets traverse the host's hooks - connecting CNI (which builds the path) to netfilter (which the packets traverse).
What you might wonder¶
"Kubernetes really ships no networking?" Correct - and it's deliberate. The kubelet creates the Pod's network namespace and then calls whatever CNI plugin is configured; if none is installed, Pods stay ContainerCreating with a "no CNI config" error. This delegation is why you choose a CNI (Calico/Cilium/Flannel/...) when building a cluster - it's a required component Kubernetes doesn't provide. kind/minikube install one for you, which is why Pods "just work" there.
"veth pair vs bridge vs overlay - how do they relate?" veth pair = the cable connecting one Pod to the host. Bridge (cni0) = the switch all Pods on a node plug into (intra-node connectivity). Overlay (VXLAN) or routing (BGP) = how nodes connect their Pod subnets (inter-node connectivity). Your workshop built the first two; real CNIs add the third. They're layers, not alternatives.
"Why is the Pod IP on the Pod, not the node?" That's the Kubernetes networking model: every Pod gets a real, routable IP in a flat space, and Pods communicate without NAT (unlike Docker's default bridge, which NATs containers behind the host IP). This is why services can target Pod IPs directly and why it feels like every Pod is a little VM on the network. The CNI plugin implements this flat model.
"How does this connect to Services and kube-proxy?" CNI gives Pods IPs and connectivity (this workshop). Services give a stable virtual IP that load-balances across a set of Pod IPs - implemented by kube-proxy (iptables - the netfilter investigation) or Cilium (eBPF). CNI is the foundation (Pods can talk); Services are built on top (stable addressing + load balancing). Different layers, often different components.
"Should I ever write a CNI plugin?" Almost never from scratch - Calico/Cilium/Flannel are mature and cover ~all needs. You'd write CNI logic for exotic environments (a custom cloud, special hardware) or as a meta-plugin (chaining others). Building one here is to understand pod networking - so you can debug "why can't this Pod reach that one?" by knowing exactly what the plugin set up (veth? IP? route? policy?).
What this gave you¶
- You wired pod-to-pod networking by hand: veth pairs, a bridge, IPs, routes - and watched two namespaces ping each other.
- You know the CNI contract: Kubernetes calls a plugin executable (ADD/DEL, netns, JSON in/out) and ships no networking itself.
- You wrote a minimal CNI plugin and watched the kubelet invoke it to give a real Pod an IP and connectivity.
- You know what real plugins add - IPAM, cross-node routing, network policy, dataplane performance - and why those are the hard parts.
- You can reason about "why can't this Pod reach that one?" in terms of the actual plumbing.
- You connected CNI (builds the path) to netfilter and kube-proxy/Services (what runs on it).
Next: the platform layer - build a GitOps sync loop that continuously reconciles your cluster from a git repo, the pattern Argo CD and Flux industrialize.
Back to the Networking & Storage month.
Submit your build¶
When you finish this workshop, share what you built so others can see and learn from your work. Include:
- Public repo with your CNI plugin script
- Ping output between two pods on your bridge proving they can reach each other
- Output of `ip netns` and `ip link` showing the namespaces and veth pairs you wired
- Note on what kubelet passes to a CNI plugin on stdin and what you must return on stdout
Submit your build Request feedback on your output Discuss this workshop
Worked example - Week 14: a NetworkPolicy → what eBPF actually does¶
Companion to Kubernetes → Month 04 → Week 14: Cilium and eBPF. The week explains the Cilium model: CNI plugin, identities, the L3/L4/L7 policy layers, and the eBPF datapath. This page takes one Kubernetes NetworkPolicy and traces it through Cilium all the way to the eBPF program enforcing it on a packet.
You need a kind/k3s/minikube cluster with Cilium installed (cilium install from the Cilium CLI; or Helm with --set kubeProxyReplacement=true).
The policy¶
# api-deny-from-frontend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-deny-from-frontend
namespace: shop
spec:
podSelector:
matchLabels:
app: api
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels:
app: orders
ports:
- port: 8080
protocol: TCP
What this says, in English: "Pods in namespace shop with label app=api will accept TCP/8080 traffic only from pods labeled app=orders in the same namespace. Everything else gets dropped."
Without a NetworkPolicy controller, Kubernetes ignores this object entirely. With Cilium installed, the policy becomes a real packet-level rule. Walk through how.
Step 1 - Pod IPs and identities¶
Apply the policy and deploy three sample pods:
$ kubectl apply -f api-deny-from-frontend.yaml
$ kubectl run -n shop api --image=nginx --labels=app=api --port 8080
$ kubectl run -n shop orders --image=alpine --labels=app=orders -- sh -c "while true; do sleep 60; done"
$ kubectl run -n shop frontend --image=alpine --labels=app=frontend -- sh -c "while true; do sleep 60; done"
Now look at what Cilium did:
$ kubectl exec -n kube-system ds/cilium -- cilium endpoint list -o json | jq '.[] | {id, name: .status.identity.labels, ip: .status.networking.addressing[0].ipv4}'
{ "id": 412, "name": ["k8s:app=api","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.42" }
{ "id": 1207, "name": ["k8s:app=orders","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.43" }
{ "id": 1208, "name": ["k8s:app=frontend","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.44" }
Cilium assigned each pod an endpoint ID and a security identity derived from the pod's labels. The identity is a number, not the label set itself. All pods with the same label set share an identity, which is the unit Cilium reasons about.
The key trick: traditional iptables-based CNIs do rule matching by IP, which means rules scale O(n²) with pod count. Cilium does it by identity, which scales O(unique_label_sets²) - vastly smaller in practice.
Step 2 - The policy in Cilium's view¶
$ kubectl exec -n kube-system ds/cilium -- cilium policy get
[
{
"endpointSelector": {"matchLabels": {"k8s:app": "api", "k8s:io.kubernetes.pod.namespace": "shop"}},
"ingress": [
{
"fromEndpoints": [
{"matchLabels": {"k8s:app": "orders", "k8s:io.kubernetes.pod.namespace": "shop"}}
],
"toPorts": [{"ports": [{"port": "8080", "protocol": "TCP"}]}]
}
]
}
]
Same content, Cilium's internal representation. The selectors will resolve to specific identity numbers when the policy is materialized into eBPF maps.
Step 3 - Test the policy works¶
$ kubectl exec -n shop orders -- wget -qO- --timeout=2 http://10.244.0.42:8080
<!DOCTYPE html>
<html>
<head><title>Welcome to nginx!</title>
...
$ kubectl exec -n shop frontend -- wget -qO- --timeout=2 http://10.244.0.42:8080
wget: download timed out
orders succeeds (allowed). frontend times out (silently dropped). Good.
But where is the drop happening?
Step 4 - Find the eBPF program¶
Cilium attaches eBPF programs at several kernel hook points: tc (traffic control) ingress/egress on every pod's veth, and on the host's external interface. List them:
$ kubectl exec -n kube-system ds/cilium -- bpftool prog show | grep cil_
1342: sched_cls name cil_from_container tag 4f...
1343: sched_cls name cil_to_container tag 8a...
1344: sched_cls name cil_from_host tag c2...
1345: sched_cls name cil_to_host tag d7...
1346: sched_cls name cil_from_netdev tag e3...
These are the BPF programs implementing the datapath. cil_from_container runs on every packet leaving a pod's veth; cil_to_container on every packet entering. The policy enforcement happens in cil_to_container.
Step 5 - The maps Cilium uses¶
eBPF programs are stateless; they read from kernel-managed maps (kv stores). Cilium maintains several:
$ kubectl exec -n kube-system ds/cilium -- bpftool map show | grep -E "cilium_"
221: hash name cilium_policy key 16B value 48B max_entries 16384
222: lru_hash name cilium_ct4 key 40B value 64B max_entries 524288
223: hash name cilium_lxc key 4B value 64B max_entries 65536
224: hash name cilium_metrics key 8B value 16B max_entries 65536
...
cilium_lxc- endpoint ID → pod info (IP, MAC, security identity).cilium_policy-(endpoint_id, src_identity, port, protocol) → allow/deny. This is the lookup table the BPF program consults to decide whether a packet is allowed.cilium_ct4- connection tracking. Stores active flows for established-connection allowance.
Step 6 - The actual lookup¶
When a packet from frontend (identity 1208) reaches the host with destination api (10.244.0.42:8080, endpoint 412):
cil_to_containerBPF program triggers on the veth's ingress hook.- Program reads packet headers - src IP
10.244.0.44, dst IP10.244.0.42, dst port8080. - Program looks up dst endpoint via
cilium_lxc[10.244.0.42]→ endpoint 412. - Program looks up src identity via
cilium_ipcache[10.244.0.44]→ identity 1208. - Program builds policy key
(endpoint=412, identity=1208, port=8080, proto=TCP)and queriescilium_policy. - No matching entry → returns
DROP. - Program updates
cilium_metrics(drop counter ++). tcframework drops the packet.
When orders (identity 1207) sends the same kind of packet, step 5 builds key (412, 1207, 8080, TCP), the policy map has this entry (from the NetworkPolicy → identity match), and the program returns PASS. The packet proceeds; the connection is tracked in cilium_ct4 so the return packet is also allowed via fast-path.
Step 7 - See the drop in real time¶
$ kubectl exec -n kube-system ds/cilium -- cilium monitor -t drop
xx drop (Policy denied) flow 0xab12 to endpoint 412, identity 1208->10044, file bpf_lxc.c line 1142, 86 bytes
This is the BPF program emitting a perf event when it drops a packet. The format includes the source line of the bpf_lxc.c program that made the decision, the source/destination identities, and the byte count. Cilium's hubble (a separate component) consumes these events to provide a real-time UI.
Why this matters¶
The traditional kube-proxy + iptables path for this same policy would: - Maintain ~O(pods^2) iptables rules per port. - Linearly walk those rules on every packet. - Rewrite rules on every pod create/delete, which under churn can take seconds and lose packets.
Cilium's eBPF path: - Maintains a hash map keyed by (endpoint, identity, port, proto). - O(1) lookup on every packet. - Identity-based: adding a new orders pod doesn't change the policy map at all (same identity).
In a cluster with 10,000 pods, the difference is "stable 50µs latency vs unbounded tail." That's the whole pitch for Cilium.
The trap¶
A NetworkPolicy without a controller that supports it does nothing. Many K8s users apply policies on clusters where the CNI doesn't enforce them, and the cluster silently allows everything. Verify with kubectl exec between pods that shouldn't be able to reach each other, or use cilium connectivity test if you're on Cilium.
The other trap: Cilium identity granularity. Two pods with identical label sets share an identity. If you split traffic by namespace alone, every pod in the namespace has the same identity for policy purposes. Add labels (role, tier, app-version) to get finer-grained control.
Exercise¶
- Run the demo above. Confirm the drop is visible via
cilium monitor. - Add a third allowed source: pods labeled
app=admin. Reapply the policy. Watchcilium policy getchange and confirm admin pods now succeed. - (Advanced) Use
bpftool prog dump xlated id <prog-id>on one of thecil_*programs. Read the BPF assembly. Find the map lookup instructions. - (Advanced) Read
Documentation/bpf/in the kernel tree for the BPF instruction set reference. FindBPF_LDX,BPF_JEQ. You'll see them in the disassembly.
Related reading¶
- The main Week 14 chapter covers Cilium's architecture beyond just NetworkPolicy.
- The Linux Kernel path → Month 3 → eBPF chapters cover the verifier and the JIT.
- Glossary: eBPF, Identity, CNI, tc (traffic control) in the main glossary.