Saltar a contenido

Kubernetes

Control plane, kubelet/CRI, controllers, networking, day-2.

Printing this page

Use your browser's PrintSave as PDF. The print stylesheet hides navigation, comments, and other site chrome; pages break cleanly at section boundaries; advanced content stays included regardless of beginner-mode state.


Kubernetes Platform Engineering-A 24-Week Mastery Roadmap

Authoring lens: Principal Platform Engineer / Kubernetes Maintainer. Target outcome: A graduate of this curriculum is capable of (a) building and operating a hardened Kubernetes cluster from scratch on bare metal or any cloud, (b) extending the control plane via custom controllers/operators built with controller-runtime or client-go, and (c) contributing patches to kubernetes/kubernetes or one of its core ecosystems (Cilium, Istio, ArgoCD, Crossplane).

This is not "kubectl in 24 weeks." It assumes the reader has used Kubernetes (deployed an app, run kubectl), understands containers (see the CONTAINER_INTERNALS_PLAN curriculum if not), and is ready to read kubernetes/kubernetes source as primary literature.


Repository Layout

File Purpose
00_PRELUDE_AND_PHILOSOPHY.md What Kubernetes is, what it isn't, the design ethics, reading list.
01_MONTH_CONTROL_PLANE.md Weeks 1–4. etcd & Raft, kube-apiserver, scheduler, controllers.
02_MONTH_KUBELET_AND_CRI.md Weeks 5–8. kubelet, CRI, kube-proxy, CSI, device plugins.
03_MONTH_CONTROLLERS_AND_OPERATORS.md Weeks 9–12. client-go, controller-runtime, CRDs, the operator pattern.
04_MONTH_NETWORKING_AND_STORAGE.md Weeks 13–16. CNI, Cilium/eBPF, service meshes, CSI, dynamic provisioning.
05_MONTH_PLATFORM_AND_DAY2.md Weeks 17–20. GitOps (Argo/Flux), IaC (Crossplane), HPA/VPA, admission, OPA.
06_MONTH_HARD_WAY_CAPSTONE.md Weeks 21–24. K8s the Hard Way; multi-tenancy; mTLS; capstone.
APPENDIX_A_HARDENING.md CIS, Pod Security, network policy, RBAC, audit.
APPENDIX_B_TROUBLESHOOTING.md Reference flows: pod-pending, node-notready, etcd-degraded, etc.
APPENDIX_C_CONTRIBUTING.md Contributing to k8s.io: SIGs, KEPs, first-PR playbook.
CAPSTONE_PROJECTS.md Three tracks: hard-way bare-metal cluster, GitOps platform, operator from scratch.

How Each Week Is Structured

  1. Conceptual Core-the why, with a mental model.
  2. Mechanical Detail-the how, with kubernetes/kubernetes source pointers.
  3. Lab-a hands-on exercise using a real cluster (kind, k3s, or hard-way).
  4. Hardening Drill-a security/compliance micro-task.
  5. Operations Slice-a Day-2-ops micro-task: monitoring, scaling, recovery.

Each week is sized for ~12–16 focused hours. Almost every lab requires a working cluster-invest early in a smooth local cluster setup (kind or k3d for dev; a 3-node kubeadm cluster on cloud VMs for realistic ops).


Progression Strategy

Control Plane ──► Kubelet & CRI ──► Controllers & Operators
      │                │                    │
      └────────┬───────┴────────────────────┘
   Networking & Storage
   Platform & Day-2 Ops
   Hard Way & Capstone

Prerequisites

  • Container fluency (the CONTAINER_INTERNALS_PLAN weeks 1–3 minimum).
  • Linux fluency (the LINUX curriculum weeks 9–10 minimum: namespaces & cgroups).
  • Comfortable with at least one of Go, Python, or Rust at a "I can build a small CLI" level.
  • A budget for cloud VMs OR hardware to run a multi-node cluster (3 small VMs is sufficient).

Capstone Tracks (pick one in Month 6)

  1. Hard Way Track-provision a multi-node Kubernetes cluster from scratch on bare metal or cloud, with mTLS, fine-grained RBAC, multi-tenancy, and a documented runbook.
  2. Platform Track-build a GitOps-driven platform-as-a-service: ArgoCD/Flux + Crossplane + OPA Gatekeeper + multi-tenancy + self-service. Demonstrate onboarding a new team in <30 minutes.
  3. Operator Track-build a non-trivial operator from scratch (e.g., a stateful database operator with backup/restore, or an operator that manages an external SaaS resource via Crossplane composition). Production quality.

Details in CAPSTONE_PROJECTS.md.

Prelude-What Kubernetes Actually Is

Sit with this document for an evening before week 1.


1. Kubernetes Is a Distributed Reconciliation Loop

The most clarifying way to understand Kubernetes:

Kubernetes is a distributed key-value store (etcd) wrapped in an HTTP API server, surrounded by a swarm of independent controllers, each of which watches some types of objects in the store and writes other types of objects in response-until the cluster's actual state matches the desired state.

That's it. No central brain. No orchestration engine in the traditional sense. Every component is a client of the API server. The "control plane" is an emergent property of independent controllers cooperating through a shared, transactional store.

If you internalize this, the rest of the curriculum is bookkeeping.


2. The Control Loop Is the Atom

Every interesting behavior in Kubernetes is some controller running this loop:

for {
  desired := apiServer.List(myWatchedKind)
  actual := observe(realWorld)
  diff := compute(desired, actual)
  for _, action := range diff {
    actOn(action)              // create / update / delete real resources
    apiServer.UpdateStatus(...)
  }
  watchOrSleep()
}

The Deployment controller watches Deployment objects, creates/updates ReplicaSet objects. The ReplicaSet controller watches ReplicaSet objects, creates/updates Pod objects. The kubelet watches Pod objects bound to its node, talks to the CRI to actually start containers. Each controller's only knowledge of the others is the objects they share.

This is also why Kubernetes is eventually consistent-there is no central scheduler enforcing global state, just many controllers converging.


3. The Five-Axis Cost Model

A working platform engineer reasons along five axes:

Axis Question to ask
Control plane Does this load etcd? How many writes? How many list/watch consumers?
Scheduling What's the resource request? Affinity? Taint/toleration? Topology spread?
Networking Cluster-internal service / NodePort / LoadBalancer / Ingress / Gateway? CNI overhead?
Identity & isolation What ServiceAccount? What namespace? What RBAC? What NetworkPolicy? What PodSecurity profile?
Day-2 ops What does upgrade look like? Backup? Disaster recovery? Cost?

Beginner courses teach axis 2 only.


4. The Reading List

Primary - Kubernetes in Action (Lukša, 2e). The single best book. - Programming Kubernetes (Hausenblas & Schimanski). Required for Months 3–4. - Production Kubernetes (Vyas, et al.). The Day-2 bible. - Cloud Native Patterns (Davis). Architectural patterns.

Source - kubernetes/kubernetes - the monorepo. Particularly: -cmd/kube-apiserver,pkg/apiserver/,staging/src/k8s.io/apiserver/ - API server. - cmd/kube-scheduler, pkg/scheduler/ - scheduler. -cmd/kube-controller-manager,pkg/controller/ - built-in controllers. - pkg/kubelet/ - node agent. -pkg/proxy/ - service implementations. - kubernetes/community/sig-* - design docs. - KEPs (Kubernetes Enhancement Proposals) atkubernetes/enhancements`. The canonical record of why features exist.

Adjacent canon - Designing Data-Intensive Applications (Kleppmann). Especially chapters on consensus and replication. - The Raft paper. Read in week 1. - Site Reliability Engineering (Google). The "what does Day-2 mean?" book.


5. Curriculum Philosophy

  1. Source first, blog second. When the curriculum says "study informer mechanics," open staging/src/k8s.io/client-go/tools/cache/. Blogs go stale; commits are dated.
  2. Run a real cluster. Many labs assume a multi-node setup. kind is fine for development; weeks 17+ assume something closer to production.
  3. Defaults are wrong. Kubernetes ships with permissive defaults to ease onboarding (no NetworkPolicy, no PodSecurity, broad RBAC). Production requires inverting them.

6. What Kubernetes Is Not For

A graduate of this curriculum should be able to argue these points:

  • Single-server simple deploys. A Postgres + an app on one VM with systemd is operationally simpler than a one-node Kubernetes cluster. Don't add a control plane to host one app.
  • Hard real-time / latency-critical hot paths. kube-proxy adds latency. CNI plugins add latency. The scheduler is not designed for sub-millisecond placement decisions. Use bare-metal or VM-based deployments for ultra-low-latency workloads.
  • Stateful databases at scale, naively. Kubernetes can run stateful workloads with operators (Postgres operator, MongoDB operator, etc.), but doing it correctly requires a mature operator ecosystem and skilled operators. "Just put your DB in K8s" is not free.
  • Teams without ops capacity. Kubernetes is not a Heroku replacement. The complexity is real. If you don't have a platform team, use Cloud Run, Fly, or a managed container service before reaching for K8s.

7. AI-Assisted Workflows

  • Always read generated YAML. Models hallucinate field names; Kubernetes silently ignores unknown fields by default-your "successful apply" may be doing nothing.
  • Verify CRD generation. Tools like controller-gen are deterministic; let them generate, never hand-edit.
  • Treat generated RBAC with extreme suspicion. Models tend toward over-broad permissions ("just give it cluster-admin"). Tighten by hand.

You are now ready for Week 1. Open 01_MONTH_CONTROL_PLANE.md.

Month 1-The Control Plane: etcd, kube-apiserver, scheduler, controllers

Goal: by the end of week 4 you can (a) describe etcd's Raft model and replay a leader election, (b) trace an apply through the API-server's request pipeline (auth → admission → validation → storage), (c) explain how the scheduler picks a node, and (d) read the source of one built-in controller (Deployment) end-to-end.


Weeks

Week 1 - etcd and the Raft Consensus Foundation

1.1 Conceptual Core

  • etcd is the persistent, consistent state store for everything in Kubernetes. The API server is its only client; every other component reads via the API server, never etcd directly.
  • Raft (the consensus protocol) gives etcd: linearizable writes, fault-tolerant majority reads, and bounded recovery time after node failure.
  • A Kubernetes cluster's reliability is bounded by its etcd cluster's reliability. Run 3 or 5 etcd nodes (never even numbers); back them up; monitor them.

1.2 Mechanical Detail

  • Read the Raft paper. Then read etcd-io/raft/raft.go end to end (~3000 lines). The paper takes 90 minutes; the source another 4 hours; together they're worth a year of intuition.
  • etcd's data model: a flat keyspace, mvcc revisions, watch streams. Kubernetes uses keys like /registry/pods/<namespace>/<name> and stores protobuf-encoded objects.
  • Watch streams are the foundation of every Kubernetes controller. The API server multiplexes per-resource watches over a single etcd watch stream.
  • Performance characteristics:
  • Write latency = network RTT × 2 (leader → quorum) + fsync.
  • Read latency = local read on leader (or any member with serializable=true).
  • Throughput is bound by the leader's fsync rate. SSDs help dramatically.
  • Compaction and defragmentation are required ops; without them etcd grows unbounded.

1.3 Lab-"etcd, Up Close"

  1. Bring up a 3-node etcd cluster locally (etcd binaries, no Kubernetes yet). Configure peer/client URLs.
  2. Use etcdctl to put/get keys; observe consistent reads.
  3. Kill the leader. Use etcdctl endpoint status --cluster to identify the new leader within seconds.
  4. Use etcdctl watch /foo from one terminal; put values from another. Internalize the watch model.
  5. Use etcdctl --command-timeout=60s defrag to compact + defragment. Observe disk-usage drop.

1.4 Hardening Drill

  • Configure mTLS between etcd peers and between client and etcd. Configure auth (role, user).
  • Take a snapshot via etcdctl snapshot save. Restore to a new cluster. Verify integrity.

1.5 Operations Slice

  • Wire etcd metrics to Prometheus: etcd_server_has_leader, etcd_disk_wal_fsync_duration_seconds, etcd_mvcc_db_total_size_in_bytes. Alert on absent leader, fsync p99 > 100ms, db-size approaching quota-backend-bytes.

Week 2 - The kube-apiserver

2.1 Conceptual Core

  • The API server is the only stateful component (well-the only stateless component that talks to etcd). It exposes the REST/JSON+YAML+protobuf API, performs authn/authz/admission, and writes to etcd.
  • Every Kubernetes operation-kubectl apply, controller reconciliation, kubelet status update-is an HTTP request to this server.
  • Three middleware stages every request traverses: Authentication (who are you?), Authorization (RBAC: what can you do?), Admission (mutating + validating webhooks).

2.2 Mechanical Detail

  • Authentication mechanisms: x509 client certs, bearer tokens (ServiceAccount tokens, OIDC), webhook tokens. Each is a request handler chain entry.
  • Authorization: RBAC is the dominant mode. ABAC and webhook authz exist but are rare. RBAC binds subjects (User, Group, ServiceAccount) to roles (verb + resource + namespace combinations) via RoleBinding/ClusterRoleBinding.
  • Admission:
  • Mutating webhooks: can modify the object before validation (e.g., inject sidecars).
  • Validating webhooks: can only accept or reject.
  • Built-in admission controllers: LimitRanger, ResourceQuota, ServiceAccount, PodSecurity, NodeRestriction, etc. Read plugin/pkg/admission/ in the k8s tree.
  • Aggregated API server: third parties can register their own API surface (e.g., metrics-server, Knative). The main apiserver proxies requests to them.
  • Storage: every object has a "storage version" in etcd. The server converts between API versions on read/write. This is what allows v1beta1 → v1 migrations.

2.3 Lab-"Read the Pipeline"

  1. Use kubectl --v=8 to dump the wire-level request/response of a kubectl apply. Read it carefully.
  2. Use kubectl get --raw to hit /apis/, /api/v1, /apis/apps/v1 and see the discovery surface.
  3. Configure the apiserver to log all requests with - -audit-policy-file=audit.yaml`. Apply a few changes; read the audit log.
  4. Write a tiny mutating webhook (in Go, using controller-runtime's webhook facilities) that adds a label to every Pod. Deploy and verify.

2.4 Hardening Drill

  • Audit policy template: log Metadata for every request, Request for secrets/configmaps, RequestResponse for roles/rolebindings/clusterroles/clusterrolebindings. Ship logs off-cluster.

2.5 Operations Slice

  • Wire apiserver metrics: apiserver_request_total, apiserver_request_duration_seconds, apiserver_storage_objects. Alert on per-resource latency p99 spikes.

Week 3 - The Scheduler

3.1 Conceptual Core

  • The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's spec.nodeName field.
  • The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
  • The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.

3.2 Mechanical Detail

  • Scheduler framework extension points (read pkg/scheduler/framework/types.go):
  • PreFilter-short-circuit conditions.
  • Filter-must return Success for the Node to be eligible.
  • PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
  • PreScore, Score, NormalizeScore.
  • Reserve, Permit, PreBind, Bind, PostBind.
  • Built-in plugins: NodeResourcesFit, NodeAffinity, PodTopologySpread, InterPodAffinity, TaintToleration, NodePorts, VolumeBinding, ImageLocality, NodeResourcesBalancedAllocation.
  • Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
  • Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods. priorityClass is the knob.

3.3 Lab-"Scheduler in Action"

  1. Use kubectl describe on a pending Pod to see filter/score reasons.
  2. Set a Node taint (kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it.
  3. Define PriorityClasses (high, default, batch); deploy mixed-priority Pods; trigger preemption by oversaturating.
  4. Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.

3.4 Hardening Drill

  • Set priorityClassName on system-critical Pods (CSI driver, ingress controller). Use system-cluster-critical and system-node-critical for cluster-internal pods.

3.5 Operations Slice

  • Wire scheduler metrics: scheduler_pending_pods, scheduler_pod_scheduling_duration_seconds, scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.

Week 4 - Built-in Controllers and client-go Foundations

4.1 Conceptual Core

  • The kube-controller-manager is a single binary running ~30 built-in controllers. Each is a goroutine running the reconciliation loop pattern.
  • The Deployment controller is the best worked example: watches Deployment objects, creates/updates ReplicaSet objects, drives rolling-update progression.
  • The patterns the built-in controllers establish-informers, work queues, structured logging, leader election-are the templates you'll use when building custom controllers.

4.2 Mechanical Detail

  • Informers (staging/src/k8s.io/client-go/tools/cache/):
  • A shared in-memory cache populated by a single watch stream per resource type.
  • Event handlers: OnAdd, OnUpdate, OnDelete.
  • Cache provides O(1) lookup by namespace/name.
  • Multiple controllers in one process share informers via SharedInformerFactory.
  • Work queues (client-go/util/workqueue): rate-limited, deduplicated, item-keyed queues. Reconcile functions pull a key, list-from-cache, act, requeue on error.
  • The Deployment controller flow (pkg/controller/deployment/):
  • Informer detects a Deployment change.
  • Reconciler computes the desired ReplicaSet count and per-RS replica counts based on strategy (rolling vs recreate).
  • Creates new RS / scales old RS / scales new RS.
  • Updates Deployment status with progress.

4.3 Lab-"Read the Deployment Controller"

  1. Read pkg/controller/deployment/deployment_controller.go end-to-end (~1500 lines).
  2. Trace a kubectl rollout through the source: which conditions are checked, which fields updated, what triggers the next loop iteration.
  3. Reproduce a stuck-rollout scenario (deploy a bad image); observe Progressing=False after the deadline; inspect status conditions.
  4. Manually scale a Deployment to 0 with kubectl scale; trace what the controller does in response.

4.4 Hardening Drill

  • Set sensible Deployment defaults in your platform: progressDeadlineSeconds, revisionHistoryLimit, rolling-update maxSurge/maxUnavailable for production workloads.

4.5 Operations Slice

  • Wire workqueue metrics: workqueue_adds_total, workqueue_depth, workqueue_queue_duration_seconds. Alert on persistent depth or processing latency.

Month 1 Capstone Deliverable

A control-plane/ workspace: 1. etcd-cluster/ - week 1's 3-node cluster + backup/restore script. 2.audit-pipeline/ - week 2's audit-log shipping + sample queries. 3. custom-scheduler-plugin/ - week 3's scheduler plugin + deployment. 4.controller-walkthrough.md - week 4's annotated tour of the Deployment controller.

Month 2-The Node: kubelet, CRI, kube-proxy, CSI, Devices

Goal: by the end of week 8 you can (a) describe how the kubelet maintains the desired Pod state on a node, (b) trace a service request from client to backing Pod through kube-proxy or eBPF dataplane, (c) explain CSI volume lifecycle (provision → attach → mount), and (d) write a basic device plugin or CSI driver shim.


Weeks

Week 5 - Kubelet Internals

5.1 Conceptual Core

  • The kubelet is the per-node agent. Its job: watch Pods bound to this node, drive the CRI to make the actual containers match. Plus: report node status, manage volumes, run health checks, evict on resource pressure.
  • The kubelet is also a PLEG (Pod Lifecycle Event Generator) that polls the runtime to detect actual container state changes-necessary because container exits aren't always pushed events.
  • Kubelet is the component most often blamed for "weird" Kubernetes behavior; understanding it is non-optional.

5.2 Mechanical Detail

  • Read pkg/kubelet/kubelet.go. Major loops:
  • `syncLoop - the main reconciliation loop.
  • `PLEG - pod-lifecycle event generation.
  • `volumeManager - volume mount/unmount.
  • `statusManager - Pod status updates back to apiserver.
  • `evictionManager - resource-pressure eviction.
  • Static pods-Pods defined as YAML files on disk (/etc/kubernetes/manifests/). Kubelet runs them directly without an apiserver. How control-plane pods bootstrap themselves.
  • Pod lifecycle phases: PendingRunningSucceeded/Failed. With container states Waiting / Running / Terminated.
  • Pod resource enforcement: kubelet sets cgroups based on requests/limits. With cpu-manager-policy=static, the kubelet pins exclusive CPUs to Guaranteed-class Pods. Same idea for memory-manager-policy and topology-manager-policy.
  • Eviction: when a node runs low on memory, disk, PID space, the kubelet evicts Pods in priority order. Soft vs hard thresholds.

5.3 Lab-"Kubelet Forensics"

  1. SSH to a node. journalctl -u kubelet -f and trigger a Pod creation. Watch the log.
  2. crictl ps, crictl pods, `crictl inspect - operate at the CRI layer directly.
  3. Place a static pod manifest; observe kubelet picking it up.
  4. Trigger a memory eviction by setting low evictionHard and oversubscribing. Read the eviction event and the kubelet's decision.

5.4 Hardening Drill

  • Set kubelet args: - -read-only-port=0, - -anonymous-auth=false, - -authorization-mode=Webhook, - -protect-kernel-defaults=true, - -make-iptables-util-chains=true, - -tls-min-version=VersionTLS12.

5.5 Operations Slice

  • Wire kubelet metrics: kubelet_pod_start_duration_seconds, kubelet_running_pods, kubelet_volume_stats_used_bytes. Alert on slow Pod starts.

Week 6 - CRI: kubelet ↔ Runtime

6.1 Conceptual Core

  • The kubelet does not run containers itself; it talks gRPC to a CRI implementation (containerd, CRI-O). The CRI provides RuntimeService (containers + sandboxes) and ImageService (pull/list/remove).
  • Every Pod is a sandbox (a network namespace + the "pause" container) plus N containers sharing it.

6.2 Mechanical Detail

  • The CRI proto: cri-api/pkg/apis/runtime/v1/api.proto. The most relevant calls: RunPodSandbox, CreateContainer, StartContainer, StopContainer, RemovePodSandbox, Exec, Attach, PortForward, PullImage.
  • The pause container is a tiny binary (it just calls pause(2)); it holds open the network namespace so the application containers can come and go.
  • crictl is the CLI for talking to a CRI directly. crictl ps, crictl inspect, crictl exec. Different from `kubectl - talks straight to kubelet's runtime, bypassing the API server.
  • Runtime classes: RuntimeClass objects bind a name (gvisor, kata) to a runtime handler. Pods reference via spec.runtimeClassName.

6.3 Lab-"CRI Direct"

  1. From a node, crictl pull alpine; crictl runp pod-config.json; crictl create <pod-id> ctr-config.json img-config.json; crictl start <ctr-id>. You've launched a pod-equivalent without the apiserver.
  2. Compare with kubectl deploying the same: trace each CRI call in the kubelet log.
  3. Configure containerd with multiple runtimes (runc + runsc); register both as RuntimeClasses; deploy Pods against each.

6.4 Hardening Drill

  • Configure containerd to default to a non-root user, drop default capabilities, apply default seccomp. The same hardening from the Container curriculum, applied at the daemon level.

6.5 Operations Slice

  • Monitor container_runtime_* metrics from cAdvisor (built into kubelet). Alert on container-restart rate spikes.

Week 7 - kube-proxy, Services, and the Networking Dataplane

7.1 Conceptual Core

  • A Service is a stable virtual IP and port that load-balances across a set of Pods. It is implemented at L4 by kube-proxy-or, in modern eBPF-based clusters, by the CNI directly (Cilium replaces kube-proxy entirely).
  • Modes:
  • iptables (default): kube-proxy programs iptables DNAT rules. O(N) match per packet; degrades with many Services.
  • IPVS: kube-proxy programs the kernel IPVS load balancer. O(1) lookup; better for >1000 services.
  • eBPF (Cilium): bypasses iptables entirely; programs are attached at the socket layer (bpf_sock_ops) and at the egress point. Lowest overhead.

7.2 Mechanical Detail

  • EndpointSlices replaced Endpoints in 1.21+: split per-Service endpoint lists into multiple objects to scale beyond ~1000 endpoints per Service.
  • Service types: ClusterIP (default, internal), NodePort (open a port on every node), LoadBalancer (cloud LB integration), ExternalName (DNS CNAME).
  • Headless Services (clusterIP: None): no virtual IP; DNS returns Pod IPs directly. Used by StatefulSets.
  • Topology-aware routing: prefer endpoints in the same zone (since 1.27 stable). Saves cross-zone egress costs.
  • Service IPs are virtual: no NIC has them; they live only in iptables/IPVS/eBPF rules.

7.3 Lab-"Service Path"

  1. Create a Service + Deployment. From a Pod, curl <service>.<ns>.svc.cluster.local. Trace the DNS lookup (CoreDNS) and the iptables/IPVS rules that DNAT.
  2. Switch kube-proxy to IPVS mode (mode: ipvs in kube-proxy config). Verify with ipvsadm -L -n.
  3. Install Cilium with kubeProxyReplacement=true. Observe kube-proxy not running. Verify Service connectivity still works.
  4. Compare per-packet latency under each mode with a small benchmark.

7.4 Hardening Drill

  • Enable topology-aware routing to keep traffic in zone. Apply NetworkPolicies (next month) that allow only intended traffic.

7.5 Operations Slice

  • Wire kubeproxy_sync_proxy_rules_duration_seconds. With many Services and iptables mode, this can take seconds-a known scale ceiling.

Week 8 - CSI, Storage, and Device Plugins

8.1 Conceptual Core

  • CSI (Container Storage Interface) is the standard plugin interface for storage. Every cloud and many on-prem systems ship a CSI driver. Kubernetes calls the driver via gRPC.
  • A CSI driver runs in two modes (or both):
  • Controller plugin-provision, delete, attach, detach, snapshot. Cluster-wide.
  • Node plugin-stage and publish (mount) the volume on the kubelet node.
  • PVC → PV → CSI flow: user creates a PVC; the external-provisioner sidecar sees it, calls the CSI controller's CreateVolume, which creates a PV bound to the PVC. Kubelet then asks the CSI node plugin to mount.

8.2 Mechanical Detail

  • StorageClass parameters: provisioner (CSI driver name), parameters (driver-specific), reclaimPolicy (Delete vs Retain), volumeBindingMode (Immediate vs WaitForFirstConsumer), allowVolumeExpansion.
  • WaitForFirstConsumer is critical for zone-aware provisioning-wait until the Pod is scheduled to know which zone to provision in.
  • Snapshots: VolumeSnapshot API; the external-snapshotter sidecar drives the CSI driver's snapshot calls.
  • Device plugins: a separate gRPC API (pluginapi.proto) for exposing custom resources (GPUs, FPGAs, RDMA NICs) to Pods. NVIDIA's k8s-device-plugin is the canonical example.

8.3 Lab-"Storage Hands-On"

  1. Install a local-path CSI driver (rancher/local-path-provisioner works for kind). Create a PVC; observe binding.
  2. Take a snapshot; restore to a new PVC.
  3. Author a mock device plugin that exposes 4 instances of a fake resource. Deploy a Pod requesting it; verify scheduling and resource accounting.
  4. Read the CSI proto. Diagram the provision + attach + mount flow on paper.

8.4 Hardening Drill

  • Use volumeBindingMode: WaitForFirstConsumer for all multi-zone clusters. Without it, you'll provision a volume in zone A and try to schedule its Pod in zone B.

8.5 Operations Slice

  • Monitor csi_* metrics emitted by sidecars. Alert on provision/attach errors and slow Mount operations.

Month 2 Capstone Deliverable

A node-and-cri/ workspace: 1. kubelet-tour/ - week 5's annotated journal-log walkthrough. 2.cri-direct/ - week 6's crictl - based pod-launch demo. 3.dataplane-bench/ - week 7's iptables vs IPVS vs Cilium-eBPF comparison. 4. `mock-device-plugin/ - week 8's working device plugin.

Month 3-Controllers and Operators: client-go, controller-runtime, CRDs

Goal: by the end of week 12 you can (a) build a controller from scratch with client-go and informers, (b) build a more sophisticated controller with controller-runtime (including webhooks, finalizers, status conditions), (c) define and version CRDs idiomatically, and (d) ship a non-trivial operator that manages an external system.


Weeks

Week 9 - client-go Internals and a Bare Controller

9.1 Conceptual Core

  • client-go is the Kubernetes Go client library-typed clients, informers, work queues, leader election, the lot.
  • Building a controller "from scratch" in client-go is verbose but instructive-every other framework hides these primitives.
  • The pattern (the informer + workqueue pattern):
  • Create a SharedInformerFactory for the resources you watch.
  • For each kind, register OnAdd/OnUpdate/OnDelete handlers that compute a key (namespace/name) and Add it to a RateLimitingQueue.
  • Start N workers that pull keys from the queue and run reconcile(key).
  • reconcile: list-from-cache (never call apiserver in the hot path), compute diff, apply changes, requeue on error with backoff.

9.2 Mechanical Detail

  • The informer's resync period: re-deliver every cached object every N (default 10 minutes). Belt-and-suspenders against missed events.
  • Indexers (cache.Indexer): O(1) lookup by namespace, by label, by custom key. Free with the informer.
  • Listers (<group>/<version>/<resource>/lister.go in generated client code): typed accessors over the indexer.
  • Leader election (tools/leaderelection): only one replica of the controller acts; others stand by. Uses a Lease resource as the lock.
  • Generated clients: for built-in types, client-go ships them. For your own CRDs, generate with controller-gen or kubebuilder (week 10).

9.3 Lab-"Controller From Scratch"

Build a controller that watches ConfigMaps with the label mirror=true and copies them into every namespace whose name matches a configurable prefix. - Use client-go informers + workqueue directly. - Add leader election. - Idempotent: same input twice produces same result. - Handle deletions: when the source is deleted, delete all mirrors. - Run as a Deployment in the cluster.

9.4 Hardening Drill

  • Define a minimum RBAC: only get/list/watch on configmaps and namespaces, plus create/update/delete on configmaps (constrained by namespace prefix? Use admission webhooks or namespace selectors).

9.5 Operations Slice

  • Expose controller_runtime_* - style metrics: queue depth, work duration, error rate. Add a/healthzand/readyz. Run with/livez` probe.

Week 10 - controller-runtime and Kubebuilder

10.1 Conceptual Core

  • controller-runtime is the modern, opinionated framework for controllers. Built atop client-go, it provides:
  • Manager (informer factory + leader election + metrics + healthz wired together).
  • Reconciler (typed reconcile method).
  • Client (cached read, direct write).
  • Webhook scaffolding (mutating + validating + conversion).
  • Finalizers helpers.
  • Kubebuilder is a CLI on top of controller-runtime that scaffolds projects from CRD definitions. The de facto starting point for new operators.

10.2 Mechanical Detail

  • Project structure (kubebuilder init && kubebuilder create api):
    api/v1/         # CRD types (Go structs annotated for codegen)
    config/         # YAML scaffolds (CRDs, RBAC, kustomize bases)
    internal/controller/
                    # Reconciler implementations
    cmd/main.go     # Manager bootstrap
    
  • The Reconcile method is the hot path; it should be idempotent and make no assumption about why it was called. Re-derive everything each call.
  • controllerutil.CreateOrUpdate-the reliable upsert helper.
  • Owner references-when a controller creates a child object, it sets the parent as the owner. Garbage collection handles cascading deletion.
  • Finalizers-string keys on metadata.finalizers. Block deletion until the controller removes the finalizer (after performing cleanup). The pattern for cleaning up external resources before the K8s object disappears.
  • Status subresource-separates spec writes from status writes; allows least-privilege RBAC.

10.3 Lab-"Rebuild Week 9 in controller-runtime"

Take week 9's mirror controller; rebuild with kubebuilder + controller-runtime. Compare LOC and verbosity. The framework should save substantial code.

10.4 Hardening Drill

  • Use controller-runtime's metric and health endpoints. Configure leader election with a non-default lease duration appropriate to your environment.

10.5 Operations Slice

  • Wire controller_runtime_reconcile_* metrics. Establish dashboards: reconcile rate, error rate, average reconcile duration per controller.

Week 11 - CRDs: Schema, Versioning, Validation

11.1 Conceptual Core

  • A CRD (CustomResourceDefinition) registers a new resource kind with the apiserver. Once registered, you can kubectl get/apply it like any built-in.
  • The CRD includes an OpenAPI v3 schema that the apiserver uses for validation. Get this right or you'll ship buggy custom resources.
  • Multiple versions can coexist; conversion webhooks translate between them. The pattern that allows v1alpha1 → v1beta1 → v1 evolution.

11.2 Mechanical Detail

  • Marker comments (+kubebuilder:...) on Go types generate the CRD YAML via controller-gen. Examples:
  • +kubebuilder:validation:Required
  • +kubebuilder:validation:MinLength=3
  • +kubebuilder:validation:Enum=foo;bar;baz
  • +kubebuilder:subresource:status
  • +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=.status.phase``
  • Status conditions: array of {type, status, reason, message, lastTransitionTime}. The standard pattern for surfacing controller state. Use the Kubernetes types directly (metav1.Condition).
  • Server-side apply: with SSA, multiple controllers can own different fields of the same object via fieldManager. Replaces hand-rolled patches for many use cases.
  • Conversion webhooks: invoked when apiserver needs to translate between stored and requested versions. Implement carefully-round-trip stability is essential.

11.3 Lab-"A Well-Versioned CRD"

  1. Define a CRD with v1alpha1.
  2. Add validation, defaults, status conditions, printer columns.
  3. Add a v1beta1 with renamed fields and a conversion webhook between them.
  4. Verify round-trip: kubectl get -o v1alpha1 then - o v1beta1` returns identical content.

11.4 Hardening Drill

  • Validation only at the schema level is not enough. Add admission webhooks for cross-field validation (e.g., "if mode=X then field Y is required").

11.5 Operations Slice

  • Track apiserver_storage_objects per CRD. CRDs that grow unbounded are a frequent platform failure mode.

Week 12 - Operator Patterns: Finalizers, External Resources, Multi-Cluster

12.1 Conceptual Core

  • The "operator" pattern: a controller that encapsulates operational knowledge for a specific application. Examples: Postgres operator (provisions DBs, handles backups, failover), Cert-Manager (ACME-driven cert lifecycle), Prometheus operator (manages Prometheus + Alertmanager + ServiceMonitor stack).
  • An operator is a controller plus one or more CRDs representing the application's domain concepts.
  • Production operators handle: leader election, finalizers, status conditions, observability, RBAC, upgrades, multi-tenant isolation, external-system reconciliation, retries with backoff.

12.2 Mechanical Detail

  • External resources (cloud APIs, SaaS): the controller's reconcile loop calls outward. Idempotency is essential-assume your reconcile may run multiple times before the external API confirms.
  • Crossplane (week 19) generalizes this: every external resource is itself a Kubernetes object backed by a controller that talks to the cloud. You compose them.
  • Cluster-scoped vs namespace-scoped operators: namespace-scoped is safer (lower blast radius) but limits multi-tenant operator deployment.
  • Operator SDK vs Kubebuilder: largely converged today; pick whichever your team prefers. The patterns are identical.

12.3 Lab-"An Operator That Manages an External Resource"

Build an operator with a GitHubRepo CRD: spec includes a repo name and visibility; the controller calls the GitHub API to create/update/delete the repo to match. Includes: - Authentication via a Secret referenced by the CR. - Finalizers for cleanup. - Status conditions: Ready, Synced, Error with reasons. - Rate-limited reconciles with exponential backoff. - E2E test using a fake GitHub API server.

12.4 Hardening Drill

  • Define an OPA/Kyverno policy: every GitHubRepo must reference a Secret in the same namespace; cross-namespace references denied. Tests for the policy.

12.5 Operations Slice

  • Add tracing to the reconcile path; export traces via OTel. The operator's hop into GitHub appears as an external span-useful for diagnosing outages.

Month 3 Capstone Deliverable

A controllers-and-operators/ workspace: 1. mirror-controller-clientgo/ (week 9). 2. mirror-controller-cr/ (week 10). 3. versioned-crd/ (week 11). 4. github-repo-operator/ (week 12).

Month 4-Networking and Storage at Scale

Goal: by the end of week 16 you can (a) explain the CNI spec and trace a Pod-to-Pod packet through a working CNI, (b) install and operate Cilium with eBPF-based dataplane, kube-proxy replacement, and L7 visibility, (c) reason about service-mesh tradeoffs (Istio vs Linkerd vs Cilium Service Mesh), and (d) operate dynamic CSI provisioning at scale with backups and snapshots.


Weeks

Week 13 - The CNI Spec and Pod Networking

13.1 Conceptual Core

  • The CNI (Container Network Interface) spec is small (~30 pages). A CNI plugin is a binary that the kubelet (via the runtime) invokes when a Pod sandbox is created. Inputs: container ID, network namespace path, JSON config. Outputs: assigned IP, routes.
  • Kubelet does not know about networking beyond "ask the CNI." This is what makes the dataplane pluggable.
  • Modern CNIs ship as DaemonSets that program kernel rules (iptables, OVS, eBPF) and run a small "agent" plus a thin "delegator" CNI binary.

13.2 Mechanical Detail

  • Read containernetworking/cni/SPEC.md. Operations: ADD, DEL, CHECK, VERSION.
  • The CNI binary must be at /opt/cni/bin/<name>; config at /etc/cni/net.d/*.conf.
  • The kubelet → CRI → CNI flow:
  • Kubelet asks runtime to create a sandbox.
  • Runtime creates a netns; calls CNI ADD.
  • CNI assigns IP, sets up the netns.
  • Sandbox containers join via CLONE_NEWNS=false plus the existing netns.
  • CNI chains: multiple plugins composed, each running in order (e.g., a primary CNI + a metering plugin + a port-mapping plugin).

13.3 Lab-"Read a CNI's Source"

  1. Pick a simple CNI (flannel or the reference bridge plugin from containernetworking/plugins). Read its cmdAdd end to end.
  2. Deploy a small kind cluster; trace a Pod creation in the kubelet log; correlate with the CNI binary invocation.
  3. Use nsenter -t <pause-pid> -n ip a to inspect the container's network namespace from the host.

13.4 Hardening Drill

  • Default-deny NetworkPolicy per namespace. Allow only intended Pod-to-Pod and Pod-to-Service traffic.

13.5 Operations Slice

  • Monitor CNI errors in kubelet logs. A node with consistent CNI ADD failures will have stuck pending Pods-alert on this.

Week 14 - Cilium and eBPF Networking

14.1 Conceptual Core

Cilium is the dominant eBPF-based CNI. It replaces iptables-based packet processing with eBPF programs attached at three layers:

  • Socket layer (bpf_sock_ops) - connection-level decisions before packets exist.
  • Cgroup egress - per-pod outbound policy enforcement.
  • NIC-level XDP - ingress filtering at line rate, before the kernel network stack.

The shift from iptables matters at scale: an iptables-based kube-proxy walks a linear chain of rules per packet - O(services). eBPF programs do hash-table lookups: O(1) per packet, regardless of service count.

Beyond replacing the CNI, Cilium provides: - Kube-proxy replacement (eBPF-based service load balancing - no iptables churn on every endpoint change). - L7 NetworkPolicy (HTTP, gRPC, Kafka filtering at the dataplane, not in a sidecar). - ClusterMesh (multi-cluster service discovery and cross-cluster policy). - Hubble (eBPF-based flow observability - every pod-to-pod connection visible without sampling). - Service Mesh (sidecar-less mTLS via eBPF + SPIFFE).

This is the bridge to Linux Month 3 - eBPF in production. See also: eBPF in the observability cross-topic page.

14.2 Mechanical Detail

  • Dataplane as eBPF graph: Cilium's eBPF programs live under bpf/ in cilium/cilium. The agent compiles them at startup with the cluster's specific configuration baked in (BTF-driven CO-RE for portability across kernels).
  • Identity-based policy: pods are assigned a numeric identity derived from their labels (app=foo,env=prod → identity 1234). eBPF programs match on these identities, not on IPs. This is what allows policy to scale to thousands of pods without per-pod iptables rules - identities are stable across pod restarts and IP changes.
  • Service load balancing: instead of iptables DNAT chains, Cilium uses an eBPF map indexed by (service IP, port) returning a backend. Connection state lives in a separate eBPF map; updates are atomic, no kernel reload, no race during endpoint churn.
  • Encryption: WireGuard (recommended; in-kernel since 5.6) or IPsec tunnels between nodes. Per-NetworkPolicy opt-in or cluster-wide.
  • Hubble captures every packet's metadata via eBPF - source/dest identity, verdict (allowed/denied), L7 protocol info - and exposes it via gRPC + a CLI + a UI. Per-packet overhead is single-digit-percent CPU.

The trap

Switching kubeProxyReplacement from falsetrue on a live cluster without draining nodes. The iptables rules from the old kube-proxy don't get cleaned up automatically, and they interact badly with Cilium's eBPF NAT. Always: drain node → reconfigure → uncordon. The Cilium installer's kubeProxyReplacement: strict mode aborts if it finds residual rules.

14.3 Lab - "Install and Drive Cilium"

  1. Install via Helm with: kubeProxyReplacement=true, hubble.enabled=true, hubble.relay.enabled=true, hubble.ui.enabled=true, encryption.enabled=true, encryption.type=wireguard.
  2. Use the Hubble UI (cilium hubble ui) to visualize pod-to-pod traffic in real time.
  3. Author L4 NetworkPolicy (standard k8s API); test enforcement with a denied + allowed flow.
  4. Author an L7 CiliumNetworkPolicy (e.g., allow only HTTP GET /api/* from frontend → backend); test enforcement.
  5. Enable Cilium Service Mesh; observe sidecar-free mTLS between two test services.

14.4 Hardening Drill

Enable transparent encryption (WireGuard) between nodes. Combined with default-deny NetworkPolicy (start: deny everything, allow explicitly), this gives defense-in-depth: even if a node is compromised, the attacker sees only encrypted traffic for flows they haven't been explicitly authorized to observe.

14.5 Operations Slice

Monitor cilium_* Prometheus metrics. Alert on: - policy-drop rate spikes - legitimate workloads being denied (usually a NetworkPolicy author mistake, or a new service that didn't get its allow rule). - identity-table pressure - Cilium has a max identity count per cluster; approaching it means too many distinct label combinations, often from a bad operator emitting unique labels per request. - endpoint regeneration time - if it climbs past 5-10s, your label churn is overwhelming the agent.

Week 15 - Service Meshes: Istio, Linkerd, Cilium Service Mesh

15.1 Conceptual Core

  • A service mesh adds: mTLS between Services, retries/timeouts/circuit-breaking, traffic shifting (canary, blue/green), observability (RED metrics + traces), policy enforcement.
  • Two architectural patterns:
  • Sidecar (Istio classic, Linkerd)-Envoy/linkerd-proxy runs in every Pod. ~50 MB memory per Pod, ~1 ms latency overhead.
  • Sidecar-less (Istio ambient, Cilium SM)-eBPF + per-node proxy. Much lower per-Pod overhead.
  • Decision matrix:
  • Mature, full-featured, complex → Istio.
  • Minimalist, Rust-based, fast to install → Linkerd.
  • Already running Cilium, want sidecar-less → Cilium Service Mesh.

15.2 Mechanical Detail

  • Envoy (under Istio + others) is the dataplane proxy. xDS APIs (LDS, RDS, CDS, EDS) push config from the control plane.
  • mTLS rotation: the mesh control plane issues short-lived certs (typically 24h) signed by an internal CA (or SPIFFE-compatible).
  • Traffic management: Istio VirtualService + DestinationRule for routing rules. K8s Gateway API is the standard-track replacement, supported by all major meshes.
  • Observability: every mesh emits RED metrics (Rate, Errors, Duration) per-service. With OTel, traces propagate through the mesh.

15.3 Lab-"Three Meshes"

  1. Install Istio in ambient mode on a test cluster. Apply a VirtualService that does 90/10 canary routing. Verify with Hubble or Kiali.
  2. Repeat with Linkerd. Compare install footprint, configuration ergonomics, and observability quality.
  3. (If running Cilium) enable Cilium Service Mesh. Compare again.
  4. Document tradeoffs: install effort, per-Pod overhead, feature gaps.

15.4 Hardening Drill

  • Enable mTLS in STRICT mode. Define AuthorizationPolicys denying cross-namespace traffic by default; allow only intended pairs.

15.5 Operations Slice

  • Wire the mesh's RED metrics into your service-level dashboards. Define SLOs per service: latency p99, error rate, mTLS handshake success rate.

Week 16 - CSI at Scale: Snapshots, Backup, Cloning

16.1 Conceptual Core

  • Production storage in K8s requires:
  • Dynamic provisioning (week 8).
  • Volume Snapshots (point-in-time captures).
  • Backups (off-cluster, often app-consistent via operator hooks).
  • Cloning (PVC from snapshot, or PVC-from-PVC).
  • Resizing (online expansion).
  • Velero is the de-facto cluster backup tool: backs up resource manifests + PV snapshots to object storage; restores selectively.

16.2 Mechanical Detail

  • VolumeSnapshotClassVolumeSnapshotVolumeSnapshotContent. Mirrors the SC/PVC/PV trio.
  • The external-snapshotter sidecar runs alongside the CSI controller, watching VolumeSnapshot objects.
  • Volume populators (since 1.24+)-populate a new PVC from arbitrary sources (snapshots, other PVCs, S3, etc.). Modular framework.
  • Velero: install, configure storage location (S3-compatible bucket), schedule backups via Schedule resource. Plugins for cloud providers and for "BackupStorageLocation" abstraction.

16.3 Lab-"Backup and Restore"

  1. Install Velero against a MinIO bucket.
  2. Schedule a daily backup of one namespace.
  3. Delete the namespace; restore from backup; verify Pods come back, PVs reattach, data intact.
  4. Create a stateful workload (Postgres via an operator); test snapshot + clone flow for fast dev/test environment provisioning.

16.4 Hardening Drill

  • Test restore into a different cluster. This is the actual disaster-recovery scenario, and the most commonly broken backup story.

16.5 Operations Slice

  • Wire Velero metrics: backup success rate, backup duration, restore-test outcomes (run a synthetic restore weekly to validate).

Month 4 Capstone Deliverable

A networking-and-storage/ workspace: 1. cni-source-walkthrough.md (week 13). 2. cilium-policies/ - L4 + L7 + identity-based examples (week 14). 3.mesh-comparison/ - three meshes, RED dashboards (week 15). 4. `velero-DR/ - backup, restore, and cross-cluster-restore demos (week 16).

Month 5-Platform Engineering and Day-2 Operations

Goal: by the end of week 20 you can (a) operate a GitOps workflow with ArgoCD or Flux at scale, (b) provision cloud infrastructure declaratively via Crossplane (or Terraform from K8s), (c) configure HPA/VPA against custom Prometheus metrics, and (d) author and enforce policies with OPA Gatekeeper or Kyverno.


Weeks

Week 17 - GitOps: ArgoCD and Flux

17.1 Conceptual Core

  • GitOps = the cluster's desired state is the contents of a git repo. A controller in the cluster watches the repo and reconciles drift.
  • The two dominant tools:
  • ArgoCD-UI-rich, opinionated about app structure (Application CRD), wide adoption.
  • Flux-CLI/CRD-first, more composable (Kustomization, HelmRelease, GitRepository, OCIRepository), favored by CNCF-style purists.
  • Both implement the same control loop: pull manifests from git → render (Kustomize/Helm) → apply → reconcile drift.

17.2 Mechanical Detail

  • ArgoCD Application: spec.source (git path or Helm chart), spec.destination (cluster + namespace), spec.syncPolicy (manual vs automatic, prune, self-heal).
  • ApplicationSet (Argo)-generate many Apps from templates; the foundation for multi-tenant fleet management.
  • Flux Kustomization + HelmRelease-separate CRs for source-of-truth, transform, and apply.
  • Sync waves / dependencies: both tools support ordering. Critical for "install CRDs before the resources that use them."
  • Drift detection: tools auto-revert manual changes by default. Sometimes that is not what you want during incident response-know how to disable temporarily.

17.3 Lab-"Two GitOps Stacks"

  1. Install ArgoCD. Set up an Application for a small app from a git repo. Verify auto-sync and auto-prune.
  2. Install Flux. Set up the equivalent. Compare ergonomics.
  3. Use ApplicationSet (Argo) to deploy the same app to three environment overlays (dev, staging, prod). Verify per-environment configuration via Kustomize overlays.

17.4 Hardening Drill

  • ArgoCD/Flux talk to a git repo with read access. Use SSH deploy keys or fine-scoped GitHub apps; never broad PATs. Encrypt secrets at rest with sealed-secrets or sops.

17.5 Operations Slice

  • Wire ArgoCD/Flux metrics: per-Application sync rate, drift rate, reconciliation duration. Alert on persistent OutOfSync or Failed states.

Week 18 - IaC From Within K8s: Crossplane and Terraform

18.1 Conceptual Core

  • Crossplane flips IaC inside out: cloud resources are Kubernetes resources (CRDs), reconciled by Crossplane's providers (provider-aws, provider-gcp, provider-azure, provider-helm, provider-kubernetes). You manage cloud infra with kubectl apply.
  • Compositions let you bundle low-level primitives into domain-specific abstractions: define a XPostgresInstance that, when applied, creates a VPC subnet, an RDS instance, IAM bindings, and a ServiceMonitor. Platform teams ship Compositions; app teams consume them.
  • Terraform alternative: run Terraform Cloud / Atlantis externally; treat the cluster as a deploy target only. Simpler in some shops; doesn't unify the control plane.

18.2 Mechanical Detail

  • Provider = a controller image that knows how to talk to one external system. Install via Provider CRD.
  • ProviderConfig = credentials + connection details for the provider.
  • Managed Resource (MR) = the K8s representation of a cloud resource (Bucket, Database, IAMRole).
  • Composition = a YAML transform: "given this XPostgresInstance claim, produce these MRs with these field mappings."
  • Composite Resource Definition (XRD) = the schema for the abstract type; the platform-team-facing equivalent of CRD.

18.3 Lab-"Self-Service Database"

  1. Install Crossplane. Install provider-aws (or provider-gcp).
  2. Configure provider credentials.
  3. Define an XRD XDatabase with parameters: size, engine, version, region.
  4. Define a Composition that materializes an RDS instance + a Secret with credentials.
  5. As an "app team" persona, create a Database claim. Watch it become a real RDS instance. Delete; watch it be torn down.

18.4 Hardening Drill

  • Restrict Composition selectors and compositionRef so app teams cannot select unintended Compositions. Use OPA/Gatekeeper to enforce naming, region, size limits.

18.5 Operations Slice

  • Compositions are platform contracts. Version them. Provide migration paths. Treat as you would a public API: SLAs, deprecation windows, changelogs.

Week 19 - HPA, VPA, KEDA: Autoscaling

19.1 Conceptual Core

  • HPA (Horizontal Pod Autoscaler): scales replica count based on metrics. CPU/memory by default; with metrics.k8s.io + custom.metrics.k8s.io adapters (e.g., prometheus-adapter), any metric is fair game.
  • VPA (Vertical Pod Autoscaler): adjusts a Pod's CPU/memory requests based on observed usage. Two modes: Auto (recreate pod with new resources), Off/Initial (only on creation).
  • KEDA (Kubernetes Event-Driven Autoscaling): scale to zero, scale on event-source backlog (Kafka lag, SQS depth, custom). Sits in front of HPA.

19.2 Mechanical Detail

  • HPA reconcile interval: 15s by default. Picking metrics that are too jittery causes flapping; smooth at the source.
  • HPA scaling policies: scaleUp.policies and scaleDown.policies with stabilization windows. Tune to workload's elasticity profile.
  • Custom metrics adapter (prometheus-adapter): translates Prometheus queries into the custom.metrics.k8s.io API the HPA reads. Define rules in adapter config.
  • VPA's recommender computes percentile-based recommendations from historical usage. Often used in Off mode just to suggest resource changes; production safety prefers manual approval.

19.3 Lab-"Autoscale on Custom Metrics"

  1. Deploy a load-test target with a Prometheus-exposed requests_per_second metric.
  2. Install prometheus-adapter mapping that metric to custom.metrics.k8s.io.
  3. Author HPA targeting AverageValue=200 of that metric. Drive load; watch scaling.
  4. Add KEDA in front for scale-to-zero behavior. Verify cold-start latency.

19.4 Hardening Drill

  • Set minReplicas to a non-zero value for any tier-1 service (avoid cold-start during incident traffic). Cap maxReplicas to avoid runaway autoscaling on metric anomalies.

19.5 Operations Slice

  • Wire HPA event metrics. Alert on persistent desiredReplicas == maxReplicas (you've hit the cap) and on flapping (scaleUp and scaleDown events alternating rapidly).

Week 20 - Admission Control: Webhooks, OPA Gatekeeper, Kyverno

20.1 Conceptual Core

  • Admission control is the apiserver's last gate: every create/update is run through configured admission webhooks before persistence.
  • Two policy-engine choices in the modern ecosystem:
  • OPA Gatekeeper-Rego-language policies; the standard for "policy as code."
  • Kyverno-YAML-native policies; lower learning curve, strong template/mutation/generate support.
  • Pod Security Admission (replacement for the deprecated PodSecurityPolicy)-built into the apiserver. Three profiles: privileged, baseline, restricted. Apply per-namespace.

20.2 Mechanical Detail

  • Validating webhooks: receive AdmissionReview, return Allowed=true/false with reasons. Cannot mutate.
  • Mutating webhooks: also return JSON Patch / strategic merge for changes. Applied before validating.
  • Failure policy (Fail vs Ignore): if the webhook is unreachable, fail closed (safer) or open (operationally simpler). Trade off carefully.
  • Gatekeeper's ConstraintTemplate (Rego) + Constraint (instance) model. Audit mode reports without enforcing-start there in any new policy rollout.
  • Kyverno's ClusterPolicy / Policy CRDs cover validate, mutate, generate, verifyImages.

20.3 Lab-"Three Policy Layers"

  1. Apply Pod Security Admission per-namespace: restricted everywhere except a priv namespace.
  2. Author 5 Gatekeeper Constraints: require resource limits, forbid latest tags, enforce non-root, label-required, namespace-must-have-team-label.
  3. Author equivalents in Kyverno. Compare expressiveness.
  4. Run in audit-mode for a week against a pre-existing cluster; triage findings before enforcing.

20.4 Hardening Drill

  • Mandate signed images via Kyverno's verifyImages with cosign keys. Combined with Sigstore policy from the Container curriculum, this closes the supply-chain gate at the cluster.

20.5 Operations Slice

  • Track admission-webhook latency. Slow webhooks slow every apply. Pod-creation latency p99 is your warning signal.

Month 5 Capstone Deliverable

A platform-and-day2/ workspace: 1. gitops-stack/ (week 17)-ArgoCD + ApplicationSet + multi-env overlays. 2. crossplane-platform/ (week 18)-XDatabase composition + claim demo. 3. hpa-custom-metrics/ (week 19)-Prom-adapter + HPA + KEDA scale-to-zero demo. 4. policy-suite/ (week 20)-Gatekeeper + Kyverno + PSA examples.

Month 6-Kubernetes The Hard Way + Capstone

Goal: by the end of week 24 you have built (or substantially built) a multi-node Kubernetes cluster from raw VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy isolation, and a documented operational runbook.


Weeks

Week 21 - Bootstrap: VMs, Certificates, etcd

21.1 Conceptual Core

  • "Kubernetes the Hard Way" is Kelsey Hightower's exercise: bring up a Kubernetes cluster step by step, from raw VMs, generating certs by hand, configuring every flag explicitly. The point is not operational efficiency; it is deep understanding of every moving part.
  • This curriculum's hard-way variant: bring up 3 control-plane nodes + 3 worker nodes on cloud VMs (or bare metal). Use modern toolchain (containerd, Cilium, latest stable Kubernetes).

21.2 Mechanical Detail

  • VM provisioning: 6 VMs, ~2 vCPU 4 GB each. Cloud (AWS/GCP/Hetzner) or bare metal.
  • PKI: a CA + intermediate CAs for etcd, kube-apiserver, kubelet, front-proxy. Use cfssl or easy-rsa. Every component identifies itself with x509.
  • etcd cluster: 3 nodes, mTLS between peers and clients, snapshots scheduled.
  • Loopback bootstrap considerations: kubelet needs a kubeconfig before the apiserver is up. Either use static-pod manifests for control-plane components (the kubeadm approach) or run the control plane outside the cluster on the VMs themselves.

21.3 Lab-"Bring Up etcd"

  1. Provision 3 VMs labeled etcd-{1,2,3}.
  2. Generate CA + per-node certs.
  3. Install etcd binaries; configure systemd units with mTLS.
  4. Bring up; verify etcdctl member list shows healthy quorum.
  5. Take a snapshot. Restore on a separate test machine.

21.4 Hardening Drill

  • etcd encryption-at-rest is separate from the K8s secret encryption (next week). Configure etcd with an encryption-providers config from day one.

21.5 Operations Slice

  • etcd backup automation: etcdctl snapshot save cron'd to S3 every 6 hours. Verify restore weekly.

Week 22 - Control Plane and Worker Nodes

22.1 Conceptual Core

  • The control plane: kube-apiserver, kube-scheduler, kube-controller-manager. Run all three as systemd-managed binaries on each control-plane node, behind a load balancer (HAProxy or cloud LB) for HA.
  • The worker plane: containerd + kubelet + kube-proxy (or Cilium replacement). Joins the cluster via a kubelet kubeconfig signed by the cluster CA.

22.2 Mechanical Detail

  • kube-apiserver flags:
    • -etcd-servers=https://etcd-{1,2,3}:2379` with mTLS.
    • -encryption-provider-config=...` for secret encryption-at-rest.
    • -audit-policy-file=...and - -audit-log-path=....
    • -authorization-mode=Node,RBAC`.
    • -enable-admission-plugins=NodeRestriction,PodSecurity,ResourceQuota,...`.
    • -service-account-issuer, - -service-account-signing-key-file for ServiceAccount tokens (projected, OIDC-compatible).
  • kubelet bootstrap: TLS bootstrap using a bootstrap token; kubelet auto-rotates its cert via kubelet-csr-approver.
  • CNI: install Cilium first (DaemonSet); only after Cilium is healthy do worker-node Pods become ready.
  • DNS: install CoreDNS as a Deployment; the kubelet's cluster-DNS arg points at its Service IP.

22.3 Lab-"Cluster Live"

  1. Bring up 3 control-plane nodes; HAProxy in front.
  2. Bring up 3 workers; join via bootstrap tokens.
  3. Install Cilium; verify Pod-to-Pod connectivity.
  4. Install CoreDNS; verify Service DNS works.
  5. Smoke test: deploy a sample app + Service + Ingress; verify end-to-end.

22.4 Hardening Drill

  • Apply CIS Kubernetes Benchmark v1.8 (or current). Use kube-bench to score. Address all FAILs; document WARNs.

22.5 Operations Slice

  • Wire control-plane components to Prometheus. Define SLOs: apiserver request p99 < 1s, etcd-leader-changes per hour < 1, scheduler queue depth < 100.

Week 23 - RBAC, Multi-Tenancy, mTLS Everywhere

23.1 Conceptual Core

  • Multi-tenancy is the hardest sustained problem in Kubernetes. The kernel and Kubernetes give you soft isolation by default; converting that to hard isolation requires layered controls.
  • The required layers: namespace-per-tenant + RBAC + NetworkPolicy + ResourceQuota + LimitRange + PodSecurity + node-pool isolation + (optionally) sandboxed runtime.
  • mTLS everywhere: control-plane (already from week 22), service mesh between Services (week 15), workload identity for Pods talking to cloud APIs (e.g., AWS IRSA, GCP Workload Identity).

23.2 Mechanical Detail

  • Tenant onboarding as code (Crossplane Composition or Helm chart):
  • Namespace.
  • ResourceQuota + LimitRange.
  • Default-deny NetworkPolicy + an allow-namespace-internal exception.
  • PodSecurity admission label (restricted).
  • RoleBindings for the tenant's group.
  • ServiceAccount with workload identity binding for cloud access.
  • GitOps Application(Set) entries to deploy the tenant's app catalog.
  • Hard isolation tiers:
  • Tier 1: namespace + RBAC. Default. Suitable for trusted internal teams.
  • Tier 2: + sandboxed runtime (gVisor) for tenant-owned untrusted workloads.
  • Tier 3: + dedicated node pool with taints. Suitable for compliance-bound workloads.
  • Tier 4: separate cluster (vCluster, Cluster API). Strongest isolation; highest cost.

23.3 Lab-"Onboard a Tenant"

  1. Author a tenant Composition (Crossplane) or Helm chart that, given {tenant: "acme"}, materializes everything in §23.2.
  2. Onboard acme. Have a "tenant developer" persona deploy an app via GitOps.
  3. Verify isolation: from acme's namespace, can you read another tenant's secrets? Pods? Logs? Each should fail.

23.4 Hardening Drill

  • Run kubescape or polaris against the cluster. Address findings until score is >90%.

23.5 Operations Slice

  • Per-tenant cost attribution: label every resource with tenant=; export kube-state-metrics with that label to Prometheus; cost-allocate via OpenCost.

Week 24 - Defense, Documentation, and the Capstone Demo

24.1 Conceptual Core

The final week is integration and defense. Bring the capstone (whichever track) to production-defensible quality.

24.2 Final Hardening Checklist

  • CIS benchmark green (kube-bench).
  • All control-plane components mTLS, with cert auto-rotation tested.
  • Encryption-at-rest enabled for secrets in etcd.
  • Audit logging enabled; logs shipped off-cluster.
  • Default-deny NetworkPolicy in every namespace.
  • PodSecurity restricted everywhere except documented exceptions.
  • Image admission requires signed images (Sigstore policy).
  • Velero backups + tested cross-cluster restore.
  • Chaos: drain a node, kill a master, partition the network-cluster recovers.
  • Observability: Prometheus + Grafana + Loki + Tempo (or equivalent) integrated.
  • Cost attribution per tenant.
  • Runbooks: node-not-ready, etcd-degraded, apiserver-OOM, namespace-stuck-terminating, pod-pending-forever.

24.3 Lab-"Defend the Cluster"

Schedule a 60-minute mock review. Demo: 1. The architecture diagram. 2. Provisioning (Ansible/Terraform/Crossplane). 3. Tenant onboarding from request to running app. 4. Failure injection: kill a control-plane node; show cluster recovery. 5. Observability: trace a request from ingress through service mesh to backend, with metrics, logs, and trace ID correlation. 6. Backup + restore.

24.4 Operations Slice

  • Tag the cluster manifest repo v1.0.0. Sign with cosign. Publish a RUNBOOK.md that, in principle, lets a successor team rebuild the cluster from scratch.

Month 6 Deliverable

The capstone artifact (per CAPSTONE_PROJECTS.md), plus the aggregated kubernetes-mastery/ repo containing every prior month's deliverable.

Appendix A-Kubernetes Hardening Reference

Cumulative hardening checklist. By week 24 the reader's cluster-baseline/ template should encode every section.


A.1 Control Plane

  • etcd: 3 or 5 nodes, mTLS, encryption-at-rest, snapshot+restore tested.
  • kube-apiserver: encryption providers, audit logging, NodeRestriction admission, PodSecurity admission, OIDC (or trustedSA) for users.
  • kube-scheduler: leader election; default + custom plugins reviewed.
  • kube-controller-manager: leader election; minimum SA permissions.
  • kubelet: read-only port disabled, TLS bootstrap with CSR approval, anonymous-auth false, authorization webhook.

A.2 RBAC

  • No bindings to the cluster-admin ClusterRole except for break-glass.
  • Per-tenant Roles, not ClusterRoles.
  • Audit system:authenticated and system:unauthenticated group bindings-both should be empty.
  • Use kubectl auth can-i --as=... to verify least-privilege per persona.

A.3 Pod Security

  • PodSecurity admission restricted everywhere by default.
  • Exceptions documented in code (namespace labels) with justification.
  • Pod-level: runAsNonRoot, readOnlyRootFilesystem, drop all caps, seccomp RuntimeDefault.
  • Mutating webhook to inject defaults if Pod spec omits them.

A.4 Network

  • CNI with NetworkPolicy support (Cilium, Calico).
  • Default-deny ingress + egress in every namespace.
  • Allowed flows declared per workload as labeled NetworkPolicy.
  • L7 policy on ingress (Cilium L7 NetworkPolicy or service mesh).
  • mTLS between Services (mesh).
  • Egress controls: explicit allowed CIDRs / FQDNs.

A.5 Image Supply Chain

  • Image admission (Kyverno / Cosign policy-controller) requires signature.
  • Allowlisted registries.
  • No latest tags; pin by digest in production.
  • SBOM and SLSA provenance attestations attached to every image.

A.6 Secrets

  • etcd encryption-at-rest with rotated keys.
  • External Secret Operator (ESO) for cloud-KMS-sourced secrets.
  • No secrets in env vars where possible (use volume mounts, watch for restart).
  • No secrets committed to git, even in sealed form, without sealed-secrets/sops ratchet.

A.7 Multi-Tenancy

  • One namespace per tenant; ResourceQuota + LimitRange.
  • Hierarchical Namespaces or Capsule for nested tenants.
  • PriorityClasses by tier; preemption tuned.
  • Per-tenant cost attribution via labels + OpenCost.

A.8 Observability

  • Audit logs shipped off-cluster (read-only on cluster).
  • Container logs (Loki / cloud equivalent).
  • Metrics (Prometheus + kube-state-metrics + node-exporter).
  • Traces (OTel Collector + Tempo / Jaeger / cloud).
  • Continuous profiling (Parca / Pyroscope) optional but recommended.
  • SLO tracking per service (Pyrra / Sloth).

A.9 Backup + DR

  • Velero scheduled backups to off-cluster storage.
  • Cross-region or cross-cluster restore tested at least quarterly.
  • etcd snapshot tested for catastrophic-recovery scenario.
  • DR runbook with RTO + RPO documented.

A.10 The cluster-baseline/ Template

cluster-baseline/
  bootstrap/
    pki/                    # CA + per-component certs (cfssl)
    etcd/                   # systemd unit + config
    kube-apiserver/
    kube-scheduler/
    kube-controller-manager/
    kubelet/
  cni/cilium-values.yaml
  service-mesh/             # istio or linkerd values
  observability/
    prometheus/
    grafana/
    loki/
    tempo/
    parca/
  policy/
    pod-security/
    networkpolicy-default-deny.yaml
    gatekeeper-constraints/
    kyverno-policies/
    sigstore-policy.yaml
  tenancy/
    namespace-template/      # Crossplane composition
    rbac-template/
    quotas-template/
  velero/
    schedule.yaml
    locations.yaml
  runbooks/
    node-not-ready.md
    etcd-degraded.md
    apiserver-oom.md
    pod-pending-forever.md
    cluster-rebuild.md
  RUNBOOK.md
  THREAT_MODEL.md

This is the artifact every cluster you bring up after week 24 should be provisioned from.

Appendix B-Troubleshooting Reference Flows

Reference flows for the failure modes you will see in production.


B.1 Pod Pending Forever

kubectl describe pod <pod>

Common causes (in observed-frequency order): 1. No node satisfies scheduling constraints. Events: shows FailedScheduling. Read the reason: insufficient CPU/memory, no matching nodeSelector, taints unmatched, no PV available, topology spread blocked. 2. PVC stuck pending. kubectl get pvc <pvc> - if Pending, check StorageClass, provisioner pods, cloud-side quota. 3. **Image pull failure**.Events:showsErrImagePull/ImagePullBackOff. Check registry auth, image tag exists, network egress to registry. 4. **Admission webhook rejected**. Often hidden in apiserver logs;kubectl get events -Amay surface it. 5. **Quota exceeded**.ResourceQuota` denied creation.

Drilldown: kubectl get events -A --sort-by=.lastTimestamp | tail -30.


B.2 Pod CrashLoopBackOff

kubectl logs <pod> --previous
kubectl describe pod <pod>

Common causes: 1. App-level crash. Read the previous container's logs. 2. Liveness probe failing. The probe is killing the container. Check probe path/port; loosen initialDelaySeconds. 3. OOMKilled. kubectl describe shows Reason: OOMKilled. Increase memory limit or fix leak. 4. ConfigMap / Secret missing. Pod is mounting it; if missing, kubelet fails the start. Watch for events. 5. Init container failure. Pod won't progress; check init container logs first.


B.3 Node NotReady

kubectl describe node <node>
ssh <node> sudo journalctl -u kubelet -f

Common causes: 1. kubelet down. systemd unit failure; check journal. 2. CNI agent down. The node has no functional networking; Cilium/Calico DaemonSet pod has crashed. 3. Disk pressure. Events: shows EvictionThresholdMet. Free space (delete old container images, journal logs). 4. PID pressure. Too many processes. 5. Out-of-resources kernel-side. Check dmesg on the node.


B.4 etcd Degraded

etcdctl endpoint status --cluster
etcdctl endpoint health --cluster

Common causes: 1. Disk full or slow. fsync latency spikes; everything else feels slow. Check etcd_disk_wal_fsync_duration_seconds. 2. Leader election thrashing. Network instability between etcd nodes; check inter-node latency. 3. Database size growth. Forgot to compact. etcdctl compact <rev>; etcdctl defrag. 4. Quorum lost. Majority of nodes down. Restore from snapshot to a new cluster; recover.


B.5 Apiserver 5xx / Timeouts

kubectl get --raw=/livez
kubectl get --raw=/readyz?verbose

Common causes: 1. etcd issues (above). 2. Webhook timeouts. Slow admission webhooks block every apply. Check webhook latency; consider failurePolicy: Ignore with caution. 3. Aggregated API down (e.g., metrics-server). kubectl top fails; downstream features (HPA) degrade. 4. Apiserver overload. Too many list/watch consumers; CPU pegged. Add replicas; review priority-and-fairness flow control.


B.6 Service Has No Endpoints

kubectl get endpoints <service>
kubectl get endpointslices -l kubernetes.io/service-name=<service>

Common causes: 1. Selector mismatch. Service spec.selector doesn't match Pod labels. Most common. 2. Pods not ready. ReadinessProbe failing; only ready Pods join Endpoints. 3. Port mismatch. Service port name vs container port name out of sync. 4. Topology-aware routing dropping endpoints. Check service.kubernetes.io/topology-aware-hints.


B.7 Namespace Stuck Terminating

kubectl get namespace <ns> -o json | jq .spec.finalizers

Cause: A finalizer can't be removed because its owning controller is gone (or stuck).

Fix path (carefully-you are bypassing a safety):

kubectl get namespace <ns> -o json \
  | jq '.spec.finalizers = []' \
  | kubectl replace --raw "/api/v1/namespaces/<ns>/finalize" -f -

But also: investigate why the finalizer wouldn't clear. Often a dangling external resource the operator was waiting on.


B.8 ImagePullBackOff in a Private Registry

  1. `kubectl get secret -o yaml - exists and well-formed?
  2. Pod's spec.imagePullSecrets references it?
  3. Secret type is kubernetes.io/dockerconfigjson?
  4. Decoded .dockerconfigjson has the right registry URL and credentials?
  5. From the node, can you crictl pull the image manually with the same creds?

B.9 HPA Not Scaling

  1. `kubectl describe hpa - events show why.
  2. Metrics available? kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" for resource metrics; kubectl get --raw "/apis/custom.metrics.k8s.io/..." for custom.
  3. Pod requests set? HPA uses requests as the denominator; without them, percentage-based metrics are meaningless.
  4. behavior policies preventing fast scaling? Check scaleUp.policies and stabilizationWindowSeconds.

B.10 Mesh: 503 from Sidecar

(Istio specifics, but general patterns apply) 1. Service has Endpoints? 2. mTLS mode strict, but caller without sidecar? Check PeerAuthentication. 3. AuthorizationPolicy denying the call? 4. DestinationRule with circuit-breaker tripped? kubectl describe destinationrule. 5. Envoy access log: istioctl proxy-config log <pod> --level debug and re-issue.

Appendix C-Contributing to Kubernetes

Kubernetes is the largest open-source project in the world by contributor count. The flip side: the bureaucracy is real. This appendix is the on-ramp.


C.1 Mental Model

The Kubernetes project is governed by SIGs (Special Interest Groups)-domain-scoped groups (sig-node, sig-network, sig-storage, sig-api-machinery, sig-scheduling, sig-cli, sig-auth, etc.). Each SIG has chairs, technical leads, regular meetings, and a Slack channel. Almost every PR maps to a SIG.

Major changes go through KEPs (Kubernetes Enhancement Proposals)-design docs in the kubernetes/enhancements repo. KEPs progress through provisionalimplementableimplementeddeprecated over multiple releases.

Implications for newcomers: 1. Find the right SIG before opening a PR. 2. For non-trivial changes, write or piggyback on a KEP first. 3. The cycle time is slow. Two-week review is normal; six-week is common.


C.2 Setting Up

git clone https://github.com/kubernetes/kubernetes
cd kubernetes
make all -j$(nproc)

The build is heavy (it's all of Kubernetes); plan for ~10 minutes the first time.

For tests:

make test                       # unit
make test-integration KUBE_TEST_ARGS="-run <name>"
make test-e2e KUBE_TEST_ARGS=...

For local dev cluster: kind (uses your local Docker / containerd, spins up a multi-node cluster in containers in seconds).


C.3 Where the Easy Wins Are

C.3.1 Documentation

  • kubernetes/website (the website repo). Docs improvements are welcome and reviewed quickly.

C.3.2 e2e flakes

  • The flakey label on kubernetes/kubernetes issues. Fixing flakes is unglamorous but high-impact.

C.3.3 kubectl

  • `staging/src/k8s.io/kubectl - small, contained. Bug fixes and small features are tractable.

C.3.4 client-go improvements

  • `staging/src/k8s.io/client-go - used by every controller in the world. Improvements compound.

C.3.5 SIG-specific work

  • Pick a SIG matching your interest. Their backlogs have good first issue labels.

C.3.6 Don't start here (yet)

  • Scheduler core (high stakes; small SIG; deep changes need KEPs).
  • API machinery (apiserver internals, conversion, validation).
  • kubelet (touches every node; PR latency is high for safety).
  • Anything touching scaling / performance critical paths.

C.4 The First-PR Workflow

  1. Find an issue. Read CONTRIBUTING.md and SIG-specific contribution guides. Comment /assign to claim.
  2. Branch from master.
  3. Implement. Run make verify, make test, the relevant make test-integration subset.
  4. Commit with Signed-off-by (DCO).
  5. Open the PR. SIG bots will auto-assign reviewers. Use the PR template; fill in every field.
  6. CI cycle. Tests run on Prow. Re-run with /test all. Address comments.
  7. Approval flow: a reviewer adds /lgtm; an approver adds /approve. Both required. The bot then merges.
  8. Backport (if applicable): for bugfixes, the PR may need cherry-picks to release branches. Use the cherry-pick robot or do manually.

C.5 The KEP Process

For changes that: - Add or modify the API. - Have user-visible behavior changes. - Affect multiple SIGs.

Process: 1. Open an issue in the relevant SIG. 2. Get at least informal agreement that the problem is real. 3. Write a KEP using the template in kubernetes/enhancements/keps/NNNN-template/. 4. Submit as a PR. The KEP itself goes through review. 5. Once implementable, you can submit code PRs referencing the KEP number. 6. KEPs target a Kubernetes release (alpha → beta → GA over multiple releases).

Time scale: months to a year for a substantial KEP.


C.6 Adjacent Targets if k/k Is Too Heavy

The CNCF ecosystem has many high-impact projects with smaller surface area:

Project Bar
kubectl plugins (krew) Low. Author your own; submit to krew index.
kind Low–Medium. Friendly maintainers.
kustomize Medium.
Helm Medium. Larger team.
ArgoCD / Flux Medium. Active.
Cilium Medium–High.
Crossplane Medium. Welcoming to providers.
Operator SDK / Kubebuilder Medium.
OpenTelemetry (collector + Operator) Medium.

A merged contribution to any of these is a credible Kubernetes-ecosystem signal in interviews.


C.7 Calibration

A reasonable goal for a curriculum graduate:

  • By end of week 23: a PR open against kubernetes/website, a kubectl plugin, or a small fix to an ecosystem project.
  • By end of capstone: that PR merged.
  • 6 months post-curriculum: a substantive contribution-a kubectl feature, a new operator, a Cilium policy plugin.

Patient contributors become trusted contributors. Trusted contributors become reviewers. Reviewers become approvers. Approvers become SIG chairs. The path exists; it just takes time.

Capstone Projects-Three Tracks, One Choice

Pick one. The work performed here is what you describe in interviews.


Track 1-Hard Way: A Production-Grade Cluster From Scratch

Outcome: a multi-node Kubernetes cluster brought up on bare metal or cloud VMs, with mTLS-everywhere, fine-grained RBAC, multi-tenancy, GitOps-managed workloads, and a documented operational runbook.

Functional spec

  • 3 control-plane nodes + 3 workers (minimum). HAProxy or cloud LB in front of the apiservers.
  • etcd with mTLS, encryption-at-rest, scheduled snapshots to off-cluster storage.
  • CNI: Cilium with kube-proxy replacement, Hubble enabled.
  • Service mesh: Istio (ambient) or Linkerd, mTLS strict between services.
  • Storage: a real CSI driver (local-path for dev; OpenEBS / Longhorn / cloud CSI for "real").
  • Observability: Prometheus + Grafana + Loki + Tempo + OTel Collector.
  • GitOps: ArgoCD or Flux managing platform addons.
  • Policy: Pod Security restricted, NetworkPolicy default-deny, Kyverno or Gatekeeper enforcing org rules.
  • Backup: Velero scheduled, restore tested.

Non-functional spec

  • CIS benchmark green (kube-bench ≥90% pass).
  • Cluster rebuild from scratch in <2 hours via Ansible/Terraform.
  • Zero-downtime kubelet upgrades (drain + replace pattern).
  • A demo: kill a control-plane node; cluster recovers without intervention.

Acceptance

  • Public repo with provisioning playbooks and runbooks.
  • A 30-minute screencast walking the assessor through bring-up, an incident drill, and a tenant onboarding.
  • A RUNBOOK.md covering: cluster provisioning, node addition/removal, etcd backup/restore, certificate rotation, upgrade procedure, top 5 incident types and remediation.

Skills exercised

  • All months-but Months 1, 2, 6 most heavily.

Track 2-Platform: GitOps Multi-Tenant PaaS

Outcome: a self-service developer platform built on Kubernetes that demonstrates onboarding a new team in <30 minutes, with policy guardrails, infra-from-code, and full observability.

Functional spec

  • Tenant model: each tenant gets a Namespace, ResourceQuota, LimitRange, RBAC bindings, default NetworkPolicy, monitoring scrape config, GitOps Application (Argo) entry-all materialized from a single tenant claim (Crossplane Composition or Helm chart).
  • Self-service: developers commit a manifest.yaml to their repo; ArgoCD/Flux picks it up; their app deploys.
  • Policy: Kyverno or OPA Gatekeeper enforces: image signatures, no latest tags, mandatory labels, resource limits, no privileged Pods.
  • Observability: each tenant's metrics/logs/traces are isolated (via labels and Loki/Prom multi-tenancy); a per-tenant Grafana folder with default dashboards.
  • Cost attribution: OpenCost emits per-tenant cost; surface in a dashboard.
  • Crossplane: a Database claim that materializes a real cloud database (or, for demo, a chart-deployed Postgres).

Non-functional spec

  • Tenant onboarding: from "claim PR opened" to "deployed app reachable" in <30 minutes (target: <5).
  • Failure isolation: a tenant exceeding quota does not affect other tenants.
  • Compliance: every running Pod can be traced back to a git commit + signature verification.

Acceptance

  • Public repo with platform manifests + tenant-onboarding template.
  • Live demo: onboard a fresh tenant; deploy a sample app; demonstrate observability + policy denial; demonstrate quota enforcement.
  • A PLATFORM.md describing the contract between platform team and tenants: versioning, deprecation, support, escalation.

Skills exercised

  • Months 3 (operators / Crossplane), 5 (GitOps + IaC + autoscaling + admission), 6 (multi-tenancy).

Track 3-Operator: Production-Quality Operator From Scratch

Outcome: a non-trivial operator that manages a stateful application or external system, complete with backup/restore, upgrades, observability, and a thoughtful API.

Suggested scopes

  1. elasticsearch-mini-operator: manage Elasticsearch clusters with auto-scaling, snapshot lifecycle, index lifecycle policies.
  2. postgres-mini-operator: with automatic failover (using the Postgres replication primitives), backup/restore via WAL-G to S3, point-in-time recovery.
  3. saas-resource-operator: manage external SaaS resources via Crossplane composition (e.g., a GitHubRepo operator complete with branch protection, secret scanning, codeowners).

Acceptance

  • Public repo, written with controller-runtime + Kubebuilder.
  • CRDs versioned (v1alpha1 + v1beta1 + conversion webhook).
  • Status conditions, finalizers, owner references-all idiomatic.
  • Comprehensive RBAC (least-privilege, generated from kubebuilder markers).
  • Mutating + validating admission webhooks.
  • E2E tests (Ginkgo + envtest, plus a kind-based suite).
  • Helm chart or kustomize manifests for installation.
  • Observability: Prometheus metrics, structured logs (logr), OTel traces.
  • Helm-test-style upgrade tests across three operator versions.
  • Documentation: design rationale, API reference, examples.

Skills exercised

  • Months 3 (operators), 4 (storage if stateful), 5 (admission), 6 (defense).

Cross-Track Requirements

  • cluster-baseline/ template (Appendix A) integrated.
  • ADRs (≥3).
  • THREAT_MODEL.md.
  • RUNBOOK.md.
  • Defense readiness: 60-minute walkthrough.

The track choice signals career direction: Track 1 for SRE/cluster-operator roles, Track 2 for platform-engineering roles, Track 3 for software-engineering-on-Kubernetes roles. Pick based on where you want the next interview loop.

Worked example - Week 14: a NetworkPolicy → what eBPF actually does

Companion to Kubernetes → Month 04 → Week 14: Cilium and eBPF. The week explains the Cilium model: CNI plugin, identities, the L3/L4/L7 policy layers, and the eBPF datapath. This page takes one Kubernetes NetworkPolicy and traces it through Cilium all the way to the eBPF program enforcing it on a packet.

You need a kind/k3s/minikube cluster with Cilium installed (cilium install from the Cilium CLI; or Helm with --set kubeProxyReplacement=true).

The policy

# api-deny-from-frontend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-deny-from-frontend
  namespace: shop
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes: [Ingress]
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: orders
    ports:
    - port: 8080
      protocol: TCP

What this says, in English: "Pods in namespace shop with label app=api will accept TCP/8080 traffic only from pods labeled app=orders in the same namespace. Everything else gets dropped."

Without a NetworkPolicy controller, Kubernetes ignores this object entirely. With Cilium installed, the policy becomes a real packet-level rule. Walk through how.

Step 1 - Pod IPs and identities

Apply the policy and deploy three sample pods:

$ kubectl apply -f api-deny-from-frontend.yaml
$ kubectl run -n shop api      --image=nginx --labels=app=api    --port 8080
$ kubectl run -n shop orders   --image=alpine --labels=app=orders -- sh -c "while true; do sleep 60; done"
$ kubectl run -n shop frontend --image=alpine --labels=app=frontend -- sh -c "while true; do sleep 60; done"

Now look at what Cilium did:

$ kubectl exec -n kube-system ds/cilium -- cilium endpoint list -o json | jq '.[] | {id, name: .status.identity.labels, ip: .status.networking.addressing[0].ipv4}'
{ "id": 412, "name": ["k8s:app=api","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.42" }
{ "id": 1207, "name": ["k8s:app=orders","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.43" }
{ "id": 1208, "name": ["k8s:app=frontend","k8s:io.kubernetes.pod.namespace=shop"], "ip": "10.244.0.44" }

Cilium assigned each pod an endpoint ID and a security identity derived from the pod's labels. The identity is a number, not the label set itself. All pods with the same label set share an identity, which is the unit Cilium reasons about.

The key trick: traditional iptables-based CNIs do rule matching by IP, which means rules scale O(n²) with pod count. Cilium does it by identity, which scales O(unique_label_sets²) - vastly smaller in practice.

Step 2 - The policy in Cilium's view

$ kubectl exec -n kube-system ds/cilium -- cilium policy get
[
  {
    "endpointSelector": {"matchLabels": {"k8s:app": "api", "k8s:io.kubernetes.pod.namespace": "shop"}},
    "ingress": [
      {
        "fromEndpoints": [
          {"matchLabels": {"k8s:app": "orders", "k8s:io.kubernetes.pod.namespace": "shop"}}
        ],
        "toPorts": [{"ports": [{"port": "8080", "protocol": "TCP"}]}]
      }
    ]
  }
]

Same content, Cilium's internal representation. The selectors will resolve to specific identity numbers when the policy is materialized into eBPF maps.

Step 3 - Test the policy works

$ kubectl exec -n shop orders -- wget -qO- --timeout=2 http://10.244.0.42:8080
<!DOCTYPE html>
<html>
<head><title>Welcome to nginx!</title>
...
$ kubectl exec -n shop frontend -- wget -qO- --timeout=2 http://10.244.0.42:8080
wget: download timed out

orders succeeds (allowed). frontend times out (silently dropped). Good.

But where is the drop happening?

Step 4 - Find the eBPF program

Cilium attaches eBPF programs at several kernel hook points: tc (traffic control) ingress/egress on every pod's veth, and on the host's external interface. List them:

$ kubectl exec -n kube-system ds/cilium -- bpftool prog show | grep cil_
1342: sched_cls  name cil_from_container  tag 4f...
1343: sched_cls  name cil_to_container    tag 8a...
1344: sched_cls  name cil_from_host       tag c2...
1345: sched_cls  name cil_to_host         tag d7...
1346: sched_cls  name cil_from_netdev     tag e3...

These are the BPF programs implementing the datapath. cil_from_container runs on every packet leaving a pod's veth; cil_to_container on every packet entering. The policy enforcement happens in cil_to_container.

Step 5 - The maps Cilium uses

eBPF programs are stateless; they read from kernel-managed maps (kv stores). Cilium maintains several:

$ kubectl exec -n kube-system ds/cilium -- bpftool map show | grep -E "cilium_"
221: hash  name cilium_policy   key 16B  value 48B  max_entries 16384
222: lru_hash name cilium_ct4   key 40B  value 64B  max_entries 524288
223: hash  name cilium_lxc      key 4B   value 64B  max_entries 65536
224: hash  name cilium_metrics  key 8B   value 16B  max_entries 65536
...
  • cilium_lxc - endpoint ID → pod info (IP, MAC, security identity).
  • cilium_policy - (endpoint_id, src_identity, port, protocol) → allow/deny. This is the lookup table the BPF program consults to decide whether a packet is allowed.
  • cilium_ct4 - connection tracking. Stores active flows for established-connection allowance.

Step 6 - The actual lookup

When a packet from frontend (identity 1208) reaches the host with destination api (10.244.0.42:8080, endpoint 412):

  1. cil_to_container BPF program triggers on the veth's ingress hook.
  2. Program reads packet headers - src IP 10.244.0.44, dst IP 10.244.0.42, dst port 8080.
  3. Program looks up dst endpoint via cilium_lxc[10.244.0.42] → endpoint 412.
  4. Program looks up src identity via cilium_ipcache[10.244.0.44] → identity 1208.
  5. Program builds policy key (endpoint=412, identity=1208, port=8080, proto=TCP) and queries cilium_policy.
  6. No matching entry → returns DROP.
  7. Program updates cilium_metrics (drop counter ++).
  8. tc framework drops the packet.

When orders (identity 1207) sends the same kind of packet, step 5 builds key (412, 1207, 8080, TCP), the policy map has this entry (from the NetworkPolicy → identity match), and the program returns PASS. The packet proceeds; the connection is tracked in cilium_ct4 so the return packet is also allowed via fast-path.

Step 7 - See the drop in real time

$ kubectl exec -n kube-system ds/cilium -- cilium monitor -t drop
xx drop (Policy denied) flow 0xab12 to endpoint 412, identity 1208->10044, file bpf_lxc.c line 1142, 86 bytes

This is the BPF program emitting a perf event when it drops a packet. The format includes the source line of the bpf_lxc.c program that made the decision, the source/destination identities, and the byte count. Cilium's hubble (a separate component) consumes these events to provide a real-time UI.

Why this matters

The traditional kube-proxy + iptables path for this same policy would: - Maintain ~O(pods^2) iptables rules per port. - Linearly walk those rules on every packet. - Rewrite rules on every pod create/delete, which under churn can take seconds and lose packets.

Cilium's eBPF path: - Maintains a hash map keyed by (endpoint, identity, port, proto). - O(1) lookup on every packet. - Identity-based: adding a new orders pod doesn't change the policy map at all (same identity).

In a cluster with 10,000 pods, the difference is "stable 50µs latency vs unbounded tail." That's the whole pitch for Cilium.

The trap

A NetworkPolicy without a controller that supports it does nothing. Many K8s users apply policies on clusters where the CNI doesn't enforce them, and the cluster silently allows everything. Verify with kubectl exec between pods that shouldn't be able to reach each other, or use cilium connectivity test if you're on Cilium.

The other trap: Cilium identity granularity. Two pods with identical label sets share an identity. If you split traffic by namespace alone, every pod in the namespace has the same identity for policy purposes. Add labels (role, tier, app-version) to get finer-grained control.

Exercise

  1. Run the demo above. Confirm the drop is visible via cilium monitor.
  2. Add a third allowed source: pods labeled app=admin. Reapply the policy. Watch cilium policy get change and confirm admin pods now succeed.
  3. (Advanced) Use bpftool prog dump xlated id <prog-id> on one of the cil_* programs. Read the BPF assembly. Find the map lookup instructions.
  4. (Advanced) Read Documentation/bpf/ in the kernel tree for the BPF instruction set reference. Find BPF_LDX, BPF_JEQ. You'll see them in the disassembly.