Saltar a contenido

Week 14 - Cilium and eBPF Networking

14.1 Conceptual Core

Cilium is the dominant eBPF-based CNI. It replaces iptables-based packet processing with eBPF programs attached at three layers:

  • Socket layer (bpf_sock_ops) - connection-level decisions before packets exist.
  • Cgroup egress - per-pod outbound policy enforcement.
  • NIC-level XDP - ingress filtering at line rate, before the kernel network stack.

The shift from iptables matters at scale: an iptables-based kube-proxy walks a linear chain of rules per packet - O(services). eBPF programs do hash-table lookups: O(1) per packet, regardless of service count.

Beyond replacing the CNI, Cilium provides: - Kube-proxy replacement (eBPF-based service load balancing - no iptables churn on every endpoint change). - L7 NetworkPolicy (HTTP, gRPC, Kafka filtering at the dataplane, not in a sidecar). - ClusterMesh (multi-cluster service discovery and cross-cluster policy). - Hubble (eBPF-based flow observability - every pod-to-pod connection visible without sampling). - Service Mesh (sidecar-less mTLS via eBPF + SPIFFE).

This is the bridge to Linux Month 3 - eBPF in production. See also: eBPF in the observability cross-topic page.

14.2 Mechanical Detail

  • Dataplane as eBPF graph: Cilium's eBPF programs live under bpf/ in cilium/cilium. The agent compiles them at startup with the cluster's specific configuration baked in (BTF-driven CO-RE for portability across kernels).
  • Identity-based policy: pods are assigned a numeric identity derived from their labels (app=foo,env=prod → identity 1234). eBPF programs match on these identities, not on IPs. This is what allows policy to scale to thousands of pods without per-pod iptables rules - identities are stable across pod restarts and IP changes.
  • Service load balancing: instead of iptables DNAT chains, Cilium uses an eBPF map indexed by (service IP, port) returning a backend. Connection state lives in a separate eBPF map; updates are atomic, no kernel reload, no race during endpoint churn.
  • Encryption: WireGuard (recommended; in-kernel since 5.6) or IPsec tunnels between nodes. Per-NetworkPolicy opt-in or cluster-wide.
  • Hubble captures every packet's metadata via eBPF - source/dest identity, verdict (allowed/denied), L7 protocol info - and exposes it via gRPC + a CLI + a UI. Per-packet overhead is single-digit-percent CPU.

The trap

Switching kubeProxyReplacement from falsetrue on a live cluster without draining nodes. The iptables rules from the old kube-proxy don't get cleaned up automatically, and they interact badly with Cilium's eBPF NAT. Always: drain node → reconfigure → uncordon. The Cilium installer's kubeProxyReplacement: strict mode aborts if it finds residual rules.

14.3 Lab - "Install and Drive Cilium"

  1. Install via Helm with: kubeProxyReplacement=true, hubble.enabled=true, hubble.relay.enabled=true, hubble.ui.enabled=true, encryption.enabled=true, encryption.type=wireguard.
  2. Use the Hubble UI (cilium hubble ui) to visualize pod-to-pod traffic in real time.
  3. Author L4 NetworkPolicy (standard k8s API); test enforcement with a denied + allowed flow.
  4. Author an L7 CiliumNetworkPolicy (e.g., allow only HTTP GET /api/* from frontend → backend); test enforcement.
  5. Enable Cilium Service Mesh; observe sidecar-free mTLS between two test services.

14.4 Hardening Drill

Enable transparent encryption (WireGuard) between nodes. Combined with default-deny NetworkPolicy (start: deny everything, allow explicitly), this gives defense-in-depth: even if a node is compromised, the attacker sees only encrypted traffic for flows they haven't been explicitly authorized to observe.

14.5 Operations Slice

Monitor cilium_* Prometheus metrics. Alert on: - policy-drop rate spikes - legitimate workloads being denied (usually a NetworkPolicy author mistake, or a new service that didn't get its allow rule). - identity-table pressure - Cilium has a max identity count per cluster; approaching it means too many distinct label combinations, often from a bad operator emitting unique labels per request. - endpoint regeneration time - if it climbs past 5-10s, your label churn is overwhelming the agent.

Comments