Week 3 - The Scheduler¶
3.1 Conceptual Core¶
- The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's
spec.nodeNamefield. - The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
- The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.
3.2 Mechanical Detail¶
- Scheduler framework extension points (read
pkg/scheduler/framework/types.go): - PreFilter-short-circuit conditions.
- Filter-must return Success for the Node to be eligible.
- PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
- PreScore, Score, NormalizeScore.
- Reserve, Permit, PreBind, Bind, PostBind.
- Built-in plugins:
NodeResourcesFit,NodeAffinity,PodTopologySpread,InterPodAffinity,TaintToleration,NodePorts,VolumeBinding,ImageLocality,NodeResourcesBalancedAllocation. - Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
- Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods.
priorityClassis the knob.
3.3 Lab-"Scheduler in Action"¶
- Use
kubectl describeon a pending Pod to see filter/score reasons. - Set a Node taint (
kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it. - Define
PriorityClasses (high,default,batch); deploy mixed-priority Pods; trigger preemption by oversaturating. - Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.
3.4 Hardening Drill¶
- Set
priorityClassNameon system-critical Pods (CSI driver, ingress controller). Usesystem-cluster-criticalandsystem-node-criticalfor cluster-internal pods.
3.5 Operations Slice¶
- Wire scheduler metrics:
scheduler_pending_pods,scheduler_pod_scheduling_duration_seconds,scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.