Week 3 - The Scheduler¶

3.1 Conceptual Core¶

The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's spec.nodeName field.
The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.

Scheduler framework extension points (read pkg/scheduler/framework/types.go):
PreFilter-short-circuit conditions.
Filter-must return Success for the Node to be eligible.
PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
PreScore, Score, NormalizeScore.
Reserve, Permit, PreBind, Bind, PostBind.
Built-in plugins: NodeResourcesFit, NodeAffinity, PodTopologySpread, InterPodAffinity, TaintToleration, NodePorts, VolumeBinding, ImageLocality, NodeResourcesBalancedAllocation.
Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods. priorityClass is the knob.

Use kubectl describe on a pending Pod to see filter/score reasons.
Set a Node taint (kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it.
Define PriorityClasses (high, default, batch); deploy mixed-priority Pods; trigger preemption by oversaturating.
Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.

Set priorityClassName on system-critical Pods (CSI driver, ingress controller). Use system-cluster-critical and system-node-critical for cluster-internal pods.

Wire scheduler metrics: scheduler_pending_pods, scheduler_pod_scheduling_duration_seconds, scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.