Saltar a contenido

Week 3 - The Scheduler

3.1 Conceptual Core

  • The default scheduler is a single-replica controller (with leader election) that watches unscheduled Pods and binds them to Nodes. The "binding" is just a write to the Pod's spec.nodeName field.
  • The scheduler's algorithm is filter then score: filter Nodes that can't host the Pod (resources, affinities, taints), score the remaining ones, pick the highest-scoring.
  • The framework is plugin-based: filter plugins, score plugins, reserve plugins, pre-bind plugins, bind plugins. You can add custom plugins without forking.

3.2 Mechanical Detail

  • Scheduler framework extension points (read pkg/scheduler/framework/types.go):
  • PreFilter-short-circuit conditions.
  • Filter-must return Success for the Node to be eligible.
  • PostFilter-invoked when no Node passes filter (e.g., to trigger preemption).
  • PreScore, Score, NormalizeScore.
  • Reserve, Permit, PreBind, Bind, PostBind.
  • Built-in plugins: NodeResourcesFit, NodeAffinity, PodTopologySpread, InterPodAffinity, TaintToleration, NodePorts, VolumeBinding, ImageLocality, NodeResourcesBalancedAllocation.
  • Scheduling profiles: multiple "scheduler personalities" can run in one binary, each with a different plugin config. Used for batch workloads with different priorities.
  • Preemption: when a high-priority Pod can't fit, the scheduler may preempt (delete) lower-priority Pods. priorityClass is the knob.

3.3 Lab-"Scheduler in Action"

  1. Use kubectl describe on a pending Pod to see filter/score reasons.
  2. Set a Node taint (kubectl taint nodes node1 key=value:NoSchedule); observe new Pods avoid it.
  3. Define PriorityClasses (high, default, batch); deploy mixed-priority Pods; trigger preemption by oversaturating.
  4. Write a custom scheduler plugin (a tiny score plugin) using the scheduler framework. Configure your scheduler binary; run it. Verify selection difference vs default.

3.4 Hardening Drill

  • Set priorityClassName on system-critical Pods (CSI driver, ingress controller). Use system-cluster-critical and system-node-critical for cluster-internal pods.

3.5 Operations Slice

  • Wire scheduler metrics: scheduler_pending_pods, scheduler_pod_scheduling_duration_seconds, scheduler_pod_scheduling_attempts. Alert on persistent pending Pods.

Comments