Workshop - Build a custom scheduler¶

DifficultyDeepTime75 min

Needs: Linux or macOS, Go 1.21+, Docker, kind or k3d

Before you start:

Built the controller from scratch
Understand what a Pod binding is (Pod-to-Node assignment)

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to Kubernetes -> Month 01 -> Week 3: The Scheduler. The chapter explains how the scheduler assigns Pods to nodes through filtering and scoring. This workshop has you build a working scheduler - a real program that watches for unscheduled Pods and binds them to nodes by your own logic - and watch your placement decisions take effect. By the end you'll understand that scheduling is "just" another control loop, and you'll have replaced one of the most mysterious-seeming control-plane components with ~120 lines of Go.

~90 minutes. Needs: kind/k3d (ideally a multi-node cluster), Go 1.21+, kubectl. Prerequisite: the controller-from-scratch workshop - a scheduler is a controller with a specific job.

What you'll build, and the idea it makes concrete¶

You'll build a minimal scheduler that watches for Pods requesting schedulerName: workshop-scheduler, picks a node by a simple policy (least-loaded by Pod count), and binds the Pod to it. Then you'll watch your scheduler place Pods - and watch a Pod sit Pending forever when your scheduler isn't running, proving you are the thing making placement happen.

The idea this makes concrete:

Scheduling is not magic, and not part of the kubelet - it's a separate control loop. The default kube-scheduler watches for Pods with no node assigned (spec.nodeName == ""), runs filter (which nodes can run this?) then score (which is best?), and writes the chosen node back via a Bind call. The kubelet on that node then notices "a Pod is assigned to me" and runs it. Placement and execution are decoupled. You can run multiple schedulers side by side, and a Pod picks one by schedulerName.

The controller workshop showed reconcile on ConfigMaps. This shows the same loop doing the job that feels most like core-Kubernetes-magic - and revealing it's the same pattern: watch, decide, act.

Step 0: how scheduling actually works¶

The mental model, because the magic dissolves once you see it:

1. You create a Pod. The apiserver stores it with spec.nodeName = "" (unscheduled).
2. A SCHEDULER (watching for unscheduled Pods) notices it.
3. Scheduler FILTERS: which nodes have enough CPU/mem, match nodeSelector, tolerate taints?  -> feasible nodes
4. Scheduler SCORES: rank the feasible nodes by some policy -> best node
5. Scheduler BINDS: writes spec.nodeName = <best node> (a POST to pods/<name>/binding).
6. The KUBELET on that node sees a Pod assigned to it, pulls images, starts containers.

Two facts that surprise people: - The scheduler never starts a container. It only decides and records the decision (the bind). The kubelet does the running. Decision and execution are different components. - A Pod with no scheduler to handle it stays Pending forever. There's no fallback - if you set a schedulerName no scheduler is watching, the Pod just waits. You'll see this directly.

Your scheduler implements steps 2-5.

Step 1: a multi-node cluster (so placement is visible)¶

Placement is only interesting with more than one node. kind can make a multi-node cluster:

$ cat <<EOF | kind create cluster --name sched-workshop --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF
$ kubectl get nodes
NAME                          STATUS   ROLES           AGE
sched-workshop-control-plane  Ready    control-plane   40s
sched-workshop-worker         Ready    <none>          25s
sched-workshop-worker2        Ready    <none>          25s
sched-workshop-worker3        Ready    <none>          25s

Three worker nodes - your scheduler will choose among them.

Step 2: the scheduler loop - watch unscheduled Pods¶

A scheduler is a controller whose "desired state" is "every Pod assigned to a node." Set up the project and the watch:

$ mkdir mini-scheduler && cd mini-scheduler
$ go mod init workshop/mini-scheduler
$ go get k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/client-go@v0.29.3

package main

import (
    "context"
    "fmt"
    "path/filepath"
    "os"

    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/fields"
    "k8s.io/client-go/informers"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/tools/cache"
    "k8s.io/client-go/tools/clientcmd"
    "k8s.io/klog/v2"
)

const schedulerName = "workshop-scheduler"

func main() {
    kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
    config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
    client, _ := kubernetes.NewForConfig(config)

    factory := informers.NewSharedInformerFactory(client, 0)
    podInformer := factory.Core().V1().Pods()
    nodeInformer := factory.Core().V1().Nodes()
    podLister := podInformer.Lister()
    nodeLister := nodeInformer.Lister()

    podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
        AddFunc: func(obj interface{}) {
            pod := obj.(*corev1.Pod)
            // Only handle pods that ask for US and aren't scheduled yet.
            if pod.Spec.SchedulerName != schedulerName || pod.Spec.NodeName != "" {
                return
            }
            if err := schedule(client, podLister, nodeLister, pod); err != nil {
                klog.Errorf("failed to schedule %s/%s: %v", pod.Namespace, pod.Name, err)
            }
        },
    })

    stop := make(chan struct{})
    factory.Start(stop)
    cache.WaitForCacheSync(stop, podInformer.Informer().HasSynced, nodeInformer.Informer().HasSynced)
    klog.Infof("%s running", schedulerName)
    <-stop
}

The trigger is the same as any controller: an informer fires when an unscheduled Pod (asking for our scheduler) appears. The work is schedule().

Step 3: filter and score - the decision¶

Here's the heart - the same filter-then-score the real scheduler does, simplified to one policy (fewest Pods wins):

func schedule(client kubernetes.Interface, podLister listers.PodLister,
    nodeLister listers.NodeLister, pod *corev1.Pod) error {

    nodes, err := nodeLister.List(labels.Everything())
    if err != nil {
        return err
    }

    // FILTER: which nodes can run this pod?
    var feasible []*corev1.Node
    for _, n := range nodes {
        if isReady(n) && !hasBlockingTaint(n) {   // (real schedulers also check cpu/mem/affinity)
            feasible = append(feasible, n)
        }
    }
    if len(feasible) == 0 {
        return fmt.Errorf("no feasible node for %s", pod.Name)
    }

    // SCORE: pick the node running the fewest pods (least-loaded policy).
    allPods, _ := podLister.List(labels.Everything())
    countByNode := map[string]int{}
    for _, p := range allPods {
        if p.Spec.NodeName != "" {
            countByNode[p.Spec.NodeName]++
        }
    }
    best := feasible[0]
    for _, n := range feasible[1:] {
        if countByNode[n.Name] < countByNode[best.Name] {
            best = n
        }
    }

    // BIND: record the decision. THIS is what "scheduling" actually is.
    binding := &corev1.Binding{
        ObjectMeta: metav1.ObjectMeta{Name: pod.Name, Namespace: pod.Namespace},
        Target:     corev1.ObjectReference{Kind: "Node", Name: best.Name},
    }
    err = client.CoreV1().Pods(pod.Namespace).Bind(context.TODO(), binding, metav1.CreateOptions{})
    if err != nil {
        return err
    }
    klog.Infof("bound %s/%s -> %s (had %d pods)", pod.Namespace, pod.Name, best.Name, countByNode[best.Name])

    // Emit a Scheduled event so `kubectl describe pod` shows what happened (like the real scheduler).
    return nil
}

func isReady(n *corev1.Node) bool {
    for _, c := range n.Status.Conditions {
        if c.Type == corev1.NodeReady {
            return c.Status == corev1.ConditionTrue
        }
    }
    return false
}
func hasBlockingTaint(n *corev1.Node) bool {
    for _, t := range n.Spec.Taints {
        if t.Effect == corev1.TaintEffectNoSchedule {
            return true
        }
    }
    return false
}

The Bind call is the whole job: a POST to pods/<name>/binding that writes spec.nodeName. That's it - "scheduling a Pod" is setting one field. Everything else (filter, score) is just deciding which value to write. The real scheduler has dozens of filter and score plugins; yours has one of each, but the shape is identical.

Step 4: run it and watch your scheduler place Pods¶

$ go run .
workshop-scheduler running

Create a few Pods that ask for your scheduler:

$ for i in 1 2 3 4 5 6; do
  kubectl run pod$i --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
done

Watch your scheduler's log - it's making decisions in real time:

bound default/pod1 -> sched-workshop-worker  (had 0 pods)
bound default/pod2 -> sched-workshop-worker2 (had 0 pods)
bound default/pod3 -> sched-workshop-worker3 (had 0 pods)
bound default/pod4 -> sched-workshop-worker  (had 1 pods)    <- least-loaded: spreads them
bound default/pod5 -> sched-workshop-worker2 (had 1 pods)
bound default/pod6 -> sched-workshop-worker3 (had 1 pods)

Your least-loaded policy spread 6 Pods evenly across 3 nodes. Confirm from the cluster's side:

$ kubectl get pods -o wide --sort-by='{.spec.nodeName}'
NAME   READY   STATUS    NODE
pod1   1/1     Running   sched-workshop-worker
pod4   1/1     Running   sched-workshop-worker
pod2   1/1     Running   sched-workshop-worker2
pod5   1/1     Running   sched-workshop-worker2
pod3   1/1     Running   sched-workshop-worker3
pod6   1/1     Running   sched-workshop-worker3

Two Pods per node, exactly as your scoring decided. And critically - kubectl describe pod pod1 shows it was scheduled and then the kubelet started it. Your scheduler decided; the kubelet executed. You watched the decoupling.

Step 5: the proof that YOU are the scheduler - watch a Pod hang Pending¶

This is the "it clicks" moment. Stop your scheduler (Ctrl-C). Now create another Pod that asks for it:

$ kubectl run orphan --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
$ kubectl get pod orphan
NAME     READY   STATUS    AGE
orphan   0/1     Pending   30s          <- stuck. Forever.
$ kubectl describe pod orphan | grep -A2 Events
Events:  <none>                         # NO scheduler touched it. No node assigned. It just waits.

The Pod is Pending with no events - because nothing is watching for workshop-scheduler Pods. There's no default fallback; a Pod whose named scheduler is absent waits indefinitely. Now restart your scheduler:

$ go run .
# instantly in the log:
bound default/orphan -> sched-workshop-worker (had 2 pods)
$ kubectl get pod orphan
NAME     READY   STATUS    AGE
orphan   1/1     Running   3s           <- your scheduler woke up and placed it

You watched a Pod sit unschedulable until your program ran, then get placed the instant it did. That's the proof that scheduling is a control loop you can own - the Pending Pod was waiting for you. This is also why a broken/overloaded scheduler manifests as Pods stuck Pending cluster-wide - now you know exactly why.

Step 6: break it - filter everything out¶

See the "no feasible node" path. Taint all workers so your filter rejects them:

$ kubectl taint nodes -l '!node-role.kubernetes.io/control-plane' workshop=blocked:NoSchedule
$ kubectl run blocked --image=pause --overrides='{"spec":{"schedulerName":"workshop-scheduler"}}'
# scheduler log: failed to schedule default/blocked: no feasible node for blocked
$ kubectl get pod blocked     # Pending - your filter rejected every node

The Pod is unschedulable because no node passed the filter - exactly what happens with the real scheduler when resource requests exceed every node's capacity, or affinity/taints exclude everything. Remove the taint (kubectl taint nodes ... workshop-) and your scheduler (which retries via the informer's resync) places it. This is the difference between "no scheduler" (Step 5, no events) and "scheduler ran but found nowhere to put it" (Step 6, a FailedScheduling event in the real scheduler).

Now extend it¶

Real resource filtering. Filter on actual CPU/memory: sum the requests of pods already on each node, compare to allocatable. This is the core of real scheduling - bin-packing by resources.
Respect nodeSelector / affinity / taints+tolerations. Add these filters and watch pods land only where they're allowed. You'll appreciate how much the default scheduler does.
Spread vs bin-pack. Add a second scoring policy (most-loaded = bin-pack, to empty nodes for scale-down) and make it configurable. This is the real spreading-vs-consolidation tradeoff.
The Scheduler Framework plugin path. Instead of a standalone scheduler, write a plugin for the real kube-scheduler (a Filter or Score plugin). This is how production custom scheduling is done - you extend the battle-tested scheduler rather than replace it.

What you might wonder¶

"Should I ever run a custom scheduler in production?" Rarely as a replacement - the default scheduler is sophisticated (resource fit, affinity, topology spread, preemption) and hard to beat. But custom scheduling is real for specialized needs: gang scheduling (all-or-nothing for ML training jobs), hardware-topology-aware placement (GPUs/NUMA), or batch/HPC. The right path is almost always a Scheduler Framework plugin (extend the default), not a from-scratch scheduler. Building one here is to understand the default - so you can reason about why a Pod landed where it did, and why one is stuck Pending.

"Why is binding a separate API call, not just updating the Pod?" The binding subresource (pods/<name>/binding) exists specifically so scheduling is an atomic, auditable, permission-scoped operation - a component can be granted "bind pods" without "edit pods." It also lets the apiserver enforce that a Pod is bound exactly once. It's the same subresource-design reason status is separate from spec.

"How does the real scheduler avoid placing two pods on a node that only fits one?" It tracks assumed Pods (Pods it has bound but that the cache hasn't caught up on yet) so back-to-back decisions account for in-flight bindings - avoiding the race where it over-commits a node in the gap before the bind is observed. Your mini-scheduler has this race (it reads pod counts from the cache); the real one handles it. A good "now extend it" thought experiment.

"What's preemption?" When a high-priority Pod can't fit, the scheduler can evict lower-priority Pods to make room (then schedule the high-priority one). It's why PriorityClass matters. Your mini-scheduler doesn't preempt - it just fails when nothing fits. Preemption is one of the most complex parts of the real scheduler.

"Multiple schedulers really run at once?" Yes - that's why schedulerName exists. The default scheduler handles Pods with schedulerName: default-scheduler (the default); yours handles workshop-scheduler. They coexist, each watching for its own Pods. This is how you can run a specialized scheduler for batch jobs alongside the default for services, in the same cluster.

What this gave you¶

You built a working scheduler: watch unscheduled Pods, filter feasible nodes, score, bind.
You learned scheduling is recording a decision (the bind sets spec.nodeName) - the scheduler never runs containers; the kubelet does.
You watched your least-loaded policy spread Pods across nodes, confirmed from the cluster.
You watched a Pod hang Pending with no events until your scheduler ran - proving scheduling is a loop you own, and explaining cluster-wide Pending.
You saw the "no feasible node" path and how it differs from "no scheduler."
You know when custom scheduling is warranted (gang/topology/batch) and that the Scheduler Framework plugin path beats a from-scratch replacement.

Next: go below the control plane to the network - build pod networking and see how packets actually flow between Pods.

Back to the Control Plane month.

Submit your build¶

When you finish this workshop, share what you built so others can see and learn from your work. Include:

Public repo with your scheduler code
Multi-node kind cluster output showing your scheduler placing Pods (schedulerName set)
Proof of the "hangs Pending until your scheduler runs" demo
Short note on the scoring function you chose and why

Submit your build Request feedback on your output Discuss this workshop

Browse the gallery | All discussions