Saltar a contenido

Workshop - Build an autoscaler from scratch

DifficultyDeepTime75 min
Needs: Linux or macOS, Go 1.21+, Docker, kind or k3d, metrics-server

Before you start:

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to Kubernetes -> Month 05 -> Week 19: HPA, VPA, KEDA: Autoscaling. The chapter explains how the Horizontal Pod Autoscaler scales workloads on metrics. This workshop has you build an HPA-like controller - a loop that reads a metric, computes the desired replica count, and scales a Deployment - then watch it scale up under load and back down when load drops. By the end you'll understand the exact control-theory the real HPA runs, including why it sometimes oscillates and how stabilization prevents it.

~75 minutes. Needs: kind/k3d, Go 1.21+, kubectl. Prerequisite: the controller-from-scratch workshop - an autoscaler is a controller whose desired state is computed from a metric.

What you'll build, and the idea it makes concrete

You'll build an autoscaler that watches a Deployment, reads a load metric (we'll use CPU; you'll see the formula), computes how many replicas should exist to hit a target utilization, and scales the Deployment to match. Then you'll drive load up and watch it scale out, drop the load and watch it scale back in.

The idea this makes concrete:

Autoscaling is a control loop with a feedback formula. The HPA isn't magic - every ~15s it asks "given the current metric and my target, how many replicas do I need?", computes desired = ceil(current * currentMetric / targetMetric), and resizes the Deployment. It's a proportional controller: the further the metric is from target, the bigger the scaling step. The hard parts - not over-reacting to spikes, not flapping up and down - are solved with a stabilization window and tolerance, which you'll build and then break to see why they exist.

The controller pilot computed desired state from a spec. This computes desired state from a live measurement and feeds it back - closing a control loop, which is a genuinely different and powerful pattern.

Step 0: the HPA formula and the control-loop model

The whole autoscaler is one formula, applied in a loop:

desiredReplicas = ceil( currentReplicas * (currentMetricValue / targetMetricValue) )

Read it as proportional control: if current CPU is 2x your target, you need ~2x the replicas; if it's at target, replicas stay. The loop:

every 15s:
    metric   = average metric across the Deployment's pods   (e.g. CPU utilization %)
    desired  = ceil(current * metric / target)
    desired  = clamp(desired, minReplicas, maxReplicas)      # never below min / above max
    if desired != current AND outside tolerance:
        scale the Deployment to desired

Two facts that shape everything: - It's a feedback loop, so it's prone to oscillation (scale up -> metric drops -> scale down -> metric rises -> scale up...). Tolerance and a stabilization window damp this. You'll build them after seeing the raw version flap. - The HPA scales the Deployment, not Pods directly - it edits spec.replicas via the scale subresource, and the Deployment/ReplicaSet controllers (the controller pilot's pattern) create the actual Pods. Layered controllers again.

Step 1: cluster + a scalable target + metrics

$ kind create cluster --name hpa-workshop
# metrics-server provides pod CPU/mem (the HPA's default metric source)
$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
$ kubectl patch -n kube-system deployment metrics-server --type=json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'  # kind needs this
# a deployment to scale: a CPU-burnable web app
$ kubectl create deployment load --image=registry.k8s.io/hpa-example
$ kubectl set resources deployment load --requests=cpu=100m   # HPA needs requests to compute %
$ kubectl expose deployment load --port=80

metrics-server is what feeds CPU/memory to autoscalers; the hpa-example image busy-loops on each request so you can drive its CPU up. Note the CPU request - utilization % is usage / request, so without a request there's no percentage to target.

Step 2: the autoscaler loop

$ mkdir mini-hpa && cd mini-hpa && go mod init workshop/mini-hpa
$ go get k8s.io/client-go@v0.29.3 k8s.io/api@v0.29.3 k8s.io/apimachinery@v0.29.3 k8s.io/metrics@v0.29.3
package main

import (
    "context"
    "math"
    "path/filepath"
    "os"
    "time"

    autoscalingv1 "k8s.io/api/autoscaling/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/tools/clientcmd"
    metricsclient "k8s.io/metrics/pkg/client/clientset/versioned"
    "k8s.io/klog/v2"
)

const (
    namespace      = "default"
    deploymentName = "load"
    targetCPUPct   = 50              // scale to keep avg CPU at ~50% of request
    minReplicas    = 1
    maxReplicas    = 10
    tolerance      = 0.10            // ignore changes within +/-10% of target (anti-flap)
    syncInterval   = 15 * time.Second
)

func main() {
    kubeconfig := filepath.Join(os.Getenv("HOME"), ".kube", "config")
    config, _ := clientcmd.BuildConfigFromFlags("", kubeconfig)
    client, _ := kubernetes.NewForConfig(config)
    metrics, _ := metricsclient.NewForConfig(config)

    for {
        if err := reconcile(client, metrics); err != nil {
            klog.Errorf("reconcile: %v", err)
        }
        time.Sleep(syncInterval)
    }
}

Step 3: the reconcile - read metric, compute desired, scale

func reconcile(client kubernetes.Interface, metrics metricsclient.Interface) error {
    ctx := context.TODO()

    // 1. Current replicas (via the scale subresource - same one HPA uses).
    scale, err := client.AppsV1().Deployments(namespace).GetScale(ctx, deploymentName, metav1.GetOptions{})
    if err != nil {
        return err
    }
    current := scale.Spec.Replicas

    // 2. Current metric: average CPU across the deployment's pods, as % of request.
    podMetrics, err := metrics.MetricsV1beta1().PodMetricses(namespace).List(ctx, metav1.ListOptions{
        LabelSelector: "app=" + deploymentName,
    })
    if err != nil || len(podMetrics.Items) == 0 {
        return err
    }
    var totalMilli int64
    for _, pm := range podMetrics.Items {
        for _, c := range pm.Containers {
            totalMilli += c.Usage.Cpu().MilliValue()
        }
    }
    avgMilli := totalMilli / int64(len(podMetrics.Items))
    currentPct := float64(avgMilli) / 100.0 * 100.0   // usage(milli) / request(100m) * 100

    // 3. THE HPA FORMULA: desired = ceil(current * currentMetric / targetMetric)
    ratio := currentPct / float64(targetCPUPct)
    if math.Abs(ratio-1.0) <= tolerance {
        klog.Infof("CPU %.0f%% (target %d%%) within tolerance; staying at %d replicas",
            currentPct, targetCPUPct, current)
        return nil // anti-flap: don't scale for small deviations
    }
    desired := int32(math.Ceil(float64(current) * ratio))

    // 4. Clamp to [min, max].
    if desired < minReplicas { desired = minReplicas }
    if desired > maxReplicas { desired = maxReplicas }

    if desired == current {
        return nil
    }

    // 5. Scale (write spec.replicas via the scale subresource).
    scale.Spec.Replicas = desired
    _, err = client.AppsV1().Deployments(namespace).UpdateScale(ctx, deploymentName, scale, metav1.UpdateOptions{})
    klog.Infof("CPU %.0f%% (target %d%%) -> scaling %d -> %d replicas",
        currentPct, targetCPUPct, current, desired)
    return err
}

That's a working horizontal autoscaler. The GetScale/UpdateScale calls use the scale subresource - the same generic /scale endpoint the real HPA uses, which is why HPA works on Deployments, StatefulSets, and even custom resources that expose /scale. You read the metric, plug it into the formula, clamp, and scale. The tolerance check is the first anti-flap guard.

Step 4: run it and watch it scale up under load

$ go run .
CPU 0% (target 50%) within tolerance; staying at 1 replicas

Idle, so it holds at 1. Now generate load - hammer the service from a busybox pod:

$ kubectl run loadgen --image=busybox --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://load; done"

Watch your autoscaler react over the next minute:

# autoscaler log:
CPU 0%   (target 50%) within tolerance; staying at 1 replicas
CPU 220% (target 50%) -> scaling 1 -> 5 replicas        <- load spiked, scale out
CPU 95%  (target 50%) -> scaling 5 -> 10 replicas       <- still hot, more
CPU 48%  (target 50%) within tolerance; staying at 10 replicas   <- settled near target
$ kubectl get deployment load
NAME   READY   UP-TO-DATE   AVAILABLE
load   10/10   10           10          <- scaled out to handle the load

Your autoscaler watched CPU climb, computed it needed more replicas to bring per-pod CPU down to ~50%, and scaled out - then settled once the metric reached target. That's closed-loop control: it drove the system to the setpoint. The formula did it - ceil(1 * 220/50) = 5, then ceil(5 * 95/50) = 10, capped at max.

Step 5: watch it scale back down

Stop the load and watch the loop run in reverse:

$ kubectl delete pod loadgen
# autoscaler log over the next minute:
CPU 3% (target 50%) -> scaling 10 -> 1 replicas         <- load gone, scale in
$ kubectl get deployment load
NAME   READY
load   1/1                                              <- back to minimum

It scaled in because the metric dropped far below target. The full loop: load up -> scale out -> metric settles; load down -> scale in -> back to min. You built a system that automatically right-sizes a workload to its load - the entire value of autoscaling, in one feedback formula.

Step 6: break it - watch it flap, then fix it with stabilization

Now the lesson that separates a toy from the real HPA. Set tolerance = 0.0 (remove the anti-flap guard) and drive a bursty load (a loadgen that sleeps between bursts). Watch the autoscaler oscillate:

CPU 80% -> scaling 2 -> 4 replicas       # burst: scale up
CPU 20% -> scaling 4 -> 2 replicas       # quiet: scale down
CPU 85% -> scaling 2 -> 4 replicas       # burst: scale up again
CPU 18% -> scaling 4 -> 2 replicas       # flap, flap, flap...

This flapping (thrashing replicas up and down) is a real production problem - it churns Pods, disrupts traffic, and wastes resources. The real HPA solves it with a stabilization window: it remembers recent desired-replica computations and, for scale-down, uses the highest recommendation over the last N minutes (default 5min for down, 0 for up - scale up fast, scale down slow). Add it:

// Keep a window of recent scale-down recommendations; use the MAX for scale-down.
var downWindow []int32     // recommendations over the last stabilizationPeriod
// ... in reconcile, when desired < current (scale-down):
downWindow = appendWithExpiry(downWindow, desired, stabilizationPeriod)
desired = maxOf(downWindow)   // don't scale down below the highest recent recommendation

With stabilization, a brief dip in load doesn't immediately scale down - the autoscaler waits to be sure load has dropped, preventing flap. Re-run the bursty load and watch it scale up promptly but scale down patiently - stable. You just built the most important real-world refinement of autoscaling, and you understand why it exists because you watched the raw loop flap without it.

Now extend it

  1. Custom metrics. Scale on requests-per-second or queue depth instead of CPU (the custom/external metrics API). This is what KEDA specializes in - scaling on Kafka lag, queue length, etc., even to zero.
  2. Scale to zero. Allow minReplicas: 0 and scale up from zero on the first request (needs a request-buffering proxy). The serverless-on-Kubernetes pattern (Knative, KEDA).
  3. Multiple metrics. Compute desired from CPU and memory and a custom metric, taking the max - exactly what the real HPA does with multiple metric sources.
  4. Then read the real HPA config. An HorizontalPodAutoscaler object's behavior.scaleDown.stabilizationWindowSeconds, policies, metrics - every field maps to something you built. You'll configure it with understanding instead of cargo-culting.

What you might wonder

"Why scale up fast but down slow?" Asymmetry by design. Under-provisioning hurts immediately (dropped requests, latency), so scale up fast to protect availability. Over-provisioning just costs a little money, and load is often bursty, so scale down slowly to avoid flapping and to absorb brief dips. The HPA defaults encode this: 0s stabilization for up, 300s for down. Your stabilization window implemented exactly this.

"HPA vs VPA vs Cluster Autoscaler - how do they relate?" HPA (this workshop) scales replica count horizontally on load. VPA (Vertical Pod Autoscaler) adjusts each Pod's CPU/memory requests (right-sizing one Pod, not adding Pods) - and conflicts with HPA on the same metric, so they're used on different axes. Cluster Autoscaler scales the nodes - when Pods can't be scheduled (the scheduler workshop's Pending), it adds nodes; when nodes are underused, it removes them. Three autoscalers, three layers: Pods-count (HPA), Pod-size (VPA), node-count (CA). They compose.

"Why does the HPA need resource requests?" Utilization is usage / request. With no request, there's no denominator - no percentage to target. This is the #1 "my HPA isn't scaling" cause: the target Deployment has no CPU request, so the HPA can't compute utilization and does nothing. Always set requests on autoscaled workloads.

"Is the formula really that simple?" The core is ceil(current * currentMetric / targetMetric), yes. The real HPA adds: tolerance (don't act on tiny deviations), stabilization windows (anti-flap), configurable scaling policies (max % or pods per interval), multiple metrics (take the max), and special handling for unready/missing-metric pods. But the proportional-control heart is exactly what you built - all the rest is refinement around it.

"When KEDA instead of HPA?" HPA scales on CPU/memory (and custom metrics with setup). KEDA extends autoscaling to event sources - Kafka topic lag, RabbitMQ queue depth, cloud queue length, cron schedules - and crucially supports scale-to-zero. Use KEDA when you scale on "how much work is queued" rather than "how busy are the current pods," or when you want zero replicas at idle. KEDA actually drives an HPA under the hood for the scaling mechanics.

What this gave you

  • You built a horizontal autoscaler: read a metric, apply desired = ceil(current * metric/target), clamp, scale via the scale subresource.
  • You watched it scale out under load and back in when load dropped - closed-loop control to a setpoint.
  • You watched it flap without anti-flap guards, then fixed it with tolerance + a scale-down stabilization window - and understand why the real HPA scales up fast and down slow.
  • You know why autoscaled workloads need resource requests (utilization = usage/request).
  • You can place HPA vs VPA vs Cluster Autoscaler (Pod-count / Pod-size / node-count) and know when KEDA fits.
  • Every field of a real HorizontalPodAutoscaler now maps to something you implemented.

Next, the capstone of the workshop series: bootstrap a Kubernetes control-plane component by hand - the ultimate proof that the control plane is "just" processes and etcd.

Back to the Platform & Day-2 month.

Submit your build

When you finish this workshop, share what you built so others can see and learn from your work. Include:

  • Public repo with your autoscaler code
  • Load-test screenshot showing scale-up under traffic
  • Demo of the flapping problem and your stabilization-window fix
  • Note on why HPA scales up fast and down slow

Submit your build  Request feedback on your output  Discuss this workshop

Browse the gallery  |  All discussions

Comments