Workshop - Build an operator with a Custom Resource Definition¶

DifficultyDeepTime90 min

Needs: Linux or macOS, Go 1.21+, Docker, kind or k3d, kubebuilder

Before you start:

Built the controller from scratch
Understand the difference between core resources and Custom Resources
Familiar with Deployments, Services, and ConfigMaps as concepts

Launch in KillercodaFree browser-based environment - no install required to follow along.

Companion to Kubernetes -> Month 03 -> Weeks 10-12: controller-runtime, CRDs, Operator Patterns. The chapters explain that you can extend the Kubernetes API with your own resource types. This workshop has you do it - define a brand-new kind of Kubernetes object (kubectl get websites!) and build the operator that gives it meaning. By the end you'll have extended Kubernetes itself, and you'll understand why "Kubernetes is a platform for building platforms" is literally true.

~120 minutes. Needs: kind/k3d, Go 1.21+, kubectl, and kubebuilder (brew install kubebuilder or the official install script). Prerequisite: the controller-from-scratch workshop - this builds directly on the reconcile loop you learned there.

What you'll build, and the idea it makes concrete¶

You'll define a custom resource called Website - a high-level object where a user declares "I want a website serving this HTML at this replica count" - and an operator that reconciles each Website into a real Deployment + Service + ConfigMap. So a user writes 6 lines of YAML:

apiVersion: web.workshop.io/v1
kind: Website
metadata:
  name: hello
spec:
  replicas: 2
  html: "<h1>Hello from my operator</h1>"

...and your operator creates and manages all the Kubernetes plumbing behind it. The idea this makes concrete:

Kubernetes is extensible at the API level. A Custom Resource Definition (CRD) teaches the apiserver a new noun; a controller gives that noun behavior. Together they're an operator - and from the user's side, your custom type is indistinguishable from a built-in like Deployment. kubectl get, kubectl describe, RBAC, kubectl apply, watches - all work on your type for free. This is how cert-manager, Prometheus Operator, Crossplane, and every database-on-Kubernetes works: they add nouns and the controllers that fulfill them.

The pilot workshop showed you the reconcile loop on a built-in type (ConfigMap). This one shows you defining your own type and reconciling it - the leap from "I can write a controller" to "I can extend Kubernetes."

Step 0: cluster + scaffold¶

$ kind create cluster --name operator-workshop
$ mkdir website-operator && cd website-operator
$ kubebuilder init --domain workshop.io --repo workshop.io/website-operator

kubebuilder init scaffolds a full operator project: a main.go that wires up the manager, a Makefile with all the build/deploy targets, the controller-runtime dependencies, and the manifest generators. controller-runtime is the framework the pilot's raw client-go boilerplate becomes - now that you know what it hides, you'll appreciate it.

Step 1: define the API - create your new resource type¶

$ kubebuilder create api --group web --version v1 --kind Website
# answer 'y' to "Create Resource" and 'y' to "Create Controller"

This generates api/v1/website_types.go - the Go struct that is your new resource. Edit it to define the spec (what the user declares) and status (what the operator reports back):

// WebsiteSpec is the DESIRED state - what the user asks for.
type WebsiteSpec struct {
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    Replicas int32 `json:"replicas"`           // how many copies to run

    // +kubebuilder:validation:Required
    HTML string `json:"html"`                  // the page content to serve
}

// WebsiteStatus is the OBSERVED state - what the operator reports.
type WebsiteStatus struct {
    ReadyReplicas int32  `json:"readyReplicas"` // how many are actually ready
    URL           string `json:"url"`           // where to reach it
}

Those +kubebuilder:validation markers are not comments - they generate OpenAPI schema that the apiserver enforces. Declare replicas must be 1-10, and the apiserver itself rejects replicas: 50 before your controller ever sees it. You get validation for free, at the API layer, just by annotating a struct field.

The Spec/Status split is the universal Kubernetes object shape: spec = desired (user writes it), status = observed (controller writes it). Every built-in works this way; now yours does too.

Step 2: generate and install the CRD - teach the apiserver your noun¶

$ make manifests        # generates the CRD YAML from your Go markers
$ make install          # installs the CRD into the cluster

That second command is the moment your cluster learns a new word. Verify:

$ kubectl get crds | grep website
websites.web.workshop.io   2026-05-23T...

$ kubectl api-resources | grep website
websites   web.workshop.io/v1   true   Website

Kubernetes now knows what a Website is. Before you've written a line of controller logic, kubectl already works on it:

$ kubectl get websites
No resources found in default namespace.    # the apiserver answers - it knows the type

You taught the apiserver a new noun. kubectl get/describe/apply/delete, RBAC, watches - all already work on Website, because the CRD made it a first-class API citizen. That's the "for free" the intro promised, made literal.

Step 3: the reconcile function - give the noun meaning¶

Open internal/controller/website_controller.go. controller-runtime hands you a Reconcile method - the same loop from the pilot, minus the boilerplate (no manual informer/workqueue; the framework runs them). Your job is purely: given a Website, make the cluster match it.

func (r *WebsiteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the Website (desired state). Gone? Owned children are GC'd.
    var site webv1.Website
    if err := r.Get(ctx, req.NamespacedName, &site); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Ensure the ConfigMap holding the HTML.
    cm := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{Name: site.Name + "-html", Namespace: site.Namespace},
    }
    _, err := ctrl.CreateOrUpdate(ctx, r.Client, cm, func() error {
        cm.Data = map[string]string{"index.html": site.Spec.HTML}
        return ctrl.SetControllerReference(&site, cm, r.Scheme)  // owner ref -> auto GC
    })
    if err != nil {
        return ctrl.Result{}, err
    }

    // 3. Ensure the Deployment (nginx serving the HTML, at the requested replicas).
    deploy := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{Name: site.Name, Namespace: site.Namespace},
    }
    _, err = ctrl.CreateOrUpdate(ctx, r.Client, deploy, func() error {
        deploy.Spec.Replicas = &site.Spec.Replicas       // desired replicas from the spec
        deploy.Spec.Selector = &metav1.LabelSelector{MatchLabels: map[string]string{"site": site.Name}}
        deploy.Spec.Template = corev1.PodTemplateSpec{
            ObjectMeta: metav1.ObjectMeta{Labels: map[string]string{"site": site.Name}},
            Spec: corev1.PodSpec{Containers: []corev1.Container{{
                Name:  "nginx",
                Image: "nginx:1.27-alpine",
                VolumeMounts: []corev1.VolumeMount{{
                    Name: "html", MountPath: "/usr/share/nginx/html",
                }},
            }},
                Volumes: []corev1.Volume{{
                    Name: "html",
                    VolumeSource: corev1.VolumeSource{ConfigMap: &corev1.ConfigMapVolumeSource{
                        LocalObjectReference: corev1.LocalObjectReference{Name: cm.Name},
                    }},
                }},
            },
        }
        return ctrl.SetControllerReference(&site, deploy, r.Scheme)
    })
    if err != nil {
        return ctrl.Result{}, err
    }

    // 4. Ensure the Service.
    svc := &corev1.Service{ObjectMeta: metav1.ObjectMeta{Name: site.Name, Namespace: site.Namespace}}
    _, err = ctrl.CreateOrUpdate(ctx, r.Client, svc, func() error {
        svc.Spec.Selector = map[string]string{"site": site.Name}
        svc.Spec.Ports = []corev1.ServicePort{{Port: 80, TargetPort: intstr.FromInt(80)}}
        return ctrl.SetControllerReference(&site, svc, r.Scheme)
    })
    if err != nil {
        return ctrl.Result{}, err
    }

    // 5. Update STATUS - report observed state back to the user.
    site.Status.ReadyReplicas = deploy.Status.ReadyReplicas
    site.Status.URL = fmt.Sprintf("http://%s.%s.svc.cluster.local", svc.Name, svc.Namespace)
    if err := r.Status().Update(ctx, &site); err != nil {
        return ctrl.Result{}, err
    }

    log.Info("reconciled", "website", site.Name, "replicas", site.Spec.Replicas)
    return ctrl.Result{}, nil
}

It's the same shape as the pilot - desired vs actual, make them match - but CreateOrUpdate collapses the "get, create-if-missing, update-if-drifted" into one call, and SetControllerReference wires the owner reference (so deleting the Website garbage-collects all three children). The framework runs the informer and workqueue you built by hand last time.

One required wiring: tell the manager to also watch the children, so changes to them re-trigger reconcile (self-healing). In SetupWithManager:

func (r *WebsiteReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webv1.Website{}).        // primary: watch Websites
        Owns(&appsv1.Deployment{}).   // also watch Deployments we own -> reconcile owner
        Owns(&corev1.Service{}).      // same for Services
        Owns(&corev1.ConfigMap{}).    // and ConfigMaps
        Complete(r)
}

Owns(...) is the framework doing what you did manually in the pilot (mapping a companion's change back to its owner's key). Three lines instead of custom event-handler logic.

Step 4: run the operator and create your first Website¶

$ make run        # runs the operator locally against your cluster

In another terminal, apply the 6-line YAML from the intro:

$ cat <<EOF | kubectl apply -f -
apiVersion: web.workshop.io/v1
kind: Website
metadata:
  name: hello
spec:
  replicas: 2
  html: "<h1>Hello from my operator</h1>"
EOF
website.web.workshop.io/hello created

Now watch your operator turn those 6 lines into running infrastructure:

$ kubectl get website,deployment,service,configmap -l site=hello
$ kubectl get all -l site=hello
NAME                         READY   STATUS    RESTARTS   AGE
pod/hello-7d9f8c-xk2p4       1/1     Running   0          8s
pod/hello-7d9f8c-m4nl9       1/1     Running   0          8s        <- 2 replicas, as requested
NAME            TYPE        CLUSTER-IP      PORT(S)   AGE
service/hello   ClusterIP   10.96.142.7     80/TCP    8s
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/hello   2/2     2            2           8s

Six lines of YAML became a Deployment, a Service, a ConfigMap, and two running Pods - because your operator reconciled the Website into them. Confirm it actually serves your HTML:

$ kubectl port-forward service/hello 8080:80 &
$ curl localhost:8080
<h1>Hello from my operator</h1>          # your declared HTML, served

And the status you wrote flows back to the user:

$ kubectl get website hello -o jsonpath='{.status}{"\n"}'
{"readyReplicas":2,"url":"http://hello.default.svc.cluster.local"}

Step 5: watch it reconcile - the operator earns its name¶

This is the payoff - your custom resource behaves exactly like a built-in. Change the spec, watch the operator drive the change:

$ kubectl patch website hello --type merge -p '{"spec":{"replicas":4}}'
$ kubectl get pods -l site=hello -w
# watch 2 more pods appear - the operator scaled the Deployment because the Website spec changed

Change the HTML, watch it propagate:

$ kubectl patch website hello --type merge -p '{"spec":{"html":"<h1>Updated live</h1>"}}'
# operator updates the ConfigMap; (in a fuller version you'd roll the pods to pick it up)

And the self-heal moment, now on your own resource type:

$ kubectl delete deployment hello
deployment.apps "hello" deleted
$ kubectl get deployment hello
NAME    READY   AGE
hello   2/2     2s              <- your operator recreated it

You deleted the Deployment; your operator recreated it, because the Website says it should exist. The owner-reference cleanup works too - delete the Website and everything it created vanishes:

$ kubectl delete website hello
website.web.workshop.io "hello" deleted
$ kubectl get all -l site=hello
No resources found.             # Deployment, Service, ConfigMap, Pods - all GC'd

One kubectl delete website tore down four objects, because they're all owned by the Website. You built a self-healing, self-cleaning, high-level abstraction that the user drives with 6 lines of YAML - which is exactly what cert-manager, the Prometheus Operator, and every database operator do.

Step 6: finalizers - cleanup that owner references can't do¶

Owner references handle in-cluster cleanup. But operators often manage external resources (a cloud DNS record, an S3 bucket, a SaaS account) that Kubernetes garbage collection knows nothing about. Finalizers solve this: they block deletion until your operator does external cleanup first.

The pattern (Week 12's core): add a finalizer string to the object; when the object is deleted, Kubernetes sets a deletionTimestamp but keeps the object until your finalizer is removed - giving your reconcile a chance to clean up externally, then remove the finalizer to let deletion complete.

const finalizer = "web.workshop.io/cleanup"

// In Reconcile, near the top:
if site.DeletionTimestamp.IsZero() {
    // Not being deleted: ensure our finalizer is present.
    if !controllerutil.ContainsFinalizer(&site, finalizer) {
        controllerutil.AddFinalizer(&site, finalizer)
        return ctrl.Result{}, r.Update(ctx, &site)
    }
} else {
    // Being deleted: do external cleanup, THEN remove the finalizer.
    if controllerutil.ContainsFinalizer(&site, finalizer) {
        // ... e.g. delete the external DNS record for site.Status.URL ...
        log.Info("external cleanup done", "website", site.Name)
        controllerutil.RemoveFinalizer(&site, finalizer)
        return ctrl.Result{}, r.Update(ctx, &site)
    }
    return ctrl.Result{}, nil // finalizer gone: Kubernetes completes deletion
}

This is why some objects "hang" in Terminating - a finalizer is blocking deletion until its controller finishes (or is stuck). Now you know what Terminating means and how to build deliberate teardown. Finalizers are the operator's hook for "delete this safely, including the things Kubernetes can't see."

Step 7: ship it (deploy the operator into the cluster)¶

make run ran the operator on your laptop. To run it in the cluster like a real operator:

$ make docker-build docker-push IMG=<your-registry>/website-operator:v0.1
$ make deploy IMG=<your-registry>/website-operator:v0.1

make deploy installs the CRD, the operator Deployment, a ServiceAccount, and the RBAC (kubebuilder generated minimum-privilege rules from the +kubebuilder:rbac markers in your controller). Now the operator runs as a Pod, reconciling Websites cluster-wide, with no laptop involved. That's a shippable operator.

Now extend it¶

Printer columns. Add +kubebuilder:printcolumn markers so kubectl get websites shows REPLICAS, READY, URL in the table - the polish real operators have.
Conditions. Replace the flat status with standard metav1.Conditions (Ready, Progressing) - the convention every mature operator follows, and what kubectl wait --for=condition=Ready keys on.
Webhooks. Add a defaulting/validation webhook (the next workshop) for logic the OpenAPI schema can't express ("html must contain an <h1>").
Multiple versions. Add a v2 with a conversion webhook - how operators evolve their API without breaking existing resources (Week 11).
Owned-resource health in status. Watch the Deployment's real readiness and reflect it, so Website status is trustworthy.

What you might wonder¶

"CRD + controller = operator - is that the whole definition?" Essentially yes. An operator is a custom resource (the noun, via a CRD) plus a controller that reconciles it (the behavior). The term often implies operational domain knowledge encoded in the controller (how to back up this database, how to fail over this cluster) - but structurally it's CRD + controller. You just built one.

"controller-runtime hid the informer/workqueue I built in the pilot. Is that bad?" No - it's the point. You built them by hand once to understand them; now the framework runs them correctly so you focus on reconcile logic. For() sets up the primary informer+workqueue, Owns() sets up the child watches and owner-mapping, the manager runs the loop. Knowing what's underneath (from the pilot) means controller-runtime is convenience, not magic.

"Why update Status separately from Spec?" Spec is the user's; Status is yours. They're often separate API subresources so that updating status doesn't conflict with a user editing spec (separate resourceVersion tracking), and so RBAC can grant status-write without spec-write. r.Status().Update() hits the /status subresource specifically. Mixing them causes update conflicts - a real operator bug.

"How do real operators handle upgrades to the CRD schema?" Versioned APIs (v1, v1beta1) with conversion webhooks that translate between versions on the fly, so old resources keep working when you ship a new schema. It's Week 11 and the "multiple versions" extension above. This is how, e.g., cert-manager moved from v1alpha2 to v1 without breaking anyone's Certificates.

"Is writing an operator usually the right call?" Often not - if a Helm chart or plain manifests suffice, use those. Operators earn their complexity when you need ongoing operational logic: reconciling drift, automating backups/failover/upgrades, managing external resources, encoding domain expertise. "Does this need a control loop, or just a one-time install?" is the deciding question. You now can build one when the answer is "control loop."

What this gave you¶

You extended the Kubernetes API: defined a Website CRD and watched kubectl treat it as a first-class type.
You got apiserver-enforced validation for free from struct markers.
You built the operator that reconciles Website into a Deployment + Service + ConfigMap, with controller-runtime running the loop you built by hand in the pilot.
You watched it reconcile spec changes, self-heal a deleted child, and GC everything on delete via owner references.
You implemented finalizers for external cleanup - and now know what Terminating means.
You can ship the operator into the cluster with generated RBAC.
You understand that cert-manager, Prometheus Operator, Crossplane, and every operator are this: a CRD plus a reconciling controller.

Next: intercept the API itself - build an admission webhook that validates and mutates objects as they're created, before they're ever stored.

Back to the Controllers & Operators month.

Submit your build¶

When you finish this workshop, share what you built so others can see and learn from your work. Include:

Public repo with your operator code and the Website CRD manifest
Output of `kubectl get website,deploy,svc,cm` showing the resources your operator reconciled into existence
Demonstration that deleting the Website resource cleans up the children (finalizer + owner refs)

Submit your build Request feedback on your output Discuss this workshop

Browse the gallery | All discussions