Week 21 - ML on Kubernetes: KServe, KubeRay, Volcano, GPU Operators¶

21.1 Conceptual Core¶

Kubernetes is the dominant control plane for ML workloads in production. Three dimensions:
Training orchestration-schedule large multi-GPU jobs, handle preemption, gang-schedule (all-or-nothing). Tools: Volcano, KubeRay, Kueue, JobSet.
Inference serving-model deployment, autoscaling, traffic routing, A/B. Tools: KServe, Seldon Core, vLLM-on-K8s, Ray Serve, Triton Inference Server.
GPU resource management-driver installation, device plugins, MIG partitioning, time-slicing. Tools: NVIDIA GPU Operator, AMD GPU Operator.

NVIDIA GPU Operator: deploys driver, container toolkit, device plugin, DCGM metrics exporter, MIG manager, all as DaemonSets. Pod requests nvidia.com/gpu: 1 and the device plugin allocates.
MIG (Multi-Instance GPU): A100/H100 hardware partitioning. One A100 → up to 7 isolated GPU slices. Useful for many small workloads on big GPUs; not for training. Configured via the GPU Operator's MIG manager.
Volcano / Kueue: gang scheduling-a 64-GPU job won't start until 64 GPUs are simultaneously available. Naive K8s default scheduler will partial-schedule and deadlock.
KubeRay: operator for Ray clusters. Ray is the de-facto distributed Python compute (used heavily by AI labs for data preprocessing, RLHF rollouts, hyperparameter sweeps).
KServe + vLLM: the canonical inference stack. KServe InferenceService CRD wraps vLLM (or Triton, or TGI) with autoscaling, canary, transformer pre/post-processing.

Bring up a small GPU-enabled cluster (kind+nvidia, or a 2-node cloud cluster with 1-2 GPUs each).
Install GPU Operator. Verify kubectl describe node shows nvidia.com/gpu: N.
Install Volcano. Submit a 4-GPU gang-scheduled training job (a small FSDP run from week 14).
Install KServe + vLLM runtime. Deploy a 7B model. Hit it with a load test. Demonstrate autoscaling.
Document the YAML for each in a deployable repo.

DCGM metrics (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE) exported to Prometheus. Read them.

A real production GPU fleet has cost, capacity, utilization, and reliability dashboards. Build a Grafana dashboard with the four. Bookmark.