Week 21 - ML on Kubernetes: KServe, KubeRay, Volcano, GPU Operators¶
21.1 Conceptual Core¶
- Kubernetes is the dominant control plane for ML workloads in production. Three dimensions:
- Training orchestration-schedule large multi-GPU jobs, handle preemption, gang-schedule (all-or-nothing). Tools: Volcano, KubeRay, Kueue, JobSet.
- Inference serving-model deployment, autoscaling, traffic routing, A/B. Tools: KServe, Seldon Core, vLLM-on-K8s, Ray Serve, Triton Inference Server.
- GPU resource management-driver installation, device plugins, MIG partitioning, time-slicing. Tools: NVIDIA GPU Operator, AMD GPU Operator.
21.2 Mechanical Detail¶
- NVIDIA GPU Operator: deploys driver, container toolkit, device plugin, DCGM metrics exporter, MIG manager, all as DaemonSets. Pod requests
nvidia.com/gpu: 1and the device plugin allocates. - MIG (Multi-Instance GPU): A100/H100 hardware partitioning. One A100 → up to 7 isolated GPU slices. Useful for many small workloads on big GPUs; not for training. Configured via the GPU Operator's MIG manager.
- Volcano / Kueue: gang scheduling-a 64-GPU job won't start until 64 GPUs are simultaneously available. Naive K8s default scheduler will partial-schedule and deadlock.
- KubeRay: operator for Ray clusters. Ray is the de-facto distributed Python compute (used heavily by AI labs for data preprocessing, RLHF rollouts, hyperparameter sweeps).
- KServe + vLLM: the canonical inference stack. KServe
InferenceServiceCRD wraps vLLM (or Triton, or TGI) with autoscaling, canary, transformer pre/post-processing.
21.3 Lab-"Train and Serve on K8s"¶
- Bring up a small GPU-enabled cluster (kind+nvidia, or a 2-node cloud cluster with 1-2 GPUs each).
- Install GPU Operator. Verify
kubectl describe nodeshowsnvidia.com/gpu: N. - Install Volcano. Submit a 4-GPU gang-scheduled training job (a small FSDP run from week 14).
- Install KServe + vLLM runtime. Deploy a 7B model. Hit it with a load test. Demonstrate autoscaling.
- Document the YAML for each in a deployable repo.
21.4 Idiomatic & Diagnostic Drill¶
- DCGM metrics (
DCGM_FI_DEV_GPU_UTIL,DCGM_FI_DEV_FB_USED,DCGM_FI_PROF_PIPE_TENSOR_ACTIVE) exported to Prometheus. Read them.
21.5 Production Slice¶
- A real production GPU fleet has cost, capacity, utilization, and reliability dashboards. Build a Grafana dashboard with the four. Bookmark.