Skip to content

Week 21 - ML on Kubernetes: KServe, KubeRay, Volcano, GPU Operators

21.1 Conceptual Core

  • Kubernetes is the dominant control plane for ML workloads in production. Three dimensions:
  • Training orchestration-schedule large multi-GPU jobs, handle preemption, gang-schedule (all-or-nothing). Tools: Volcano, KubeRay, Kueue, JobSet.
  • Inference serving-model deployment, autoscaling, traffic routing, A/B. Tools: KServe, Seldon Core, vLLM-on-K8s, Ray Serve, Triton Inference Server.
  • GPU resource management-driver installation, device plugins, MIG partitioning, time-slicing. Tools: NVIDIA GPU Operator, AMD GPU Operator.

21.2 Mechanical Detail

  • NVIDIA GPU Operator: deploys driver, container toolkit, device plugin, DCGM metrics exporter, MIG manager, all as DaemonSets. Pod requests nvidia.com/gpu: 1 and the device plugin allocates.
  • MIG (Multi-Instance GPU): A100/H100 hardware partitioning. One A100 → up to 7 isolated GPU slices. Useful for many small workloads on big GPUs; not for training. Configured via the GPU Operator's MIG manager.
  • Volcano / Kueue: gang scheduling-a 64-GPU job won't start until 64 GPUs are simultaneously available. Naive K8s default scheduler will partial-schedule and deadlock.
  • KubeRay: operator for Ray clusters. Ray is the de-facto distributed Python compute (used heavily by AI labs for data preprocessing, RLHF rollouts, hyperparameter sweeps).
  • KServe + vLLM: the canonical inference stack. KServe InferenceService CRD wraps vLLM (or Triton, or TGI) with autoscaling, canary, transformer pre/post-processing.

21.3 Lab-"Train and Serve on K8s"

  1. Bring up a small GPU-enabled cluster (kind+nvidia, or a 2-node cloud cluster with 1-2 GPUs each).
  2. Install GPU Operator. Verify kubectl describe node shows nvidia.com/gpu: N.
  3. Install Volcano. Submit a 4-GPU gang-scheduled training job (a small FSDP run from week 14).
  4. Install KServe + vLLM runtime. Deploy a 7B model. Hit it with a load test. Demonstrate autoscaling.
  5. Document the YAML for each in a deployable repo.

21.4 Idiomatic & Diagnostic Drill

  • DCGM metrics (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE) exported to Prometheus. Read them.

21.5 Production Slice

  • A real production GPU fleet has cost, capacity, utilization, and reliability dashboards. Build a Grafana dashboard with the four. Bookmark.

Comments