Appendix A-Hardening, Observability, and Fleet Operations¶
Cumulative reference for the production-readiness work distributed through the curriculum.
A.1 GPU Profiling Toolkit¶
| Tool | Use case |
|---|---|
nvidia-smi |
Quick health check, util, memory, temperature, ECC errors. |
nvidia-smi dmon -s pucvmet |
Streaming per-second telemetry. |
dcgmi |
NVIDIA Data Center GPU Manager-production fleet metrics. |
nsys profile |
System-level timeline: CUDA kernels, NCCL calls, CPU, OS, NVTX ranges. |
ncu --set full |
Per-kernel deep dive: SM occupancy, memory throughput, tensor-core util. |
nvprof |
Legacy. Avoid; use nsys + ncu instead. |
torch.profiler |
Framework-aware: per-op timing, memory allocations, stack traces, CUDA + CPU. |
torch.cuda.memory._record_memory_history |
Memory allocation timeline. Indispensable for OOM debugging. |
py-spy / austin |
Sampling profiler for the Python side. |
A complete perf debugging session uses three tools: torch.profiler for the framework view, nsys for the timeline, ncu for the kernel deep-dive.
A.2 GPU Fleet Observability¶
Required Prometheus metrics for any production GPU fleet:
DCGM_FI_DEV_GPU_UTIL # 0-100, time GPU was busy
DCGM_FI_DEV_FB_USED # framebuffer (HBM) bytes used
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_GPU_TEMP # alert > 85°C sustained
DCGM_FI_DEV_POWER_USAGE # watts
DCGM_FI_DEV_PCIE_LINK_GEN_CURRENT # surprise downgrades happen
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE # tensor-core active fraction
DCGM_FI_PROF_DRAM_ACTIVE # HBM active fraction
DCGM_FI_PROF_NVLINK_RX_BYTES # interconnect traffic
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_DEV_XID_ERRORS # any non-zero is bad
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL # double-bit ECC = retire the GPU
Plus framework-level:
- pytorch_distributed_* for NCCL collectives (use NCCL's NVTX ranges).
- vLLM / serving metrics: TTFT, TPOT, requests-running, requests-waiting, kv-cache-usage-perc, num-preemptions.
The four golden signals for an inference fleet: 1. Throughput (tokens/sec/GPU). 2. Tail latency (p99 TTFT, p99 TPOT). 3. Cost (inferred from utilization × $/hour). 4. Quality (eval scores from canary traffic).
A.3 Training Job Hardening¶
For multi-day training runs:
- Checkpointing: every N minutes (not just every N steps); resumable from any checkpoint with bit-exact loss continuation.
- Failure handling: NCCL timeouts retry with exponential backoff before failing the job. NaN detection skips (or fails) the step. Persistent NaNs trigger checkpoint reload.
- Preemption: cloud spot instances may preempt with seconds of notice. SIGTERM handler triggers checkpoint + clean exit.
- Health probes: fail-fast on ECC errors, NVLink degradation, throttling. Tools: NVIDIA's GPU diagnostic (
dcgmi diag). - Run state: persist the run's metadata (commit, config, hardware fingerprint) alongside checkpoints.
- Cost control: hard timeouts, budget caps, on-call alerting on overruns.
- Reproducibility: pinned versions, seeded RNGs, documented determinism guarantees.
A.4 Inference Fleet Hardening¶
- Request validation: prompt length caps, content-type checks, rate limiting per user/tenant. Reject obvious abuse before invoking the model.
- Timeouts at every layer: client→gateway, gateway→model, model→GPU. Cascading timeout discipline: each layer's timeout < parent's timeout − queue time budget.
- Backpressure: when GPU memory or queue depth saturates, return 429 (rate-limited) immediately rather than queueing forever. The model can not catch up; let the client retry with backoff.
- Graceful degradation: when the primary model fails, fall back to a smaller, cheaper model (with a quality marker in the response). Still-broken: return 503.
- Multi-region: GPU fleets often consolidate in one region. Plan for regional outages; document the RTO/RPO for inference.
- Model-version pinning: production traffic targets a specific model digest, not a tag. Promote via canary (small % traffic) → roll forward or rollback.
A.5 Cost Control¶
Inference cost ≈ tokens × time-per-token × $/GPU-hour / batch-size.
Levers: - Batch size: bigger batch → lower $/token, higher latency. Workload-dependent sweet spot. - Model size: 7B → 13B → 70B is roughly 5× compute per inference. Quality may not justify; eval rigorously. - Quantization: INT4 W4A16 cuts decode HBM traffic ~4×. Roughly proportional throughput gain. - GPU class: H100 vs A100-H100 is ~2× faster for ~1.5× cost. Often cheaper per token. - Region/spot: spot inference is risky; spot training is normal. Spot decisions are cluster-wide. - Caching: prefix caching (system prompts) and full-response caching (FAQ-style) save dramatically. Build hit-rate dashboards.
Budget: $100/month per developer for hands-on learning; $1000-5000 for the capstone month, depending on track.
A.6 Model Observability (Beyond System Metrics)¶
- Output distribution drift: monitor response length, refusal rate, fraction of failed JSON-mode outputs, sentiment, toxicity scores.
- Input distribution drift: prompt length, language detection, topic distribution.
- Eval-from-production: weekly resample N production requests, send to an eval harness (LLM-as-judge or human eval), surface quality trends.
- Per-feature observability: tag every request with the calling product/feature; surface metrics per-feature so a single product's regression is detectable.
- A/B telemetry: when running a canary (5% traffic to model B, 95% to model A), every metric needs to support per-variant breakdowns.
A.7 The ai-systems-baseline/ Template¶
ai-systems-baseline/
Dockerfile.cuda # pinned CUDA + cuDNN + Python + PyTorch
pyproject.toml / requirements.txt # pinned deps, including torch
scripts/
bench-throughput.sh
bench-latency.sh
profile-nsys.sh
profile-ncu.sh
eval-regression.py
configs/
nccl.env # NCCL_* tuning
pytorch.env # TORCH_* tuning
vllm.yaml # for serving track
observability/
dcgm-exporter-config.yaml
grafana/
gpu-fleet.json
training-health.json
inference-fleet.json
ci/
test.yml
bench.yml # regression-tracked perf
eval.yml # quality regression
profile-on-pr.yml # generate ncu reports on PR
runbooks/
nan-during-training.md
ecc-error.md
nccl-timeout.md
inference-latency-spike.md
oom-on-load.md
RELEASE_CHECKLIST.md
THREAT_MODEL.md
COST_MODEL.md
Every AI workload you ship after week 24 should be built from this template.