Skip to content

Appendix A-Hardening, Observability, and Fleet Operations

Cumulative reference for the production-readiness work distributed through the curriculum.


A.1 GPU Profiling Toolkit

Tool Use case
nvidia-smi Quick health check, util, memory, temperature, ECC errors.
nvidia-smi dmon -s pucvmet Streaming per-second telemetry.
dcgmi NVIDIA Data Center GPU Manager-production fleet metrics.
nsys profile System-level timeline: CUDA kernels, NCCL calls, CPU, OS, NVTX ranges.
ncu --set full Per-kernel deep dive: SM occupancy, memory throughput, tensor-core util.
nvprof Legacy. Avoid; use nsys + ncu instead.
torch.profiler Framework-aware: per-op timing, memory allocations, stack traces, CUDA + CPU.
torch.cuda.memory._record_memory_history Memory allocation timeline. Indispensable for OOM debugging.
py-spy / austin Sampling profiler for the Python side.

A complete perf debugging session uses three tools: torch.profiler for the framework view, nsys for the timeline, ncu for the kernel deep-dive.


A.2 GPU Fleet Observability

Required Prometheus metrics for any production GPU fleet:

DCGM_FI_DEV_GPU_UTIL                  # 0-100, time GPU was busy
DCGM_FI_DEV_FB_USED                   # framebuffer (HBM) bytes used
DCGM_FI_DEV_FB_FREE
DCGM_FI_DEV_GPU_TEMP                  # alert > 85°C sustained
DCGM_FI_DEV_POWER_USAGE               # watts
DCGM_FI_DEV_PCIE_LINK_GEN_CURRENT     # surprise downgrades happen
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE       # tensor-core active fraction
DCGM_FI_PROF_DRAM_ACTIVE              # HBM active fraction
DCGM_FI_PROF_NVLINK_RX_BYTES          # interconnect traffic
DCGM_FI_PROF_PCIE_RX_BYTES
DCGM_FI_DEV_XID_ERRORS                # any non-zero is bad
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL         # double-bit ECC = retire the GPU

Plus framework-level: - pytorch_distributed_* for NCCL collectives (use NCCL's NVTX ranges). - vLLM / serving metrics: TTFT, TPOT, requests-running, requests-waiting, kv-cache-usage-perc, num-preemptions.

The four golden signals for an inference fleet: 1. Throughput (tokens/sec/GPU). 2. Tail latency (p99 TTFT, p99 TPOT). 3. Cost (inferred from utilization × $/hour). 4. Quality (eval scores from canary traffic).


A.3 Training Job Hardening

For multi-day training runs:

  • Checkpointing: every N minutes (not just every N steps); resumable from any checkpoint with bit-exact loss continuation.
  • Failure handling: NCCL timeouts retry with exponential backoff before failing the job. NaN detection skips (or fails) the step. Persistent NaNs trigger checkpoint reload.
  • Preemption: cloud spot instances may preempt with seconds of notice. SIGTERM handler triggers checkpoint + clean exit.
  • Health probes: fail-fast on ECC errors, NVLink degradation, throttling. Tools: NVIDIA's GPU diagnostic (dcgmi diag).
  • Run state: persist the run's metadata (commit, config, hardware fingerprint) alongside checkpoints.
  • Cost control: hard timeouts, budget caps, on-call alerting on overruns.
  • Reproducibility: pinned versions, seeded RNGs, documented determinism guarantees.

A.4 Inference Fleet Hardening

  • Request validation: prompt length caps, content-type checks, rate limiting per user/tenant. Reject obvious abuse before invoking the model.
  • Timeouts at every layer: client→gateway, gateway→model, model→GPU. Cascading timeout discipline: each layer's timeout < parent's timeout − queue time budget.
  • Backpressure: when GPU memory or queue depth saturates, return 429 (rate-limited) immediately rather than queueing forever. The model can not catch up; let the client retry with backoff.
  • Graceful degradation: when the primary model fails, fall back to a smaller, cheaper model (with a quality marker in the response). Still-broken: return 503.
  • Multi-region: GPU fleets often consolidate in one region. Plan for regional outages; document the RTO/RPO for inference.
  • Model-version pinning: production traffic targets a specific model digest, not a tag. Promote via canary (small % traffic) → roll forward or rollback.

A.5 Cost Control

Inference cost ≈ tokens × time-per-token × $/GPU-hour / batch-size.

Levers: - Batch size: bigger batch → lower $/token, higher latency. Workload-dependent sweet spot. - Model size: 7B → 13B → 70B is roughly 5× compute per inference. Quality may not justify; eval rigorously. - Quantization: INT4 W4A16 cuts decode HBM traffic ~4×. Roughly proportional throughput gain. - GPU class: H100 vs A100-H100 is ~2× faster for ~1.5× cost. Often cheaper per token. - Region/spot: spot inference is risky; spot training is normal. Spot decisions are cluster-wide. - Caching: prefix caching (system prompts) and full-response caching (FAQ-style) save dramatically. Build hit-rate dashboards.

Budget: $100/month per developer for hands-on learning; $1000-5000 for the capstone month, depending on track.


A.6 Model Observability (Beyond System Metrics)

  • Output distribution drift: monitor response length, refusal rate, fraction of failed JSON-mode outputs, sentiment, toxicity scores.
  • Input distribution drift: prompt length, language detection, topic distribution.
  • Eval-from-production: weekly resample N production requests, send to an eval harness (LLM-as-judge or human eval), surface quality trends.
  • Per-feature observability: tag every request with the calling product/feature; surface metrics per-feature so a single product's regression is detectable.
  • A/B telemetry: when running a canary (5% traffic to model B, 95% to model A), every metric needs to support per-variant breakdowns.

A.7 The ai-systems-baseline/ Template

ai-systems-baseline/
  Dockerfile.cuda                     # pinned CUDA + cuDNN + Python + PyTorch
  pyproject.toml / requirements.txt   # pinned deps, including torch
  scripts/
    bench-throughput.sh
    bench-latency.sh
    profile-nsys.sh
    profile-ncu.sh
    eval-regression.py
  configs/
    nccl.env                          # NCCL_*  tuning
    pytorch.env                       # TORCH_* tuning
    vllm.yaml                         # for serving track
  observability/
    dcgm-exporter-config.yaml
    grafana/
      gpu-fleet.json
      training-health.json
      inference-fleet.json
  ci/
    test.yml
    bench.yml                         # regression-tracked perf
    eval.yml                          # quality regression
    profile-on-pr.yml                 # generate ncu reports on PR
  runbooks/
    nan-during-training.md
    ecc-error.md
    nccl-timeout.md
    inference-latency-spike.md
    oom-on-load.md
  RELEASE_CHECKLIST.md
  THREAT_MODEL.md
  COST_MODEL.md

Every AI workload you ship after week 24 should be built from this template.

Comments