Deep Dives-Self-Contained Reference Chapters¶

Fourteen chapters that take the tutoriaal curriculum from "guided tour with external links" to self-contained mastery resources. Each chapter was authored to let a working engineer master the topic from the document alone-without depending on YouTube videos, blog posts, or paper PDFs as primary sources.

Total: ~131,000 words / 14 files / 575 KB. Each chapter is 7,000–11,000 words, layered (intuition → mechanism → math → numbers → diagrams → code → exercises), and ends with worked exercises.

Why This Layer Exists¶

The sequences/ files are guides: rungs with links to 3Blue1Brown, Karpathy, Strang, arXiv. The weeks/ files are schedules: three-session-per-week build plans. Together they tell you what to learn and when.

The deep dives tell you how the material itself works. They are the answer to "what if those YouTube videos disappear in 5 years?" and "what if I want to learn this without a 4-hour Karpathy lecture?".

A reader can reach mastery in any of these 14 topics from the deep dive alone, with the sequences/weeks providing the cadence and the weekly artifact gates.

Reading Orders¶

As curriculum companion (paired with sequences and weeks)¶

When	Read this deep dive	Pairs with sequence	Pairs with weeks
Q1 (months 1-3)	01 Math for ML	sequences 01, 02, 03	M01 (all weeks), M02-W01
Q1	02 PyTorch Fluency	sequence 04, 05	M01-W04, M02 (DL build weeks)
Q1	03 Classical ML Rigor	sequence 06	M02 (all weeks)
Q1	04 Deep Learning Fundamentals	sequence 07	M02-M03 transition
Q2 (months 4-6)	05 LLM Application Patterns	sequence 09	M04 (all weeks)
Q2	06 Retrieval and RAG	sequence 10	M05 (all weeks)
Q2	07 Agent Reliability Engineering	sequence 11	M06 (all weeks)
Q2/Q3	08 Evaluation Systems	sequence 12	M06-W04, M07 specialty weeks
Q2/Q3	09 LLM Observability	sequence 13	M06, M07-Track A
Q3 (months 7-9)	10 Fine-Tuning SFT to RLHF	sequence 15	M08 (all weeks)
Q3	11 Multimodal Foundations	(new-gap-fill)	M07-M08 (insert per cadence)
Q3/Q4	12 AI Safety and Red-Teaming	(new-gap-fill)	M07 production hardening
Throughout	13 AI-for-SRE Bridge	(new-moat amplifier)	M04 (anchor reframe), M11 (positioning)
Year 1 end / Year 2 start	14 Future-Proofing and Audit	(meta)	M12-W04 retrospective

As a standalone reference text¶

Topical groupings:

Foundations (durable spine): 01 → 02 → 03 → 04
Applications: 05 → 06 → 07
Quality and operations (the user's specialty): 08 → 09 → 13
Modeling: 10 → 11
Production: 12
Meta: 14

As interview prep¶

Every chapter's Section "Practical Exercises" reaches the depth of a senior-level technical interview question. If you can solve them cold, you can answer the interview question.

Chapter Index¶

`01_MATH_FOR_ML.md - The Math an Applied AI Engineer Actually Uses¶

~9,900 words. Linear algebra (vectors as both views; cosine identity derived; matmul as composition with three views; transpose with (AB)ᵀ = BᵀAᵀ proof; rank, basis, SVD with rotate-scale-rotate; norms; tensors with (B,S,H) justification). Calculus (derivative/chain rule; gradient descent from linear approximation; full ∂/∂w MSE = -2x(y - ŷ) derivation; full softmax+CE ∂L/∂z = softmax(z) - y derivation; Jacobians). Probability (RVs/expectation/variance; Bayes derivation; MLE→MSE for Gaussian and MLE→cross-entropy for categorical both derived; KL divergence; perplexity = exp(CE)). Six worked exercises (cosine of (3,4) and (4,3); MLP shape walk-through; sigmoid-MSE gradient; spam-Bayes ≈ 0.7742; Gaussian MLE = MSE proof; KL Bernoulli(0.3, 0.5)).

`02_PYTORCH_FLUENCY.md - User-Level PyTorch¶

~8,200 words. Companion to AI_SYSTEMS_PLAN/DEEP_DIVES/04 (which is internals). Tensor creation/dtype/device; shape ops with view-vs-reshape contiguity; broadcasting drill; einsum vs matmul; autograd as user; nn.Module pattern; layers; loss functions with log-sum-exp stability; AdamW with parameter groups; LR schedulers (warmup → cosine recipe); Dataset/DataLoader fast-path; complete annotated training loop; AMP (BF16 autocast vs FP16 GradScaler); checkpointing (model+opt+sched+scaler+RNG); torch.compile user-level; minimum-viable DDP via torchrun; gradient checkpointing; reproducibility limits; HF transformers integration; 20-bug pitfall bestiary; six worked exercises with answer code.

`03_CLASSICAL_ML_RIGOR.md - The Foundations Skipped at Your Peril¶

~9,400 words. Why this matters for LLM eval. Train/val/test discipline; loss functions derived (MSE = Gaussian MLE; MAE = Laplace MLE; cross-entropy = categorical MLE; BCE; hinge); regularization as priors (L2 = Gaussian, L1 = Laplace, AdamW vs L2-in-SGD); bias/variance with double-descent caveat; cross-validation; calibration (ECE formula, reliability diagrams, why deep nets are over-confident, temperature/Platt/isotonic-directly relevant to LLM-as-judge); confusion matrix → P/R/F1/Fβ → ROC/AUC → PR/AUPRC → log-loss/Brier with scoring-rule properties; class imbalance; classifier zoo (LR, RF, GBM, when trees win); the classical → LLM bridge; bootstrap CIs, McNemar's test, A/B sample size N ≈ 16·p(1-p)/Δ²; Bayesian alternative; honest baseline anti-pattern; six exercises with full numerics (F1/F2/F0.5; ECE on 5-bin × 100; logistic-MLE on separable data; bootstrap paired CI; temperature scaling math; A/B for 2% lift on 10% baseline → 3,600/arm at 80% power).

`04_DEEP_LEARNING_FUNDAMENTALS.md - Backprop to AdamW¶

~7,400 words. Network setup; MLP forward with shapes; full backprop derivation with ∂L/∂W = δxᵀ, ∂L/∂x = Wᵀδ triplet and a 2-layer worked numerical example; vanishing/exploding gradients and ResNet skip-connection insight; activations (ReLU/GELU/SiLU/SwiGLU/softmax with diag(S) - SSᵀ Jacobian); init (Xavier 2/(n_in+n_out), He 2/n_in, GPT-2 residual scaling 1/√(2L)); optimizers fully derived (SGD → momentum → RMSprop → Adam with bias correction → AdamW with the Loshchilov-Hutter decoupled-decay derivation); LR schedules (cosine annealing formula, warmup-then-cosine recipe); normalization (BatchNorm vs LayerNorm vs RMSNorm; pre-norm vs post-norm); regularization (dropout, DropPath, weight decay, label smoothing); loss landscapes; clip-by-norm; mixed precision overview; six exercises (CE gradient derivation; 20-line Adam in PyTorch; He init for (512, 2048) ReLU = σ=1/16; NaN diagnoses; LayerNorm backward; 24-layer-transformer LR recipe).

`05_LLM_APPLICATION_PATTERNS.md - Daily-Work Engineering¶

~8,300 words. LLM application lifecycle; messages-list abstraction; sampling parameters derived from softmax (temperature, top-p, top-k, frequency/presence penalties); structured outputs (4 reliability levels: prompt → JSON mode → function calling → grammar-constrained with Pydantic+instructor+outlines); tool use protocol (Anthropic vs OpenAI; multi-tool dispatch; common failure modes); streaming with SSE; Anthropic prompt caching with worked savings calculator; LiteLLM vs native SDKs; cost calculation with provider-mapping JSON; retry with exponential backoff + jitter (formula); circuit breakers; orchestration patterns (sequential, map-reduce, branch-merge, self-consistency, self-critique) with async; few-shot; CoT and reasoning models (o1/Claude thinking); DSPy paradigm; production patterns (idempotency, multi-tenancy, PII); ~80-line MVP service skeleton; six worked exercises (prompt-cache savings; Pydantic tool-loop with retry; sequential→async-parallel conversion; circuit breaker; tokenization+pricing; self-consistency).

`06_RETRIEVAL_AND_RAG.md - Production-Quality Retrieval¶

~10,500 words. Why retrieval; BM25 fully derived (TF-IDF base, full score(q,d) = Σ IDF(t)·tf(t,d)·(k1+1)/(...) formula, k1/b intuition); dense retrieval with InfoNCE contrastive loss derivation; hard-negative mining; embedding-model landscape table (illustrative); Matryoshka embeddings; cross-encoder reranking; vector indexing (HNSW algorithm walk-through, IVF, PQ, IVF-PQ, DiskANN, recall-latency curve); vector DB decision matrix; hybrid retrieval with RRF formula score(d) = Σ 1/(k + rank_i(d)); chunking (fixed, semantic, hierarchical, late chunking Jina 2024, contextual retrieval Anthropic 2024); production pipeline; query rewriting (HyDE, multi-query, step-back); eval metrics derived (Recall@k, Precision@k, MRR, NDCG with full formula); RAGAS; failure modes (lost-in-the-middle with worked example, retrieval-generation gap); multi-hop, agentic, GraphRAG; metadata filtering; production concerns (freshness, versioning, multi-tenant, citations); self-host vs API; six exercises including a customer-support RAG eval-set design (50 questions across 4 slices).

`07_AGENT_RELIABILITY_ENGINEERING.md - Distributed-Systems Lens¶

~10,700 words. The user's bridge chapter. Agent as state-machine fixed-point; six patterns (tool-use loop, ReAct, Plan-and-Execute, ReWOO, Reflexion, ToT) with cost-quality table; tool design craft (names, descriptions for the model, schemas, idempotency, structured errors, tool-zoo problem); distributed-systems failure taxonomy applied to agents: timeouts (cascading deadlines), retries, backpressure, partial failure, sagas with worked flight-booking example, circuit breakers (with code), bulkheads, idempotency keys; loop termination (six guards); prompt injection through tool outputs (six layered defenses; the "this is unsolved" reality); hallucinated tool calls; state management (FSM framing); multi-agent (when justified, supervisor-router skeleton); HIL checkpoints; trajectory vs outcome eval; OTel per-step observability; benchmarks (SWE-bench, GAIA, τ-bench, WebArena); cost discipline with worked 50-step ReAct ≈ $3.32/task; 23-item production checklist; six exercises (production agent in <300 LOC; circuit breaker for tool; loop detector; saga for booking; injection test case; expected cost cap).

`08_EVALUATION_SYSTEMS.md - The User's Specialty¶

~10,500 words. Six structural reasons LLM eval is hard; full taxonomy (reference-based, reference-free, outcome, trajectory) with when-to-use table; golden dataset design (50 / 500 / 2000+ tiers; stratification; provenance; versioning; rotating holdout); LLM-as-judge (single-grade, pairwise, reference-augmented; four documented biases-position, length, verbosity, self-preference-with mitigations; rubric-decomposed prompt; calibration with κ ≥ 0.6 rule); statistical power (N ≈ 4·p(1-p)/Δ², worked p=0.7 Δ=0.01 → N≈8400, McNemar paired, bootstrap, multi-comparison); Cohen's κ derived from scratch with implementation; Fleiss; Krippendorff; Landis-Koch interpretation; eval-driven workflow; regression detection (per-example flips, slice analysis, avg-tide trap); offline/online/counterfactual/A/B; task-specific stacks (classification with calibration, summarization with rubric, RAG with RAGAS, agents with outcome+trajectory, code with full pass@k formula and worked n=20, c=3, k=5 → 0.6008, MT-Bench); eval-of-eval; tool landscape (Inspect AI, Braintrust, Langfuse, LangSmith, OpenAI Evals, Promptfoo, RAGAS); eval-set lifecycle v0/v1/v2+; A/B testing depth with (z_{α/2}+z_β)²·2p(1-p)/Δ² derivation; hidden costs; twelve named anti-patterns; six exercises including the Q4 capstone eval set design.

`09_LLM_OBSERVABILITY.md - The User's Unique Moat¶

~10,000 words. Five derived properties of LLM observability (non-determinism, graded quality, cost variance, prompt-as-artifact, fan-out); four golden signals, LLM edition (TTFT/TPOT/total; tokens/sec; tri-class errors; provider saturation); OTel GenAI semantic conventions with full gen_ai.* attribute table and code skeleton; span design (tree-not-line, streaming rule with derivation, agentic example); cost attribution (pricing JSON, prompt-version regression pattern, cardinality trap with three safer alternatives); latency breakdown (RTT/queue/prefill/decode); token usage (cache + reasoning tokens, hit-rate, per-conversation); sampling (tail-based, skeleton-vs-content tiers, OTel Collector config); privacy/PII (three redaction layers, GDPR right-to-deletion); drift detection (input/output/quality with KL formula); production debugging (replay() primitive, prompt diffs, A/B traceability); the SRE bridge (SLIs, SLOs as YAML, error budgets, multi-burn-rate alerting); tool landscape; five production runbooks; custom dashboard layout as the portfolio artifact; from-scratch ~50-line @trace_llm_call decorator; six exercises; appendix with attribute tables, metric catalog, and SRE-to-AI translation card.

`10_FINE_TUNING_SFT_TO_RLHF.md - The Math, Derived End-to-End¶

~10,500 words. Decision matrix (prompt vs RAG vs FT); SFT mechanics with prompt-masking; catastrophic forgetting and five mitigations; LoRA full derivation (low-rank decomposition, init asymmetry, α/r scaling, target modules, memory math: Llama-7B Q+V at r=16 = 0.12% trainable, inference merge vs multi-LoRA serving); QLoRA full derivation (NF4 quantile derivation, double quantization, paged optimizers, 70B-on-48GB worked budget); RLHF concepts; Lagrangian derivation of the KL-constrained optimal policy π*(y|x) ∝ π_SFT(y|x) · exp(r(x,y)/β); PPO clip mechanics; DPO full derivation as the chapter centerpiece-every step from invert (★) → substitute into Bradley-Terry → cancel β·log Z(x) → take NLL → final loss → gradient interpretation; GRPO; reward models with Bradley-Terry; reward-hacking failure modes; preference data curation with κ; Constitutional AI / RLAIF; frontier-scale FT (cross-ref to AI_SYSTEMS); full-FT vs LoRA decision; eval discipline; end-to-end workflow; six exercises all numerically worked (Llama-7B params, full DPO derivation, 70B/48GB budget, preference-data spec, FT eval matrix, GRPO advantages for [0.8, 0.5, 0.3, 0.7] → [+1.17, -0.39, -1.43, +0.65]).

`11_MULTIMODAL_FOUNDATIONS.md - Patching the Text-Only Gap¶

~11,000 words. Why multimodal matters (2026+ frontier models are natively multimodal); vision encoders (CNN context → ViT end-to-end with patch arithmetic); CLIP with full contrastive-loss derivation; four fusion families (late/cross-attention/early/native) with decision matrix; LLaVA architecture in operational detail; Whisper architecture (mel-spectrogram, encoder/decoder, multitask); diffusion fundamentals (DDPM forward/reverse/loss derived; latent diffusion; classifier-free guidance derivation); video models brief; multimodal eval challenges (POPE for hallucination, FID/CLIP-Score for generation); production patterns (document understanding, VQA, ASR, speech-to-speech, image generation); cost economics (hedged); engineering integration (preprocessing, tokens-per-image quirk); open-weights landscape (Llama 3.2 Vision, Pixtral, Qwen2-VL, InternVL2, FLUX, Whisper, Parakeet, Moshi); multimodal RAG (CLIP, ColPali); six exercises (patch counting, ~25-line CLIP loss, T=3 diffusion walk, 200-page PDF RAG design, image-cost comparison, hallucination root causes); 12-week study path appendix.

`12_AI_SAFETY_AND_RED_TEAMING.md - Production Defense Engineering¶

~8,200 words. Threat model (trusted code, untrusted data, untrusted user); four threat categories + DoS; direct vs indirect prompt injection (the dominant 2024+ threat with real incidents-Bing Sidney 2023, Slack RAG 2024, Greshake et al. 2023); jailbreak categories (persona, encoding, multi-turn, many-shot Anthropic 2024, payload smuggling, adversarial suffixes Zou et al. 2023, visual jailbreaks); mathematical limits on perfect defense; defenses-input filtering (Llama Guard, ShieldGemma, NVIDIA NeMo); output filtering; structural (separate trust planes, tool-output delimiters, capability gating, Spotlighting Microsoft 2024); constrained decoding (eliminates entire injection classes); guardrails frameworks; red-teaming (PyRIT, Garak; manual + automated + continuous); harms taxonomy (CBRN, privacy, bias); audit logging with GDPR tension; incident response; dual-use problem (helpfulness vs safety; refusal-rate <2%, harmful-compliance <1%); governance (NIST AI RMF, EU AI Act, ISO/IEC 42001, model cards, system cards); production safety checklist (12 items); six exercises (Llama-Guard input filter, indirect-injection test cases, audit-log schema, constrained-decoding for JSON, Garak red-team run, model card draft).

`13_AI_FOR_SRE_BRIDGE.md - The Unique Moat the Curriculum Underweights¶

~7,900 words. The thesis (AI-applied-to-SRE is underserved); the user's existing assets named explicitly (production-incident intuition, telemetry literacy, distributed-systems instincts, CI/CD discipline, customer-of-AI experience); 2026 job market for the bridge (AIOps vendors, LLM observability vendors, frontier-lab platform/SRE roles, internal AI platforms, AI-for-DevOps startups); eight problem patterns where AI helps SRE; six fully-developed patterns (incident triage, RCA, postmortem agent, NL-to-query observability, code-change risk, anomaly-detection augmentation) with architecture sketches and walk-throughs; reusing the Bamboo+Datadog plugin as case-study substrate (eval data > the plugin code); novel observability questions LLM systems raise (SLI for LLM service; error budget for graded outputs; canary deploys for prompts; rollback unit; change management for prompts); Datadog → LLM-observability migration playbook; interview-grade positioning narrative; non-obvious advice; 90-day side project that demonstrates the bridge; six exercises (SLOs for LLM-triage; canary playbook; NL-to-query eval set; RCA architecture; conference talk abstract; year-2 roadmap).

`14_FUTURE_PROOFING_AND_AUDIT.md - The Operating Manual¶

~8,100 words. Three-tier durability framework (Spine 10+ year / Stable 4-7 year / Ephemeral 1-3 year) with 60/25/15 study-time split; per-sequence durability audit covering all 17 existing sequences with refresh cadences; daily/weekly/monthly/quarterly/semi-annual/yearly playbooks; tripwires (tooling, field, career, personal); field-velocity sources (curated by pattern, not just current names); 6/12/24-month milestones; pivot signals table with thresholds and responses; spine investments that survive pivots; ephemeral decay table; cross-curriculum stack (this curriculum + RUST + GO + LINUX + CONTAINER + KUBERNETES + AI_SYSTEMS) with the bet and the hedge; six future scenarios A-F (spec extension, foundation-model commoditization, new paradigm, hardware shift, regulatory shift, demand contraction) explicitly framed as scenarios not predictions; 11-item annual audit checklist; the honest meta-question; eight curriculum-staleness anti-patterns; six yearly exercises; appendices (durability tag legend, quarterly audit template, yearly decision record).

Anti-Fabrication Compliance¶

Every chapter authored under explicit anti-fabrication rules:

Citable items stated unhedged (Cohen's κ formula, BM25 formula, RRF, NDCG, pass@k, AdamW decoupled-decay insight, AWQ/GPTQ/DPO are real papers, OTel GenAI semantic conventions exist, etc.).
Approximate numbers explicitly hedged with "~" or "as of ~2025; verify".
Specific tool features prefer general descriptions; version-dependent specifics flagged.
Real incidents (Bing Sidney, Slack RAG, Greshake) cited by year with rough description.
Future scenarios framed as scenarios, never predictions.
No invented benchmark scores.

How This Layer Connects to the Broader Repository¶

This DEEP_DIVES set complements two others in the same repository:

AI_SYSTEMS_PLAN/DEEP_DIVES/ (11 chapters): the systems track-GPU programming, CUDA/Triton, framework internals, distributed training math, inference serving, quantization, numerics. Where this curriculum is applications-first, that one is systems-first. The two are siblings, not competitors.
LINUX/, CONTAINER_INTERNALS_PLAN/, KUBERNETES_PLAN/: the substrate (host, image, orchestration). When a tutoriaal weekly project ships a service, those curricula tell you how to deploy it.

Cross-references in the chapters point at specific deep dives in the systems track (e.g., chapter 10 references AI_SYSTEMS/06 for distributed-training math; chapter 04 references AI_SYSTEMS/11 for numerical stability).

How to Use This Resource¶

Curriculum companion: read the sequence/week first, then the matching deep dive, then return to the lab with both as references.

Standalone reference: tab open during work; jump by topic.

Interview prep: each chapter's exercises are at senior-level interview depth.

Teaching resource: each chapter is a self-contained lecture's worth of material; use to onboard a teammate to a sub-topic in one afternoon.

Year 2+ refresh anchor: chapter 14 tells you when and how to refresh; chapters 1-13 are what you refresh.