Appendix C - Deep-Dive Session: CPython Internals and the AI Runtime Stack¶
This is the single-sit deep dive the curriculum promises. Schedule a full day (8 hours, with breaks) at the end of Month 3 (after the runtime chapter) and re-read it at the end of Month 6. The goal: see clearly through every layer between print("hello") and the silicon.
The format is six "stations." Each station has: what to read, what to run, what you should be able to explain afterward.
Station 1 - From python foo.py to a Frame on the Stack (90 min)¶
Read:
- Python/pythonrun.c::_PyRun_SimpleFileObject - the entry path.
- Python/compile.c (skim) - AST → bytecode.
- Include/internal/pycore_frame.h - frame layout in 3.11+.
- Python/ceval.c::_PyEval_EvalFrameDefault - the interpreter loop.
Run:
python -c "
import dis, sys
def f(x): return x*x + 1
dis.dis(f, adaptive=True)
print(f.__code__.co_consts, f.__code__.co_names, f.__code__.co_varnames)
"
Explain afterwards:
- Why LOAD_FAST is faster than LOAD_GLOBAL.
- What "specialization" means in PEP 659 and how to observe it.
- The lifecycle of a frame object - when it's allocated, when it's freed, why exception tracebacks pin frames.
Station 2 - Memory: Refcount, Cyclic GC, pymalloc, Arenas (75 min)¶
Read:
- Include/object.h - PyObject header.
- Objects/obmalloc.c - the small-object allocator.
- Modules/gcmodule.c / Python/gc.c - the cyclic GC.
Run:
import sys, gc, tracemalloc
tracemalloc.start()
xs = [object() for _ in range(10_000)]
print(sys.getsizeof(xs), sum(sys.getsizeof(x) for x in xs))
print(gc.get_count(), gc.get_threshold())
Explain:
- Why del xs deterministically frees memory but gc.collect() is needed for cycles.
- Why __slots__ saves ~40% per instance.
- The interaction between refcounts and the free-threaded build's "biased reference counting."
Station 3 - The GIL and Its Successors (60 min)¶
Read:
- Python/ceval_gil.c - the GIL implementation.
- PEP 703 (no-GIL) and PEP 684 (per-interpreter GIL).
- The "biased reference counting" paper (Choi et al.).
Run:
Reproduce the prime-counting benchmark from Month 4, Week 11.
Explain:
- Why a NumPy-heavy ThreadPoolExecutor scales on stock CPython.
- What changes for pure-Python code under python3.13t.
- When subinterpreters beat both threads and processes.
Station 4 - asyncio Internals (75 min)¶
Read:
- Lib/asyncio/base_events.py - the loop's run_forever.
- Lib/asyncio/tasks.py - Task machinery, cancellation.
- Modules/_asynciomodule.c - the C accelerator (Tasks, Futures).
- The selectors module - epoll/kqueue glue.
Run:
import asyncio, sys
async def main():
loop = asyncio.get_running_loop()
loop.slow_callback_duration = 0.05
# deliberately stall
import time; time.sleep(0.2)
asyncio.run(main())
Watch the warning fire.
Explain:
- The exact path from await coro to a callback scheduled on the loop.
- How Task cancellation delivers CancelledError precisely at the next await.
- Why uvloop is faster (libuv, C event loop, fewer Python frames per I/O).
Station 5 - NumPy and the Buffer Protocol (60 min)¶
Read:
- PEP 3118 - buffer protocol.
- NumPy's numpy/core/src/multiarray/arrayobject.c (skim).
- The strides/shape/dtype model: NumPy User Guide → Internals.
Run:
import numpy as np
a = np.arange(12).reshape(3, 4)
print(a.strides, a.flags['C_CONTIGUOUS'])
b = a.T
print(b.strides, b.flags['C_CONTIGUOUS'])
mv = memoryview(a)
print(mv.format, mv.itemsize, mv.shape, mv.strides)
Explain:
- Why a transpose is O(1) - it changes strides, not data.
- Why a.T.copy() is sometimes necessary before passing to a C library.
- How the buffer protocol lets bytes, array.array, numpy.ndarray, and torch.Tensor share memory without copies.
Station 6 - PyTorch, Autograd, CUDA Streams (90 min)¶
Read:
- PyTorch internals (Edward Yang's blog).
- torch.autograd overview docs; the Function / Variable machinery.
- vLLM PagedAttention paper (sets up serving questions in Month 6).
Run:
import torch
x = torch.randn(4, 4, requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad)
print(torch.cuda.is_available(), torch.cuda.current_stream() if torch.cuda.is_available() else None)
Explain:
- The autograd tape: forward builds the graph, backward walks it.
- Why .detach() and with torch.no_grad(): matter for inference latency.
- CPU↔GPU synchronization: when .item() blocks, why torch.cuda.synchronize() exists.
- How vLLM's PagedAttention reduces KV-cache fragmentation and why that translates directly to throughput.
Synthesis: The Mental Model¶
After this deep dive, hold this picture:
your code ──► AST ──► bytecode ──► eval loop (specializing) ──► C function ──► syscall / GPU kernel
│ │ │ │ │
└─ ruff/pyright └─ dis └─ py-spy └─ py-spy --native └─ nsys / nvprof
└─ GIL / free-threaded
└─ refcount / GC
Every performance question maps to one of those columns. Every correctness question maps to a boundary between two of them. That is what "senior" looks like.