Skip to content

Appendix C - Deep-Dive Session: CPython Internals and the AI Runtime Stack

This is the single-sit deep dive the curriculum promises. Schedule a full day (8 hours, with breaks) at the end of Month 3 (after the runtime chapter) and re-read it at the end of Month 6. The goal: see clearly through every layer between print("hello") and the silicon.

The format is six "stations." Each station has: what to read, what to run, what you should be able to explain afterward.


Station 1 - From python foo.py to a Frame on the Stack (90 min)

Read: - Python/pythonrun.c::_PyRun_SimpleFileObject - the entry path. - Python/compile.c (skim) - AST → bytecode. - Include/internal/pycore_frame.h - frame layout in 3.11+. - Python/ceval.c::_PyEval_EvalFrameDefault - the interpreter loop.

Run:

python -c "
import dis, sys
def f(x): return x*x + 1
dis.dis(f, adaptive=True)
print(f.__code__.co_consts, f.__code__.co_names, f.__code__.co_varnames)
"

Explain afterwards: - Why LOAD_FAST is faster than LOAD_GLOBAL. - What "specialization" means in PEP 659 and how to observe it. - The lifecycle of a frame object - when it's allocated, when it's freed, why exception tracebacks pin frames.


Station 2 - Memory: Refcount, Cyclic GC, pymalloc, Arenas (75 min)

Read: - Include/object.h - PyObject header. - Objects/obmalloc.c - the small-object allocator. - Modules/gcmodule.c / Python/gc.c - the cyclic GC.

Run:

import sys, gc, tracemalloc
tracemalloc.start()
xs = [object() for _ in range(10_000)]
print(sys.getsizeof(xs), sum(sys.getsizeof(x) for x in xs))
print(gc.get_count(), gc.get_threshold())

Explain: - Why del xs deterministically frees memory but gc.collect() is needed for cycles. - Why __slots__ saves ~40% per instance. - The interaction between refcounts and the free-threaded build's "biased reference counting."


Station 3 - The GIL and Its Successors (60 min)

Read: - Python/ceval_gil.c - the GIL implementation. - PEP 703 (no-GIL) and PEP 684 (per-interpreter GIL). - The "biased reference counting" paper (Choi et al.).

Run:

# stock
python -c "import threading, time; ..."
# free-threaded (3.13+)
python3.13t -c "..."

Reproduce the prime-counting benchmark from Month 4, Week 11.

Explain: - Why a NumPy-heavy ThreadPoolExecutor scales on stock CPython. - What changes for pure-Python code under python3.13t. - When subinterpreters beat both threads and processes.


Station 4 - asyncio Internals (75 min)

Read: - Lib/asyncio/base_events.py - the loop's run_forever. - Lib/asyncio/tasks.py - Task machinery, cancellation. - Modules/_asynciomodule.c - the C accelerator (Tasks, Futures). - The selectors module - epoll/kqueue glue.

Run:

import asyncio, sys
async def main():
    loop = asyncio.get_running_loop()
    loop.slow_callback_duration = 0.05
    # deliberately stall
    import time; time.sleep(0.2)
asyncio.run(main())

Watch the warning fire.

Explain: - The exact path from await coro to a callback scheduled on the loop. - How Task cancellation delivers CancelledError precisely at the next await. - Why uvloop is faster (libuv, C event loop, fewer Python frames per I/O).


Station 5 - NumPy and the Buffer Protocol (60 min)

Read: - PEP 3118 - buffer protocol. - NumPy's numpy/core/src/multiarray/arrayobject.c (skim). - The strides/shape/dtype model: NumPy User Guide → Internals.

Run:

import numpy as np
a = np.arange(12).reshape(3, 4)
print(a.strides, a.flags['C_CONTIGUOUS'])
b = a.T
print(b.strides, b.flags['C_CONTIGUOUS'])
mv = memoryview(a)
print(mv.format, mv.itemsize, mv.shape, mv.strides)

Explain: - Why a transpose is O(1) - it changes strides, not data. - Why a.T.copy() is sometimes necessary before passing to a C library. - How the buffer protocol lets bytes, array.array, numpy.ndarray, and torch.Tensor share memory without copies.


Station 6 - PyTorch, Autograd, CUDA Streams (90 min)

Read: - PyTorch internals (Edward Yang's blog). - torch.autograd overview docs; the Function / Variable machinery. - vLLM PagedAttention paper (sets up serving questions in Month 6).

Run:

import torch
x = torch.randn(4, 4, requires_grad=True)
y = (x * x).sum()
y.backward()
print(x.grad)
print(torch.cuda.is_available(), torch.cuda.current_stream() if torch.cuda.is_available() else None)

Explain: - The autograd tape: forward builds the graph, backward walks it. - Why .detach() and with torch.no_grad(): matter for inference latency. - CPU↔GPU synchronization: when .item() blocks, why torch.cuda.synchronize() exists. - How vLLM's PagedAttention reduces KV-cache fragmentation and why that translates directly to throughput.


Synthesis: The Mental Model

After this deep dive, hold this picture:

   your code ──► AST ──► bytecode ──► eval loop (specializing) ──► C function ──► syscall / GPU kernel
        │              │                    │                          │                    │
        └─ ruff/pyright └─ dis              └─ py-spy                   └─ py-spy --native   └─ nsys / nvprof
                                            └─ GIL / free-threaded                          
                                            └─ refcount / GC

Every performance question maps to one of those columns. Every correctness question maps to a boundary between two of them. That is what "senior" looks like.

Comments