Week 9 - PyTorch Internals: Tensor, Dispatcher, ATen¶
9.1 Conceptual Core¶
- PyTorch is a layered system:
- Python frontend-
torch.*namespace, what users write. - Dispatcher-routes ops to backend implementations based on device, dtype, layout, autograd state, and other "keys."
- ATen-the C++ tensor library. Each op (
add,matmul,softmax) has device-specific implementations (CPU, CUDA, MPS, XPU). - Backends-cuBLAS, cuDNN, OneDNN, custom kernels.
- Every Python tensor op is, fundamentally, a dispatcher call.
a + b→torch.add(a, b)→aten::add→ CPU/CUDA add kernel. Understanding this is the foundation for the rest of the month.
9.2 Mechanical Detail¶
- Read
aten/src/ATen/core/dispatch/Dispatcher.handDispatchKey.h. TheDispatchKeyenum names every backend, every layer (autograd, autocast, named tensors, vmap, ...). - Dispatch keys stack: a tensor's "key set" determines which dispatcher entries fire and in what order. AutogradCUDA → AutocastCUDA → CUDA, for example.
torch::Librarymacro registers ops:- The Python tensor object is a thin wrapper around
at::Tensor, which is a thin wrapper aroundc10::TensorImpl, which holds ac10::Storageand view metadata (sizes, strides, offset, dtype, device). - Strides are critical. A "tensor view" (transpose, slice, narrow) shares storage but rewrites strides. The dispatcher and most ops handle strided tensors transparently; some kernels require contiguous (
tensor.contiguous()).
9.3 Lab-"Trace an Op"¶
- From Python, run
a + bfor two CUDA tensors. UseTORCH_SHOW_DISPATCH_TRACE=1(ortorch._C._dispatch_print_registrations()) to see the dispatcher's path. - Read `aten/src/ATen/native/cuda/BinaryOps.cu - find the actual CUDA kernel for add.
- Trace
torch.matmul(a, b)similarly. Note that for BF16 it routes to cuBLAS. - Document the call chain in
TRACE.md.
9.4 Idiomatic & Diagnostic Drill¶
torch.profiler.profile(activities=[CPU, CUDA])withrecord_shapes=Trueandwith_stack=True. Read the table; identify any op spending more than 5% of total time.
9.5 Production Slice¶
- Add
torch.cuda.synchronize()discipline: every benchmark must sync before timing. CUDA is asynchronous; without sync, you'll measure queue insertion, not execution.