Week 16 - Native Extensions, Releasing the GIL, FFI¶
16.1 Conceptual Core¶
- The fastest Python is Python that calls into C. The fastest correct Python is Python that calls into C and releases the GIL while it's there. NumPy, PyTorch, and
hashlibdo this; many third-party C extensions don't. - Three FFI options today:
- Cython for inner-loop kernels (write Python with type hints, compile to C).
- Rust + PyO3 + maturin for everything else: thread-safe, memory-safe, modern build.
ctypes/cffifor calling existing.so/.dlllibraries without writing an extension.
16.2 Mechanical Detail¶
- The CPython C API:
PyObject *, refcount discipline (Py_INCREF/Py_DECREF), the GIL macros (Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS), exception handling (PyErr_SetString). - The Stable ABI (PEP 384) and the Limited API. Building wheels that work across CPython versions.
numpy's C API:PyArrayObject,PyArray_DATA, contiguity flags. The buffer protocol (PEP 3118):memoryview,__buffer__on Python types.- PyO3 idioms:
#[pyfunction],#[pyclass],Bound<'py, PyAny>,py.allow_threads(|| {...})to release the GIL. - HPy: a successor C API, portable across PyPy/CPython/GraalPy. Worth knowing about; not yet load-bearing.
16.3 Lab - "Write the Hot Kernel in Rust"¶
- Take the cosine-similarity workload from week 12. Implement it in Rust with PyO3.
- Use
py.allow_threads(|| ...)around the SIMD loop. Verify with a PythonThreadPoolExecutor(8)that you get ~8x speedup. - Compare to NumPy and to your Cython version. Write up the cost in code complexity.
- Bonus: expose a
Vector#[pyclass]and benchmark crossing the FFI per-call vs. per-batch. Internalize the per-call FFI cost.
16.4 Idiomatic & Linter Drill¶
- Add
cargo clippyto your Rust crate. Addmaturin develop --releaseto your dev workflow.
16.5 Production Hardening Slice¶
- Build manylinux wheels with
cibuildwheel. Add a CI matrix: cp312, cp313, cp313t, cp314 across linux/x86_64, linux/aarch64, macos/arm64. This is the modern wheel-distribution baseline.
Month-4 Exit Criteria¶
Before starting Month 5:
- Build an asyncio service that survives
kill -INTmid-flight without dropping requests or leaking tasks. - Pick - and justify - between threads, processes, asyncio, free-threaded, and subinterpreters for any workload.
- Write a Rust extension that releases the GIL and verify parallel scaling.
- Diagnose an event-loop stall from a
py-spy dumpalone.