Skip to content

Week 16 - Native Extensions, Releasing the GIL, FFI

16.1 Conceptual Core

  • The fastest Python is Python that calls into C. The fastest correct Python is Python that calls into C and releases the GIL while it's there. NumPy, PyTorch, and hashlib do this; many third-party C extensions don't.
  • Three FFI options today:
  • Cython for inner-loop kernels (write Python with type hints, compile to C).
  • Rust + PyO3 + maturin for everything else: thread-safe, memory-safe, modern build.
  • ctypes / cffi for calling existing .so/.dll libraries without writing an extension.

16.2 Mechanical Detail

  • The CPython C API: PyObject *, refcount discipline (Py_INCREF/Py_DECREF), the GIL macros (Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS), exception handling (PyErr_SetString).
  • The Stable ABI (PEP 384) and the Limited API. Building wheels that work across CPython versions.
  • numpy's C API: PyArrayObject, PyArray_DATA, contiguity flags. The buffer protocol (PEP 3118): memoryview, __buffer__ on Python types.
  • PyO3 idioms: #[pyfunction], #[pyclass], Bound<'py, PyAny>, py.allow_threads(|| {...}) to release the GIL.
  • HPy: a successor C API, portable across PyPy/CPython/GraalPy. Worth knowing about; not yet load-bearing.

16.3 Lab - "Write the Hot Kernel in Rust"

  1. Take the cosine-similarity workload from week 12. Implement it in Rust with PyO3.
  2. Use py.allow_threads(|| ...) around the SIMD loop. Verify with a Python ThreadPoolExecutor(8) that you get ~8x speedup.
  3. Compare to NumPy and to your Cython version. Write up the cost in code complexity.
  4. Bonus: expose a Vector #[pyclass] and benchmark crossing the FFI per-call vs. per-batch. Internalize the per-call FFI cost.

16.4 Idiomatic & Linter Drill

  • Add cargo clippy to your Rust crate. Add maturin develop --release to your dev workflow.

16.5 Production Hardening Slice

  • Build manylinux wheels with cibuildwheel. Add a CI matrix: cp312, cp313, cp313t, cp314 across linux/x86_64, linux/aarch64, macos/arm64. This is the modern wheel-distribution baseline.

Month-4 Exit Criteria

Before starting Month 5:

  1. Build an asyncio service that survives kill -INT mid-flight without dropping requests or leaking tasks.
  2. Pick - and justify - between threads, processes, asyncio, free-threaded, and subinterpreters for any workload.
  3. Write a Rust extension that releases the GIL and verify parallel scaling.
  4. Diagnose an event-loop stall from a py-spy dump alone.

Comments