Week 12 - The Optimization Ladder: Algorithm → Vectorize → Native → JIT¶

12.1 Conceptual Core¶

The ladder, in order of expected ROI per hour:

Algorithm and data structure. Big-O still wins by orders of magnitude. set membership over list membership. heapq over re-sorting. bisect over linear scan.
Stay in the C layer. Replace Python loops with NumPy / Polars / itertools / built-in sum/min/max/map. The eval loop is your enemy; comprehensions and built-ins minimize trips through it.
Native extension. Cython, mypyc, Rust+PyO3, C+Pybind11. Write Python, profile, then rewrite the hot 5%.
JIT. PyPy (drop-in for many workloads, no NumPy if you use cpyext), Numba (NumPy-aware JIT), or wait for the upcoming CPython JIT (PEP 744 tier-2 + tier-3 copy-and-patch JIT, experimental in 3.13).

12.2 Mechanical Detail¶

Profilers, in order of use:
cProfile + snakeviz: function-level, deterministic, ~10% overhead.
py-spy: sampling, attaches to running process, no code changes, ~1% overhead. Use this in production.
scalene: CPU + memory + GPU, line-level, low overhead.
memray: memory-focused, flamegraphs.
pyinstrument: low-overhead sampling, beautiful HTML output.
Vectorization patterns: replace for x in xs: ys.append(x*2 + 1) with np.asarray(xs) * 2 + 1. Replace for row in df.iterrows(): with column expressions in Polars / Pandas.
Cython mental model: write Python, add cdef int i, recompile, get 10–100x. Worth it for hot inner kernels.
mypyc (compiles type-annotated Python to C extensions; powers mypy itself, black).
PyO3 + maturin: Rust extensions with maturin develop. The right tool when you also need true threads and predictable memory.

12.3 Lab - "Climb the Ladder"¶

Take a deliberately slow workload - e.g., compute pairwise cosine similarity between 10k 768-dim vectors with a pure-Python triple loop. Time it. Then climb: 1. Algorithmic: skip pairs already computed. 2. Vectorize: NumPy batched matmul with norm. 3. Cython rewrite of the inner kernel. 4. Numba @njit on the same. 5. (Stretch) Rust + PyO3 implementation. 6. Compare to faiss / hnswlib.

Tabulate speedups in NOTES.md. The lesson is that step 2 usually wins by 100x and step 3+ by ~2x more - but step 6 (use the right library) wins by 1000x. Algorithm > implementation > tuning.

12.4 Idiomatic & Linter Drill¶

Enable ruff NPY (NumPy-specific rules). Refactor numerical code to use modern NumPy idioms.

12.5 Production Hardening Slice¶

Add pytest-benchmark regression gates. Add a perf/ directory of benchmarks tracked in CI with historical data.

Month-3 Exit Criteria¶

Before starting Month 4:

Read dis.dis output and predict relative cost of two Python implementations.
Identify a real memory leak from a memray flamegraph.
Choose between threads / processes / asyncio / free-threaded for a given workload, with one paragraph of justification.
Apply the optimization ladder in order - and refuse to skip step 1.