Week 12 - The Optimization Ladder: Algorithm → Vectorize → Native → JIT¶
12.1 Conceptual Core¶
The ladder, in order of expected ROI per hour:
- Algorithm and data structure. Big-O still wins by orders of magnitude.
setmembership overlistmembership.heapqover re-sorting.bisectover linear scan. - Stay in the C layer. Replace Python loops with NumPy / Polars / itertools / built-in
sum/min/max/map. The eval loop is your enemy; comprehensions and built-ins minimize trips through it. - Native extension. Cython, mypyc, Rust+PyO3, C+Pybind11. Write Python, profile, then rewrite the hot 5%.
- JIT. PyPy (drop-in for many workloads, no NumPy if you use
cpyext), Numba (NumPy-aware JIT), or wait for the upcoming CPython JIT (PEP 744 tier-2 + tier-3 copy-and-patch JIT, experimental in 3.13).
12.2 Mechanical Detail¶
- Profilers, in order of use:
cProfile+snakeviz: function-level, deterministic, ~10% overhead.py-spy: sampling, attaches to running process, no code changes, ~1% overhead. Use this in production.scalene: CPU + memory + GPU, line-level, low overhead.memray: memory-focused, flamegraphs.pyinstrument: low-overhead sampling, beautiful HTML output.- Vectorization patterns: replace
for x in xs: ys.append(x*2 + 1)withnp.asarray(xs) * 2 + 1. Replacefor row in df.iterrows():with column expressions in Polars / Pandas. - Cython mental model: write Python, add
cdef int i, recompile, get 10–100x. Worth it for hot inner kernels. mypyc(compiles type-annotated Python to C extensions; powersmypyitself,black).- PyO3 + maturin: Rust extensions with
maturin develop. The right tool when you also need true threads and predictable memory.
12.3 Lab - "Climb the Ladder"¶
Take a deliberately slow workload - e.g., compute pairwise cosine similarity between 10k 768-dim vectors with a pure-Python triple loop. Time it. Then climb:
1. Algorithmic: skip pairs already computed.
2. Vectorize: NumPy batched matmul with norm.
3. Cython rewrite of the inner kernel.
4. Numba @njit on the same.
5. (Stretch) Rust + PyO3 implementation.
6. Compare to faiss / hnswlib.
Tabulate speedups in NOTES.md. The lesson is that step 2 usually wins by 100x and step 3+ by ~2x more - but step 6 (use the right library) wins by 1000x. Algorithm > implementation > tuning.
12.4 Idiomatic & Linter Drill¶
- Enable
ruffNPY(NumPy-specific rules). Refactor numerical code to use modern NumPy idioms.
12.5 Production Hardening Slice¶
- Add
pytest-benchmarkregression gates. Add aperf/directory of benchmarks tracked in CI with historical data.
Month-3 Exit Criteria¶
Before starting Month 4:
- Read
dis.disoutput and predict relative cost of two Python implementations. - Identify a real memory leak from a
memrayflamegraph. - Choose between threads / processes / asyncio / free-threaded for a given workload, with one paragraph of justification.
- Apply the optimization ladder in order - and refuse to skip step 1.