Saltar a contenido

Week 12 - The Optimization Ladder: Algorithm → Vectorize → Native → JIT

12.1 Conceptual Core

The ladder, in order of expected ROI per hour:

  1. Algorithm and data structure. Big-O still wins by orders of magnitude. set membership over list membership. heapq over re-sorting. bisect over linear scan.
  2. Stay in the C layer. Replace Python loops with NumPy / Polars / itertools / built-in sum/min/max/map. The eval loop is your enemy; comprehensions and built-ins minimize trips through it.
  3. Native extension. Cython, mypyc, Rust+PyO3, C+Pybind11. Write Python, profile, then rewrite the hot 5%.
  4. JIT. PyPy (drop-in for many workloads, no NumPy if you use cpyext), Numba (NumPy-aware JIT), or wait for the upcoming CPython JIT (PEP 744 tier-2 + tier-3 copy-and-patch JIT, experimental in 3.13).

12.2 Mechanical Detail

  • Profilers, in order of use:
  • cProfile + snakeviz: function-level, deterministic, ~10% overhead.
  • py-spy: sampling, attaches to running process, no code changes, ~1% overhead. Use this in production.
  • scalene: CPU + memory + GPU, line-level, low overhead.
  • memray: memory-focused, flamegraphs.
  • pyinstrument: low-overhead sampling, beautiful HTML output.
  • Vectorization patterns: replace for x in xs: ys.append(x*2 + 1) with np.asarray(xs) * 2 + 1. Replace for row in df.iterrows(): with column expressions in Polars / Pandas.
  • Cython mental model: write Python, add cdef int i, recompile, get 10–100x. Worth it for hot inner kernels.
  • mypyc (compiles type-annotated Python to C extensions; powers mypy itself, black).
  • PyO3 + maturin: Rust extensions with maturin develop. The right tool when you also need true threads and predictable memory.

12.3 Lab - "Climb the Ladder"

Take a deliberately slow workload - e.g., compute pairwise cosine similarity between 10k 768-dim vectors with a pure-Python triple loop. Time it. Then climb: 1. Algorithmic: skip pairs already computed. 2. Vectorize: NumPy batched matmul with norm. 3. Cython rewrite of the inner kernel. 4. Numba @njit on the same. 5. (Stretch) Rust + PyO3 implementation. 6. Compare to faiss / hnswlib.

Tabulate speedups in NOTES.md. The lesson is that step 2 usually wins by 100x and step 3+ by ~2x more - but step 6 (use the right library) wins by 1000x. Algorithm > implementation > tuning.

12.4 Idiomatic & Linter Drill

  • Enable ruff NPY (NumPy-specific rules). Refactor numerical code to use modern NumPy idioms.

12.5 Production Hardening Slice

  • Add pytest-benchmark regression gates. Add a perf/ directory of benchmarks tracked in CI with historical data.

Month-3 Exit Criteria

Before starting Month 4:

  1. Read dis.dis output and predict relative cost of two Python implementations.
  2. Identify a real memory leak from a memray flamegraph.
  3. Choose between threads / processes / asyncio / free-threaded for a given workload, with one paragraph of justification.
  4. Apply the optimization ladder in order - and refuse to skip step 1.

Comments