Week 2 - Linear Algebra Refresh, BLAS, NumPy¶
2.1 Conceptual Core¶
- The only linear algebra you need for systems work, for now: matrix-vector multiply, matrix-matrix multiply (GEMM), elementwise ops, reductions. Almost every neural-network primitive decomposes into these.
- GEMM (
C = αAB + βC) is the most-studied operation in scientific computing. BLAS Level 3 routines (sgemm,dgemm,hgemm) are heavily optimized. Modern hardware defines its peak FLOPS by sgemm performance. - The naive triple-loop matmul achieves <5% of peak. Tiled, blocked, vectorized matmuls achieve >90%. Understanding the gap is week 2's whole content.
2.2 Mechanical Detail¶
- NumPy uses BLAS underneath.
numpy.dot(A, B)with appropriately-built NumPy hits the BLAS path; a Python triple-loop is ~1000× slower. - OpenBLAS / Intel MKL / Apple Accelerate are the dominant CPU BLAS implementations. Intel oneMKL is fastest on Intel; OpenBLAS is portable.
- GPU BLAS: NVIDIA cuBLAS, AMD rocBLAS. Wrapped by every framework.
- Einsum notation (
numpy.einsum("ij,jk->ik", A, B)) is the lingua franca of multi-dimensional tensor ops. It generalizes matmul, batch matmul, transpose, sum, contraction. Learn it.
2.3 Lab-"Three Matmuls"¶
Implement 1024×1024 matmul three ways:
1. Naive triple-loop in Python (will take ~minutes; that's the point).
2. Naive in NumPy with explicit loops-only marginal speedup.
3. numpy.dot-measure speedup over (1).
You should see ~10,000× speedup between (1) and (3). Internalize why. Read Goto and van de Geijn's "Anatomy of a High-Performance Matrix Multiplication" if you want the deep version (recommended).
2.4 Idiomatic & Diagnostic Drill¶
python -c 'import numpy; numpy.show_config()' - see which BLAS your NumPy is linked against. Reinstall withconda install numpy` (which pulls MKL on Linux/Windows) and re-benchmark; observe.
2.5 Production Slice¶
- Add a
requirements.txtto your project with versions pinned. NumPy/BLAS bugs in version drift have cost real money-pin everything.