Skip to content

Week 2 - Linear Algebra Refresh, BLAS, NumPy

2.1 Conceptual Core

  • The only linear algebra you need for systems work, for now: matrix-vector multiply, matrix-matrix multiply (GEMM), elementwise ops, reductions. Almost every neural-network primitive decomposes into these.
  • GEMM (C = αAB + βC) is the most-studied operation in scientific computing. BLAS Level 3 routines (sgemm, dgemm, hgemm) are heavily optimized. Modern hardware defines its peak FLOPS by sgemm performance.
  • The naive triple-loop matmul achieves <5% of peak. Tiled, blocked, vectorized matmuls achieve >90%. Understanding the gap is week 2's whole content.

2.2 Mechanical Detail

  • NumPy uses BLAS underneath. numpy.dot(A, B) with appropriately-built NumPy hits the BLAS path; a Python triple-loop is ~1000× slower.
  • OpenBLAS / Intel MKL / Apple Accelerate are the dominant CPU BLAS implementations. Intel oneMKL is fastest on Intel; OpenBLAS is portable.
  • GPU BLAS: NVIDIA cuBLAS, AMD rocBLAS. Wrapped by every framework.
  • Einsum notation (numpy.einsum("ij,jk->ik", A, B)) is the lingua franca of multi-dimensional tensor ops. It generalizes matmul, batch matmul, transpose, sum, contraction. Learn it.

2.3 Lab-"Three Matmuls"

Implement 1024×1024 matmul three ways: 1. Naive triple-loop in Python (will take ~minutes; that's the point). 2. Naive in NumPy with explicit loops-only marginal speedup. 3. numpy.dot-measure speedup over (1).

You should see ~10,000× speedup between (1) and (3). Internalize why. Read Goto and van de Geijn's "Anatomy of a High-Performance Matrix Multiplication" if you want the deep version (recommended).

2.4 Idiomatic & Diagnostic Drill

  • python -c 'import numpy; numpy.show_config()' - see which BLAS your NumPy is linked against. Reinstall withconda install numpy` (which pulls MKL on Linux/Windows) and re-benchmark; observe.

2.5 Production Slice

  • Add a requirements.txt to your project with versions pinned. NumPy/BLAS bugs in version drift have cost real money-pin everything.

Comments