Skip to content

Testing

Why it matters

Five families of tests recur across every path: unit, integration, property-based, fuzz, benchmark. Each language has its own tooling for each, but the strategies transfer. The strongest software-engineering muscles you can build are:

  1. Knowing which family fits which kind of bug.
  2. Knowing each ecosystem's canonical answer well enough to read their tests.

This page is the cross-language reading list for both.


The five families

Family Catches Run frequency Canonical tools by path
Unit logic bugs in pure functions, narrow modules every commit go test, JUnit 5, cargo test, pytest, kunit
Integration wiring bugs, real-dependency bugs every PR Testcontainers, docker-compose, pytest fixtures
Property-based edge cases you didn't think to write every PR jqwik (Java), proptest/quickcheck (Rust), hypothesis (Python), gopter (Go)
Fuzz crashes, security bugs, parser bugs continuous (oss-fuzz, CI nightly) go test -fuzz, cargo-fuzz / libFuzzer, atheris (Python), syzkaller (kernel)
Benchmark perf regressions per release / nightly JMH (Java), criterion (Rust), go test -bench, pytest-benchmark, perf+ftrace

Plus a sixth, language-specific family: concurrency stress tests - jcstress (Java), loom (Rust), KCSAN (kernel), -race (Go). See memory models.


The lens, per path

Go - go test, -race, fuzz, examples

Month 1 - Runtime Foundations for the basics, Appendix A for the discipline.

Built into the toolchain. go test ./... runs everything. -race is the data-race detector. -fuzz (1.18+) is the built-in coverage-guided fuzzer. Example* functions double as documentation and tests - go doc shows them and go test runs them.

What's unique here: no test framework. The testing package gives you t.Errorf and parallelism (t.Parallel); subtests via t.Run("subname", ...). That's the API. Third-party libraries (testify, gocheck) exist; the community largely doesn't use them.

The trap

Forgetting t.Parallel() makes tests serialize unnecessarily on CI. Forgetting to capture the loop variable in subtests pre-Go-1.22 causes every parallel subtest to run with the same input.

Java - JUnit 5, AssertJ, Testcontainers, Mockito, JMH, jcstress, jqwik

Month 1 - Language & Toolchain, week 4. The deepest test ecosystem in any path.

  • JUnit 5 (Jupiter) - @Test, @ParameterizedTest, @Nested, @ExtendWith. Modern; supersedes JUnit 4 (still common in legacy code).
  • AssertJ - fluent assertions. assertThat(x).isEqualTo(y).hasSize(3).containsExactly(...). Way better failure messages than assertEquals.
  • Mockito - collaborator mocking. Mockito 5+, prefer constructor injection, never @InjectMocks on final fields.
  • Testcontainers - real dependencies in tests via Docker. Postgres, Kafka, Redis, anything with an image. Rebuilds your mental model: most "integration tests" should be Testcontainer tests.
  • jqwik - property-based testing. @Property-annotated methods receive generated inputs; jqwik shrinks failing cases automatically.
  • JMH - the Java microbenchmark harness. The only correct way to measure JVM perf. See observability.
  • jcstress - concurrency stress harness (Shipilëv's). For testing memory-model edge cases in your lock-free code.

What's unique here: every test family has a mature, opinionated, JDK-author-blessed tool. The cost is the learning curve - you genuinely need a week to get fluent.

The trap

Mocking what you don't own (Mockito's own wiki rule). Don't mock Connection/ResultSet/HttpClient/etc - those are external interfaces. Either use a real implementation in a Testcontainer or wrap them in your own interface and mock that.

Rust - cargo test, criterion, proptest, loom, MIRI, cargo-fuzz

Month 1 - Foundations, week on testing.

#[test] inside any module. cargo test runs them. Doctests in /// comments are real tests - they compile and run.

  • criterion - benchmark crate; statistical analysis of variance, regression detection, HTML reports.
  • proptest / quickcheck - property-based testing.
  • loom - model-checks concurrent code by exhaustively exploring thread interleavings. The Rust equivalent of jcstress.
  • MIRI - interpreter for Rust's mid-level IR; detects undefined behavior in unsafe code. cargo +nightly miri test.
  • cargo-fuzz - libFuzzer integration.

What's unique here: the correctness tooling - MIRI + loom + the borrow checker - gives Rust a stronger "your concurrent code is actually correct" story than any other path.

The trap

cargo test runs tests in parallel by default. Shared global state (env vars, working directory, file paths) creates flaky tests. Use serial_test or properly isolate state.

Python - pytest, hypothesis, tox/nox, atheris

Month 1 - Foundations, week on testing.

  • pytest - the de facto test runner. Fixtures, parametrize, plugins (pytest-asyncio, pytest-cov, pytest-mock, pytest-benchmark, pytest-xdist).
  • hypothesis - property-based testing; arguably the best in any language. @given(integers(), text()) generates inputs and shrinks failures.
  • tox / nox - matrix runners (test across Python versions, dependency versions).
  • atheris - Google's coverage-guided Python fuzzer.

What's unique here: hypothesis's stateful testing - model your system as a state machine, let hypothesis explore the state space. Comparable to model-checking but more accessible.

The trap

Slow tests because fixtures are session-scoped when they should be function-scoped (or vice versa). pytest --setup-show reveals the actual setup graph.

Linux kernel - kunit, kselftest, syzkaller, ktap, KCSAN

Month 1 - Kernel Foundations, test week.

  • kunit - in-kernel unit tests, run during kernel build or in a tiny qemu instance via kunit_tool.py.
  • kselftest - userspace-driven kernel feature tests; lives in tools/testing/selftests/.
  • syzkaller - Google's kernel fuzzer; the source of a huge fraction of recent kernel CVEs.
  • KCSAN - Kernel Concurrency Sanitizer; data-race detector.
  • KASAN / KMSAN - Address / Memory Sanitizer for the kernel.

What's unique here: testing a kernel means testing inside a kernel. kunit literally runs in-tree; kselftest boots a userspace and pokes the kernel from outside; syzkaller runs hundreds of VMs and aggregates crashes.

The trap

Reproducing a syzkaller crash needs the exact same kernel config - random fuzzed reproducers are not portable. Always capture .config alongside the bug report.

AI Systems - pytest + GPU-aware fixtures, eval harnesses

Month 3 - Framework Internals and Deep Dive 08 - Evaluation Systems (in tutoriaal/DEEP_DIVES/).

Standard pytest stack, with GPU detection (pytest.mark.skipif(not torch.cuda.is_available())) and numerical-tolerance assertions (torch.testing.assert_close(actual, expected, rtol=1e-3, atol=1e-4)). For LLM applications: an evaluation harness (separate from unit tests) - see Deep Dive 08.

The trap

Asserting on exact floating-point equality across GPU runs. Different reduction orders, different tensor cores, different precisions → never bit-exact. Always use assert_close with documented tolerances.


The contrasts that teach

The strongest cross-language reading list for testing:

Want to learn… Read… Then transfer to…
Property-based testing hypothesis (Python) jqwik (Java), proptest (Rust), gopter (Go)
Concurrency stress jcstress (Java) + Shipilëv's writeups loom (Rust), KCSAN (kernel), -race (Go)
Real-dependency integration Testcontainers (Java) testcontainers-python, testcontainers-go
Fuzzing syzkaller (kernel) for the gold standard cargo-fuzz, go fuzz, atheris
Benchmarking discipline JMH (Java) + criterion (Rust) pytest-benchmark, go test -bench
Data-race detection -race (Go) + MIRI (Rust) KCSAN (kernel), TSAN (C)

What to read first

  • You write Go services → Go testing in Appendix A. Then run -race on your existing suite - most non-trivial Go codebases have at least one race nobody noticed.
  • You write JVM services → Java Month 1 week 4 + Testcontainers' docs. Replace every "mocked database" test with a Testcontainer one and watch your bug counts converge.
  • You write Rust → cargo test + criterion + proptest. Add loom or MIRI if you have any unsafe or any lock-free code.
  • You write Python at scale → pytest first, then hypothesis. The hypothesis tutorial is the single highest-leverage 90 minutes of testing reading you can do.
  • You hack the kernel → kunit + kselftest, then read 100 syzkaller crash reports until the failure patterns are intuitive.