Saltar a contenido

Appendix B - Build-From-Scratch Data Structures and Patterns

A working Python engineer should have implemented each of the following at least once, with pyright-clean types, pytest + hypothesis tests, and a pytest-benchmark micro-bench. This appendix sketches the minimal-viable design.


B.1 LRU Cache (with TTL)

When: function memoization, decoded-payload caches, embedding caches.

Design: - OrderedDict from collections. move_to_end on hit; popitem(last=False) on evict. - Optional TTL via (value, expiry_monotonic) tuples; lazy expiration on access. - Concurrent variant: a threading.Lock (or asyncio.Lock for async) around mutations.

Lab: compare to functools.lru_cache and cachetools.TTLCache. Bench miss/hit costs.


B.2 Trie (Prefix Tree)

When: autocomplete, IP routing tables, tokenizer prefix lookup, dictionary spell-check.

Design: - Node = dict[str, Node] + is_end: bool (+ optional payload). - Insert/lookup O(len(key)); prefix iteration O(prefix + matches).

Lab: implement add, contains, iter_prefix. Bench against set for membership and against linear scan for prefix queries.


B.3 Bloom Filter

When: dedup at scale, "definitely-not-seen" checks before expensive lookups.

Design: - bitarray of size m; k hash functions derived from one (mmh3 or hashlib) via double hashing. - Sized for target false-positive rate p over expected n: m = -n*ln(p)/(ln(2)^2), k = m/n * ln(2).

Lab: empirically verify FP rate matches predicted. Compare memory to set for 10M items.


B.4 SPSC Ring Buffer (asyncio)

When: backpressure between a producer task and a consumer task, fixed-memory pipelines.

Design: - list[T | None] of capacity N (power of two). - head/tail integers; full when tail - head == N; empty when equal. - asyncio.Event for "not full" and "not empty"; set() on transition.

Lab: compare to asyncio.Queue(maxsize=N). The stdlib version is fine; build this once to internalize the contract.


B.5 Bounded Concurrent Map

When: caches with tight memory budgets, multi-writer state.

Design: - N shards of (threading.RLock, dict). Hash key, mod N, lock the shard. - Eviction: per-shard LRU list.

Lab: compare to a single-lock dict and to a CAS-y "lock-free" attempt. The simple sharded design wins almost always.


B.6 Vector Index (Brute-Force, Then HNSW Wrapper)

When: nearest-neighbor over embeddings.

Design - brute force: - np.ndarray of shape (N, D), L2-normalized. - Query: (N, D) @ (D,) dot product, np.argpartition for top-K.

Design - HNSW wrapper: - Wrap hnswlib.Index with a typed Pythonic API. - Persist with index.save_index(path).

Lab: Build both. Verify recall@10 vs. brute-force ground truth. Plot recall/QPS trade-off.


B.7 Token Bucket Rate Limiter (asyncio)

When: client-side rate limiting against an LLM API.

Design: - tokens: float, last_refill: float (monotonic). - On acquire(n): refill (now - last_refill) * rate, cap at capacity. If tokens >= n, deduct and return; else await asyncio.sleep for the deficit.

Lab: ensure bursts don't exceed capacity; ensure long-run average matches rate. Compare to aiolimiter.


B.8 Circuit Breaker

When: protecting downstream services from cascading failure.

Design: - States: CLOSED (normal), OPEN (fail fast), HALF_OPEN (probe). - Counters: consecutive failures threshold, reset timeout. - On call: if OPEN, raise immediately; if HALF_OPEN, allow one probe.

Lab: integrate with the LLMClient from week 21. Verify behavior under simulated 503 storms.


B.9 Async Worker Pool with Backpressure

When: ingestion pipelines, batch-embedding workloads.

Design: - Bounded asyncio.Queue, N worker tasks consuming, producer awaits put. - TaskGroup owns the workers; cancellation cleanly drains. - Each worker: try/except around the unit of work; metrics per outcome.

Lab: process 1M items at a controlled QPS without OOM. Tune N and queue size.


B.10 Domain Patterns Worth Building

  • Repository + UnitOfWork over SQLite, with an in-memory fake for tests.
  • Result type (Ok[T] | Err[E]) for code where exceptions obscure flow. Don't over-apply - Python has exceptions for a reason.
  • Saga / state machine for multi-step durable workflows. State table in Postgres + idempotency keys.
  • Outbox pattern for reliable event publishing alongside DB writes.

Comments