Capstone Projects-Three Tracks, One Choice¶
The Month 6 capstone is the deliverable that converts this curriculum from study into evidence. Pick one track. The work performed here is what you describe in interviews and link from a portfolio.
Track 1-Distributed Storage: A Raft-Replicated KV Store¶
Outcome: a 3+ node Raft-replicated key-value store with linearizable reads, snapshots, online membership changes, and a `jepsen - style fault-injection harness verifying linearizability.
Functional spec¶
- gRPC API:
Get(key),Put(key, value),Delete(key),Watch(prefix) stream. - Cluster API:
AddNode,RemoveNode,Leadership. - Linearizable reads via read-index.
- Snapshots every N entries (default 10K) with
InstallSnapshotto recovering followers. - Persistent WAL via Pebble or BoltDB.
- TLS between nodes; mutual auth via x509.
Non-functional spec¶
- Sustained 50K writes/sec on commodity hardware (3-node, NVMe).
- Sub-10 ms write latency p99 under 50% utilization.
- Recovery time (leader change → fully available) under 1 s for a 3-node cluster.
- Survives a single-node crash without data loss; survives a network partition with a clear majority.
Architecture sketch¶
- One goroutine per node consumes from
etcd-io/raftReadychannel. - Apply loop: stream committed entries → state machine → respond to clients.
- Network: gRPC with a long-lived bidi stream per peer pair.
- State machine: a sharded
map[string][]bytewith versioning forWatch.
Test rigor¶
- Unit: state-machine transitions, log-truncation invariants.
- Integration: 3-node local cluster via
t.Run, exercise membership. - Fault injection: a "nemesis" goroutine that randomly partitions, pauses, crashes nodes; client offer history fed to a linearizability checker (Knossos in Clojure-via-process, or a lightweight Go port).
- Race-clean under sustained load.
Hardening pass¶
goreleaser,cosignsigning, SBOM viacyclonedx-gomod.GOMEMLIMITfrom cgroup;automaxprocs.- PGO with a representative workload.
pprof+runtime/tracecapture endpoints.- OTel traces across the Raft RPC layer (custom interceptor).
- A
RUNBOOK.mdcovering: leader-stuck triage, log-corruption recovery, snapshot-restore procedure.
Acceptance criteria¶
- Public repo with all of the above.
- A README that includes: a topology diagram, a load-test latency CDF, a Jepsen-style report.
- Defensible answer to: "What happens during a network partition where a majority can elect a new leader but the old leader is still up?"
Skills exercised¶
- Months 3 (concurrency), 5 (gRPC, observability), 6.21–6.22 (Raft, distributed storage).
Track 2-Service Mesh: A gRPC Microservices Mesh¶
Outcome: a multi-service mesh demonstrating a custom service registry, health checking, deadline propagation, retries, outlier ejection, and end-to-end OTel tracing across at least four interconnected services.
Functional spec¶
- A Registry service: gRPC interface for
Register,Deregister,Watch,LookupHealthy. Backed by an in-memory store with optional Raft replication (composes with Track 1). - A Sidecar library that:
- Resolves service names via the registry (custom gRPC
resolver.Builder). - Implements client-side load balancing with round-robin + outlier ejection.
- Propagates OTel context, deadlines, and a
request_id. - Adds retry policy via service config.
- Four demo services (e.g.,
user,order,inventory,payment) with a fan-out call graph that exercises retries, timeouts, and partial failures. - A
mesh-clifor service inspection and chaos injection.
Non-functional spec¶
- Sub-millisecond p99 sidecar overhead per RPC.
- Outlier ejection within 10 s of an endpoint going bad.
- Deadline propagation: an inbound 1 s deadline must result in downstream calls seeing strictly less than 1 s remaining.
Architecture sketch¶
- Each service runs the sidecar library in-process (no separate sidecar binary-keep it simple, defensible).
- Registry uses
etcd-io/raftif Track 1 also chosen; otherwise a single-instance with TLS. - Service discovery uses long-poll
Watchvia gRPC server-streaming.
Test rigor¶
- Unit: resolver, balancer, interceptor stacks.
- Integration: spin all four services in-process, exercise the call graph with
testcontainersfor the registry's Postgres if used. - Chaos: a
chaos-injectormiddleware that drops/delays/errors random %. - Latency tests with
ghzat multiple QPS levels.
Hardening pass¶
pprofeverywhere; OTel everywhere.goleakper-service.- A reproducible Docker Compose stack and a one-command
make demothat brings it up with Jaeger and Prometheus. - Alarms wired: Prometheus rules on per-service error rate, p99 latency, registry watch lag.
Acceptance criteria¶
- All four services deployable with
make demo. - A flame graph demonstrating where sidecar overhead lives.
- A trace screenshot showing deadline-propagated failure across the call chain.
- Defensible answer to: "What happens if the registry leader is down for 30 seconds?"
Skills exercised¶
- Months 3 (concurrency), 5 (gRPC mastery, observability), 6 (capstone defense, performance).
Track 3-Streaming Pipeline: A Kafka-Compatible Ingestion + Stream Processor¶
Outcome: a Kafka-protocol-compatible (subset) broker plus a stream-processing framework, with at-least-once delivery, exactly-once-effective consumer offsets, and replay.
Functional spec¶
- Broker: implements a subset of the Kafka wire protocol (Produce, Fetch, Metadata, ListOffsets, OffsetCommit, OffsetFetch). Disk-backed log per partition; segment + index files.
- Stream processor: a small framework letting users write
func(input Stream[T]) Stream[U]with operators (Map,Filter,Window,Aggregate,Join). - Consumer: offset management, rebalance protocol (subset).
- Producer: idempotent producer (within session).
- Compatibility: works with
franz-go(the leading Kafka Go client) for at least Produce/Fetch.
Non-functional spec¶
- 200K msgs/sec sustained on a single partition (commodity NVMe).
- Sub-50 ms producer ack p99 with
acks=all. - Replay from arbitrary offset.
- Crash-recoverable: WAL fsync semantics documented.
Architecture sketch¶
- One goroutine per partition for the disk-write path.
- mmap'd index files; sequential append to log files.
- Replication: Raft per partition (composes with Track 1) or a simpler primary-backup with a documented data-loss window.
Test rigor¶
- Unit: log segment boundary handling, offset arithmetic, index lookup.
- Integration: produce-and-consume tests against
franz-go. - Fuzz: protocol parser fuzzed against malformed records.
- Crash test: kill -9 during write; restart; verify WAL recovery.
Hardening pass¶
pproffor the hot path (the produce-write loop must be0 allocs/opper record).- PGO with a sustained-throughput profile.
runtime/traceartifact showing zero scheduler stalls under load.
Acceptance criteria¶
- Public repo, a reference-grade README.
- A throughput/latency benchmark vs. real Kafka on the same hardware.
- A replay demo showing rewinding consumer offset to a specific timestamp.
Skills exercised¶
- Months 2 (memory + GC tuning, allocation discipline), 3 (concurrency at 200K msgs/sec), 5 (observability), 6.22 (storage patterns).
Cross-Track Requirements¶
Regardless of track:
- Hardening template integrated. The
hardening/template from Appendix A applies. - Architectural Decision Records (ADRs). At least three for the capstone, each ~1 page.
- Threat model. One page minimum, no matter the track.
- Defense readiness. You should be able to walk a reviewer through the code in 45 minutes and answer "what fails first under load / fuzzing / a malicious input / a network partition?"
The track choice signals career direction: Track 1 for distributed-systems infrastructure roles, Track 2 for platform/SRE/networking roles, Track 3 for data-infra/streaming roles. Pick based on where you want the next interview loop, not on what looks easiest.