Skip to content

Capstone Projects-Three Tracks, One Choice

The Month 6 capstone is the deliverable that converts this curriculum from study into evidence. Pick one track. The work performed here is what you describe in interviews and link from a portfolio.


Track 1-Distributed Storage: A Raft-Replicated KV Store

Outcome: a 3+ node Raft-replicated key-value store with linearizable reads, snapshots, online membership changes, and a `jepsen - style fault-injection harness verifying linearizability.

Functional spec

  • gRPC API: Get(key), Put(key, value), Delete(key), Watch(prefix) stream.
  • Cluster API: AddNode, RemoveNode, Leadership.
  • Linearizable reads via read-index.
  • Snapshots every N entries (default 10K) with InstallSnapshot to recovering followers.
  • Persistent WAL via Pebble or BoltDB.
  • TLS between nodes; mutual auth via x509.

Non-functional spec

  • Sustained 50K writes/sec on commodity hardware (3-node, NVMe).
  • Sub-10 ms write latency p99 under 50% utilization.
  • Recovery time (leader change → fully available) under 1 s for a 3-node cluster.
  • Survives a single-node crash without data loss; survives a network partition with a clear majority.

Architecture sketch

  • One goroutine per node consumes from etcd-io/raft Ready channel.
  • Apply loop: stream committed entries → state machine → respond to clients.
  • Network: gRPC with a long-lived bidi stream per peer pair.
  • State machine: a sharded map[string][]byte with versioning for Watch.

Test rigor

  • Unit: state-machine transitions, log-truncation invariants.
  • Integration: 3-node local cluster via t.Run, exercise membership.
  • Fault injection: a "nemesis" goroutine that randomly partitions, pauses, crashes nodes; client offer history fed to a linearizability checker (Knossos in Clojure-via-process, or a lightweight Go port).
  • Race-clean under sustained load.

Hardening pass

  • goreleaser, cosign signing, SBOM via cyclonedx-gomod.
  • GOMEMLIMIT from cgroup; automaxprocs.
  • PGO with a representative workload.
  • pprof + runtime/trace capture endpoints.
  • OTel traces across the Raft RPC layer (custom interceptor).
  • A RUNBOOK.md covering: leader-stuck triage, log-corruption recovery, snapshot-restore procedure.

Acceptance criteria

  • Public repo with all of the above.
  • A README that includes: a topology diagram, a load-test latency CDF, a Jepsen-style report.
  • Defensible answer to: "What happens during a network partition where a majority can elect a new leader but the old leader is still up?"

Skills exercised

  • Months 3 (concurrency), 5 (gRPC, observability), 6.21–6.22 (Raft, distributed storage).

Track 2-Service Mesh: A gRPC Microservices Mesh

Outcome: a multi-service mesh demonstrating a custom service registry, health checking, deadline propagation, retries, outlier ejection, and end-to-end OTel tracing across at least four interconnected services.

Functional spec

  • A Registry service: gRPC interface for Register, Deregister, Watch, LookupHealthy. Backed by an in-memory store with optional Raft replication (composes with Track 1).
  • A Sidecar library that:
  • Resolves service names via the registry (custom gRPC resolver.Builder).
  • Implements client-side load balancing with round-robin + outlier ejection.
  • Propagates OTel context, deadlines, and a request_id.
  • Adds retry policy via service config.
  • Four demo services (e.g., user, order, inventory, payment) with a fan-out call graph that exercises retries, timeouts, and partial failures.
  • A mesh-cli for service inspection and chaos injection.

Non-functional spec

  • Sub-millisecond p99 sidecar overhead per RPC.
  • Outlier ejection within 10 s of an endpoint going bad.
  • Deadline propagation: an inbound 1 s deadline must result in downstream calls seeing strictly less than 1 s remaining.

Architecture sketch

  • Each service runs the sidecar library in-process (no separate sidecar binary-keep it simple, defensible).
  • Registry uses etcd-io/raft if Track 1 also chosen; otherwise a single-instance with TLS.
  • Service discovery uses long-poll Watch via gRPC server-streaming.

Test rigor

  • Unit: resolver, balancer, interceptor stacks.
  • Integration: spin all four services in-process, exercise the call graph with testcontainers for the registry's Postgres if used.
  • Chaos: a chaos-injector middleware that drops/delays/errors random %.
  • Latency tests with ghz at multiple QPS levels.

Hardening pass

  • pprof everywhere; OTel everywhere.
  • goleak per-service.
  • A reproducible Docker Compose stack and a one-command make demo that brings it up with Jaeger and Prometheus.
  • Alarms wired: Prometheus rules on per-service error rate, p99 latency, registry watch lag.

Acceptance criteria

  • All four services deployable with make demo.
  • A flame graph demonstrating where sidecar overhead lives.
  • A trace screenshot showing deadline-propagated failure across the call chain.
  • Defensible answer to: "What happens if the registry leader is down for 30 seconds?"

Skills exercised

  • Months 3 (concurrency), 5 (gRPC mastery, observability), 6 (capstone defense, performance).

Track 3-Streaming Pipeline: A Kafka-Compatible Ingestion + Stream Processor

Outcome: a Kafka-protocol-compatible (subset) broker plus a stream-processing framework, with at-least-once delivery, exactly-once-effective consumer offsets, and replay.

Functional spec

  • Broker: implements a subset of the Kafka wire protocol (Produce, Fetch, Metadata, ListOffsets, OffsetCommit, OffsetFetch). Disk-backed log per partition; segment + index files.
  • Stream processor: a small framework letting users write func(input Stream[T]) Stream[U] with operators (Map, Filter, Window, Aggregate, Join).
  • Consumer: offset management, rebalance protocol (subset).
  • Producer: idempotent producer (within session).
  • Compatibility: works with franz-go (the leading Kafka Go client) for at least Produce/Fetch.

Non-functional spec

  • 200K msgs/sec sustained on a single partition (commodity NVMe).
  • Sub-50 ms producer ack p99 with acks=all.
  • Replay from arbitrary offset.
  • Crash-recoverable: WAL fsync semantics documented.

Architecture sketch

  • One goroutine per partition for the disk-write path.
  • mmap'd index files; sequential append to log files.
  • Replication: Raft per partition (composes with Track 1) or a simpler primary-backup with a documented data-loss window.

Test rigor

  • Unit: log segment boundary handling, offset arithmetic, index lookup.
  • Integration: produce-and-consume tests against franz-go.
  • Fuzz: protocol parser fuzzed against malformed records.
  • Crash test: kill -9 during write; restart; verify WAL recovery.

Hardening pass

  • pprof for the hot path (the produce-write loop must be 0 allocs/op per record).
  • PGO with a sustained-throughput profile.
  • runtime/trace artifact showing zero scheduler stalls under load.

Acceptance criteria

  • Public repo, a reference-grade README.
  • A throughput/latency benchmark vs. real Kafka on the same hardware.
  • A replay demo showing rewinding consumer offset to a specific timestamp.

Skills exercised

  • Months 2 (memory + GC tuning, allocation discipline), 3 (concurrency at 200K msgs/sec), 5 (observability), 6.22 (storage patterns).

Cross-Track Requirements

Regardless of track:

  • Hardening template integrated. The hardening/ template from Appendix A applies.
  • Architectural Decision Records (ADRs). At least three for the capstone, each ~1 page.
  • Threat model. One page minimum, no matter the track.
  • Defense readiness. You should be able to walk a reviewer through the code in 45 minutes and answer "what fails first under load / fuzzing / a malicious input / a network partition?"

The track choice signals career direction: Track 1 for distributed-systems infrastructure roles, Track 2 for platform/SRE/networking roles, Track 3 for data-infra/streaming roles. Pick based on where you want the next interview loop, not on what looks easiest.

Comments