Capstone Projects-Three Tracks, One Choice¶

The Month 6 capstone is the deliverable that converts this curriculum from study into evidence. Pick one track. The work performed here is what you describe in interviews and link from a portfolio.

Track 1-Distributed Storage: A Raft-Replicated KV Store¶

Outcome: a 3+ node Raft-replicated key-value store with linearizable reads, snapshots, online membership changes, and a `jepsen - style fault-injection harness verifying linearizability.

Functional spec¶

gRPC API: Get(key), Put(key, value), Delete(key), Watch(prefix) stream.
Cluster API: AddNode, RemoveNode, Leadership.
Linearizable reads via read-index.
Snapshots every N entries (default 10K) with InstallSnapshot to recovering followers.
Persistent WAL via Pebble or BoltDB.
TLS between nodes; mutual auth via x509.

Non-functional spec¶

Sustained 50K writes/sec on commodity hardware (3-node, NVMe).
Sub-10 ms write latency p99 under 50% utilization.
Recovery time (leader change → fully available) under 1 s for a 3-node cluster.
Survives a single-node crash without data loss; survives a network partition with a clear majority.

Architecture sketch¶

One goroutine per node consumes from etcd-io/raft Ready channel.
Apply loop: stream committed entries → state machine → respond to clients.
Network: gRPC with a long-lived bidi stream per peer pair.
State machine: a sharded map[string][]byte with versioning for Watch.

Test rigor¶

Unit: state-machine transitions, log-truncation invariants.
Integration: 3-node local cluster via t.Run, exercise membership.
Fault injection: a "nemesis" goroutine that randomly partitions, pauses, crashes nodes; client offer history fed to a linearizability checker (Knossos in Clojure-via-process, or a lightweight Go port).
Race-clean under sustained load.

Hardening pass¶

goreleaser, cosign signing, SBOM via cyclonedx-gomod.
GOMEMLIMIT from cgroup; automaxprocs.
PGO with a representative workload.
pprof + runtime/trace capture endpoints.
OTel traces across the Raft RPC layer (custom interceptor).
A RUNBOOK.md covering: leader-stuck triage, log-corruption recovery, snapshot-restore procedure.

Acceptance criteria¶

Public repo with all of the above.
A README that includes: a topology diagram, a load-test latency CDF, a Jepsen-style report.
Defensible answer to: "What happens during a network partition where a majority can elect a new leader but the old leader is still up?"

Skills exercised¶

Months 3 (concurrency), 5 (gRPC, observability), 6.21–6.22 (Raft, distributed storage).

Track 2-Service Mesh: A gRPC Microservices Mesh¶

Outcome: a multi-service mesh demonstrating a custom service registry, health checking, deadline propagation, retries, outlier ejection, and end-to-end OTel tracing across at least four interconnected services.

Functional spec¶

A Registry service: gRPC interface for Register, Deregister, Watch, LookupHealthy. Backed by an in-memory store with optional Raft replication (composes with Track 1).
A Sidecar library that:
Resolves service names via the registry (custom gRPC resolver.Builder).
Implements client-side load balancing with round-robin + outlier ejection.
Propagates OTel context, deadlines, and a request_id.
Adds retry policy via service config.
Four demo services (e.g., user, order, inventory, payment) with a fan-out call graph that exercises retries, timeouts, and partial failures.
A mesh-cli for service inspection and chaos injection.

Non-functional spec¶

Sub-millisecond p99 sidecar overhead per RPC.
Outlier ejection within 10 s of an endpoint going bad.
Deadline propagation: an inbound 1 s deadline must result in downstream calls seeing strictly less than 1 s remaining.

Architecture sketch¶

Each service runs the sidecar library in-process (no separate sidecar binary-keep it simple, defensible).
Registry uses etcd-io/raft if Track 1 also chosen; otherwise a single-instance with TLS.
Service discovery uses long-poll Watch via gRPC server-streaming.

Test rigor¶

Unit: resolver, balancer, interceptor stacks.
Integration: spin all four services in-process, exercise the call graph with testcontainers for the registry's Postgres if used.
Chaos: a chaos-injector middleware that drops/delays/errors random %.
Latency tests with ghz at multiple QPS levels.

Hardening pass¶

pprof everywhere; OTel everywhere.
goleak per-service.
A reproducible Docker Compose stack and a one-command make demo that brings it up with Jaeger and Prometheus.
Alarms wired: Prometheus rules on per-service error rate, p99 latency, registry watch lag.

Acceptance criteria¶

All four services deployable with make demo.
A flame graph demonstrating where sidecar overhead lives.
A trace screenshot showing deadline-propagated failure across the call chain.
Defensible answer to: "What happens if the registry leader is down for 30 seconds?"

Skills exercised¶

Months 3 (concurrency), 5 (gRPC mastery, observability), 6 (capstone defense, performance).

Track 3-Streaming Pipeline: A Kafka-Compatible Ingestion + Stream Processor¶

Outcome: a Kafka-protocol-compatible (subset) broker plus a stream-processing framework, with at-least-once delivery, exactly-once-effective consumer offsets, and replay.

Functional spec¶

Broker: implements a subset of the Kafka wire protocol (Produce, Fetch, Metadata, ListOffsets, OffsetCommit, OffsetFetch). Disk-backed log per partition; segment + index files.
Stream processor: a small framework letting users write func(input Stream[T]) Stream[U] with operators (Map, Filter, Window, Aggregate, Join).
Consumer: offset management, rebalance protocol (subset).
Producer: idempotent producer (within session).
Compatibility: works with franz-go (the leading Kafka Go client) for at least Produce/Fetch.

Non-functional spec¶

200K msgs/sec sustained on a single partition (commodity NVMe).
Sub-50 ms producer ack p99 with acks=all.
Replay from arbitrary offset.
Crash-recoverable: WAL fsync semantics documented.

Architecture sketch¶

One goroutine per partition for the disk-write path.
mmap'd index files; sequential append to log files.
Replication: Raft per partition (composes with Track 1) or a simpler primary-backup with a documented data-loss window.

Test rigor¶

Unit: log segment boundary handling, offset arithmetic, index lookup.
Integration: produce-and-consume tests against franz-go.
Fuzz: protocol parser fuzzed against malformed records.
Crash test: kill -9 during write; restart; verify WAL recovery.

Hardening pass¶

pprof for the hot path (the produce-write loop must be 0 allocs/op per record).
PGO with a sustained-throughput profile.
runtime/trace artifact showing zero scheduler stalls under load.

Acceptance criteria¶

Public repo, a reference-grade README.
A throughput/latency benchmark vs. real Kafka on the same hardware.
A replay demo showing rewinding consumer offset to a specific timestamp.

Skills exercised¶

Months 2 (memory + GC tuning, allocation discipline), 3 (concurrency at 200K msgs/sec), 5 (observability), 6.22 (storage patterns).

Cross-Track Requirements¶

Regardless of track:

Hardening template integrated. The hardening/ template from Appendix A applies.
Architectural Decision Records (ADRs). At least three for the capstone, each ~1 page.
Threat model. One page minimum, no matter the track.
Defense readiness. You should be able to walk a reviewer through the code in 45 minutes and answer "what fails first under load / fuzzing / a malicious input / a network partition?"

The track choice signals career direction: Track 1 for distributed-systems infrastructure roles, Track 2 for platform/SRE/networking roles, Track 3 for data-infra/streaming roles. Pick based on where you want the next interview loop, not on what looks easiest.

Capstone Projects-Three Tracks, One Choice¶

Track 1-Distributed Storage: A Raft-Replicated KV Store¶

Functional spec¶

Non-functional spec¶

Architecture sketch¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Track 2-Service Mesh: A gRPC Microservices Mesh¶

Functional spec¶

Non-functional spec¶

Architecture sketch¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Track 3-Streaming Pipeline: A Kafka-Compatible Ingestion + Stream Processor¶

Functional spec¶

Non-functional spec¶

Architecture sketch¶

Test rigor¶

Hardening pass¶

Acceptance criteria¶

Skills exercised¶

Cross-Track Requirements¶

Comments