Capstone Projects¶
Three tracks. Pick one. Each is sized for the four-week Month 6 schedule in 06_MONTH_CAPSTONE.md. Each forces the full curriculum into one artifact.
Common requirements across tracks:
- Built on Java 25 LTS, virtual threads + structured concurrency where appropriate.
- gRPC or HTTP/2 over java.net.http for cross-component RPC.
- Full observability: Micrometer Prometheus metrics, OpenTelemetry traces, structured JSON logs with trace-ID correlation.
- Continuous JFR recording, GC logs to disk, heap-dump-on-OOM.
- Testcontainers-based integration tests, jqwik property tests for core invariants, jcstress for custom concurrency.
- One JMH benchmark suite covering at least one hot path.
- Public GitHub repo, README, design doc, runbook, demo script.
Track 1 - Distributed Storage: Raft-Backed Key-Value Store¶
Goal¶
A 3-to-5-node replicated KV store. Linearizable single-key reads/writes. Snapshot/restore. Membership changes (add/remove node) optional but encouraged.
Required reading¶
- Diego Ongaro, In Search of an Understandable Consensus Algorithm (the Raft paper).
- Ongaro's PhD thesis chapters on log compaction and membership.
- One mature Raft implementation for reference: Atomix, jraft (SOFAStack), or Hashicorp's
raft(Go) - read it, don't copy it.
Core scope (must-have)¶
- Leader election with randomized timeouts.
- Log replication with majority commit.
- Linearizable reads via read-index or leader leases.
- Persistent log on disk (append-only segments + index).
- gRPC for inter-node communication and client API.
- Snapshot + restore.
Stretch scope¶
- Joint-consensus membership changes (JEP-style "stable until I touch it" config).
- Leader leases for fast reads without read-index.
- Multi-Raft (sharding by key range).
- Linearizability checker (Jepsen-style; e.g. Knossos invoked from CI).
Failure scenarios to test¶
- Kill leader mid-write. New leader within timeout. No committed entries lost.
- Network partition isolating the leader. Minority side cannot make progress; majority elects new leader.
- Slow disk on a follower. Throughput degrades to the slowest acknowledged majority; does not stall the cluster.
- Restart an entire node. Recovers from disk log + snapshot.
Why this track¶
Forces you into: virtual threads under load, structured concurrency for fan-out RPCs, careful JMM reasoning for the consensus state machine, gRPC, persistence, JFR-driven tuning, failure injection.
Track 2 - Service Mesh: gRPC Microservice Mesh with Custom Control Plane¶
Goal¶
A working microservices mesh with multiple backend services, a service registry, a load-balancing client-side proxy, deadline propagation, circuit-breaker-driven outlier ejection, and end-to-end OpenTelemetry tracing.
Required reading¶
- The gRPC documentation on name resolution, load balancing, and retries.
- The Envoy/Istio docs on the data-plane / control-plane split (for terminology; you don't need to use Envoy).
- Resilience4j docs.
- Sam Newman, Building Microservices, 2nd ed., chapters on resilience and observability.
Core scope (must-have)¶
- A registry service (in-memory, persisted, or backed by etcd via jetcd) where backends register/deregister with TTL.
- At least three backend services with non-trivial dependencies among them (e.g.,
inventory→pricing→tax). - A client-side proxy library (gRPC interceptors) that:
- Resolves a service name to instances from the registry.
- Load-balances across them (round-robin minimum; weighted/pick-the-shortest-queue stretch).
- Propagates deadlines and trace context.
- Circuit-breaks per-instance based on error/latency.
- Ejects outliers from the LB pool.
- Full OpenTelemetry trace per top-level request, end-to-end through all services.
- Prometheus metrics: per-service request count, latency histogram, error rate, circuit-breaker state.
Stretch scope¶
- mTLS between services (use
java.securityand a local CA). - A "canary" feature: route N% of traffic to a different version of a backend.
- Outlier detection based on per-instance success-rate (Envoy's algorithm is a fine reference).
- Throttling / rate limiting at the proxy.
Failure scenarios to test¶
- Kill one backend instance mid-request. Client retries to another instance; the failed instance is ejected from the LB pool.
- Slow backend (inject 5s sleeps). Deadline propagation cancels the request through the chain.
- Cascading failure: backend B fails; circuit breaker opens; backend A degrades gracefully (returns cached / partial data, not stalls).
- Registry restart. Services re-register; client refreshes its view.
Why this track¶
Forces you into: gRPC depth, virtual-thread-friendly RPC patterns, Resilience4j composition, deadline propagation (subtle), OpenTelemetry instrumentation depth, multi-service operations.
Track 3 - Streaming Pipeline: Kafka-Style Ingest with Replay¶
Goal¶
A single-broker (stretch: multi-broker) message-streaming system with a producer API, a consumer API, durable segmented log storage, consumer groups, and at-least-once delivery with replay from offset.
Required reading¶
- Kafka documentation: the log abstraction, segments, indexes, consumer groups, offset commits.
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction.
- LMAX Disruptor paper (for the in-broker hot path inspiration).
Core scope (must-have)¶
- A broker process with TCP wire protocol (custom, or simplified Kafka-compatible).
- Topics with partitions; each partition is a segmented append-only log on disk.
- Producer API: send batched, acknowledged messages.
- Consumer API with offset commit (manual + auto).
- Consumer groups with partition rebalancing on join/leave.
- At-least-once delivery: bounded duplication on consumer restart; no loss.
- Replay from arbitrary offset.
Stretch scope¶
- Multi-broker with leader-per-partition (effectively, Raft per partition - combines with Track 1 ideas).
- Stream processing API (map / filter / window / aggregate) on top of the broker.
- Compaction (latest-value-per-key) topics.
- Schema registry integration (Avro / protobuf).
Failure scenarios to test¶
- Kill a consumer mid-batch. Restart; consumer resumes from last committed offset; no loss; bounded duplication.
- Broker crash. Replay log; producer's unacknowledged sends retried; consumer's view of committed offsets preserved.
- Slow consumer. Lag grows; producer is not blocked (broker-side buffering is bounded; producers acknowledge at-send-time, not at-consumer-receipt).
- Disk full. Backpressure to producers; no corruption.
Why this track¶
Forces you into: file I/O at performance (segment management, mmap or FileChannel), virtual threads for many concurrent connections, careful JMM reasoning for the hot path, JMH-driven optimization, backpressure design, JFR-driven tuning under sustained load.
Track-Independent Defense Checklist¶
By end of Month 6, regardless of track, you have:
- A public repo with CI green.
- A README that lets a stranger build and run the system in five commands.
- A design doc (3–8 pages) explaining choices and rejected alternatives.
- A runbook (
RUNBOOK.md) with the top 5 alerts and mitigation steps. - A postmortem-style writeup (5–15 pages): what you built, what surprised you, what you would do differently.
- A demo script (10–20 min) walking through happy path + at least one chaos scenario, with live dashboards.
- JMH results for at least one critical operation, with notes on what limits throughput.
- A JFR recording from a steady-state load run, committed in
docs/perf/, with annotations on what to look at. - A
hardening/checklist fromAPPENDIX_Aticked through.
If you finish early¶
Pick one and do it:
- Submit a tiny upstream patch to OpenJDK (see APPENDIX_C) using something you learned during the capstone.
- Write a blog post on one specific surprise from your work (e.g., "what I learned about virtual-thread pinning building a Raft implementation").
- Port the project to GraalVM native-image and document the trade-offs.
- Migrate one component from blocking to reactive (or vice versa) and JMH the difference.
The capstone is the end of the curriculum, not the end of the work. Treat it as the launch pad.