Week 23 - Failure, Observability, and Operations¶
Conceptual Core¶
Make it survive failure. Make it diagnosable. This is the week that separates a project from a system.
Deliverables¶
- Failure injection: kill a node mid-operation. Network partitions (use
iptablesrules in Testcontainers, or Toxiproxy). Slow disk. Verify the documented invariants hold or, where they cannot, document the user-visible behavior. - Observability dashboards: Grafana dashboards (committed as JSON) for the three pillars. One "is it healthy" dashboard. One "what's happening" dashboard.
- Runbook: a
RUNBOOK.mdwith: top 5 alerts, what each means, what to check, how to mitigate. Treat it as the on-call handoff document. - Capacity test: one published JMH or k6 run showing throughput and latency under steady load and under fault.
Track-specific notes¶
- Distributed storage: partition the leader. Verify a new leader is elected within the documented timeout and reads return the latest committed write.
- Service mesh: kill a backend mid-RPC. Verify retry + circuit breaker + outlier ejection.
- Streaming pipeline: kill a consumer mid-batch. Verify replay from last committed offset; no data loss; bounded duplication.
Hardening slice¶
Promote your hardening/ template to a public template repo. README + make new-service scaffolding script.