Week 23 - Failure, Observability, and Operations¶

Conceptual Core¶

Make it survive failure. Make it diagnosable. This is the week that separates a project from a system.

Failure injection: kill a node mid-operation. Network partitions (use iptables rules in Testcontainers, or Toxiproxy). Slow disk. Verify the documented invariants hold or, where they cannot, document the user-visible behavior.
Observability dashboards: Grafana dashboards (committed as JSON) for the three pillars. One "is it healthy" dashboard. One "what's happening" dashboard.
Runbook: a RUNBOOK.md with: top 5 alerts, what each means, what to check, how to mitigate. Treat it as the on-call handoff document.
Capacity test: one published JMH or k6 run showing throughput and latency under steady load and under fault.

Distributed storage: partition the leader. Verify a new leader is elected within the documented timeout and reads return the latest committed write.
Service mesh: kill a backend mid-RPC. Verify retry + circuit breaker + outlier ejection.
Streaming pipeline: kill a consumer mid-batch. Verify replay from last committed offset; no data loss; bounded duplication.

Promote your hardening/ template to a public template repo. README + make new-service scaffolding script.