Skip to content

Week 23 - Failure, Observability, and Operations

Conceptual Core

Make it survive failure. Make it diagnosable. This is the week that separates a project from a system.

Deliverables

  • Failure injection: kill a node mid-operation. Network partitions (use iptables rules in Testcontainers, or Toxiproxy). Slow disk. Verify the documented invariants hold or, where they cannot, document the user-visible behavior.
  • Observability dashboards: Grafana dashboards (committed as JSON) for the three pillars. One "is it healthy" dashboard. One "what's happening" dashboard.
  • Runbook: a RUNBOOK.md with: top 5 alerts, what each means, what to check, how to mitigate. Treat it as the on-call handoff document.
  • Capacity test: one published JMH or k6 run showing throughput and latency under steady load and under fault.

Track-specific notes

  • Distributed storage: partition the leader. Verify a new leader is elected within the documented timeout and reads return the latest committed write.
  • Service mesh: kill a backend mid-RPC. Verify retry + circuit breaker + outlier ejection.
  • Streaming pipeline: kill a consumer mid-batch. Verify replay from last committed offset; no data loss; bounded duplication.

Hardening slice

Promote your hardening/ template to a public template repo. README + make new-service scaffolding script.

Comments