Week 18 - Observability: Logs, Metrics, Traces¶
Conceptual Core¶
Three pillars, one mental model: - Logs are events. High cardinality, structured, expensive at scale, sampled in production. - Metrics are aggregates. Low cardinality, cheap, always-on, dashboards-and-alerts. - Traces are causal chains. Per-request, sampled, the only pillar that answers "what did this specific call do?"
In 2026 the cross-language standard is OpenTelemetry for traces (and increasingly for metrics and logs). The JVM-idiomatic metrics library is Micrometer. Logs stay on SLF4J + Logback / Log4j2 - OTel logs is shipping but adoption is still incremental.
Mechanical Detail¶
- Logs: SLF4J as the facade, Logback or Log4j2 as the backend. JSON layout via
logstash-logback-encoderor Log4j2'sJsonLayout. MDC (Mapped Diagnostic Context) for request-scoped fields - set in a request filter, read implicitly by every log line. Sample high-volume events (every 100thGET /healthz); never sample errors. - Metrics (Micrometer):
MeterRegistryis the entry point. Core meter types:Counter(monotonic),Gauge(point-in-time),Timer(latency histograms),DistributionSummary(non-time histograms),LongTaskTimer(in-flight long ops). Pluggable backends: Prometheus, Datadog, CloudWatch, Dynatrace, etc. Bind JVM metrics:JvmMemoryMetrics,JvmGcMetrics,JvmThreadMetrics,ProcessorMetrics,JvmHeapPressureMetrics,ClassLoaderMetrics. - Traces (OpenTelemetry Java agent):
-javaagent:opentelemetry-javaagent.jarauto-instruments JDBC, HTTP clients/servers, Kafka, gRPC, Redis, and 100+ other libraries without code changes. Manual spans viaGlobalOpenTelemetry.getTracer(...).spanBuilder("name").startSpan(). Attribute names follow OTel semantic conventions (http.request.method,db.system.name, etc.) - adhere to them so dashboards built for one service work for another. - Correlation: the OTel agent injects
trace_idandspan_idinto MDC automatically. Configure your log layout to include them (%X{trace_id}in Logback). Now traces and logs are joinable in the backend (Grafana, Datadog, Honeycomb) by trace ID - the single highest-leverage observability investment you can make.
The trap
High-cardinality tags on metrics. Adding user_id or request_id as a Micrometer tag explodes Prometheus storage (each unique value is a separate time-series). Tags should be bounded sets: HTTP method, status class, endpoint pattern (not URL). Per-user data belongs in traces or logs, not metrics.
Lab¶
Wire all three pillars into your Week 17 service:
- OTel agent: download opentelemetry-javaagent.jar, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317.
- Micrometer Prometheus: add micrometer-registry-prometheus, expose /actuator/prometheus.
- Logback JSON: replace the default text layout with LogstashEncoder + %X{trace_id} in the pattern.
Bring up a local stack via docker-compose: Prometheus + Tempo (traces) + Loki (logs) + Grafana. Generate load with k6 run script.js or wrk -t4 -c100 -d60s. In Grafana, find a slow request: open its trace, click through to the matching log lines via trace ID.
Idiomatic Drill¶
Add three custom Micrometer metrics that would help an SRE during a real incident - examples:
- request_queue_depth (gauge) - work waiting for a thread.
- db_connection_wait_seconds (timer) - time spent waiting for a Hikari connection.
- outbound_retry_count (counter, tagged by target service) - Resilience4j retry hits.
For each, write the one-line justification: "this metric tells me X when Y is failing."
Production Hardening Slice¶
Document an "observability contract" for every service in your template:
- Health endpoint at
/actuator/healthdistinguishing liveness vs readiness. - Metrics endpoint at
/actuator/prometheusin Prometheus exposition format, JVM + HTTP + custom metrics included. - Traces exported to a configurable OTLP endpoint via the OTel agent.
- Logs JSON-structured to stdout (let the container runtime ship them), with
trace_idinjected.
One README section per item. Any service that doesn't meet all four is not production-ready.