Week 18 - Observability: Logs, Metrics, Traces¶

Conceptual Core¶

Three pillars, one mental model: - Logs are events. High cardinality, structured, expensive at scale, sampled in production. - Metrics are aggregates. Low cardinality, cheap, always-on, dashboards-and-alerts. - Traces are causal chains. Per-request, sampled, the only pillar that answers "what did this specific call do?"

In 2026 the cross-language standard is OpenTelemetry for traces (and increasingly for metrics and logs). The JVM-idiomatic metrics library is Micrometer. Logs stay on SLF4J + Logback / Log4j2 - OTel logs is shipping but adoption is still incremental.

Mechanical Detail¶

Logs: SLF4J as the facade, Logback or Log4j2 as the backend. JSON layout via logstash-logback-encoder or Log4j2's JsonLayout. MDC (Mapped Diagnostic Context) for request-scoped fields - set in a request filter, read implicitly by every log line. Sample high-volume events (every 100th GET /healthz); never sample errors.
Metrics (Micrometer): MeterRegistry is the entry point. Core meter types: Counter (monotonic), Gauge (point-in-time), Timer (latency histograms), DistributionSummary (non-time histograms), LongTaskTimer (in-flight long ops). Pluggable backends: Prometheus, Datadog, CloudWatch, Dynatrace, etc. Bind JVM metrics: JvmMemoryMetrics, JvmGcMetrics, JvmThreadMetrics, ProcessorMetrics, JvmHeapPressureMetrics, ClassLoaderMetrics.
Traces (OpenTelemetry Java agent): -javaagent:opentelemetry-javaagent.jar auto-instruments JDBC, HTTP clients/servers, Kafka, gRPC, Redis, and 100+ other libraries without code changes. Manual spans via GlobalOpenTelemetry.getTracer(...).spanBuilder("name").startSpan(). Attribute names follow OTel semantic conventions (http.request.method, db.system.name, etc.) - adhere to them so dashboards built for one service work for another.
Correlation: the OTel agent injects trace_id and span_id into MDC automatically. Configure your log layout to include them (%X{trace_id} in Logback). Now traces and logs are joinable in the backend (Grafana, Datadog, Honeycomb) by trace ID - the single highest-leverage observability investment you can make.

The trap

High-cardinality tags on metrics. Adding user_id or request_id as a Micrometer tag explodes Prometheus storage (each unique value is a separate time-series). Tags should be bounded sets: HTTP method, status class, endpoint pattern (not URL). Per-user data belongs in traces or logs, not metrics.

Lab¶

Wire all three pillars into your Week 17 service: - OTel agent: download opentelemetry-javaagent.jar, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317. - Micrometer Prometheus: add micrometer-registry-prometheus, expose /actuator/prometheus. - Logback JSON: replace the default text layout with LogstashEncoder + %X{trace_id} in the pattern.

Bring up a local stack via docker-compose: Prometheus + Tempo (traces) + Loki (logs) + Grafana. Generate load with k6 run script.js or wrk -t4 -c100 -d60s. In Grafana, find a slow request: open its trace, click through to the matching log lines via trace ID.

Idiomatic Drill¶

Add three custom Micrometer metrics that would help an SRE during a real incident - examples: - request_queue_depth (gauge) - work waiting for a thread. - db_connection_wait_seconds (timer) - time spent waiting for a Hikari connection. - outbound_retry_count (counter, tagged by target service) - Resilience4j retry hits.

For each, write the one-line justification: "this metric tells me X when Y is failing."

Production Hardening Slice¶

Document an "observability contract" for every service in your template:

Health endpoint at /actuator/health distinguishing liveness vs readiness.
Metrics endpoint at /actuator/prometheus in Prometheus exposition format, JVM + HTTP + custom metrics included.
Traces exported to a configurable OTLP endpoint via the OTel agent.
Logs JSON-structured to stdout (let the container runtime ship them), with trace_id injected.

One README section per item. Any service that doesn't meet all four is not production-ready.