Skip to content

Month 6-Week 4: Q2 capstone-observability bridge post + Q3 track decision

Week summary

  • Goal: Q3 track decision (Evals / Agents / Inference). OpenTelemetry GenAI conventions adopted. Publish "LLM observability for engineers who already know observability"-the most career-leveraged post of your year.
  • Time: ~9 h over 3 sessions.
  • Output: Q3 track committed; OTel GenAI in project; sixth public blog post; Q2 retrospective.
  • Sequences relied on: 13-llm-observability rungs 01, 04, 05, 09, 11.

Why this week matters

Q2 closes here. The bridge observability post is the artifact that announces your unique positioning to the world: an SRE / observability engineer who became an AI engineer. Few people sit at this intersection in 2026; the post claims the territory.

The Q3 track decision is also load-bearing. Without commitment, Q3 dilutes; with commitment, Q3 produces a real specialty.

Prerequisites

  • M04, M05, M06-W01–W03 complete.
  • Anchor project mature.
  • Session A-Tue/Wed evening (~3 h): track decision + OTel GenAI
  • Session B-Sat morning (~4 h): bridge blog post draft + edit
  • Session C-Sun afternoon (~2 h): publish + Q2 retro

Session A-Track decision + OpenTelemetry GenAI

Goal: Commit to a Q3 specialty track. Adopt OpenTelemetry GenAI semantic conventions in the project.

Part 1-Q3 track decision (60 min)

Three options (from the roadmap): - Track A-Evals. Recommended for you given observability background. - Track B-Agents. Strong if M06's agent work was your favorite part. - Track C-Inference Infra. Strong if distributed-systems and infrastructure energizes you most.

Decide by writing. In Q3_TRACK_DECISION.md: 1. Which track and why (3 sentences). 2. The specific Q3 anchor project hypothesis. 3. The specific 3 OSS projects you'd potentially contribute to. 4. The blog post you'd write at end of Q3.

Examples: - Track A: "Open-source eval framework for agent trajectories-go beyond Inspect AI / Braintrust by focusing on agent-specific failure modes." - Track B: "Reproducible SWE-bench Lite submission with novel reflection design." - Track C: "Self-hosting benchmark suite for OSS LLMs with quantization comparisons."

Lock the decision. Don't second-guess in Q3.

Part 2-Read OTel GenAI semantic conventions (60 min)

OpenTelemetry standardizes how to instrument any system. The GenAI conventions extend this for AI workloads: - gen_ai.system - anthropic, openai, vllm, etc. -gen_ai.request.model - claude-opus-4-7 - gen_ai.usage.input_tokens - gen_ai.usage.output_tokens - `gen_ai.response.finish_reasons - list

Read: opentelemetry.io-search "GenAI semantic conventions". Plus active GitHub spec discussions (the spec is evolving in 2025–2026).

Part 3-Adopt in your project (60 min)

Either via OpenTelemetry SDK directly:

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm.request") as span:
    span.set_attribute("gen_ai.system", "anthropic")
    span.set_attribute("gen_ai.request.model", "claude-opus-4-7")
    resp = client.messages.create(...)
    span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)

Or via Langfuse's OTel-aware adapter (search "langfuse opentelemetry"). Either way, your traces now use vendor-neutral conventions-Datadog, Grafana, Honeycomb can all consume them.

Push v0.8.0.

Output of Session A

  • Q3_TRACK_DECISION.md written.
  • OTel GenAI conventions adopted.

Session B-Bridge blog post

Goal: Draft and edit "LLM observability for engineers who already know observability" (~2500 words). The most career-leveraged post of your year.

Part 1-Outline (30 min)

1. Hook (200 w)
   "Observability engineers will be the SREs of AI products. Here's the manual I wish existed."
2. Familiar primitives map to LLM systems (400 w)
   - Span → LLM call
   - SLI → quality SLI (Faithfulness, Pass Rate)
   - Trace → multi-step agent run
   - Drift → distribution shift in inputs / outputs / quality
3. What's actually new (400 w)
   - Quality is non-deterministic. Same input → different outputs.
   - Cost as a first-class signal.
   - Judge-based eval introduces a *measurement system* not just metrics.
4. The mapping in practice (600 w)
   - OTel GenAI conventions.
   - Langfuse / LangSmith vs traditional APM.
   - Grafana dashboards with the new signals.
5. SLO design for LLM systems (400 w)
   - Faithfulness SLI > 0.85 (judge-based).
   - p95 latency budget including streaming TTFT.
   - Cost-per-1000-requests trend.
6. Pitfalls (300 w)
   - PII leakage in traces (redaction).
   - Judge drift over time.
   - Cardinality explosion from per-prompt tags.
7. What I built (200 w)
   - Link to your project. Real numbers from it.
8. Bridge (200 w)
   - This is where SREs and AI meet. Few engineers sit here. Take the seat.

Part 2-Draft (180 min)

Write the full ~2500 words. Use your project's data throughout. This is your post; the specifics make it real.

Include: - Code snippets (OTel instrumentation; SLI definition). - Charts from your project (latency, cost-per-day, judge scores over time). - A diagram showing the SRE ↔ AI bridge.

Part 3-Edit (30 min)

Read aloud. Cut anything weak. Tighten the hook and conclusion.

Output of Session B

  • Drafted and edited blog post (~2500 words).

Session C-Publish + Q2 retrospective

Goal: Publish the bridge post broadly. Run Q2 retrospective.

Part 1-Publish (60 min)

  • Personal blog.
  • Cross-post: HN (Show HN), r/MachineLearning (Discussion), r/devops (this is your audience), r/sre, X, LinkedIn.
  • Email it directly to: 3 observability practitioners you respect; 2 LLM observability product folks (Langfuse, LangSmith, Datadog LLM, Arize). Polite, brief.
  • Post in relevant Slacks/Discords.

Part 2-Engage (30 min)

Respond to comments. The bridge post will likely get more engagement than your previous posts because it speaks to both audiences (SRE and AI). Be ready.

Part 3-Q2 retrospective (60 min)

Q2_RETRO.md:

# Q2 Retrospective: Applied AI Engineering

## Artifacts shipped (12 weeks)
- Anchor project at v0.8.0
- 50-example golden dataset, 50 human labels, judge κ documented per dimension
- Inspect AI eval suite + CI regression detection
- Hybrid + rerank + contextual RAG pipeline
- ReAct + reflection agent
- Langfuse tracing + OTel GenAI conventions
- 6 public blog posts (4 in Q2)
- Year-cumulative: 9 public blog posts

## KPIs vs Q2 targets
| Metric | Target | Actual |
|---|---|---|
| Public repos | 2 | 1 (anchor; deep) |
| Blog posts | 2 | 4 |
| Eval runs | 5+ | 12+ |
| OSS PRs | 0 | 1 (M04) |

## Three biggest insights
1. Eval rigor is the differentiator. Most teams ship folklore.
2. Reranking is the underrated lever in RAG.
3. The SRE → AI bridge is real and underpopulated.

## Q3 track committed: <Track A / B / C>
- Anchor: <project name>
- Capstone: <description>

## Q3 plan
- M07: dive deep into specialty + first frontier paper.
- M08: universal inference + fine-tuning fundamentals.
- M09: track final push, OSS PR, distributed-training literacy, specialty post.

## Confidence calibration before Q3
- [ ] I can build a non-trivial LLM application end-to-end with evals.
- [ ] I can wire LLM observability with OTel conventions.
- [ ] I have public artifacts proving the work.
- [ ] My Q3 specialty hypothesis is specific and committed.

Output of Session C

  • Sixth public blog post live, ≥4 channels.
  • Q2 retrospective committed.

End-of-week artifact

  • Q3 track decision committed in writing
  • OTel GenAI conventions adopted
  • Sixth blog post published, ≥4 channels
  • Q2 retrospective written

End-of-week self-assessment

  • I can articulate my Q3 specialty in 30 seconds.
  • My bridge post is something I'd link in interviews.
  • I have shipped artifacts that prove the year so far.

Common failure modes for this week

  • Indecision on track. Pick one. Course-correct in Q3 retro if needed; don't oscillate weekly.
  • Generic observability post. This must be your project's specifics, not theory.
  • Skipping the engage phase. Replies to substantive comments are how relationships start.

What's next (preview of M07-W01-Q3 begins)

Specialty deep dive begins. Foundational paper for your track. New repo for the specialty work. DESIGN.md.

Comments