Month 6-Week 4: Q2 capstone-observability bridge post + Q3 track decision¶

Week summary¶

Goal: Q3 track decision (Evals / Agents / Inference). OpenTelemetry GenAI conventions adopted. Publish "LLM observability for engineers who already know observability"-the most career-leveraged post of your year.
Time: ~9 h over 3 sessions.
Output: Q3 track committed; OTel GenAI in project; sixth public blog post; Q2 retrospective.
Sequences relied on: 13-llm-observability rungs 01, 04, 05, 09, 11.

Why this week matters¶

Q2 closes here. The bridge observability post is the artifact that announces your unique positioning to the world: an SRE / observability engineer who became an AI engineer. Few people sit at this intersection in 2026; the post claims the territory.

The Q3 track decision is also load-bearing. Without commitment, Q3 dilutes; with commitment, Q3 produces a real specialty.

Prerequisites¶

M04, M05, M06-W01–W03 complete.
Anchor project mature.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): track decision + OTel GenAI
Session B-Sat morning (~4 h): bridge blog post draft + edit
Session C-Sun afternoon (~2 h): publish + Q2 retro

Session A-Track decision + OpenTelemetry GenAI¶

Goal: Commit to a Q3 specialty track. Adopt OpenTelemetry GenAI semantic conventions in the project.

Part 1-Q3 track decision (60 min)¶

Three options (from the roadmap): - Track A-Evals. Recommended for you given observability background. - Track B-Agents. Strong if M06's agent work was your favorite part. - Track C-Inference Infra. Strong if distributed-systems and infrastructure energizes you most.

Decide by writing. In Q3_TRACK_DECISION.md: 1. Which track and why (3 sentences). 2. The specific Q3 anchor project hypothesis. 3. The specific 3 OSS projects you'd potentially contribute to. 4. The blog post you'd write at end of Q3.

Examples: - Track A: "Open-source eval framework for agent trajectories-go beyond Inspect AI / Braintrust by focusing on agent-specific failure modes." - Track B: "Reproducible SWE-bench Lite submission with novel reflection design." - Track C: "Self-hosting benchmark suite for OSS LLMs with quantization comparisons."

Lock the decision. Don't second-guess in Q3.

Part 2-Read OTel GenAI semantic conventions (60 min)¶

OpenTelemetry standardizes how to instrument any system. The GenAI conventions extend this for AI workloads: - gen_ai.system - anthropic, openai, vllm, etc. -gen_ai.request.model - claude-opus-4-7 - gen_ai.usage.input_tokens - gen_ai.usage.output_tokens - `gen_ai.response.finish_reasons - list

Read: opentelemetry.io-search "GenAI semantic conventions". Plus active GitHub spec discussions (the spec is evolving in 2025–2026).

Part 3-Adopt in your project (60 min)¶

Either via OpenTelemetry SDK directly:

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("llm.request") as span:
    span.set_attribute("gen_ai.system", "anthropic")
    span.set_attribute("gen_ai.request.model", "claude-opus-4-7")
    resp = client.messages.create(...)
    span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)

Or via Langfuse's OTel-aware adapter (search "langfuse opentelemetry"). Either way, your traces now use vendor-neutral conventions-Datadog, Grafana, Honeycomb can all consume them.

Push v0.8.0.

Output of Session A¶

Q3_TRACK_DECISION.md written.
OTel GenAI conventions adopted.

Session B-Bridge blog post¶

Goal: Draft and edit "LLM observability for engineers who already know observability" (~2500 words). The most career-leveraged post of your year.

Part 1-Outline (30 min)¶

1. Hook (200 w)
   "Observability engineers will be the SREs of AI products. Here's the manual I wish existed."
2. Familiar primitives map to LLM systems (400 w)
   - Span → LLM call
   - SLI → quality SLI (Faithfulness, Pass Rate)
   - Trace → multi-step agent run
   - Drift → distribution shift in inputs / outputs / quality
3. What's actually new (400 w)
   - Quality is non-deterministic. Same input → different outputs.
   - Cost as a first-class signal.
   - Judge-based eval introduces a *measurement system* not just metrics.
4. The mapping in practice (600 w)
   - OTel GenAI conventions.
   - Langfuse / LangSmith vs traditional APM.
   - Grafana dashboards with the new signals.
5. SLO design for LLM systems (400 w)
   - Faithfulness SLI > 0.85 (judge-based).
   - p95 latency budget including streaming TTFT.
   - Cost-per-1000-requests trend.
6. Pitfalls (300 w)
   - PII leakage in traces (redaction).
   - Judge drift over time.
   - Cardinality explosion from per-prompt tags.
7. What I built (200 w)
   - Link to your project. Real numbers from it.
8. Bridge (200 w)
   - This is where SREs and AI meet. Few engineers sit here. Take the seat.

Part 2-Draft (180 min)¶

Write the full ~2500 words. Use your project's data throughout. This is your post; the specifics make it real.

Include: - Code snippets (OTel instrumentation; SLI definition). - Charts from your project (latency, cost-per-day, judge scores over time). - A diagram showing the SRE ↔ AI bridge.

Part 3-Edit (30 min)¶

Read aloud. Cut anything weak. Tighten the hook and conclusion.

Output of Session B¶

Drafted and edited blog post (~2500 words).

Session C-Publish + Q2 retrospective¶

Goal: Publish the bridge post broadly. Run Q2 retrospective.

Part 1-Publish (60 min)¶

Personal blog.
Cross-post: HN (Show HN), r/MachineLearning (Discussion), r/devops (this is your audience), r/sre, X, LinkedIn.
Email it directly to: 3 observability practitioners you respect; 2 LLM observability product folks (Langfuse, LangSmith, Datadog LLM, Arize). Polite, brief.
Post in relevant Slacks/Discords.

Part 2-Engage (30 min)¶

Respond to comments. The bridge post will likely get more engagement than your previous posts because it speaks to both audiences (SRE and AI). Be ready.

Part 3-Q2 retrospective (60 min)¶

Q2_RETRO.md:

# Q2 Retrospective: Applied AI Engineering

## Artifacts shipped (12 weeks)
- Anchor project at v0.8.0
- 50-example golden dataset, 50 human labels, judge κ documented per dimension
- Inspect AI eval suite + CI regression detection
- Hybrid + rerank + contextual RAG pipeline
- ReAct + reflection agent
- Langfuse tracing + OTel GenAI conventions
- 6 public blog posts (4 in Q2)
- Year-cumulative: 9 public blog posts

## KPIs vs Q2 targets
| Metric | Target | Actual |
|---|---|---|
| Public repos | 2 | 1 (anchor; deep) |
| Blog posts | 2 | 4 |
| Eval runs | 5+ | 12+ |
| OSS PRs | 0 | 1 (M04) |

## Three biggest insights
1. Eval rigor is the differentiator. Most teams ship folklore.
2. Reranking is the underrated lever in RAG.
3. The SRE → AI bridge is real and underpopulated.

## Q3 track committed: <Track A / B / C>
- Anchor: <project name>
- Capstone: <description>

## Q3 plan
- M07: dive deep into specialty + first frontier paper.
- M08: universal inference + fine-tuning fundamentals.
- M09: track final push, OSS PR, distributed-training literacy, specialty post.

## Confidence calibration before Q3
- [ ] I can build a non-trivial LLM application end-to-end with evals.
- [ ] I can wire LLM observability with OTel conventions.
- [ ] I have public artifacts proving the work.
- [ ] My Q3 specialty hypothesis is specific and committed.

Output of Session C¶

Sixth public blog post live, ≥4 channels.
Q2 retrospective committed.

End-of-week artifact¶

Q3 track decision committed in writing
OTel GenAI conventions adopted
Sixth blog post published, ≥4 channels
Q2 retrospective written

End-of-week self-assessment¶

I can articulate my Q3 specialty in 30 seconds.
My bridge post is something I'd link in interviews.
I have shipped artifacts that prove the year so far.

Common failure modes for this week¶

Indecision on track. Pick one. Course-correct in Q3 retro if needed; don't oscillate weekly.
Generic observability post. This must be your project's specifics, not theory.
Skipping the engage phase. Replies to substantive comments are how relationships start.

What's next (preview of M07-W01-Q3 begins)¶

Specialty deep dive begins. Foundational paper for your track. New repo for the specialty work. DESIGN.md.