Month 6-Week 4: Q2 capstone-observability bridge post + Q3 track decision¶
Week summary¶
- Goal: Q3 track decision (Evals / Agents / Inference). OpenTelemetry GenAI conventions adopted. Publish "LLM observability for engineers who already know observability"-the most career-leveraged post of your year.
- Time: ~9 h over 3 sessions.
- Output: Q3 track committed; OTel GenAI in project; sixth public blog post; Q2 retrospective.
- Sequences relied on: 13-llm-observability rungs 01, 04, 05, 09, 11.
Why this week matters¶
Q2 closes here. The bridge observability post is the artifact that announces your unique positioning to the world: an SRE / observability engineer who became an AI engineer. Few people sit at this intersection in 2026; the post claims the territory.
The Q3 track decision is also load-bearing. Without commitment, Q3 dilutes; with commitment, Q3 produces a real specialty.
Prerequisites¶
- M04, M05, M06-W01–W03 complete.
- Anchor project mature.
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): track decision + OTel GenAI
- Session B-Sat morning (~4 h): bridge blog post draft + edit
- Session C-Sun afternoon (~2 h): publish + Q2 retro
Session A-Track decision + OpenTelemetry GenAI¶
Goal: Commit to a Q3 specialty track. Adopt OpenTelemetry GenAI semantic conventions in the project.
Part 1-Q3 track decision (60 min)¶
Three options (from the roadmap): - Track A-Evals. Recommended for you given observability background. - Track B-Agents. Strong if M06's agent work was your favorite part. - Track C-Inference Infra. Strong if distributed-systems and infrastructure energizes you most.
Decide by writing. In Q3_TRACK_DECISION.md:
1. Which track and why (3 sentences).
2. The specific Q3 anchor project hypothesis.
3. The specific 3 OSS projects you'd potentially contribute to.
4. The blog post you'd write at end of Q3.
Examples: - Track A: "Open-source eval framework for agent trajectories-go beyond Inspect AI / Braintrust by focusing on agent-specific failure modes." - Track B: "Reproducible SWE-bench Lite submission with novel reflection design." - Track C: "Self-hosting benchmark suite for OSS LLMs with quantization comparisons."
Lock the decision. Don't second-guess in Q3.
Part 2-Read OTel GenAI semantic conventions (60 min)¶
OpenTelemetry standardizes how to instrument any system. The GenAI conventions extend this for AI workloads:
- gen_ai.system - anthropic, openai, vllm, etc.
-gen_ai.request.model - claude-opus-4-7
- gen_ai.usage.input_tokens
- gen_ai.usage.output_tokens
- `gen_ai.response.finish_reasons - list
Read: opentelemetry.io-search "GenAI semantic conventions". Plus active GitHub spec discussions (the spec is evolving in 2025–2026).
Part 3-Adopt in your project (60 min)¶
Either via OpenTelemetry SDK directly:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("llm.request") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-opus-4-7")
resp = client.messages.create(...)
span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
Or via Langfuse's OTel-aware adapter (search "langfuse opentelemetry"). Either way, your traces now use vendor-neutral conventions-Datadog, Grafana, Honeycomb can all consume them.
Push v0.8.0.
Output of Session A¶
Q3_TRACK_DECISION.mdwritten.- OTel GenAI conventions adopted.
Session B-Bridge blog post¶
Goal: Draft and edit "LLM observability for engineers who already know observability" (~2500 words). The most career-leveraged post of your year.
Part 1-Outline (30 min)¶
1. Hook (200 w)
"Observability engineers will be the SREs of AI products. Here's the manual I wish existed."
2. Familiar primitives map to LLM systems (400 w)
- Span → LLM call
- SLI → quality SLI (Faithfulness, Pass Rate)
- Trace → multi-step agent run
- Drift → distribution shift in inputs / outputs / quality
3. What's actually new (400 w)
- Quality is non-deterministic. Same input → different outputs.
- Cost as a first-class signal.
- Judge-based eval introduces a *measurement system* not just metrics.
4. The mapping in practice (600 w)
- OTel GenAI conventions.
- Langfuse / LangSmith vs traditional APM.
- Grafana dashboards with the new signals.
5. SLO design for LLM systems (400 w)
- Faithfulness SLI > 0.85 (judge-based).
- p95 latency budget including streaming TTFT.
- Cost-per-1000-requests trend.
6. Pitfalls (300 w)
- PII leakage in traces (redaction).
- Judge drift over time.
- Cardinality explosion from per-prompt tags.
7. What I built (200 w)
- Link to your project. Real numbers from it.
8. Bridge (200 w)
- This is where SREs and AI meet. Few engineers sit here. Take the seat.
Part 2-Draft (180 min)¶
Write the full ~2500 words. Use your project's data throughout. This is your post; the specifics make it real.
Include: - Code snippets (OTel instrumentation; SLI definition). - Charts from your project (latency, cost-per-day, judge scores over time). - A diagram showing the SRE ↔ AI bridge.
Part 3-Edit (30 min)¶
Read aloud. Cut anything weak. Tighten the hook and conclusion.
Output of Session B¶
- Drafted and edited blog post (~2500 words).
Session C-Publish + Q2 retrospective¶
Goal: Publish the bridge post broadly. Run Q2 retrospective.
Part 1-Publish (60 min)¶
- Personal blog.
- Cross-post: HN (Show HN), r/MachineLearning (Discussion), r/devops (this is your audience), r/sre, X, LinkedIn.
- Email it directly to: 3 observability practitioners you respect; 2 LLM observability product folks (Langfuse, LangSmith, Datadog LLM, Arize). Polite, brief.
- Post in relevant Slacks/Discords.
Part 2-Engage (30 min)¶
Respond to comments. The bridge post will likely get more engagement than your previous posts because it speaks to both audiences (SRE and AI). Be ready.
Part 3-Q2 retrospective (60 min)¶
Q2_RETRO.md:
# Q2 Retrospective: Applied AI Engineering
## Artifacts shipped (12 weeks)
- Anchor project at v0.8.0
- 50-example golden dataset, 50 human labels, judge κ documented per dimension
- Inspect AI eval suite + CI regression detection
- Hybrid + rerank + contextual RAG pipeline
- ReAct + reflection agent
- Langfuse tracing + OTel GenAI conventions
- 6 public blog posts (4 in Q2)
- Year-cumulative: 9 public blog posts
## KPIs vs Q2 targets
| Metric | Target | Actual |
|---|---|---|
| Public repos | 2 | 1 (anchor; deep) |
| Blog posts | 2 | 4 |
| Eval runs | 5+ | 12+ |
| OSS PRs | 0 | 1 (M04) |
## Three biggest insights
1. Eval rigor is the differentiator. Most teams ship folklore.
2. Reranking is the underrated lever in RAG.
3. The SRE → AI bridge is real and underpopulated.
## Q3 track committed: <Track A / B / C>
- Anchor: <project name>
- Capstone: <description>
## Q3 plan
- M07: dive deep into specialty + first frontier paper.
- M08: universal inference + fine-tuning fundamentals.
- M09: track final push, OSS PR, distributed-training literacy, specialty post.
## Confidence calibration before Q3
- [ ] I can build a non-trivial LLM application end-to-end with evals.
- [ ] I can wire LLM observability with OTel conventions.
- [ ] I have public artifacts proving the work.
- [ ] My Q3 specialty hypothesis is specific and committed.
Output of Session C¶
- Sixth public blog post live, ≥4 channels.
- Q2 retrospective committed.
End-of-week artifact¶
- Q3 track decision committed in writing
- OTel GenAI conventions adopted
- Sixth blog post published, ≥4 channels
- Q2 retrospective written
End-of-week self-assessment¶
- I can articulate my Q3 specialty in 30 seconds.
- My bridge post is something I'd link in interviews.
- I have shipped artifacts that prove the year so far.
Common failure modes for this week¶
- Indecision on track. Pick one. Course-correct in Q3 retro if needed; don't oscillate weekly.
- Generic observability post. This must be your project's specifics, not theory.
- Skipping the engage phase. Replies to substantive comments are how relationships start.
What's next (preview of M07-W01-Q3 begins)¶
Specialty deep dive begins. Foundational paper for your track. New repo for the specialty work. DESIGN.md.