Month 4-Week 4: Polish, DSPy experiment, fourth blog post, OSS PR¶

Week summary¶

Goal: Polish the anchor project for sharing. Try DSPy as a different paradigm. Publish your fourth public blog post (the AI-engineer identity announcement). Submit your first OSS PR.
Time: ~9 h over 3 sessions.
Output: Polished project + DSPy experiment + fourth public blog post + first merged-or-open OSS PR.
Sequences relied on: 09-llm-application-engineering rung 10; 12-evaluation-systems.

Why this week matters¶

The blog post this week is the announcement-"I'm an AI engineer who builds evaluable LLM systems." It's the first piece of writing that fully reflects the new identity. Hiring managers screen for posts like this. Done well, it pays career dividends for years.

DSPy is a different paradigm-programs whose prompts are compiled rather than written. Going through one tutorial doesn't mean adopting it forever, but it changes how you think about prompts. That changed perspective is the point.

The OSS PR is small but symbolic. It's the start of the year-long habit of contributing externally.

Prerequisites¶

M04-W01–W03 complete.
Project with eval pipeline running.

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): close gaps + DSPy
Session B-Sat morning (~3.5 h): blog post draft
Session C-Sun afternoon (~2.5 h): publish + OSS PR + retro

Session A-Audit the project + DSPy experiment¶

Goal: Address project gaps. Try DSPy and reflect.

Part 1-Audit the project as a stranger (60 min)¶

Pretend you're seeing the repo for the first time. Read the README. Try to install. Try to run the example.

Note every friction: - Setup instructions unclear? - Missing environment variable docs? - Example output not shown? - Eval methodology buried?

Fix the top 3.

Part 2-DSPy in 90 minutes (90 min)¶

Read DSPy's quickstart: dspy.ai. Install: uv add dspy.

DSPy treats prompts as programs with signatures. You declare what's expected; DSPy compiles the prompt:

import dspy

dspy.configure(lm=dspy.LM("anthropic/claude-opus-4-7"))

class TriageSignature(dspy.Signature):
    """Triage an incident report and produce structured output."""
    incident: str = dspy.InputField()
    severity: str = dspy.OutputField(desc="critical | high | medium | low")
    probable_cause: str = dspy.OutputField(desc="1-2 sentences")
    recommended_actions: list[str] = dspy.OutputField()

triage = dspy.Predict(TriageSignature)
result = triage(incident="...")
print(result)

Compare DSPy's output to your hand-written version on 5 cases. Notes on: - Did DSPy's compiled prompt produce comparable quality? - What does the paradigm shift feel like? (declarative vs imperative) - Would you adopt DSPy for production? Why / why not?

Part 3-Reflect (30 min)¶

Write 200 words: "What DSPy did to my mental model of prompts."

Common reflections: - Prompts as programs makes evaluation natural. - Optimization (MIPROv2, etc.) is intriguing but adds complexity. - For straightforward tasks, hand-written prompts are clearer.

Output of Session A¶

Top-3 README gaps fixed.
DSPy experiment committed (small notebook).
Reflection written.

Session B-Blog post draft¶

Goal: Draft "Building an LLM-powered incident triage system-and the data on whether it works" (~2000 words).

Part 1-Outline (30 min)¶

1. Hook (200 words)
   - "Most AI demos hide the eval. This post shows the numbers."
   - Tease the conclusion.
2. The problem (250 words)
   - Why incident triage; what's hard about it; what teams currently do.
3. The naive approach (300 words)
   - Single LLM call with a structured-output schema. Show the code.
   - First eval run: 73% pass rate.
4. The eval setup (400 words)
   - Golden set composition. Heuristic + judge. Judge validation (kappa).
   - Why this matters more than model choice.
5. Iterations (400 words)
   - Few-shot examples → +X points.
   - Tool use for live data → +Y points.
   - Show the table with bootstrap CIs.
6. Cost & performance (200 words)
   - $/incident, p95 latency, cache hit rate.
7. Limitations (150 words)
   - What this doesn't do well; where it fails.
8. What's next (100 words)
   - Bridge to month 5 (RAG) and 6 (agents).

Part 2-Draft (120 min)¶

Write the full ~2000 words. Don't perfect-complete.

Include: - Real numbers throughout. - Code snippets formatted cleanly. - Charts: eval-pass-rate-over-time, latency histogram. - Honest limitations section.

Part 3-Edit pass 1 (60 min)¶

Read aloud.
Cut filler.
Tighten the hook.
Check the conclusion lands.

Output of Session B¶

Drafted and once-edited blog post.

Session C-Publish, OSS PR, retro¶

Goal: Publish the post. Submit one OSS PR. Run month-4 retrospective.

Publish to your blog.
Cross-post: dev.to, HN (Show HN if applicable), r/MachineLearning (Project flair), r/LocalLLaMA.
LinkedIn post with a 100-word teaser.
Tag 3 specific people whose work you cited politely.

Part 2-First OSS PR (60 min)¶

Pick a project you used this month: LiteLLM, Anthropic SDK, Pydantic, Langfuse, DSPy, openai-cookbook.

Find a low-hanging issue: - Doc typo or unclear example. - Missing test for a small function. - A # TODO that's small enough.

Fork → branch → fix → push → open PR.

Don't wait for merge. Submit and continue. The PR being open is the artifact; review takes time.

Document in LEARNING_LOG.md.

Part 3-Month-4 retro (45 min)¶

MONTH_4_RETRO.md:

# Month 4 retro

## Artifacts shipped
- Project at v0.3.0 (provider abstraction, tools, streaming, caching, evals)
- 50-example golden dataset
- Eval pipeline with CI + judge validation
- DSPy experiment
- Blog post: <link>
- OSS PR: <link>

## KPIs vs Q2 targets
| Metric | Target (Q2) | Actual at end of M04 |
|---|---|---|
| Public repos | 2 | 1 (anchor project)-on track |
| Blog posts | 2 | 1-on track |
| Eval runs | 5+ | 3 already |

## Lessons
1. Judge validation moved my confidence from 'maybe' to 'defensible'.
2. Cost numbers are surprisingly informative-e.g., few-shot doubled cost for marginal gain.
3. DSPy pleasant for prototyping; not ready to bet on for production this round.

## Pace check
- Sustainable / accelerated / behind?

## M05 plan (RAG)
- Pick a retrievable corpus (real or synthetic).
- BM25 baseline → dense retrieval → hybrid + rerank → RAG faithfulness eval.
- M05-W01 starts with corpus + BM25.

Output of Session C¶

Blog post live, ≥3 channels.
OSS PR open.
Month-4 retro committed.

End-of-week artifact¶

Fourth public blog post live, shared in ≥3 channels
DSPy experiment in repo
First OSS PR submitted
Month-4 retrospective written
Project polished to "presentable to a stranger"

End-of-week self-assessment¶

My blog post is something I'd link in a job application.
My project's README is good enough for someone to clone and run in 10 min.
I have signaled "AI engineer" externally-not just internally.

Common failure modes for this week¶

Polishing the post for a week. Ship at 80%. Quality compounds across posts; perfection is the enemy.
OSS PR scope creep. Pick something small. The point is the habit.
Hiding limitations. Honest limitations sections are more trusted, not less.

What's next (preview of M05-W01)¶

RAG begins. Pick a corpus. BM25 baseline first (always). NDCG@10 and MRR computed by hand once.