Month 5-Week 2: Dense retrieval, embeddings, vector databases¶
Week summary¶
- Goal: Add dense (embedding-based) retrieval. Stand up a real vector DB (pgvector or Qdrant). Compare two embedding models. Quantify dense-vs-BM25 quality and latency.
- Time: ~9 h over 3 sessions.
- Output: Vector DB running locally; dense retrieval evaluated with NDCG and MRR; comparison table in README.
- Sequences relied on: 10-retrieval-and-rag rungs 03, 04; 01-linear-algebra rung 09.
Why this week matters¶
Dense retrieval handles paraphrase and synonymy that BM25 misses. But dense isn't always better-sometimes BM25 wins on rare terms or exact-match queries. Knowing when each wins on your specific corpus is the kind of empirical literacy senior RAG engineers have. This week measures that explicitly.
Standing up a vector DB also moves you from "toy NumPy retrieval" to "production-grade infra." pgvector vs Qdrant vs Weaviate are choices teams make daily; trying one means you can speak to all.
Prerequisites¶
- M05-W01 complete (BM25 baseline, eval queries).
Recommended cadence¶
- Session A-Tue/Wed evening (~3 h): embeddings + naive dense retrieval
- Session B-Sat morning (~3.5 h): vector DB integration
- Session C-Sun afternoon (~2.5 h): two embedding models compared + write up
Session A-Embeddings + naive dense retrieval¶
Goal: Embed corpus and queries with sentence-transformers. Naive cosine retrieval in NumPy. Compare to BM25.
Part 1-Embedding intuition (45 min)¶
Read: Sentence-BERT paper (arxiv.org/abs/1908.10084), abstract + sections 1, 2, 3.
Key ideas: - BERT alone gives token-level embeddings; SBERT pools to sentence-level via siamese fine-tuning. - Mean-pooling over the last hidden state with attention masking → a fixed-size vector per text. - Cosine similarity between embeddings reflects semantic similarity.
Models to consider:
- all-MiniLM-L6-v2 - small, fast, decent (384-dim).
-BAAI/bge-small-en-v1.5 - better quality at the same size.
- BAAI/bge-large-en-v1.5 - best of the open free options at 1024-dim.
-text-embedding-3-large` (OpenAI)-strong commercial choice.
For Session A, start with bge-small-en-v1.5.
Part 2-Embed corpus + queries (60 min)¶
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
# Embed all chunks (batched for speed)
chunk_texts = [c["text"] for c in chunks]
chunk_embeds = model.encode(chunk_texts, batch_size=32, show_progress_bar=True,
normalize_embeddings=True) # crucial for cosine via dot
# Embed queries
query_texts = [q["query"] for q in queries]
query_embeds = model.encode(query_texts, normalize_embeddings=True)
print(chunk_embeds.shape, query_embeds.shape)
Why normalize_embeddings=True? When vectors have unit length, dot product equals cosine similarity. Saves work and avoids subtle bugs.
Part 3-NumPy nearest-neighbors (75 min)¶
def search_dense(query_idx, k=10):
q = query_embeds[query_idx]
scores = chunk_embeds @ q # cosine because pre-normalized
top = scores.argsort()[::-1][:k]
return [(chunks[i], float(scores[i])) for i in top]
Run for all queries. Compute NDCG@10 and MRR using your week-1 implementations.
Inspect failures. Find 5 queries where BM25 beat dense and 5 where dense beat BM25. Look at why. This is the empirical insight you want.
Output of Session A¶
- Embedding pipeline working.
- NumPy dense retrieval evaluated.
- Failure-mode comparison BM25 vs dense.
Session B-Vector database integration¶
Goal: Move from NumPy to a real vector DB. Verify retrieval results match. Benchmark latency.
Part 1-Pick a vector DB + setup (45 min)¶
pgvector (Postgres extension): - Pros: leverages Postgres infra you may already have; SQL queries. - Cons: less fancy for hybrid search out of the box.
Qdrant (purpose-built): - Pros: built for vector search; great hybrid, filters, scaling. - Cons: another service.
Recommended for you (SRE background): pgvector. Postgres familiarity means less novelty.
# Run pgvector via Docker
docker run -d --name pgv -p 5432:5432 \
-e POSTGRES_PASSWORD=pw \
ankane/pgvector
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
doc_id TEXT,
chunk_idx INT,
text TEXT,
embedding vector(384) -- match your model's dim
);
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Part 2-Ingest + query (90 min)¶
import psycopg2
conn = psycopg2.connect("dbname=postgres user=postgres password=pw host=localhost")
cur = conn.cursor()
for c, e in zip(chunks, chunk_embeds):
cur.execute(
"INSERT INTO chunks (doc_id, chunk_idx, text, embedding) VALUES (%s, %s, %s, %s)",
(c["doc_id"], c["chunk_idx"], c["text"], e.tolist()),
)
conn.commit()
def search_pgvector(query_embed, k=10):
cur.execute(
"SELECT doc_id, chunk_idx, text, 1 - (embedding <=> %s::vector) AS score "
"FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s",
(query_embed.tolist(), query_embed.tolist(), k),
)
return cur.fetchall()
Verify parity. For 5 queries, compare pgvector results to NumPy results. Top-10 should be identical.
Part 3-Latency benchmark (60 min)¶
import time
times = []
for q in queries:
qe = model.encode(q["query"], normalize_embeddings=True)
start = time.perf_counter()
_ = search_pgvector(qe, k=10)
times.append((time.perf_counter() - start) * 1000)
print(f"p50: {np.percentile(times, 50):.1f} ms")
print(f"p95: {np.percentile(times, 95):.1f} ms")
Expected: <20ms p50 on 1000-chunk corpus with ivfflat index. Without index, 50–200ms.
Output of Session B¶
- pgvector running with corpus indexed.
- Parity check vs NumPy.
- Latency benchmark.
Session C-Two embedding models compared + write up¶
Goal: Re-embed with a stronger model. Compare. Document the cost-quality-latency tradeoffs.
Part 1-Embed with bge-large-en-v1.5 (75 min)¶
model_large = SentenceTransformer("BAAI/bge-large-en-v1.5")
chunk_embeds_large = model_large.encode(chunk_texts, batch_size=8,
show_progress_bar=True,
normalize_embeddings=True)
# 1024-dim, slower to embed
Re-evaluate on the same queries. Likely lift: +0.05 NDCG@10.
Part 2-Cost-quality-latency analysis (60 min)¶
Build a comparison table:
| Method | NDCG@10 | MRR | Embed time/chunk | Index size | Search p50 |
|---|---|---|---|---|---|
| BM25 | 0.612 | 0.534 | 0 | small | 5 ms |
| Dense MiniLM | 0.687 | 0.604 | 4 ms | 384 × N | 8 ms |
| Dense BGE-large | 0.732 | 0.661 | 18 ms | 1024 × N | 14 ms |
Decision matrix: - For a small (<10K chunks) corpus with high quality requirements: BGE-large. - For a large corpus where re-embedding is expensive: MiniLM, then upgrade later. - Always keep BM25 around-for hybrid (next week).
Part 3-README + push (45 min)¶
Update README with the comparison table. Push v0.4.0.
Update LEARNING_LOG.md: "Embeddings are not magic-picking a strong model gives a real but bounded lift; the bigger lift is in reranking, which is next week."
Output of Session C¶
- Two-embedding comparison table.
- README updated.
- v0.4.0 tagged.
End-of-week artifact¶
- pgvector (or Qdrant) running with corpus indexed
- Dense retrieval with two embedding models, both evaluated
- Comparison table in README (BM25 vs dense small vs dense large)
- Latency benchmarks per method
End-of-week self-assessment¶
- I can stand up a vector DB from a clean machine in <1 hour.
- I can articulate when to pick BGE-large vs MiniLM.
- I have measured baselines on my own corpus, not folklore numbers.
Common failure modes for this week¶
- Forgetting
normalize_embeddings=Truefor cosine via dot. - Trying every embedding model before measuring. Pick two, compare carefully.
- No latency benchmark. Production tradeoffs are inseparable from latency.
What's next (preview of M05-W03)¶
Hybrid retrieval (BM25 + dense via Reciprocal Rank Fusion) and reranking. Plus Anthropic Contextual Retrieval. The full modern RAG stack.