Month 5-Week 2: Dense retrieval, embeddings, vector databases¶

Week summary¶

Goal: Add dense (embedding-based) retrieval. Stand up a real vector DB (pgvector or Qdrant). Compare two embedding models. Quantify dense-vs-BM25 quality and latency.
Time: ~9 h over 3 sessions.
Output: Vector DB running locally; dense retrieval evaluated with NDCG and MRR; comparison table in README.
Sequences relied on: 10-retrieval-and-rag rungs 03, 04; 01-linear-algebra rung 09.

Why this week matters¶

Dense retrieval handles paraphrase and synonymy that BM25 misses. But dense isn't always better-sometimes BM25 wins on rare terms or exact-match queries. Knowing when each wins on your specific corpus is the kind of empirical literacy senior RAG engineers have. This week measures that explicitly.

Standing up a vector DB also moves you from "toy NumPy retrieval" to "production-grade infra." pgvector vs Qdrant vs Weaviate are choices teams make daily; trying one means you can speak to all.

Prerequisites¶

M05-W01 complete (BM25 baseline, eval queries).

Recommended cadence¶

Session A-Tue/Wed evening (~3 h): embeddings + naive dense retrieval
Session B-Sat morning (~3.5 h): vector DB integration
Session C-Sun afternoon (~2.5 h): two embedding models compared + write up

Session A-Embeddings + naive dense retrieval¶

Goal: Embed corpus and queries with sentence-transformers. Naive cosine retrieval in NumPy. Compare to BM25.

Part 1-Embedding intuition (45 min)¶

Read: Sentence-BERT paper (arxiv.org/abs/1908.10084), abstract + sections 1, 2, 3.

Key ideas: - BERT alone gives token-level embeddings; SBERT pools to sentence-level via siamese fine-tuning. - Mean-pooling over the last hidden state with attention masking → a fixed-size vector per text. - Cosine similarity between embeddings reflects semantic similarity.

Models to consider: - all-MiniLM-L6-v2 - small, fast, decent (384-dim). -BAAI/bge-small-en-v1.5 - better quality at the same size. - BAAI/bge-large-en-v1.5 - best of the open free options at 1024-dim. -text-embedding-3-large` (OpenAI)-strong commercial choice.

For Session A, start with bge-small-en-v1.5.

Part 2-Embed corpus + queries (60 min)¶

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Embed all chunks (batched for speed)
chunk_texts = [c["text"] for c in chunks]
chunk_embeds = model.encode(chunk_texts, batch_size=32, show_progress_bar=True,
                            normalize_embeddings=True)  # crucial for cosine via dot

# Embed queries
query_texts = [q["query"] for q in queries]
query_embeds = model.encode(query_texts, normalize_embeddings=True)

print(chunk_embeds.shape, query_embeds.shape)

Why normalize_embeddings=True? When vectors have unit length, dot product equals cosine similarity. Saves work and avoids subtle bugs.

Part 3-NumPy nearest-neighbors (75 min)¶

def search_dense(query_idx, k=10):
    q = query_embeds[query_idx]
    scores = chunk_embeds @ q  # cosine because pre-normalized
    top = scores.argsort()[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top]

Run for all queries. Compute NDCG@10 and MRR using your week-1 implementations.

Dense (bge-small-en-v1.5):
  NDCG@10 = 0.687 (BM25 was 0.612)
  MRR     = 0.604 (BM25 was 0.534)

Inspect failures. Find 5 queries where BM25 beat dense and 5 where dense beat BM25. Look at why. This is the empirical insight you want.

Output of Session A¶

Embedding pipeline working.
NumPy dense retrieval evaluated.
Failure-mode comparison BM25 vs dense.

Session B-Vector database integration¶

Goal: Move from NumPy to a real vector DB. Verify retrieval results match. Benchmark latency.

Part 1-Pick a vector DB + setup (45 min)¶

pgvector (Postgres extension): - Pros: leverages Postgres infra you may already have; SQL queries. - Cons: less fancy for hybrid search out of the box.

Qdrant (purpose-built): - Pros: built for vector search; great hybrid, filters, scaling. - Cons: another service.

Recommended for you (SRE background): pgvector. Postgres familiarity means less novelty.

# Run pgvector via Docker
docker run -d --name pgv -p 5432:5432 \
  -e POSTGRES_PASSWORD=pw \
  ankane/pgvector

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    doc_id TEXT,
    chunk_idx INT,
    text TEXT,
    embedding vector(384)  -- match your model's dim
);
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Part 2-Ingest + query (90 min)¶

import psycopg2

conn = psycopg2.connect("dbname=postgres user=postgres password=pw host=localhost")
cur = conn.cursor()

for c, e in zip(chunks, chunk_embeds):
    cur.execute(
        "INSERT INTO chunks (doc_id, chunk_idx, text, embedding) VALUES (%s, %s, %s, %s)",
        (c["doc_id"], c["chunk_idx"], c["text"], e.tolist()),
    )
conn.commit()

def search_pgvector(query_embed, k=10):
    cur.execute(
        "SELECT doc_id, chunk_idx, text, 1 - (embedding <=> %s::vector) AS score "
        "FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s",
        (query_embed.tolist(), query_embed.tolist(), k),
    )
    return cur.fetchall()

Verify parity. For 5 queries, compare pgvector results to NumPy results. Top-10 should be identical.

Part 3-Latency benchmark (60 min)¶

import time
times = []
for q in queries:
    qe = model.encode(q["query"], normalize_embeddings=True)
    start = time.perf_counter()
    _ = search_pgvector(qe, k=10)
    times.append((time.perf_counter() - start) * 1000)

print(f"p50: {np.percentile(times, 50):.1f} ms")
print(f"p95: {np.percentile(times, 95):.1f} ms")

Expected: <20ms p50 on 1000-chunk corpus with ivfflat index. Without index, 50–200ms.

Output of Session B¶

pgvector running with corpus indexed.
Parity check vs NumPy.
Latency benchmark.

Session C-Two embedding models compared + write up¶

Goal: Re-embed with a stronger model. Compare. Document the cost-quality-latency tradeoffs.

Part 1-Embed with `bge-large-en-v1.5` (75 min)¶

model_large = SentenceTransformer("BAAI/bge-large-en-v1.5")
chunk_embeds_large = model_large.encode(chunk_texts, batch_size=8,
                                        show_progress_bar=True,
                                        normalize_embeddings=True)
# 1024-dim, slower to embed

Re-evaluate on the same queries. Likely lift: +0.05 NDCG@10.

Dense small: NDCG@10 = 0.687
Dense large: NDCG@10 = 0.732   (+0.045)

Part 2-Cost-quality-latency analysis (60 min)¶

Build a comparison table:

Method	NDCG@10	MRR	Embed time/chunk	Index size	Search p50
BM25	0.612	0.534	0	small	5 ms
Dense MiniLM	0.687	0.604	4 ms	384 × N	8 ms
Dense BGE-large	0.732	0.661	18 ms	1024 × N	14 ms

Decision matrix: - For a small (<10K chunks) corpus with high quality requirements: BGE-large. - For a large corpus where re-embedding is expensive: MiniLM, then upgrade later. - Always keep BM25 around-for hybrid (next week).

Part 3-README + push (45 min)¶

Update README with the comparison table. Push v0.4.0.

Update LEARNING_LOG.md: "Embeddings are not magic-picking a strong model gives a real but bounded lift; the bigger lift is in reranking, which is next week."

Output of Session C¶

Two-embedding comparison table.
README updated.
v0.4.0 tagged.

End-of-week artifact¶

pgvector (or Qdrant) running with corpus indexed
Dense retrieval with two embedding models, both evaluated
Comparison table in README (BM25 vs dense small vs dense large)
Latency benchmarks per method

End-of-week self-assessment¶

I can stand up a vector DB from a clean machine in <1 hour.
I can articulate when to pick BGE-large vs MiniLM.
I have measured baselines on my own corpus, not folklore numbers.

Common failure modes for this week¶

Forgetting normalize_embeddings=True for cosine via dot.
Trying every embedding model before measuring. Pick two, compare carefully.
No latency benchmark. Production tradeoffs are inseparable from latency.

What's next (preview of M05-W03)¶

Hybrid retrieval (BM25 + dense via Reciprocal Rank Fusion) and reranking. Plus Anthropic Contextual Retrieval. The full modern RAG stack.