Saltar a contenido

Month 5-Week 2: Dense retrieval, embeddings, vector databases

Week summary

  • Goal: Add dense (embedding-based) retrieval. Stand up a real vector DB (pgvector or Qdrant). Compare two embedding models. Quantify dense-vs-BM25 quality and latency.
  • Time: ~9 h over 3 sessions.
  • Output: Vector DB running locally; dense retrieval evaluated with NDCG and MRR; comparison table in README.
  • Sequences relied on: 10-retrieval-and-rag rungs 03, 04; 01-linear-algebra rung 09.

Why this week matters

Dense retrieval handles paraphrase and synonymy that BM25 misses. But dense isn't always better-sometimes BM25 wins on rare terms or exact-match queries. Knowing when each wins on your specific corpus is the kind of empirical literacy senior RAG engineers have. This week measures that explicitly.

Standing up a vector DB also moves you from "toy NumPy retrieval" to "production-grade infra." pgvector vs Qdrant vs Weaviate are choices teams make daily; trying one means you can speak to all.

Prerequisites

  • M05-W01 complete (BM25 baseline, eval queries).
  • Session A-Tue/Wed evening (~3 h): embeddings + naive dense retrieval
  • Session B-Sat morning (~3.5 h): vector DB integration
  • Session C-Sun afternoon (~2.5 h): two embedding models compared + write up

Session A-Embeddings + naive dense retrieval

Goal: Embed corpus and queries with sentence-transformers. Naive cosine retrieval in NumPy. Compare to BM25.

Part 1-Embedding intuition (45 min)

Read: Sentence-BERT paper (arxiv.org/abs/1908.10084), abstract + sections 1, 2, 3.

Key ideas: - BERT alone gives token-level embeddings; SBERT pools to sentence-level via siamese fine-tuning. - Mean-pooling over the last hidden state with attention masking → a fixed-size vector per text. - Cosine similarity between embeddings reflects semantic similarity.

Models to consider: - all-MiniLM-L6-v2 - small, fast, decent (384-dim). -BAAI/bge-small-en-v1.5 - better quality at the same size. - BAAI/bge-large-en-v1.5 - best of the open free options at 1024-dim. -text-embedding-3-large` (OpenAI)-strong commercial choice.

For Session A, start with bge-small-en-v1.5.

Part 2-Embed corpus + queries (60 min)

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Embed all chunks (batched for speed)
chunk_texts = [c["text"] for c in chunks]
chunk_embeds = model.encode(chunk_texts, batch_size=32, show_progress_bar=True,
                            normalize_embeddings=True)  # crucial for cosine via dot

# Embed queries
query_texts = [q["query"] for q in queries]
query_embeds = model.encode(query_texts, normalize_embeddings=True)

print(chunk_embeds.shape, query_embeds.shape)

Why normalize_embeddings=True? When vectors have unit length, dot product equals cosine similarity. Saves work and avoids subtle bugs.

Part 3-NumPy nearest-neighbors (75 min)

def search_dense(query_idx, k=10):
    q = query_embeds[query_idx]
    scores = chunk_embeds @ q  # cosine because pre-normalized
    top = scores.argsort()[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top]

Run for all queries. Compute NDCG@10 and MRR using your week-1 implementations.

Dense (bge-small-en-v1.5):
  NDCG@10 = 0.687 (BM25 was 0.612)
  MRR     = 0.604 (BM25 was 0.534)

Inspect failures. Find 5 queries where BM25 beat dense and 5 where dense beat BM25. Look at why. This is the empirical insight you want.

Output of Session A

  • Embedding pipeline working.
  • NumPy dense retrieval evaluated.
  • Failure-mode comparison BM25 vs dense.

Session B-Vector database integration

Goal: Move from NumPy to a real vector DB. Verify retrieval results match. Benchmark latency.

Part 1-Pick a vector DB + setup (45 min)

pgvector (Postgres extension): - Pros: leverages Postgres infra you may already have; SQL queries. - Cons: less fancy for hybrid search out of the box.

Qdrant (purpose-built): - Pros: built for vector search; great hybrid, filters, scaling. - Cons: another service.

Recommended for you (SRE background): pgvector. Postgres familiarity means less novelty.

# Run pgvector via Docker
docker run -d --name pgv -p 5432:5432 \
  -e POSTGRES_PASSWORD=pw \
  ankane/pgvector
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    doc_id TEXT,
    chunk_idx INT,
    text TEXT,
    embedding vector(384)  -- match your model's dim
);
CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Part 2-Ingest + query (90 min)

import psycopg2

conn = psycopg2.connect("dbname=postgres user=postgres password=pw host=localhost")
cur = conn.cursor()

for c, e in zip(chunks, chunk_embeds):
    cur.execute(
        "INSERT INTO chunks (doc_id, chunk_idx, text, embedding) VALUES (%s, %s, %s, %s)",
        (c["doc_id"], c["chunk_idx"], c["text"], e.tolist()),
    )
conn.commit()

def search_pgvector(query_embed, k=10):
    cur.execute(
        "SELECT doc_id, chunk_idx, text, 1 - (embedding <=> %s::vector) AS score "
        "FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s",
        (query_embed.tolist(), query_embed.tolist(), k),
    )
    return cur.fetchall()

Verify parity. For 5 queries, compare pgvector results to NumPy results. Top-10 should be identical.

Part 3-Latency benchmark (60 min)

import time
times = []
for q in queries:
    qe = model.encode(q["query"], normalize_embeddings=True)
    start = time.perf_counter()
    _ = search_pgvector(qe, k=10)
    times.append((time.perf_counter() - start) * 1000)

print(f"p50: {np.percentile(times, 50):.1f} ms")
print(f"p95: {np.percentile(times, 95):.1f} ms")

Expected: <20ms p50 on 1000-chunk corpus with ivfflat index. Without index, 50–200ms.

Output of Session B

  • pgvector running with corpus indexed.
  • Parity check vs NumPy.
  • Latency benchmark.

Session C-Two embedding models compared + write up

Goal: Re-embed with a stronger model. Compare. Document the cost-quality-latency tradeoffs.

Part 1-Embed with bge-large-en-v1.5 (75 min)

model_large = SentenceTransformer("BAAI/bge-large-en-v1.5")
chunk_embeds_large = model_large.encode(chunk_texts, batch_size=8,
                                        show_progress_bar=True,
                                        normalize_embeddings=True)
# 1024-dim, slower to embed

Re-evaluate on the same queries. Likely lift: +0.05 NDCG@10.

Dense small: NDCG@10 = 0.687
Dense large: NDCG@10 = 0.732   (+0.045)

Part 2-Cost-quality-latency analysis (60 min)

Build a comparison table:

Method NDCG@10 MRR Embed time/chunk Index size Search p50
BM25 0.612 0.534 0 small 5 ms
Dense MiniLM 0.687 0.604 4 ms 384 × N 8 ms
Dense BGE-large 0.732 0.661 18 ms 1024 × N 14 ms

Decision matrix: - For a small (<10K chunks) corpus with high quality requirements: BGE-large. - For a large corpus where re-embedding is expensive: MiniLM, then upgrade later. - Always keep BM25 around-for hybrid (next week).

Part 3-README + push (45 min)

Update README with the comparison table. Push v0.4.0.

Update LEARNING_LOG.md: "Embeddings are not magic-picking a strong model gives a real but bounded lift; the bigger lift is in reranking, which is next week."

Output of Session C

  • Two-embedding comparison table.
  • README updated.
  • v0.4.0 tagged.

End-of-week artifact

  • pgvector (or Qdrant) running with corpus indexed
  • Dense retrieval with two embedding models, both evaluated
  • Comparison table in README (BM25 vs dense small vs dense large)
  • Latency benchmarks per method

End-of-week self-assessment

  • I can stand up a vector DB from a clean machine in <1 hour.
  • I can articulate when to pick BGE-large vs MiniLM.
  • I have measured baselines on my own corpus, not folklore numbers.

Common failure modes for this week

  • Forgetting normalize_embeddings=True for cosine via dot.
  • Trying every embedding model before measuring. Pick two, compare carefully.
  • No latency benchmark. Production tradeoffs are inseparable from latency.

What's next (preview of M05-W03)

Hybrid retrieval (BM25 + dense via Reciprocal Rank Fusion) and reranking. Plus Anthropic Contextual Retrieval. The full modern RAG stack.

Comments