Building a RAG System
Overview
This document walks through building a complete RAG system end-to-end: document ingestion, embedding generation, vector storage, retrieval, prompt construction, and generation. It also covers evaluation, common failure modes, and how to debug them.
A working RAG system is not complicated. The first version can be built in a day. Making it reliable, accurate, and fast enough for production takes iteration guided by evaluation.
Architecture
Offline Pipeline (runs on document changes):
Documents (PDF, Markdown, HTML, etc.)
↓
Parser → clean text
↓
Chunker → text segments with metadata
↓
Embedding model → vectors
↓
Vector store (Pinecone, Weaviate, pgvector, etc.)
Online Pipeline (runs on every query):
User question
↓
Embedding model → query vector
↓
Vector store → top K relevant chunks
↓
Prompt builder → question + context
↓
LLM → answer
↓
Response to user
Step 1: Document Ingestion
import os
from pathlib import Path
def ingest_documents(source_dir: str) -> list[dict]:
"""Load and parse documents from a directory."""
documents = []
supported_extensions = {".md", ".txt", ".pdf", ".html", ".docx"}
for file_path in Path(source_dir).rglob("*"):
if file_path.suffix.lower() not in supported_extensions:
continue
text = parse_document(str(file_path))
if len(text.strip()) < 50: # Skip near-empty documents
continue
documents.append({
"id": str(file_path),
"text": text,
"metadata": {
"filename": file_path.name,
"path": str(file_path),
"extension": file_path.suffix,
"size_bytes": file_path.stat().st_size,
"modified": file_path.stat().st_mtime,
}
})
return documents
Step 2: Chunking
Try semantic chunking first (split on headers/paragraphs), fall back to recursive fixed-size chunking (500 tokens, 100 overlap) for unstructured text. Preserve metadata lineage: each chunk carries its source document ID, chunk index, and original document metadata.
Step 3: Embedding Generation
from openai import OpenAI
import numpy as np
client = OpenAI()
def generate_embeddings(chunks: list[dict], batch_size: int = 100) -> list[dict]:
"""Generate embeddings for all chunks in batches."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [chunk["text"] for chunk in batch]
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
for j, embedding_data in enumerate(response.data):
batch[j]["embedding"] = embedding_data.embedding
return chunks
Use text-embedding-3-small (1536 dimensions, $0.02/1M tokens) as the default. Upgrade to text-embedding-3-large only if evaluation shows a quality gap. Self-host BGE or E5 when data cannot leave your network.
Step 4: Vector Storage
Choosing a Vector Store
Vector Store Type Best for Notes
───────────────────────────────────────────────────────────────────────────
Pinecone Managed SaaS Production, minimal ops Free tier, scales well
Weaviate Self-hosted/ Hybrid search built-in Flexible, feature-rich
Cloud
pgvector PostgreSQL Already using Postgres No new infrastructure
extension
Chroma Embedded Prototyping, small datasets SQLite-based, simple
Qdrant Self-hosted/ High performance, filtering Rust-based, fast
Cloud
Milvus Self-hosted Very large scale (billions) Complex to operate
For most teams, the decision is:
- Prototyping: Chroma (in-memory, zero setup)
- Already using Postgres: pgvector (no new infrastructure)
- Production, want managed: Pinecone or Weaviate Cloud
- Production, self-hosted: Qdrant or Weaviate
Example: pgvector
import psycopg2
def setup_pgvector(conn):
"""Set up pgvector table for RAG chunks."""
with conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
CREATE TABLE IF NOT EXISTS chunks (
id TEXT PRIMARY KEY,
text TEXT NOT NULL,
embedding vector(1536),
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
)
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS chunks_embedding_idx
ON chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
""")
conn.commit()
def search_chunks(conn, query_embedding: list[float], top_k: int = 5) -> list[dict]:
"""Search for similar chunks using cosine distance."""
with conn.cursor() as cur:
cur.execute("""
SELECT id, text, metadata, 1 - (embedding <=> %s::vector) as similarity
FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s
""", (query_embedding, query_embedding, top_k))
return [
{"id": row[0], "text": row[1], "metadata": row[2], "similarity": row[3]}
for row in cur.fetchall()
]
Step 5: Retrieval, Prompt Construction & Generation
def rag_query(question: str, conn, top_k: int = 5) -> dict:
"""End-to-end RAG pipeline: retrieve and generate."""
# 1. Embed the question
query_embedding = client.embeddings.create(
model="text-embedding-3-small", input=question
).data[0].embedding
# 2. Retrieve relevant chunks
chunks = search_chunks(conn, query_embedding, top_k=top_k)
# 3. Check retrieval quality
if not chunks or chunks[0]["similarity"] < 0.5:
return {"answer": "I could not find relevant information.", "sources": [], "confidence": "low"}
# 4. Build prompt with source attribution
context_parts = []
for i, chunk in enumerate(chunks, 1):
source = chunk["metadata"].get("filename", "Unknown")
context_parts.append(f"[Source {i}: {source}]\n{chunk['text']}")
context = "\n\n---\n\n".join(context_parts)
# 5. Generate answer
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based ONLY on the provided context. Cite sources using [Source N]. If the context lacks the answer, say so."},
{"role": "user", "content": f"Context:\n{context}\n\n---\nQuestion: {question}"}
],
temperature=0, max_tokens=500
)
return {
"answer": response.choices[0].message.content,
"sources": [{"file": c["metadata"]["filename"], "similarity": c["similarity"]} for c in chunks],
"confidence": "high" if chunks[0]["similarity"] > 0.75 else "medium"
}
Evaluation
You cannot improve what you do not measure. RAG evaluation has two dimensions: retrieval quality and generation quality.
Key Metrics
Retrieval metrics:
Recall@K: What fraction of relevant docs appear in top K results?
MRR: How high is the first relevant result ranked?
Build a test set of 50+ question/relevant-document pairs.
Measure recall@5 and recall@10. Target recall@5 > 0.8.
Generation metrics:
Use LLM-as-judge: have GPT-4o score answers 1-5 against expected
answers. Track average score and distribution across categories.
Alternatively, use human evaluation on a sample of 50-100 queries.
Common Failure Modes & Debugging
The Model Says "I Don't Know" When the Answer Exists
Diagnosis: Retrieval failure. The right chunks are not being found.
Debug steps:
1. Run the query embedding against your vector store manually
2. Look at the top 20 results. Is the right chunk in there at all?
If NO: The chunk doesn't exist or its embedding is too far from the query.
→ Check chunking: is the relevant text in its own chunk?
→ Try the query with keyword search. Does it find the right doc?
→ Re-embed with a different model.
If YES but low rank: The embedding model ranks it below irrelevant chunks.
→ Add metadata filtering to narrow the search space.
→ Use hybrid search (keyword + semantic).
→ Rephrase: try embedding the question differently.
The Model Gives a Wrong Answer Confidently
Diagnosis: Either wrong chunks retrieved or model hallucination.
Debug steps:
1. Look at which chunks were retrieved.
If wrong chunks: Retrieval issue.
→ The question is ambiguous. Multiple topics match.
→ Add metadata filtering to scope the search.
→ Increase top_k and let the model sort through more context.
If right chunks but wrong answer: Generation issue.
→ The system prompt is not strict enough about staying grounded.
→ The model is synthesizing across chunks incorrectly.
→ Add "Quote the exact text that supports your answer" to the prompt.
The Answer Is Correct But Misses Information
Diagnosis: not enough chunks retrieved, or relevant info split across chunk boundaries. Increase top_k, reduce chunk size so more chunks fit in context, or add overlap to prevent information loss at boundaries.
Latency Is Too High
Typical breakdown: embedding query (100-200ms), vector search (10-50ms), LLM generation (500-3000ms). Optimize by caching frequent queries, using a faster model (GPT-4o-mini), reducing top_k, or streaming results.
Common Pitfalls
- Shipping without evaluation: "It seems to work" is not enough. Build a test set of at least 50 question-answer pairs. Run automated evals on every change to chunking, prompts, or models.
- Not inspecting retrieved chunks: When debugging, always look at what was actually retrieved. Most RAG failures are retrieval failures, not generation failures.
- Overly large chunks: Using 2000-token chunks means fewer chunks fit in context. If the relevant information is 50 tokens within a 2000-token chunk, 97.5% of that context is noise.
- No similarity threshold: Returning the top 5 results even when none are relevant leads to hallucination. Set a minimum similarity threshold and return "I don't know" when nothing qualifies.
- Ignoring document freshness: If your knowledge base has versioned documents, old versions can poison results. Delete or demote outdated chunks when new versions are indexed.
- Building before measuring: Set up evaluation before building the full pipeline. Otherwise you have no way to know if your changes are improvements.
Key Takeaways
- A complete RAG system has two pipelines: offline (ingest, chunk, embed, store) and online (embed query, retrieve, prompt, generate). Both need attention.
- Evaluate retrieval and generation separately. Most quality problems are retrieval problems. If the right chunks are not found, the generation cannot be correct.
- Start simple: Chroma or pgvector, fixed-size chunks, text-embedding-3-small, and GPT-4o. Optimize only when evaluation shows where the bottleneck is.
- Set a similarity threshold and implement "I don't know" responses. A system that admits uncertainty is more trustworthy than one that confidently hallucinates.
- Debug systematically: inspect retrieved chunks first, then check if the generation prompt is clear, then consider model quality. Most issues live in retrieval or chunking.