Building a RAG System

Overview

This document walks through building a complete RAG system end-to-end: document ingestion, embedding generation, vector storage, retrieval, prompt construction, and generation. It also covers evaluation, common failure modes, and how to debug them.

A working RAG system is not complicated. The first version can be built in a day. Making it reliable, accurate, and fast enough for production takes iteration guided by evaluation.

Architecture

Offline Pipeline (runs on document changes):

  Documents (PDF, Markdown, HTML, etc.)
    ↓
  Parser → clean text
    ↓
  Chunker → text segments with metadata
    ↓
  Embedding model → vectors
    ↓
  Vector store (Pinecone, Weaviate, pgvector, etc.)

Online Pipeline (runs on every query):

  User question
    ↓
  Embedding model → query vector
    ↓
  Vector store → top K relevant chunks
    ↓
  Prompt builder → question + context
    ↓
  LLM → answer
    ↓
  Response to user

Step 1: Document Ingestion

import os
from pathlib import Path

def ingest_documents(source_dir: str) -> list[dict]:
    """Load and parse documents from a directory."""
    documents = []
    supported_extensions = {".md", ".txt", ".pdf", ".html", ".docx"}
    
    for file_path in Path(source_dir).rglob("*"):
        if file_path.suffix.lower() not in supported_extensions:
            continue
        
        text = parse_document(str(file_path))
        
        if len(text.strip()) < 50:  # Skip near-empty documents
            continue
        
        documents.append({
            "id": str(file_path),
            "text": text,
            "metadata": {
                "filename": file_path.name,
                "path": str(file_path),
                "extension": file_path.suffix,
                "size_bytes": file_path.stat().st_size,
                "modified": file_path.stat().st_mtime,
            }
        })
    
    return documents

Step 2: Chunking

Try semantic chunking first (split on headers/paragraphs), fall back to recursive fixed-size chunking (500 tokens, 100 overlap) for unstructured text. Preserve metadata lineage: each chunk carries its source document ID, chunk index, and original document metadata.

Step 3: Embedding Generation

from openai import OpenAI
import numpy as np

client = OpenAI()

def generate_embeddings(chunks: list[dict], batch_size: int = 100) -> list[dict]:
    """Generate embeddings for all chunks in batches."""
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [chunk["text"] for chunk in batch]
        
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        
        for j, embedding_data in enumerate(response.data):
            batch[j]["embedding"] = embedding_data.embedding
    
    return chunks

Use text-embedding-3-small (1536 dimensions, $0.02/1M tokens) as the default. Upgrade to text-embedding-3-large only if evaluation shows a quality gap. Self-host BGE or E5 when data cannot leave your network.

Step 4: Vector Storage

Choosing a Vector Store

Vector Store     Type            Best for                          Notes
───────────────────────────────────────────────────────────────────────────
Pinecone         Managed SaaS    Production, minimal ops           Free tier, scales well
Weaviate         Self-hosted/    Hybrid search built-in            Flexible, feature-rich
                 Cloud
pgvector         PostgreSQL      Already using Postgres            No new infrastructure
                 extension
Chroma           Embedded        Prototyping, small datasets       SQLite-based, simple
Qdrant           Self-hosted/    High performance, filtering       Rust-based, fast
                 Cloud
Milvus           Self-hosted     Very large scale (billions)       Complex to operate

For most teams, the decision is:

Prototyping: Chroma (in-memory, zero setup)
Already using Postgres: pgvector (no new infrastructure)
Production, want managed: Pinecone or Weaviate Cloud
Production, self-hosted: Qdrant or Weaviate

Example: pgvector

import psycopg2

def setup_pgvector(conn):
    """Set up pgvector table for RAG chunks."""
    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id TEXT PRIMARY KEY,
                text TEXT NOT NULL,
                embedding vector(1536),
                metadata JSONB,
                created_at TIMESTAMP DEFAULT NOW()
            )
        """)
        cur.execute("""
            CREATE INDEX IF NOT EXISTS chunks_embedding_idx 
            ON chunks USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)
        """)
        conn.commit()


def search_chunks(conn, query_embedding: list[float], top_k: int = 5) -> list[dict]:
    """Search for similar chunks using cosine distance."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT id, text, metadata, 1 - (embedding <=> %s::vector) as similarity
            FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s
        """, (query_embedding, query_embedding, top_k))
        return [
            {"id": row[0], "text": row[1], "metadata": row[2], "similarity": row[3]}
            for row in cur.fetchall()
        ]

Step 5: Retrieval, Prompt Construction & Generation

def rag_query(question: str, conn, top_k: int = 5) -> dict:
    """End-to-end RAG pipeline: retrieve and generate."""
    
    # 1. Embed the question
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding
    
    # 2. Retrieve relevant chunks
    chunks = search_chunks(conn, query_embedding, top_k=top_k)
    
    # 3. Check retrieval quality
    if not chunks or chunks[0]["similarity"] < 0.5:
        return {"answer": "I could not find relevant information.", "sources": [], "confidence": "low"}
    
    # 4. Build prompt with source attribution
    context_parts = []
    for i, chunk in enumerate(chunks, 1):
        source = chunk["metadata"].get("filename", "Unknown")
        context_parts.append(f"[Source {i}: {source}]\n{chunk['text']}")
    context = "\n\n---\n\n".join(context_parts)
    
    # 5. Generate answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based ONLY on the provided context. Cite sources using [Source N]. If the context lacks the answer, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\n---\nQuestion: {question}"}
        ],
        temperature=0, max_tokens=500
    )
    
    return {
        "answer": response.choices[0].message.content,
        "sources": [{"file": c["metadata"]["filename"], "similarity": c["similarity"]} for c in chunks],
        "confidence": "high" if chunks[0]["similarity"] > 0.75 else "medium"
    }

Evaluation

You cannot improve what you do not measure. RAG evaluation has two dimensions: retrieval quality and generation quality.

Key Metrics

Retrieval metrics:
  Recall@K:  What fraction of relevant docs appear in top K results?
  MRR:       How high is the first relevant result ranked?
  
  Build a test set of 50+ question/relevant-document pairs.
  Measure recall@5 and recall@10. Target recall@5 > 0.8.

Generation metrics:
  Use LLM-as-judge: have GPT-4o score answers 1-5 against expected
  answers. Track average score and distribution across categories.
  
  Alternatively, use human evaluation on a sample of 50-100 queries.

Common Failure Modes & Debugging

The Model Says "I Don't Know" When the Answer Exists

Diagnosis: Retrieval failure. The right chunks are not being found.

Debug steps:
  1. Run the query embedding against your vector store manually
  2. Look at the top 20 results. Is the right chunk in there at all?
  
  If NO: The chunk doesn't exist or its embedding is too far from the query.
    → Check chunking: is the relevant text in its own chunk?
    → Try the query with keyword search. Does it find the right doc?
    → Re-embed with a different model.
  
  If YES but low rank: The embedding model ranks it below irrelevant chunks.
    → Add metadata filtering to narrow the search space.
    → Use hybrid search (keyword + semantic).
    → Rephrase: try embedding the question differently.

The Model Gives a Wrong Answer Confidently

Diagnosis: Either wrong chunks retrieved or model hallucination.

Debug steps:
  1. Look at which chunks were retrieved.
  
  If wrong chunks: Retrieval issue.
    → The question is ambiguous. Multiple topics match.
    → Add metadata filtering to scope the search.
    → Increase top_k and let the model sort through more context.
  
  If right chunks but wrong answer: Generation issue.
    → The system prompt is not strict enough about staying grounded.
    → The model is synthesizing across chunks incorrectly.
    → Add "Quote the exact text that supports your answer" to the prompt.

Shipping without evaluation: "It seems to work" is not enough. Build a test set of at least 50 question-answer pairs. Run automated evals on every change to chunking, prompts, or models.
Not inspecting retrieved chunks: When debugging, always look at what was actually retrieved. Most RAG failures are retrieval failures, not generation failures.
Overly large chunks: Using 2000-token chunks means fewer chunks fit in context. If the relevant information is 50 tokens within a 2000-token chunk, 97.5% of that context is noise.
No similarity threshold: Returning the top 5 results even when none are relevant leads to hallucination. Set a minimum similarity threshold and return "I don't know" when nothing qualifies.
Ignoring document freshness: If your knowledge base has versioned documents, old versions can poison results. Delete or demote outdated chunks when new versions are indexed.
Building before measuring: Set up evaluation before building the full pipeline. Otherwise you have no way to know if your changes are improvements.

Key Takeaways

A complete RAG system has two pipelines: offline (ingest, chunk, embed, store) and online (embed query, retrieve, prompt, generate). Both need attention.
Evaluate retrieval and generation separately. Most quality problems are retrieval problems. If the right chunks are not found, the generation cannot be correct.
Start simple: Chroma or pgvector, fixed-size chunks, text-embedding-3-small, and GPT-4o. Optimize only when evaluation shows where the bottleneck is.
Set a similarity threshold and implement "I don't know" responses. A system that admits uncertainty is more trustworthy than one that confidently hallucinates.
Debug systematically: inspect retrieved chunks first, then check if the generation prompt is clear, then consider model quality. Most issues live in retrieval or chunking.