3 min read
On this page

Building a RAG System

Overview

This document walks through building a complete RAG system end-to-end: document ingestion, embedding generation, vector storage, retrieval, prompt construction, and generation. It also covers evaluation, common failure modes, and how to debug them.

A working RAG system is not complicated. The first version can be built in a day. Making it reliable, accurate, and fast enough for production takes iteration guided by evaluation.

Architecture

Offline Pipeline (runs on document changes):

  Documents (PDF, Markdown, HTML, etc.)
    ↓
  Parser → clean text
    ↓
  Chunker → text segments with metadata
    ↓
  Embedding model → vectors
    ↓
  Vector store (Pinecone, Weaviate, pgvector, etc.)

Online Pipeline (runs on every query):

  User question
    ↓
  Embedding model → query vector
    ↓
  Vector store → top K relevant chunks
    ↓
  Prompt builder → question + context
    ↓
  LLM → answer
    ↓
  Response to user

Step 1: Document Ingestion

import os
from pathlib import Path

def ingest_documents(source_dir: str) -> list[dict]:
    """Load and parse documents from a directory."""
    documents = []
    supported_extensions = {".md", ".txt", ".pdf", ".html", ".docx"}
    
    for file_path in Path(source_dir).rglob("*"):
        if file_path.suffix.lower() not in supported_extensions:
            continue
        
        text = parse_document(str(file_path))
        
        if len(text.strip()) < 50:  # Skip near-empty documents
            continue
        
        documents.append({
            "id": str(file_path),
            "text": text,
            "metadata": {
                "filename": file_path.name,
                "path": str(file_path),
                "extension": file_path.suffix,
                "size_bytes": file_path.stat().st_size,
                "modified": file_path.stat().st_mtime,
            }
        })
    
    return documents

Step 2: Chunking

Try semantic chunking first (split on headers/paragraphs), fall back to recursive fixed-size chunking (500 tokens, 100 overlap) for unstructured text. Preserve metadata lineage: each chunk carries its source document ID, chunk index, and original document metadata.

Step 3: Embedding Generation

from openai import OpenAI
import numpy as np

client = OpenAI()

def generate_embeddings(chunks: list[dict], batch_size: int = 100) -> list[dict]:
    """Generate embeddings for all chunks in batches."""
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [chunk["text"] for chunk in batch]
        
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts
        )
        
        for j, embedding_data in enumerate(response.data):
            batch[j]["embedding"] = embedding_data.embedding
    
    return chunks

Use text-embedding-3-small (1536 dimensions, $0.02/1M tokens) as the default. Upgrade to text-embedding-3-large only if evaluation shows a quality gap. Self-host BGE or E5 when data cannot leave your network.

Step 4: Vector Storage

Choosing a Vector Store

Vector Store     Type            Best for                          Notes
───────────────────────────────────────────────────────────────────────────
Pinecone         Managed SaaS    Production, minimal ops           Free tier, scales well
Weaviate         Self-hosted/    Hybrid search built-in            Flexible, feature-rich
                 Cloud
pgvector         PostgreSQL      Already using Postgres            No new infrastructure
                 extension
Chroma           Embedded        Prototyping, small datasets       SQLite-based, simple
Qdrant           Self-hosted/    High performance, filtering       Rust-based, fast
                 Cloud
Milvus           Self-hosted     Very large scale (billions)       Complex to operate

For most teams, the decision is:

  • Prototyping: Chroma (in-memory, zero setup)
  • Already using Postgres: pgvector (no new infrastructure)
  • Production, want managed: Pinecone or Weaviate Cloud
  • Production, self-hosted: Qdrant or Weaviate

Example: pgvector

import psycopg2

def setup_pgvector(conn):
    """Set up pgvector table for RAG chunks."""
    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
        cur.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id TEXT PRIMARY KEY,
                text TEXT NOT NULL,
                embedding vector(1536),
                metadata JSONB,
                created_at TIMESTAMP DEFAULT NOW()
            )
        """)
        cur.execute("""
            CREATE INDEX IF NOT EXISTS chunks_embedding_idx 
            ON chunks USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)
        """)
        conn.commit()


def search_chunks(conn, query_embedding: list[float], top_k: int = 5) -> list[dict]:
    """Search for similar chunks using cosine distance."""
    with conn.cursor() as cur:
        cur.execute("""
            SELECT id, text, metadata, 1 - (embedding <=> %s::vector) as similarity
            FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s
        """, (query_embedding, query_embedding, top_k))
        return [
            {"id": row[0], "text": row[1], "metadata": row[2], "similarity": row[3]}
            for row in cur.fetchall()
        ]

Step 5: Retrieval, Prompt Construction & Generation

def rag_query(question: str, conn, top_k: int = 5) -> dict:
    """End-to-end RAG pipeline: retrieve and generate."""
    
    # 1. Embed the question
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding
    
    # 2. Retrieve relevant chunks
    chunks = search_chunks(conn, query_embedding, top_k=top_k)
    
    # 3. Check retrieval quality
    if not chunks or chunks[0]["similarity"] < 0.5:
        return {"answer": "I could not find relevant information.", "sources": [], "confidence": "low"}
    
    # 4. Build prompt with source attribution
    context_parts = []
    for i, chunk in enumerate(chunks, 1):
        source = chunk["metadata"].get("filename", "Unknown")
        context_parts.append(f"[Source {i}: {source}]\n{chunk['text']}")
    context = "\n\n---\n\n".join(context_parts)
    
    # 5. Generate answer
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based ONLY on the provided context. Cite sources using [Source N]. If the context lacks the answer, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\n---\nQuestion: {question}"}
        ],
        temperature=0, max_tokens=500
    )
    
    return {
        "answer": response.choices[0].message.content,
        "sources": [{"file": c["metadata"]["filename"], "similarity": c["similarity"]} for c in chunks],
        "confidence": "high" if chunks[0]["similarity"] > 0.75 else "medium"
    }

Evaluation

You cannot improve what you do not measure. RAG evaluation has two dimensions: retrieval quality and generation quality.

Key Metrics

Retrieval metrics:
  Recall@K:  What fraction of relevant docs appear in top K results?
  MRR:       How high is the first relevant result ranked?
  
  Build a test set of 50+ question/relevant-document pairs.
  Measure recall@5 and recall@10. Target recall@5 > 0.8.

Generation metrics:
  Use LLM-as-judge: have GPT-4o score answers 1-5 against expected
  answers. Track average score and distribution across categories.
  
  Alternatively, use human evaluation on a sample of 50-100 queries.

Common Failure Modes & Debugging

The Model Says "I Don't Know" When the Answer Exists

Diagnosis: Retrieval failure. The right chunks are not being found.

Debug steps:
  1. Run the query embedding against your vector store manually
  2. Look at the top 20 results. Is the right chunk in there at all?
  
  If NO: The chunk doesn't exist or its embedding is too far from the query.
    → Check chunking: is the relevant text in its own chunk?
    → Try the query with keyword search. Does it find the right doc?
    → Re-embed with a different model.
  
  If YES but low rank: The embedding model ranks it below irrelevant chunks.
    → Add metadata filtering to narrow the search space.
    → Use hybrid search (keyword + semantic).
    → Rephrase: try embedding the question differently.

The Model Gives a Wrong Answer Confidently

Diagnosis: Either wrong chunks retrieved or model hallucination.

Debug steps:
  1. Look at which chunks were retrieved.
  
  If wrong chunks: Retrieval issue.
    → The question is ambiguous. Multiple topics match.
    → Add metadata filtering to scope the search.
    → Increase top_k and let the model sort through more context.
  
  If right chunks but wrong answer: Generation issue.
    → The system prompt is not strict enough about staying grounded.
    → The model is synthesizing across chunks incorrectly.
    → Add "Quote the exact text that supports your answer" to the prompt.

The Answer Is Correct But Misses Information

Diagnosis: not enough chunks retrieved, or relevant info split across chunk boundaries. Increase top_k, reduce chunk size so more chunks fit in context, or add overlap to prevent information loss at boundaries.

Latency Is Too High

Typical breakdown: embedding query (100-200ms), vector search (10-50ms), LLM generation (500-3000ms). Optimize by caching frequent queries, using a faster model (GPT-4o-mini), reducing top_k, or streaming results.

Common Pitfalls

  • Shipping without evaluation: "It seems to work" is not enough. Build a test set of at least 50 question-answer pairs. Run automated evals on every change to chunking, prompts, or models.
  • Not inspecting retrieved chunks: When debugging, always look at what was actually retrieved. Most RAG failures are retrieval failures, not generation failures.
  • Overly large chunks: Using 2000-token chunks means fewer chunks fit in context. If the relevant information is 50 tokens within a 2000-token chunk, 97.5% of that context is noise.
  • No similarity threshold: Returning the top 5 results even when none are relevant leads to hallucination. Set a minimum similarity threshold and return "I don't know" when nothing qualifies.
  • Ignoring document freshness: If your knowledge base has versioned documents, old versions can poison results. Delete or demote outdated chunks when new versions are indexed.
  • Building before measuring: Set up evaluation before building the full pipeline. Otherwise you have no way to know if your changes are improvements.

Key Takeaways

  • A complete RAG system has two pipelines: offline (ingest, chunk, embed, store) and online (embed query, retrieve, prompt, generate). Both need attention.
  • Evaluate retrieval and generation separately. Most quality problems are retrieval problems. If the right chunks are not found, the generation cannot be correct.
  • Start simple: Chroma or pgvector, fixed-size chunks, text-embedding-3-small, and GPT-4o. Optimize only when evaluation shows where the bottleneck is.
  • Set a similarity threshold and implement "I don't know" responses. A system that admits uncertainty is more trustworthy than one that confidently hallucinates.
  • Debug systematically: inspect retrieved chunks first, then check if the generation prompt is clear, then consider model quality. Most issues live in retrieval or chunking.