What RAG Is

Overview

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. Instead of relying solely on what the model learned during training, you retrieve relevant documents from your own data and include them in the prompt. The model then generates an answer grounded in those documents.

RAG solves two fundamental problems with LLMs: they have a knowledge cutoff (they do not know about anything after their training data), and they hallucinate (they confidently state things that are not true). By giving the model the right documents at the right time, you get answers that are both current and grounded in real data.

The Problem RAG Solves

Knowledge Cutoff

Every LLM has a training data cutoff date. Ask about something that happened after that date and the model either refuses to answer or makes something up.

Without RAG:
  User: "What was our Q4 2025 revenue?"
  Model: "I don't have information about your company's specific 
          financial data."  (Best case — honest refusal)
  
  Model: "Based on industry trends, your Q4 2025 revenue was likely 
          around $12M."  (Worst case — confident hallucination)

With RAG:
  System retrieves Q4 2025 earnings report from company documents.
  
  User: "What was our Q4 2025 revenue?"
  Model: "According to the Q4 2025 earnings report, revenue was 
          $14.2M, up 18% from Q3."  (Grounded in real data)

Hallucination

LLMs generate plausible-sounding text. When they lack information, they fill in gaps with statistically likely (but factually wrong) content. This is not a bug that will be fixed — it is inherent to how language models work.

Without RAG:
  User: "What is the return policy for Pro plan customers?"
  Model: "Pro plan customers typically enjoy a 60-day return window 
          with full refund."  (Sounds authoritative. Completely made up.)

With RAG:
  System retrieves the actual return policy document.
  
  User: "What is the return policy for Pro plan customers?"
  Model: "Per the current return policy, Pro plan customers have a 
          30-day return window. Refunds are processed within 5-7 
          business days and exclude shipping costs."  (From the document.)

RAG = Search + Generation

At its core, RAG is two steps:

Step 1: RETRIEVE
  User asks a question
  → Convert question to a search query
  → Search your document store
  → Get the top N most relevant chunks
  
Step 2: GENERATE
  Take the user's question + retrieved chunks
  → Build a prompt with the context
  → Send to the LLM
  → Model generates answer grounded in the context

def answer_with_rag(question: str, vector_store, llm_client) -> str:
    """Simple RAG pipeline: retrieve then generate."""
    
    # Step 1: Retrieve relevant documents
    query_embedding = get_embedding(question)
    relevant_chunks = vector_store.search(
        query_vector=query_embedding,
        top_k=5
    )
    
    # Step 2: Build prompt with context
    context = "\n\n---\n\n".join([chunk.text for chunk in relevant_chunks])
    
    prompt = f"""Answer the user's question based ONLY on the provided context.
If the context does not contain enough information to answer, say "I don't have 
enough information to answer that question."

Context:
{context}

Question: {question}"""
    
    # Step 3: Generate answer
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You answer questions based on provided context. Cite the source when possible. Do not make up information."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    
    return response.choices[0].message.content

Why RAG Beats Fine-Tuning for Knowledge

This is the most important architectural decision in applied AI: when you need the model to know about your data, should you use RAG or fine-tuning?

RAG Advantages

Advantage              Why it matters
───────────────────────────────────────────────────────────────────
Up-to-date knowledge   Add new documents instantly. No retraining.
Source attribution      You know which document the answer came from.
Access control          Show different documents to different users.
Lower cost              No GPU time for training. Just embeddings + search.
No catastrophic         Adding new knowledge doesn't degrade existing
forgetting              model capabilities.
Easy to debug           If the answer is wrong, check which documents
                        were retrieved. The problem is either retrieval
                        or generation — easy to isolate.
Data separation         Your data stays in your database, not baked
                        into model weights.

Fine-Tuning Advantages

Advantage              Why it matters
───────────────────────────────────────────────────────────────────
No retrieval latency    Knowledge is in the model weights. No search step.
Consistent behavior     The model "knows" the domain. Doesn't need context.
Style and format        Fine-tuning can teach tone, format, and style
                        that are hard to specify in prompts.
Lower inference cost    Smaller fine-tuned model can replace a larger
                        model + RAG pipeline.

When to Use Which

Use RAG when:
  - Knowledge changes frequently (policies, docs, product catalog)
  - You need source attribution ("this answer came from document X")
  - Different users should see different information
  - You have a large knowledge base (1000+ documents)
  - You need to add knowledge without retraining
  - Accuracy and groundedness are critical

Use fine-tuning when:
  - You need a specific output style or format
  - The model needs domain-specific vocabulary or reasoning patterns
  - Latency is critical and you can't afford the retrieval step
  - The knowledge is stable and won't change frequently
  - You want a smaller, cheaper model that performs like a larger one

Use BOTH when:
  - You need domain-specific style AND up-to-date knowledge
  - Fine-tune for format/style, RAG for factual grounding

Real-World RAG Applications

Customer Support

User question → Search knowledge base → Retrieve relevant articles → Generate answer

Knowledge base: 500 support articles, FAQ pages, product docs
Update frequency: Weekly (new articles added, old ones updated)
Why RAG: Policies change, new products launch, answers must cite sources.
Without RAG: Model hallucinates refund policies and product features.

Legal Research

Lawyer query → Search case law database → Retrieve relevant cases → Summarize findings

Knowledge base: 10,000+ court decisions, statutes, regulations
Update frequency: Daily (new decisions filed)
Why RAG: Hallucinated case citations are a professional liability.
A lawyer citing a non-existent case can face sanctions.

Internal Company Search

Employee question → Search internal docs → Retrieve relevant content → Generate answer

Knowledge base: Confluence pages, Notion docs, Slack history, Google Drive
Update frequency: Continuous (documents edited all day)
Why RAG: Company processes, contacts, and policies change constantly.
Fine-tuning would be stale within days.

Code Documentation

Developer question → Search codebase + docs → Retrieve relevant files → Explain

Knowledge base: Source code, README files, API docs, architecture decision records
Update frequency: Every commit
Why RAG: Codebase changes with every PR. A fine-tuned model would need
constant retraining.

How RAG Works Under the Hood

Offline (indexing phase):
  
  Documents
    ↓
  Split into chunks (500-1000 tokens each)
    ↓
  Generate embedding vector for each chunk
    ↓
  Store chunks + vectors in vector database
    ↓
  Index is ready for searching

Online (query phase):
  
  User question
    ↓
  Generate embedding vector for the question
    ↓
  Find chunks with most similar vectors (cosine similarity)
    ↓
  Return top K chunks (typically 3-10)
    ↓
  Include chunks in LLM prompt as context
    ↓
  LLM generates answer grounded in the chunks

The key insight: questions and their answers have similar embeddings. "What is the return policy?" and a chunk containing "Our return policy allows refunds within 30 days" will have vectors that are close together in embedding space, even though they share few exact words.

RAG vs Other Approaches

Approach            Cost        Freshness      Accuracy    Complexity
─────────────────────────────────────────────────────────────────────
Prompt stuffing     Low         Manual update  High        Low
(put docs in prompt)

RAG                 Medium      Automatic      High        Medium
(retrieve + generate)

Fine-tuning         High        Requires       Medium      High
                                retraining

Knowledge graph     High        Manual update  Very high   Very high
+ RAG

Prompt stuffing (just putting documents in the prompt) works for small knowledge bases that fit within the context window. Once your data exceeds what fits in a single prompt, you need RAG to select the relevant subset.

Common Pitfalls

Skipping evaluation: "It seems to work" is not evaluation. Build a test set of question-answer pairs and measure retrieval quality (are the right documents found?) and answer quality (is the generated answer correct?).
Wrong chunk size: Too small and chunks lack context. Too big and you waste context window space on irrelevant text. Start with 500-1000 tokens per chunk and adjust based on your data.
No "I don't know" path: If the retrieved documents do not contain the answer, the model should say so. Without explicit instructions, models will hallucinate an answer using the irrelevant context.
Ignoring retrieval quality: The generation can only be as good as the retrieval. If the wrong documents are retrieved, the model will produce a confident wrong answer based on those documents.
Not considering context window limits: If you retrieve 10 chunks of 1000 tokens each, that is 10K tokens of context. Add the system prompt and the model's output, and you may be hitting limits. Budget your context window.
Treating RAG as a one-time setup: RAG systems need ongoing maintenance. Documents change, user questions evolve, and retrieval quality can degrade. Monitor and iterate.

Key Takeaways

RAG combines retrieval (search) with generation (LLM) to ground model outputs in real data. This reduces hallucination and solves the knowledge cutoff problem.
RAG beats fine-tuning for most knowledge-grounding use cases because it supports real-time updates, source attribution, and access control without retraining.
The quality of a RAG system depends on both retrieval quality (finding the right documents) and generation quality (synthesizing a correct answer). Evaluate both separately.
RAG is the standard architecture for any application where the model needs to answer questions about data it was not trained on: support bots, internal search, document Q&A, legal research.
Always include an "I don't know" instruction. A model that admits uncertainty is more trustworthy than one that confidently generates wrong answers from irrelevant context.