What RAG Is
Overview
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation. Instead of relying solely on what the model learned during training, you retrieve relevant documents from your own data and include them in the prompt. The model then generates an answer grounded in those documents.
RAG solves two fundamental problems with LLMs: they have a knowledge cutoff (they do not know about anything after their training data), and they hallucinate (they confidently state things that are not true). By giving the model the right documents at the right time, you get answers that are both current and grounded in real data.
The Problem RAG Solves
Knowledge Cutoff
Every LLM has a training data cutoff date. Ask about something that happened after that date and the model either refuses to answer or makes something up.
Without RAG:
User: "What was our Q4 2025 revenue?"
Model: "I don't have information about your company's specific
financial data." (Best case — honest refusal)
Model: "Based on industry trends, your Q4 2025 revenue was likely
around $12M." (Worst case — confident hallucination)
With RAG:
System retrieves Q4 2025 earnings report from company documents.
User: "What was our Q4 2025 revenue?"
Model: "According to the Q4 2025 earnings report, revenue was
$14.2M, up 18% from Q3." (Grounded in real data)
Hallucination
LLMs generate plausible-sounding text. When they lack information, they fill in gaps with statistically likely (but factually wrong) content. This is not a bug that will be fixed — it is inherent to how language models work.
Without RAG:
User: "What is the return policy for Pro plan customers?"
Model: "Pro plan customers typically enjoy a 60-day return window
with full refund." (Sounds authoritative. Completely made up.)
With RAG:
System retrieves the actual return policy document.
User: "What is the return policy for Pro plan customers?"
Model: "Per the current return policy, Pro plan customers have a
30-day return window. Refunds are processed within 5-7
business days and exclude shipping costs." (From the document.)
RAG = Search + Generation
At its core, RAG is two steps:
Step 1: RETRIEVE
User asks a question
→ Convert question to a search query
→ Search your document store
→ Get the top N most relevant chunks
Step 2: GENERATE
Take the user's question + retrieved chunks
→ Build a prompt with the context
→ Send to the LLM
→ Model generates answer grounded in the context
def answer_with_rag(question: str, vector_store, llm_client) -> str:
"""Simple RAG pipeline: retrieve then generate."""
# Step 1: Retrieve relevant documents
query_embedding = get_embedding(question)
relevant_chunks = vector_store.search(
query_vector=query_embedding,
top_k=5
)
# Step 2: Build prompt with context
context = "\n\n---\n\n".join([chunk.text for chunk in relevant_chunks])
prompt = f"""Answer the user's question based ONLY on the provided context.
If the context does not contain enough information to answer, say "I don't have
enough information to answer that question."
Context:
{context}
Question: {question}"""
# Step 3: Generate answer
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You answer questions based on provided context. Cite the source when possible. Do not make up information."},
{"role": "user", "content": prompt}
],
temperature=0
)
return response.choices[0].message.content
Why RAG Beats Fine-Tuning for Knowledge
This is the most important architectural decision in applied AI: when you need the model to know about your data, should you use RAG or fine-tuning?
RAG Advantages
Advantage Why it matters
───────────────────────────────────────────────────────────────────
Up-to-date knowledge Add new documents instantly. No retraining.
Source attribution You know which document the answer came from.
Access control Show different documents to different users.
Lower cost No GPU time for training. Just embeddings + search.
No catastrophic Adding new knowledge doesn't degrade existing
forgetting model capabilities.
Easy to debug If the answer is wrong, check which documents
were retrieved. The problem is either retrieval
or generation — easy to isolate.
Data separation Your data stays in your database, not baked
into model weights.
Fine-Tuning Advantages
Advantage Why it matters
───────────────────────────────────────────────────────────────────
No retrieval latency Knowledge is in the model weights. No search step.
Consistent behavior The model "knows" the domain. Doesn't need context.
Style and format Fine-tuning can teach tone, format, and style
that are hard to specify in prompts.
Lower inference cost Smaller fine-tuned model can replace a larger
model + RAG pipeline.
When to Use Which
Use RAG when:
- Knowledge changes frequently (policies, docs, product catalog)
- You need source attribution ("this answer came from document X")
- Different users should see different information
- You have a large knowledge base (1000+ documents)
- You need to add knowledge without retraining
- Accuracy and groundedness are critical
Use fine-tuning when:
- You need a specific output style or format
- The model needs domain-specific vocabulary or reasoning patterns
- Latency is critical and you can't afford the retrieval step
- The knowledge is stable and won't change frequently
- You want a smaller, cheaper model that performs like a larger one
Use BOTH when:
- You need domain-specific style AND up-to-date knowledge
- Fine-tune for format/style, RAG for factual grounding
Real-World RAG Applications
Customer Support
User question → Search knowledge base → Retrieve relevant articles → Generate answer
Knowledge base: 500 support articles, FAQ pages, product docs
Update frequency: Weekly (new articles added, old ones updated)
Why RAG: Policies change, new products launch, answers must cite sources.
Without RAG: Model hallucinates refund policies and product features.
Legal Research
Lawyer query → Search case law database → Retrieve relevant cases → Summarize findings
Knowledge base: 10,000+ court decisions, statutes, regulations
Update frequency: Daily (new decisions filed)
Why RAG: Hallucinated case citations are a professional liability.
A lawyer citing a non-existent case can face sanctions.
Internal Company Search
Employee question → Search internal docs → Retrieve relevant content → Generate answer
Knowledge base: Confluence pages, Notion docs, Slack history, Google Drive
Update frequency: Continuous (documents edited all day)
Why RAG: Company processes, contacts, and policies change constantly.
Fine-tuning would be stale within days.
Code Documentation
Developer question → Search codebase + docs → Retrieve relevant files → Explain
Knowledge base: Source code, README files, API docs, architecture decision records
Update frequency: Every commit
Why RAG: Codebase changes with every PR. A fine-tuned model would need
constant retraining.
How RAG Works Under the Hood
Offline (indexing phase):
Documents
↓
Split into chunks (500-1000 tokens each)
↓
Generate embedding vector for each chunk
↓
Store chunks + vectors in vector database
↓
Index is ready for searching
Online (query phase):
User question
↓
Generate embedding vector for the question
↓
Find chunks with most similar vectors (cosine similarity)
↓
Return top K chunks (typically 3-10)
↓
Include chunks in LLM prompt as context
↓
LLM generates answer grounded in the chunks
The key insight: questions and their answers have similar embeddings. "What is the return policy?" and a chunk containing "Our return policy allows refunds within 30 days" will have vectors that are close together in embedding space, even though they share few exact words.
RAG vs Other Approaches
Approach Cost Freshness Accuracy Complexity
─────────────────────────────────────────────────────────────────────
Prompt stuffing Low Manual update High Low
(put docs in prompt)
RAG Medium Automatic High Medium
(retrieve + generate)
Fine-tuning High Requires Medium High
retraining
Knowledge graph High Manual update Very high Very high
+ RAG
Prompt stuffing (just putting documents in the prompt) works for small knowledge bases that fit within the context window. Once your data exceeds what fits in a single prompt, you need RAG to select the relevant subset.
Common Pitfalls
- Skipping evaluation: "It seems to work" is not evaluation. Build a test set of question-answer pairs and measure retrieval quality (are the right documents found?) and answer quality (is the generated answer correct?).
- Wrong chunk size: Too small and chunks lack context. Too big and you waste context window space on irrelevant text. Start with 500-1000 tokens per chunk and adjust based on your data.
- No "I don't know" path: If the retrieved documents do not contain the answer, the model should say so. Without explicit instructions, models will hallucinate an answer using the irrelevant context.
- Ignoring retrieval quality: The generation can only be as good as the retrieval. If the wrong documents are retrieved, the model will produce a confident wrong answer based on those documents.
- Not considering context window limits: If you retrieve 10 chunks of 1000 tokens each, that is 10K tokens of context. Add the system prompt and the model's output, and you may be hitting limits. Budget your context window.
- Treating RAG as a one-time setup: RAG systems need ongoing maintenance. Documents change, user questions evolve, and retrieval quality can degrade. Monitor and iterate.
Key Takeaways
- RAG combines retrieval (search) with generation (LLM) to ground model outputs in real data. This reduces hallucination and solves the knowledge cutoff problem.
- RAG beats fine-tuning for most knowledge-grounding use cases because it supports real-time updates, source attribution, and access control without retraining.
- The quality of a RAG system depends on both retrieval quality (finding the right documents) and generation quality (synthesizing a correct answer). Evaluate both separately.
- RAG is the standard architecture for any application where the model needs to answer questions about data it was not trained on: support bots, internal search, document Q&A, legal research.
- Always include an "I don't know" instruction. A model that admits uncertainty is more trustworthy than one that confidently generates wrong answers from irrelevant context.