Chunking & Indexing

Overview

Before a RAG system can retrieve relevant documents, those documents must be split into chunks, converted to embeddings, and stored in a searchable index. This process — the indexing pipeline — determines the ceiling of your RAG system's quality. Bad chunking means bad retrieval, and no amount of prompt engineering fixes that.

The core tradeoff: chunks must be small enough to be specific (so you retrieve relevant content, not entire chapters) but large enough to be meaningful (so the retrieved content has enough context to be useful).

The Indexing Pipeline

Raw Documents
  ↓
Document Parsing (PDF, HTML, Markdown → plain text)
  ↓
Chunking (split text into pieces)
  ↓
Embedding (convert chunks to vectors)
  ↓
Storage (save chunks + vectors + metadata in vector store)
  ↓
Index Ready for Search

Each step has decisions that significantly affect quality.

Document Parsing

Before chunking, you need clean text. PDFs are the hardest (varying layouts, tables, OCR quality). HTML needs boilerplate stripped. Markdown is cleanest. Use libraries like PyMuPDF or pdfplumber for PDFs with Tesseract as an OCR fallback. Always clean artifacts: collapse excessive whitespace, strip headers/footers, normalize encoding.

Chunking Strategies

Fixed-Size Chunking

The simplest approach: split text into chunks of N tokens (or characters) with overlap.

def chunk_fixed_size(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
    """Split text into fixed-size chunks with overlap.
    
    Args:
        text: The full document text
        chunk_size: Target tokens per chunk (approximate with chars * 0.25)
        overlap: Number of tokens to overlap between consecutive chunks
    """
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # Step back by overlap amount
    
    return chunks

Chunk size guidelines:

  Too small (< 100 tokens):
    - Chunks lack context
    - "The return policy is 30 days" without knowing WHICH return policy
    - Retrieval finds many small fragments, none sufficient
  
  Sweet spot (300-800 tokens):
    - Enough context to be meaningful
    - Specific enough to be relevant
    - Fits multiple chunks in context window
  
  Too large (> 1500 tokens):
    - Chunks contain multiple topics
    - Retrieval returns partially relevant content
    - Wastes context window on irrelevant text within the chunk

Semantic Chunking

Split at natural boundaries: paragraphs, sections, or topic changes.

def chunk_by_headers(markdown_text: str, max_chunk_size: int = 800) -> list[dict]:
    """Split markdown by headers, respecting section boundaries.
    
    Each chunk includes its header hierarchy for context.
    """
    lines = markdown_text.split("\n")
    chunks = []
    current_chunk = []
    current_headers = []
    
    for line in lines:
        # Detect header lines
        header_match = re.match(r"^(#{1,4})\s+(.+)", line)
        
        if header_match:
            level = len(header_match.group(1))
            title = header_match.group(2)
            
            # Save current chunk if it has content
            if current_chunk:
                chunk_text = "\n".join(current_chunk)
                if len(chunk_text.split()) > 20:  # Skip trivially small chunks
                    chunks.append({
                        "text": chunk_text,
                        "headers": list(current_headers),
                        "section": current_headers[-1] if current_headers else ""
                    })
                current_chunk = []
            
            # Update header hierarchy
            current_headers = [h for h in current_headers if h[0] < level]
            current_headers.append((level, title))
        
        current_chunk.append(line)
    
    # Don't forget the last chunk
    if current_chunk:
        chunk_text = "\n".join(current_chunk)
        if len(chunk_text.split()) > 20:
            chunks.append({
                "text": chunk_text,
                "headers": list(current_headers),
                "section": current_headers[-1] if current_headers else ""
            })
    
    return chunks

Recursive Chunking

Try to split at the most natural boundary first, then fall back to smaller boundaries:

Split priority (try each in order):
  1. Double newline (paragraph boundary)
  2. Single newline (line boundary)
  3. Sentence boundary (period + space)
  4. Word boundary (space)
  
If a paragraph is under the chunk size, keep it whole.
If a paragraph exceeds the chunk size, split at sentence boundaries.
If a sentence exceeds the chunk size, split at word boundaries.

This is the approach used by most RAG frameworks (LangChain, LlamaIndex) and works well as a default.

Overlap Between Chunks

Overlap ensures that information at chunk boundaries is not lost.

Without overlap:
  Chunk 1: "...the refund is processed within 5-7 business"
  Chunk 2: "days. Exceptions apply for digital products..."
  
  A search for "refund processing time" might find Chunk 1 but miss
  that the sentence continues in Chunk 2.

With overlap (100 tokens):
  Chunk 1: "...the refund is processed within 5-7 business days. 
            Exceptions apply for digital products..."
  Chunk 2: "...within 5-7 business days. Exceptions apply for 
            digital products which are non-refundable..."
  
  Both chunks contain the complete sentence about processing time.

Overlap guidelines:
  - 10-20% of chunk size is typical (e.g., 100 tokens for 500-token chunks)
  - Too much overlap wastes storage and can cause duplicate retrieval
  - Too little overlap risks splitting important passages
  - Overlap at sentence boundaries when possible

Metadata

Metadata turns "here's a chunk of text" into "here's a chunk from the HR policy, updated last month, applicable to US employees."

def create_chunk_with_metadata(text: str, source_doc: dict) -> dict:
    """Attach metadata to a chunk for filtering and context."""
    return {
        "text": text,
        "embedding": get_embedding(text),
        "metadata": {
            "source": source_doc["filename"],
            "title": source_doc["title"],
            "section": source_doc.get("section", ""),
            "author": source_doc.get("author", ""),
            "last_updated": source_doc["modified_date"],
            "document_type": source_doc["type"],  # policy, faq, guide
            "department": source_doc.get("department", ""),
            "language": source_doc.get("language", "en"),
        }
    }

Metadata enables powerful filtering at query time:

# "What is the PTO policy?" → search only HR policy documents
results = vector_store.search(
    query_vector=query_embedding,
    top_k=5,
    filter={"department": "hr", "document_type": "policy"}
)

# "What changed in the API docs last week?" → filter by recency
results = vector_store.search(
    query_vector=query_embedding,
    top_k=5,
    filter={"document_type": "api_docs", "last_updated": {"$gte": last_week}}
)

Hybrid Search: Keyword + Semantic

Semantic search (embeddings) captures meaning but can miss exact terms. Keyword search captures exact matches but misses synonyms. Combining both is strictly better than either alone.

Query: "Error code E-4021"

Semantic search alone:
  Might find documents about "error handling" or "troubleshooting"
  but miss the specific error code because the embedding doesn't
  encode exact strings well.

Keyword search alone:
  Finds documents containing "E-4021" exactly, but misses documents
  that describe the same error by its description without the code.

Hybrid search:
  Combines both result sets. Documents matching the exact code AND
  documents about the described error are both returned and ranked.

The standard combination technique is Reciprocal Rank Fusion (RRF): run both searches, assign each result a score based on its rank in each list, and sort by combined score. Most vector databases (Weaviate, Qdrant) support hybrid search natively.

Keeping the Index Fresh

Strategy               How it works                    When to use
──────────────────────────────────────────────────────────────────────
Full rebuild           Re-index everything periodically Rarely (weekly/monthly)
Incremental update     Only re-index changed documents  Most common approach
Change detection       Watch for file changes,          Real-time freshness
                       re-index automatically            needed

Incremental updates are the standard: track document modification times, delete old chunks when a document changes, and re-index only the changed document. Match index freshness to how often your data actually changes.

Common Pitfalls

One chunk size for all content: A 500-token chunk works for prose but destroys code (splits functions mid-body) and tables (separates headers from data). Adapt chunking strategy to content type.
No overlap: Chunks that split mid-sentence or mid-paragraph lose critical context at boundaries. Always use overlap, typically 10-20% of chunk size.
Ignoring metadata: Without metadata, you cannot filter by source, date, or department. This means every query searches everything, returning irrelevant results for scoped questions.
Not testing chunk quality: After chunking, manually inspect 20-30 chunks. Are they coherent? Do they contain complete thoughts? Would a human find them useful as context for answering questions?
Embedding stale data: If your documents change weekly but your index updates monthly, users get outdated answers. Match index freshness to data change frequency.
Skipping keyword search: Pure semantic search fails on exact terms (error codes, product SKUs, names). Hybrid search catches these cases with minimal additional complexity.

Key Takeaways

Chunk size is the most impactful parameter in a RAG system. Start with 500-800 tokens, use overlap of 10-20%, and adjust based on your data and evaluation results.
Semantic chunking (splitting at natural boundaries like headers and paragraphs) produces better chunks than fixed-size splitting for structured documents.
Metadata on chunks enables filtering, which dramatically improves retrieval relevance for scoped queries (e.g., "search only in HR policies").
Hybrid search (keyword + semantic) outperforms either method alone. Keyword search catches exact terms; semantic search catches meaning.
The indexing pipeline needs maintenance. Documents change, and your index must stay fresh. Incremental updates are the standard approach.