6 min read
On this page

Information Retrieval

Overview

Information retrieval (IR) finds relevant documents from a large collection in response to a query. IR is the backbone of search engines, question answering systems, and retrieval-augmented generation. The field has evolved from Boolean and vector space models through learning-to-rank to neural dense retrieval.


Classical Retrieval Models

Boolean Retrieval

Queries are Boolean expressions over terms. Documents either match or do not.

query: "machine" AND "learning" AND NOT "deep"
  • Binary relevance: no ranking
  • Exact match: no fuzzy matching or synonyms
  • Still used in legal and patent search where precision matters
  • Foundation for inverted index data structures

Vector Space Model

Represent both queries and documents as vectors in term space; rank by similarity.

score(q, d) = cos(q, d) = (q . d) / (|q| * |d|)
  • TF-IDF weighting for both query and document vectors
  • Allows partial matching and ranking
  • Efficient with sparse vector operations

BM25

The standard probabilistic ranking function, extending TF-IDF.

BM25(q, d) = sum_{t in q} IDF(t) * (f(t,d) * (k1 + 1)) / (f(t,d) + k1 * (1 - b + b * |d|/avgdl))
  • f(t,d): term frequency of t in document d
  • k1: term frequency saturation parameter (typically 1.2-2.0)
  • b: length normalization parameter (typically 0.75)
  • Sublinear TF scaling (diminishing returns for repeated terms)
  • Length normalization prevents bias toward long documents
  • Remains a strong baseline; often competitive with or complementary to neural methods

Inverted Index

The core data structure for efficient text retrieval.

Structure:

term -> [(doc_id_1, freq_1, [positions]), (doc_id_2, freq_2, [positions]), ...]

Construction:

  1. Tokenize and normalize all documents
  2. For each term, record which documents contain it and where
  3. Sort posting lists by document ID for efficient intersection

Query processing:

  • AND: intersect posting lists
  • OR: union posting lists
  • Phrase queries: check position information for adjacency
  • Skip pointers and compression (variable-byte, PForDelta) for efficiency

Scale: Modern search engines index billions of documents with inverted indices distributed across thousands of machines.


Learning to Rank (LTR)

Machine learning models that learn to rank documents given handcrafted features.

Feature Categories

Category Examples
Query-document BM25 score, TF-IDF cosine, query term coverage
Document PageRank, document length, freshness, domain authority
Query Query length, query type classification
Interaction Click-through rate, dwell time

Approaches

Approach Loss Function Description
Pointwise Regression/classification Predict relevance score per document independently
Pairwise Hinge/logistic on pairs Learn to correctly order document pairs (RankNet, LambdaRank)
Listwise Likelihood of permutation Optimize ranking metric directly (ListNet, LambdaMART)

LambdaMART (gradient-boosted trees with lambda gradients) was the dominant LTR method for a decade and remains widely deployed.


Neural Retrieval

Dense Passage Retrieval (DPR)

Encode queries and passages into dense vectors; retrieve by approximate nearest neighbor search.

query -> BERT encoder -> q vector (768d)
passage -> BERT encoder -> p vector (768d)
score = q . p  (dot product)

Training:

  • Contrastive loss: positive passage should have higher score than negatives
  • Hard negatives (BM25 top results that are not relevant) are critical for training quality
  • In-batch negatives provide efficient training signal

Inference:

  • Encode all passages offline and build an ANN index (FAISS, ScaNN)
  • Encode query online, search the index
  • Retrieval in milliseconds over millions of passages

ColBERT (Contextualized Late Interaction)

A middle ground between bi-encoder (DPR) and cross-encoder approaches.

query tokens -> BERT -> token embeddings [q_1, ..., q_m]
document tokens -> BERT -> token embeddings [d_1, ..., d_n]
score = sum_i max_j (q_i . d_j)  [MaxSim]
  • Each query token attends to its best-matching document token
  • Richer interaction than single-vector dot product
  • Document token embeddings can be precomputed and indexed
  • ColBERTv2: residual compression reduces storage by 6-10x

SPLADE (Sparse Lexical and Expansion)

Learns sparse representations using BERT, combining neural power with inverted index efficiency.

Input -> BERT -> MLM head -> log(1 + ReLU(logits)) -> sparse vector
  • Produces sparse vectors in vocabulary space (one weight per vocab term)
  • Enables term expansion: model activates semantically related terms not in the input
  • Can use standard inverted indices for retrieval
  • Competitive with dense retrieval while being more interpretable and efficient

Comparison of Neural Retrieval Methods

Method Representation Index Interaction Latency
BM25 Sparse (exact terms) Inverted index Lexical match Very fast
DPR Dense (single vector) ANN index Dot product Fast
ColBERT Dense (per-token) Token-level ANN MaxSim Moderate
SPLADE Learned sparse Inverted index Weighted match Fast
Cross-encoder N/A None (pairwise) Full attention Slow

Cross-Encoder Reranking

A cross-encoder processes the query and document together through a transformer, enabling full token-level interaction.

[CLS] query [SEP] document [SEP] -> BERT -> classification score

Properties:

  • Most accurate scoring method (full attention between query and document tokens)
  • Too slow for first-stage retrieval (must score every query-document pair)
  • Used as a reranker: first retrieve top-100 with BM25/DPR, then rerank with cross-encoder

Multi-stage pipeline:

Corpus (millions) -> BM25/DPR -> top-100 -> Cross-encoder reranker -> top-10

This retrieve-then-rerank pattern is standard in production systems.


Retrieval-Augmented Generation (RAG)

Combines retrieval with language model generation for knowledge-intensive tasks.

Architecture

Query -> Retriever -> Top-k documents -> LLM (query + documents as context) -> Answer

Design Decisions

Decision Options
Retriever BM25, DPR, ColBERT, hybrid
Chunk size 100-500 tokens per chunk
Top-k 3-10 passages
Context integration Concatenate, summarize, or iteratively retrieve
Generator BART, T5, GPT, LLaMA

Advanced RAG Patterns

  • Hybrid retrieval: Combine sparse (BM25) and dense (DPR) scores with reciprocal rank fusion
  • Query rewriting: LLM reformulates the query for better retrieval
  • Hypothetical Document Embeddings (HyDE): Generate a hypothetical answer, embed it, retrieve similar real documents
  • Iterative retrieval: Retrieve, generate partial answer, retrieve again with refined query
  • Self-RAG: Model decides when retrieval is needed and critiques its own output

Chunking Strategies

Strategy Description
Fixed-size Split by token count (e.g., 256 tokens) with overlap
Sentence-based Split at sentence boundaries
Paragraph-based Use document structure
Semantic Cluster sentences by topic/similarity
Recursive Split hierarchically by headings, paragraphs, sentences

Evaluation

Retrieval Metrics

Metric Definition
Precision@k Fraction of top-k results that are relevant
Recall@k Fraction of all relevant documents in top-k
MAP (Mean Average Precision) Average precision across recall levels, averaged over queries
MRR (Mean Reciprocal Rank) Average of 1/rank of the first relevant result
NDCG@k Normalized discounted cumulative gain; accounts for graded relevance

Benchmarks

Benchmark Task Scale
MS MARCO Passage/document ranking 8.8M passages
BEIR Zero-shot retrieval (18 datasets) Diverse domains
TREC DL Deep learning track MS MARCO subsets
Natural Questions Open-domain QA retrieval Wikipedia
MTEB Massive text embedding benchmark 56 datasets

Key Takeaways

  • BM25 with inverted indices remains a strong, efficient baseline for text retrieval
  • Dense retrieval (DPR) enables semantic matching but requires ANN indices and careful training
  • ColBERT and SPLADE bridge dense and sparse approaches with different trade-offs
  • Cross-encoder reranking provides the highest accuracy as a second stage after efficient first-stage retrieval
  • RAG connects retrieval to generation, enabling LLMs to access external knowledge
  • Hybrid retrieval (sparse + dense) consistently outperforms either approach alone