6 min read
On this page

Information Retrieval

Overview

Information retrieval (IR) finds relevant documents from a large collection in response to a query. IR is the backbone of search engines, question answering systems, and retrieval-augmented generation. The field has evolved from Boolean and vector space models through learning-to-rank to neural dense retrieval.


Classical Retrieval Models

Boolean Retrieval

Queries are Boolean expressions over terms. Documents either match or do not.

query: "machine" AND "learning" AND NOT "deep"
  • Binary relevance: no ranking
  • Exact match: no fuzzy matching or synonyms
  • Still used in legal and patent search where precision matters
  • Foundation for inverted index data structures

Vector Space Model

Represent both queries and documents as vectors in term space; rank by similarity.

score(q, d) = cos(q, d) = (q . d) / (|q| * |d|)
  • TF-IDF weighting for both query and document vectors
  • Allows partial matching and ranking
  • Efficient with sparse vector operations

BM25

The standard probabilistic ranking function, extending TF-IDF.

BM25(q, d) = sum_{t in q} IDF(t) * (f(t,d) * (k1 + 1)) / (f(t,d) + k1 * (1 - b + b * |d|/avgdl))
  • f(t,d): term frequency of t in document d
  • k1: term frequency saturation parameter (typically 1.2-2.0)
  • b: length normalization parameter (typically 0.75)
  • Sublinear TF scaling (diminishing returns for repeated terms)
  • Length normalization prevents bias toward long documents
  • Remains a strong baseline; often competitive with or complementary to neural methods

Inverted Index

The core data structure for efficient text retrieval.

Structure:

term -> [(doc_id_1, freq_1, [positions]), (doc_id_2, freq_2, [positions]), ...]

Construction:

  1. Tokenize and normalize all documents
  2. For each term, record which documents contain it and where
  3. Sort posting lists by document ID for efficient intersection

Query processing:

  • AND: intersect posting lists
  • OR: union posting lists
  • Phrase queries: check position information for adjacency
  • Skip pointers and compression (variable-byte, PForDelta) for efficiency

Scale: Modern search engines index billions of documents with inverted indices distributed across thousands of machines.


Learning to Rank (LTR)

Machine learning models that learn to rank documents given handcrafted features.

Feature Categories

| Category | Examples | |---|---| | Query-document | BM25 score, TF-IDF cosine, query term coverage | | Document | PageRank, document length, freshness, domain authority | | Query | Query length, query type classification | | Interaction | Click-through rate, dwell time |

Approaches

| Approach | Loss Function | Description | |---|---|---| | Pointwise | Regression/classification | Predict relevance score per document independently | | Pairwise | Hinge/logistic on pairs | Learn to correctly order document pairs (RankNet, LambdaRank) | | Listwise | Likelihood of permutation | Optimize ranking metric directly (ListNet, LambdaMART) |

LambdaMART (gradient-boosted trees with lambda gradients) was the dominant LTR method for a decade and remains widely deployed.


Neural Retrieval

Dense Passage Retrieval (DPR)

Encode queries and passages into dense vectors; retrieve by approximate nearest neighbor search.

query -> BERT encoder -> q vector (768d)
passage -> BERT encoder -> p vector (768d)
score = q . p  (dot product)

Training:

  • Contrastive loss: positive passage should have higher score than negatives
  • Hard negatives (BM25 top results that are not relevant) are critical for training quality
  • In-batch negatives provide efficient training signal

Inference:

  • Encode all passages offline and build an ANN index (FAISS, ScaNN)
  • Encode query online, search the index
  • Retrieval in milliseconds over millions of passages

ColBERT (Contextualized Late Interaction)

A middle ground between bi-encoder (DPR) and cross-encoder approaches.

query tokens -> BERT -> token embeddings [q_1, ..., q_m]
document tokens -> BERT -> token embeddings [d_1, ..., d_n]
score = sum_i max_j (q_i . d_j)  [MaxSim]
  • Each query token attends to its best-matching document token
  • Richer interaction than single-vector dot product
  • Document token embeddings can be precomputed and indexed
  • ColBERTv2: residual compression reduces storage by 6-10x

SPLADE (Sparse Lexical and Expansion)

Learns sparse representations using BERT, combining neural power with inverted index efficiency.

Input -> BERT -> MLM head -> log(1 + ReLU(logits)) -> sparse vector
  • Produces sparse vectors in vocabulary space (one weight per vocab term)
  • Enables term expansion: model activates semantically related terms not in the input
  • Can use standard inverted indices for retrieval
  • Competitive with dense retrieval while being more interpretable and efficient

Comparison of Neural Retrieval Methods

| Method | Representation | Index | Interaction | Latency | |---|---|---|---|---| | BM25 | Sparse (exact terms) | Inverted index | Lexical match | Very fast | | DPR | Dense (single vector) | ANN index | Dot product | Fast | | ColBERT | Dense (per-token) | Token-level ANN | MaxSim | Moderate | | SPLADE | Learned sparse | Inverted index | Weighted match | Fast | | Cross-encoder | N/A | None (pairwise) | Full attention | Slow |


Cross-Encoder Reranking

A cross-encoder processes the query and document together through a transformer, enabling full token-level interaction.

[CLS] query [SEP] document [SEP] -> BERT -> classification score

Properties:

  • Most accurate scoring method (full attention between query and document tokens)
  • Too slow for first-stage retrieval (must score every query-document pair)
  • Used as a reranker: first retrieve top-100 with BM25/DPR, then rerank with cross-encoder

Multi-stage pipeline:

Corpus (millions) -> BM25/DPR -> top-100 -> Cross-encoder reranker -> top-10

This retrieve-then-rerank pattern is standard in production systems.


Retrieval-Augmented Generation (RAG)

Combines retrieval with language model generation for knowledge-intensive tasks.

Architecture

Query -> Retriever -> Top-k documents -> LLM (query + documents as context) -> Answer

Design Decisions

| Decision | Options | |---|---| | Retriever | BM25, DPR, ColBERT, hybrid | | Chunk size | 100-500 tokens per chunk | | Top-k | 3-10 passages | | Context integration | Concatenate, summarize, or iteratively retrieve | | Generator | BART, T5, GPT, LLaMA |

Advanced RAG Patterns

  • Hybrid retrieval: Combine sparse (BM25) and dense (DPR) scores with reciprocal rank fusion
  • Query rewriting: LLM reformulates the query for better retrieval
  • Hypothetical Document Embeddings (HyDE): Generate a hypothetical answer, embed it, retrieve similar real documents
  • Iterative retrieval: Retrieve, generate partial answer, retrieve again with refined query
  • Self-RAG: Model decides when retrieval is needed and critiques its own output

Chunking Strategies

| Strategy | Description | |---|---| | Fixed-size | Split by token count (e.g., 256 tokens) with overlap | | Sentence-based | Split at sentence boundaries | | Paragraph-based | Use document structure | | Semantic | Cluster sentences by topic/similarity | | Recursive | Split hierarchically by headings, paragraphs, sentences |


Evaluation

Retrieval Metrics

| Metric | Definition | |---|---| | Precision@k | Fraction of top-k results that are relevant | | Recall@k | Fraction of all relevant documents in top-k | | MAP (Mean Average Precision) | Average precision across recall levels, averaged over queries | | MRR (Mean Reciprocal Rank) | Average of 1/rank of the first relevant result | | NDCG@k | Normalized discounted cumulative gain; accounts for graded relevance |

Benchmarks

| Benchmark | Task | Scale | |---|---|---| | MS MARCO | Passage/document ranking | 8.8M passages | | BEIR | Zero-shot retrieval (18 datasets) | Diverse domains | | TREC DL | Deep learning track | MS MARCO subsets | | Natural Questions | Open-domain QA retrieval | Wikipedia | | MTEB | Massive text embedding benchmark | 56 datasets |


Key Takeaways

  • BM25 with inverted indices remains a strong, efficient baseline for text retrieval
  • Dense retrieval (DPR) enables semantic matching but requires ANN indices and careful training
  • ColBERT and SPLADE bridge dense and sparse approaches with different trade-offs
  • Cross-encoder reranking provides the highest accuracy as a second stage after efficient first-stage retrieval
  • RAG connects retrieval to generation, enabling LLMs to access external knowledge
  • Hybrid retrieval (sparse + dense) consistently outperforms either approach alone