Information Retrieval

Overview

Information retrieval (IR) finds relevant documents from a large collection in response to a query. IR is the backbone of search engines, question answering systems, and retrieval-augmented generation. The field has evolved from Boolean and vector space models through learning-to-rank to neural dense retrieval.

Classical Retrieval Models

Boolean Retrieval

Queries are Boolean expressions over terms. Documents either match or do not.

query: "machine" AND "learning" AND NOT "deep"

Binary relevance: no ranking
Exact match: no fuzzy matching or synonyms
Still used in legal and patent search where precision matters
Foundation for inverted index data structures

Vector Space Model

Represent both queries and documents as vectors in term space; rank by similarity.

score(q, d) = cos(q, d) = (q . d) / (|q| * |d|)

TF-IDF weighting for both query and document vectors
Allows partial matching and ranking
Efficient with sparse vector operations

BM25

The standard probabilistic ranking function, extending TF-IDF.

BM25(q, d) = sum_{t in q} IDF(t) * (f(t,d) * (k1 + 1)) / (f(t,d) + k1 * (1 - b + b * |d|/avgdl))

f(t,d): term frequency of t in document d
k1: term frequency saturation parameter (typically 1.2-2.0)
b: length normalization parameter (typically 0.75)
Sublinear TF scaling (diminishing returns for repeated terms)
Length normalization prevents bias toward long documents
Remains a strong baseline; often competitive with or complementary to neural methods

Inverted Index

The core data structure for efficient text retrieval.

Structure:

term -> [(doc_id_1, freq_1, [positions]), (doc_id_2, freq_2, [positions]), ...]

Construction:

Tokenize and normalize all documents
For each term, record which documents contain it and where
Sort posting lists by document ID for efficient intersection

Query processing:

AND: intersect posting lists
OR: union posting lists
Phrase queries: check position information for adjacency
Skip pointers and compression (variable-byte, PForDelta) for efficiency

Scale: Modern search engines index billions of documents with inverted indices distributed across thousands of machines.

Learning to Rank (LTR)

Machine learning models that learn to rank documents given handcrafted features.

Feature Categories

Category	Examples
Query-document	BM25 score, TF-IDF cosine, query term coverage
Document	PageRank, document length, freshness, domain authority
Query	Query length, query type classification
Interaction	Click-through rate, dwell time

Approaches

Approach	Loss Function	Description
Pointwise	Regression/classification	Predict relevance score per document independently
Pairwise	Hinge/logistic on pairs	Learn to correctly order document pairs (RankNet, LambdaRank)
Listwise	Likelihood of permutation	Optimize ranking metric directly (ListNet, LambdaMART)

LambdaMART (gradient-boosted trees with lambda gradients) was the dominant LTR method for a decade and remains widely deployed.

Neural Retrieval

Dense Passage Retrieval (DPR)

Encode queries and passages into dense vectors; retrieve by approximate nearest neighbor search.

query -> BERT encoder -> q vector (768d)
passage -> BERT encoder -> p vector (768d)
score = q . p  (dot product)

Training:

Contrastive loss: positive passage should have higher score than negatives
Hard negatives (BM25 top results that are not relevant) are critical for training quality
In-batch negatives provide efficient training signal

Inference:

Encode all passages offline and build an ANN index (FAISS, ScaNN)
Encode query online, search the index
Retrieval in milliseconds over millions of passages

ColBERT (Contextualized Late Interaction)

A middle ground between bi-encoder (DPR) and cross-encoder approaches.

query tokens -> BERT -> token embeddings [q_1, ..., q_m]
document tokens -> BERT -> token embeddings [d_1, ..., d_n]
score = sum_i max_j (q_i . d_j)  [MaxSim]

Each query token attends to its best-matching document token
Richer interaction than single-vector dot product
Document token embeddings can be precomputed and indexed
ColBERTv2: residual compression reduces storage by 6-10x

SPLADE (Sparse Lexical and Expansion)

Learns sparse representations using BERT, combining neural power with inverted index efficiency.

Input -> BERT -> MLM head -> log(1 + ReLU(logits)) -> sparse vector

Produces sparse vectors in vocabulary space (one weight per vocab term)
Enables term expansion: model activates semantically related terms not in the input
Can use standard inverted indices for retrieval
Competitive with dense retrieval while being more interpretable and efficient

Comparison of Neural Retrieval Methods

Method	Representation	Index	Interaction	Latency
BM25	Sparse (exact terms)	Inverted index	Lexical match	Very fast
DPR	Dense (single vector)	ANN index	Dot product	Fast
ColBERT	Dense (per-token)	Token-level ANN	MaxSim	Moderate
SPLADE	Learned sparse	Inverted index	Weighted match	Fast
Cross-encoder	N/A	None (pairwise)	Full attention	Slow

Cross-Encoder Reranking

A cross-encoder processes the query and document together through a transformer, enabling full token-level interaction.

[CLS] query [SEP] document [SEP] -> BERT -> classification score

Properties:

Most accurate scoring method (full attention between query and document tokens)
Too slow for first-stage retrieval (must score every query-document pair)
Used as a reranker: first retrieve top-100 with BM25/DPR, then rerank with cross-encoder

Multi-stage pipeline:

Corpus (millions) -> BM25/DPR -> top-100 -> Cross-encoder reranker -> top-10

This retrieve-then-rerank pattern is standard in production systems.

Retrieval-Augmented Generation (RAG)

Combines retrieval with language model generation for knowledge-intensive tasks.

Architecture

Query -> Retriever -> Top-k documents -> LLM (query + documents as context) -> Answer

Design Decisions

Decision	Options
Retriever	BM25, DPR, ColBERT, hybrid
Chunk size	100-500 tokens per chunk
Top-k	3-10 passages
Context integration	Concatenate, summarize, or iteratively retrieve
Generator	BART, T5, GPT, LLaMA

Advanced RAG Patterns

Hybrid retrieval: Combine sparse (BM25) and dense (DPR) scores with reciprocal rank fusion
Query rewriting: LLM reformulates the query for better retrieval
Hypothetical Document Embeddings (HyDE): Generate a hypothetical answer, embed it, retrieve similar real documents
Iterative retrieval: Retrieve, generate partial answer, retrieve again with refined query
Self-RAG: Model decides when retrieval is needed and critiques its own output

Chunking Strategies

Strategy	Description
Fixed-size	Split by token count (e.g., 256 tokens) with overlap
Sentence-based	Split at sentence boundaries
Paragraph-based	Use document structure
Semantic	Cluster sentences by topic/similarity
Recursive	Split hierarchically by headings, paragraphs, sentences

Evaluation

Retrieval Metrics

Metric	Definition
Precision@k	Fraction of top-k results that are relevant
Recall@k	Fraction of all relevant documents in top-k
MAP (Mean Average Precision)	Average precision across recall levels, averaged over queries
MRR (Mean Reciprocal Rank)	Average of 1/rank of the first relevant result
NDCG@k	Normalized discounted cumulative gain; accounts for graded relevance

Benchmarks

Benchmark	Task	Scale
MS MARCO	Passage/document ranking	8.8M passages
BEIR	Zero-shot retrieval (18 datasets)	Diverse domains
TREC DL	Deep learning track	MS MARCO subsets
Natural Questions	Open-domain QA retrieval	Wikipedia
MTEB	Massive text embedding benchmark	56 datasets

Key Takeaways

BM25 with inverted indices remains a strong, efficient baseline for text retrieval
Dense retrieval (DPR) enables semantic matching but requires ANN indices and careful training
ColBERT and SPLADE bridge dense and sparse approaches with different trade-offs
Cross-encoder reranking provides the highest accuracy as a second stage after efficient first-stage retrieval
RAG connects retrieval to generation, enabling LLMs to access external knowledge
Hybrid retrieval (sparse + dense) consistently outperforms either approach alone