Search Engines

Overview

Search engines enable efficient full-text retrieval, relevance ranking, and increasingly, semantic vector search over large document collections. The core data structure is the inverted index, which maps terms to the documents containing them. Modern search systems combine lexical (keyword) and semantic (vector) approaches for hybrid search.

Inverted Index

The inverted index is the fundamental data structure for full-text search. It maps each unique term to a posting list of documents (and positions) where the term appears.

Documents:
  doc1: "the quick brown fox"
  doc2: "the lazy brown dog"
  doc3: "quick fox jumps"

Inverted Index:
  Term     -> Posting List (docID: [positions])
  "brown"  -> [(doc1: [2]), (doc2: [2])]
  "dog"    -> [(doc2: [3])]
  "fox"    -> [(doc1: [3]), (doc3: [1])]
  "jumps"  -> [(doc3: [2])]
  "lazy"   -> [(doc2: [1])]
  "quick"  -> [(doc1: [1]), (doc3: [0])]
  "the"    -> [(doc1: [0]), (doc2: [0])]

Storage optimization:
  - Posting lists sorted by docID for efficient merging
  - Delta encoding: [1, 5, 8, 15] -> [1, 4, 3, 7]
  - Compressed with variable-byte or PForDelta encoding
  - Skip lists for fast intersection of large posting lists

Boolean Retrieval

Query: "quick AND brown"
  Intersect posting lists:
    quick: [doc1, doc3]
    brown: [doc1, doc2]
    Result: [doc1]

Query: "quick OR fox"
  Union posting lists:
    quick: [doc1, doc3]
    fox:   [doc1, doc3]
    Result: [doc1, doc3]

Phrase query: "quick fox"
  Intersect posting lists, then check positions:
    quick: [(doc1: [1]), (doc3: [0])]
    fox:   [(doc1: [3]), (doc3: [1])]
    Check: fox.position = quick.position + 1
    doc3: quick@0, fox@1 -> match
    doc1: quick@1, fox@3 -> no match (gap of 2)
    Result: [doc3]

Tokenization and Analysis

The analysis pipeline transforms raw text into searchable terms.

Analysis Pipeline:
  Raw Text: "The Quick-Brown FOX's"
      |
  Character Filter: strip HTML, normalize Unicode
      |  "The Quick-Brown FOX's"
      v
  Tokenizer: split into tokens
      |  ["The", "Quick", "Brown", "FOX's"]
      v
  Token Filters (chain):
    Lowercase:     ["the", "quick", "brown", "fox's"]
    Possessive:    ["the", "quick", "brown", "fox"]
    Stop words:    ["quick", "brown", "fox"]
    Stemming:      ["quick", "brown", "fox"]

Common Analyzers

// Elasticsearch custom analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"],
          "char_filter": ["html_strip"]
        }
      }
    }
  }
}

Tokenizer	Input	Output
Standard	"Quick-Brown Fox"	["Quick", "Brown", "Fox"]
Whitespace	"Quick-Brown Fox"	["Quick-Brown", "Fox"]
N-gram (2,3)	"fox"	["fo", "ox", "fox"]
Edge N-gram	"fox"	["f", "fo", "fox"]

Relevance Scoring

TF-IDF

Term Frequency - Inverse Document Frequency measures how important a term is to a document within a corpus.

TF(t, d)  = frequency of term t in document d
IDF(t)    = log(N / df(t))
             N = total documents, df(t) = documents containing t

Score(t, d) = TF(t, d) * IDF(t)

Example:
  Corpus: 10,000 documents
  Query term "database":
    appears 5 times in doc1 (100 words)
    appears in 500 documents total

  TF  = 5 / 100 = 0.05
  IDF = log(10000 / 500) = log(20) = 3.0
  Score = 0.05 * 3.0 = 0.15

BM25 (Best Matching 25)

The standard relevance ranking function in modern search engines. Improves on TF-IDF with term frequency saturation and document length normalization.

BM25(d, Q) = SUM over q in Q:
  IDF(q) * (f(q,d) * (k1 + 1)) / (f(q,d) + k1 * (1 - b + b * |d|/avgdl))

Parameters:
  k1 = 1.2  (term frequency saturation; higher = more weight to TF)
  b  = 0.75 (length normalization; 0 = none, 1 = full)
  f(q,d) = raw term frequency of q in d
  |d|    = document length
  avgdl  = average document length

Key improvements over TF-IDF:
  - TF saturation: going from 1->2 occurrences matters more than 10->11
  - Length normalization: longer documents don't get unfair advantage
  - IDF: log((N - df + 0.5) / (df + 0.5) + 1) avoids negative values

Vector Search

Vector search finds documents by semantic similarity using dense vector embeddings, enabling retrieval based on meaning rather than keyword overlap.

HNSW (Hierarchical Navigable Small World)

The most widely used approximate nearest neighbor (ANN) algorithm. Builds a multi-layer graph where upper layers have long-range connections for fast navigation and lower layers have short-range connections for precision.

Layer 2:  [A] --------- [D]           (few nodes, long edges)
           |              |
Layer 1:  [A] --- [C] -- [D] -- [F]   (more nodes)
           |    /  |       |  \   |
Layer 0:  [A]-[B]-[C]-[D]-[E]-[F]-[G] (all nodes, short edges)

Search for nearest neighbor of query Q:
  1. Enter at top layer, greedy walk to nearest node
  2. Drop to next layer, continue greedy walk
  3. At layer 0, explore local neighborhood
  4. Return top-k nearest nodes

Parameters:
  M:       max connections per node per layer (default 16)
  efConstruction: beam width during build (default 200)
  efSearch: beam width during query (trade-off: speed vs recall)

Complexity: O(log N) search, O(N * M * log N) build
Memory:     O(N * M * d) where d = vector dimension

IVF (Inverted File Index)

Partitions the vector space into clusters (Voronoi cells) using k-means. At query time, only the nearest clusters are searched.

Build:
  1. Cluster N vectors into K centroids (k-means)
  2. Assign each vector to its nearest centroid
  3. Build inverted list per centroid

Search:
  1. Find nprobe nearest centroids to query vector
  2. Search only vectors in those clusters
  3. Return top-k results

Trade-offs:
  K=1024, nprobe=16 -> search 1.6% of data, ~95% recall
  K=4096, nprobe=64 -> search 1.6% of data, ~98% recall

IVF-PQ: combine with Product Quantization for memory efficiency

Product Quantization (PQ)

Compresses vectors by splitting them into subspaces and quantizing each subspace independently.

Original vector (128-dim float32): 512 bytes

PQ with 8 subspaces, 256 centroids each:
  Split: [16-dim] [16-dim] [16-dim] ... [16-dim]  (8 subvectors)
  Each subvector -> nearest centroid ID (1 byte)
  Compressed: 8 bytes (64x compression)

Distance computation:
  Precompute distance table: query subvector to all centroids
  Table size: 8 subspaces * 256 centroids = 2048 entries
  Approximate distance = sum of 8 table lookups

Hybrid Search

Combining lexical (BM25) and semantic (vector) search for better retrieval quality than either approach alone.

Query: "how to handle database connection pool exhaustion"

BM25 retrieves:  documents with exact terms "connection pool exhaustion"
Vector retrieves: documents about "running out of DB connections",
                  "too many open connections", "connection leak"

Fusion strategies:
  1. Reciprocal Rank Fusion (RRF):
     score(d) = SUM over rankings R:
       1 / (k + rank_R(d))     where k=60 typically

  2. Weighted linear combination:
     score(d) = alpha * normalize(bm25_score) +
                (1-alpha) * normalize(vector_score)

  3. Re-ranking: retrieve top-N from each, re-rank with cross-encoder

Elasticsearch / OpenSearch Architecture

Cluster Architecture

Cluster:
  Master Node:    cluster state, shard allocation, index management
  Data Nodes:     store shards, execute queries
  Coordinating:   route requests, merge results
  Ingest Nodes:   pre-process documents (pipelines)

Index Structure:
  Index "products" (5 primary shards, 1 replica each):
    Shard 0: [Node A (primary)] [Node B (replica)]
    Shard 1: [Node B (primary)] [Node C (replica)]
    Shard 2: [Node C (primary)] [Node A (replica)]
    Shard 3: [Node A (primary)] [Node C (replica)]
    Shard 4: [Node B (primary)] [Node A (replica)]

  Document routing: shard = hash(doc._id) % num_primary_shards

Query Execution

Search request flow:
  1. Client -> Coordinating node
  2. Coordinating broadcasts query to all relevant shards
  3. Each shard executes query locally, returns top-K doc IDs + scores
  4. Coordinating merges results (priority queue), selects global top-K
  5. Coordinating fetches full documents from relevant shards (fetch phase)
  6. Return results to client

Two-phase: Query Phase (scatter-gather scores) + Fetch Phase (get docs)

Indexing and Segments

// Index a document
PUT /products/_doc/1
{
  "name": "Wireless Headphones",
  "description": "Noise-canceling Bluetooth headphones with 30h battery",
  "price": 149.99,
  "category": "electronics",
  "embedding": [0.12, -0.34, 0.56, ...]  // dense vector for kNN
}

// Search with BM25 + kNN hybrid
GET /products/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "description": "wireless noise canceling" } },
        { "knn": {
            "field": "embedding",
            "query_vector": [0.11, -0.33, 0.55, ...],
            "num_candidates": 100,
            "boost": 0.5
        }}
      ]
    }
  }
}

Lucene Internals

Apache Lucene is the library underlying Elasticsearch, OpenSearch, and Solr.

Segment Architecture

Lucene Index = collection of immutable segments

Write path:
  1. Documents buffered in memory (IndexWriter)
  2. Flush: write as new segment to disk
  3. Segments are immutable once written
  4. Deletes: recorded in .del bitset (not physical deletion)
  5. Merge: background process combines segments, purges deletes

Segment contents:
  .tii / .tip  - term index (FST: finite state transducer)
  .doc         - posting lists (doc IDs + frequencies)
  .pos         - term positions
  .dvd / .dvm  - doc values (columnar, for sorting/aggregation)
  .kdd / .kdm  - BKD trees (numeric/geo range queries)
  .vec / .vex  - HNSW vector index

Finite State Transducer (FST)

Lucene uses FSTs as a compact in-memory dictionary mapping terms to metadata (posting list offsets). FSTs share common prefixes and suffixes, achieving significant compression over hash maps or tries.

Terms: ["cat", "car", "card", "care"]

FST (shared prefix "ca"):
       c -> a -> t  (cat)
              -> r  (car)
                 -> d (card)
                 -> e (care)

Memory: O(total unique characters) vs O(total string bytes) for hash map
Lookup: O(term length)

Applications and Selection Guide

Use Case	Approach	System
Full-text search	BM25 + inverted index	Elasticsearch, OpenSearch
E-commerce product search	BM25 + facets + filters	Elasticsearch, Solr
Semantic/similarity search	HNSW vectors	Pinecone, Weaviate, Milvus
Hybrid (keyword + semantic)	BM25 + kNN + RRF	Elasticsearch 8+, OpenSearch
Log analytics	Full-text + aggregations	Elasticsearch, OpenSearch
Autocomplete	Edge n-grams + prefix queries	Elasticsearch, Typesense