Distributed Search

A single-node search engine works until it does not. When indexes grow beyond one machine's memory, queries exceed one machine's throughput, or availability requirements demand redundancy, search must be distributed. This subtopic covers how production search systems shard, replicate, route, and manage indexes across clusters of machines.

Index Sharding

Sharding splits a large index into smaller pieces distributed across multiple nodes. Each shard holds a subset of documents and can answer queries independently.

Document-Based Sharding

The most common approach assigns each document to a shard based on a hash of its document ID.

Sharding by document ID:
  shard = hash(document_id) % num_shards

  Shard 0: documents {1, 5, 9, 13, ...}
  Shard 1: documents {2, 6, 10, 14, ...}
  Shard 2: documents {3, 7, 11, 15, ...}
  Shard 3: documents {4, 8, 12, 16, ...}

  Query "wireless headphones":
    Must query ALL shards (any shard could have matching docs)
    Merge results from all shards by score

This is how Elasticsearch works. Every query fans out to every shard, and a coordinating node merges the results. The tradeoff is query fan-out: every search hits every shard.

Term-Based Sharding

An alternative assigns terms (words) to shards rather than documents. Each shard holds the complete posting list for its assigned terms.

Sharding by term:
  Shard 0: terms starting with a-f
  Shard 1: terms starting with g-m
  Shard 2: terms starting with n-s
  Shard 3: terms starting with t-z

  Query "wireless headphones":
    Shard 3 handles "wireless", Shard 1 handles "headphones"
    Only 2 shards queried instead of all 4

Term-based sharding reduces query fan-out but complicates indexing (a single document touches multiple shards) and makes scoring harder because TF-IDF and BM25 need corpus-wide statistics that are now split across shards. Google used term-based sharding in early web search but most modern systems prefer document-based sharding for its simplicity.

Choosing Shard Count

Factors for shard count:
  - Index size per shard: target 10-50 GB per shard for Elasticsearch
  - Query latency: more shards = more fan-out overhead
  - Indexing throughput: more shards = more parallel write capacity
  - Future growth: resharding is expensive, over-provision slightly

  Example:
    500 GB total index, target 25 GB per shard = 20 primary shards
    Expected 3x growth over 2 years = consider 40 shards upfront

Elasticsearch sets the shard count at index creation and cannot change it without reindexing. Over-sharding wastes resources (each shard has fixed overhead). Under-sharding limits future scaling. Getting this right matters.

Replication

Replication copies each shard to multiple nodes for fault tolerance and read throughput.

Primary-Replica Model

Elasticsearch replication:
  Primary shard: handles writes, replicates to replicas
  Replica shard: serves reads, promoted to primary if primary fails

  Index with 3 primary shards, 1 replica each:
    Node A: [Primary 0] [Replica 2]
    Node B: [Primary 1] [Replica 0]
    Node C: [Primary 2] [Replica 1]

  If Node B fails:
    Replica 1 on Node C promoted to Primary 1
    Cluster still serves all queries
    New replica created on remaining nodes to restore redundancy

Read Scaling with Replicas

Every replica can serve search queries independently. Adding replicas increases read throughput linearly.

Throughput scaling:
  1 replica per shard: 2x read capacity (primary + replica)
  2 replicas per shard: 3x read capacity

  Cost: each replica doubles storage for that shard

  Strategy: use more replicas for read-heavy workloads
  Elasticsearch default: 1 replica per shard

LinkedIn runs Elasticsearch clusters with varying replica counts based on query load. High-traffic indexes like people search use more replicas than low-traffic indexes like admin logs.

Consistency Tradeoffs

Writes go to the primary shard and replicate to replicas. The question is whether to wait for replication before acknowledging the write.

Write consistency options:
  wait_for_active_shards=1: acknowledge after primary write (fast, risk of data loss)
  wait_for_active_shards=all: acknowledge after all replicas confirm (slow, durable)
  wait_for_active_shards=quorum: acknowledge after majority confirms (balanced)

  For search indexing, eventual consistency is usually acceptable.
  A document indexed 1 second ago not yet searchable is fine.
  A document permanently lost is not.

Query Routing & Scatter-Gather

Distributed search follows a scatter-gather pattern: a coordinator scatters the query to all relevant shards, each shard searches locally, and the coordinator gathers and merges results.

Two-Phase Query Execution

Phase 1 - Query (scatter):
  Coordinator receives search request
  Forwards query to all shards (or relevant subset)
  Each shard returns top-K document IDs + scores (lightweight)

Phase 2 - Fetch (gather):
  Coordinator merges and sorts all shard results
  Selects global top-K
  Fetches full documents only for the final results
  Returns complete results to client

This two-phase approach avoids transferring full documents from every shard. Only the final top-K results require a fetch.

Coordinator Overhead

Merge cost at coordinator:
  20 shards, top 100 per shard = merge 2,000 scored entries
  Sort 2,000 entries to find global top 10

  For pagination (page 50 of results):
  Each shard must return top 500 (50 pages * 10 results)
  Coordinator merges 10,000 entries
  Deep pagination is expensive in distributed search

Elasticsearch addresses deep pagination with search_after cursors and point-in-time snapshots instead of traditional offset-based pagination.

Adaptive Replica Selection

Not all replicas respond equally fast. Adaptive routing sends queries to the replica with the lowest response time or queue depth.

Replica selection strategies:
  Round-robin: simple, ignores node load differences
  Least-busy: route to replica with fewest in-flight queries
  Adaptive: track response time per replica, prefer faster nodes

  Elasticsearch uses adaptive replica selection by default since v7.
  This handles heterogeneous hardware and temporary slowdowns.

Near-Real-Time Indexing

Production search systems must make new content searchable within seconds, not minutes.

Refresh & Flush

Elasticsearch indexing pipeline:
  1. Document arrives at primary shard
  2. Written to in-memory buffer + transaction log (translog)
  3. Refresh (default: every 1 second):
     - In-memory buffer flushed to a new Lucene segment
     - Segment is searchable but not yet on disk
  4. Flush (periodic or when translog is large):
     - Segments written to disk
     - Translog cleared

  The 1-second refresh interval is why Elasticsearch is "near-real-time"
  rather than real-time. Documents are searchable ~1 second after indexing.

Bulk Indexing

For large ingestion jobs, optimize for throughput over latency.

Bulk indexing optimizations:
  - Increase refresh_interval to 30s or disable during bulk load
  - Set number_of_replicas to 0, restore after load completes
  - Use bulk API (batch 1,000-5,000 documents per request)
  - Disable merge throttling during initial load

  Result: 10-50x faster indexing for batch loads
  Tradeoff: documents not searchable until refresh

GitHub rebuilt their entire code search index (billions of documents) using bulk indexing with these optimizations. During normal operation, they use standard near-real-time indexing so new code is searchable within seconds of being pushed.

Index Lifecycle Management

Indexes have lifecycles, especially for time-series data like logs and events.

Time-Based Indexes

Log data pattern:
  logs-2026-04-14  (today's index, actively written)
  logs-2026-04-13  (yesterday, read-only, fully merged)
  logs-2026-04-12  (read-only, maybe fewer replicas)
  ...
  logs-2026-03-15  (cold, moved to cheaper storage)
  logs-2026-01-*   (archived or deleted per retention policy)

Hot-warm-cold architecture:
  Hot nodes:  SSDs, high CPU, current + recent indexes
  Warm nodes: HDDs, moderate CPU, older indexes (read-only)
  Cold nodes: cheapest storage, rarely queried archives

Aliases & Rollover

Index alias pattern:
  Alias "logs-write" --> logs-2026-04-14 (current write target)
  Alias "logs-read"  --> logs-2026-04-* (all April indexes)

  Rollover: when logs-2026-04-14 hits 50GB or 1 day old:
    1. Create logs-2026-04-15
    2. Point "logs-write" alias to new index
    3. Old index becomes read-only and moves to warm tier

  Applications always write to "logs-write" and never need to know
  the actual index name.

Datadog manages petabytes of observability data using time-based indexes with automated rollover and tiered storage. Their hot tier uses NVMe SSDs for recent data, while older data moves to object storage with a caching layer.

Common Pitfalls

Over-sharding small indexes. Each shard has fixed memory and file handle overhead. An index with 100,000 documents does not need 20 shards. Start small and scale up.
Deep pagination with from/size. Requesting page 1,000 requires every shard to return 10,000 results. Use search_after or scroll APIs for deep result sets.
Ignoring shard balance. Uneven document distribution causes hot spots. One shard with 10x the documents of others dominates query latency.
Not monitoring segment count. Too many small segments degrade search performance. Ensure merge policies keep segment count manageable.
Treating search clusters like databases. Search engines trade consistency for speed. Do not use Elasticsearch as a primary data store. Keep the source of truth in a database and index into search asynchronously.
Scaling by adding shards instead of replicas. If the bottleneck is query throughput (not index size), adding replicas is cheaper and simpler than resharding.

Key Takeaways

Document-based sharding is the standard approach. Every query fans out to all shards, and a coordinator merges results using a two-phase scatter-gather pattern.
Replicas provide both fault tolerance and read scaling. Add replicas for throughput; add shards for capacity.
Near-real-time indexing makes documents searchable within about 1 second. Bulk loads should disable this for throughput.
Deep pagination is inherently expensive in distributed search. Use cursor-based pagination for large result sets.
Time-based indexes with hot-warm-cold tiering are the standard pattern for log and event data, balancing query performance with storage cost.