Distributed Search
A single-node search engine works until it does not. When indexes grow beyond one machine's memory, queries exceed one machine's throughput, or availability requirements demand redundancy, search must be distributed. This subtopic covers how production search systems shard, replicate, route, and manage indexes across clusters of machines.
Index Sharding
Sharding splits a large index into smaller pieces distributed across multiple nodes. Each shard holds a subset of documents and can answer queries independently.
Document-Based Sharding
The most common approach assigns each document to a shard based on a hash of its document ID.
Sharding by document ID:
shard = hash(document_id) % num_shards
Shard 0: documents {1, 5, 9, 13, ...}
Shard 1: documents {2, 6, 10, 14, ...}
Shard 2: documents {3, 7, 11, 15, ...}
Shard 3: documents {4, 8, 12, 16, ...}
Query "wireless headphones":
Must query ALL shards (any shard could have matching docs)
Merge results from all shards by score
This is how Elasticsearch works. Every query fans out to every shard, and a coordinating node merges the results. The tradeoff is query fan-out: every search hits every shard.
Term-Based Sharding
An alternative assigns terms (words) to shards rather than documents. Each shard holds the complete posting list for its assigned terms.
Sharding by term:
Shard 0: terms starting with a-f
Shard 1: terms starting with g-m
Shard 2: terms starting with n-s
Shard 3: terms starting with t-z
Query "wireless headphones":
Shard 3 handles "wireless", Shard 1 handles "headphones"
Only 2 shards queried instead of all 4
Term-based sharding reduces query fan-out but complicates indexing (a single document touches multiple shards) and makes scoring harder because TF-IDF and BM25 need corpus-wide statistics that are now split across shards. Google used term-based sharding in early web search but most modern systems prefer document-based sharding for its simplicity.
Choosing Shard Count
Factors for shard count:
- Index size per shard: target 10-50 GB per shard for Elasticsearch
- Query latency: more shards = more fan-out overhead
- Indexing throughput: more shards = more parallel write capacity
- Future growth: resharding is expensive, over-provision slightly
Example:
500 GB total index, target 25 GB per shard = 20 primary shards
Expected 3x growth over 2 years = consider 40 shards upfront
Elasticsearch sets the shard count at index creation and cannot change it without reindexing. Over-sharding wastes resources (each shard has fixed overhead). Under-sharding limits future scaling. Getting this right matters.
Replication
Replication copies each shard to multiple nodes for fault tolerance and read throughput.
Primary-Replica Model
Elasticsearch replication:
Primary shard: handles writes, replicates to replicas
Replica shard: serves reads, promoted to primary if primary fails
Index with 3 primary shards, 1 replica each:
Node A: [Primary 0] [Replica 2]
Node B: [Primary 1] [Replica 0]
Node C: [Primary 2] [Replica 1]
If Node B fails:
Replica 1 on Node C promoted to Primary 1
Cluster still serves all queries
New replica created on remaining nodes to restore redundancy
Read Scaling with Replicas
Every replica can serve search queries independently. Adding replicas increases read throughput linearly.
Throughput scaling:
1 replica per shard: 2x read capacity (primary + replica)
2 replicas per shard: 3x read capacity
Cost: each replica doubles storage for that shard
Strategy: use more replicas for read-heavy workloads
Elasticsearch default: 1 replica per shard
LinkedIn runs Elasticsearch clusters with varying replica counts based on query load. High-traffic indexes like people search use more replicas than low-traffic indexes like admin logs.
Consistency Tradeoffs
Writes go to the primary shard and replicate to replicas. The question is whether to wait for replication before acknowledging the write.
Write consistency options:
wait_for_active_shards=1: acknowledge after primary write (fast, risk of data loss)
wait_for_active_shards=all: acknowledge after all replicas confirm (slow, durable)
wait_for_active_shards=quorum: acknowledge after majority confirms (balanced)
For search indexing, eventual consistency is usually acceptable.
A document indexed 1 second ago not yet searchable is fine.
A document permanently lost is not.
Query Routing & Scatter-Gather
Distributed search follows a scatter-gather pattern: a coordinator scatters the query to all relevant shards, each shard searches locally, and the coordinator gathers and merges results.
Two-Phase Query Execution
Phase 1 - Query (scatter):
Coordinator receives search request
Forwards query to all shards (or relevant subset)
Each shard returns top-K document IDs + scores (lightweight)
Phase 2 - Fetch (gather):
Coordinator merges and sorts all shard results
Selects global top-K
Fetches full documents only for the final results
Returns complete results to client
This two-phase approach avoids transferring full documents from every shard. Only the final top-K results require a fetch.
Coordinator Overhead
Merge cost at coordinator:
20 shards, top 100 per shard = merge 2,000 scored entries
Sort 2,000 entries to find global top 10
For pagination (page 50 of results):
Each shard must return top 500 (50 pages * 10 results)
Coordinator merges 10,000 entries
Deep pagination is expensive in distributed search
Elasticsearch addresses deep pagination with search_after cursors and point-in-time snapshots instead of traditional offset-based pagination.
Adaptive Replica Selection
Not all replicas respond equally fast. Adaptive routing sends queries to the replica with the lowest response time or queue depth.
Replica selection strategies:
Round-robin: simple, ignores node load differences
Least-busy: route to replica with fewest in-flight queries
Adaptive: track response time per replica, prefer faster nodes
Elasticsearch uses adaptive replica selection by default since v7.
This handles heterogeneous hardware and temporary slowdowns.
Near-Real-Time Indexing
Production search systems must make new content searchable within seconds, not minutes.
Refresh & Flush
Elasticsearch indexing pipeline:
1. Document arrives at primary shard
2. Written to in-memory buffer + transaction log (translog)
3. Refresh (default: every 1 second):
- In-memory buffer flushed to a new Lucene segment
- Segment is searchable but not yet on disk
4. Flush (periodic or when translog is large):
- Segments written to disk
- Translog cleared
The 1-second refresh interval is why Elasticsearch is "near-real-time"
rather than real-time. Documents are searchable ~1 second after indexing.
Bulk Indexing
For large ingestion jobs, optimize for throughput over latency.
Bulk indexing optimizations:
- Increase refresh_interval to 30s or disable during bulk load
- Set number_of_replicas to 0, restore after load completes
- Use bulk API (batch 1,000-5,000 documents per request)
- Disable merge throttling during initial load
Result: 10-50x faster indexing for batch loads
Tradeoff: documents not searchable until refresh
GitHub rebuilt their entire code search index (billions of documents) using bulk indexing with these optimizations. During normal operation, they use standard near-real-time indexing so new code is searchable within seconds of being pushed.
Index Lifecycle Management
Indexes have lifecycles, especially for time-series data like logs and events.
Time-Based Indexes
Log data pattern:
logs-2026-04-14 (today's index, actively written)
logs-2026-04-13 (yesterday, read-only, fully merged)
logs-2026-04-12 (read-only, maybe fewer replicas)
...
logs-2026-03-15 (cold, moved to cheaper storage)
logs-2026-01-* (archived or deleted per retention policy)
Hot-warm-cold architecture:
Hot nodes: SSDs, high CPU, current + recent indexes
Warm nodes: HDDs, moderate CPU, older indexes (read-only)
Cold nodes: cheapest storage, rarely queried archives
Aliases & Rollover
Index alias pattern:
Alias "logs-write" --> logs-2026-04-14 (current write target)
Alias "logs-read" --> logs-2026-04-* (all April indexes)
Rollover: when logs-2026-04-14 hits 50GB or 1 day old:
1. Create logs-2026-04-15
2. Point "logs-write" alias to new index
3. Old index becomes read-only and moves to warm tier
Applications always write to "logs-write" and never need to know
the actual index name.
Datadog manages petabytes of observability data using time-based indexes with automated rollover and tiered storage. Their hot tier uses NVMe SSDs for recent data, while older data moves to object storage with a caching layer.
Common Pitfalls
- Over-sharding small indexes. Each shard has fixed memory and file handle overhead. An index with 100,000 documents does not need 20 shards. Start small and scale up.
- Deep pagination with from/size. Requesting page 1,000 requires every shard to return 10,000 results. Use
search_afteror scroll APIs for deep result sets. - Ignoring shard balance. Uneven document distribution causes hot spots. One shard with 10x the documents of others dominates query latency.
- Not monitoring segment count. Too many small segments degrade search performance. Ensure merge policies keep segment count manageable.
- Treating search clusters like databases. Search engines trade consistency for speed. Do not use Elasticsearch as a primary data store. Keep the source of truth in a database and index into search asynchronously.
- Scaling by adding shards instead of replicas. If the bottleneck is query throughput (not index size), adding replicas is cheaper and simpler than resharding.
Key Takeaways
- Document-based sharding is the standard approach. Every query fans out to all shards, and a coordinator merges results using a two-phase scatter-gather pattern.
- Replicas provide both fault tolerance and read scaling. Add replicas for throughput; add shards for capacity.
- Near-real-time indexing makes documents searchable within about 1 second. Bulk loads should disable this for throughput.
- Deep pagination is inherently expensive in distributed search. Use cursor-based pagination for large result sets.
- Time-based indexes with hot-warm-cold tiering are the standard pattern for log and event data, balancing query performance with storage cost.