Query Engines
Overview
The query engine transforms a declarative SQL statement into an efficient physical execution plan. Modern query engines employ sophisticated execution models, vectorized processing, and compilation techniques to maximize throughput on contemporary hardware. The choice of execution model fundamentally determines how well a database exploits CPU caches, SIMD instructions, and multi-core parallelism.
Volcano / Iterator Model
The Volcano model (also called the iterator or pull model) is the classical execution framework. Each operator implements an open(), next(), close() interface, and tuples flow one-at-a-time from leaves to root.
class SeqScan:
def open(self):
self.cursor = self.table.first_page()
def next(self):
row = self.cursor.read()
if row is None:
return None
self.cursor.advance()
return row
class Filter:
def __init__(self, child, predicate):
self.child = child
self.predicate = predicate
def next(self):
while True:
row = self.child.next()
if row is None:
return None
if self.predicate(row):
return row
class HashJoin:
def open(self):
self.build_child.open()
self.probe_child.open()
self.ht = {}
# Build phase: materialize inner relation
while (row := self.build_child.next()):
key = row[self.build_key]
self.ht.setdefault(key, []).append(row)
def next(self):
# Probe phase: stream outer relation
while (row := self.probe_child.next()):
key = row[self.probe_key]
if key in self.ht:
for match in self.ht[key]:
return merge(row, match)
return None
Limitations
- Virtual function overhead: Each
next()call crosses a function boundary per operator per row - Poor cache utilization: Processing one row at a time does not amortize instruction cache or branch prediction
- No SIMD: Single-tuple processing cannot leverage vector instructions
Vectorized / Batch Execution
Vectorized execution processes data in batches (vectors) of typically 1024-4096 values. This amortizes function call overhead and enables SIMD processing.
VECTOR_SIZE = 1024
class VectorizedFilter:
def __init__(self, child, column_idx, threshold):
self.child = child
self.column_idx = column_idx
self.threshold = threshold
def next_batch(self):
batch = self.child.next_batch()
if batch is None:
return None
# Columnar selection vector - process entire column at once
col = batch.columns[self.column_idx]
mask = col > self.threshold # SIMD-friendly comparison
return batch.filter(mask)
class VectorizedHashAgg:
def __init__(self, child, group_col, agg_col):
self.child = child
self.group_col = group_col
self.agg_col = agg_col
self.ht = {}
def consume_all(self):
while (batch := self.child.next_batch()):
groups = batch.columns[self.group_col]
values = batch.columns[self.agg_col]
# Process batch: hash all keys, aggregate values
for i in range(batch.size):
key = groups[i]
self.ht[key] = self.ht.get(key, 0) + values[i]
Used by: DuckDB, Velox (Meta), ClickHouse, DataFusion, Databricks Photon
Push vs Pull Execution
Pull Model (Volcano)
The root operator pulls data by calling next() on children. Control flow goes top-down; data flows bottom-up.
Project (pull)
|
Filter (pull)
|
SeqScan (produces)
Control: Root -> Filter -> SeqScan
Data: SeqScan -> Filter -> Root
Push Model
Leaf operators push data to parent operators. Each operator has a consume() method called by its child. This maps naturally to compiled code with tight loops.
class PushSeqScan:
def produce(self):
for page in self.table.pages():
for row in page.rows():
self.parent.consume(row) # push to parent
class PushFilter:
def consume(self, row):
if self.predicate(row):
self.parent.consume(row) # push matching rows up
class PushHashJoin:
def consume_build(self, row):
self.ht[row[self.key]] = row
def consume_probe(self, row):
if row[self.key] in self.ht:
self.parent.consume(merge(row, self.ht[row[self.key]]))
Key advantage: The push model eliminates per-tuple virtual function dispatch, enabling the compiler to optimize across operator boundaries.
Morsel-Driven Parallelism
Morsel-driven parallelism (introduced by HyPer/Umbra) divides input data into small chunks called morsels (~10K tuples) and distributes them to worker threads dynamically.
Table Data:
[Morsel 0] [Morsel 1] [Morsel 2] [Morsel 3] [Morsel 4] ...
Worker threads grab morsels from a shared dispatcher:
Thread 0: takes Morsel 0, processes pipeline, writes to local HT partition
Thread 1: takes Morsel 1, processes pipeline, writes to local HT partition
Thread 2: takes Morsel 2, ...
Thread 0: finishes Morsel 0, takes Morsel 3, ...
Pipeline Breakers (materialization points):
Scan -> Filter -> Probe HashJoin [pipeline 1]
---- BREAK (build hash table) ----
Scan -> Filter -> Build HashJoin [pipeline 0]
Properties
- Work stealing: Threads that finish early steal morsels from the global queue
- NUMA-aware: Morsels preferentially assigned to threads local to the data's NUMA node
- Elasticity: Number of active threads can change mid-query
- No global synchronization: Each thread operates on local partitions, merging at pipeline breakers
JIT Compilation (HyPer / Umbra)
Query compilation generates specialized machine code for each query, eliminating interpretation overhead entirely.
Compilation Pipeline
SQL Query
|
v
Logical Plan (relational algebra)
|
v
Physical Plan (with chosen algorithms)
|
v
LLVM IR / C code generation
|
v
Machine Code (JIT compiled)
|
v
Execute directly on CPU
Example: SELECT SUM(price) FROM orders WHERE qty > 10
Generated pseudocode (tight loop, no operator boundaries):
sum = 0
for each tuple in orders:
if tuple.qty > 10:
sum += tuple.price
return sum
Produce/Consume Model
HyPer's code generation uses the produce/consume paradigm, where each operator generates code in two phases:
// Code generated for: SELECT sum(b) FROM R WHERE a > 5
// Scan operator produces:
for (auto& tuple : R) { // scan produce
// Filter consumes:
if (tuple.a > 5) { // filter consume/produce
// Aggregation consumes:
sum += tuple.b; // aggregation consume
}
}
// Result: one tight loop, no function calls between operators
Adaptive Compilation (Umbra)
Stages:
1. Interpret query immediately (low latency for short queries)
2. If query runs > threshold, trigger background compilation
3. Switch to compiled code mid-execution (hot-swap)
Benefit: avoids paying compilation cost for simple queries
while still achieving peak throughput for complex ones
SIMD in Query Processing
Single Instruction Multiple Data (SIMD) processes multiple values in a single CPU instruction using wide registers (128-bit SSE, 256-bit AVX2, 512-bit AVX-512).
Selection Scans
// Scalar: process one value at a time
for (int i = 0; i < n; i++) {
if (col[i] > threshold) result[count++] = i;
}
// AVX-512: process 16 int32 values per instruction
__m512i thresh = _mm512_set1_epi32(threshold);
for (int i = 0; i < n; i += 16) {
__m512i data = _mm512_loadu_si512(&col[i]);
__mmask16 mask = _mm512_cmpgt_epi32_mask(data, thresh);
_mm512_mask_compressstoreu_epi32(&result[count], mask, indices);
count += __builtin_popcount(mask);
}
// 8-16x throughput improvement
Hash Table Probing
SIMD hash probing (horizontal):
1. Compute 8 hashes in parallel (AVX2)
2. Gather 8 hash table entries simultaneously
3. Compare 8 keys in parallel
4. Handle collisions for mismatches
Throughput: ~2-4x improvement over scalar probing
Where SIMD Helps Most
| Operation | SIMD Speedup | Technique | |-----------|-------------|-----------| | Predicate evaluation | 4-16x | Parallel comparison | | Hash computation | 4-8x | Parallel CRC32/Murmur | | Decompression | 2-4x | Bit unpacking | | String operations | 2-4x | SSE4.2 string intrinsics | | Aggregation (SUM) | 4-8x | Vertical accumulation |
Compiled Queries: Full Pipeline
SQL:
SELECT c.name, SUM(o.total)
FROM customers c JOIN orders o ON c.id = o.cust_id
WHERE o.date >= '2025-01-01'
GROUP BY c.name
Execution (morsel-driven + compiled):
Pipeline 0 (build): Scan customers -> Build HT(id -> name)
[compiled tight loop, SIMD hashing]
Pipeline 1 (probe + aggregate):
Scan orders -> Filter(date) -> Probe HT -> Aggregate(name, SUM)
[compiled: fused scan+filter+probe+agg in single loop]
[morsel-driven: each thread processes ~10K order rows]
[thread-local hash aggregation, merge at end]
Comparison of Execution Models
| Aspect | Volcano | Vectorized | Compiled | |--------|---------|-----------|----------| | Granularity | 1 tuple | 1000+ tuples | Pipeline of tuples | | Function calls | Per tuple per op | Per batch per op | None (fused loop) | | SIMD friendly | No | Yes | Yes | | Compilation time | None | None | 10-100ms | | Implementation | Simple | Moderate | Complex | | Systems | PostgreSQL, MySQL | DuckDB, ClickHouse | HyPer, Umbra |
Applications
- PostgreSQL uses the Volcano iterator model with a JIT option (LLVM) for expression evaluation
- DuckDB uses vectorized push-based execution with morsel-driven parallelism
- ClickHouse uses vectorized pull-based execution with SIMD-optimized primitives
- Databricks Photon is a vectorized C++ engine replacing the JVM-based Spark engine for improved performance
- Velox (Meta) is a reusable vectorized execution library used by Presto and other engines