Query Engines

Overview

The query engine transforms a declarative SQL statement into an efficient physical execution plan. Modern query engines employ sophisticated execution models, vectorized processing, and compilation techniques to maximize throughput on contemporary hardware. The choice of execution model fundamentally determines how well a database exploits CPU caches, SIMD instructions, and multi-core parallelism.

Volcano / Iterator Model

The Volcano model (also called the iterator or pull model) is the classical execution framework. Each operator implements an open(), next(), close() interface, and tuples flow one-at-a-time from leaves to root.

class SeqScan:
    def open(self):
        self.cursor = self.table.first_page()

    def next(self):
        row = self.cursor.read()
        if row is None:
            return None
        self.cursor.advance()
        return row

class Filter:
    def __init__(self, child, predicate):
        self.child = child
        self.predicate = predicate

    def next(self):
        while True:
            row = self.child.next()
            if row is None:
                return None
            if self.predicate(row):
                return row

class HashJoin:
    def open(self):
        self.build_child.open()
        self.probe_child.open()
        self.ht = {}
        # Build phase: materialize inner relation
        while (row := self.build_child.next()):
            key = row[self.build_key]
            self.ht.setdefault(key, []).append(row)

    def next(self):
        # Probe phase: stream outer relation
        while (row := self.probe_child.next()):
            key = row[self.probe_key]
            if key in self.ht:
                for match in self.ht[key]:
                    return merge(row, match)
        return None

Limitations

Virtual function overhead: Each next() call crosses a function boundary per operator per row
Poor cache utilization: Processing one row at a time does not amortize instruction cache or branch prediction
No SIMD: Single-tuple processing cannot leverage vector instructions

Vectorized / Batch Execution

Vectorized execution processes data in batches (vectors) of typically 1024-4096 values. This amortizes function call overhead and enables SIMD processing.

VECTOR_SIZE = 1024

class VectorizedFilter:
    def __init__(self, child, column_idx, threshold):
        self.child = child
        self.column_idx = column_idx
        self.threshold = threshold

    def next_batch(self):
        batch = self.child.next_batch()
        if batch is None:
            return None
        # Columnar selection vector - process entire column at once
        col = batch.columns[self.column_idx]
        mask = col > self.threshold  # SIMD-friendly comparison
        return batch.filter(mask)

class VectorizedHashAgg:
    def __init__(self, child, group_col, agg_col):
        self.child = child
        self.group_col = group_col
        self.agg_col = agg_col
        self.ht = {}

    def consume_all(self):
        while (batch := self.child.next_batch()):
            groups = batch.columns[self.group_col]
            values = batch.columns[self.agg_col]
            # Process batch: hash all keys, aggregate values
            for i in range(batch.size):
                key = groups[i]
                self.ht[key] = self.ht.get(key, 0) + values[i]

Used by: DuckDB, Velox (Meta), ClickHouse, DataFusion, Databricks Photon

Push vs Pull Execution

Pull Model (Volcano)

The root operator pulls data by calling next() on children. Control flow goes top-down; data flows bottom-up.

       Project (pull)
          |
       Filter  (pull)
          |
       SeqScan (produces)

Control: Root -> Filter -> SeqScan
Data:    SeqScan -> Filter -> Root

Push Model

Leaf operators push data to parent operators. Each operator has a consume() method called by its child. This maps naturally to compiled code with tight loops.

class PushSeqScan:
    def produce(self):
        for page in self.table.pages():
            for row in page.rows():
                self.parent.consume(row)  # push to parent

class PushFilter:
    def consume(self, row):
        if self.predicate(row):
            self.parent.consume(row)  # push matching rows up

class PushHashJoin:
    def consume_build(self, row):
        self.ht[row[self.key]] = row

    def consume_probe(self, row):
        if row[self.key] in self.ht:
            self.parent.consume(merge(row, self.ht[row[self.key]]))

Key advantage: The push model eliminates per-tuple virtual function dispatch, enabling the compiler to optimize across operator boundaries.

Morsel-Driven Parallelism

Morsel-driven parallelism (introduced by HyPer/Umbra) divides input data into small chunks called morsels (~10K tuples) and distributes them to worker threads dynamically.

Table Data:
  [Morsel 0] [Morsel 1] [Morsel 2] [Morsel 3] [Morsel 4] ...

Worker threads grab morsels from a shared dispatcher:
  Thread 0: takes Morsel 0, processes pipeline, writes to local HT partition
  Thread 1: takes Morsel 1, processes pipeline, writes to local HT partition
  Thread 2: takes Morsel 2, ...
  Thread 0: finishes Morsel 0, takes Morsel 3, ...

Pipeline Breakers (materialization points):
  Scan -> Filter -> Probe HashJoin  [pipeline 1]
         ---- BREAK (build hash table) ----
  Scan -> Filter -> Build HashJoin  [pipeline 0]

Properties

Work stealing: Threads that finish early steal morsels from the global queue
NUMA-aware: Morsels preferentially assigned to threads local to the data's NUMA node
Elasticity: Number of active threads can change mid-query
No global synchronization: Each thread operates on local partitions, merging at pipeline breakers

JIT Compilation (HyPer / Umbra)

Query compilation generates specialized machine code for each query, eliminating interpretation overhead entirely.

Compilation Pipeline

SQL Query
    |
    v
Logical Plan (relational algebra)
    |
    v
Physical Plan (with chosen algorithms)
    |
    v
LLVM IR / C code generation
    |
    v
Machine Code (JIT compiled)
    |
    v
Execute directly on CPU

Example: SELECT SUM(price) FROM orders WHERE qty > 10

Generated pseudocode (tight loop, no operator boundaries):
  sum = 0
  for each tuple in orders:
      if tuple.qty > 10:
          sum += tuple.price
  return sum

Produce/Consume Model

HyPer's code generation uses the produce/consume paradigm, where each operator generates code in two phases:

// Code generated for: SELECT sum(b) FROM R WHERE a > 5

// Scan operator produces:
for (auto& tuple : R) {           // scan produce
    // Filter consumes:
    if (tuple.a > 5) {            // filter consume/produce
        // Aggregation consumes:
        sum += tuple.b;           // aggregation consume
    }
}
// Result: one tight loop, no function calls between operators

Adaptive Compilation (Umbra)

Stages:
1. Interpret query immediately (low latency for short queries)
2. If query runs > threshold, trigger background compilation
3. Switch to compiled code mid-execution (hot-swap)

Benefit: avoids paying compilation cost for simple queries
         while still achieving peak throughput for complex ones

SIMD in Query Processing

Single Instruction Multiple Data (SIMD) processes multiple values in a single CPU instruction using wide registers (128-bit SSE, 256-bit AVX2, 512-bit AVX-512).

Selection Scans

// Scalar: process one value at a time
for (int i = 0; i < n; i++) {
    if (col[i] > threshold) result[count++] = i;
}

// AVX-512: process 16 int32 values per instruction
__m512i thresh = _mm512_set1_epi32(threshold);
for (int i = 0; i < n; i += 16) {
    __m512i data = _mm512_loadu_si512(&col[i]);
    __mmask16 mask = _mm512_cmpgt_epi32_mask(data, thresh);
    _mm512_mask_compressstoreu_epi32(&result[count], mask, indices);
    count += __builtin_popcount(mask);
}
// 8-16x throughput improvement

Hash Table Probing

SIMD hash probing (horizontal):
  1. Compute 8 hashes in parallel (AVX2)
  2. Gather 8 hash table entries simultaneously
  3. Compare 8 keys in parallel
  4. Handle collisions for mismatches

Throughput: ~2-4x improvement over scalar probing

Where SIMD Helps Most

Operation	SIMD Speedup	Technique
Predicate evaluation	4-16x	Parallel comparison
Hash computation	4-8x	Parallel CRC32/Murmur
Decompression	2-4x	Bit unpacking
String operations	2-4x	SSE4.2 string intrinsics
Aggregation (SUM)	4-8x	Vertical accumulation

Compiled Queries: Full Pipeline

SQL:
  SELECT c.name, SUM(o.total)
  FROM customers c JOIN orders o ON c.id = o.cust_id
  WHERE o.date >= '2025-01-01'
  GROUP BY c.name

Execution (morsel-driven + compiled):

Pipeline 0 (build): Scan customers -> Build HT(id -> name)
  [compiled tight loop, SIMD hashing]

Pipeline 1 (probe + aggregate):
  Scan orders -> Filter(date) -> Probe HT -> Aggregate(name, SUM)
  [compiled: fused scan+filter+probe+agg in single loop]
  [morsel-driven: each thread processes ~10K order rows]
  [thread-local hash aggregation, merge at end]

Comparison of Execution Models

Aspect	Volcano	Vectorized	Compiled
Granularity	1 tuple	1000+ tuples	Pipeline of tuples
Function calls	Per tuple per op	Per batch per op	None (fused loop)
SIMD friendly	No	Yes	Yes
Compilation time	None	None	10-100ms
Implementation	Simple	Moderate	Complex
Systems	PostgreSQL, MySQL	DuckDB, ClickHouse	HyPer, Umbra

Applications

PostgreSQL uses the Volcano iterator model with a JIT option (LLVM) for expression evaluation
DuckDB uses vectorized push-based execution with morsel-driven parallelism
ClickHouse uses vectorized pull-based execution with SIMD-optimized primitives
Databricks Photon is a vectorized C++ engine replacing the JVM-based Spark engine for improved performance
Velox (Meta) is a reusable vectorized execution library used by Presto and other engines