Lec13 Query Execution I — Processing Models, Access Methods, Expression Evaluation

This lecture addresses the fundamental question: once the DBMS has a query plan, how does it actually execute it? It covers three processing models for driving tuple flow through operators, the main access methods for reading data from tables, and how the DBMS evaluates expressions efficiently.

Query Execution Overview

Before diving into processing models, the lecture establishes key terminology. A query plan is a DAG of operators. Within a plan:

A pipeline is a sequence of operators where tuples continuously flow between them without intermediate storage.
A pipeline breaker is an operator that cannot finish until all its children emit all their tuples — for example, Joins (build side), Subqueries, and Order By.

The DBMS splits a query plan into pipeline segments separated by pipeline breakers, then executes each segment.

Processing Models

A DBMS’s processing model defines how the system executes a query plan and moves data from one operator to the next. Different models make different trade-offs for OLTP vs. OLAP workloads. Each model is comprised of two types of execution paths: - Control Flow: how the DBMS invokes an operator - Data Flow: how an operator sends its results

The output of an operator can be either whole tuples (NSM) or subsets of columns (DSM). The three main models differ in the granularity of data each operator processes at a time.

Iterator Model (Volcano / Pipeline)

The most widely used processing model (Most Common). Every operator implements three functions:

Open(): initialize the operator’s state and call Open() on children
Next(): return the next single tuple, or a null/EOF marker when finished
Close(): clean up resources and call Close() on children

function Next():
    for each child in children:
        tuple = child.Next()
        if tuple is not null:
            process(tuple)
            emit(tuple)
    return null  // no more tuples

Key properties: - Each call to Next() returns a single tuple (or a null marker for end-of-stream) - The top-level operator repeatedly calls Next() on its child, which recursively calls its child, and so on down to the leaf (scan) operators - Tuples flow upward through the tree one at a time — this is also called the pipeline model because tuples are processed as soon as they are produced - Used by almost every DBMS: PostgreSQL, MySQL, SQLite, MongoDB, etc.

The beauty of this model is its simplicity — each operator is self-contained and only needs to implement one interface. However, the per-tuple function call overhead can become significant for analytical workloads processing billions of tuples.

Materialization Model

The opposite extreme (Rare): each operator processes all its input and produces all its output at once.

function Output():
    result = []
    for each child in children:
        child_result = child.Output()
        for tuple in child_result:
            process(tuple)
            result.append(tuple)
    return result

Key properties: - Each operator returns its entire result set as a materialized collection - Better for OLTP workloads where queries touch small amounts of data — fewer function calls, simpler control flow - The DBMS can send down hints like LIMIT to avoid unnecessary materialization - Worse for OLAP workloads with large intermediate results — materializing millions of tuples wastes memory - Enables operator fusion: since each operator sees all its input at once, the DBMS can fuse adjacent operators (e.g., scan + filter + projection) into a single pass over the data

Vectorized / Batch Model

A middle ground (Common in modern systems) that captures the best of both worlds. Each operator emits a batch of tuples (typically 1024 or so) per call. Originally proposed in the MonetDB/X100 system.

function Next():
    batch = []
    for each child in children:
        child_batch = child.Next()
        for tuple in child_batch:
            process(tuple)
            batch.append(tuple)
        if len(batch) >= BATCH_SIZE:
            return batch
    return batch

Key properties: - Amortizes function call overhead across many tuples (like materialization) while maintaining the streaming property (like iterator) - Ideal for OLAP workloads — the batch size is chosen to fit in CPU cache, enabling efficient vectorized operations via SIMD - Each batch internally uses a column-oriented layout, and operators include null bitmaps to track which entries in the batch are valid - Used by modern analytical systems: DuckDB, Snowflake, Databricks, ClickHouse, etc.

Comparison

Model	Granularity	Best For	Function Call Overhead
Iterator	Single tuple	General purpose	High (per tuple)
Materialization	Entire result	OLTP (small results)	Low (per operator)
Vectorized	Batch of tuples	OLAP (large scans)	Medium (per batch)

Plan Processing Direction

Regardless of the processing model, the DBMS must decide the direction in which data flows through the plan tree.

Top-to-Bottom (Pull)

The root operator “pulls” tuples from its children by calling Next() or Output()
Control flow starts at the top and recurses downward
Data flows upward from leaves to root
This is the classic approach used by the iterator model

Bottom-to-Top (Push)

Leaf operators “push” tuples to their parent operators
Control flow starts at the leaves
Allows tighter loops — the operator at the leaf can push tuples through a compiled pipeline without returning control between each tuple
Enables operator fusion: adjacent operators in a pipeline can be fused into a single tight loop, eliminating virtual function call overhead between operators
Used by HyPer and systems that employ push-based query compilation

The push model can achieve better performance because it avoids the overhead of recursive function calls. Instead of each operator pulling from its child, a leaf operator (e.g., a table scan) runs a tight loop that pushes tuples through a chain of operators compiled into a single function.

Push vs Pull Tradeoffs

Aspect	Pull (Top-to-Bottom)	Push (Bottom-to-Top)
Virtual function overhead	One `Next()` call per tuple/batch	Compiled away via operator fusion
LIMIT handling	Easy — just stop calling `Next()`	Harder — must propagate “stop” signal back to leaf
Sort-Merge Join	Natural — can alternate pulling from two sorted inputs	Difficult — two push sources must be coordinated
Scheduling	Simple recursive control flow	Needs a scheduler to manage which pipeline runs when
Output control	Consumer controls the pace	Producer controls the pace; needs output buffers

Access Methods

An access method is the way the DBMS reads data from a table. The query optimizer chooses the access method based on cost estimates.

Sequential Scan

The simplest method: read every page in the table, one after another.

for page in table.pages:
    for tuple in page.tuples:
        if evaluate(predicate, tuple):
            emit(tuple)

This is often the worst-case scenario, but the DBMS can apply several optimizations:

Prefetching: fetch the next few pages into the buffer pool before they are needed
Buffer Pool Bypass: for large sequential scans, bypass the buffer pool to avoid polluting the cache with pages that won’t be re-read
Parallelization: divide the scan across multiple threads or workers
Late Materialization: in columnar storage, delay stitching columns together until absolutely necessary — only read the columns the query actually needs
Heap Clustering: if the table is physically sorted on a key, range predicates on that key can skip large portions of the heap
Data Encoding / Compression: operate directly on compressed data without decompressing (e.g., run-length encoding can answer some predicates without expanding)
Scan Sharing (Synchronized Scans): if multiple queries scan the same table concurrently, share the scan cursor so they piggyback on each other’s I/O
Materialized Views / Cached Results: precompute and cache common query results so scans can be avoided entirely
Code Specialization / Compilation: generate specialized scan code for a specific schema and predicate at query compile time, avoiding generic interpretation overhead

Data Skipping

The idea behind data skipping is to avoid examining pages or tuples that are guaranteed not to satisfy a query’s predicates. There are two categories:

Approximate Queries (Lossy): execute queries on a sampled subset of the data to produce approximate answers. This sacrifices precision for speed — useful for exploratory analytics where exact answers are not required.

Zone Maps (Lossless): pre-computed metadata about the values in each page (or range of pages) that allows the DBMS to skip pages with certainty:

Stores the MIN, MAX, AVG, SUM, COUNT for each attribute in a contiguous range of pages
Before scanning a page, check the zone map — if the predicate cannot match any value in the range, skip the entire page
Extremely effective for range predicates on sorted or semi-sorted data
Zone maps are maintained automatically as data is inserted/updated
Used by: Netezza, Oracle, Vertica, SingleStore, Snowflake, and supported natively in columnar file formats like Parquet, ORC, Vortex, and LanceDB

For example, if a query asks for WHERE val > 600 and a page’s zone map shows MAX(val) = 400, that entire page can be skipped.

Index Scan

If the table has an index on the attribute(s) used in the predicate, the DBMS can use the index to find matching tuples directly, avoiding a full table scan.

The query optimizer must decide: is it cheaper to use the index or just do a sequential scan? This depends on: - Selectivity of the predicate (how many tuples match) - Whether the index is clustered (data physically sorted by index key) - Cost of random I/O vs sequential I/O

For low-selectivity predicates (few matching tuples), an index scan is usually faster. For high-selectivity predicates, the random I/O from an unclustered index can make a sequential scan cheaper.

Multi-Index Scan

When a query has predicates on multiple attributes and indexes exist on each, the DBMS can:

Scan each index independently to get a set of matching Record IDs (RIDs)
Compute the intersection (for AND) or union (for OR) of these RID sets using set operations
Retrieve the actual tuples using the resulting RID set

This approach can be more efficient than using a single index when multiple selective predicates are combined.

Modification Queries

Operators that modify the database (INSERT, UPDATE, DELETE) need special handling because they both read and write data.

The Halloween Problem

Named after it was discovered on Halloween 1976 at IBM. The problem: an UPDATE query can see its own modifications if the scan and update operate on the same index.

Example:

1	UPDATE employees SET salary = salary + 100 WHERE salary < 1100;

If the table is scanned via an index on salary, a tuple with salary = 1000 gets updated to 1100. But the scan hasn’t finished — it may encounter that same tuple again at its new position in the index (now at key 1100, which is still within the scan range if the cursor hasn’t passed it). The tuple gets updated again to 1200, and so on — creating an infinite loop where the tuple keeps “running ahead” of the scan cursor.

Solutions: - Track modified tuples: record the RIDs of updated tuples and skip them if encountered again - Materialize first: scan all qualifying tuples first, then apply updates — this breaks the pipeline but guarantees correctness

Expression Evaluation

The DBMS must evaluate expressions in WHERE clauses, SELECT lists, and other contexts.

Execution Context

For each expression evaluation, the DBMS maintains an execution context that tracks:

Current Tuple: the tuple being processed
Query Parameters: bound values for parameterized queries
Table Schema: metadata needed to interpret column references

The naive approach for evaluating expressions is an expression tree where each node is an operator or value.

Expression Tree Evaluation

    >
   / \
  +   600
 / \
A   B

To evaluate, walk the tree recursively. The problem: for every single tuple, the DBMS traverses the tree, checks node types at runtime, and follows pointers — this is slow due to: - Poor instruction cache behavior (pointer chasing) - Branch mispredictions (type-checking at each node) - Function call overhead

Optimizations

Constant Folding: pre-compute expressions that don’t depend on tuple data at planning time
- WHERE 1 = 0 → skip the entire query
- WHERE x > 3 + 2 → WHERE x > 5
- SELECT UPPER('wutang') → SELECT 'WUTANG' (function calls on constants are folded too)
Common Sub-Expression Elimination (CSE): if the same sub-expression appears multiple times, compute it once and reuse the result
- Example: WHERE STRPOS(name, 'Andy') > 0 AND STRPOS(name, 'Andy') < 5 — the STRPOS(name, 'Andy') call is computed once and the result is shared
JIT Compilation: compile the expression (or the entire query pipeline) into native machine code at query time, eliminating the overhead of tree interpretation
- PostgreSQL uses LLVM for JIT compilation of expressions
- HyPer/Umbra compile entire query pipelines to native code
- Eliminates type-checking, pointer-chasing, and function call overhead

Lec13 Takeaways

The vectorized model is the sweet spot for analytical workloads: it amortizes function-call overhead while keeping data in cache and enabling SIMD
Zone maps are a simple but powerful optimization for sequential scans — they let the DBMS skip irrelevant pages without maintaining a full index
The Halloween Problem is a subtle correctness issue that arises when modifications and scans share the same index structure
Push-based execution can outperform pull-based by compiling operators into tight loops that avoid recursive function calls
JIT compilation of expressions and pipelines removes the interpretation overhead that dominates CPU time in analytical queries

Lec14 Query Execution II — Parallelism

This lecture asks: how does the DBMS exploit parallel hardware to speed up query execution? It covers process models for managing workers, three types of query parallelism, data-level parallelism via SIMD, and I/O parallelism techniques.

Parallel vs Distributed

Before diving in, an important distinction:

Aspect	Parallel DBMS	Distributed DBMS
Resources	Physically close, communicate via high-speed interconnect	Nodes can be far apart, communicate over a slower network
Communication	Cheap and assumed reliable	Cost cannot be ignored; messages may be lost
Examples	Single multi-core machine, shared-memory cluster	CockroachDB, Spanner

This lecture focuses on parallel execution within a single machine or tightly coupled cluster. Distributed query execution is covered in later lectures.

Process Models

A DBMS process model defines how the system maps its internal worker abstractions to OS-level processes or threads. This is separate from the query processing model discussed in Lec13.

Process per DBMS Worker

Each worker is a separate OS process: - Relies on OS scheduler for CPU time - Workers communicate via shared memory for global data structures (buffer pool, lock table, etc.) - A crash in one worker does not necessarily bring down the entire system - Higher context-switch overhead than threads - Used by: PostgreSQL, Oracle (historically), IBM DB2

Thread per DBMS Worker (Most Common)

Each worker is an OS thread within a single DBMS process: - Lower overhead — threads share the same address space, no need for shared memory segments - The DBMS manages its own scheduling (or relies on the OS thread scheduler) - A crash in one thread can bring down the entire process - Every major DBMS from the last ~25 years uses native OS threads (not green threads or user-level coroutines) - Used by: MySQL, SQL Server, Oracle (modern versions), most modern systems

Important: using the thread-per-worker model does not automatically mean the DBMS supports intra-query parallelism. It only means the DBMS can run multiple queries concurrently on separate threads. Exploiting parallelism within a single query requires additional infrastructure (exchange operators, parallel-aware operators, etc.).

Embedded DBMS

The DBMS runs inside the application’s process (no separate server): - No inter-process communication needed - The application is responsible for managing concurrency - Used by: SQLite, DuckDB, LevelDB, RocksDB, BerkeleyDB, WiredTiger (MongoDB’s storage engine)

Comparison

Model	Isolation	Overhead	Used By
Process per Worker	High (process boundary)	High (context switch)	PostgreSQL
Thread per Worker (Most Common)	Low (shared address space)	Low (lightweight)	MySQL, SQL Server
Embedded	None (in-process)	Minimal	SQLite, DuckDB

Scheduling

For thread-per-worker systems, the DBMS must decide how to assign tasks to worker threads. Two approaches:

OS-level scheduling: let the OS decide which thread runs on which core. Simple but the DBMS has no control over placement decisions.
DBMS-level scheduling: the DBMS maintains its own work queues and assigns tasks to threads. This allows NUMA-aware placement (schedule work on the core closest to the data) and avoids excessive context switching.

Modern high-performance systems (e.g., SQL Server, Oracle) prefer DBMS-level scheduling to maintain control over resource allocation.

Query Parallelism

There are two top-level categories: parallelism between queries and parallelism within a single query.

Inter-Query Parallelism

Execute multiple queries simultaneously on different workers. This is straightforward — each query runs independently. The main challenge is concurrency control (ensuring transactions don’t interfere with each other), which is the subject of later lectures.

Improves throughput (queries per second) rather than latency of individual queries.

Intra-Query Parallelism

Execute a single query using multiple workers. This improves the latency of individual queries. There are three approaches:

Intra-Operator Parallelism (Horizontal) — Most Common

Decompose a single operator into independent fragments that operate on different subsets of the data:

1
2
3

     Gather
    /  |  \
Scan₁ Scan₂ Scan₃    ← same operator, different data partitions

A table scan is split across 3 workers, each reading a different range of pages
Results are merged at an Exchange operator above
Also called horizontal partitioning or data parallelism at the operator level
The most common form of intra-query parallelism
Example: Parallel Grace Hash Join — partition both tables using hash on the join key, then run independent hash join instances on each partition pair across multiple workers

Inter-Operator Parallelism (Vertical / Pipelining) — Less Common

Different operators in the query plan run simultaneously on different workers, passing tuples between them in a producer/consumer paradigm:

1	Worker 1: Scan → Filter ──tuples──▶ Worker 2: Join → Project

Also called pipeline parallelism — operators are stages in a pipeline
Limited by the slowest stage (pipeline stalls if one operator is much slower)
Less common in practice because it requires careful balancing
More common in streaming systems (e.g., Apache Kafka, Flink, Spark Streaming, Storm) than in traditional DBMS

Bushy Parallelism — Higher-End Systems

A combination of both: different branches of the plan tree execute in parallel, and within each branch, operators can also be parallelized:

     Join
    /    \
Scan₁   Scan₂    ← two branches in parallel
(W1,W2) (W3,W4)  ← each branch also uses multiple workers

Most flexible but most complex to schedule
Used by modern MPP systems for complex multi-way joins

Exchange Operator

The Exchange operator is the key mechanism for implementing intra-query parallelism. It sits between parallel fragments of a query plan and manages data distribution.

Three types:

Type	Behavior	Use Case
Gather	Merge results from multiple workers into one stream	Combining parallel scan results
Distribute	Split one stream across multiple workers	Partitioning data for parallel join
Repartition	Redistribute data from N workers to M workers	Changing partition scheme mid-query

The exchange operator abstracts away parallelism — the operators above and below it don’t need to know they are running in parallel. This is a clean architectural separation.

Google BigQuery uses the Repartition (shuffle) exchange extensively — its execution engine performs massive redistributions of intermediate data across thousands of workers to execute distributed joins and aggregations.

Data Parallelism with SIMD

SIMD (Single Instruction, Multiple Data) applies the same operation to multiple data elements simultaneously using wide CPU registers.

Why SIMD Matters for Databases

A typical database scan applies a simple predicate (e.g., val > 100) to millions of values. Without SIMD, this requires one comparison per value. With SIMD, a single instruction can compare 4, 8, or even 16 values at once (depending on the register width: SSE=128-bit, AVX2=256-bit, AVX-512=512-bit).

SIMD Selection Scans

Consider a selection scan: SELECT * FROM T WHERE val > 100

Without SIMD:

1 2	for each value v in column: if v > 100: output v

With SIMD (AVX2, 256-bit registers, 32-bit integers = 8 values per register):

threshold = broadcast(100)  // [100, 100, 100, 100, 100, 100, 100, 100]
for each group of 8 values:
    data = load(values[i:i+8])
    mask = compare_gt(data, threshold)  // produces bitmask
    output matching values using mask

This is 8x more values processed per instruction cycle (in theory).

Potential speedup compounds with multicore parallelism. For example, a machine with 32 cores each executing 4-wide SIMD instructions achieves up to 128x throughput compared to scalar single-core execution.

Filter Representations

After a SIMD comparison produces a bitmask of qualifying tuples, how should the DBMS represent and use this filter?

Selection Vectors: store the positions (indices) of qualifying tuples in an array. - Compact when selectivity is low (few matches) - Downstream operators iterate over positions - Used by: DuckDB, Vectorwise

Bitmaps: a bit vector where bit i = 1 if tuple i qualifies. - Fixed-size regardless of selectivity - Efficient for bitwise AND/OR of multiple predicates - Used by: Oracle, many columnar stores

Representation	Space	Best When	Set Operations
Selection Vector	O(matches)	Low selectivity	Requires merge
Bitmap	O(n) fixed	High selectivity, multiple predicates	Bitwise AND/OR

I/O Parallelism

When the bottleneck is disk I/O rather than CPU, the DBMS can use multiple storage devices in parallel. A single disk’s bandwidth is fundamentally limited, so scaling I/O requires spreading data across multiple devices.

Multi-Disk Parallelism

Spread the database across multiple disks to increase aggregate I/O bandwidth.

RAID 0 (Striping): split each file across multiple disks in round-robin fashion. - Read/write bandwidth scales linearly with the number of disks - No redundancy — any single disk failure loses data - Transparent to the DBMS

RAID 1 (Mirroring): duplicate each file on multiple disks. - Read bandwidth doubles (can read from either copy) - Write bandwidth unchanged (must write both copies) - Fault tolerant — survives single disk failure - Transparent to the DBMS

In practice, production systems use RAID 10 (mirrored stripes) or rely on distributed storage systems that handle redundancy at a higher level. RAID is typically configured at the hardware or OS level — transparent to the DBMS. Modern cloud/distributed systems often replace hardware RAID with software-defined erasure codes.

The fundamental tradeoff triangle for multi-disk configurations is Performance vs Durability vs Capacity — you can optimize for at most two at the expense of the third.

Database Partitioning

Split the logical database into disjoint segments stored on separate disks (or separate machines in a distributed system).

Vertical Partitioning: split a table by columns — store different columns on different disks. - Similar to column-store layout - Useful when queries only access a subset of columns

Horizontal Partitioning: split a table by rows — different ranges or hash buckets on different disks. - Common partition strategies: range, hash, round-robin - Enables parallel scans where each worker reads from a different partition - Must be careful with partition pruning for point queries - PostgreSQL supports this via Tablespaces, allowing different tables/indexes to be stored on different disk volumes

Challenges of Parallelism

Parallelism introduces several practical challenges that the DBMS must handle:

Coordination overhead: synchronizing workers adds latency (e.g., exchange operators, barriers)
Scheduling: deciding how many workers to allocate to each query, and how to divide work among them
Concurrency issues: parallel access to shared data structures (hash tables, buffer pool) requires careful locking or lock-free designs
Resource contention: multiple parallel queries compete for CPU, memory, and I/O bandwidth

The goal is to maximize throughput and minimize latency while keeping these overheads manageable.

Lec14 Takeaways

Thread-per-worker is the dominant process model in modern DBMS — but using threads does not automatically mean the DBMS supports intra-query parallelism
Intra-operator (horizontal) parallelism is the most common form of intra-query parallelism, cleanly abstracted via the Exchange operator
SIMD provides significant speedups for scan-heavy analytical workloads — combined with multicore parallelism, it can achieve 100x+ throughput improvements
I/O parallelism through RAID and partitioning is essential when disk bandwidth is the bottleneck, with a fundamental tradeoff between performance, durability, and capacity
Parallelism introduces coordination overhead, scheduling complexity, and resource contention — the DBMS must balance these costs against throughput gains

Final Summary

Topic	Key Idea
Processing Models	Iterator (per-tuple), Materialization (all-at-once), Vectorized (per-batch) — vectorized wins for OLAP
Plan Direction	Pull (top-down) is simple; Push (bottom-up) enables compiled pipelines
Access Methods	Sequential scan with zone maps, index scan for selective queries, multi-index for compound predicates
Halloween Problem	UPDATE can see its own changes through a shared index — must track or materialize first
Expression Evaluation	Expression trees are slow; JIT compilation eliminates interpretation overhead
Parallel vs Distributed	Parallel = close resources, cheap communication; Distributed = far nodes, unreliable network
Process Models	Process-per-worker (isolation) vs thread-per-worker (performance, most common) vs embedded (simplicity)
Intra-Query Parallelism	Horizontal (most common), Vertical/pipelining (less common), Bushy (higher-end systems)
Exchange Operator	Gather / Distribute / Repartition — abstracts parallelism from operators
SIMD	Apply one instruction to many data values — combined with multicore yields 100x+ speedup
Filter Representations	Selection vectors (sparse, good for low selectivity) vs bitmaps (dense, good for combining predicates)
I/O Parallelism	RAID striping/mirroring for bandwidth, partitioning for parallel scans; tradeoff: performance vs durability vs capacity

Key takeaway: Modern query execution is a co-design of processing models, parallelism strategies, and hardware exploitation — the vectorized batch model, exchange operators, and SIMD collectively enable analytical systems to process billions of tuples per second.

References

A. Pavlo, “Query Execution I,” CMU 15-445/645 Lecture #13, Spring 2024.
A. Pavlo & J. Patel, “Query Execution II,” CMU 15-445/645 Lecture #14, Spring 2024.
G. Graefe, “Volcano — An Extensible and Parallel Query Evaluation System,” IEEE TKDE, 1994.
T. Neumann, “Efficiently Compiling Efficient Query Plans for Modern Hardware,” VLDB, 2011.
P. Boncz, M. Zukowski, N. Nes, “MonetDB/X100: Hyper-Pipelining Query Execution,” CIDR, 2005.

This post is based on lecture materials from CMU 15-445/645 Database Systems by Andy Pavlo & Jignesh Patel (Lecture #13: Query Execution I, Lecture #14: Query Execution II).