15645 Database systems: Query Execution
Lec13 Query Execution I — Processing Models, Access Methods, Expression Evaluation
This lecture addresses the fundamental question: once the DBMS has a query plan, how does it actually execute it? It covers three processing models for driving tuple flow through operators, the main access methods for reading data from tables, and how the DBMS evaluates expressions efficiently.
Query Execution Overview
Before diving into processing models, the lecture establishes key terminology. A query plan is a DAG of operators. Within a plan:
- A pipeline is a sequence of operators where tuples continuously flow between them without intermediate storage.
- A pipeline breaker is an operator that cannot finish until all its children emit all their tuples — for example, Joins (build side), Subqueries, and Order By.
The DBMS splits a query plan into pipeline segments separated by pipeline breakers, then executes each segment.
Processing Models
A DBMS’s processing model defines how the system executes a query plan and moves data from one operator to the next. Different models make different trade-offs for OLTP vs. OLAP workloads. Each model is comprised of two types of execution paths: - Control Flow: how the DBMS invokes an operator - Data Flow: how an operator sends its results
The output of an operator can be either whole tuples (NSM) or subsets of columns (DSM). The three main models differ in the granularity of data each operator processes at a time.
Iterator Model (Volcano / Pipeline)
The most widely used processing model (Most Common). Every operator implements three functions:
Open(): initialize the operator’s state and callOpen()on childrenNext(): return the next single tuple, or a null/EOF marker when finishedClose(): clean up resources and callClose()on children
1 | function Next(): |
Key properties: - Each call to Next() returns a
single tuple (or a null marker for end-of-stream) - The
top-level operator repeatedly calls Next() on its child,
which recursively calls its child, and so on down to the leaf (scan)
operators - Tuples flow upward through the tree one at
a time — this is also called the pipeline model because
tuples are processed as soon as they are produced - Used by almost every
DBMS: PostgreSQL, MySQL, SQLite, MongoDB, etc.
The beauty of this model is its simplicity — each operator is self-contained and only needs to implement one interface. However, the per-tuple function call overhead can become significant for analytical workloads processing billions of tuples.
Materialization Model
The opposite extreme (Rare): each operator processes all its input and produces all its output at once.
1 | function Output(): |
Key properties: - Each operator returns its entire result
set as a materialized collection - Better for OLTP workloads
where queries touch small amounts of data — fewer function calls,
simpler control flow - The DBMS can send down hints like
LIMIT to avoid unnecessary materialization - Worse for OLAP
workloads with large intermediate results — materializing millions of
tuples wastes memory - Enables operator fusion: since
each operator sees all its input at once, the DBMS can fuse adjacent
operators (e.g., scan + filter + projection) into a single pass over the
data
Vectorized / Batch Model
A middle ground (Common in modern systems) that captures the best of both worlds. Each operator emits a batch of tuples (typically 1024 or so) per call. Originally proposed in the MonetDB/X100 system.
1 | function Next(): |
Key properties: - Amortizes function call overhead across many tuples (like materialization) while maintaining the streaming property (like iterator) - Ideal for OLAP workloads — the batch size is chosen to fit in CPU cache, enabling efficient vectorized operations via SIMD - Each batch internally uses a column-oriented layout, and operators include null bitmaps to track which entries in the batch are valid - Used by modern analytical systems: DuckDB, Snowflake, Databricks, ClickHouse, etc.
Comparison
| Model | Granularity | Best For | Function Call Overhead |
|---|---|---|---|
| Iterator | Single tuple | General purpose | High (per tuple) |
| Materialization | Entire result | OLTP (small results) | Low (per operator) |
| Vectorized | Batch of tuples | OLAP (large scans) | Medium (per batch) |
Plan Processing Direction
Regardless of the processing model, the DBMS must decide the direction in which data flows through the plan tree.
Top-to-Bottom (Pull)
- The root operator “pulls” tuples from its children by calling
Next()orOutput() - Control flow starts at the top and recurses downward
- Data flows upward from leaves to root
- This is the classic approach used by the iterator model
Bottom-to-Top (Push)
- Leaf operators “push” tuples to their parent operators
- Control flow starts at the leaves
- Allows tighter loops — the operator at the leaf can push tuples through a compiled pipeline without returning control between each tuple
- Enables operator fusion: adjacent operators in a pipeline can be fused into a single tight loop, eliminating virtual function call overhead between operators
- Used by HyPer and systems that employ push-based query compilation
The push model can achieve better performance because it avoids the overhead of recursive function calls. Instead of each operator pulling from its child, a leaf operator (e.g., a table scan) runs a tight loop that pushes tuples through a chain of operators compiled into a single function.
Push vs Pull Tradeoffs
| Aspect | Pull (Top-to-Bottom) | Push (Bottom-to-Top) |
|---|---|---|
| Virtual function overhead | One Next() call per tuple/batch |
Compiled away via operator fusion |
| LIMIT handling | Easy — just stop calling Next() |
Harder — must propagate “stop” signal back to leaf |
| Sort-Merge Join | Natural — can alternate pulling from two sorted inputs | Difficult — two push sources must be coordinated |
| Scheduling | Simple recursive control flow | Needs a scheduler to manage which pipeline runs when |
| Output control | Consumer controls the pace | Producer controls the pace; needs output buffers |
Access Methods
An access method is the way the DBMS reads data from a table. The query optimizer chooses the access method based on cost estimates.
Sequential Scan
The simplest method: read every page in the table, one after another.
1 | for page in table.pages: |
This is often the worst-case scenario, but the DBMS can apply several optimizations:
- Prefetching: fetch the next few pages into the buffer pool before they are needed
- Buffer Pool Bypass: for large sequential scans, bypass the buffer pool to avoid polluting the cache with pages that won’t be re-read
- Parallelization: divide the scan across multiple threads or workers
- Late Materialization: in columnar storage, delay stitching columns together until absolutely necessary — only read the columns the query actually needs
- Heap Clustering: if the table is physically sorted on a key, range predicates on that key can skip large portions of the heap
- Data Encoding / Compression: operate directly on compressed data without decompressing (e.g., run-length encoding can answer some predicates without expanding)
- Scan Sharing (Synchronized Scans): if multiple queries scan the same table concurrently, share the scan cursor so they piggyback on each other’s I/O
- Materialized Views / Cached Results: precompute and cache common query results so scans can be avoided entirely
- Code Specialization / Compilation: generate specialized scan code for a specific schema and predicate at query compile time, avoiding generic interpretation overhead
Data Skipping
The idea behind data skipping is to avoid examining pages or tuples that are guaranteed not to satisfy a query’s predicates. There are two categories:
Approximate Queries (Lossy): execute queries on a sampled subset of the data to produce approximate answers. This sacrifices precision for speed — useful for exploratory analytics where exact answers are not required.
Zone Maps (Lossless): pre-computed metadata about the values in each page (or range of pages) that allows the DBMS to skip pages with certainty:
- Stores the MIN, MAX, AVG, SUM, COUNT for each attribute in a contiguous range of pages
- Before scanning a page, check the zone map — if the predicate cannot match any value in the range, skip the entire page
- Extremely effective for range predicates on sorted or semi-sorted data
- Zone maps are maintained automatically as data is inserted/updated
- Used by: Netezza, Oracle, Vertica, SingleStore, Snowflake, and supported natively in columnar file formats like Parquet, ORC, Vortex, and LanceDB
For example, if a query asks for WHERE val > 600 and
a page’s zone map shows MAX(val) = 400, that entire page
can be skipped.
Index Scan
If the table has an index on the attribute(s) used in the predicate, the DBMS can use the index to find matching tuples directly, avoiding a full table scan.
The query optimizer must decide: is it cheaper to use the index or just do a sequential scan? This depends on: - Selectivity of the predicate (how many tuples match) - Whether the index is clustered (data physically sorted by index key) - Cost of random I/O vs sequential I/O
For low-selectivity predicates (few matching tuples), an index scan is usually faster. For high-selectivity predicates, the random I/O from an unclustered index can make a sequential scan cheaper.
Multi-Index Scan
When a query has predicates on multiple attributes and indexes exist on each, the DBMS can:
- Scan each index independently to get a set of matching Record IDs (RIDs)
- Compute the intersection (for AND) or union (for OR) of these RID sets using set operations
- Retrieve the actual tuples using the resulting RID set
This approach can be more efficient than using a single index when multiple selective predicates are combined.
Modification Queries
Operators that modify the database (INSERT, UPDATE, DELETE) need special handling because they both read and write data.
The Halloween Problem
Named after it was discovered on Halloween 1976 at IBM. The problem: an UPDATE query can see its own modifications if the scan and update operate on the same index.
Example: 1
UPDATE employees SET salary = salary + 100 WHERE salary < 1100;
If the table is scanned via an index on salary, a tuple
with salary = 1000 gets updated to 1100. But
the scan hasn’t finished — it may encounter that same tuple again at its
new position in the index (now at key 1100, which is still within the
scan range if the cursor hasn’t passed it). The tuple gets updated again
to 1200, and so on — creating an infinite loop where the
tuple keeps “running ahead” of the scan cursor.
Solutions: - Track modified tuples: record the RIDs of updated tuples and skip them if encountered again - Materialize first: scan all qualifying tuples first, then apply updates — this breaks the pipeline but guarantees correctness
Expression Evaluation
The DBMS must evaluate expressions in WHERE clauses, SELECT lists, and other contexts.
Execution Context
For each expression evaluation, the DBMS maintains an execution context that tracks:
- Current Tuple: the tuple being processed
- Query Parameters: bound values for parameterized queries
- Table Schema: metadata needed to interpret column references
The naive approach for evaluating expressions is an expression tree where each node is an operator or value.
Expression Tree Evaluation
1 | > |
To evaluate, walk the tree recursively. The problem: for every single tuple, the DBMS traverses the tree, checks node types at runtime, and follows pointers — this is slow due to: - Poor instruction cache behavior (pointer chasing) - Branch mispredictions (type-checking at each node) - Function call overhead
Optimizations
- Constant Folding: pre-compute expressions that
don’t depend on tuple data at planning time
WHERE 1 = 0→ skip the entire queryWHERE x > 3 + 2→WHERE x > 5SELECT UPPER('wutang')→SELECT 'WUTANG'(function calls on constants are folded too)
- Common Sub-Expression Elimination (CSE): if the
same sub-expression appears multiple times, compute it once and reuse
the result
- Example:
WHERE STRPOS(name, 'Andy') > 0 AND STRPOS(name, 'Andy') < 5— theSTRPOS(name, 'Andy')call is computed once and the result is shared
- Example:
- JIT Compilation: compile the expression (or the
entire query pipeline) into native machine code at query time,
eliminating the overhead of tree interpretation
- PostgreSQL uses LLVM for JIT compilation of expressions
- HyPer/Umbra compile entire query pipelines to native code
- Eliminates type-checking, pointer-chasing, and function call overhead
Lec13 Takeaways
- The vectorized model is the sweet spot for analytical workloads: it amortizes function-call overhead while keeping data in cache and enabling SIMD
- Zone maps are a simple but powerful optimization for sequential scans — they let the DBMS skip irrelevant pages without maintaining a full index
- The Halloween Problem is a subtle correctness issue that arises when modifications and scans share the same index structure
- Push-based execution can outperform pull-based by compiling operators into tight loops that avoid recursive function calls
- JIT compilation of expressions and pipelines removes the interpretation overhead that dominates CPU time in analytical queries
Lec14 Query Execution II — Parallelism
This lecture asks: how does the DBMS exploit parallel hardware to speed up query execution? It covers process models for managing workers, three types of query parallelism, data-level parallelism via SIMD, and I/O parallelism techniques.
Parallel vs Distributed
Before diving in, an important distinction:
| Aspect | Parallel DBMS | Distributed DBMS |
|---|---|---|
| Resources | Physically close, communicate via high-speed interconnect | Nodes can be far apart, communicate over a slower network |
| Communication | Cheap and assumed reliable | Cost cannot be ignored; messages may be lost |
| Examples | Single multi-core machine, shared-memory cluster | CockroachDB, Spanner |
This lecture focuses on parallel execution within a single machine or tightly coupled cluster. Distributed query execution is covered in later lectures.
Process Models
A DBMS process model defines how the system maps its internal worker abstractions to OS-level processes or threads. This is separate from the query processing model discussed in Lec13.
Process per DBMS Worker
Each worker is a separate OS process: - Relies on OS scheduler for CPU time - Workers communicate via shared memory for global data structures (buffer pool, lock table, etc.) - A crash in one worker does not necessarily bring down the entire system - Higher context-switch overhead than threads - Used by: PostgreSQL, Oracle (historically), IBM DB2
Thread per DBMS Worker (Most Common)
Each worker is an OS thread within a single DBMS process: - Lower overhead — threads share the same address space, no need for shared memory segments - The DBMS manages its own scheduling (or relies on the OS thread scheduler) - A crash in one thread can bring down the entire process - Every major DBMS from the last ~25 years uses native OS threads (not green threads or user-level coroutines) - Used by: MySQL, SQL Server, Oracle (modern versions), most modern systems
Important: using the thread-per-worker model does not automatically mean the DBMS supports intra-query parallelism. It only means the DBMS can run multiple queries concurrently on separate threads. Exploiting parallelism within a single query requires additional infrastructure (exchange operators, parallel-aware operators, etc.).
Embedded DBMS
The DBMS runs inside the application’s process (no separate server): - No inter-process communication needed - The application is responsible for managing concurrency - Used by: SQLite, DuckDB, LevelDB, RocksDB, BerkeleyDB, WiredTiger (MongoDB’s storage engine)
Comparison
| Model | Isolation | Overhead | Used By |
|---|---|---|---|
| Process per Worker | High (process boundary) | High (context switch) | PostgreSQL |
| Thread per Worker (Most Common) | Low (shared address space) | Low (lightweight) | MySQL, SQL Server |
| Embedded | None (in-process) | Minimal | SQLite, DuckDB |
Scheduling
For thread-per-worker systems, the DBMS must decide how to assign tasks to worker threads. Two approaches:
- OS-level scheduling: let the OS decide which thread runs on which core. Simple but the DBMS has no control over placement decisions.
- DBMS-level scheduling: the DBMS maintains its own work queues and assigns tasks to threads. This allows NUMA-aware placement (schedule work on the core closest to the data) and avoids excessive context switching.
Modern high-performance systems (e.g., SQL Server, Oracle) prefer DBMS-level scheduling to maintain control over resource allocation.
Query Parallelism
There are two top-level categories: parallelism between queries and parallelism within a single query.
Inter-Query Parallelism
Execute multiple queries simultaneously on different workers. This is straightforward — each query runs independently. The main challenge is concurrency control (ensuring transactions don’t interfere with each other), which is the subject of later lectures.
Improves throughput (queries per second) rather than latency of individual queries.
Intra-Query Parallelism
Execute a single query using multiple workers. This improves the latency of individual queries. There are three approaches:
Intra-Operator Parallelism (Horizontal) — Most Common
Decompose a single operator into independent fragments that operate on different subsets of the data:
1 | Gather |
- A table scan is split across 3 workers, each reading a different range of pages
- Results are merged at an Exchange operator above
- Also called horizontal partitioning or data parallelism at the operator level
- The most common form of intra-query parallelism
- Example: Parallel Grace Hash Join — partition both tables using hash on the join key, then run independent hash join instances on each partition pair across multiple workers
Inter-Operator Parallelism (Vertical / Pipelining) — Less Common
Different operators in the query plan run simultaneously on different workers, passing tuples between them in a producer/consumer paradigm:
1 | Worker 1: Scan → Filter ──tuples──▶ Worker 2: Join → Project |
- Also called pipeline parallelism — operators are stages in a pipeline
- Limited by the slowest stage (pipeline stalls if one operator is much slower)
- Less common in practice because it requires careful balancing
- More common in streaming systems (e.g., Apache Kafka, Flink, Spark Streaming, Storm) than in traditional DBMS
Bushy Parallelism — Higher-End Systems
A combination of both: different branches of the plan tree execute in parallel, and within each branch, operators can also be parallelized:
1 | Join |
- Most flexible but most complex to schedule
- Used by modern MPP systems for complex multi-way joins
Exchange Operator
The Exchange operator is the key mechanism for implementing intra-query parallelism. It sits between parallel fragments of a query plan and manages data distribution.
Three types:
| Type | Behavior | Use Case |
|---|---|---|
| Gather | Merge results from multiple workers into one stream | Combining parallel scan results |
| Distribute | Split one stream across multiple workers | Partitioning data for parallel join |
| Repartition | Redistribute data from N workers to M workers | Changing partition scheme mid-query |
The exchange operator abstracts away parallelism — the operators above and below it don’t need to know they are running in parallel. This is a clean architectural separation.
Google BigQuery uses the Repartition (shuffle) exchange extensively — its execution engine performs massive redistributions of intermediate data across thousands of workers to execute distributed joins and aggregations.
Data Parallelism with SIMD
SIMD (Single Instruction, Multiple Data) applies the same operation to multiple data elements simultaneously using wide CPU registers.
Why SIMD Matters for Databases
A typical database scan applies a simple predicate (e.g.,
val > 100) to millions of values. Without SIMD, this
requires one comparison per value. With SIMD, a single instruction can
compare 4, 8, or even 16 values at once (depending on the register
width: SSE=128-bit, AVX2=256-bit, AVX-512=512-bit).
SIMD Selection Scans
Consider a selection scan:
SELECT * FROM T WHERE val > 100
Without SIMD: 1
2for each value v in column:
if v > 100: output v
With SIMD (AVX2, 256-bit registers, 32-bit integers = 8 values per
register): 1
2
3
4
5threshold = broadcast(100) // [100, 100, 100, 100, 100, 100, 100, 100]
for each group of 8 values:
data = load(values[i:i+8])
mask = compare_gt(data, threshold) // produces bitmask
output matching values using mask
This is 8x more values processed per instruction cycle (in theory).
Potential speedup compounds with multicore parallelism. For example, a machine with 32 cores each executing 4-wide SIMD instructions achieves up to 128x throughput compared to scalar single-core execution.
Filter Representations
After a SIMD comparison produces a bitmask of qualifying tuples, how should the DBMS represent and use this filter?
Selection Vectors: store the positions (indices) of qualifying tuples in an array. - Compact when selectivity is low (few matches) - Downstream operators iterate over positions - Used by: DuckDB, Vectorwise
Bitmaps: a bit vector where bit i = 1
if tuple i qualifies. - Fixed-size regardless of
selectivity - Efficient for bitwise AND/OR of multiple predicates - Used
by: Oracle, many columnar stores
| Representation | Space | Best When | Set Operations |
|---|---|---|---|
| Selection Vector | O(matches) | Low selectivity | Requires merge |
| Bitmap | O(n) fixed | High selectivity, multiple predicates | Bitwise AND/OR |
I/O Parallelism
When the bottleneck is disk I/O rather than CPU, the DBMS can use multiple storage devices in parallel. A single disk’s bandwidth is fundamentally limited, so scaling I/O requires spreading data across multiple devices.
Multi-Disk Parallelism
Spread the database across multiple disks to increase aggregate I/O bandwidth.
RAID 0 (Striping): split each file across multiple disks in round-robin fashion. - Read/write bandwidth scales linearly with the number of disks - No redundancy — any single disk failure loses data - Transparent to the DBMS
RAID 1 (Mirroring): duplicate each file on multiple disks. - Read bandwidth doubles (can read from either copy) - Write bandwidth unchanged (must write both copies) - Fault tolerant — survives single disk failure - Transparent to the DBMS
In practice, production systems use RAID 10 (mirrored stripes) or rely on distributed storage systems that handle redundancy at a higher level. RAID is typically configured at the hardware or OS level — transparent to the DBMS. Modern cloud/distributed systems often replace hardware RAID with software-defined erasure codes.
The fundamental tradeoff triangle for multi-disk configurations is Performance vs Durability vs Capacity — you can optimize for at most two at the expense of the third.
Database Partitioning
Split the logical database into disjoint segments stored on separate disks (or separate machines in a distributed system).
Vertical Partitioning: split a table by columns — store different columns on different disks. - Similar to column-store layout - Useful when queries only access a subset of columns
Horizontal Partitioning: split a table by rows — different ranges or hash buckets on different disks. - Common partition strategies: range, hash, round-robin - Enables parallel scans where each worker reads from a different partition - Must be careful with partition pruning for point queries - PostgreSQL supports this via Tablespaces, allowing different tables/indexes to be stored on different disk volumes
Challenges of Parallelism
Parallelism introduces several practical challenges that the DBMS must handle:
- Coordination overhead: synchronizing workers adds latency (e.g., exchange operators, barriers)
- Scheduling: deciding how many workers to allocate to each query, and how to divide work among them
- Concurrency issues: parallel access to shared data structures (hash tables, buffer pool) requires careful locking or lock-free designs
- Resource contention: multiple parallel queries compete for CPU, memory, and I/O bandwidth
The goal is to maximize throughput and minimize latency while keeping these overheads manageable.
Lec14 Takeaways
- Thread-per-worker is the dominant process model in modern DBMS — but using threads does not automatically mean the DBMS supports intra-query parallelism
- Intra-operator (horizontal) parallelism is the most common form of intra-query parallelism, cleanly abstracted via the Exchange operator
- SIMD provides significant speedups for scan-heavy analytical workloads — combined with multicore parallelism, it can achieve 100x+ throughput improvements
- I/O parallelism through RAID and partitioning is essential when disk bandwidth is the bottleneck, with a fundamental tradeoff between performance, durability, and capacity
- Parallelism introduces coordination overhead, scheduling complexity, and resource contention — the DBMS must balance these costs against throughput gains
Final Summary
| Topic | Key Idea |
|---|---|
| Processing Models | Iterator (per-tuple), Materialization (all-at-once), Vectorized (per-batch) — vectorized wins for OLAP |
| Plan Direction | Pull (top-down) is simple; Push (bottom-up) enables compiled pipelines |
| Access Methods | Sequential scan with zone maps, index scan for selective queries, multi-index for compound predicates |
| Halloween Problem | UPDATE can see its own changes through a shared index — must track or materialize first |
| Expression Evaluation | Expression trees are slow; JIT compilation eliminates interpretation overhead |
| Parallel vs Distributed | Parallel = close resources, cheap communication; Distributed = far nodes, unreliable network |
| Process Models | Process-per-worker (isolation) vs thread-per-worker (performance, most common) vs embedded (simplicity) |
| Intra-Query Parallelism | Horizontal (most common), Vertical/pipelining (less common), Bushy (higher-end systems) |
| Exchange Operator | Gather / Distribute / Repartition — abstracts parallelism from operators |
| SIMD | Apply one instruction to many data values — combined with multicore yields 100x+ speedup |
| Filter Representations | Selection vectors (sparse, good for low selectivity) vs bitmaps (dense, good for combining predicates) |
| I/O Parallelism | RAID striping/mirroring for bandwidth, partitioning for parallel scans; tradeoff: performance vs durability vs capacity |
Key takeaway: Modern query execution is a co-design of processing models, parallelism strategies, and hardware exploitation — the vectorized batch model, exchange operators, and SIMD collectively enable analytical systems to process billions of tuples per second.
References
- A. Pavlo, “Query Execution I,” CMU 15-445/645 Lecture #13, Spring 2024.
- A. Pavlo & J. Patel, “Query Execution II,” CMU 15-445/645 Lecture #14, Spring 2024.
- G. Graefe, “Volcano — An Extensible and Parallel Query Evaluation System,” IEEE TKDE, 1994.
- T. Neumann, “Efficiently Compiling Efficient Query Plans for Modern Hardware,” VLDB, 2011.
- P. Boncz, M. Zukowski, N. Nes, “MonetDB/X100: Hyper-Pipelining Query Execution,” CIDR, 2005.
This post is based on lecture materials from CMU 15-445/645 Database Systems by Andy Pavlo & Jignesh Patel (Lecture #13: Query Execution I, Lecture #14: Query Execution II).


