Lec15 Query Planning & Optimization — Rules, Heuristics, and Search Strategies

The previous lectures covered how individual operators execute (scans, joins, sorts). This lecture steps back to ask a higher-level question: given a SQL query, how does the DBMS choose which plan to execute? The answer involves transforming logical representations using algebraic equivalences, applying heuristic rules, and — for cost-based optimizers — searching through the space of equivalent plans to find the cheapest one.

Query Planning Pipeline

A SQL statement goes through several stages before execution:

1
2
3
4
5
SQL String
→ Parser → Abstract Syntax Tree (AST)
→ Binder → Logical Plan (with resolved names & types)
→ Tree Rewriter → Rewritten Logical Plan (expression simplification)
→ Optimizer → Physical Plan (with specific algorithms & access paths)

The parser checks syntax and produces an AST. The binder resolves table and column names against the system catalog, mapping them to internal identifiers. This produces a logical plan — a tree of relational algebra operators that specifies what data to compute, but not how.

The optimizer’s job is to transform this logical plan into the best physical plan — one that specifies concrete algorithms (hash join vs. sort-merge join), access methods (sequential scan vs. index scan), and operator ordering.

Two fundamental approaches exist: 1. Heuristic / Rule-based: apply static transformation rules that generally improve plans without estimating costs 2. Cost-based: enumerate equivalent plans, estimate their cost using statistics, and pick the cheapest

Most real systems use a combination of both — heuristic rewrites first, then cost-based search over the simplified space.

Relational Algebra Equivalences

The optimizer’s ability to rearrange a query plan rests on equivalences — algebraic identities that guarantee two expressions produce the same result. These equivalences define the legal moves in the optimizer’s search space.

Selection Equivalences

Decomposition (cascade):

\[\sigma_{p_1 \wedge p_2}(R) \equiv \sigma_{p_1}(\sigma_{p_2}(R))\]

A conjunctive predicate can be split into a cascade of individual selections. This is the foundation for predicate pushdown — once split, individual predicates can be pushed closer to the data source.

Commutativity:

\[\sigma_{p_1}(\sigma_{p_2}(R)) \equiv \sigma_{p_2}(\sigma_{p_1}(R))\]

The order of adjacent selections does not affect the result. A cost-based optimizer can reorder them so the most selective (or cheapest) predicate runs first.

Projection Equivalences

Projections are idempotent — projecting the same columns twice is equivalent to projecting once:

\[\pi_{a_1}(R) \equiv \pi_{a_1}(\pi_{a_1, a_2, \ldots}(R))\]

This enables early projection: the optimizer can insert projections lower in the tree to reduce tuple width, cutting I/O and memory usage for subsequent operators.

Join Equivalences

Commutativity:

\[R \bowtie S \equiv S \bowtie R\]

The optimizer is free to swap the build and probe sides of a hash join, or choose which table to use as the outer relation in a nested-loop join.

Associativity:

\[(R \bowtie S) \bowtie T \equiv R \bowtie (S \bowtie T)\]

This is the most impactful equivalence for multi-way joins. The number of possible join orderings grows super-exponentially with the number of tables — for \(n\) tables, the number of different join trees is the Catalan number \(C(n)\). For a 10-table join, there are over 17 billion possible orderings.

Heuristic / Rule-Based Optimization

A heuristic optimizer applies a fixed sequence of transformation rules without considering the actual data. These rules represent “usually good” strategies:

Core Heuristic Rules

  1. Split conjunctive predicates: decompose WHERE p1 AND p2 AND p3 into individual filters
  2. Predicate pushdown: push selection operators as far down the tree as possible, ideally to the table scan. This reduces the number of tuples flowing through the rest of the plan
  3. Replace Cartesian products with joins: if a Cartesian product is followed by a selection on a join predicate, replace both with a single join operator
  4. Projection pushdown: push projections down to eliminate unneeded columns early, reducing tuple width
  5. Reorder predicates by selectivity: within a filter chain, evaluate the most selective predicate first (the one that eliminates the most tuples)

Example Transformation

Consider a three-table query:

1
2
3
4
5
SELECT s.name, c.title
FROM students AS s, enrolled AS e, courses AS c
WHERE s.sid = e.sid
AND e.cid = c.cid
AND e.grade = 'A';

The naive plan starts with Cartesian products of all three tables, followed by all selections, and a final projection. Heuristic optimization transforms this step by step:

  1. Split the conjunctive WHERE into three individual selections
  2. Push e.grade = 'A' down to the scan of enrolled
  3. Replace the Cartesian product + join-predicate pairs with proper join operators
  4. Push projections down to discard unneeded columns early

The result is a dramatically cheaper plan — filtering early means joins process far fewer tuples.

Limitations of Heuristic Optimization

Heuristic rules do not consider: - The actual sizes of intermediate results - The availability or selectivity of indexes - The relative cost of different physical operators - Data distribution (skew, correlations)

A heuristic optimizer always applies the same transformations regardless of the data. For example, it always pushes predicates down — but occasionally a predicate is so non-selective that pushing it down just adds overhead without reducing cardinality.

Expression Rewriting

Before searching for the best plan, the optimizer simplifies predicates through expression rewriting — syntactic transformations that eliminate redundant or impossible conditions.

Impossible / Redundant Predicates

1
2
3
4
5
6
7
-- Impossible: no row can satisfy both conditions
WHERE val BETWEEN 1 AND 5 AND val BETWEEN 10 AND 20
Empty result (skip execution entirely)

-- Redundant: the second predicate is subsumed
WHERE val >= 5 AND val >= 3
WHERE val >= 5

Predicate Merging

1
2
3
-- Merge overlapping BETWEEN predicates
WHERE val BETWEEN 1 AND 10 AND val BETWEEN 5 AND 20
WHERE val BETWEEN 5 AND 10

Join Elimination

If a join is performed but none of the joined table’s columns appear in the output and the join condition is on a foreign key that guarantees a match, the join can be eliminated entirely. This sometimes occurs with views that join more tables than the query actually needs.

Cost-Based Optimization

While heuristics handle the “obviously good” transformations, choosing among genuinely different strategies (e.g., hash join vs. sort-merge join, different join orderings) requires a cost model. The optimizer enumerates equivalent plans, estimates each plan’s cost, and selects the cheapest.

The cost-based component needs three things: 1. A way to enumerate equivalent plans (search strategy) 2. A way to estimate the cost of each plan (cost model — covered in Lecture 16) 3. A way to prune the search space to keep it tractable

Search Strategies

The optimizer must explore the space of logically equivalent plans. Two major paradigms exist:

Bottom-Up (System R / Forward Chaining)

The System R optimizer (1979) pioneered cost-based optimization. It builds optimal plans from the bottom up:

  1. Start with base relations and find the best access path for each (sequential scan vs. index scan)
  2. Enumerate all two-table join orderings. For each pair, consider all join algorithms (nested-loop, sort-merge, hash) and retain only the cheapest plan for each pair
  3. Build three-table joins by extending the best two-table plans, and so on
  4. At each level, prune plans that are strictly worse than another plan with the same result and same “interesting orders”

The key insight is dynamic programming with interesting orders: a plan that produces sorted output might be suboptimal in isolation but saves a sort later. The optimizer keeps the cheapest plan for each combination of (relation set, output order).

1
2
3
Level 1: Best single-table access for {R}, {S}, {T}
Level 2: Best join plan for {R,S}, {R,T}, {S,T}
Level 3: Best join plan for {R,S,T} — extend Level 2 winners

For \(n\) tables, this explores \(O(2^n)\) subsets — expensive but tractable for typical queries (< 15 tables). This approach only considers left-deep join trees (where the right child of each join is always a base table), which limits the search space and guarantees that intermediate results stream through the pipeline without full materialization.

Top-Down (Volcano / Cascades / Backward Chaining)

The Volcano (1993) and Cascades (1995) frameworks take the opposite approach:

  1. Start with the full logical expression at the root
  2. Apply transformation rules to generate equivalent logical expressions (stored in a shared structure called a memo table)
  3. For each logical expression, apply implementation rules to generate physical alternatives
  4. Recursively optimize sub-expressions using branch-and-bound pruning — if the cost of a partial plan already exceeds the best known complete plan, abandon that branch
1
2
3
4
5
6
7
Optimize(LogicalExpr):
for each equivalent expr in ApplyRules(LogicalExpr):
for each physical impl of expr:
cost = EstimateCost(impl)
if cost < upper_bound:
recursively optimize children
update best plan

The memo table is crucial — it ensures that equivalent sub-expressions are recognized and optimized only once. Each group in the memo represents a set of logically equivalent expressions, and the optimizer stores the best physical plan found for each group.

Enforcers

Both search strategies must handle physical properties that certain operators require. For example, a sort-merge join requires its inputs to be sorted on the join key. An enforcer is a physical operator (like Sort) that the optimizer inserts to guarantee a required property.

The optimizer treats enforcers as additional alternatives: it compares the cost of (sort + merge join) against (hash join without sort) and picks the cheaper option.

Bottom-Up vs Top-Down Comparison

Aspect Bottom-Up (System R) Top-Down (Volcano/Cascades)
Direction Forward chaining (small → large) Backward chaining (large → small)
Search pattern BFS (enumerate all \(k\)-table plans before \(k+1\)) DFS with memoization
Pruning Discard dominated plans at each level Branch-and-bound with upper bounds
Tree shapes Typically left-deep only Can explore bushy trees
Implementation Easier to implement More flexible, extensible rule system
Used by PostgreSQL, IBM DB2, MySQL SQL Server, CockroachDB, Apache Calcite

In practice, both approaches find similar-quality plans. The top-down approach is more extensible (adding new rules doesn’t require restructuring the search) but can be harder to debug.

Star and Snowflake Schema Queries

Data warehouses commonly use star schemas (one fact table surrounded by dimension tables) and snowflake schemas (dimension tables further normalized). These queries join one large fact table with many small dimension tables.

Standard dynamic-programming enumeration (\(2^n\) subsets for \(n\) tables) becomes impractical when \(n\) is large (10+ dimension tables). Special strategies include:

  • Fact-table-centric: always start the join order with the fact table
  • Predicate pushdown to dimensions first: filter dimension tables before joining to the fact table, using bitmap filters or semi-joins to reduce the fact table scan
  • Star join optimization: recognize the star pattern and use specialized multi-index or bitmap intersection strategies

Lec15 Takeaways

  • The optimizer transforms logical plans into physical plans using algebraic equivalences — the correctness of query optimization rests entirely on these equivalences being sound
  • Heuristic rules (predicate pushdown, projection pushdown, Cartesian-to-join conversion) are cheap to apply and almost always beneficial — every real optimizer applies them first
  • Cost-based optimization enumerates equivalent plans and picks the cheapest — the two major paradigms (bottom-up System R and top-down Cascades) represent fundamentally different search strategies with similar outcomes
  • Join ordering is the single most impactful optimization decision — the number of valid orderings grows super-exponentially, making exhaustive search impossible for large queries
  • Enforcers bridge the gap between logical equivalence and physical requirements — the optimizer must account for properties like sort order when comparing plans

Lec16 Query Planning & Optimization — Cost Models, Statistics, and Cardinality Estimation

Lecture 15 established how the optimizer searches through equivalent plans. This lecture answers the complementary question: how does the optimizer estimate the cost of a plan? The answer requires statistics about the data, a cost model that translates statistics into estimated I/O and CPU costs, and cardinality estimation to predict the size of intermediate results.

Why Cost Estimation Matters

The optimizer’s plan choices are only as good as its cost estimates. A perfect search algorithm paired with terrible cost estimates will confidently pick a terrible plan. The cost model is the optimizer’s “oracle” — and its accuracy directly determines query performance.

Cost estimation has three components: 1. Statistics: summaries of the data distribution (histograms, sketches, samples) 2. Cardinality estimation: predicting how many tuples each operator produces 3. Cost model: mapping cardinalities and operator types to estimated execution cost

Statistics

The DBMS maintains internal statistics about tables and columns to support cost estimation. These are stored in the system catalog and typically updated by an explicit ANALYZE command (PostgreSQL) or automatically in the background.

Basic Statistics

For each relation \(R\): - \(N_R\): number of tuples in \(R\) - \(V(A, R)\): number of distinct values for attribute \(A\) in \(R\) (sometimes called the cardinality of \(A\)) - \(B_R\): number of disk blocks (pages) containing tuples of \(R\)

The selection cardinality \(SC(A, R)\) is the average number of tuples with a given value of \(A\):

\[SC(A, R) = \frac{N_R}{V(A, R)}\]

If \(A\) is a key (all values distinct), \(SC(A, R) = 1\). If \(A\) has only one distinct value, \(SC(A, R) = N_R\).

Histograms

Real data is rarely uniform. Histograms capture the distribution of values in a column, allowing the optimizer to estimate selectivity more accurately than the uniform assumption.

Equi-Width Histogram: Divide the value range into fixed-width buckets and count how many tuples fall in each bucket.

Range Count
[0, 10) 120
[10, 20) 340
[20, 30) 90
[30, 40) 450

Simple to build but can be misleading when values cluster — a bucket spanning a dense region and a sparse region hides the real distribution.

Equi-Depth (Equi-Height) Histogram: Adjust bucket boundaries so that each bucket contains approximately the same number of tuples.

This better captures skewed distributions — narrow buckets appear where data is dense, wide buckets where data is sparse. Most production systems use equi-depth histograms.

End-Biased Histogram: Store exact frequencies for the most common values (MCVs), and use a single histogram for the remaining values. PostgreSQL uses this approach — it maintains both an MCV list and a histogram for each analyzed column.

Sketches

For very large datasets or streaming scenarios, exact histograms can be expensive. Probabilistic sketches provide approximate statistics with bounded error using minimal space.

Count-Min Sketch: A 2D array of counters with \(d\) hash functions. To record a value, hash it with each function and increment the corresponding counter. To query the frequency, take the minimum across all hash functions (hence “count-min”). The minimum is used because hash collisions can only inflate counts, never deflate them.

  • Space: \(O(d \times w)\) where \(w\) is the width (number of counters per row)
  • Error: probabilistic guarantee based on \(d\) and \(w\)
  • Use case: estimating frequencies of individual values (point queries)

HyperLogLog (HLL): Estimates the number of distinct values (cardinality) using the observation that the maximum number of leading zeros in hashed values relates to the log of the cardinality. Uses very little space (a few KB) to estimate cardinalities in the billions.

  • Space: \(O(m)\) where \(m\) is the number of registers (typically 1024–16384)
  • Error: approximately \(\pm 2\%\) for \(m = 16384\)
  • Use case: estimating \(V(A, R)\) without scanning the entire column

Sampling

An alternative to pre-computed statistics: maintain a random sample of tuples from each table and run the query’s predicates against the sample to estimate selectivities. This adapts naturally to complex predicates (including correlated ones) but has higher runtime cost.

Systems that use sampling-based estimation include Oracle (dynamic sampling) and some recent systems that combine sampling with learned models.

Cost Models

The cost model translates statistics and cardinality estimates into a single numeric cost for each plan. Two major flavors exist:

Physical Cost Model

Estimates actual resource consumption: disk I/Os, CPU cycles, memory usage, network transfers. This is the most accurate approach but ties the optimizer to specific hardware.

Factors include: - Sequential I/O vs. random I/O (sequential reads are 10–100x cheaper) - CPU cost per tuple for comparison, hashing, expression evaluation - Memory available for hash tables, sort buffers - Network cost for distributed queries

Logical Cost Model

Assigns abstract, relative costs to operators without modeling physical hardware. This makes the optimizer portable across hardware configurations but less accurate for any specific deployment.

PostgreSQL Cost Model

PostgreSQL uses a physical cost model with configurable magic constants that approximate the relative cost of different operations:

Parameter Default Meaning
seq_page_cost 1.0 Cost of reading one page sequentially
random_page_cost 4.0 Cost of reading one page randomly
cpu_tuple_cost 0.01 CPU cost per tuple processed
cpu_index_tuple_cost 0.005 CPU cost per index entry
cpu_operator_cost 0.0025 CPU cost per operator evaluation

The total cost of a plan is the sum of I/O cost and CPU cost:

\[\text{Total Cost} = (\text{pages\_read} \times \text{page\_cost}) + (\text{tuples\_processed} \times \text{cpu\_tuple\_cost})\]

These constants encode the assumption that random I/O is 4x more expensive than sequential I/O — reasonable for spinning disks but too conservative for SSDs, where the ratio is closer to 1.5–2x. Administrators can tune these constants for their hardware.

Cardinality Estimation

The most critical job of the cost model: predicting how many tuples each operator produces. Every subsequent cost estimate in the plan depends on these predictions.

Selectivity

The selectivity \(\text{sel}(P)\) of a predicate \(P\) is the fraction of tuples that satisfy \(P\):

\[\text{sel}(P) = \frac{|\sigma_P(R)|}{N_R}\]

The estimated output cardinality of a selection is:

\[|\sigma_P(R)| = N_R \times \text{sel}(P)\]

Selectivity for Common Predicates

Equality: \(A = \text{constant}\)

\[\text{sel}(A = c) = \frac{1}{V(A, R)}\]

Assumes uniform distribution — each distinct value is equally likely.

Range: \(A \geq c\) (given domain \([\text{min}, \text{max}]\))

\[\text{sel}(A \geq c) = \frac{\text{max} - c}{\text{max} - \text{min}}\]

Assumes values are uniformly distributed across the range.

Negation: \(\text{NOT } P\)

\[\text{sel}(\text{NOT } P) = 1 - \text{sel}(P)\]

Conjunction: \(P_1 \text{ AND } P_2\)

\[\text{sel}(P_1 \wedge P_2) = \text{sel}(P_1) \times \text{sel}(P_2)\]

This assumes independence between predicates — a strong assumption that often does not hold (e.g., city = 'Pittsburgh' and state = 'PA' are highly correlated).

Disjunction: \(P_1 \text{ OR } P_2\)

\[\text{sel}(P_1 \vee P_2) = \text{sel}(P_1) + \text{sel}(P_2) - \text{sel}(P_1) \times \text{sel}(P_2)\]

The Three Assumptions

Classical cardinality estimation rests on three simplifying assumptions:

  1. Uniform Distribution: All values in a column are equally likely. This breaks down with skewed data (e.g., Zipfian distributions common in real workloads).

  2. Independence: Predicates on different columns are statistically independent — knowing the value of column \(A\) tells you nothing about column \(B\). This breaks down with correlated attributes (city/state, age/graduation year).

  3. Containment (Inclusion Principle): For a join \(R \bowtie_{A} S\), the domain of values in the smaller relation is contained in the domain of the larger relation. That is, every value that appears in the smaller set also appears in the larger set.

Join Size Estimation

For an equi-join \(R \bowtie_{R.A = S.A} S\), the estimated result size is:

\[|\text{est}| \approx \frac{N_R \times N_S}{\max(V(A, R), V(A, S))}\]

The intuition: divide by the larger number of distinct values because the containment assumption implies the smaller set of values is fully “contained” in the larger set. Each value in the smaller domain matches with \(SC(A, R)\) tuples from \(R\) and \(SC(A, S)\) tuples from \(S\).

For a Cartesian product (no join predicate), the result size is simply \(N_R \times N_S\).

Multi-Join Estimation and Error Propagation

For queries joining multiple tables, the optimizer must estimate intermediate result sizes, and each estimate feeds into the next. Errors compound multiplicatively:

Consider a three-table join \(R \bowtie S \bowtie T\): 1. Estimate \(|R \bowtie S|\) — say the estimate is off by 2x 2. Use that estimate to compute \(|(R \bowtie S) \bowtie T|\) — the 2x error propagates

After \(k\) joins, estimation errors can compound to \(O(2^k)\) or worse. This is why cardinality estimation is often called the Achilles’ heel of query optimization — even small per-step errors can lead the optimizer to choose catastrophically wrong plans for complex queries.

The Correlated Attributes Problem

The independence assumption is the most problematic in practice. Consider:

1
2
SELECT * FROM cars
WHERE make = 'Honda' AND model = 'Accord';

The optimizer estimates:

\[\text{sel} = \frac{1}{V(\text{make})} \times \frac{1}{V(\text{model})}\]

If there are 50 makes and 500 models, the estimated selectivity is \(\frac{1}{50} \times \frac{1}{500} = \frac{1}{25000}\), far too low — because make and model are highly correlated (only Honda makes the Accord).

Approaches to handling correlations:

Approach Idea Limitation
Multi-column statistics Maintain joint histograms or statistics on column combinations Exponential number of possible combinations
Bayesian networks Model dependencies between columns as a graphical model Complex to build and maintain
Feedback-driven Use actual runtime cardinalities to correct future estimates Requires executing the query first
Learned cardinality estimators Train ML models on query logs to predict cardinalities Data drift, training cost, reliability

No single approach has fully solved this problem. Modern systems typically use a combination: multi-column statistics for known correlations, and the independence assumption as a fallback.


Lec16 Takeaways

  • Statistics (histograms, sketches, samples) are the foundation of cost-based optimization — without accurate data summaries, the optimizer is flying blind
  • The cost model translates cardinality estimates into plan costs — PostgreSQL’s magic constants encode hardware-specific assumptions that administrators should tune
  • Cardinality estimation under the uniform/independence/containment assumptions works well for simple queries but breaks down for correlated attributes and multi-join queries
  • Estimation errors propagate multiplicatively through join chains — a 2x error per join becomes a 1000x error over 10 joins, leading to catastrophically wrong plan choices
  • The correlated attributes problem remains one of the hardest challenges in query optimization — multi-column stats, learned estimators, and feedback loops are active areas of research

Final Summary

Topic Key Idea
Planning Pipeline SQL → AST → Logical Plan → Physical Plan; optimizer bridges logical to physical
Relational Algebra Equivalences Selection decomposition, join commutativity/associativity — define the legal search space
Heuristic Optimization Static rules (predicate pushdown, projection pushdown) — cheap and almost always beneficial
Expression Rewriting Simplify/eliminate impossible or redundant predicates before search
Bottom-Up Search (System R) Dynamic programming over relation subsets, \(O(2^n)\), left-deep trees, interesting orders
Top-Down Search (Cascades) DFS with memo table, branch-and-bound pruning, supports bushy trees
Enforcers Physical operators (e.g., Sort) inserted to guarantee required properties
Statistics Histograms (equi-width, equi-depth, end-biased), sketches (Count-Min, HLL), sampling
Cost Model Map cardinalities + operator types → numeric cost; physical (PostgreSQL) vs logical
Cardinality Estimation \(\text{sel}(P) \times N_R\); relies on uniform, independence, containment assumptions
Join Size Estimation \(N_R \times N_S / \max(V(A,R), V(A,S))\) under containment
Error Propagation Estimation errors compound multiplicatively through join chains
Correlated Attributes Independence assumption fails; solutions include multi-column stats, learned models

Key takeaway: Query optimization is a search problem over equivalent plans, guided by a cost model that depends on cardinality estimation — and cardinality estimation, despite decades of research, remains the weakest link in the chain. Getting statistics right matters more than having a sophisticated search algorithm.

References

  • P. Selinger et al., “Access Path Selection in a Relational Database Management System,” SIGMOD, 1979.
  • G. Graefe, “The Cascades Framework for Query Optimization,” IEEE Data Engineering Bulletin, 1995.
  • G. Graefe and W. McKenna, “The Volcano Optimizer Generator: Extensibility and Efficient Search,” ICDE, 1993.
  • V. Leis et al., “How Good Are Query Optimizers, Really?,” PVLDB, 2015.
  • PostgreSQL Documentation, “Planner Cost Constants,” postgresql.org.

This post is based on lecture materials from CMU 15-445/645 Database Systems by Andy Pavlo & Jignesh Patel (Lecture #15: Query Planning & Optimization I, Lecture #16: Query Planning & Optimization II).