15645 Database systems: Join Algorithms
Lec 12 Join Algorithms: Nested Loop, Sort-Merge, Hash
0. Cost Model and Mental Model
The lecture studies binary inner equi-joins between
tables R and S.
We use:
M: pages inRN: pages inSm: tuples inRn: tuples inSB: buffer pages availableC: average cost of one index probe
The primary cost metric is I/O count.
We ignore:
- result output cost
- CPU cost
- network cost
For CPU-style reasoning, the useful baseline is:
- naive comparison work is usually measured in tuple pairs
- hashing is expected linear in tuples when the hash table behaves well
- sort-merge is dominated by sorting plus one merge scan, except in duplicate-heavy worst cases
1. Join Operator Output
For matching tuples r ∈ R and s ∈ S, the
operator must decide what to emit upward in the plan tree.
Early Materialization
- Copy needed attributes into a new output tuple immediately.
- Simpler for later operators.
- More copying.
Late Materialization
- Emit join keys plus record ids / tuple references.
- Better for column stores and wide tuples.
- Delays tuple reconstruction until later.
2. Why Join Choice Matters
R ⨝ S is far cheaper than building R × S
and filtering afterward.
There is no universally best join algorithm. The correct choice depends on:
- memory size
- existing indexes
- whether inputs are already sorted
- data distribution / skew
- which relation is smaller
In general, the smaller relation should be the outer or build side when the algorithm allows that choice.
3. Nested Loop Join Family
3.1 Naive Nested Loop Join
Algorithm:
1 | for each tuple r in R: |
Cost:
\[ M + (m \cdot N) \]
Tuple-comparison complexity:
\[ \Theta(mn) \]
Why it is bad:
- for every tuple in
R, scan all pages ofS - good only as the conceptual baseline
3.2 Block Nested Loop Join
Improve the idea by loading a block of the outer relation into memory.
With B buffers:
- use
B-2buffers for the outer block - use
1buffer for the inner relation scan - use
1output buffer
Cost:
\[ M + \left\lceil \frac{M}{B-2} \right\rceil \cdot N \]
Tuple-comparison work is still:
\[ \Theta(mn) \]
The gain is not fewer comparisons; the gain is far fewer rescans of the inner relation on disk.
Key rules:
- choose the smaller relation by pages, not tuples, as outer
- if the outer fits in memory, cost approaches
M + N
This is the first nested-loop variant that is often reasonable.
Feasibility shortcut:
\[ M \leq B-2 \implies \text{outer fits in memory} \]
and the I/O cost then becomes about:
\[ M + N \]
3.3 Index Nested Loop Join
If the inner table has a useful index on the join key:
1 | for each tuple r in R: |
Cost:
\[ M + (m \cdot C) \]
CPU-style view:
- one outer scan
- one index probe per outer tuple
- so the dominant work is roughly
mprobes rather thanmntuple comparisons
This can be excellent when:
- the index already exists
- probes are selective / cheap
It can be terrible if the index probes are random and expensive.
4. Sort-Merge Join
Phase 1: Sort
- Sort both inputs on the join key.
- Any suitable sorting algorithm can be used, including external merge sort.
Phase 2: Merge
- Scan the sorted inputs together.
- Advance the lower key.
- When keys match, emit all matching combinations.
Cost
If both sides must be sorted first:
\[ \text{Sort}(R) = 2M \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{M}{B} \right\rceil \right\rceil \right) \]
\[ \text{Sort}(S) = 2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil \right) \]
\[ \text{Merge Cost} = M + N \]
Total cost:
\[ \text{Sort}(R) + \text{Sort}(S) + (M+N) \]
Expected CPU complexity, ignoring duplicate blow-up, is roughly:
\[ O(m \log m + n \log n + m + n) \]
When Sort-Merge Is Attractive
- one or both inputs are already sorted on the join key
- the output must also be sorted on that key
- sorting can be shared with another operator
Important Caveat
Worst case happens when many tuples share the same join key on both sides.
Then the merge phase can degrade toward:
\[ (M \cdot N) + \text{sort cost} \]
So duplicate-heavy joins can make sort-merge much less attractive than its clean scan pattern suggests.
5. Hash Join
Hash join relies on one observation:
if r and s join on the same key, then
hashing that key sends them to the same bucket / partition.
5.1 Simple Hash Join
Build Phase
- Scan the smaller relation.
- Insert tuples into an in-memory hash table using the join key.
- In practice, simple open-addressing schemes like linear probing work well.
Probe Phase
- Scan the other relation.
- Hash its join key.
- Probe the hash table for matches.
This is excellent when the build side fits in memory.
Memory rule of thumb:
\[ \min(M, N) \lesssim B-2 \]
ignoring hash-table overhead and output buffers.
Expected CPU complexity:
\[ O(m + n) \]
Worst case can still degrade badly with pathological collisions or extreme skew.
Probe Filter Optimization
While building the hash table, also build a Bloom filter:
- check the Bloom filter before probing the real hash table
- avoid many useless hash-table probes
- classic example of sideways information passing
5.2 Partitioned Hash Join (GRACE Hash Join)
When the build side does not fit in memory, partition both relations first.
Partition Phase
- Hash
Rintokbuckets on disk. - Hash
Sintokbuckets with the same function. - Spill buckets as they fill.
Probe Phase
For each pair of corresponding partitions:
- build an in-memory hash table on the smaller partition
- scan the other partition and probe it
If no recursive repartitioning is needed, cost is:
\[ 3(M + N) \]
Why:
- partition phase reads and writes both relations:
2(M+N) - probe phase reads both partitioned relations once:
M+N
A useful fit condition for the build side is:
\[ \left\lceil \frac{\min(M, N)}{B-1} \right\rceil \leq B-2 \]
which says that after partitioning, each build partition should fit in memory.
Expected tuple work remains close to linear:
\[ O(m + n) \]
as long as partitioning is balanced and recursion is not excessive.
5.3 Edge Cases and Optimizations
Recursive Partitioning
If a partition is still too large to fit in memory:
- repartition it with another hash function
- repeat until the build partition fits
Skew / Hot Keys
If one join key is so frequent that its partition still will not fit:
- use a fallback like block nested loop join just for that heavy key
- trade random I/O for sequential I/O
Hybrid Hash Join
If memory is larger than the minimum needed for partitioning:
- keep one partition in memory during the first phase
- join it immediately
- spill only the remaining partitions
This saves some I/O compared with plain partitioned hash join.
Practical Observation
The probe relation can be any size. What must fit is the build hash table, or at least one build partition at a time.
6. Selection Guide
| Situation | Good First Choice | Why |
|---|---|---|
| No index, little memory, baseline fallback | Block nested loop | simple and predictable |
| Inner side has useful index | Index nested loop | avoids full scans of inner relation |
| Inputs already sorted or output needs sorted | Sort-merge join | reuse ordering |
| Smaller side fits in memory | Simple hash join | usually very fast |
| Both sides large, no ordering needed | Partitioned hash join | strong I/O behavior |
7. Complexity Summary
| Algorithm | CPU / Tuple Complexity | I/O Cost |
|---|---|---|
| Naive nested loop | \Theta(mn) |
M + (m \cdot N) |
| Block nested loop | \Theta(mn) |
M + \left\lceil \frac{M}{B-2} \right\rceil \cdot N |
| Index nested loop | about m index probes |
M + (m \cdot C) |
| Sort-merge join | O(m \log m + n \log n + m + n) expected |
\text{Sort}(R) + \text{Sort}(S) + (M+N) |
| Sort-merge worst case | up to O(mn) duplicate handling |
(M \cdot N) + \text{sort cost} |
| Simple hash join | expected O(m+n) |
scan + build/probe, requires build side in memory |
| Partitioned hash join | expected O(m+n) |
3(M+N) if no recursive repartitioning |
8. Quick Recap
- Naive nested loop join is mostly a teaching baseline.
- Block nested loop join becomes practical once the outer relation can be buffered.
- Index nested loop join is only as good as the index probes.
- Sort-merge join is attractive when sorting is already useful elsewhere.
- Hash join is the default workhorse for large equi-joins, especially when ordering is unnecessary.



