Lec 12 Join Algorithms: Nested Loop, Sort-Merge, Hash

0. Cost Model and Mental Model

The lecture studies binary inner equi-joins between tables R and S.

We use:

M: pages in R
N: pages in S
m: tuples in R
n: tuples in S
B: buffer pages available
C: average cost of one index probe

The primary cost metric is I/O count.

We ignore:

result output cost
CPU cost
network cost

For CPU-style reasoning, the useful baseline is:

naive comparison work is usually measured in tuple pairs
hashing is expected linear in tuples when the hash table behaves well
sort-merge is dominated by sorting plus one merge scan, except in duplicate-heavy worst cases

1. Join Operator Output

For matching tuples r ∈ R and s ∈ S, the operator must decide what to emit upward in the plan tree.

Early Materialization

Copy needed attributes into a new output tuple immediately.
Simpler for later operators.
More copying.

Late Materialization

Emit join keys plus record ids / tuple references.
Better for column stores and wide tuples.
Delays tuple reconstruction until later.

2. Why Join Choice Matters

R ⨝ S is far cheaper than building R × S and filtering afterward.

There is no universally best join algorithm. The correct choice depends on:

memory size
existing indexes
whether inputs are already sorted
data distribution / skew
which relation is smaller

In general, the smaller relation should be the outer or build side when the algorithm allows that choice.

3. Nested Loop Join Family

3.1 Naive Nested Loop Join

Algorithm:

for each tuple r in R:
  for each tuple s in S:
    if r and s match:
      emit

Cost:

\[ M + (m \cdot N) \]

Tuple-comparison complexity:

\[ \Theta(mn) \]

Why it is bad:

for every tuple in R, scan all pages of S
good only as the conceptual baseline

3.2 Block Nested Loop Join

Improve the idea by loading a block of the outer relation into memory.

With B buffers:

use B-2 buffers for the outer block
use 1 buffer for the inner relation scan
use 1 output buffer

Cost:

\[ M + \left\lceil \frac{M}{B-2} \right\rceil \cdot N \]

Tuple-comparison work is still:

\[ \Theta(mn) \]

The gain is not fewer comparisons; the gain is far fewer rescans of the inner relation on disk.

Key rules:

choose the smaller relation by pages, not tuples, as outer
if the outer fits in memory, cost approaches M + N

This is the first nested-loop variant that is often reasonable.

Feasibility shortcut:

\[ M \leq B-2 \implies \text{outer fits in memory} \]

and the I/O cost then becomes about:

\[ M + N \]

3.3 Index Nested Loop Join

If the inner table has a useful index on the join key:

1
2
3

for each tuple r in R:
  probe index on S using join key of r
  emit matches

Cost:

\[ M + (m \cdot C) \]

CPU-style view:

one outer scan
one index probe per outer tuple
so the dominant work is roughly m probes rather than mn tuple comparisons

This can be excellent when:

the index already exists
probes are selective / cheap

It can be terrible if the index probes are random and expensive.

4. Sort-Merge Join

Phase 1: Sort

Sort both inputs on the join key.
Any suitable sorting algorithm can be used, including external merge sort.

Phase 2: Merge

Scan the sorted inputs together.
Advance the lower key.
When keys match, emit all matching combinations.

Cost

If both sides must be sorted first:

\[ \text{Sort}(R) = 2M \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{M}{B} \right\rceil \right\rceil \right) \]

\[ \text{Sort}(S) = 2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil \right) \]

\[ \text{Merge Cost} = M + N \]

Total cost:

\[ \text{Sort}(R) + \text{Sort}(S) + (M+N) \]

Expected CPU complexity, ignoring duplicate blow-up, is roughly:

\[ O(m \log m + n \log n + m + n) \]

When Sort-Merge Is Attractive

one or both inputs are already sorted on the join key
the output must also be sorted on that key
sorting can be shared with another operator

Important Caveat

Worst case happens when many tuples share the same join key on both sides.

Then the merge phase can degrade toward:

\[ (M \cdot N) + \text{sort cost} \]

So duplicate-heavy joins can make sort-merge much less attractive than its clean scan pattern suggests.

5. Hash Join

Hash join relies on one observation:

if r and s join on the same key, then hashing that key sends them to the same bucket / partition.

5.1 Simple Hash Join

Build Phase

Scan the smaller relation.
Insert tuples into an in-memory hash table using the join key.
In practice, simple open-addressing schemes like linear probing work well.

Probe Phase

Scan the other relation.
Hash its join key.
Probe the hash table for matches.

This is excellent when the build side fits in memory.

Memory rule of thumb:

\[ \min(M, N) \lesssim B-2 \]

ignoring hash-table overhead and output buffers.

Expected CPU complexity:

\[ O(m + n) \]

Worst case can still degrade badly with pathological collisions or extreme skew.

Probe Filter Optimization

While building the hash table, also build a Bloom filter:

check the Bloom filter before probing the real hash table
avoid many useless hash-table probes
classic example of sideways information passing

5.2 Partitioned Hash Join (GRACE Hash Join)

When the build side does not fit in memory, partition both relations first.

Partition Phase

Hash R into k buckets on disk.
Hash S into k buckets with the same function.
Spill buckets as they fill.

Probe Phase

For each pair of corresponding partitions:

build an in-memory hash table on the smaller partition
scan the other partition and probe it

If no recursive repartitioning is needed, cost is:

\[ 3(M + N) \]

Why:

partition phase reads and writes both relations: 2(M+N)
probe phase reads both partitioned relations once: M+N

A useful fit condition for the build side is:

\[ \left\lceil \frac{\min(M, N)}{B-1} \right\rceil \leq B-2 \]

which says that after partitioning, each build partition should fit in memory.

Expected tuple work remains close to linear:

\[ O(m + n) \]

as long as partitioning is balanced and recursion is not excessive.

5.3 Edge Cases and Optimizations

Recursive Partitioning

If a partition is still too large to fit in memory:

repartition it with another hash function
repeat until the build partition fits

Skew / Hot Keys

If one join key is so frequent that its partition still will not fit:

use a fallback like block nested loop join just for that heavy key
trade random I/O for sequential I/O

Hybrid Hash Join

If memory is larger than the minimum needed for partitioning:

keep one partition in memory during the first phase
join it immediately
spill only the remaining partitions

This saves some I/O compared with plain partitioned hash join.

Practical Observation

The probe relation can be any size. What must fit is the build hash table, or at least one build partition at a time.

6. Selection Guide

Situation	Good First Choice	Why
No index, little memory, baseline fallback	Block nested loop	simple and predictable
Inner side has useful index	Index nested loop	avoids full scans of inner relation
Inputs already sorted or output needs sorted	Sort-merge join	reuse ordering
Smaller side fits in memory	Simple hash join	usually very fast
Both sides large, no ordering needed	Partitioned hash join	strong I/O behavior

7. Complexity Summary

Algorithm	CPU / Tuple Complexity	I/O Cost
Naive nested loop	`\Theta(mn)`	`M + (m \cdot N)`
Block nested loop	`\Theta(mn)`	`M + \left\lceil \frac{M}{B-2} \right\rceil \cdot N`
Index nested loop	about `m` index probes	`M + (m \cdot C)`
Sort-merge join	`O(m \log m + n \log n + m + n)` expected	`\text{Sort}(R) + \text{Sort}(S) + (M+N)`
Sort-merge worst case	up to `O(mn)` duplicate handling	`(M \cdot N) + \text{sort cost}`
Simple hash join	expected `O(m+n)`	scan + build/probe, requires build side in memory
Partitioned hash join	expected `O(m+n)`	`3(M+N)` if no recursive repartitioning

8. Quick Recap

Naive nested loop join is mostly a teaching baseline.
Block nested loop join becomes practical once the outer relation can be buffered.
Index nested loop join is only as good as the index probes.
Sort-merge join is attractive when sorting is already useful elsewhere.
Hash join is the default workhorse for large equi-joins, especially when ordering is unnecessary.