Lec 11 Sorting & Aggregation Algorithms

0. Why Sorting Shows Up in a DBMS

SQL relations are logically unordered, but the DBMS still sorts all the time:

ORDER BY
DISTINCT
GROUP BY
window functions
bulk loading a B+Tree from sorted tuples

In a disk-oriented DBMS, the key constraint is that the input or output may not fit in memory. That is why the lecture focuses on spill-aware algorithms with mostly sequential I/O.

Throughout this note:

T = number of tuples
N = number of pages
B = number of buffer pages
G = number of groups in an aggregation

1. Sorting Model

Sorting works on (key, value) pairs:

Key: attributes that define the order
Value: either the full tuple or a record id / offset

Early vs. Late Materialization

Strategy	What Gets Sorted	Best When
Early materialization	`(sort key, tuple)`	tuple is small or later operators want the whole row
Late materialization	`(sort key, rid)`	column stores or wide tuples where copying is expensive

In-Memory Complexity

For comparison-based in-memory sorting, the usual CPU complexity is:

\[ O(T \log T) \]

The lecture is mainly about what changes once the data no longer fits in memory: the asymptotic CPU story stays similar, but the dominant cost often becomes disk I/O.

2. External Merge Sort

If the data fits in memory, use your favorite in-memory sort. If not, use external merge sort.

Core Idea

External merge sort is divide-and-conquer:

Split data into runs that fit in memory.
Sort each run.
Write runs to disk.
Merge runs into larger sorted runs until one remains.

Pass 0: Initial Runs

With B buffer pages and input size N pages:

read up to B pages
sort them in memory
write one sorted run back out

That yields about \lceil N / B \rceil initial runs.

Merge Passes

During each merge pass:

use B-1 input buffers
use 1 output buffer
merge up to B-1 runs at a time

Cost Formula

Number of passes:

\[ 1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil \]

Total I/O cost:

\[ 2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil\right) \]

Reason: every pass reads and writes the entire file.

Ignoring low-level implementation details, comparison work is still roughly:

\[ O(T \log T) \]

but the formula that usually matters in a DBMS is the I/O cost above.

Double Buffering

Double buffering overlaps CPU and I/O:

process one buffer while prefetching the next run
improves throughput
but effectively reduces usable merge buffers because some memory is reserved for prefetch/output

3. Sorting via B+Tree

If the table already has a B+Tree on the sort key, the DBMS may use the index instead of running a full external sort.

Clustered B+Tree

Scan the leaf pages in order.
Data pages are already aligned with index order.
Access is sequential.
This is usually better than external sorting.

Unclustered B+Tree

Leaf scan gives keys in order, but fetching tuples means chasing record pointers.
That often degenerates into nearly one I/O per record.
Usually a bad idea for full sorting.
Exception: Top-N queries, where only a small prefix is needed.

4. Top-N Heap Sort

For queries like:

1
2
3

SELECT * FROM enrolled
ORDER BY sid ASC
FETCH FIRST 10 ROWS WITH TIES;

the DBMS does not need to sort everything.

Core Idea

Scan the input once.
Maintain an in-memory heap / priority queue of the best N candidates.
Discard tuples that cannot enter the top N.

This is ideal when:

LIMIT is small
the top N fits in memory
you do not want the cost of a full sort

If the query uses WITH TIES, the final output may be slightly larger than N.

Complexity:

time: O(T \log N)
memory: O(N)
I/O: one full scan of the input, plus output

5. Collations

Sorting strings is not just “compare bytes”.

A collation defines:

character ordering
equality behavior
case sensitivity
accent sensitivity
locale-specific rules

Binary Collation

Fastest option
Compares raw bytes / code points
Good for performance
But ASCII / binary order is not the same as human linguistic order

Locale-Aware Comparison

The lecture references ICU / UCA-style comparison:

Primary: base letters
Secondary: accents
Tertiary: case
Quaternary: punctuation / tie-breaking

Since multi-level comparison is expensive, systems often precompute sort keys.

Comparison Optimizations

Code specialization: compile/type-specialize comparisons instead of calling a generic comparator.
Prefix comparison for long strings: compare a cheap binary prefix first, fall back only on ties.
Key normalization: encode variable-length values into a comparable fixed-form representation.

6. Aggregations

An aggregation collapses many tuples into fewer values:

DISTINCT
GROUP BY
SUM / COUNT / MIN / MAX / AVG

The core systems problem is the same as sorting: how do we quickly bring identical group keys together?

7. Sort-Based Aggregation

Idea

Sort tuples by the grouping key.
Scan the sorted stream once.
Maintain running state for the current group.

Running States

Aggregate	Running State
`COUNT`	count
`SUM`	sum
`MIN`	current min
`MAX`	current max
`AVG`	`(count, sum)`

When It Is Attractive

Sort-based aggregation is a good fit when:

data is already sorted
the query also needs ordered output
the same sort can be reused for multiple downstream operators

If sorting is required anyway, aggregate computation after sorting is just a linear pass:

\[ O(T) \]

So the full cost is roughly:

\[ \text{sort cost} + O(T) \]

8. Hash Aggregation

If we do not need ordered output, hashing is often better.

In-Memory Hash Aggregate

As the DBMS scans tuples:

hash the group key
look up the entry in an ephemeral hash table
create a new entry if none exists
otherwise update the running aggregate

This is great when the hash table fits in memory.

Expected complexity:

time: O(T)
memory: O(G) aggregate states

Worst-case time can degrade under pathological collisions, but the expected case is linear.

9. External Hash Aggregation

If the aggregation state does not fit in memory, use a two-phase external algorithm.

Phase 1: Partition

Use hash function h1 on the group key.
Split tuples into B-1 disk partitions.
Keep 1 input buffer and B-1 output buffers.

Phase 2: ReHash / Aggregate

For each partition:

read it back
build an in-memory hash table
compute aggregates inside that smaller partition

This works because tuples with the same group key are sent to the same partition.

If no recursive repartitioning is needed and we ignore final result output, a useful I/O approximation is:

\[ 3N \]

because:

partition phase reads and writes the input once: 2N
rehash/aggregate phase reads it once more: N

10. Choosing Sort vs. Hash

Task	Start With	Why
Need final order	Sort	ordering is part of the result
`ORDER BY ... LIMIT N`	Top-N heap	avoid full sort
`GROUP BY` / `DISTINCT` with no order requirement	Hash aggregate	cheaper than sorting in many cases
Existing clustered index on sort key	B+Tree scan	ordered access is already materialized

11. Complexity Summary

Technique	CPU / Tuple Complexity	I/O / Memory View
In-memory comparison sort	`O(T \log T)`	memory must hold input
External merge sort	roughly `O(T \log T)` comparisons	`2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil N/B \right\rceil \right\rceil\right)` I/Os
Clustered B+Tree sort	linear scan after index traversal	about one sequential pass over relevant pages
Unclustered B+Tree sort	pointer chasing dominates	can approach one random fetch per tuple
Top-N heap sort	`O(T \log N)`	`O(N)` memory, one input scan
Sort-based aggregation	`\text{sort cost} + O(T)`	good when order is also needed
In-memory hash aggregation	expected `O(T)`	`O(G)` memory
External hash aggregation	expected `O(T)` work per phase	about `3N` I/Os without recursive partitioning

Quick Recap

External merge sort is the baseline spill-aware sorting algorithm.
B+Tree order can replace sorting, but only clustered access is consistently good.
Top-N queries should avoid full sorting when possible.
String sorting is expensive because collation rules are richer than byte order.
Aggregation is fundamentally about grouping equal keys; sorting and hashing are the two main ways to do it.