Lec 11 Sorting & Aggregation Algorithms

0. Why Sorting Shows Up in a DBMS

SQL relations are logically unordered, but the DBMS still sorts all the time:

  • ORDER BY
  • DISTINCT
  • GROUP BY
  • window functions
  • bulk loading a B+Tree from sorted tuples

In a disk-oriented DBMS, the key constraint is that the input or output may not fit in memory. That is why the lecture focuses on spill-aware algorithms with mostly sequential I/O.

Throughout this note:

  • T = number of tuples
  • N = number of pages
  • B = number of buffer pages
  • G = number of groups in an aggregation

1. Sorting Model

Sorting works on (key, value) pairs:

  • Key: attributes that define the order
  • Value: either the full tuple or a record id / offset

Early vs. Late Materialization

Strategy What Gets Sorted Best When
Early materialization (sort key, tuple) tuple is small or later operators want the whole row
Late materialization (sort key, rid) column stores or wide tuples where copying is expensive

In-Memory Complexity

For comparison-based in-memory sorting, the usual CPU complexity is:

\[ O(T \log T) \]

The lecture is mainly about what changes once the data no longer fits in memory: the asymptotic CPU story stays similar, but the dominant cost often becomes disk I/O.

2. External Merge Sort

If the data fits in memory, use your favorite in-memory sort. If not, use external merge sort.

Core Idea

External merge sort is divide-and-conquer:

  1. Split data into runs that fit in memory.
  2. Sort each run.
  3. Write runs to disk.
  4. Merge runs into larger sorted runs until one remains.

Pass 0: Initial Runs

With B buffer pages and input size N pages:

  • read up to B pages
  • sort them in memory
  • write one sorted run back out

That yields about \lceil N / B \rceil initial runs.

Merge Passes

During each merge pass:

  • use B-1 input buffers
  • use 1 output buffer
  • merge up to B-1 runs at a time

Cost Formula

Number of passes:

\[ 1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil \]

Total I/O cost:

\[ 2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil \frac{N}{B} \right\rceil \right\rceil\right) \]

Reason: every pass reads and writes the entire file.

Ignoring low-level implementation details, comparison work is still roughly:

\[ O(T \log T) \]

but the formula that usually matters in a DBMS is the I/O cost above.

Double Buffering

Double buffering overlaps CPU and I/O:

  • process one buffer while prefetching the next run
  • improves throughput
  • but effectively reduces usable merge buffers because some memory is reserved for prefetch/output

3. Sorting via B+Tree

If the table already has a B+Tree on the sort key, the DBMS may use the index instead of running a full external sort.

Clustered B+Tree

  • Scan the leaf pages in order.
  • Data pages are already aligned with index order.
  • Access is sequential.
  • This is usually better than external sorting.

Unclustered B+Tree

  • Leaf scan gives keys in order, but fetching tuples means chasing record pointers.
  • That often degenerates into nearly one I/O per record.
  • Usually a bad idea for full sorting.
  • Exception: Top-N queries, where only a small prefix is needed.

4. Top-N Heap Sort

For queries like:

1
2
3
SELECT * FROM enrolled
ORDER BY sid ASC
FETCH FIRST 10 ROWS WITH TIES;

the DBMS does not need to sort everything.

Core Idea

  • Scan the input once.
  • Maintain an in-memory heap / priority queue of the best N candidates.
  • Discard tuples that cannot enter the top N.

This is ideal when:

  • LIMIT is small
  • the top N fits in memory
  • you do not want the cost of a full sort

If the query uses WITH TIES, the final output may be slightly larger than N.

Complexity:

  • time: O(T \log N)
  • memory: O(N)
  • I/O: one full scan of the input, plus output

5. Collations

Sorting strings is not just “compare bytes”.

A collation defines:

  • character ordering
  • equality behavior
  • case sensitivity
  • accent sensitivity
  • locale-specific rules

Binary Collation

  • Fastest option
  • Compares raw bytes / code points
  • Good for performance
  • But ASCII / binary order is not the same as human linguistic order

Locale-Aware Comparison

The lecture references ICU / UCA-style comparison:

  • Primary: base letters
  • Secondary: accents
  • Tertiary: case
  • Quaternary: punctuation / tie-breaking

Since multi-level comparison is expensive, systems often precompute sort keys.

Comparison Optimizations

  • Code specialization: compile/type-specialize comparisons instead of calling a generic comparator.
  • Prefix comparison for long strings: compare a cheap binary prefix first, fall back only on ties.
  • Key normalization: encode variable-length values into a comparable fixed-form representation.

6. Aggregations

An aggregation collapses many tuples into fewer values:

  • DISTINCT
  • GROUP BY
  • SUM / COUNT / MIN / MAX / AVG

The core systems problem is the same as sorting: how do we quickly bring identical group keys together?

7. Sort-Based Aggregation

Idea

  1. Sort tuples by the grouping key.
  2. Scan the sorted stream once.
  3. Maintain running state for the current group.

Running States

Aggregate Running State
COUNT count
SUM sum
MIN current min
MAX current max
AVG (count, sum)

When It Is Attractive

Sort-based aggregation is a good fit when:

  • data is already sorted
  • the query also needs ordered output
  • the same sort can be reused for multiple downstream operators

If sorting is required anyway, aggregate computation after sorting is just a linear pass:

\[ O(T) \]

So the full cost is roughly:

\[ \text{sort cost} + O(T) \]

8. Hash Aggregation

If we do not need ordered output, hashing is often better.

In-Memory Hash Aggregate

As the DBMS scans tuples:

  • hash the group key
  • look up the entry in an ephemeral hash table
  • create a new entry if none exists
  • otherwise update the running aggregate

This is great when the hash table fits in memory.

Expected complexity:

  • time: O(T)
  • memory: O(G) aggregate states

Worst-case time can degrade under pathological collisions, but the expected case is linear.

9. External Hash Aggregation

If the aggregation state does not fit in memory, use a two-phase external algorithm.

Phase 1: Partition

  • Use hash function h1 on the group key.
  • Split tuples into B-1 disk partitions.
  • Keep 1 input buffer and B-1 output buffers.

Phase 2: ReHash / Aggregate

For each partition:

  • read it back
  • build an in-memory hash table
  • compute aggregates inside that smaller partition

This works because tuples with the same group key are sent to the same partition.

If no recursive repartitioning is needed and we ignore final result output, a useful I/O approximation is:

\[ 3N \]

because:

  • partition phase reads and writes the input once: 2N
  • rehash/aggregate phase reads it once more: N

10. Choosing Sort vs. Hash

Task Start With Why
Need final order Sort ordering is part of the result
ORDER BY ... LIMIT N Top-N heap avoid full sort
GROUP BY / DISTINCT with no order requirement Hash aggregate cheaper than sorting in many cases
Existing clustered index on sort key B+Tree scan ordered access is already materialized

11. Complexity Summary

Technique CPU / Tuple Complexity I/O / Memory View
In-memory comparison sort O(T \log T) memory must hold input
External merge sort roughly O(T \log T) comparisons 2N \cdot \left(1 + \left\lceil \log_{B-1}\left\lceil N/B \right\rceil \right\rceil\right) I/Os
Clustered B+Tree sort linear scan after index traversal about one sequential pass over relevant pages
Unclustered B+Tree sort pointer chasing dominates can approach one random fetch per tuple
Top-N heap sort O(T \log N) O(N) memory, one input scan
Sort-based aggregation \text{sort cost} + O(T) good when order is also needed
In-memory hash aggregation expected O(T) O(G) memory
External hash aggregation expected O(T) work per phase about 3N I/Os without recursive partitioning

Quick Recap

  1. External merge sort is the baseline spill-aware sorting algorithm.
  2. B+Tree order can replace sorting, but only clustered access is consistently good.
  3. Top-N queries should avoid full sorting when possible.
  4. String sorting is expensive because collation rules are richer than byte order.
  5. Aggregation is fundamentally about grouping equal keys; sorting and hashing are the two main ways to do it.