Lec3 Database Storage: Files & Pages

Storage Hierarchy

image-20260201113518099

image-20260201113835345

Heap File

A heap file is an unordered collection of pages with tuples that are stored in random order.

Need additional meta-data to track location of files and free space availability.

image-20260201130006343

Page layout

Every page contains a header of metadata about the page’s contents.

Page Size; Checksum; DBMS Version; Transaction Visibility; Compression / Encoding Meta-data; Schema Information; Data Summary / Sketches

Approach:

  • Approach #1: Tuple-oriented Storage

  • Approach #2: Log-structured Storage

  • Approach #3: Index-organized Storage

Tuple-oriented Storage

The most common layout scheme is called slotted pages.

image-20260201150241807

RECORD IDS

The DBMS assigns each logical tuple a unique record identifier that represents its physical location in the database.

→ Example: File Id, Page Id, Slot #

Tuple layout

A tuple is essentially a sequence of bytes prefixed with a header that contains meta-data about it.

Tuple header

Each tuple is prefixed with a header that contains meta-data about it.

→ Visibility info (concurrency control)

→ Bit Map for NULL values.

We do not need to store meta-data about the schema

Tuple data

Attributes are typically stored in the order that you specify them when you create the table. However, it might be more efficient to lay them out differently.

Lec4 Buffer Pool Memory Management

This lecture is to solve how the DBMS manages its memory and move data back-and-forth from disk.

Disk-Oriented DBMS Model

Modern DBMSs are disk-oriented: disk is the source of truth, while memory is used as a performance optimization.

Two dimensions govern data movement:

  • Spatial control: where pages are placed on disk, aiming to keep frequently co-accessed pages physically close.
  • Temporal control: when pages are brought into memory and when modified pages are written back to disk.

The buffer pool is the core mechanism that implements temporal control.

Buffer Pool Metadata

To manage cached pages, the DBMS maintains in-memory metadata:

Page Table

  • Maps page_id → frame_id
  • Typically implemented as a hash table
  • Protected by latches for thread safety
  • Not persisted to disk

Per-Page Metadata

Each cached page tracks:

  • Dirty flag: whether the page has been modified
  • Pin (reference) count: number of active users; pinned pages cannot be evicted
  • Access history: used by replacement policies

Locks vs. Latches

The buffer pool uses latches, not locks.

  • Locks protect logical database contents and are held for transaction duration
  • Latches protect internal DBMS data structures and are held briefly

Latches do not require rollback support and are closer to mutexes.

Page Directory vs. Page Table

These two structures are often confused:

  • Page directory: maps pages to disk locations, persisted on disk
  • Page table: maps pages to memory frames, exists only in memory

The directory enables recovery after restart; the page table does not.

Buffer Replacement Policies

When the buffer pool is full and a new page must be loaded, the DBMS must evict an existing page. Replacement policies aim to balance correctness, accuracy, and low overhead.

LRU and CLOCK

  • Evict pages that were accessed least recently
  • CLOCK approximates LRU using a reference bit

Both are vulnerable to sequential flooding, where large scans evict useful pages that will soon be reused.

LRU-K

LRU-K considers the K-th most recent access rather than just the last one.

  • Pages accessed fewer than K times are treated as low-value
  • Helps distinguish one-time scans from truly hot pages
  • Requires more metadata and higher implementation complexity

ARC

Adaptive Replacement Cache balances recency and frequency dynamically.

  • Maintains separate lists for recent and frequent pages
  • Uses ghost lists to learn from recent evictions
  • Adjusts behavior automatically based on workload

ARC is more complex but adapts well to mixed workloads.

Lec5 Log-Structured Storage

Traditional page-oriented storage assumes that data can be overwritten in place. However, this assumption breaks down on modern storage systems where random writes are expensive or outright unsupported.

Log-structured storage takes a different approach: instead of modifying tuples inside pages, the DBMS records all changes as log records.

Log Records and In-Memory Updates

Each modification to a tuple is represented as a log record:

  • PUT(key, value) inserts or updates a tuple
  • DELETE(key) marks a tuple as deleted

The DBMS does not immediately write these records to disk. Instead, updates are first applied to an in-memory data structure called the MemTable.

The MemTable maintains keys in sorted order and always reflects the most recent state of the database.

Flushing the MemTable to Disk

When the MemTable becomes full, the DBMS flushes its contents to disk as a Sorted String Table (SSTable).

Key properties of an SSTable:

  • Stored as a new file on disk
  • Contains records sorted by key (low → high)
  • Immutable once written

This flushed SSTable is placed into Level 0.

At this stage, multiple SSTables may exist, ordered from newest to oldest, and their key ranges may overlap.

Multi-Level Organization

As more SSTables are generated, the DBMS organizes them into multiple levels:

  • Level 0
    • SSTables flushed directly from memory
    • Key ranges may overlap
    • Ordered by creation time (newest → oldest)
  • Lower levels (Level 1, Level 2, …)
    • SSTables have non-overlapping key ranges
    • Data is progressively older and more compact

This hierarchy allows the DBMS to absorb writes quickly at the top while gradually organizing data in the background.

Reading Data (GET)

To process a GET(key) request, the DBMS searches for the most recent version of the key:

  1. MemTable
    • Checked first, since it contains the newest updates
  2. SummaryTable
    • Stores metadata such as:
      • Min/max key per SSTable
      • Key filters per level
    • Used to eliminate SSTables that cannot contain the key
  3. Disk Levels
    • Level 0 is searched from newest to oldest SSTable
    • Lower levels are searched using key range constraints

The search stops as soon as the key is found.

image-20260201200533436

Log-Structured Compaction

Over time, SSTables accumulate outdated versions and deleted entries. To reclaim space and improve read performance, the DBMS periodically performs compaction.

Compaction:

  • Merges multiple SSTables using a sort-merge algorithm
  • Retains only the latest version of each key
  • Discards obsolete entries and tombstones

This process produces new SSTables in lower levels while removing old ones.

image-20260201200458754

Compaction Strategies

The PPT introduces two compaction strategies:

  • Leveled Compaction
    • Organizes data into multiple levels
    • SSTables within a level do not overlap (except Level 0)
    • Optimized for read-heavy workloads

image-20260201200418845

  • Universal Compaction
    • Maintains a single logical level
    • Allows overlapping key ranges
    • Optimized for write-heavy workloads

image-20260201200354967

Lec6 Database Storage + Column Stores + Data Compression

Database workloads

  • On-Line Transaction Processing (OLTP): Fast operations that only read/update a small amount of data
  • On-Line Analytical Processing (OLAP): Complex queries that read a lot of data to compute aggregates
  • Hybrid Transaction + Analytical Processing: OLTP + OLAP

image-20260202140700003

Storage model

A DBMS’s storage model specifies how it physically organizes tuples on disk and in memory.

Can have different performance characteristics based on target workload ann influence the design choices of the rest of the DBMS

  • N-ary Storage Model(NSM)
  • Decomposition Storage Model(DSM)
  • Hybrid Storage Model(PAX)

NSM

A disk-oriented NSM system stores a tuple’s fixed-length and variable length attributes contiguously in a single slotted page.

The tuple’s record id (page#, slot#) is how the DBMS uniquely identifies a physical tuple

image-20260202141836961

Advantages

  • Fast inserts, updates, and deletes.
  • Good for queries that need the entire tuple (OLTP).
  • Can use index-oriented physical storage for clustering.

Disadvantages

  • Not good for scanning large portions of the table and/or a subset of the attributes.

  • Terrible memory locality in access patterns.

  • Not ideal for compression because of multiple value domains within a single page.

DSM

Store a single attribute for all tuples contiguously in a block of data. Also known as a “column store

Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes.

image-20260202142652635

Tuple Identfication

  • Fixed-length Offsets
  • Embedded Tuple Ids

Advantages

  • Reduces the amount wasted I/O per query because the DBMS only reads the data that it needs.

  • Faster query processing because of increased locality and cached data reuse (Lecture #14).

  • Better data compression because data from the same domain are physically collocated.

Disadvantages

  • Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching/reorganization.

PAX

Partition Attributes Across (PAX) is a hybrid storage model that vertically partitions attributes within a database page.

Horizontally partition data into row groups. Then vertically partition their attributes into column chunks.

image-20260202143955864

Database Compression

Goals:

  • #1: Must produce fixed-length values. Only exception is var-length data stored in separate pool.

  • #2: Postpone decompression for as long as possible during query execution. Also known as late materialization.

  • #3: Must be a lossless scheme. People (typically) don’t like losing data. Any lossy compression must be performed by application

Compression Granularity

  • Block-level: Compress a block of tuples for the same table.
  • Tuple-level: Compress the contents of the entire tuple (NSM-only).
  • Attreibute-level: Compress a single attribute within one tuple (overflow). Can target multiple attributes for the same tuple.
  • Column-level: Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

Naive Compression

Compress data using a general-purpose algorithm. Scope of compression is based on input provided.

image-20260202144852164

Ideally, the DBMS can operate on compressed data without decompressing it first.

image-20260202145041605

Columnar Compression

Run-length Encoding

Compress runs of the same value in a single column into triplets:

image-20260202145403355


Bit-Packing Encoding

If the values for an integer attribute is smaller than the range of its given data type size, then reduce the number of bits to represent each value.

image-20260202145546669

Patching / Mostly Encoding: A variation of bit packing for when an attribute’s values are “mostly” less than the largest size, store them with smaller data type.

image-20260202145840090


Bitmap Encoding

Store a separate bitmap for each unique value for an attribute where an offset in the vector corresponds to a tuple.

image-20260202150128385

But there are some flaws fo this method.

Assume we have 10 million tuples. 43,000 zip codes in the US.

10000000 × 32-bits = 40 MB (Orginal)

10000000 × 43000 = 53.75 GB (With Bitmap)

And every time the application inserts a new tuple, the DBMS must extend 43,000 different bitmaps.


Roaring Bitmaps

Bitmap index that switches which data structure to use for a range of values based local density of bits

  • Dense chunks are stored using uncompressed bitmaps.(Can be further compressed with RLE)
  • Sparse chunks use bitpacked arrays of 16-bit integers.

image-20260202151924937


Delta / Frame-of-Reference Encoding

Recording the difference between values that follow each other in the same column.

Combine with RLE to get even better compression ratios.

image-20260202150652776

Dictionary Encoding

Replace frequent values with smaller fixed-length codes and then maintain a mapping (dictionary) from the codes to the original values. (Typically, one code per attribute value.)

image-20260202151150956

image-20260202151303156

Bifurcated Environment

A bifurcated environment separates OLTP and OLAP systems, moving data between them via ETL or ELT pipelines.

image-20260202155219551

Hybrid Storage Model

Use seperate execution engines that are optimized for either NSM or DSM databases.

  • Store new data in NSM for fast OLTP.
  • Migrate data to DSM for more efficient OLAP.
image-20260202155745470

Fractured Mirrors

Examples: Oracle, IBM DB2 Blu, Microsoft SQL Server

DBMS automatically maintains a second copy of the database in a DSM layout.

  • All updates are first entered in NSM then eventually copied into DSM mirror.
  • If the DBMS supports updates, it must invalidate tuples in the DSM mirror.

Delta Store

Examples: SAP HANA, Vertica, SingleStore, Databricks, Google Napa

A background thread migrates updates from delta store and applies them to DSM data.

  • Batch large chunks and then write them out as a PAX file.
  • Delete records in the delta store once they are in column store.

A delta store is a row-oriented write buffer that absorbs updates and is periodically merged into the column store in batches.