Lec 8-9 Indexes + Filters

0. Study Map

This note follows one question: “For this query type, which structure should I use?”

Query Intent Primary Structure Why
Point / range on structured keys B+Tree Ordered, exact, update-friendly
Fast “not exists” pre-check Bloom-like filter Cheap negative filtering
In-memory ordered set Skip list Simple updates, expected O(log n)
Prefix/digital key lookup Trie / Radix Follows key digits/prefixes
Full-text keyword search Inverted index term -> posting list
Semantic similarity search Vector index (IVF/HNSW) ANN over embeddings

1. B+Tree (Lec 8 Core)

1.1 Core Idea

B+Tree is the default ordered index in DBMSs: one structure supports both point lookup and range scan with stable disk behavior.

1.2 Physical Structure

  • Inner node: routing keys + child pointers.
  • Leaf node: key + payload (RID / primary key / tuple data).
  • Leaf sibling links: make ordered scans efficient.

B-Tree vs B+Tree

Aspect B-Tree B+Tree
Value location All levels Leaves only
Range scan Less natural Natural via leaf chain
Real DBMS usage Rare Mainstream

1.3 Update Logic (What Happens on Write)

Insert

  1. Traverse to target leaf.
  2. Insert key in order.
  3. If overflow, split node.
  4. Push separator key to parent.
  5. Cascade upward if parent also overflows.

Delete

  1. Remove key from leaf.
  2. If under threshold, try borrow from sibling.
  3. If borrow fails, merge siblings.
  4. Update parent separators.
  5. Cascade upward on parent underflow.

Duplicate Keys

  • Preferred: store (key, rid) to keep uniqueness.
  • Avoid overflow chains unless unavoidable (maintenance cost high).

1.4 Design Knobs

Knob Typical Choice Trade-off
Node size Larger on slower storage Fewer I/Os vs more in-node scan
Merge threshold Not always immediate merge Lower write churn vs temporary sparsity
Variable-length key storage Pointer/var-node/padding/indirection Space efficiency vs complexity
In-node search Linear(SIMD)/binary/interpolation CPU efficiency vs distribution assumptions

1.5 Practical Optimizations

  • Pointer swizzling: pinned pages can be referenced by raw pointers.
  • Write-optimized B+Tree: buffer updates at internal nodes, flush in batches.
  • Partial index: index only rows matching predicate.
  • Include columns: support index-only scans.
  • Prefix compression / dedup / suffix truncation: reduce footprint.
  • Bulk build: sort keys first, build bottom-up.

1.6 When to Use B+Tree

Use B+Tree first when you need:

  • exact lookup + range query
  • frequent updates
  • predictable transactional behavior

2. Filters (Lec 9 Part A)

2.1 Bloom Filter

Core Idea

A bitmap + k hash functions for membership checks.

Semantics

  • False negative: never
  • False positive: possible

Operations

  • Insert(x): set k bits.
  • Lookup(x): any bit 0 => absent; all 1 => maybe present.

Best Use

As a pre-filter in front of expensive index/table probes.

2.2 Other Filters

  • Counting Bloom: counters instead of bits, supports delete.
  • Cuckoo Filter: fingerprint-based, dynamic insert/delete.
  • SuRF: compressed static trie-like filter for approximate point/range filtering.

3. Ordered Alternatives (Lec 9 Part B)

3.1 Skip List

Core Idea

Layered linked lists with random “height” per key.

Why It Exists

  • easier update path than strict tree rebalancing
  • expected O(log n) operations

Trade-off

  • good in memory (e.g., MemTable)
  • usually worse locality than B+Tree on disk/cache-heavy paths

3.2 Trie / Radix Tree

Trie

  • Search by key digits/prefixes.
  • Complexity O(k) where k is key length.
  • Shape depends on key prefixes, not insertion order.

Radix (Patricia)

  • Compress single-child paths.
  • Better space than plain trie.
  • Some compressed-path checks may require verification against original key/tuple.

4. Inverted Index for Text Search (Lec 9 Part C)

4.1 Core Structure

  • Dictionary of terms
  • Posting list per term

This is the right structure for contains keyword queries, where B+Tree LIKE '%...%' is usually inefficient.

4.2 Systems in Slides

  • Lucene: FST dictionary + segments + background merge.
  • PostgreSQL GIN: dictionary + posting list/tree + pending list.

4.3 Retrieval Quality Layer

  • Ranking: TF-IDF / BM25
  • Token processing: n-gram/trigram
  • Normalization: punctuation/spacing variants

5. Vector Index for Semantic Search (Lec 9 Part D)

5.1 Why Vector Index

Keyword overlap is not enough for semantic matching. Convert text/items into embeddings, then retrieve nearest vectors.

  • Cluster vectors into centroids.
  • Map centroid -> posting list of vectors.
  • Query probes nearest clusters and reranks candidates.

5.3 Graph-Style Vector Search (HNSW)

  • Build neighbor graph on vectors.
  • Greedy traversal from entry points toward query.
  • Multi-layer hierarchy improves speed/quality.

5.4 Practical Note

Vector search is usually combined with metadata filters before/after ANN stage.

6. Final Selection Guide

If Your Query Is… Start With Then Add
a = ? or a between ? and ? B+Tree Partial/covering index
“Does key probably exist?” Bloom filter Base index probe on positives
“Find docs containing term” Inverted index BM25 + tokenizer tuning
“Find semantically similar items” Vector index (IVF/HNSW) metadata filter + rerank
Prefix-heavy key search Trie/Radix compression/adaptive node layout

Quick Recap

  1. B+Tree is still the baseline ordered index.
  2. Filters reduce wasted probes but do not locate tuples.
  3. Inverted indexes solve keyword retrieval; vector indexes solve semantic retrieval.
  4. Real systems combine multiple structures in one pipeline.