Lec 8-9 Indexes + Filters

0. Study Map

This note follows one question: “For this query type, which structure should I use?”

Query Intent	Primary Structure	Why
Point / range on structured keys	B+Tree	Ordered, exact, update-friendly
Fast “not exists” pre-check	Bloom-like filter	Cheap negative filtering
In-memory ordered set	Skip list	Simple updates, expected `O(log n)`
Prefix/digital key lookup	Trie / Radix	Follows key digits/prefixes
Full-text keyword search	Inverted index	`term -> posting list`
Semantic similarity search	Vector index (IVF/HNSW)	ANN over embeddings

1. B+Tree (Lec 8 Core)

1.1 Core Idea

B+Tree is the default ordered index in DBMSs: one structure supports both point lookup and range scan with stable disk behavior.

1.2 Physical Structure

Inner node: routing keys + child pointers.
Leaf node: key + payload (RID / primary key / tuple data).
Leaf sibling links: make ordered scans efficient.

B-Tree vs B+Tree

Aspect	B-Tree	B+Tree
Value location	All levels	Leaves only
Range scan	Less natural	Natural via leaf chain
Real DBMS usage	Rare	Mainstream

1.3 Update Logic (What Happens on Write)

Insert

Traverse to target leaf.
Insert key in order.
If overflow, split node.
Push separator key to parent.
Cascade upward if parent also overflows.

Delete

Remove key from leaf.
If under threshold, try borrow from sibling.
If borrow fails, merge siblings.
Update parent separators.
Cascade upward on parent underflow.

Duplicate Keys

Preferred: store (key, rid) to keep uniqueness.
Avoid overflow chains unless unavoidable (maintenance cost high).

1.4 Design Knobs

Knob	Typical Choice	Trade-off
Node size	Larger on slower storage	Fewer I/Os vs more in-node scan
Merge threshold	Not always immediate merge	Lower write churn vs temporary sparsity
Variable-length key storage	Pointer/var-node/padding/indirection	Space efficiency vs complexity
In-node search	Linear(SIMD)/binary/interpolation	CPU efficiency vs distribution assumptions

1.5 Practical Optimizations

Pointer swizzling: pinned pages can be referenced by raw pointers.
Write-optimized B+Tree: buffer updates at internal nodes, flush in batches.
Partial index: index only rows matching predicate.
Include columns: support index-only scans.
Prefix compression / dedup / suffix truncation: reduce footprint.
Bulk build: sort keys first, build bottom-up.

1.6 When to Use B+Tree

Use B+Tree first when you need:

exact lookup + range query
frequent updates
predictable transactional behavior

2. Filters (Lec 9 Part A)

2.1 Bloom Filter

Core Idea

A bitmap + k hash functions for membership checks.

Semantics

False negative: never
False positive: possible

Operations

Insert(x): set k bits.
Lookup(x): any bit 0 => absent; all 1 => maybe present.

Best Use

As a pre-filter in front of expensive index/table probes.

2.2 Other Filters

Counting Bloom: counters instead of bits, supports delete.
Cuckoo Filter: fingerprint-based, dynamic insert/delete.
SuRF: compressed static trie-like filter for approximate point/range filtering.

3. Ordered Alternatives (Lec 9 Part B)

3.1 Skip List

Core Idea

Layered linked lists with random “height” per key.

Why It Exists

easier update path than strict tree rebalancing
expected O(log n) operations

Trade-off

good in memory (e.g., MemTable)
usually worse locality than B+Tree on disk/cache-heavy paths

3.2 Trie / Radix Tree

Trie

Search by key digits/prefixes.
Complexity O(k) where k is key length.
Shape depends on key prefixes, not insertion order.

Radix (Patricia)

Compress single-child paths.
Better space than plain trie.
Some compressed-path checks may require verification against original key/tuple.

4. Inverted Index for Text Search (Lec 9 Part C)

4.1 Core Structure

Dictionary of terms
Posting list per term

This is the right structure for contains keyword queries, where B+Tree LIKE '%...%' is usually inefficient.

4.2 Systems in Slides

Lucene: FST dictionary + segments + background merge.
PostgreSQL GIN: dictionary + posting list/tree + pending list.

4.3 Retrieval Quality Layer

Ranking: TF-IDF / BM25
Token processing: n-gram/trigram
Normalization: punctuation/spacing variants

5. Vector Index for Semantic Search (Lec 9 Part D)

5.1 Why Vector Index

Keyword overlap is not enough for semantic matching. Convert text/items into embeddings, then retrieve nearest vectors.

5.2 IVF-Style Vector Search

Cluster vectors into centroids.
Map centroid -> posting list of vectors.
Query probes nearest clusters and reranks candidates.

5.3 Graph-Style Vector Search (HNSW)

Build neighbor graph on vectors.
Greedy traversal from entry points toward query.
Multi-layer hierarchy improves speed/quality.

5.4 Practical Note

Vector search is usually combined with metadata filters before/after ANN stage.

6. Final Selection Guide

If Your Query Is…	Start With	Then Add
`a = ?` or `a between ? and ?`	B+Tree	Partial/covering index
“Does key probably exist?”	Bloom filter	Base index probe on positives
“Find docs containing term”	Inverted index	BM25 + tokenizer tuning
“Find semantically similar items”	Vector index (IVF/HNSW)	metadata filter + rerank
Prefix-heavy key search	Trie/Radix	compression/adaptive node layout

Quick Recap

B+Tree is still the baseline ordered index.
Filters reduce wasted probes but do not locate tuples.
Inverted indexes solve keyword retrieval; vector indexes solve semantic retrieval.
Real systems combine multiple structures in one pipeline.