15645 Database systems: Index & Filter
Lec 8-9 Indexes + Filters
0. Study Map
This note follows one question: “For this query type, which structure should I use?”
| Query Intent | Primary Structure | Why |
|---|---|---|
| Point / range on structured keys | B+Tree | Ordered, exact, update-friendly |
| Fast “not exists” pre-check | Bloom-like filter | Cheap negative filtering |
| In-memory ordered set | Skip list | Simple updates, expected O(log n) |
| Prefix/digital key lookup | Trie / Radix | Follows key digits/prefixes |
| Full-text keyword search | Inverted index | term -> posting list |
| Semantic similarity search | Vector index (IVF/HNSW) | ANN over embeddings |
1. B+Tree (Lec 8 Core)
1.1 Core Idea
B+Tree is the default ordered index in DBMSs: one structure supports both point lookup and range scan with stable disk behavior.
1.2 Physical Structure
- Inner node: routing keys + child pointers.
- Leaf node: key + payload (RID / primary key / tuple data).
- Leaf sibling links: make ordered scans efficient.
B-Tree vs B+Tree
| Aspect | B-Tree | B+Tree |
|---|---|---|
| Value location | All levels | Leaves only |
| Range scan | Less natural | Natural via leaf chain |
| Real DBMS usage | Rare | Mainstream |
1.3 Update Logic (What Happens on Write)
Insert
- Traverse to target leaf.
- Insert key in order.
- If overflow, split node.
- Push separator key to parent.
- Cascade upward if parent also overflows.
Delete
- Remove key from leaf.
- If under threshold, try borrow from sibling.
- If borrow fails, merge siblings.
- Update parent separators.
- Cascade upward on parent underflow.
Duplicate Keys
- Preferred: store
(key, rid)to keep uniqueness. - Avoid overflow chains unless unavoidable (maintenance cost high).
1.4 Design Knobs
| Knob | Typical Choice | Trade-off |
|---|---|---|
| Node size | Larger on slower storage | Fewer I/Os vs more in-node scan |
| Merge threshold | Not always immediate merge | Lower write churn vs temporary sparsity |
| Variable-length key storage | Pointer/var-node/padding/indirection | Space efficiency vs complexity |
| In-node search | Linear(SIMD)/binary/interpolation | CPU efficiency vs distribution assumptions |
1.5 Practical Optimizations
- Pointer swizzling: pinned pages can be referenced by raw pointers.
- Write-optimized B+Tree: buffer updates at internal nodes, flush in batches.
- Partial index: index only rows matching predicate.
- Include columns: support index-only scans.
- Prefix compression / dedup / suffix truncation: reduce footprint.
- Bulk build: sort keys first, build bottom-up.
1.6 When to Use B+Tree
Use B+Tree first when you need:
- exact lookup + range query
- frequent updates
- predictable transactional behavior
2. Filters (Lec 9 Part A)
2.1 Bloom Filter
Core Idea
A bitmap + k hash functions for membership checks.
Semantics
- False negative: never
- False positive: possible
Operations
Insert(x): setkbits.Lookup(x): any bit0=> absent; all1=> maybe present.
Best Use
As a pre-filter in front of expensive index/table probes.
2.2 Other Filters
- Counting Bloom: counters instead of bits, supports delete.
- Cuckoo Filter: fingerprint-based, dynamic insert/delete.
- SuRF: compressed static trie-like filter for approximate point/range filtering.
3. Ordered Alternatives (Lec 9 Part B)
3.1 Skip List
Core Idea
Layered linked lists with random “height” per key.
Why It Exists
- easier update path than strict tree rebalancing
- expected
O(log n)operations
Trade-off
- good in memory (e.g., MemTable)
- usually worse locality than B+Tree on disk/cache-heavy paths
3.2 Trie / Radix Tree
Trie
- Search by key digits/prefixes.
- Complexity
O(k)wherekis key length. - Shape depends on key prefixes, not insertion order.
Radix (Patricia)
- Compress single-child paths.
- Better space than plain trie.
- Some compressed-path checks may require verification against original key/tuple.
4. Inverted Index for Text Search (Lec 9 Part C)
4.1 Core Structure
- Dictionary of terms
- Posting list per term
This is the right structure for contains keyword
queries, where B+Tree LIKE '%...%' is usually
inefficient.
4.2 Systems in Slides
- Lucene: FST dictionary + segments + background merge.
- PostgreSQL GIN: dictionary + posting list/tree + pending list.
4.3 Retrieval Quality Layer
- Ranking: TF-IDF / BM25
- Token processing: n-gram/trigram
- Normalization: punctuation/spacing variants
5. Vector Index for Semantic Search (Lec 9 Part D)
5.1 Why Vector Index
Keyword overlap is not enough for semantic matching. Convert text/items into embeddings, then retrieve nearest vectors.
5.2 IVF-Style Vector Search
- Cluster vectors into centroids.
- Map centroid -> posting list of vectors.
- Query probes nearest clusters and reranks candidates.
5.3 Graph-Style Vector Search (HNSW)
- Build neighbor graph on vectors.
- Greedy traversal from entry points toward query.
- Multi-layer hierarchy improves speed/quality.
5.4 Practical Note
Vector search is usually combined with metadata filters before/after ANN stage.
6. Final Selection Guide
| If Your Query Is… | Start With | Then Add |
|---|---|---|
a = ? or a between ? and ? |
B+Tree | Partial/covering index |
| “Does key probably exist?” | Bloom filter | Base index probe on positives |
| “Find docs containing term” | Inverted index | BM25 + tokenizer tuning |
| “Find semantically similar items” | Vector index (IVF/HNSW) | metadata filter + rerank |
| Prefix-heavy key search | Trie/Radix | compression/adaptive node layout |
Quick Recap
- B+Tree is still the baseline ordered index.
- Filters reduce wasted probes but do not locate tuples.
- Inverted indexes solve keyword retrieval; vector indexes solve semantic retrieval.
- Real systems combine multiple structures in one pipeline.
All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.
Comments



