TinyKV: A Raft-Based KV Storage System
Interview Script (Memorize This First)
30-Second Version (Opening)
TinyKV uses Multi-Raft to achieve strong-consistency replication, with Badger as the persistence layer. On top of that it implements region split, membership change, snapshot transfer, scheduler-based load balancing, and distributed transactions via MVCC + Percolator. I focused on stability under split/recover and conf change scenarios, ensuring requests never time out in complex failure cases and nodes can always recover after a restart.
3-Minute Version (Expanded)
- Start with the layered architecture: Server → Raftstore → Raft → Storage/MVCC → Scheduler.
- Walk through the consistency pipeline: propose → replicate → commit → apply → callback.
- Cover complex scenarios: leader transfer, conf change, split, snapshot, crash recovery.
- End with a problem you solved end-to-end: symptom, root cause, fix, validation.
Deep-Dive Answer Order (Stay Focused)
- State the invariants first (linearizability, monotone epoch, log ordering).
- Explain the key data structures (
proposal,Progress,RegionEpoch). - Walk through failure paths (proposal dropped, stale/tombstone, snapshot/recover).
- Describe verification: high-concurrency + unreliable network + restart/partition regression tests.
Architecture Overview
TinyKV is an educational distributed KV storage system modeled after TiKV + PD from the TiDB ecosystem. It consists of four core modules:
1 | Client (TinySQL) |
Part 1: Storage Engine Basics (Project 1)
1.1 Badger Engine
- LSM-Tree based KV store (similar to RocksDB)
- Column Families emulated via key prefix:
${cf}_${key} - Three CFs:
CfDefault(user data),CfLock(transaction locks),CfWrite(commit records)
1.2 Storage Interface Design
1 | type Storage interface { |
1.3 Interview Points
- Why Column Families? Logical isolation of different data types (locks, values, write records) improves scan efficiency.
- Badger vs RocksDB? Badger is Go-native, uses LSM-Tree with value log separation.
Part 2: Raft Consensus Algorithm (Project 2A)
2.1 Core Data Structures
1 | type Raft struct { |
2.2 Leader Election Algorithm
Flow: 1. Follower election timeout fires → becomes
Candidate 2. Term++, votes for itself 3. Sends
RequestVote{Term, LastIndex, LogTerm} to all peers 4.
Receives majority votes → becomes Leader 5. Leader immediately appends a
no-op entry (prevents committing entries from prior terms)
Voting Rules: - Only one vote per term - Candidate’s
log must be at least as up-to-date as the voter’s (compare
LastLogTerm, then LastLogIndex on tie) -
Receiving a message with a higher term → revert to Follower
Randomized Timeout:
randomizedTimeout = electionTimeout + rand(0, electionTimeout)
- Prevents livelock (multiple nodes starting elections
simultaneously)
2.3 Log Replication Algorithm
Leader tracks progress per follower:
1
2
3
4type Progress struct {
Match uint64 // highest index confirmed replicated
Next uint64 // next index to send
}
AppendEntries Flow: 1. Leader sends
{prevLogIndex, prevLogTerm, Entries[], Commit} 2. Follower
checks: does the entry at prevLogIndex have term ==
prevLogTerm? 3. Match → append entries, update committed 4.
Mismatch → reject; leader decrements Next and retries (log
backtracking)
Commit Rule (Critical!): 1
2
3
4
5
6func maybeCommit():
1. Collect Match values from all peers
2. Sort and find the median N (highest index replicated by majority)
3. Only commit if log[N].term == currentTerm
// Never commit entries from a previous term directly!
// This is a key difference from Paxos.
2.4 RaftLog Structure
1 | snapshot/first.....applied....committed....stabled.....last |
Key Functions: - unstableEntries():
entries after stabled (need to be persisted) -
nextEnts(): entries from applied+1 to
committed (need to be applied to state machine) -
maybeCompact(): discard entries already covered by a
snapshot
2.5 RawNode Interface — Application-Driven Loop
1 | // Application loop: |
Ready Struct: 1
2
3
4
5
6
7
8type Ready struct {
SoftState *SoftState // leader / state changes
HardState pb.HardState // term / vote / commit changes
Entries []pb.Entry // entries to persist
CommittedEntries []pb.Entry // committed entries to apply
Snapshot pb.Snapshot // snapshot to install
Messages []pb.Message // messages to send
}
2.6 Frequently Asked Interview Questions
Q: Why does a newly elected leader append a no-op entry? A: To prevent the “phantom log” problem. A new leader may hold uncommitted entries from the previous term. Raft only allows committing entries from the current term, so a no-op entry in the current term lets the leader indirectly commit those prior-term entries.
Q: How does Raft guarantee log consistency? A: Via
the Log Matching Property: (1) if two nodes have entries with the same
index and term, those entries are identical; (2) if the same holds, all
preceding entries are also identical. The prevLogTerm check
in AppendEntries enforces this.
Q: Why can’t prior-term entries be committed directly? A: See Figure 8 in the Raft paper. Committing an entry from a previous term directly can cause a committed entry to be overwritten later, violating State Machine Safety.
Part 3: Raft KV Layer (Project 2B)
3.1 Overall Architecture
1 | Client → gRPC → RaftstoreRouter → router.send(regionID, MsgRaftCmd) |
3.2 Critical Persistent State
Two Badger instances:
| Engine | Stores | Key Format |
|---|---|---|
| RaftDB | Raft log + RaftLocalState | RaftLogKey(regionID, index) |
| KVDB | User data + RegionLocalState + RaftApplyState | user key / ApplyStateKey(regionID) |
Three key metadata structures: 1
2
3RaftLocalState: {HardState{Term, Vote, Commit}, LastIndex, LastTerm}
RaftApplyState: {AppliedIndex, TruncatedState{Index, Term}}
RegionLocalState: {Region{Id, StartKey, EndKey, Peers[], RegionEpoch}, State}
3.3 Proposal Matching Mechanism
1 | type proposal struct { |
On propose: record
(nextIndex, currentTerm, callback) On
apply: match entry by (term, index); if matched,
invoke callback
Stale detection: -
entry.Term > proposal.term → proposal was overwritten by
a new leader → ErrStaleCommand -
entry.Index > proposal.index → same conclusion
3.4 Interview Points
Q: Do reads also go through Raft? A: In TinyKV, yes (ReadIndex optimization is not implemented). Get requests are submitted as Raft entries to guarantee linearizability. Real systems typically use ReadIndex or LeaseRead to avoid that overhead.
Q: Why is raftWorker single-threaded? A: All peer messages are processed in one goroutine, eliminating lock contention. Lock-free message passing is achieved via channels.
Part 4: Snapshot Mechanism (Project 2C)
4.1 Snapshot Trigger
1 | Periodic tick → onRaftGCLogTick() |
4.2 Snapshot Generation and Sending
1 | Leader detects follower.Next < firstIndex (log already truncated) |
4.3 Snapshot Application Flow
1 | Follower receives MsgSnapshot |
4.4 Interview Points
Q: Why is snapshot generation asynchronous? A:
Generating a snapshot requires scanning the entire region’s data, which
can be slow. Asynchronous generation avoids blocking the Raft main loop.
ErrSnapshotTemporarilyUnavailable signals Raft to retry on
the next tick.
Part 5: Multi-Raft (Project 3A / 3B)
5.1 Leader Transfer
1 | 1. Leader receives MsgTransferLeader{Target} |
Key distinction: Leader transfer bypasses the Raft
log; it calls RawNode.TransferLeader() directly.
5.2 Configuration Change (Membership Change)
1 | // Propose a conf change |
Safety: one node changed at a time (no Joint
Consensus); PendingConfIndex prevents concurrent
changes.
5.3 Region Split
1 | SplitCheckTick → scan region size → find split key when threshold exceeded |
5.4 RegionEpoch Mechanism
1 | type RegionEpoch struct { |
Purpose: detect stale requests. The epoch in a client request must match the region’s current epoch.
5.5 Interview Points
Q: Why not use Joint Consensus? A: TinyKV simplifies membership changes to one node at a time. Joint Consensus allows multiple simultaneous changes but is more complex to implement. Single-step safety is guaranteed by “at most one change at a time.”
Q: How are replicas created in the new region after a
split? A: Followers create the new peer locally when they apply
the split entry. If a follower is absent (e.g., partitioned), the
leader’s heartbeat causes the store to detect an unknown region and call
replicatePeer() to create an empty peer, which then
receives data via a snapshot.
Part 6: Scheduler Load Balancing (Project 3C)
6.1 processRegionHeartbeat
1 | func processRegionHeartbeat(region) error: |
6.2 Balance Region Scheduler
1 | Schedule(cluster): |
6.3 Interview Points
Q: Why the 2×regionSize tolerance check? A: To prevent oscillation. If the gap is too small, the move might immediately need to be reversed, causing an infinite loop.
Q: How does the scheduler guarantee safe scheduling? A: Via RegionEpoch checks. An operator records the epoch at creation time; if the epoch has changed when the operator executes (region was split or conf-changed), the operator is cancelled.
Part 7: Distributed Transactions (Project 4)
7.1 MVCC — Multi-Version Concurrency Control
How the three CFs cooperate:
| CF | Key Format | Value | Purpose |
|---|---|---|---|
| CfDefault | EncodeKey(userKey, startTS) |
actual data | stores all versions of a value |
| CfLock | userKey (raw) |
Lock{Primary, Ts, Ttl, Kind} |
transaction lock (at most one per key) |
| CfWrite | EncodeKey(userKey, commitTS) |
Write{StartTS, Kind} |
commit records |
Key Encoding (critical!): 1
2
3
4
5
6func EncodeKey(key []byte, ts uint64) []byte {
encodedKey := codec.EncodeBytes(key) // order-preserving encoding
newKey := append(encodedKey, make([]byte, 8)...)
binary.BigEndian.PutUint64(newKey[len(encodedKey):], ^ts) // bitwise NOT!
return newKey
}^ts makes timestamps sort in descending
order (newest version first) -
Seek(EncodeKey(key, startTS)) lands directly on the latest
version with commitTS <= startTS
7.2 Percolator Two-Phase Commit
Phase 1: Prewrite (lock + write data)
1
2
3
4
5For each mutation:
1. Check write conflict: MostRecentWrite(key).commitTs > startTs → WriteConflict
2. Check lock conflict: GetLock(key) exists → KeyLocked
3. PutValue(key, value) → CfDefault[key+startTs] = value
4. PutLock(key, Lock{Primary, startTs, TTL, Kind}) → CfLock[key] = lock
Phase 2: Commit (commit + unlock) 1
2
3
4
5
6
7
8
9For each key:
1. GetLock(key) — check the lock
2. lock == nil:
- Check CurrentWrite: already committed → idempotent return
already rolled back → error
- No write record → prewrite was lost, skip (no-op)
3. lock.Ts != startTs → lock belongs to another transaction → error
4. PutWrite(key, commitTs, Write{startTs, kind}) → CfWrite[key+commitTs]
5. DeleteLock(key) → remove CfLock[key]
7.3 Read (KvGet)
1 | 1. GetLock(key): if lock.Ts < readTs → blocked (return Locked error) |
7.4 Scan (KvScan)
1 | type Scanner struct { |
7.5 Transaction Cleanup
CheckTxnStatus (lock TTL check): 1
2
3
4
5
6
71. Check write and lock for the primary key
2. Already committed → return commitTs
3. Already rolled back → return NoAction
4. Lock present → check whether TTL has expired
- Expired: clear lock + value, write rollback record, return TTLExpireRollback
- Not expired: return NoAction
5. No lock and no write record → write rollback (guard against a late-arriving prewrite)
BatchRollback: 1
21. Check whether each key is already committed → if so, refuse rollback
2. DeleteLock + DeleteValue + PutWrite(Rollback)
ResolveLock: 1
2
31. Scan CfLock for all locks with lock.Ts == startTs
2. commitVersion == 0 → rollback all
3. commitVersion > 0 → commit all
7.6 Latch Mechanism
1 | type Latches struct { |
- In-memory mutex lock that prevents concurrent in-process operations on the same key
- Different from MVCC locks: a latch is local and short-lived; an MVCC lock is persistent and distributed
7.7 Frequently Asked Interview Questions
Q: What is the role of the Primary Key in Percolator? A: The primary key is the single source of truth for a transaction. Only the primary key’s write record needs to be committed synchronously; secondary keys can be committed asynchronously. Checking the primary key’s lock/write record is sufficient to determine the status of the entire transaction.
Q: Why does CfWrite use commitTS instead of startTS in the
key? A: Reads need to find the latest commit with
commitTS <= readTS. Encoding by startTS
would make that lookup inefficient. With commitTS, a single
Seek lands on the first write record at or before
readTS.
Q: Snapshot Isolation vs Serializable? A: SI allows Write Skew anomalies (two transactions read from the same snapshot and each writes a different key, but their combined effect violates application invariants). TinyKV implements SI, not full Serializability.
Q: What happens if the client crashes after Prewrite succeeds
but before Commit? A: Any transaction that encounters the lock
will call CheckTxnStatus on the primary key. If the lock’s
TTL has expired, the lock is rolled back. If the primary has already
been committed, the commit is carried through.
Q: Why doesn’t CfLock’s key include a timestamp? A:
Only one transaction can lock a key at a time. The startTS
is already stored inside the lock value, so there is no need to encode
it in the key.
Part 8: System Design Interview Questions
8.1 Complete Write Request Path
1 | 1. Client → gRPC → Server.KvPrewrite() |
8.2 Region Split Flow
1 | Region [a, z) exceeds size threshold |
8.3 Crash Recovery
1 | Node restart: |
8.4 Linearizability Guarantee
1 | Writes: Raft ensures majority acknowledgement → writes are durable |
Part 9: Key Source Files
| Module | File | Core Content |
|---|---|---|
| Raft | raft/raft.go |
election, log replication, heartbeat, snapshot |
| Raft | raft/log.go |
RaftLog management, unstable/committed/applied |
| Raft | raft/rawnode.go |
Ready interface, Advance, ProposeConfChange |
| Raftstore | kv/raftstore/peer_msg_handler.go |
message handling, entry application, split/confchange |
| Raftstore | kv/raftstore/peer_storage.go |
persistence, snapshot generation and application |
| Raftstore | kv/raftstore/raft_worker.go |
single-threaded message loop |
| MVCC | kv/transaction/mvcc/transaction.go |
MvccTxn, EncodeKey, GetValue |
| MVCC | kv/transaction/mvcc/scanner.go |
range scan |
| Server | kv/server/server.go |
KvGet / KvPrewrite / KvCommit / KvScan |
| Scheduler | scheduler/server/cluster.go |
processRegionHeartbeat |
| Scheduler | scheduler/server/schedulers/balance_region.go |
Balance Region scheduling |
Part 10: Papers and References
| Topic | Paper / Resource |
|---|---|
| Raft | In Search of an Understandable Consensus Algorithm |
| Percolator | Large-scale Incremental Processing Using Distributed Transactions and Notifications |
| Spanner | Spanner: Google’s Globally-Distributed Database |
| TiKV Design | TiKV Deep Dive |
| LSM-Tree | The Log-Structured Merge-Tree |



