Interview Script (Memorize This First)

30-Second Version (Opening)

TinyKV uses Multi-Raft to achieve strong-consistency replication, with Badger as the persistence layer. On top of that it implements region split, membership change, snapshot transfer, scheduler-based load balancing, and distributed transactions via MVCC + Percolator. I focused on stability under split/recover and conf change scenarios, ensuring requests never time out in complex failure cases and nodes can always recover after a restart.

3-Minute Version (Expanded)

  1. Start with the layered architecture: Server → Raftstore → Raft → Storage/MVCC → Scheduler.
  2. Walk through the consistency pipeline: propose → replicate → commit → apply → callback.
  3. Cover complex scenarios: leader transfer, conf change, split, snapshot, crash recovery.
  4. End with a problem you solved end-to-end: symptom, root cause, fix, validation.

Deep-Dive Answer Order (Stay Focused)

  1. State the invariants first (linearizability, monotone epoch, log ordering).
  2. Explain the key data structures (proposal, Progress, RegionEpoch).
  3. Walk through failure paths (proposal dropped, stale/tombstone, snapshot/recover).
  4. Describe verification: high-concurrency + unreliable network + restart/partition regression tests.

Architecture Overview

TinyKV is an educational distributed KV storage system modeled after TiKV + PD from the TiDB ecosystem. It consists of four core modules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Client (TinySQL)
| gRPC (kvrpcpb)
v
TinyKV Server (kv/server/)
|
+-- Storage Layer
| +-- StandaloneStorage (Project 1)
| +-- RaftStorage (Project 2-3)
| |
| +-- Raftstore (kv/raftstore/)
| | drives Raft consensus
| v
| Raft Module (raft/)
|
+-- Transaction Layer (Project 4)
| +-- MVCC (kv/transaction/mvcc/)
| +-- Percolator 2PC
|
TinyScheduler (scheduler/)
| analogous to PD: cluster management and load balancing
+-- Region Heartbeat handling
+-- Balance Leader/Region scheduling

Part 1: Storage Engine Basics (Project 1)

1.1 Badger Engine

  • LSM-Tree based KV store (similar to RocksDB)
  • Column Families emulated via key prefix: ${cf}_${key}
  • Three CFs: CfDefault (user data), CfLock (transaction locks), CfWrite (commit records)

1.2 Storage Interface Design

1
2
3
4
5
6
7
8
9
type Storage interface {
Write(ctx *kvrpcpb.Context, batch []Modify) error
Reader(ctx *kvrpcpb.Context) (StorageReader, error)
}
type StorageReader interface {
GetCF(cf string, key []byte) ([]byte, error)
IterCF(cf string) engine_util.DBIterator
Close()
}

1.3 Interview Points

  • Why Column Families? Logical isolation of different data types (locks, values, write records) improves scan efficiency.
  • Badger vs RocksDB? Badger is Go-native, uses LSM-Tree with value log separation.

Part 2: Raft Consensus Algorithm (Project 2A)

2.1 Core Data Structures

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
type Raft struct {
Term uint64 // current term
Vote uint64 // who we voted for
Lead uint64 // current leader
State StateType // Follower / Candidate / Leader
RaftLog *RaftLog // log management
Prs map[uint64]*Progress // replication progress per peer {Match, Next}
votes map[uint64]bool // election vote tracking

electionTimeout int // base election timeout
heartbeatTimeout int // heartbeat interval
randomizedTimeout int // randomized election timeout [election, 2*election)
electionElapsed int // election timer
heartbeatElapsed int // heartbeat timer

leadTransferee uint64 // target of an in-progress leader transfer
PendingConfIndex uint64 // prevents concurrent conf changes
}

2.2 Leader Election Algorithm

Flow: 1. Follower election timeout fires → becomes Candidate 2. Term++, votes for itself 3. Sends RequestVote{Term, LastIndex, LogTerm} to all peers 4. Receives majority votes → becomes Leader 5. Leader immediately appends a no-op entry (prevents committing entries from prior terms)

Voting Rules: - Only one vote per term - Candidate’s log must be at least as up-to-date as the voter’s (compare LastLogTerm, then LastLogIndex on tie) - Receiving a message with a higher term → revert to Follower

Randomized Timeout: randomizedTimeout = electionTimeout + rand(0, electionTimeout) - Prevents livelock (multiple nodes starting elections simultaneously)

2.3 Log Replication Algorithm

Leader tracks progress per follower:

1
2
3
4
type Progress struct {
Match uint64 // highest index confirmed replicated
Next uint64 // next index to send
}

AppendEntries Flow: 1. Leader sends {prevLogIndex, prevLogTerm, Entries[], Commit} 2. Follower checks: does the entry at prevLogIndex have term == prevLogTerm? 3. Match → append entries, update committed 4. Mismatch → reject; leader decrements Next and retries (log backtracking)

Commit Rule (Critical!):

1
2
3
4
5
6
func maybeCommit():
1. Collect Match values from all peers
2. Sort and find the median N (highest index replicated by majority)
3. Only commit if log[N].term == currentTerm
// Never commit entries from a previous term directly!
// This is a key difference from Paxos.

2.4 RaftLog Structure

1
2
3
4
5
6
7
8
snapshot/first.....applied....committed....stabled.....last
--------|------------------------------------------------|
log entries

- applied: highest index applied to the state machine
- committed: highest index confirmed by majority
- stabled: highest index persisted to storage
- invariant: applied <= committed <= stabled <= last

Key Functions: - unstableEntries(): entries after stabled (need to be persisted) - nextEnts(): entries from applied+1 to committed (need to be applied to state machine) - maybeCompact(): discard entries already covered by a snapshot

2.5 RawNode Interface — Application-Driven Loop

1
2
3
4
5
6
7
8
9
10
11
// Application loop:
for {
rn.Tick() // advance logical clock
if rn.HasReady() {
rd := rn.Ready() // get all pending changes
persist(rd.Entries) // persist log entries
send(rd.Messages) // send network messages
apply(rd.CommittedEntries) // apply to state machine
rn.Advance(rd) // notify Raft that processing is done
}
}

Ready Struct:

1
2
3
4
5
6
7
8
type Ready struct {
SoftState *SoftState // leader / state changes
HardState pb.HardState // term / vote / commit changes
Entries []pb.Entry // entries to persist
CommittedEntries []pb.Entry // committed entries to apply
Snapshot pb.Snapshot // snapshot to install
Messages []pb.Message // messages to send
}

2.6 Frequently Asked Interview Questions

Q: Why does a newly elected leader append a no-op entry? A: To prevent the “phantom log” problem. A new leader may hold uncommitted entries from the previous term. Raft only allows committing entries from the current term, so a no-op entry in the current term lets the leader indirectly commit those prior-term entries.

Q: How does Raft guarantee log consistency? A: Via the Log Matching Property: (1) if two nodes have entries with the same index and term, those entries are identical; (2) if the same holds, all preceding entries are also identical. The prevLogTerm check in AppendEntries enforces this.

Q: Why can’t prior-term entries be committed directly? A: See Figure 8 in the Raft paper. Committing an entry from a previous term directly can cause a committed entry to be overwritten later, violating State Machine Safety.


Part 3: Raft KV Layer (Project 2B)

3.1 Overall Architecture

1
2
3
4
5
6
7
8
9
10
11
12
Client → gRPC → RaftstoreRouter → router.send(regionID, MsgRaftCmd)
|
peerSender channel (buf: 40960)
|
raftWorker.run() [single-threaded]
|
peerMsgHandler
├── proposeRaftCommand() → Raft.Propose()
└── HandleRaftReady()
├── SaveReadyState() [persist]
├── Send(messages) [network]
└── apply entries [state machine]

3.2 Critical Persistent State

Two Badger instances:

Engine Stores Key Format
RaftDB Raft log + RaftLocalState RaftLogKey(regionID, index)
KVDB User data + RegionLocalState + RaftApplyState user key / ApplyStateKey(regionID)

Three key metadata structures:

1
2
3
RaftLocalState:   {HardState{Term, Vote, Commit}, LastIndex, LastTerm}
RaftApplyState: {AppliedIndex, TruncatedState{Index, Term}}
RegionLocalState: {Region{Id, StartKey, EndKey, Peers[], RegionEpoch}, State}

3.3 Proposal Matching Mechanism

1
2
3
4
5
type proposal struct {
index uint64 // expected log index
term uint64 // expected term
cb *message.Callback // callback
}

On propose: record (nextIndex, currentTerm, callback) On apply: match entry by (term, index); if matched, invoke callback

Stale detection: - entry.Term > proposal.term → proposal was overwritten by a new leader → ErrStaleCommand - entry.Index > proposal.index → same conclusion

3.4 Interview Points

Q: Do reads also go through Raft? A: In TinyKV, yes (ReadIndex optimization is not implemented). Get requests are submitted as Raft entries to guarantee linearizability. Real systems typically use ReadIndex or LeaseRead to avoid that overhead.

Q: Why is raftWorker single-threaded? A: All peer messages are processed in one goroutine, eliminating lock contention. Lock-free message passing is achieved via channels.


Part 4: Snapshot Mechanism (Project 2C)

4.1 Snapshot Trigger

1
2
3
4
Periodic tick → onRaftGCLogTick()
→ if appliedIndex - truncatedIndex > RaftLogGcCountLimit
→ propose CompactLogRequest{CompactIndex, CompactTerm}
→ truncate log + schedule raftlog-gc worker to clean old entries

4.2 Snapshot Generation and Sending

1
2
3
4
5
Leader detects follower.Next < firstIndex (log already truncated)
→ calls PeerStorage.Snapshot()
→ asynchronously sends RegionTaskGen to region worker
→ returns ErrSnapshotTemporarilyUnavailable (Raft will retry)
→ next Ready() cycle picks up the snapshot and sends MsgSnapshot

4.3 Snapshot Application Flow

1
2
3
4
5
6
7
8
9
Follower receives MsgSnapshot
→ checkSnapshot() validates (peer in list, no overlapping region, file exists)
→ Raft layer: discard entries covered by snapshot, update applied/committed/stabled
→ Raftstore layer: ApplySnapshot()
→ clear old metadata
→ delete data outside the new region's key range
→ update RaftLocalState, RaftApplyState, RegionLocalState
→ send RegionTaskApply (block until complete)
→ update storeMeta (regionRanges B-Tree)

4.4 Interview Points

Q: Why is snapshot generation asynchronous? A: Generating a snapshot requires scanning the entire region’s data, which can be slow. Asynchronous generation avoids blocking the Raft main loop. ErrSnapshotTemporarilyUnavailable signals Raft to retry on the next tick.


Part 5: Multi-Raft (Project 3A / 3B)

5.1 Leader Transfer

1
2
3
4
5
6
1. Leader receives MsgTransferLeader{Target}
2. Check whether target's log is up-to-date (Match == LastIndex)
3. If not → send MsgAppend to help target catch up
4. If yes → send MsgTimeoutNow to target
5. Target immediately starts an election (no timeout needed)
6. Safety: cancel the transfer after 2*electionTimeout if not complete

Key distinction: Leader transfer bypasses the Raft log; it calls RawNode.TransferLeader() directly.

5.2 Configuration Change (Membership Change)

1
2
3
4
5
6
7
8
9
10
11
// Propose a conf change
RawNode.ProposeConfChange(ConfChange{ChangeType, NodeId, Context})
→ creates an EntryConfChange entry
→ replicated and committed through Raft

// Apply the conf change
entry committed → applyConfChangeEntry()
→ RawNode.ApplyConfChange(cc) // update Raft's internal Prs
→ update Region.Peers + RegionEpoch.ConfVer++
→ persist RegionLocalState
if removing self → destroyPeer()

Safety: one node changed at a time (no Joint Consensus); PendingConfIndex prevents concurrent changes.

5.3 Region Split

1
2
3
4
5
6
7
8
SplitCheckTick → scan region size → find split key when threshold exceeded
→ MsgSplitRegion → ask Scheduler to allocate new Region/Peer IDs
→ propose AdminRequest{Split} through Raft

On apply:
Original region: [startKey, endKey) → [startKey, splitKey)
New region: [splitKey, endKey) (new Raft group)
RegionEpoch.Version++

5.4 RegionEpoch Mechanism

1
2
3
4
type RegionEpoch struct {
ConfVer uint64 // incremented on each AddPeer/RemovePeer
Version uint64 // incremented on each Split/Merge
}

Purpose: detect stale requests. The epoch in a client request must match the region’s current epoch.

5.5 Interview Points

Q: Why not use Joint Consensus? A: TinyKV simplifies membership changes to one node at a time. Joint Consensus allows multiple simultaneous changes but is more complex to implement. Single-step safety is guaranteed by “at most one change at a time.”

Q: How are replicas created in the new region after a split? A: Followers create the new peer locally when they apply the split entry. If a follower is absent (e.g., partitioned), the leader’s heartbeat causes the store to detect an unknown region and call replicatePeer() to create an empty peer, which then receives data via a snapshot.


Part 6: Scheduler Load Balancing (Project 3C)

6.1 processRegionHeartbeat

1
2
3
4
5
6
7
8
func processRegionHeartbeat(region) error:
1. Fetch existing region (by ID)
2. Version comparison (reject stale heartbeats):
- newVersion < existVersion → reject
- newVersion == existVersion && newConfVer < existConfVer → reject
3. For new regions, check for key range overlaps
4. PutRegion to update region info
5. updateStoreStatus to refresh per-store statistics

6.2 Balance Region Scheduler

1
2
3
4
5
6
7
8
9
10
11
Schedule(cluster):
1. Collect all healthy stores (Up && DownTime < MaxStoreDownTime)
2. Sort by RegionSize descending
3. Starting from the largest store, pick a region:
- Prefer pending region → follower → leader
- Skip regions with fewer than MaxReplicas peers
4. Starting from the smallest store, find a target
(must not already hold a peer for this region)
5. Verify the size gap is large enough:
source.RegionSize - target.RegionSize >= 2 * region.ApproximateSize
6. CreateMovePeerOperator(source → target)

6.3 Interview Points

Q: Why the 2×regionSize tolerance check? A: To prevent oscillation. If the gap is too small, the move might immediately need to be reversed, causing an infinite loop.

Q: How does the scheduler guarantee safe scheduling? A: Via RegionEpoch checks. An operator records the epoch at creation time; if the epoch has changed when the operator executes (region was split or conf-changed), the operator is cancelled.


Part 7: Distributed Transactions (Project 4)

7.1 MVCC — Multi-Version Concurrency Control

How the three CFs cooperate:

CF Key Format Value Purpose
CfDefault EncodeKey(userKey, startTS) actual data stores all versions of a value
CfLock userKey (raw) Lock{Primary, Ts, Ttl, Kind} transaction lock (at most one per key)
CfWrite EncodeKey(userKey, commitTS) Write{StartTS, Kind} commit records

Key Encoding (critical!):

1
2
3
4
5
6
func EncodeKey(key []byte, ts uint64) []byte {
encodedKey := codec.EncodeBytes(key) // order-preserving encoding
newKey := append(encodedKey, make([]byte, 8)...)
binary.BigEndian.PutUint64(newKey[len(encodedKey):], ^ts) // bitwise NOT!
return newKey
}
- ^ts makes timestamps sort in descending order (newest version first) - Seek(EncodeKey(key, startTS)) lands directly on the latest version with commitTS <= startTS

7.2 Percolator Two-Phase Commit

Phase 1: Prewrite (lock + write data)

1
2
3
4
5
For each mutation:
1. Check write conflict: MostRecentWrite(key).commitTs > startTs → WriteConflict
2. Check lock conflict: GetLock(key) exists → KeyLocked
3. PutValue(key, value) → CfDefault[key+startTs] = value
4. PutLock(key, Lock{Primary, startTs, TTL, Kind}) → CfLock[key] = lock

Phase 2: Commit (commit + unlock)

1
2
3
4
5
6
7
8
9
For each key:
1. GetLock(key) — check the lock
2. lock == nil:
- Check CurrentWrite: already committed → idempotent return
already rolled back → error
- No write record → prewrite was lost, skip (no-op)
3. lock.Ts != startTs → lock belongs to another transaction → error
4. PutWrite(key, commitTs, Write{startTs, kind}) → CfWrite[key+commitTs]
5. DeleteLock(key) → remove CfLock[key]

7.3 Read (KvGet)

1
2
3
4
5
1. GetLock(key): if lock.Ts < readTs → blocked (return Locked error)
2. Seek CfWrite at EncodeKey(key, readTs)
3. Find the first Write record with commitTs <= readTs
4. If Write.Kind == Delete → return NotFound
5. Otherwise use Write.StartTS to read the actual value from CfDefault

7.4 Scan (KvScan)

1
2
3
4
5
6
7
8
9
10
type Scanner struct {
Txn *MvccTxn
Iter DBIterator // iterates CfWrite
}

func Next() (key, value, error):
1. Seek to EncodeKey(currentKey, startTS)
2. Decode userKey; skip older versions of the same key
3. Read value from CfDefault using Write.StartTS
4. Advance to the next distinct userKey

7.5 Transaction Cleanup

CheckTxnStatus (lock TTL check):

1
2
3
4
5
6
7
1. Check write and lock for the primary key
2. Already committed → return commitTs
3. Already rolled back → return NoAction
4. Lock present → check whether TTL has expired
- Expired: clear lock + value, write rollback record, return TTLExpireRollback
- Not expired: return NoAction
5. No lock and no write record → write rollback (guard against a late-arriving prewrite)

BatchRollback:

1
2
1. Check whether each key is already committed → if so, refuse rollback
2. DeleteLock + DeleteValue + PutWrite(Rollback)

ResolveLock:

1
2
3
1. Scan CfLock for all locks with lock.Ts == startTs
2. commitVersion == 0 → rollback all
3. commitVersion > 0 → commit all

7.6 Latch Mechanism

1
2
3
type Latches struct {
Locks map[string]*sync.Mutex
}
  • In-memory mutex lock that prevents concurrent in-process operations on the same key
  • Different from MVCC locks: a latch is local and short-lived; an MVCC lock is persistent and distributed

7.7 Frequently Asked Interview Questions

Q: What is the role of the Primary Key in Percolator? A: The primary key is the single source of truth for a transaction. Only the primary key’s write record needs to be committed synchronously; secondary keys can be committed asynchronously. Checking the primary key’s lock/write record is sufficient to determine the status of the entire transaction.

Q: Why does CfWrite use commitTS instead of startTS in the key? A: Reads need to find the latest commit with commitTS <= readTS. Encoding by startTS would make that lookup inefficient. With commitTS, a single Seek lands on the first write record at or before readTS.

Q: Snapshot Isolation vs Serializable? A: SI allows Write Skew anomalies (two transactions read from the same snapshot and each writes a different key, but their combined effect violates application invariants). TinyKV implements SI, not full Serializability.

Q: What happens if the client crashes after Prewrite succeeds but before Commit? A: Any transaction that encounters the lock will call CheckTxnStatus on the primary key. If the lock’s TTL has expired, the lock is rolled back. If the primary has already been committed, the commit is carried through.

Q: Why doesn’t CfLock’s key include a timestamp? A: Only one transaction can lock a key at a time. The startTS is already stored inside the lock value, so there is no need to encode it in the key.


Part 8: System Design Interview Questions

8.1 Complete Write Request Path

1
2
3
4
5
6
7
8
9
10
1.  Client → gRPC → Server.KvPrewrite()
2. Server acquires Latch → gets Reader → creates MvccTxn
3. Conflict detection (write conflict + lock conflict)
4. PutValue + PutLock → write to Badger (via RaftStorage)
5. RaftStorage → Raftstore → Raft Propose
6. Raft replicates to majority → Commit
7. HandleRaftReady → apply committed entries → write to KVDB
8. Callback to Client → Prewrite succeeds
9. Client → gRPC → Server.KvCommit()
10. PutWrite + DeleteLock → Commit succeeds

8.2 Region Split Flow

1
2
3
4
5
6
7
Region [a, z) exceeds size threshold
→ SplitChecker finds split key = "m"
→ propose Split{splitKey="m", newRegionID, newPeerIDs}
→ replicated and committed through Raft
→ Apply: original region [a, m), new region [m, z)
→ two independent Raft groups run in parallel
→ notify Scheduler to update routing table

8.3 Crash Recovery

1
2
3
4
5
6
Node restart:
1. Load RegionLocalState from KVDB (all region metadata)
2. Load RaftLocalState from RaftDB (HardState)
3. Load RaftApplyState from KVDB (how far we've applied)
4. Rebuild RaftLog: start from stabled; re-apply entries after applied
5. Rejoin the cluster via Raft heartbeats

8.4 Linearizability Guarantee

1
2
3
4
Writes: Raft ensures majority acknowledgement → writes are durable
Reads: also go through Raft → guaranteed to see the latest value
(Optimization: ReadIndex — leader confirms it is still leader, then reads locally)
(Optimization: LeaseRead — read locally within the lease window)

Part 9: Key Source Files

Module File Core Content
Raft raft/raft.go election, log replication, heartbeat, snapshot
Raft raft/log.go RaftLog management, unstable/committed/applied
Raft raft/rawnode.go Ready interface, Advance, ProposeConfChange
Raftstore kv/raftstore/peer_msg_handler.go message handling, entry application, split/confchange
Raftstore kv/raftstore/peer_storage.go persistence, snapshot generation and application
Raftstore kv/raftstore/raft_worker.go single-threaded message loop
MVCC kv/transaction/mvcc/transaction.go MvccTxn, EncodeKey, GetValue
MVCC kv/transaction/mvcc/scanner.go range scan
Server kv/server/server.go KvGet / KvPrewrite / KvCommit / KvScan
Scheduler scheduler/server/cluster.go processRegionHeartbeat
Scheduler scheduler/server/schedulers/balance_region.go Balance Region scheduling

Part 10: Papers and References

Topic Paper / Resource
Raft In Search of an Understandable Consensus Algorithm
Percolator Large-scale Incremental Processing Using Distributed Transactions and Notifications
Spanner Spanner: Google’s Globally-Distributed Database
TiKV Design TiKV Deep Dive
LSM-Tree The Log-Structured Merge-Tree