TinyKV: A Raft-Based KV Storage System

Interview Script (Memorize This First)

30-Second Version (Opening)

TinyKV uses Multi-Raft to achieve strong-consistency replication, with Badger as the persistence layer. On top of that it implements region split, membership change, snapshot transfer, scheduler-based load balancing, and distributed transactions via MVCC + Percolator. I focused on stability under split/recover and conf change scenarios, ensuring requests never time out in complex failure cases and nodes can always recover after a restart.

3-Minute Version (Expanded)

Start with the layered architecture: Server → Raftstore → Raft → Storage/MVCC → Scheduler.
Walk through the consistency pipeline: propose → replicate → commit → apply → callback.
Cover complex scenarios: leader transfer, conf change, split, snapshot, crash recovery.
End with a problem you solved end-to-end: symptom, root cause, fix, validation.

Deep-Dive Answer Order (Stay Focused)

State the invariants first (linearizability, monotone epoch, log ordering).
Explain the key data structures (proposal, Progress, RegionEpoch).
Walk through failure paths (proposal dropped, stale/tombstone, snapshot/recover).
Describe verification: high-concurrency + unreliable network + restart/partition regression tests.

Architecture Overview

TinyKV is an educational distributed KV storage system modeled after TiKV + PD from the TiDB ecosystem. It consists of four core modules:

Client (TinySQL)
    |  gRPC (kvrpcpb)
    v
TinyKV Server (kv/server/)
    |
    +-- Storage Layer
    |   +-- StandaloneStorage (Project 1)
    |   +-- RaftStorage (Project 2-3)
    |       |
    |       +-- Raftstore (kv/raftstore/)
    |           |  drives Raft consensus
    |           v
    |       Raft Module (raft/)
    |
    +-- Transaction Layer (Project 4)
    |   +-- MVCC (kv/transaction/mvcc/)
    |   +-- Percolator 2PC
    |
TinyScheduler (scheduler/)
    |  analogous to PD: cluster management and load balancing
    +-- Region Heartbeat handling
    +-- Balance Leader/Region scheduling

Part 1: Storage Engine Basics (Project 1)

1.1 Badger Engine

LSM-Tree based KV store (similar to RocksDB)
Column Families emulated via key prefix: ${cf}_${key}
Three CFs: CfDefault (user data), CfLock (transaction locks), CfWrite (commit records)

1.2 Storage Interface Design

type Storage interface {
    Write(ctx *kvrpcpb.Context, batch []Modify) error
    Reader(ctx *kvrpcpb.Context) (StorageReader, error)
}
type StorageReader interface {
    GetCF(cf string, key []byte) ([]byte, error)
    IterCF(cf string) engine_util.DBIterator
    Close()
}

1.3 Interview Points

Why Column Families? Logical isolation of different data types (locks, values, write records) improves scan efficiency.
Badger vs RocksDB? Badger is Go-native, uses LSM-Tree with value log separation.

Part 2: Raft Consensus Algorithm (Project 2A)

2.1 Core Data Structures

type Raft struct {
    Term    uint64          // current term
    Vote    uint64          // who we voted for
    Lead    uint64          // current leader
    State   StateType       // Follower / Candidate / Leader
    RaftLog *RaftLog        // log management
    Prs     map[uint64]*Progress  // replication progress per peer {Match, Next}
    votes   map[uint64]bool       // election vote tracking

    electionTimeout     int    // base election timeout
    heartbeatTimeout    int    // heartbeat interval
    randomizedTimeout   int    // randomized election timeout [election, 2*election)
    electionElapsed     int    // election timer
    heartbeatElapsed    int    // heartbeat timer

    leadTransferee uint64     // target of an in-progress leader transfer
    PendingConfIndex uint64   // prevents concurrent conf changes
}

2.2 Leader Election Algorithm

Flow: 1. Follower election timeout fires → becomes Candidate 2. Term++, votes for itself 3. Sends RequestVote{Term, LastIndex, LogTerm} to all peers 4. Receives majority votes → becomes Leader 5. Leader immediately appends a no-op entry (prevents committing entries from prior terms)

Voting Rules: - Only one vote per term - Candidate’s log must be at least as up-to-date as the voter’s (compare LastLogTerm, then LastLogIndex on tie) - Receiving a message with a higher term → revert to Follower

Randomized Timeout: randomizedTimeout = electionTimeout + rand(0, electionTimeout) - Prevents livelock (multiple nodes starting elections simultaneously)

2.3 Log Replication Algorithm

Leader tracks progress per follower:

type Progress struct {
    Match uint64  // highest index confirmed replicated
    Next  uint64  // next index to send
}

AppendEntries Flow: 1. Leader sends {prevLogIndex, prevLogTerm, Entries[], Commit} 2. Follower checks: does the entry at prevLogIndex have term == prevLogTerm? 3. Match → append entries, update committed 4. Mismatch → reject; leader decrements Next and retries (log backtracking)

Commit Rule (Critical!):

func maybeCommit():
    1. Collect Match values from all peers
    2. Sort and find the median N (highest index replicated by majority)
    3. Only commit if log[N].term == currentTerm
    // Never commit entries from a previous term directly!
    // This is a key difference from Paxos.

2.4 RaftLog Structure

snapshot/first.....applied....committed....stabled.....last
--------|------------------------------------------------|
                      log entries

- applied:   highest index applied to the state machine
- committed: highest index confirmed by majority
- stabled:   highest index persisted to storage
- invariant: applied <= committed <= stabled <= last

Key Functions: - unstableEntries(): entries after stabled (need to be persisted) - nextEnts(): entries from applied+1 to committed (need to be applied to state machine) - maybeCompact(): discard entries already covered by a snapshot

2.5 RawNode Interface — Application-Driven Loop

// Application loop:
for {
    rn.Tick()                     // advance logical clock
    if rn.HasReady() {
        rd := rn.Ready()          // get all pending changes
        persist(rd.Entries)        // persist log entries
        send(rd.Messages)         // send network messages
        apply(rd.CommittedEntries) // apply to state machine
        rn.Advance(rd)            // notify Raft that processing is done
    }
}

Ready Struct:

type Ready struct {
    SoftState        *SoftState       // leader / state changes
    HardState        pb.HardState     // term / vote / commit changes
    Entries          []pb.Entry       // entries to persist
    CommittedEntries []pb.Entry       // committed entries to apply
    Snapshot         pb.Snapshot      // snapshot to install
    Messages         []pb.Message     // messages to send
}

2.6 Frequently Asked Interview Questions

Q: Why does a newly elected leader append a no-op entry? A: To prevent the “phantom log” problem. A new leader may hold uncommitted entries from the previous term. Raft only allows committing entries from the current term, so a no-op entry in the current term lets the leader indirectly commit those prior-term entries.

Q: How does Raft guarantee log consistency? A: Via the Log Matching Property: (1) if two nodes have entries with the same index and term, those entries are identical; (2) if the same holds, all preceding entries are also identical. The prevLogTerm check in AppendEntries enforces this.

Q: Why can’t prior-term entries be committed directly? A: See Figure 8 in the Raft paper. Committing an entry from a previous term directly can cause a committed entry to be overwritten later, violating State Machine Safety.

Part 3: Raft KV Layer (Project 2B)

3.1 Overall Architecture

Client → gRPC → RaftstoreRouter → router.send(regionID, MsgRaftCmd)
                                         |
                                    peerSender channel (buf: 40960)
                                         |
                                    raftWorker.run() [single-threaded]
                                         |
                                    peerMsgHandler
                                    ├── proposeRaftCommand() → Raft.Propose()
                                    └── HandleRaftReady()
                                        ├── SaveReadyState() [persist]
                                        ├── Send(messages)   [network]
                                        └── apply entries    [state machine]

3.2 Critical Persistent State

Two Badger instances:

Engine	Stores	Key Format
RaftDB	Raft log + RaftLocalState	`RaftLogKey(regionID, index)`
KVDB	User data + RegionLocalState + RaftApplyState	user key / `ApplyStateKey(regionID)`

Three key metadata structures:

1
2
3

RaftLocalState:   {HardState{Term, Vote, Commit}, LastIndex, LastTerm}
RaftApplyState:   {AppliedIndex, TruncatedState{Index, Term}}
RegionLocalState: {Region{Id, StartKey, EndKey, Peers[], RegionEpoch}, State}

3.3 Proposal Matching Mechanism

type proposal struct {
    index uint64            // expected log index
    term  uint64            // expected term
    cb    *message.Callback // callback
}

On propose: record (nextIndex, currentTerm, callback) On apply: match entry by (term, index); if matched, invoke callback

Stale detection: - entry.Term > proposal.term → proposal was overwritten by a new leader → ErrStaleCommand - entry.Index > proposal.index → same conclusion

3.4 Interview Points

Q: Do reads also go through Raft? A: In TinyKV, yes (ReadIndex optimization is not implemented). Get requests are submitted as Raft entries to guarantee linearizability. Real systems typically use ReadIndex or LeaseRead to avoid that overhead.

Q: Why is raftWorker single-threaded? A: All peer messages are processed in one goroutine, eliminating lock contention. Lock-free message passing is achieved via channels.

Part 4: Snapshot Mechanism (Project 2C)

4.1 Snapshot Trigger

Periodic tick → onRaftGCLogTick()
    → if appliedIndex - truncatedIndex > RaftLogGcCountLimit
    → propose CompactLogRequest{CompactIndex, CompactTerm}
    → truncate log + schedule raftlog-gc worker to clean old entries

4.2 Snapshot Generation and Sending

Leader detects follower.Next < firstIndex (log already truncated)
    → calls PeerStorage.Snapshot()
    → asynchronously sends RegionTaskGen to region worker
    → returns ErrSnapshotTemporarilyUnavailable (Raft will retry)
    → next Ready() cycle picks up the snapshot and sends MsgSnapshot

4.3 Snapshot Application Flow

Follower receives MsgSnapshot
    → checkSnapshot() validates (peer in list, no overlapping region, file exists)
    → Raft layer: discard entries covered by snapshot, update applied/committed/stabled
    → Raftstore layer: ApplySnapshot()
        → clear old metadata
        → delete data outside the new region's key range
        → update RaftLocalState, RaftApplyState, RegionLocalState
        → send RegionTaskApply (block until complete)
    → update storeMeta (regionRanges B-Tree)

4.4 Interview Points

Q: Why is snapshot generation asynchronous? A: Generating a snapshot requires scanning the entire region’s data, which can be slow. Asynchronous generation avoids blocking the Raft main loop. ErrSnapshotTemporarilyUnavailable signals Raft to retry on the next tick.

Part 5: Multi-Raft (Project 3A / 3B)

5.1 Leader Transfer

1. Leader receives MsgTransferLeader{Target}
2. Check whether target's log is up-to-date (Match == LastIndex)
3. If not → send MsgAppend to help target catch up
4. If yes → send MsgTimeoutNow to target
5. Target immediately starts an election (no timeout needed)
6. Safety: cancel the transfer after 2*electionTimeout if not complete

Key distinction: Leader transfer bypasses the Raft log; it calls RawNode.TransferLeader() directly.

5.2 Configuration Change (Membership Change)

// Propose a conf change
RawNode.ProposeConfChange(ConfChange{ChangeType, NodeId, Context})
    → creates an EntryConfChange entry
    → replicated and committed through Raft

// Apply the conf change
entry committed → applyConfChangeEntry()
    → RawNode.ApplyConfChange(cc)  // update Raft's internal Prs
    → update Region.Peers + RegionEpoch.ConfVer++
    → persist RegionLocalState
    → if removing self → destroyPeer()

Safety: one node changed at a time (no Joint Consensus); PendingConfIndex prevents concurrent changes.

5.3 Region Split

SplitCheckTick → scan region size → find split key when threshold exceeded
    → MsgSplitRegion → ask Scheduler to allocate new Region/Peer IDs
    → propose AdminRequest{Split} through Raft

On apply:
    Original region: [startKey, endKey) → [startKey, splitKey)
    New region:      [splitKey, endKey)  (new Raft group)
    RegionEpoch.Version++

5.4 RegionEpoch Mechanism

type RegionEpoch struct {
    ConfVer uint64  // incremented on each AddPeer/RemovePeer
    Version uint64  // incremented on each Split/Merge
}

Purpose: detect stale requests. The epoch in a client request must match the region’s current epoch.

5.5 Interview Points

Q: Why not use Joint Consensus? A: TinyKV simplifies membership changes to one node at a time. Joint Consensus allows multiple simultaneous changes but is more complex to implement. Single-step safety is guaranteed by “at most one change at a time.”

Q: How are replicas created in the new region after a split? A: Followers create the new peer locally when they apply the split entry. If a follower is absent (e.g., partitioned), the leader’s heartbeat causes the store to detect an unknown region and call replicatePeer() to create an empty peer, which then receives data via a snapshot.

Part 6: Scheduler Load Balancing (Project 3C)

6.1 processRegionHeartbeat

func processRegionHeartbeat(region) error:
    1. Fetch existing region (by ID)
    2. Version comparison (reject stale heartbeats):
       - newVersion < existVersion → reject
       - newVersion == existVersion && newConfVer < existConfVer → reject
    3. For new regions, check for key range overlaps
    4. PutRegion to update region info
    5. updateStoreStatus to refresh per-store statistics

6.2 Balance Region Scheduler

Schedule(cluster):
    1. Collect all healthy stores (Up && DownTime < MaxStoreDownTime)
    2. Sort by RegionSize descending
    3. Starting from the largest store, pick a region:
       - Prefer pending region → follower → leader
       - Skip regions with fewer than MaxReplicas peers
    4. Starting from the smallest store, find a target
       (must not already hold a peer for this region)
    5. Verify the size gap is large enough:
       source.RegionSize - target.RegionSize >= 2 * region.ApproximateSize
    6. CreateMovePeerOperator(source → target)

6.3 Interview Points

Q: Why the 2×regionSize tolerance check? A: To prevent oscillation. If the gap is too small, the move might immediately need to be reversed, causing an infinite loop.

Q: How does the scheduler guarantee safe scheduling? A: Via RegionEpoch checks. An operator records the epoch at creation time; if the epoch has changed when the operator executes (region was split or conf-changed), the operator is cancelled.

Part 7: Distributed Transactions (Project 4)

7.1 MVCC — Multi-Version Concurrency Control

How the three CFs cooperate:

CF	Key Format	Value	Purpose
CfDefault	`EncodeKey(userKey, startTS)`	actual data	stores all versions of a value
CfLock	`userKey` (raw)	`Lock{Primary, Ts, Ttl, Kind}`	transaction lock (at most one per key)
CfWrite	`EncodeKey(userKey, commitTS)`	`Write{StartTS, Kind}`	commit records

Key Encoding (critical!):

func EncodeKey(key []byte, ts uint64) []byte {
    encodedKey := codec.EncodeBytes(key)           // order-preserving encoding
    newKey := append(encodedKey, make([]byte, 8)...)
    binary.BigEndian.PutUint64(newKey[len(encodedKey):], ^ts)  // bitwise NOT!
    return newKey
}

- ^ts makes timestamps sort in descending order (newest version first) - Seek(EncodeKey(key, startTS)) lands directly on the latest version with commitTS <= startTS

7.2 Percolator Two-Phase Commit

Phase 1: Prewrite (lock + write data)

For each mutation:
    1. Check write conflict: MostRecentWrite(key).commitTs > startTs → WriteConflict
    2. Check lock conflict: GetLock(key) exists → KeyLocked
    3. PutValue(key, value)  → CfDefault[key+startTs] = value
    4. PutLock(key, Lock{Primary, startTs, TTL, Kind}) → CfLock[key] = lock

Phase 2: Commit (commit + unlock)

For each key:
    1. GetLock(key) — check the lock
    2. lock == nil:
       - Check CurrentWrite: already committed → idempotent return
                             already rolled back → error
       - No write record → prewrite was lost, skip (no-op)
    3. lock.Ts != startTs → lock belongs to another transaction → error
    4. PutWrite(key, commitTs, Write{startTs, kind}) → CfWrite[key+commitTs]
    5. DeleteLock(key) → remove CfLock[key]

7.3 Read (KvGet)

1. GetLock(key): if lock.Ts < readTs → blocked (return Locked error)
2. Seek CfWrite at EncodeKey(key, readTs)
3. Find the first Write record with commitTs <= readTs
4. If Write.Kind == Delete → return NotFound
5. Otherwise use Write.StartTS to read the actual value from CfDefault

7.4 Scan (KvScan)

type Scanner struct {
    Txn  *MvccTxn
    Iter DBIterator  // iterates CfWrite
}

func Next() (key, value, error):
    1. Seek to EncodeKey(currentKey, startTS)
    2. Decode userKey; skip older versions of the same key
    3. Read value from CfDefault using Write.StartTS
    4. Advance to the next distinct userKey

7.5 Transaction Cleanup

CheckTxnStatus (lock TTL check):

1. Check write and lock for the primary key
2. Already committed → return commitTs
3. Already rolled back → return NoAction
4. Lock present → check whether TTL has expired
   - Expired: clear lock + value, write rollback record, return TTLExpireRollback
   - Not expired: return NoAction
5. No lock and no write record → write rollback (guard against a late-arriving prewrite)

BatchRollback:

1 2	1. Check whether each key is already committed → if so, refuse rollback 2. DeleteLock + DeleteValue + PutWrite(Rollback)

ResolveLock:

1
2
3

1. Scan CfLock for all locks with lock.Ts == startTs
2. commitVersion == 0 → rollback all
3. commitVersion > 0  → commit all

7.6 Latch Mechanism

1
2
3

type Latches struct {
    Locks map[string]*sync.Mutex
}

In-memory mutex lock that prevents concurrent in-process operations on the same key
Different from MVCC locks: a latch is local and short-lived; an MVCC lock is persistent and distributed

7.7 Frequently Asked Interview Questions

Q: What is the role of the Primary Key in Percolator? A: The primary key is the single source of truth for a transaction. Only the primary key’s write record needs to be committed synchronously; secondary keys can be committed asynchronously. Checking the primary key’s lock/write record is sufficient to determine the status of the entire transaction.

Q: Why does CfWrite use commitTS instead of startTS in the key? A: Reads need to find the latest commit with commitTS <= readTS. Encoding by startTS would make that lookup inefficient. With commitTS, a single Seek lands on the first write record at or before readTS.

Q: Snapshot Isolation vs Serializable? A: SI allows Write Skew anomalies (two transactions read from the same snapshot and each writes a different key, but their combined effect violates application invariants). TinyKV implements SI, not full Serializability.

Q: What happens if the client crashes after Prewrite succeeds but before Commit? A: Any transaction that encounters the lock will call CheckTxnStatus on the primary key. If the lock’s TTL has expired, the lock is rolled back. If the primary has already been committed, the commit is carried through.

Q: Why doesn’t CfLock’s key include a timestamp? A: Only one transaction can lock a key at a time. The startTS is already stored inside the lock value, so there is no need to encode it in the key.

Part 8: System Design Interview Questions

8.1 Complete Write Request Path

1.  Client → gRPC → Server.KvPrewrite()
2.  Server acquires Latch → gets Reader → creates MvccTxn
3.  Conflict detection (write conflict + lock conflict)
4.  PutValue + PutLock → write to Badger (via RaftStorage)
5.  RaftStorage → Raftstore → Raft Propose
6.  Raft replicates to majority → Commit
7.  HandleRaftReady → apply committed entries → write to KVDB
8.  Callback to Client → Prewrite succeeds
9.  Client → gRPC → Server.KvCommit()
10. PutWrite + DeleteLock → Commit succeeds

8.2 Region Split Flow

Region [a, z) exceeds size threshold
    → SplitChecker finds split key = "m"
    → propose Split{splitKey="m", newRegionID, newPeerIDs}
    → replicated and committed through Raft
    → Apply: original region [a, m), new region [m, z)
    → two independent Raft groups run in parallel
    → notify Scheduler to update routing table

8.3 Crash Recovery

Node restart:
    1. Load RegionLocalState from KVDB (all region metadata)
    2. Load RaftLocalState from RaftDB (HardState)
    3. Load RaftApplyState from KVDB (how far we've applied)
    4. Rebuild RaftLog: start from stabled; re-apply entries after applied
    5. Rejoin the cluster via Raft heartbeats

8.4 Linearizability Guarantee

Writes: Raft ensures majority acknowledgement → writes are durable
Reads:  also go through Raft → guaranteed to see the latest value
    (Optimization: ReadIndex — leader confirms it is still leader, then reads locally)
    (Optimization: LeaseRead — read locally within the lease window)

Part 9: Key Source Files

Module	File	Core Content
Raft	`raft/raft.go`	election, log replication, heartbeat, snapshot
Raft	`raft/log.go`	RaftLog management, unstable/committed/applied
Raft	`raft/rawnode.go`	Ready interface, Advance, ProposeConfChange
Raftstore	`kv/raftstore/peer_msg_handler.go`	message handling, entry application, split/confchange
Raftstore	`kv/raftstore/peer_storage.go`	persistence, snapshot generation and application
Raftstore	`kv/raftstore/raft_worker.go`	single-threaded message loop
MVCC	`kv/transaction/mvcc/transaction.go`	MvccTxn, EncodeKey, GetValue
MVCC	`kv/transaction/mvcc/scanner.go`	range scan
Server	`kv/server/server.go`	KvGet / KvPrewrite / KvCommit / KvScan
Scheduler	`scheduler/server/cluster.go`	processRegionHeartbeat
Scheduler	`scheduler/server/schedulers/balance_region.go`	Balance Region scheduling

Part 10: Papers and References

Topic	Paper / Resource
Raft	In Search of an Understandable Consensus Algorithm
Percolator	Large-scale Incremental Processing Using Distributed Transactions and Notifications
Spanner	Spanner: Google’s Globally-Distributed Database
TiKV Design	TiKV Deep Dive
LSM-Tree	The Log-Structured Merge-Tree