2022-03-23
Raft
- Leader Election
- select one
- detect crashes
- Normal Operation
- Safety and consistency after leader changes
- Neutralising old leaders
- Client Interactions
- Implementing linearisable semantics
- Configuration changes:
- Adding/removing servers
Server States:
At any given time, each server is is either:
- Leader:
- At most one leader
- Follower
- Completely passive (no RPCs, only responds to RPCs)
- Candidate
Time is represented in terms:
Terms: - Election - Normal operation under a single leader
-
At most 1 leader per term
-
Some terms have no leader (failed election)
-
Each server maintains current term value
-
Key role of terms: identify obsolete information
-
All communication within raft is done via Remote Procedure Calls
Heartbeats and Timeouts:
- Servers start up as followers
- Followers expect to receive RPCs from leaders or candidates.
- Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority.
- If 'electionTimeout' elapses with no RPCs:
- Follower assumes leader has crashed
- Follower starts new election
Election Basics:
-
Increment current term value
-
Change to Candidate state
-
Vote for self
-
Send RequestVote RPCs to all other servers, retry until either:
- Receive votes from majority of servers:
- Become leader
- some AppendEntries heartbeats to all other servers
- Receive RPC from valid leader:
- Return to follower state
- Nobody wins the election (election timeout elapses):
- Increment term, start a new election
- Receive votes from majority of servers:
-
Safety: allow at most one winner per term
- Each server gives out only one vote per term (persist on disk)
- Two different candidates can't accumulate majorities in the same term
-
Liveness: some candidate must eventually win
- Chose election timeouts in [T, 2T]]
- One server usually times out and wins election before others wake up
- Works well if T >> broadcast time
#----------------------------------------------------------------------
# Log Structure:
#----------------------------------------------------------------------
Leader
Term ┌──────┬──────┬──────┬──────┬──────┬──────┐
─────►│ 1 │ 1 │ 1 │ 2 │ 3 │ 3 │
Cmd │ │ │ │ │ │ │
─────►│ add │ cmp │ ret │ mov │ jmp │ div │
└──────┴──────┴──────┴──────┴──────┴──────┘
Followers
Term ┌──────┬──────┬──────┬──────┬──────┬──────┐
─────►│ 1 │ 1 │ 1 │ 2 │ 3 │ 3 │
Cmd │ │ │ │ │ │ │
─────►│ add │ cmp │ ret │ mov │ jmp │ div │
└──────┴──────┴──────┴──────┴──────┴──────┘
Term ┌──────┬──────┬──────┬──────┬──────┐
─────►│ 1 │ 1 │ 1 │ 2 │ 3 │
Cmd │ │ │ │ │ │
─────►│ add │ cmp │ ret │ mov │ jmp │
└──────┴──────┴──────┴──────┴──────┘
Term ┌──────┬──────┬──────┐
─────►│ 1 │ 1 │ 1 │
Cmd │ │ │ │
─────►│ add │ cmp │ ret │
└──────┴──────┴──────┘
│ │ │ │
└────────────────────┴──────────────┴─────┘
Committed entries per follower
- Log entry = index, term, command
- Log store on stable storage (disk); survives crashes
- Entry committed if known to be stored on majority of servers.
- Durable, will eventually be executed by state machines.
#----------------------------------------------------------------------
# Normal Operation
#----------------------------------------------------------------------
-
Client sends command to leader
-
Leader appends command to its log
-
Leader send AppendEntries RPCs to followers
-
Once new entry committed:
- Leader passes command to its state machine, returns result to client
- Leader notifies to followers of committed entries in subsequent AppendEntries RPCs
- Followers pass committed command to their state machine
-
Crashed or slow followers?
- Leader retries RPCs until they succeed
-
Performance is optimal in common case
- Once successful RPC to any majority of servers
#----------------------------------------------------------------------
# Log Consistency
#----------------------------------------------------------------------
1 2 3 4 5
┌──────┬──────┬──────┬──────┬──────┐
│ 1 │ 1 │ 1 │ 2 │ 3 │
│ add │ cmp │ ret │ mov │ jmp │
└──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬──────┐
│ 1 │ 1 │ 1 │ 2 │ 4 │
│ add │ cmp │ ret │ mov │ sub │
└──────┴──────┴──────┴──────┴──────┘
^^^^^ Different to leader!
Raft maintains a high level of coherency between logs
-
If logs entries on different servers have same index and term, it is guaranteed:
- They store the same command
- The logs are identical in all preceding entries
-
If a given entry is committed, all preceding entries are also committed
AppendEntries Consistency Check
- Each AppendEntries RPC contains index, term of entry preceding new ones
- Follower log must contain matching entry; otherwise it rejects the request
- Implements an induction step, ensures coherency
#----------------------------------------------------------------------
# Leader Changes
#----------------------------------------------------------------------
At the beginning of the new leader's term, they have some responsibilities to fulfil:
- Old leader may have left partially replicated entries
- No special steps are taken here by the leader; just start normal replication
- Leaders log that begins to be replicated is set as the single source of truth
- Raft will eventually make follower's logs identical to leader's
- Multiple crashes can leave many extraneous log entries, such as in multiple crashes and network partitions
#----------------------------------------------------------------------
# Safety Requirement
#----------------------------------------------------------------------
Once a log entry has been applied to a state machine,no other state machine must apply a different value for that log entry.
- Raft Safety property:
- if a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders
This guarantees the safety requirement:
- leaders never overwrite entries in their logs
- Only entries in the leader's log can be committed
- Entries must be committed before applying to state machine
-> Exclude a server from becoming a leader if it does not have all the required log entries.
#----------------------------------------------------------------------
# Picking the best leader
#----------------------------------------------------------------------
- Can't tell which entries are committed in the following situation:
1 2 3 4 5
┌──────┬──────┬──────┬──────┬──────┐
│ 1 │ 1 │ 1 │ 2 │ 3 │
│ add │ cmp │ ret │ mov │ jmp │
s1└──────┴──────┴──────┴──────┴──────┘
┌──────┬──────┬──────┬──────┬
│ 1 │ 1 │ 1 │ 2 │
│ add │ cmp │ ret │ mov │
s2└──────┴──────┴──────┴──────┴
--------------------------------------------------------------
s3 is not available at election time, so we cannot tell that the
last entry to s1 is committed
┌──────┬──────┬──────┬──────┬──────┐
│ 1 │ 1 │ 1 │ 2 │ 3 │
│ add │ cmp │ ret │ mov │ jmp │
s3└──────┴──────┴──────┴──────┴──────┘
- During elections, choose the candidate with log most likely to contain
all of the committed log entries:
- Candidates include log info in RequestVote RPCs (index & term of last log entry)
- Voting server V denies vote if its log is "more complete":
(lastTerm_v > lastTerm_c) ||
(lastTerm_v == lastTerm_c) && (lastIndex_v > lastIndex_c)
- Leader will have the "most complete" log among the electing
majority
#----------------------------------------------------------------------
# New Commitment Rules
#----------------------------------------------------------------------
- For a leader to decide an entry is committed:
- Must be stored on a majority of servers
- At least one new entry from leader's term must also be stored on majority of servers
#----------------------------------------------------------------------
# Repairing Follower Logs
#----------------------------------------------------------------------
- New Leader must make follower logs consistent with its own
- Delete extraneous entries
- Fill in missing entries
- Leader keeps nextIndex for each follower:
- Index of next log entry to send to that follower
- Initialised to (1 + leader's last index)
- Wen AppendEntries consistency check fails, decrement nextIndex and try again:
# -----------------------------------------------------------------------
# Raft -> append to main notes
# -----------------------------------------------------------------------
- Whenever a follower overwrites inconsistent entry, it deletes all subsequent entries: :: Entries which are missing from the leader but are the most recent from the follower are truncated and we continue from the new leader's most recent log entry.
# -----------------------------------------
# Neutralising Old Leaders
# -----------------------------------------
-
Deposed leader may not be dead:
- Temporarily disconnected from network
- Other servers elect a new leader
- Older leader becomes reconnected, attempts to commit log entries
-
Terms used to detect stale leaders (and Candidates)
- Every RPC contains term of the sender
- If sender's term is older, RPC is rejected, send reverts to follower and updates its term
- If receiver's term is older, it reverts to follower, updates its term, then processes RPC normally
-
Election updates terms of majority of servers
-
Deposed server cannot commit new log entries
-
# -----------------------------------------
# Client Protocol
# -----------------------------------------
-
Send commands to leader
- If leader is unknown, contact any server
- If contacted server is not the leader, it will redirect to the current leader
-
Leader does not respond until command has been logged, committed, and executed by leader's state machine
-
If request times out (e.g. leader crashes)
- Client reissues command to some other server
- Eventually this will result in being redirected to the new leader
- Retry the original request with the new leader
- Client reissues command to some other server
-
What if leader crashes after executing command, but before responding?
- Must not execute command twice
-
Solution: client embeds a unique id in each command
- Server includes id in log entry
- Before accepting command, leader checks its log for entry with that id
- If the id is found in the log, ignore new command, return response from old command
-
Result: exactly-once semantics as long as client doesn't crash
# -----------------------------------------
# Configuration Changes
# -----------------------------------------
What is in the scope of 'configuration'?
-
System configuration:
- ID, address for each server
- Determines what constitutes a majority
-
Consensus Mechanism must support changes in the configuration:
- Replace a failed machine
- Change the degree of replication
-
If configuration changes happen simultaneously, it is possible that we form 2 majorities and so we have to make any changes in 2 phases.
# -----------------------------------------
# Joint Consensus
# -----------------------------------------
Raft uses a 2-phase approach: - Intermediate phase uses join consensus (need majority of both old and new configurations for elections, commitment) - Configuration change is just a log entry; applied immediately on receipt (committed or not) - Once joint consensus is committed, begin replicating log entry for final configuration Additionally:
- Any server from either configuration can serve as leader
- If current leader is not in Cnew, it must step down once Cnew is committed
# -----------------------------------------
# Raft Summary
# -----------------------------------------
- Leader election
- Normal Operation
- Safety and consistency
- Neutralise old leaders
- Client Protocol
- Configuration Changes
Return to Index