2022-03-23

Raft

Leader Election
- select one
- detect crashes
Normal Operation
Safety and consistency after leader changes
Neutralising old leaders
Client Interactions
- Implementing linearisable semantics
Configuration changes:
- Adding/removing servers

Server States:

At any given time, each server is is either:

Leader:
- At most one leader
Follower
- Completely passive (no RPCs, only responds to RPCs)
Candidate

Time is represented in terms:

Terms: - Election - Normal operation under a single leader

At most 1 leader per term
Some terms have no leader (failed election)
Each server maintains current term value
Key role of terms: identify obsolete information
All communication within raft is done via Remote Procedure Calls

Heartbeats and Timeouts:

Servers start up as followers
Followers expect to receive RPCs from leaders or candidates.
Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority.
If 'electionTimeout' elapses with no RPCs:
- Follower assumes leader has crashed
- Follower starts new election

Election Basics:

Increment current term value
Change to Candidate state
Vote for self
Send RequestVote RPCs to all other servers, retry until either:
1. Receive votes from majority of servers:
  - Become leader
  - some AppendEntries heartbeats to all other servers
2. Receive RPC from valid leader:
  - Return to follower state
3. Nobody wins the election (election timeout elapses):
  - Increment term, start a new election
Safety: allow at most one winner per term
- Each server gives out only one vote per term (persist on disk)
- Two different candidates can't accumulate majorities in the same term
Liveness: some candidate must eventually win
- Chose election timeouts in [T, 2T]]
- One server usually times out and wins election before others wake up
- Works well if T >> broadcast time

#----------------------------------------------------------------------
#   Log Structure: 
#----------------------------------------------------------------------

        Leader
Term  ┌──────┬──────┬──────┬──────┬──────┬──────┐
─────►│  1   │  1   │  1   │  2   │  3   │  3   │
Cmd   │      │      │      │      │      │      │
─────►│ add  │ cmp  │ ret  │ mov  │ jmp  │ div  │
      └──────┴──────┴──────┴──────┴──────┴──────┘

         Followers
Term  ┌──────┬──────┬──────┬──────┬──────┬──────┐
─────►│  1   │  1   │  1   │  2   │  3   │  3   │
Cmd   │      │      │      │      │      │      │
─────►│ add  │ cmp  │ ret  │ mov  │ jmp  │ div  │
      └──────┴──────┴──────┴──────┴──────┴──────┘
Term  ┌──────┬──────┬──────┬──────┬──────┐
─────►│  1   │  1   │  1   │  2   │  3   │
Cmd   │      │      │      │      │      │
─────►│ add  │ cmp  │ ret  │ mov  │ jmp  │
      └──────┴──────┴──────┴──────┴──────┘
Term  ┌──────┬──────┬──────┐
─────►│  1   │  1   │  1   │
Cmd   │      │      │      │
─────►│ add  │ cmp  │ ret  │
      └──────┴──────┴──────┘
      │                    │              │     │
      └────────────────────┴──────────────┴─────┘
        Committed entries per follower

Log entry = index, term, command
Log store on stable storage (disk); survives crashes
Entry committed if known to be stored on majority of servers.
- Durable, will eventually be executed by state machines.

#----------------------------------------------------------------------
#   Normal Operation
#----------------------------------------------------------------------

Client sends command to leader
Leader appends command to its log
Leader send AppendEntries RPCs to followers
Once new entry committed:
- Leader passes command to its state machine, returns result to client
- Leader notifies to followers of committed entries in subsequent AppendEntries RPCs
- Followers pass committed command to their state machine
Crashed or slow followers?
- Leader retries RPCs until they succeed
Performance is optimal in common case
- Once successful RPC to any majority of servers

#----------------------------------------------------------------------
#   Log Consistency
#----------------------------------------------------------------------
         1      2      3      4      5
      ┌──────┬──────┬──────┬──────┬──────┐
      │  1   │  1   │  1   │  2   │  3   │
      │ add  │ cmp  │ ret  │ mov  │ jmp  │
      └──────┴──────┴──────┴──────┴──────┘
      ┌──────┬──────┬──────┬──────┬──────┐
      │  1   │  1   │  1   │  2   │  4   │
      │ add  │ cmp  │ ret  │ mov  │ sub  │
      └──────┴──────┴──────┴──────┴──────┘
                                    ^^^^^ Different to leader!

Raft maintains a high level of coherency between logs

If logs entries on different servers have same index and term, it is guaranteed:
- They store the same command
- The logs are identical in all preceding entries
If a given entry is committed, all preceding entries are also committed

AppendEntries Consistency Check

Each AppendEntries RPC contains index, term of entry preceding new ones
Follower log must contain matching entry; otherwise it rejects the request
Implements an induction step, ensures coherency

#----------------------------------------------------------------------
#   Leader Changes
#----------------------------------------------------------------------

At the beginning of the new leader's term, they have some responsibilities to fulfil:

Old leader may have left partially replicated entries
No special steps are taken here by the leader; just start normal replication
Leaders log that begins to be replicated is set as the single source of truth
Raft will eventually make follower's logs identical to leader's
Multiple crashes can leave many extraneous log entries, such as in multiple crashes and network partitions

#----------------------------------------------------------------------
#   Safety Requirement
#----------------------------------------------------------------------

Once a log entry has been applied to a state machine,no other state machine must apply a different value for that log entry.

Raft Safety property:
- if a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders

This guarantees the safety requirement:

leaders never overwrite entries in their logs
Only entries in the leader's log can be committed
Entries must be committed before applying to state machine

-> Exclude a server from becoming a leader if it does not have all the required log entries.

#----------------------------------------------------------------------
#   Picking the best leader
#----------------------------------------------------------------------

Can't tell which entries are committed in the following situation:

         1      2      3      4      5
      ┌──────┬──────┬──────┬──────┬──────┐
      │  1   │  1   │  1   │  2   │  3   │
      │ add  │ cmp  │ ret  │ mov  │ jmp  │
    s1└──────┴──────┴──────┴──────┴──────┘

      ┌──────┬──────┬──────┬──────┬
      │  1   │  1   │  1   │  2   │
      │ add  │ cmp  │ ret  │ mov  │
    s2└──────┴──────┴──────┴──────┴

    --------------------------------------------------------------

    s3 is not available at election time, so we cannot tell that the
    last entry to s1 is committed
      ┌──────┬──────┬──────┬──────┬──────┐
      │  1   │  1   │  1   │  2   │  3   │
      │ add  │ cmp  │ ret  │ mov  │ jmp  │
    s3└──────┴──────┴──────┴──────┴──────┘

During elections, choose the candidate with log most likely to contain all of the committed log entries:
- Candidates include log info in RequestVote RPCs (index & term of last log entry)
- Voting server V denies vote if its log is "more complete":

        (lastTerm_v > lastTerm_c) || 
        (lastTerm_v == lastTerm_c) && (lastIndex_v > lastIndex_c)

- Leader will have the "most complete" log among the electing
  majority

#----------------------------------------------------------------------
#   New Commitment Rules
#----------------------------------------------------------------------

For a leader to decide an entry is committed:
- Must be stored on a majority of servers
- At least one new entry from leader's term must also be stored on majority of servers

#----------------------------------------------------------------------
#   Repairing Follower Logs
#----------------------------------------------------------------------

New Leader must make follower logs consistent with its own
- Delete extraneous entries
- Fill in missing entries
Leader keeps nextIndex for each follower:
- Index of next log entry to send to that follower
- Initialised to (1 + leader's last index)
Wen AppendEntries consistency check fails, decrement nextIndex and try again:

# -----------------------------------------------------------------------
# Raft -> append to main notes
# -----------------------------------------------------------------------

Whenever a follower overwrites inconsistent entry, it deletes all subsequent entries: :: Entries which are missing from the leader but are the most recent from the follower are truncated and we continue from the new leader's most recent log entry.

# -----------------------------------------
# Neutralising Old Leaders
# -----------------------------------------

Deposed leader may not be dead:
- Temporarily disconnected from network
- Other servers elect a new leader
- Older leader becomes reconnected, attempts to commit log entries
Terms used to detect stale leaders (and Candidates)
- Every RPC contains term of the sender
- If sender's term is older, RPC is rejected, send reverts to follower and updates its term
- If receiver's term is older, it reverts to follower, updates its term, then processes RPC normally
Election updates terms of majority of servers
- Deposed server cannot commit new log entries

# -----------------------------------------
# Client Protocol
# -----------------------------------------

Send commands to leader
- If leader is unknown, contact any server
- If contacted server is not the leader, it will redirect to the current leader
Leader does not respond until command has been logged, committed, and executed by leader's state machine
If request times out (e.g. leader crashes)
- Client reissues command to some other server
  - Eventually this will result in being redirected to the new leader
- Retry the original request with the new leader
What if leader crashes after executing command, but before responding?
- Must not execute command twice
Solution: client embeds a unique id in each command
- Server includes id in log entry
- Before accepting command, leader checks its log for entry with that id
- If the id is found in the log, ignore new command, return response from old command
Result: exactly-once semantics as long as client doesn't crash

# -----------------------------------------
# Configuration Changes
# -----------------------------------------

What is in the scope of 'configuration'?

System configuration:
- ID, address for each server
- Determines what constitutes a majority
Consensus Mechanism must support changes in the configuration:
- Replace a failed machine
- Change the degree of replication
If configuration changes happen simultaneously, it is possible that we form 2 majorities and so we have to make any changes in 2 phases.

# -----------------------------------------
# Joint Consensus
# -----------------------------------------

Raft uses a 2-phase approach: - Intermediate phase uses join consensus (need majority of both old and new configurations for elections, commitment) - Configuration change is just a log entry; applied immediately on receipt (committed or not) - Once joint consensus is committed, begin replicating log entry for final configuration Additionally:

Any server from either configuration can serve as leader
If current leader is not in Cnew, it must step down once Cnew is committed

# -----------------------------------------
# Raft Summary
# -----------------------------------------

Leader election
Normal Operation
Safety and consistency
Neutralise old leaders
Client Protocol
Configuration Changes

Return to Index