Raft

Typical Properties of Consensus Algorithms

  1. safety: produce correct results under all non-Byzantine conditions
               include network delays/partitions, packet loss/duplication/reordering
  2. availability/liveness:
     if majority of peers are operational and can communicate with each other and with clients assume failures are all fail stopping
  3. safety should not depend on timing
    do not require timing to ensure consistency
    timing: faulty clocks or message delays
  4. performance can not be affected by a minority of slow servers


Strengths and Weaknesses of Paxos
Strengths
  1. It ensures safety and liveness
  2. Its correctness is formally proved
  3. It is efficient in the normal case (?)

Weaknesses
  1. Difficult to understand
    thus difficult to implement / reason about / debug / optimize ...
  2. Does not provide a good foundation for building practical implementations
    no widely agreed-upon algo. for multi-paxos
    single-decree paxos -> multi-paxos is complex
    leader-based protocols is simpler and may be faster for making a series of decisions

Brief Introduction to Raft
  A consensus protocol with strong leader
  Why leader?
    simplifies the management of log replication
    makes raft easier to understand
  Raft log
    orders commands
    stores tentative commands until committed
    stores commands in case leader must re-send to followers
    acts as history of commands, can be replayed after server reboot


Leader Election
State of a server:
  follower, candidate, leader

State transision
 

at most one leader can be elected in a given term

Timing requirement for steady leaders
  broadcast time << election timeout << MTBF
  broadcast time: 10s of ms, may depend upon log persistence performance
  election timeout: 100s of ms (or 1s, 3s ...)
  MTBF: months

How to choose election timeout
  * at least a few (10s of) heartbeat intervals (in case network drops a heartbeat)
    to avoid needless elections, which waste time and introduce availability issues
  * random part long enough to let one candidate succeed before next starts
  * short enough to react quickly to failure, avoid long pauses (availability issues)


Log Replication

Leader Append-only: leader never deletes or overwrites its log entries; it only append new entries
Log Matching: if two logs contain an entry with same index and same term, all log entries up through the given index are identical
Leader Completeness: : if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.


Leader Election Resitriction

1. new leader must contain all log entries committed

example 1
  index: 1 2 3
      S1: 1 2 2 (leader)
      S2: 1 2 2
      S3: 1 1
  S3 is leader of term 1, is partitioned away from S1 and S2
  S1 broadcasts log entry with index 3, consider the following 3 cases:
    #1: S2 receives the log entry but the reply to S1 is lost, S1 crashes
    #2: S2 receives the log entry and replies, S1 commits the log entry and crashes
    #3: S2 receives the log entry and replies, S1 and S2 commit the log entry, S1 crashes
  After the partition heals, only S2 can be the leader.
  If S3 becomes leader, it is possible that commited log entries will be overwritten
    violates the Leader Completeness property

example 2
  index: 1 2 3
      S1: 1
     S2: 1                            ->
     S3: 1 1 (leader)

  index: 1 2 3
    S1: 1 2
      S2: 1 2 2 (leader)          ->
          S3: 1 1

  index: 1 2 3
      S1: 1 2 3 (leader)
     S2: 1 2 2
     S3: 1 1

  S1 crashes and restarts, S1 and S2 can become leader of term 4, S3 can not.


2. new leader's log must be at least as up-to-date as majority of servers
  up-to-date
    compare last entries. log has last entry with higher term is more up-to-date
    if term of last entries are same. log which is longer is more up-to-date


Q: why not elect the server with the longest log as new leader?
A:
  example:
    S1: 5 6 7
    S2: 5 8
    S3: 5 8
  first, could this scenario happen? how?
    S1 leader in term 6; crash+reboot; leader in term 7; crash and stay down
      both times it crashed after only appending to its own log
    Q: after S1 crashes in term 7, why won't S2/S3 choose 6 as next term?
    A: at least one of them votes for S1, and becomes follower in term 7
    next term will be 8, since at least one of S2/S3 learned of 7 while voting
    S2 leader in term 8, only S2+S3 alive, then crash
  all peers reboot
  who should be next leader?
    S1 has longest log, but entry 8 could have committed !!!
    so new leader can only be one of S2 or S3
    i.e. the rule cannot be simply "longest log"


Raft Persistence
  why log?
    if a server was in leader's majority for committing an entry,
    must remember entry despite reboot, so any future leader is
    guaranteed to see the committed log entry
  why votedFor?
    to prevent a client from voting for one candidate, then reboot,
    then vote for a different candidate in the same (or older!) term
    could lead to two leaders for the same term
  why currentTerm?
    to ensure terms only increase, so each term has at most one leader
    to detect RPCs from stale leaders and candidates

Q: What if we do not persist currentTerm and initialize currTerm of a server to term of its last log entry when it starts, give an example to show what could go wrong ?

A: Assume there is a raft group of three servers S1, S2 and S3. S1's last log entry has term 10. S1 receives a VoteRequest for term 11 from S2, and answers "yes". Then S1 crashes, and restarts. S1 initializes currentTerm from the term in its last log entry, which is 10. Now S1 receives a VoteRequest from S3 for term 11. S1 will vote for S3 for term 11 even though it previously voted for S2. There are two leaders S2 and S3 in term 11, which would lead to different peers committing different commands at the same index.


Q: What happens after all servers crash and restart at about the same time ?
A: An example:
  S1, S2, S3
  1. S1, S2, S3 all start as follower in currentTerm of their own, say 3, 2, 1. they voteFor S1, S1, S3 respectively
  2. election timer elapses. one of them become candidate, say S2 becomes candidate of 2 + 1 = 3
  3. S2 request votes from S1 and S3, S1 rejects to vote because it has voted for itself in term 3, S3 grants the vote and becomes follower of term 3
  4. S2 becomes leader of term 3 and starts to broadcast heartbeats to S1 and S3

  We can induce the states of S1, S2, S3 before the crash
    S1 is candidate of term 3 and leader of term 2
    S2 is follower of term 2
    S3 is leader of term 1 and then is partitioned away from S1 and S2


A special case needs to take care (Figure 8 in the Raft paper)

index: 1 2 3
    S1: 1 2
    S2: 1                                          ->
    S3: 1 3 (leader of term 3)


index: 1 2 3
    S1: 1 2 4 (leader of term 4)
    S2: 1 2
    S3: 1 3

Although log entry at 2 is successflly replicated from S1 to S2, we can not simply commit the index. what if we do so, and S1 crashes, S3 will be leader of term 5, and will commit an entry with term 3 at index 2.

Commtting different entries at same index violates safty guarantee of Raft !!!

A leader should not commit log entries from older terms and it should only commits log entries of its own term. After S1 commits log entry at index 3 with term 4, log entry at index 2 is commited indirectly because of the Log Matching property.


Protocol for Cluster Membership Changes
[TBD]


TiKV Case Study

Solutions proposed in TiKV blog for fast linearizable reads over Raft (inspired by approaches mentioned in the Raft paper)

ReadIndex Read
  when a read request comes in, the leader performs the following steps to serve the read:
    1. ReadIndex = commitIndex
    2. Ensure the leader is still leader by broadcasting heartbeats to peers
    3. Wait apply of the log entry at ReadIndex
    4. Handle the read request and reply to the client

Lease Read
  Lease:
    Assume that start is the time when the leader starts to broadcast AppendEntries to followers
    after the leader get positive responses from majority, the leader holds a valid
    lease which lasts till start + minimum_election_timeout / clock_drift_bound.

  When the leader holds a vaild lease, it can perform read requests without writing anythong into the log

  clock skew has no effect on the correctness of this approach
  but clock that may jump back and forth does

  See https://pingcap.com/blog/lease-read/ for more details


More on Replicated State Machine Protocols
  Paxos / Viewstamped Replcation / SMART / ZAB / Raft ...
  Single-decree paxos v.s. multi-decree paxos
  [TBD]  

  Master-based Paxos has similiar advantages as Raft
    1. The master is up-to-date and can serve reads of the current consensus state w/o network communication
    2. Writes can be reduced to a single round of communication by piggybacking prepares to accept messages
    3. The master can batch writes to improve throughput

  Downside of master-based/leader-based prototol
    1. Reading/writing must be done near the master to avoid accumulating latency from sequential requests.
    2. Any potential master must have adequate resources for the system’s full workload slave replicas waste resources until the moment they become master.
    3. Master failover can require a complicated state machine, and a series of timers must elapse before service is restored. It is difficult to avoid user-visible outages.

  What will go wrong if Zookeeper uses Raft instead of ZAB as its consensus protocol ?
    Zookeeper client submits requests in FIFO order and these requests must be committed in the same order. It can submit next request before preious request gets acknowledged.
    [TBD]


References
  The Raft Paper: https://raft.github.io/raft.pdf
  A guide for implementing Raft: https://thesquareplanet.com/blog/students-guide-to-raft/
  TiKV Blog: https://pingcap.com/blog/lease-read/
  The Megastore Paper: http://cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf

posted @ 2019-04-11 22:26  william-cheung  阅读(224)  评论(0编辑  收藏  举报