Raft翻译
英文原文:https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14
In Search of an Understandable Consensus Algorithm 可理解的一致性算法研究
Abstract
Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.
摘要
Raft是一个用于管理复制日志的一致性算法。它和Paxos算法产生的结果一样,和Paxos一样高效,但是它的结构和Paxos不一样;导致了Raft比Paxos更好的理解并且还提供了一个更好的基础用于建设实用的系统。为了提高Raft的可理解性,Raft分解成了几个关键过程,例如leader选举,日志复制,和安全,它加强了一致性以减少中间必须考虑的状态。来自一位用户的研究表面对于学生来说Raft比Paxos更加容易学习。Raft还包含一种新的机制用于改变集群的状态,这种机制使用多数重叠来保证安全(这句不是很理解)。
1 Introduction
Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members. Because of this, they play a key role in building reliable large-scale software systems. Paxos [13, 14] has dominated the discussion of consensus algorithms over the last decade: most implementations of consensus are based on Paxos or influenced by it, and Paxos has become the primary vehicle used to teach students about consensus.
介绍1
一致性算法允许一系列机器作为一相关的一组工作,这组机器即便其中有几台机器异常依然可以达成一致,提供正常服务。因为这些,在构建大型可用软件系统中,它扮演了一个重要的角色。在过去十年中,关于一致性算法的讨论,Paxos占主导地位:大部分一致性算法的实现是基于Paxos或者被它影响,并且Paxos已经成为了主要媒介物用于教学生学习一致性算法。
Unfortunately, Paxos is quite difficult to understand, in spite of numerous attempts to make it more approachable. Furthermore, its architecture requires complex changes to support practical systems. As a result, both system builders and students struggle with Paxos.
不幸的是,Paxos仍然难以理解,尽管有大量的尝试以使它更容易理解。此外,它的架构设计要求复杂的改变来支撑实用的系统。导致了一个结果,包括系统建设者和学生在理解Paxos的过程中痛苦挣扎。
After struggling with Paxos ourselves, we set out to find a new consensus algorithm that could provide a better foundation for system building and education. Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.
当我们亲身经历理解Paxos的挣扎后,我们着手寻找一个新的一致性算法,这种新的算法可以为系统建设和学习提供一个更好的基础。我们的方法是很特别的因为我们的主要目标是这种算法的可理解性:是否我们可以为一个实用系统设计一种一致性算法,并且可以用一种比学习Paxos更明显容易的方式来描述它?此外,我们想要这个算法来促进一种直觉的发展,这种直觉就是这种算法对开发人员是非常必要的。重要的是不仅是这种算法能起作用,更重要的是知道它为什么能起作用。
The result of this work is a consensus algorithm called Raft. In designing Raft we applied specific techniques to improve understandability, including decomposition (Raft separates leader election, log replication, and safety) and state space reduction (relative to Paxos, Raft reduces the degree of nondeterminism and the ways servers can be inconsistent with each other). A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos.
这项工作的结果是一种称为raft的一致性算法,在设计Raft的过程中我们应用了特别的技术来提高raft算法的可理解性, 包括分解(Raft把leader选举,log复制,安全分解成不同的过程)同时减少中间状态(相对与paxos,raft减少了不确定性的程度和服务器之间互相不一致的方式),一项用户调研在两所大学的43名学生表明相对Paxos,raft更容易理解:学习了两种算法以后,这些学生中的33人回答raft 相关的问题比Paxos要更好。
Raft is similar in many ways to existing consensus algorithms (most notably, Oki and Liskov’s Viewstamped Replication [27, 20]), but it has several novel features:
• Strong leader: Raft uses a stronger form of leadership than other consensus algorithms. For example, log entries only flow from the leader to other servers. This simplifies the management of the replicated log and makes Raft easier to understand.
• Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly.
• Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions. This allows the cluster to continue operating normally during configuration changes.
Raft 在很多方面和现存的一致性算法很相似,但是它还有几个新奇的特征:
健壮的leader:Raft的leader使用了一种更强的领导方式相较于其它一致性算法。举例,log entry(数据) 只能从leader服务器同步过来。这就简化了复制日志的管理,也让Raft更容易理解。
Leader选举:Raft使用随机算法来选举leader。对于任何一致性算法1⃣️要求的心跳仅仅增加了少量机制,同时可以更加简单和迅速的解决冲突。
成员身份变更:改变集群中服务器身份的机制使用了一种新联合一致性方法,这种方法就是拥有两种不同配置的大多数服务器会重叠在服务器身份改变期间(不太懂,感觉是在身份改变期间依然提供服务,但是可能会出现数据不一致)。
这就允许集群在配置更改期间依然可以正常运行。
We believe that Raft is superior to Paxos and other consensus algorithms, both for educational purposes and as a foundation for implementation. It is simpler and more understandable than other algorithms; it is described completely enough to meet the needs of a practical system; it has several open-source implementations and is used by several companies; its safety properties have been formally specified and proven; and its efficiency is comparable to other algorithms.
我们相信无论是在教学领域还是作为一个易于可实现的基础,Raft相较于Paxos和其它一致性算法是有优势的。相较于其它算法它是简单和更加可以理解的;它被描述的足够完整完全可以满足实用系统的建设需求;它还有很多开源的实现并且很多
公司应在使用它;它的安全性能已经被实用证明;其效率也可以和其它算法媲美。
The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section 3), describes our general approach to understandability (Section 4), presents the Raft consensus algorithm (Sections 5–7), evaluates Raft (Section 8), and discusses related work (Section 9). A few elements of the Raft algorithm have been omitted here because of space limitations, but they are available in an extended technical report [29]. The additional material describes how clients interact with the system, and how space in the Raft log can be reclaimed.
这篇论文的剩余部分介绍了复制状态机问题(第二部分),讨论了Raft的长处和Paxos的弱点(第三部分),描述了我们理解的一般方法(第四部分),介绍了Raft 一致性算法(5-7部分),评估Raft的价值(第八部分),讨论机器之间协同工作(第九部分)。由于篇幅的限制Raft算法很少的一部份被省略了,但是可以在扩展的计数报告中找到【29】。这篇额外的扩展材料描述了客户端和系统如何的交互,Raft 日志如何被回收。
2 Replicated state machines
Consensus algorithms typically arise in the context of replicated state machines [33]. In this approach, state machines on a collection of servers compute identical copies of the same state and can continue operating even if some of the servers are down. Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems. For example, large-scale systems that have a single cluster leader, such as GFS [7], HDFS [34], and RAMCloud [30], typically use a separate replicated state machine to manage leader election and store configuration information that must survive leader crashes. Examples of replicated state machines include Chubby [2] and ZooKeeper [9].
2 可复制的状态机
一致性算法通常出现在复制状态机(我的理解是每个机器的状态流转都是一样的,比如start开始end结束)的上下文中【33】。用这种方式,拥有相同状态的并且在一系列拥有相同算法服务器上的状态机拷贝可以继续正常运行即便其中有其它服务器挂掉。复制状态机用于解决一系列容错问题在分布式系统中。举例,大型系统有一个集群leader,比如GFS【7】,HDFS【34】,和RAAMCloud【30】,大型系统通常用一个分割的复制状态机来管理leader选举和存储配置信息即便在leader挂掉还能自动恢复运行。采用复制状态机中间件还有Chubby【2】和Zookeeper【9】。
Figure 1: Replicated state machine architecture. The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.
图形1:复制状态机架构。一致性算法管理一条可复制的日志,这条日志来自客户端的状态机命令。每个服务器状态机按日志中的同一种序列处理,所以他们都能产生一样的结果。
Replicated state machines are typically implemented using a replicated log, as shown in Figure 1. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state machine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.
复制状态机通常实用一种可复制的日志来实现,像图形1.每个服务器存储一条包含一系列命令的日志,服务器上的状态按顺序处理这些日志中的命令。每条日志的命令次序和内容都相同,所以每个状态机处理相同的命令序列。由于状态机都是确定的,
每个状态机计算相同的状态和相同的输出序列。
Keeping the replicated log consistent is the job of the consensus algorithm. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine. (问题:是不是只能leader处理?)
保证可复制日志的一致性是一致性算法的任务。在服务器上的共识模块接收来客户端的命令然后把这些命令加到日志中。共识模块和其它服务器上的共识模块交互以确保条日志都包含相同顺序的请求,即使一些服务器执行失败了。一旦命令被正确复制
,每个服务器的的状态机开始处理这些日志,然后返回客户端。结果是,这些服务器构成了一个简单,高可用的状态机。
Consensus algorithms for practical systems typically have the following properties:
• They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.
• They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients. Thus, a typical cluster of five servers can tolerate the failure of any two servers. Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.
• They do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
• In the common case, a command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls; a minority of slow servers need not impact overall system performance.
实用系统的一致性算法通常有下面几个属性:
-
在所有非拜占庭条件(为什么是非拜占庭条件,意思是如果遇到拜占庭问题就不能解决?)下确保系统的安全,这些条件包括网络延迟,分区,包丢失,系统宕机,和重排序(重排序不是很理解)
-
只要大部分的服务器可以正常运行并且可以和其它服务器和客户端通信,服务就是起作用的(可用的)。因此,一个通用的有五个服务器的集群可以容忍其中两台服务器出现问题。假设其中一些服务器由于重启停止服务;它们可以过后很快从
稳定存储中恢复并加入到集群。
-
不能依靠时间来确保这些日志数据的一致性(我的理解是更新时间最近的日志来作为最新的日志):错误的时钟和极端的消息延迟(我的理解是:服务器之间时间不一致,消息延迟,本来是最新的消息,但是很晚才来到服务器),在最坏的情况下,可以引起可用性问题。
-
在通常情况下,只要这个集群中的大部分机器已经回复一轮远程调用,这个命令就已经完成;很少的慢服务器不影响整体系统性能。
3 What’s wrong with Paxos?
Over the last ten years, Leslie Lamport’s Paxos protocol [13] has become almost synonymous with consensus: it is the protocol most commonly taught in courses, and most implementations of consensus use it as a starting point. Paxos first defines a protocol capable of reaching agreement on a single decision, such as a single replicated log entry. We refer to this subset as single-decree Paxos. Paxos then combines multiple instances of this protocol to facilitate a series of decisions such as a log (multi-Paxos). Paxos ensures both safety and liveness, and it supports changes in cluster membership. Its correctness has been proven, and it is efficient in the normal case.
3 Paxos有什么问题了?
过去10年里, Leslie Lamport的Paxos协议几乎已经变成一致性算法的代名词:这是一种在课堂上经常教授的协议,大部分一致性算法的实现就是使用它作为起点。Paxos 首先定义一种对一个单个的决策能够达成一致的协议,例如一个单独复制日志。
我们指定paxos的子集来作为单一指令。Paxos联合了实现了这个协议的多个实例来促进一系列决策落地比如日志(大多数-paxos)。Paxos既能保证正确又能保证可用,并且它能支持集群中成员状态的改变。它已经被证明是正确的,在通常情况下它是有效率的。
Unfortunately, Paxos has two significant drawbacks. The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [13] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [14, 18, 19]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year.
不幸的是,Paxos有两个重大的缺陷。第一个缺陷是Paxos特别难理解。完整的解释【13】众所周知不透明;很少有人可以成功的理解它,并且要付出很大的努力。结果,有几种尝试用简单的语言来解释Paxos。这些解释聚焦于独立指令子集,
但是仍然很有挑战。在2012 年的顶级学术期刊NSDI一份针对出席者的非正式调查,我们发现很少有人 适应Paxos, 尽管其中有经验丰富的研究人员。我们在学习Paxos过程非常的挣扎和痛苦;我们不能完全明白整个完整额协议直到阅读了一些简化
版的解释然后设计了我们自己协议,这个过程花了一年的时间。
We hypothesize that Paxos’ opaqueness derives from its choice of the single-decree subset as its foundation. Single-decree Paxos is dense and subtle: it is divided into two stages that do not have simple intuitive explanations and cannot be understood independently. Because of this, it is difficult to develop intuitions about why the singledecree protocol works. The composition rules for multiPaxos add significant additional complexity and subtlety. We believe that the overall problem of reaching consensus on multiple decisions (i.e., a log instead of a single entry) can be decomposed in other ways that are more direct and obvious.
我们假设Paxos的不透明性是来自独立指令子集的选择来作为基础(不明白这句话)。独立指令Paxos是复杂和微妙的:它被分割成两个阶段,这个两个阶段没有简单直观的解释并且不能独立理解。由于这些原因,很难产生感觉关于独立指令协议如何工作。为mutilPaxos组成规则增加额外的复杂性和微妙性。我们相信在多个决策上达成一致的全部问题可以分解成另外几种方式来解决。并且这些方式是直接和清晰的。
The second problem with Paxos is that it does not provide a good foundation for building practical implementations. One reason is that there is no widely agreedupon algorithm for multi-Paxos. Lamport’s descriptions are mostly about single-decree Paxos; he sketched possible approaches to multi-Paxos, but many details are missing. There have been several attempts to flesh out and optimize Paxos, such as [24], [35], and [11], but these differ from each other and from Lamport’s sketches. Systems such as Chubby [4] have implemented Paxos-like algorithms, but in most cases their details have not been pub- USENIX Association 2014 USENIX Annual Technical Conference 307 lished.
Paxos的第二个问题是它不能提供一个很好的基础来引导实现。一个原因是 multi-Paxos没有一个广泛认可的算法。Lamport的描述是关于Paxos独立指令协议;Lamport勾画了可能的方式来描述mutil-Paxos,但是很多细节已经丢失了。
这里有一些尝试用来充实和优化paxos,例如【23】【35】和【11】,但是这些互相不一样和Lamport的设计思想也不一样。像google的Chubby【4】系统已经实现了Paxos-like(我的理解是类似paxos)算法,但是大多数情况下这些细节并
没有公布和发表-2014 USENIX 年度技术会议307 发布。
Furthermore, the Paxos architecture is a poor one for building practical systems; this is another consequence of the single-decree decomposition. For example, there is little benefit to choosing a collection of log entries independently and then melding them into a sequential log; this just adds complexity. It is simpler and more efficient to design a system around a log, where new entries are appended sequentially in a constrained order. Another problem is that Paxos uses a symmetric peer-to-peer approach at its core (though it eventually suggests a weak form of leadership as a performance optimization). This makes sense in a simplified world where only one decision will be made, but few practical systems use this approach. If a series of decisions must be made, it is simpler and faster to first elect a leader, then have the leader coordinate the decisions.
此外,Paxos架构是一个糟糕的设计用来建设实用的系统;这是single-decree分解带来的另外一个结果。举一个例子,选择一系列独立的日志然后合并他们成为一条顺序日志没有任何好处;这只会增加复杂性。围绕日志构建一个系统是简单和高效的
,这里新的entry按到顺序依次叠加。另外一个问题Paxos使用一种对等点对点方式通信(服务器之间关系更加平等)方式在它的核心中(尽管它最终建议为了优化性能降低leader 节点的领导力)。在简化的世界里,这很有意义,在这里仅仅只有一个决定将被作出,但是很少的实用系统使用这种方式。如果一系列决策必须作出,首先选举一个leader是简单和快速的,然后这个leader来协调这些决策。(我的理解是paxos选举leader过于复杂,不适合对一系列决策作出决定,只适合对一个决策作出决定)。
As a result, practical systems bear little resemblance to Paxos. Each implementation begins with Paxos, discovers the difficulties in implementing it, and then develops a significantly different architecture. This is timeconsuming and error-prone, and the difficulties of understanding Paxos exacerbate the problem. Paxos’ formulation may be a good one for proving theorems about its correctness, but real implementations are so different from Paxos that the proofs have little value. The following comment from the Chubby implementers is typical:
There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol [4].
一个结果是:实用系统采用的算法和Paxos相似度很少。每个由Paxos开始的实现,发现非常困难去实现它,然后设计一个和paxos明显不同的架构。这样就很浪费时间有容易出错,并且paxos理解起来非常困难又加剧了问题的严重性。
paxos公式可能是用于证明这个定理正确性很好的公式,但是真实的实现和paxos又如此的不同,就显得这个证明没有任何价值。下面来自Chubby的实现者的评论非常典型:
paxos算法的描述和真实世界的需求还有很大的距离...最终的系统可能会基于一个未经验证的协议【4】。
Because of these problems, we concluded that Paxos does not provide a good foundation either for system building or for education. Given the importance of consensus in large-scale software systems, we decided to see if we could design an alternative consensus algorithm with better properties than Paxos. Raft is the result of that experiment.
由于这些存在的问题,我们总结出Paxos不能提供一个很好的基础在系统建设和教学中。基于大型软件系统中一致性算法的重要性,我们决定是否可以设计一个相对paxos更好的一致性算法。Raft就是这个实验的结果。
4 Designing for understandability
We had several goals in designing Raft: it must provide a complete and practical foundation for system building, so that it significantly reduces the amount of design work required of developers; it must be safe under all conditions and available under typical operating conditions; and it must be efficient for common operations. But our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations.
4 可理解的设计
设计Raft时我们有几个目标:它必须提供一个为建设系统提供一个完整和实用的基础,以便它可可以显著为开发人员的需求减少设计工作;在所有可能存在异常条件下它必须是安全的,在典型的操作条件下它也必须保证可用;在通常的操作下它
也必须是有效率的。但是我们大多数重要的目标-也是大多数困难的挑战也是容易理解的。对于大多数的人可以很舒适的理解这个算法。此外,它必须为开发者产生很好直觉,以至于系统建设者可以实用它来做出一些扩展,这些扩展必然适应我们真实的世界。
There were numerous points in the design of Raft where we had to choose among alternative approaches. In these situations we evaluated the alternatives based on understandability: how hard is it to explain each alternative (for example, how complex is its state space, and does it have subtle implications?), and how easy will it be for a reader to completely understand the approach and its implications?
在设计raft 中有很多要点,我们不得不在其中选择可替代的方法(不是太理解,我的理解是设计raft有很多要点,不能全部都兼顾,只能选择其中几个要点满足)。在这些情况下基于可理解性我们评估这些可替代的方法:(它的状态空间有多复杂,
是否有细微的影响?),对于一个读者完全明白这个方法和它的含义有多容易?
We recognize that there is a high degree of subjectivity in such analysis; nonetheless, we used two techniques that are generally applicable. The first technique is the well-known approach of problem decomposition: wherever possible, we divided problems into separate pieces that could be solved, explained, and understood relatively independently. For example, in Raft we separated leader election, log replication, safety, and membership changes.
我们认识到这种分析有很高的主观性;尽管如此,我们使用的两个技术是很实用的。第一个技术是众所周知的问题分解技术:尽可能,我们把问题分解成可以解决,解释,相互独立理解的几部分。举例,我们把raft分解成leader选举,日志复制,安全,成员身份改变。
Our second approach was to simplify the state space by reducing the number of states to consider, making the system more coherent and eliminating nondeterminism where possible. Specifically, logs are not allowed to have holes, and Raft limits the ways in which logs can become inconsistent with each other. Although in most cases we tried to eliminate nondeterminism, there are some situations where nondeterminism actually improves understandability. In particular, randomized approaches introduce nondeterminism, but they tend to reduce the state space by handling all possible choices in a similar fashion (“choose any; it doesn’t matter”). We used randomization to simplify the Raft leader election algorithm.
第二个方法是简化状态空间依靠减少要考虑的状态数量,确保系统的连贯性并且减少系统的不确定性。特别强调,日志不允许有漏洞存在,并且Raft限制了这些方式,就是防止logs变得互相不一致(我的理解所有服务器的日志结构都是一样的)
,尽管在大多数情况下我们尝试减少不确定性,可是在一些情况中不确定性事实上提高了可理解性(这句不太理解)。 特别是随机方式产生了不确定性(随机方式指的leader的产生是随机的),但是他们倾向于依靠用相似的方式处理所有可能的选择
来减少中间状态(不太理解)(任何选择,都没关系,是不是任何节点做leader都没关系,所有的节点的处理都是一样的?)。我们实用随机来简化Raft的leader选举算法。(这句话应该是这段话的精髓)。
5 The Raft consensus algorithm
Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.
5 Raft一致性算法
Raft是一种用于管理可复制的表格日志算法,在第二节中有描述。图二实用简介的形式总结了这个算法用于参考,图三列出了这个算法的所有关键属性;这些图的元素在余下的章节中分段讨论。
Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.
Raft一致性算法实现依靠首选选举一个杰出的leader,赋予leader管理可复制日志的全部责任。这个leader接收来客户端的请求日志,然后复制这些日志到其它服务器上,然后告诉这些服务器什么时候可以安全的把这些日志应用于它们的状态机。
拥有一个leader简化了可复制日志的管理。举例,这个leader可以决定在日志中在哪里放这个新的数据而不用考虑其它服务器,然后数据用一中简单的方式从leader流向其它服务器。leader可能宕机或者和其它服务器网络不通,在这种情况下一个
新的leader将会被选举出来。
Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:
• Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).
• Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).
• Safety: the key safety property for Raft is the State Ma-chine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.
在考虑这种leader方式中,Raft把一致性问题分解成三个相互独立的子问题,这些子问题在下面讨论:
leader选举:当一个存在的leader失败宕机了,一个新的leader才会被选举出来 (在5.2节中)。
日志复制:这个leader必须接收来客户端的日志数据然后复制到集群中的其它服务器,强制其它服务器上的日志同意接收(5.3节)
安全:Raft关键的安全属性是这个状态机属性在图三中:如果任何服务器接收了一个具体的日志数据到它的状态机,就不可能有其它服务器在同一条日志索引中接收一个不同的的日志数据。章节5.4中描述了Raft如何确保这个属性;问题解决方案
涉及一个额外的限制关于选举机制,在5.2章节有描述。
状态 | |
---|---|
持久化状态在所有的服务器中:(接收客户端调用服务之前,就已经在在所有服务器稳定更新并持久化) |
|
当前项 |
最新的项(最新的轮次)服务器第一次启动初始化为0,然后自动更新 |
选举ID |
收到选票Id在当前轮次选举中 |
日志数据 |
每条log entry包含状态机所需的命令,当entry被领导者接收,更改entry在log的index(初始log index = 1) |
在所有服务器中易变的状态 |
|
提交索引 |
众所周知entry log最大的索引(初始值是0,自动增长) |
被状态机接收的最大索引 |
log entry被状态机接收的最大索引(初始值是0,自动增长) |
在leader中易变的状态 (选举后重新初始化) |
|
下一个索引 |
下一条entry log发送到服务器 (index=leader的最大index + 1) |
匹配索引 |
对于每个服务器,最高日志条目的索引 已知要在服务器上复制 不太明白 |
叠加日志条目 通过RPC调用 | |
---|---|
被leader调用复制日志条目;同时还被用于心跳 (数据都在心跳中,避免多个rpc,减少网络调用) |
|
参数 |
|
轮次 |
leader的轮次 |
leader id |
followr收到请求会重定向到leader |
上一条log索引 |
紧接着最新索引的上一条log索引 |
上一个log轮次 |
上一条log索引 的轮次 |
日志条目 |
保存的日志条目(为了效率,批量传输) |
leader 提交 |
leader提交索引 |
结果 |
|
轮次 |
leader 更新自己的轮次 |
是否成功 |
true:follower 包含 能匹配上一条log索引和上一条log轮次(我的理解两阶段提交,第一次发送prevLogIndex,prevLogTerm,第二次确定提交,判断是否匹配上prevLogIndex,prevLogTerm) |
接收器实现 |
|
|
|
2. 返回false 如果 log没有包含prevLogIndex, prevLogTerm |
|
3. 如果entry冲突 (相同的index 但是不同的terms),删除已经存在的entry。 |
|
4. 追加任何在log中不存在的新条目 |
|
5.如果 leaderCommit > commitIndex,set commitIndex = min(leaderCommit, index of last new entry) |
通过rpc调用 请求选举 | |
---|---|
被候选着调用收集选票 |
|
轮次 |
选举轮次 |
选举者ID |
候选人请求选举 |
选举者最后一条日志条目的索引 |
选举者最后一条日志条目的索引 |
选举者最后一条日志条目的轮次 |
选举者最后一条日志条目的轮次 |
结果 |
|
轮次 |
候选人更新自己的轮次 |
voteGranted |
true 意味着候选人得到选票 |
接收器实现 |
|
|
|
2. 如果 votedFor =null or 候选人Id,说明候选人的日志至少和接收者一样信息,则候选人获得选票。 |
服务器节点规则 | |
---|---|
所有的服务器节点: |
|
如果commitIndex > lastApplied;增加lastApplied,应用log[lastApplied] 到状态机 |
|
如果 远程请求或者远程返回值 包含 term T > currentTerm:set currentTerm = T ,该节点变成follower |
|
Followers 跟随者 |
|
返回来自候选者和领导者的rpc调用请求 |
|
如果 在选举超时间隙内没有收到来自当前领导者AppendEntries 远程调用或者没有投递选票给候选人,该节点变成候选人。 |
|
Candidates 候选者: |
|
转换为候选人后,开始选举: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
leaders 领导者: |
|
选举当上leader后:发送初始的空的AppendEntries Rpc调用(心跳) 给每台服务器,在空闲时重复发送来防止选举超时。 |
|
如果收到来自客户端的请求,追加entry到本地日志,在entry被状态机接收后返回 |
|
如果 追随者 last log index >= nextindex:则发送 带有log entry从nextindex开始的send AppendEntries RPC请求 如果 成功:追随者更新nextindex和matchindex 如果由于log 不一致 AppendEntries失败:nextIndex 递减 并且重试 |
|
如果存在一个N,使得N> commitIndex,则多数 的matchIndex [i]≥N,并且log [N] .term == currentTerm: 设置commitIndex = N (这里没看明白,没懂) |
Figure 2: A condensed summary of the Raft consensus algorithm (excluding membership changes and log compaction). The server behavior in the upper-left box is described as a set of rules that trigger independently and repeatedly. Section numbers such as §5.2 indicate where particular features are discussed. A formal specification [28] describes the algorithm more precisely.
图形2:一个Raft一致性算法的简要摘要(不包括成员状态变更和日志压缩)。左上角框中的服务器行为被描述为一组规则,这些规则独立且重复触发。5.2章节指出特定功能被讨论的位置。正式规范[28]更精确地描述了该算法。
Figure 3: Raft guarantees that each of these properties is true at all times. The section numbers indicate where each property is discussed.
chine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.
图形三:Raft保证每个属性的值在任何时候都是true。每个属性在指定的章节进行讨论。
在图形3中的安全属性:如果任何一个服务器节点接收了一个特定的log entry 到它的状态机,其它服务器就不会在日志集合中同一个索引位置接收一个不同的命令。5.4章节描述了Raft如何确保
这个属性;关于选举机制解决方案还额外涉及一个限制在5.2章节中有描述。
After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.
在呈现一致性算法后,这个章节讨论可用性问题和系统中的时间扮演一个什么角色。
5.1 Raft basics
A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.
5.1 Raft基础
一个raft集群包含一系列服务器;5是一个典型的数字,它允许系统可以容忍两个服务器宕机。在任何时间每个服务器只会是三个状态中的一个(领导,跟随者,或者候选者)。在正常的情况下,集群中仅仅只有一个leader其它服务器就是跟随者。这些跟随者自己不发出请求但是他们回复来自leader和候选人的请求。这个领导者处理所有的客户端请求(如果一个客户端服务和一个跟随连接,发出请求,这个跟随者会转发这个请求给leader)。第三种状态,候选者,被用于选举一个新的leader会在4.2章节中被描述。图形四展示了这些状态和状态之间的转移;状态之间的转移会在下面讨论。
Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Raft把时间拆分成任意长度的项,在图形5有描述。这些项由连续的整数数字构成。每个项开始于一次选举,在这次选举中一个或者多个候选者尝试变成leader在5.2章节中描述。如果一个候选者赢得了这次选举,它就会在余下的项中作为leader。在有有些情况一场选举会在一个分裂的投票中产生结果。在这中情况下这期选举不会产生leader;会开始一场新的选举。Raft保证最多只有一个leader在一期选举中。
Figure 4: Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail.
图形四:服务器状态。跟随者仅仅回复来自其它服务器的请求。如果一个跟随者没有收到通讯信息,它就会变成一个候选者然后初始化一轮选举。一个候选者收到集群中大部分服务器的选票就会变成一个leader。领导者通常运行到他们失败。
Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, a single leader manages the cluster until the end of the term. Some elections fail, in which case the term ends without choosing a leader. The transitions between terms may be observed at different times on different servers.
图形5:时间被分割程多个轮次,每个轮次从一个选举开始。在一次成功的选举后,一个单个的leader管理这个集群一直到这个轮次的结束。有一些选举失败的场景,在这种情况下的轮次结束时不会选举出一个leader。轮次之间的转移可以在不同的服务器不同的时间被观察到。
Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms Figure 4: Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail. Figure 5: Time is divided into terms, and each term begins with an election. After a successful election, a single leader manages the cluster until the end of the term. Some elections fail, in which case the term ends without choosing a leader. The transitions between terms may be observed at different times on different servers.
不同的服务器在不同的时间可以观察到轮次之间的转移,在一些情况种,一个服务器可能没有观察到一次选举或者整个选举的轮次。图形四:服务器状态。Followers 仅仅回答来自其他服务器的请求。
如果一个follower没有收到通讯,这个follower就会变成一个候选人然后开始一轮选举。一个候选者如果收到其他服务器的大部分选票就会在这个集群种变成一个leader。Leaders通常运行到他们失败。
图形5:时间被分割程多个轮次,每个轮次从一个选举开始。在一次成功的选举后,一个单个的leader管理这个集群一直到这个轮次的结束。有一些选举失败的场景,在这种情况下的轮次结束时不会选举出一个leader。轮次之间的转移可以在不同的服务器不同的时间被观察到。
act as a logical clock [12] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.
act在raft中扮演一个逻辑时钟,它们允许服务器检查过时的信息例如过时的leader。每个服务器保存一个当前的term,随着时间的推移自动增加。每当服务器通讯,Current terms 进行替换;如果一个服务器的当前term小于其他服务器,它就会把当前的term更新成其他服务器的term。如果一个候选者或者leader发现它的term过时了,它会立即变成fllower状态。如果一个服务器收到的请求中是一个过时的term,它会拒绝这次请求。
Raft servers communicate using remote procedure calls (RPCs), and the consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and AppendEntries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance
Raft服务器远程通讯,一致性算法仅仅要求两种类型的远程调用。请求选票远程调用被候选者发起在选举期间(5.2节),AppendEntries 远程调用被leader发起用于复制日志entries和提供心跳的形式(
5.3节)。服务节点如果在一个时钟周期没有收到回复会重试,它们并行远程调用以获得最佳的性能。
5.2 Leader election Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
5.2 leader选举 Raft使用一种心跳机制来触发leader选举。当服务节点启动,它们的状态是follower。一个服务节点保持follower状态直到它收到来自leader或者candidate的远程调用。leader周期性的
发送心跳给所有的follower是来保持它的权威性。如果一个follower在一个周期中没有收到通信请求,它就会假设这里没有可用的leader,然后开始一轮选举来选择一个新的leader。
To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.
为了开始一轮选举,follower增加它的轮次然后变成candidate状态。它先选择自己然后并行发出选举请求给集群中的所有的服务节点。一个candidate继续保持它的状态直到下面三种情形的一种发生:
a:它赢到了这次选举
b:其它服务器节点确认它作为leader
c:一个周期过去没有胜出者。
这些结果将在下面段落中单独讨论。
A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.
如果一个候选者在同一个选举轮次中赢得了集群中服务节点的大部分选票它就赢到了这次选举成为leader。每个服务节点在给定轮次中最多选举一名candidate,按先到先得的原则(note:5.4节加了
额外的限制关于选票)。大多数确保最多只有一名candidate赢得这次选举在制定的选举轮次中(图形三选举安全属性)。一旦一个candidate赢得了这次选举,它就变成leader。然后它就会发送心跳
消息给所有其它服务节点来确保它的权威性阻止新的选举。
While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state.
在等待选票时,一个candidate可能收到一个AppendEntries远程请求来自其它声称是leader的远程请求。如果这个leader的轮次至少和candidate的轮次一样大,这个candidate识别这个leader是合法的leader然后返回candidate的follower状态。如果这个轮次比candidate当前轮次小,这个candidate就拒绝这次远程调用然后继续保持condidate状态。
The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.
第三个可能的结果是这个candidate既没有赢得选举也没有输掉选举:如果许多followers在同一时间变成candidate,选票就可能被分散以至于没有candidate获得大多数选票。在这种情况发生时,
每个candidate会超时然后开始一轮新的选举依靠增加选举的轮次和启动另外一轮远程选举。然后,没有额外措施,选票分散可能导致选举无限期举行。
Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 8.3 shows that this approach elects a leader rapidly.
Raft使用随机的选举超时来确保选票分散是很少发生并且可以快速解决。为了防止选票分散,首先,选举超时时间设置在一个时间区间的随机数据(比如 150-300ms之间随机)。这扩散到其它服务器
以至于在大多数情况下仅仅只有一台服务器会超时;它赢得这次选举然后在其它服务器超时之前发送心跳。同样的机制用于处理选票分散。每个candidate重新设置它的随机超时时间在选举开始的时候,并且等到超时时间过去后才开始下一次选举;这样减小了在新一轮选举中票据分散的可能性。8.3节展示了 快速选举一个leader的方法。
Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.
选举是一个在不同设计中指导我们选择可理解的一个举例。最初我们计划使用一个排名系统:每个candidate被分配了一个唯一的排名,这个排名用于在竞争的candidate之间进行选择。如果一个candidate发现其它candidate拥有更高的排名,它就会返回follower状态以至于拥有更高排名的candidate可以轻而易举的赢得选举。我们发现这个方法导致一些细微的问题在可用性方面(如果级别较高的服务器发生故障级别较低的服务器则可能由于超时而再次变成一个候选者,但是如果发生得太快,它会重置选举leader的进度)。对这个算发我们做了很多调整,但是每次调整后新的异常情况又会发生。最终我们得出结论随机重试方法更加的明显和易于理解。
Figure 6: Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines.
图形6:Logs由一系列entry构成,并且顺序编号。每个entry包含轮次(box中数字)和状态机的命令。如果entry可以安全的适应状态机,这个entry才可以提交。
5.3 Log replication
Once a leader has been elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machines. The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.
5.3 日志复制
一旦一个leader被选举成功,这个集群才开始对外服务。每个客户端请求包含一个可以被可复制状态机执行的命令。这个leader把这个命令封装成一个新的entry来追加到他的log中,然后对其它服务节点并行发出复制entry的请求。当这个entry安全的被其它服务节点复制完成以后,这个leader再把这个entry适配到自己的状态机最后再向客户端返回调用结果。如果followers崩溃或者运行缓慢,又或者网络数据丢失,这个leader会无限重试追加entry远程调用(尽管它已经回复了客户端)直到所有的follower完全存储了所有的log entry。 问题:如果网络超时,数据丢包,给客户端返回成功还是失败?
Logs are organized as shown in Figure 6. Each log entry stores a state machine command along with the term number when the entry was received by the leader. The term numbers in log entries are used to detect inconsistencies between logs and to ensure some of the properties in Figure 3. Each log entry also has an integer index identifying its position in the log.
图形6展示了logs的机构。当entry被leader收到,每个log entry存储了一个状态机命令伴随轮次编号一起。log entry中轮次用于检查logs中冲突并保留一些属性(图形三)。每个log entry都要一个整形数字标识在log中的位置。
The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers (e.g., entry 7 in Figure 6). This also commits all preceding entries in the leader’s log, including entries created by previous leaders. Section 5.4 discusses some subtleties when applying this rule after leader changes, and it also shows that this definition of commitment is safe. The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine (in log order).
leader决定什么时候可以安全的把log entry应用于状态机;这样的entry被称为已提交。Raft保证被提交的entry是持久化的并且最终会被所有可用的状态机执行。一旦创建了entry的leader已经复制entry到大部分其它服务节点,这个log entry就是已提交状态(比如 图形6中entry 7)。这个leader还会提交log中先前entry,包括被其它leader创建的entry。5.4节讨论了在leader改变后,应用规则时的一些细微之处,它还展示了提交的定义是安全。leader跟踪被提交的log entry的最大索引log index,并且还包含以后远程调用的index以至于其它服务器完全可以发现。一旦一个follower学习了log entry被提交,它就会应用这个entry到本地状态机。
We designed the Raft log mechanism to maintain a high level of coherency between the logs on different servers. Not only does this simplify the system’s behavior and make it more predictable, but it is an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching Property in Figure 3:
• If two entries in different logs have the same index and term, then they store the same command.
• If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
The first property follows from the fact that a leader creates at most one entry with a given log index in a given term, and log entries never change their position in the log. The second property is guaranteed by a simple consistency check performed by AppendEntries. When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. The consistency check acts as an induction step: the initial empty state of the logs satisfies the Log Matching Property, and the consistency check preserves the Log Matching Property whenever logs are extended. As a result, whenever AppendEntries returns successfully, the leader knows that the follower’s log is identical to its own log up through the new entries.
我们设计raft log机制来保持不同服务器之间log数据的强一致性。不仅仅是简化系统的行为并且让它可预测,并且它是确保安全的重要组成部分。raft维护下面这些属性,它们共同构成了图形三中log
matching 属性:
如果在不同log中的两个entry有同样的index和轮次,则它们存储的数据也是一样的。
如果在不同log中的两个entry有同样的index和轮次,则这个entry的先前entries也是相同的。
第一个属性形成了一个事实一个leader在相同log index和轮次的条件下最多只会创建一个entry,并且log entry永远不会改变它在log的位置index。第二个属性保证由AppendEntries执行简单的一致性检查来保证。当发起了一项AppendEntries RPC,在log中,包含index和轮次的entry会立即优先于新的entry。如果这个follower 在它的日志log中没有发现一个有相同index和term的entry,它就拒绝这个新的entry (不理解?)。一致性检查作为一个归纳步骤:初始化的日志空状态满足log matching属性,当日志扩展一致性检查保留log matching属性。结果,当AppendEntries返回成功,leader就知道
follower的日志和leader自己的日志相同通过新的entry。
During normal operation, the logs of the leader and followers stay consistent, so the AppendEntries consistency check never fails. However, leader crashes can leave the logs inconsistent (the old leader may not have fully replicated all of the entries in its log). These inconsistencies can compound over a series of leader and follower crashes. Figure 7 illustrates the ways in which followers’ logs may differ from that of a new leader. A follower may be missing entries that are present on the leader, it may have extra entries that are not present on the leader, or both. Missing and extraneous entries in a log may span multiple terms.
在正常运行期间,leader和followers的log是保持一致的,所以AppendEntries一致性检查不会失败。然后,leader崩溃能导致失去log的一致性(旧的leader没有把log的entry完全复制到其它服务节点)。这些不一致会加剧一系列的leader和followers崩溃。图形7说明follower的日志不同于新leader的方式。一个follower可能丢失entry,而这些entry存在于leader上,还可能存在额外的entry,而这些entry在leader上并不存在,或者两者皆有。丢失和多出entry在一个log中可能跨越多个轮次。
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. Section 5.4 will show that this is safe when coupled with one more restriction.
在raft中,leader处理所有的不一致log依靠强迫followers的log复制leader的log。这意味着在follower中这些冲突的entry会被覆盖重写成leader的log。
5.4节将展示再加上一个或多个限制条件,这就是非常安全的。
Figure 7: When the leader at the top comes to power, it is possible that any of scenarios (a–f) could occur in follower logs. Each box represents one log entry; the number in the box is its term. A follower may be missing entries (a–b), may have extra uncommitted entries (c–d), or both (e–f). For example, scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.
图形7:当leader发挥作用时,fllowers中log可能发生a-f种情况。每个box代表一个log entry;box中的number就是term。一个follower可能丢失entry(a-b),可能有额外没有提交的entry(c-d),或者
两者都存在(e-f)。举例,如果f这个服务器是第二轮选举的leader(term=2),然后接收了一系列entry到它的log,提交这些entry之前崩溃了;但是很快重启恢复,并再次成为了第三轮选举的leader,并再次接收客户端的请求;在第二轮和第三轮接收的日志entry提交之前,再一次崩溃,并在其后的选举中不再参与选举就成为f这样的情况。
To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower. When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log (11 in Figure 7). If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term.
为了让follower的日志和leader的日志保持一致,leader必须找到和follower日志一样最新位置,然后删除掉follower log中这个点以后的日志,然后发送leader log这个位置以后所有的日志到follower。
所有这些操作的发生都是在依靠AppendEntries rpc调用后回复一致性检查中发生的。leader为每个follower都维护了一个nextindex,它是leader发送给follower下一个next log entry的index。当leader第一次启动,nextindex的值初始化成log index最后一个值再+1。如果一个follower的log和leader的log数据不一致,AppendEntries 一致性检查会失败在下一次AppendEntries 调用。拒绝以后,这个leader会nextindex-1,然后再发起AppendEntries rpc调用。最终nextindex会到达一个leader和follower log一致的点。当这种情况发生,AppendEntries就成功了,它会删出follower这个点以后冲突的entry,然后追加上leader log中这个点以后的日志。一旦AppendEntries成功,follower和leader的日志就达到了一致性,它会保持这个方式到后来的选举轮次。
The protocol can be optimized to reduce the number of rejected AppendEntries RPCs; see [29] for details.
这个协议被优化用于减少AppendEntries rpc调用的拒绝数量,详情见【29】。
With this mechanism, a leader does not need to take any special actions to restore log consistency when it comes to power. It just begins normal operation, and the logs automatically converge in response to failures of the AppendEntries consistency check. A leader never overwrites or deletes entries in its own log (the Leader Append-Only Property in Figure 3).
依靠这种机制,leader不需要采取任何特被的动作来恢复日志的一致性当leader启动时。它仅仅需要正常启动,并且在回复AppendEntries 一致性检查失败中,日志会自动汇合。一个leader永远不会覆盖或者删除它的日志(图形三中leader append-only 属性)。
This log replication mechanism exhibits the desirable consensus properties described in Section 2: Raft can accept, replicate, and apply new log entries as long as a ma-jority of the servers are up; in the normal case a new entry can be replicated with a single round of RPCs to a majority of the cluster; and a single slow follower will not impact performance.
日志复制机制展示了理想的一致性属性在第二节中有这样的描述:Raft可以接收,复制,并且应用新的log entry,只要大部分的服务器在运行状态;在通常情况下一个新的entry可以在一轮RPC调用中复制到其它大部分服务器;并且一个慢的follower不会影响性能。
5.4 Safety
The previous sections described how Raft elects leaders and replicates log entries. However, the mechanisms described so far are not quite sufficient to ensure that each state machine executes exactly the same commands in the same order. For example, a follower might be unavailable while the leader commits several log entries, then it could be elected leader and overwrite these entries with new ones; as a result, different state machines might execute different command sequences.
5.4 安全
前面的章节描述了raft怎样选举leader和复制log。然而,目前位置这个机制还不能够确保每个状态机以相同的顺序执行同样的命令。举例,当leader提交一系列log entry时一个follower变得不可用,然后它被选举成leader,然后用新的entry覆盖了这些entry;结果就导致,不同的状态机可能执行不同的命令序列。
This section completes the Raft algorithm by adding a restriction on which servers may be elected leader. The restriction ensures that the leader for any given term contains all of the entries committed in previous terms (the Leader Completeness Property from Figure 3). Given the election restriction, we then make the rules for commitment more precise. Finally, we present a proof sketch for the Leader Completeness Property and show how it leads to correct behavior of the replicated state machine.
这节完善了raft算法依靠添加了一个限制,这个限制制约了那些服务器可以被选举成leader。这个限制确保选举出的leader包含以前所有轮次中提交的entry(the Leader Completeness 属性来源于图形三)。给定选举限制,我们让提交的规则更加的完善和准备。最终,我们呈现了一个证明草图关于 Leader Completeness属性,然后展示了它如何导致复制状态机的正确行为。
5.4.1 Election restriction
5.4.1 选举限制
In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. In some consensus algorithms, such as Viewstamped Replication [20], a leader can be elected even if it doesn’t initially contain all of the committed entries. These algorithms contain additional mechanisms to identify the missing entries and transmit them to the new leader, either during the election process or shortly afterwards. Unfortunately, this results in considerable additional mechanism and complexity. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.
在任何基于leader的一致性算法中,leader最终必须存储所有的提交log entries。在一些一致性算法中,例如 Viewstamped 复制【20】,一个leader可以被选举即使这个leader最初没有包含所有被提交的enties。这些算法包含其它机制来鉴别丢失的entries然后转移entries到新的leader中,在选举过程中或者过后不久。不幸的是,结果要考虑大量的额外机制和复杂性。Raft使用了一个简单的方式来保证所有来自以前轮次提交的entries都存在,在新的leader选举出来的这一时刻。不需要转移entiries到这个新的leader。这意味者log entries复制只有一个方向,就是从leader到follower,并且leader永远不会覆盖已经存在的entries在它的logs中。
Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster in order to be elected, which means that every committed entry must be present in at least one of those servers. If the candidate’s log is at least as up-to-date as any other log in that majority (where “up-to-date” is defined precisely below), then it will hold all the committed entries. The RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log, and the voter denies its vote if its own log is more up-to-date than that of the candidate.
Raft使用投票程序来防止一个候选者赢得一次选举,除非它的logs包含所有已经提交的entries。一个候选者必须和大部分服务保持通信才有机会被选上当leader,意味着每条被提交的entry必须存在于这些服务器中一个。如果一个候选者的log至少和其它服务器最新log一样最新,然后它就会保存所有的提交entries。投票选举RPC实现了这个限制:这个PRC包含候选人的log信息,如果投票者自己的日志比候选人的log要新,则投票者拒绝投票。
Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date.
Raft通过比较index和term来确定两个logs哪一个是最新的。如果logs中最后一条entry拥有不同term,更大的term是更新的。如果logs中term是相同的,则更长的log是更新的。
Figure 8: A time sequence showing why a leader cannot determine commitment using log entries from older terms. In (a) S1 is leader and partially replicates the log entry at index 2. In (b) S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself, and accepts a different entry at log index 2. In (c) S5 crashes; S1 restarts, is elected leader, and continues replication. At this point, the log entry from term 2 has been replicated on a majority of the servers, but it is not committed. If S1 crashes as in (d), S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3. However, if S1 replicates an entry from its current term on a majority of the servers before crashing, as in (e), then this entry is committed (S5 cannot win an election). At this point all preceding entries in the log are committed as well.
图形8:一个时间序列展示了为什么一个leader不能确定提交点用老的轮次的log entries。
(a)中 s1是一个leader,log index2复制了一部分。
(b)中 s1崩溃;s5被选举成第三轮中的leader,通过s3,s4和它自己的选票,然后接收了一个不同的entry,在log index2.
(c)中s5崩溃;s1恢复,并被重新选举成leader,然后继续复制entry。在这时间点,来自第二轮选举接收的entry已经被复制到大部分的服务节点,但是还没有提交。
如果s1崩溃在(d)中,s5可能再次被选举成leader(来自s2,s3,s4的选票),然后就会使用它自己的第三轮的entry去覆盖其它节点的entry。
然而,如果s1在崩溃前,复制了一个entry到大部分的服务器上,例如(e)中,然后这个entry被提交了(s5就不会赢得这次选举)。所有以前提交的entry都可以正常提交。
5.4.2 Committing entries from previous terms
5.4.2 提交以前轮次的entries
As described in Section 5.3, a leader knows that an entry from its current term is committed once that entry is stored on a majority of the servers. If a leader crashes before committing an entry, future leaders will attempt to finish replicating the entry. However, a leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers. Figure 8 illustrates a situation where an old log entry is stored on a majority of servers, yet can still be overwritten by a future leader.
在5.3节中有描述,一个leader知道一个来自当前term的entry一旦被提交,这个entry就保存到大部分的服务器上。如果一个leader在提交一个entry之前崩溃了,未来的leader会尝试完成复制这个entry。然而,一个leader不能立刻断定一个来自以前轮次的entry一旦被提交,它就会立刻被存储到大部分的服务器上。图形8说明了一种情形一个老的entry被存储到大部分的服务器上,依然可以被未来的leader覆盖。
To eliminate problems like the one in Figure 8, Raft never commits log entries from previous terms by counting replicas. Only log entries from the leader’s current term are committed by counting replicas; once an entry from the current term has been committed in this way, then all prior entries are committed indirectly because of the Log Matching Property. There are some situations where a leader could safely conclude that an older log entry is committed (for example, if that entry is stored on every server), but Raft takes a more conservative approach for simplicity
为了消除在图形8中的一个问题,raft不会依靠计算副本的数量来提交来自以前轮次的log entries。仅仅来自leader当前轮次的entry依靠计算副本数量来提交;一旦来自当前轮次的entry被提交,所以以前的entry也被间接提交。这里有一些情形leader可以安全的下结论一个老的log entry一旦被提交(举例,如果这个entry被保存到每个服务器上),但是raft采用了一个更保守的方式来简化。
Raft incurs this extra complexity in the commitment rules because log entries retain their original term numbers when a leader replicates entries from previous terms. In other consensus algorithms, if a new leader rereplicates entries from prior “terms,” it must do so with its new “term number.” Raft’s approach makes it easier to reason about log entries, since they maintain the same term number over time and across logs. In addition, new leaders in Raft send fewer log entries from previous terms than in other algorithms (other algorithms must send redundant log entries to renumber them before they can be committed).
使用这样的提交规则raft导致了浴血额外的复杂性因为当leader复制以前轮次的entries时log entry保留了以前的轮次编号。在其它一致性算法中,如果一个新的leader复制以前轮次的entry,它必须使用新的轮次编号。raft的方法更加的合理和容易,
因为它们随着时间的推移保持相同的轮次编号。另外,在raft中新的leader发送更少的来自以前轮次的log entry相比其它一致性算法(其它一致性算法必须发送多余的log entry来重新设置number在它们可以被提交前)。
Figure 9: If S1 (leader for term T) commits a new log entry from its term, and S5 is elected leader for a later term U, then there must be at least one server (S3) that accepted the log entry and also voted for S5.
图形9: 如果s1提交了一个新的entry在它的轮次中,然后s5在轮次u中被选举成leader,然后这里必须有一台接收了轮次t的log entry来给s5投递选票。
5.4.3 Safety argument
5.4.3 安全参数
Given the complete Raft algorithm, we can now argue more precisely that the Leader Completeness Property holds (this argument is based on the safety proof; see Section 8.2). We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.
给出完整的raft算法,我们可以更加精确的讨论the Leader Completeness 属性(这个参数基于安全证明,8.2节)。我们假设the Leader Completeness 属性 不存在,然后我们证明一个矛盾。假设轮次t的leader提交了一个log entry,但是这个log entry并没有保存到未来轮次的leader中。考虑更小轮次的 u>t 轮次u的leader并没有保存entry。
1. The committed entry must have been absent from leaderU’s log at the time of its election (leaders never delete or overwrite entries)
轮次u的leader缺少已经提交的entry(leader不会删除和覆盖entry)。
2. leaderT replicated the entry on a majority of the cluster, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted for leaderU, as shown in Figure 9. The voter is key to reaching a contradiction.
轮次t的leader复制entry到大部分的服务器,并且 轮次u的leader收到了来自大部分服务器的选票,然后,至少有一台服务器即收到来自轮次t的leader发送过来的entry又投递了选票给轮次u的leader,在图形9展示。选民是解决问题的关键。
3. The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it would have rejected the AppendEntries request from leaderT (its current term would have been higher than T).
选民必须在投票选举leaderu之前接收提交的entry;否则它就会拒绝来自leadert的AppendEntries请求(u>t)
4. The voter still stored the entry when it voted for leaderU, since every intervening leader contained the entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.
voter保存这个entry当它投票给leaderu,由于每个调停的leader包含这个entry(假设),leader不会删除entry,follower仅仅删除entry如果leader和follower冲突。
5. The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This leads to one of two contradictions.
选民投递选票给leaderu,所以leaderu的log必须和选民的log一样的新,这就导致两个冲突中其中一个。
6. First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the voter contained the committed entry and leaderU was assumed not to.
首先,如果选民和leaderu共享相同的term,然后这个leaderu的log必须和voter的log一样长,所以它的log包含voter log中每条entry。这是一个矛盾,由于这个选民包含这个已经提交的entry,leaderu假设不成立。
7. Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
除此以外,leaderu 的log term大于voter。u>t, 因为voter最新的log的轮次就是t(它包含来自轮次t leader提交的entry)。假设 更早的leader创建了leaderu罪行的log必须包含提交的entry。然后,依靠 Log Matching属性,leaderu的log必须包含这个提交的entry,这是一个矛盾。
8. This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries from term T that are committed in term T.
这样就完成了这个矛盾。因此,所以大于t的leader必须包含轮次t所有提交的entry。
9. The Log Matching Property guarantees that future leaders will also contain entries that are committed indirectly, such as index 2 in Figure 8(d).
The Log Matching 属性保证未来的leader 包含间接提交的entry,例如 图形8 index2描述。
Given the Leader Completeness Property, it is easy to prove the State Machine Safety Property from Figure 3 and that all state machines apply the same log entries in the same order (see [29]).
给出the Leader Completeness 属性,非常容易证明来自图形三状态机安全属性并且所有的状态机以同样的顺序应用这个相同的log entry。
5.5 Follower and candidate crashes
5.5 follower和candidate崩溃
Until this point we have focused on leader failures. Follower and candidate crashes are much simpler to handle than leader crashes, and they are both handled in the same way. If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm. For example, if a follower receives an AppendEntries request that includes log entries already present in its log, it ignores those entries in the new request.
到目前为止,我们只聚焦了leader崩溃。follower和candidate崩溃处理相比leader奔溃要容易得多,并且follower和candidate都是一样的处理方式。如果一个follower或者candidate崩溃了,选举请求和entry请求发送到它就会失败。Raft依靠不限期的重试来处理这些失败的请求;如果这个崩溃的服务器重新启动起来,这个rpc请求就会成功完成。如果一个服务器在完成rpc请求但是没有回复之前就崩溃了,它会再次收到同样的请求在它重新启动后。Raft rpc请求都是幂等的,所以这种情况没有影响。举例,如果一个follower收到一个包含它已经存在的log entry AppendEntries请求,它就会忽略这个entry在新的请求中。
5.6 Timing and availability
5.6 时间和可用性
One of our requirements for Raft is that safety must not depend on timing: the system must not produce incorrect results just because some event happens more quickly or slowly than expected. However, availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.
我们对raft要求的其中之一是安全不能依靠时间:系统不能因为比期望的发生得快或者慢产生错误的结果。然而,可用性(系统及时回复client请求的能力)必须不可避免的要依赖时间。举例,如果消息交换持续时间比raft崩溃恢复的时间更长,候选者不能立即赢得这次选举,不能一个稳定的leader,raft就不能提供服务。
Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement:
broadcastTime ≪ electionTimeout ≪ MTBF
In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections; given the randomized approach used for election timeouts, this inequality also makes split votes unlikely. The election timeout should be a few orders of magnitude less than MTBF so that the system makes steady progress. When the leader crashes, the system will be unavailable for roughly the election timeout; we would like this to represent only a small fraction of overall time.
leader选举是Raft其中一点,这点上时许是非常重要的。Raft可以选举并且保持一个稳定的leader直到这个系统满足了下面的时许要求:
广播时间《选举时间《MTBF
在这种不平等的情况下广播时间是一个服务器并行调用集群中的每个服务器并且收到回复的平均时间;选举时间是选举超时时间在5.2章节有描述;MTBF是对一个服务器调用失败的平均时间。
广播时间应该比选举时间小一个量级以至于leader可以稳定的发送心跳消息来防止followers又开始重新选举;使用随机时间来用于选举超时时间,这种不平等让分列投票可能性不大。选举超时时间应该比mtbf时间小很多量级以至于系统可以稳定运行。当leader崩溃,系统大概在超时时间内变得不可用;我们希望这个代表总时间的一小部分。
The broadcast time and MTBF are properties of the underlying system, while the election timeout is something we must choose. Raft’s RPCs typically require the recipient to persist information to stable storage, so the broadcast time may range from 0.5ms to 20ms, depending on storage technology. As a result, the election timeout is likely to be somewhere between 10ms and 500ms. Typical server MTBFs are several months or more, which easily satisfies the timing requirement.
广播时间和mtbf是基础系统的属性,在选举超时时间是我们必须要选择时。Raft RPC调用通常要求接收者把消息持久化到稳定的存储设备中。所以广播时间可能在0.5ms-20ms之间,依靠存储技术。结果,选举超时时间可能在10ms-500ms之间,典型的服务器 mtbf通常在几个月或更久,很容易满足时许要求。
6 Cluster membership changes
6 集群成员变更
Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration, for example to replace servers when they fail or to change the degree of replication. Although this can be done by taking the entire cluster off-line, updating configuration files, and then restarting the cluster, this would leave the cluster unavailable during the changeover. In addition, if there are any manual steps, they risk operator error. In order to avoid these issues, we decided to automate configuration changes and incorporate them into the Raft consensus algorithm
到目前为止我们假设这个集群配置(参与一致性算法的服务器集群)是固定的。实际上,有时会有必要去更改配置,举例当他们宕机是会更改服务器或者改变复制的程度。尽管可以让整个集群不可用来更新配置,然后再重启集群,这样会导致在配置更新期间集群不可用,此外,如果有任何的人为的手动步骤,这样风险的操作可能导致错误。为了避免这些问题,我们决定让配置自动更新然后合并这些配置到raft一致性算法中。
For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 10).
为了让配置更新机制更加的安全,在过渡期间这里不能存在一个时间点(两个leader在同一个轮次被选出来)。不幸的是,直接改变老的配置到新的配置所采取的任何方法都是不安全的。不可能自动立刻改变的所有服务器的配置,所以集群可能分成两个独立多数部分在转移期间(图形10)
In order to ensure safety, configuration changes must use a two-phase approach. There are a variety of ways to implement the two phases. For example, some systems (e.g., [20]) use the first phase to disable the old configuration so it cannot process client requests; then the second phase enables the new configuration. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:
• Log entries are replicated to all servers in both config-urations.
• Any server from either configuration may serve as leader.
• Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.
The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.
为了确保安全,配置更新采用了两阶段提交的方法。有很多中方式来实现两阶段提交。举例,一些系统(【20】)使用第一阶段来让老的配置不可用以至于不能处理客户端的请求;然后第二阶段让新的配置可用。在raft中集群首先切换到转移配置我们称为加入一致性;移动加入一致性消息被提交,系统就转移到新的配置。加入一致性结合新的配置和老的配置:
log entries 被复制到所有的服务器,在两种配置中。
来自任何配置的任何服务器都可以充当leader。
协议(为了选举和提交)要求将来自新老配置的大多数分开。
联合共识允许单个服务器在不同时间段在不同配置之间转换而不会影响安全性。此外联合共识允许集群在配置更改期间继续为客户端提供服务。
Figure 10: Switching directly from one configuration to another is unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five. Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew).
图形10: 直接从一个配置切换到另外一个是不安全的因为不同的服务器会在不同的时间发生切换。在这个例子中,集群从三个服务器增长到5个。不幸的是,这里就存在一个时间点:两个不同的leader在同一个轮次被选择出来,一个leader是老配置的,另外一个是新配置的。
Cluster configurations are stored and communicated using special entries in the replicated log; Figure 11 illustrates the configuration change process. When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new, depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.
集群配置被保存并且通过在复制log采用特殊的entry来进行传播;图形11展示了配置变化的进程。当leader收到一个请求去改变配置从老的改成新的,它将联合一致性的配置作为一个log entry保存,并且使用前面描述的机制来复制这个entry。一旦一个服务器添加了一个新的配置entry到它的log中,这个服务器使用这个配置作为未来的所有决策(一个服务器总使用日志中最新的配置,而不管这个配置是否已经提交)。这就意味着当cold.new这个log entry被提交以后,这个leader使用cold.new的相关规则。如果leader崩溃了,一个新的leader会根据cold或者cold.new来选择,取决于这个获胜的candidate是否收到过cold.new.在任何情况中,在这个期间cnew不能作单方面决定。
Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure 11, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.
一旦cold.new已经提交后,cold和cnew不能在没有其它服务器同意的情况下作决定,并且leader 的Completeness属性确保 带有cold.new log entry的服务才能被选举成leader。leader创建一个cnew log entry然后复制到集群中是安全的。配置一旦被其它服务器看到它就立即生效。当这个新配置在cnew的规则下被提交后,这个老的配置是无关紧要的,不在新配置中的服务可以被关掉。例如图11中的展示,没有时间当cold和cnew不能作单方面的决定,因为不够安全。
Figure 11: Timeline for a configuration change. Dashed lines show configuration entries that have been created but not committed, and solid lines show the latest committed configuration entry. The leader first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.
图形11:配置变更的时间线。虚线表示配置entry被创建但是没有提交,实线表示最新的配置提交entry。leader首先创建cold.new配置entry在它的log中,然后把它提交给cold.new(cold中的大多数和cnew中的大多数)。然后创建cnew entry然后提交给cnew的大多数服务器。在这个时间线中没有一个时间点cold和cnew可以独立作决定。
There are three more issues to address for reconfiguration. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.
The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. This means that there will be a period of time (while it is committingCnew) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader.
The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster,then it will not be deposed by larger term numbers.
7 Clients and log compaction
This section has been omitted due to space limitations, but the material is available in the extended version of this paper [29]. It describes how clients interact with Raft, including how clients find the cluster leader and how Raft supports linearizable semantics [8]. The extended version also describes how space in the replicated log can be reclaimed using a snapshotting approach. These issues apply to all consensus-based systems, and Raft’s solutions are similar to other systems.
8 Implementation and evaluation
We have implemented Raft as part of a replicated state machine that stores configuration information for RAMCloud [30] and assists in failover of the RAMCloud coordinator. The Raft implementation contains roughly 2000 lines of C++ code, not including tests, comments, or blank lines. The source code is freely available [21]. There are also about 25 independent third-party open source implementations [31] of Raft in various stages of development, based on drafts of this paper. Also, various companies are deploying Raft-based systems [31]. The remainder of this section evaluates Raft using three criteria: understandability, correctness, and performance.
8.1 Understandability
To measure Raft’s understandability relative to Paxos, we conducted an experimental study using upper-level undergraduate and graduate students in an Advanced Operating Systems course at Stanford University and a Distributed Computing course at U.C. Berkeley. We recorded a video lecture of Raft and another of Paxos, and created corresponding quizzes. The Raft lecture covered the content of this paper; the Paxos lecture covered enough material to create an equivalent replicated state machine, including single-decree Paxos, multi-decree Paxos, reconfiguration, and a few optimizations needed in practice (such as leader election). The quizzes tested basic understanding of the algorithms and also required students to reason about corner cases. Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to determine whether participants showed a better understanding of Raft.