Paxos Made Simple
Leslie Lamport 01 Nov 2001
2001年11月01日
Abstract
摘要
The Paxos algorithm, when presented in plain English, is very simple.
Paxos算法用简单的英语表达是非常简单的。
1 Introduction
序言
The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers [5]. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensusalgorithm—the “synod” algorithm of [5]. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often cited article on the theory of distributed systems [4].
用于实现容错分布式系统的Paxos算法被认为是难以理解的,也许是因为最初的演示文稿对许多读者来说是希腊文。[5]. 事实上,它是最简单和最明显的分布式算法之一。其核心是一种协商一致的算法-“会议”算法。[5]. 下一节将展示这种一致性算法几乎不可避免地遵循我们希望它满足的属性。最后一节解释了完整的Paxos算法,这是通过以下方式获得的直接应用的一致性状态机的方法,建立一个分布式系统的方法,应该是众所周知的,因为它可能是最经常引用的分布式系统理论的文章的主题[4].
2 The Consensus Algorithm
一致性算法
2.1 The Problem
问题
Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has beenchosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:
• Only a value that has been proposed may be chosen,
• Only a single value is chosen, and
• A process never learns that a value has been chosen unless it actually has been.
假设一个可以提出价值的过程的集合。一个一致性算法,确保建议的值之间的一个单一的选择。如果没有提出任何值,则不应选择任何值。如果已经选择了一个值,那么进程应该能够学习所选择的值。一致性的安全要求是:
·只有一个已经被提议的值可以被选择,
·只有一个值被选择,
·一个进程永远不会知道一个值已经被选择,除非它实际上已经被选择。
We won’t try to specify precise liveness requirements. However, the goal is to ensure that some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value.We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an implementation,a single process may act as more than one agent, but the mapping from agents to processes does not concern us here.Assume that agents can communicate with one another by sending mes-sages. We use the customary asynchronous, non-Byzantine model, in which:
• Agents operate at arbitrary speed, may fail by stopping, and mayrestart.Since all agents may fail after a value is chosen and thenrestart, a solution is impossible unless some information can be re-membered by an agent that has failed and restarted.
• Messages can take arbitrarily long to be delivered, can be duplicated,and can be lost, but they are not corrupted.
我们不会试图指定精确的活性要求。然而,我们的目标是确保最终选择了某个建议的值,如果已经选择了某个值,那么进程就可以最终学习该值。我们让三个角色的一致性算法进行三类代理:提议者,接受者和学习者。在一个实现中,单个进程可以充当多个代理,但是从代理到进程的映射在这里与我们无关。假设代理可以通过发送消息来彼此通信。我们使用习惯的异步、非拜占庭模型,其中:
代理以任意速度运行,可能因停止而失败,也可能重新启动。由于所有代理在选择值之后都可能失败,然后重新启动,所以解决方案是不可能的,除非一些信息可以由已失败并重新启动的代理重新成员。
消息可以花费任意长的时间来传递,可以复制,可以丢失,但它们不会被损坏。
2.2 Choosing a Value
选择一个值
The easiest way to choose a value is to have a single acceptor agent. A proposer sends a proposal to the acceptor, who chooses the first proposed value that it receives. Although simple, this solution is unsatisfactory because the failure of the acceptor makes any further progress impossible.
来选择一个值的最简单的方法是有一个单一的接受者代理。提议者将提议发送给接受者,接受者选择其收到的第一个提议值。虽然简单,但这种解决方案是不能令人满意的,因为一旦接受者挂了,后面的业务将不可能进行下去。
So, let’s try another way of choosing a value. Instead of a single acceptor,let’s use multiple acceptor agents. A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it. How large is large enough? To ensure that only a single value is chosen, we can let a large enough set consist of any majority of the agents. Because any two majorities have at least one acceptor in common, this works if an acceptor can accept at most one value. (There is an obvious generalization of a majority that has been observed in numerous papers, apparently starting with [3].)
所以,让我们尝试另一种选择值的方法。换掉一个单一的受体,让我们使用多个受体代理。提议者向一组接受者发送提议的值。接受者可以接受提议的值。当一组足够多的接受者都接受了同样的一个值时,这个值就被选定了。多少才算足够多?为了确保只有一个单一的值被选中,我们可以让由任何大多数的代理组成的一个足够大的集合。因为任何两个多数都至少有一个共同的接受者,所以如果一个接受者最多只能接受一个值,那么这个方法就有效。(有一个明显的泛化的一个多数,已在众多的论文中观察到,显然是从[3].)
In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer. This suggests therequirement:
在没有失败或消息丢失的情况下,即使只有一个提议者提出了一个值的情况下,我们也想要这个值被成功选择上。这表明了以下要求:
P1. An acceptor must accept the first proposal that it receives.
接受方必须接受其收到的第一个建议。
But this requirement raises a problem. Several values could be proposed by different proposers at about the same time, leading to a situation in which every acceptor has accepted a value, but no single value is accepted by a majority of them. Even with just two proposed values, if each is accepted by about half the acceptors, failure of a single acceptor could make it impossible to learn which of the values was chosen.
但是这个要求带来了一个问题。几乎在同一时间,不同的提议者可以提出多个值,导致每个接受者都接受了一个值,但没有一个值被大多数接受的情况。即使只有两个建议的值,如果这两个值分别被大约一半的接受者接受,如果这个时候一个决定谁将胜出的接受者挂掉了,造成的结果也就是无法将值成功选出。
P1 and the requirement that a value is chosen only when it is accepted by a majority of acceptors imply that an acceptor must be allowed to accept more than one proposal. We keep track of the different proposals that an acceptor may accept by assigning a (natural) number to each proposal, so a proposal consists of a proposal number and a value. To prevent confusion,we require that different proposals have different numbers. How this is achieved depends on the implementation, so for now we just assume it. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. In that case, we say that the proposal (as well as its value) has been chosen.
P1和只有当一个值被大多数接受者接受时才被选择的要求意味着一个接受者必须被允许接受一个以上的建议。我们通过给每个提议分配一个(自然数)来跟踪接受者可能接受的不同提议,因此一个提议由提议数和一个值组成。为了防止混淆,我们要求不同的提案有不同的数字。如何实现这一点取决于实现,所以现在我们只是假设它。当具有该值的单个提议已被大多数接受者接受时,选择该值。在这种情况下,我们说,提议(以及它的价值)已被选中。
We can allow multiple proposals to be chosen,but we must guarantee that all chosen proposals have the same value. By induction on the proposalnumber, it suffices to guarantee:
我们可以允许选择多个提议,但是必须保证所有被选择的提议都具有相同的值。通过对提议编号的归纳,可以保证:
P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.
如果一个值为v的提案被选中,那么每一个被选中的编号较高的提案都具有值v。
Since numbers are totally ordered, condition P2 guarantees the crucial safety property that only a single value is chosen.
因为数字是全序的,条件P2保证了关键的安全属性,即只有一个值被选择。
To be chosen, a proposal must be accepted by at least one acceptor. So,we can satisfy P2 by satisfying:
要被选中,提案必须被至少一个接受者接受。所以,我们可以通过满足以下条件来满足P2:
P2a. If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v.
如果一个值为v的提案被选中,那么任何接受者接受的每一个更高数字的提案都具有值v。
We still maintain P1 to ensure that some proposal is chosen. Because communication is asynchronous, a proposal could be chosen with some particular acceptor c never having received any proposal. Suppose a new proposer“wakes up” and issues a higher-numbered proposal with a different value.P1 requires c to accept this proposal, violating P2a.Maintaining both P1 and P2a requires strengthening P2a to:
我们仍然保持P1,以确保一些建议被选中。因为通信是异步的,所以可以在某个特定接受者从未接收到任何建议的情况下选择建议。假设一个新的提议者“醒来”并发出一个具有不同值的更高编号的提议。P1要求c接受这个提议,这违反了
P2a。同时保持P1和
P2a需要加强
P2a,以便:
P2b. If a proposal with value v is chosen, then every higher-numbered pro-posal issued by any proposer has value v.
如果一个值为v的提案被选中,那么任何提案者发出的每个编号较高的提案都具有值v。
Since a proposal must be issued by a proposer before it can be accepted by an acceptor, P2b implies P2a, which in turn implies P2.
由于一项建议必须在被接受者接受之前由提议人发出,因此
P2b包含
P2a,而
P2a包含
P2。
To discover how to satisfy P2b, let’s consider how we would prove that it holds. We would assume that some proposal with number m and value v is chosen and show that any proposal issued with number n > m also has value v. We would make the proof easier by using induction on n,so we can prove that proposal number n has value v under the additional assumption that every proposal issued with a number in m . . (n − 1) has value v, where i . . j denotes the set of numbers from i through j. For the proposal numbered m to be chosen, there must be some set C consisting of a majority of acceptors such that every acceptor in C accepted it. Combining this with the induction assumption, the hypothesis that m is chosen implies:
为了发现如何满足
P2b,让我们考虑如何证明它成立。我们假设某个编号为
m 且值为
v 的提案被选中,并证明任何编号为
n>m 的提案也具有值
v。我们可以通过在
n 上使用归纳法来简化证明,所以我们可以在额外的假设下证明提案号
n具有值
v,即每个提案发出的数字为
m..(n-1) 具有值
v,其中
i..j 表示从
i 到
j 的一组数。为了编号为
m 的建议被选择,必须有一些大多数接受者组成的集合C,该集合中的每一个接受者都接受编号为
m 的建议。将其与归纳假设相结合,选择
m 的假设意味着:
Every acceptor in C has accepted a proposal with number in m . . (n − 1) ,and every proposal with number in m . .(n − 1) accepted by any acceptor has value v.
集合C中的每个接受者都接受了一个编号为
m..(n-1) 的建议,以及这些编号在
m..(n-1) 中的被该集合接受的建议都是具有值
v的。
Since any set S consisting of a majority of acceptors contains at least one member of C, we can conclude that a proposal numbered n has value v by ensuring that the following invariant is maintained:
因为任何由大多数接受者组成的集合S都包含至少一个集合C中的成员,我们可以通过确保保持以下不变量得出一个编号为
n 的建议具有值
v:
P2c. For any v and n, if a proposal with value v and number n is issued,then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
对于任何
v和
n,如果发布了一个值为v且编号为
n 的提案,则存在一个由多数接受者组成的集合S,该集合中的接受者要么不接受编号小于
n 的任何建议,要么接受编号小于
n 但建议值为
v 的的建议。
We can therefore satisfy P2b by maintaining the invariance of P2c.
因此,我们可以通过保持
P2c的不变性来满足
P2b。
To maintain the invariance of P2c, a proposer that wants to issue a proposal numbered n must learn the highest-numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in some majority of acceptors. Learning about proposals already accepted is easy enough; predicting future acceptances is hard. Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n .This leads to the following algorithm for issuing proposals.
为了保持
P2c的不变性,提议者如果想要发布一个编号为
n的建议,提议者必须学习了最高编号小于
n 的建议,如果有的话,这个建议已经或将被大多数接受者接受。学习已被接受的提案很容易,但预测未来的建议却很难。相比较于试图预测未来,提议者通过迫使接受者做出一个许诺来保证不存在这种接受情况更具有性价比。换句话说,提议者要求接受者不再接受编号少于
n 的任何建议。这便有了下面发布建议的算法。
-
A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
提议者选择一个新的提议编号
n,并向某组接受者的每个成员发送请求,要求其基于以下承诺进行响应:
(a) A promise never again to accept a proposal numbered less than n, and
承诺不再接受编号小于
n的建议,以及
(b) The proposal with the highest number less than n that it has accepted, if any.
对于已接受的任何提案,它的最高编号的大小都要小于
n。
I will call such a request a prepare request with number n.
我将这样的请求称为编号为
n 的准备请求。
-
If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.
如果提议者收到来自大多数接受者的请求响应,那么它可以发出一个具有
v的提议,其中
v可以是响应中编号最高的提议的值,如果接受者的请求响应里没有具体的值,也可以由提议者自己选择的任何值。
A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.
一个提议者通过向一些接受者发送一个提议请求,该请求用于当前的这个提议被接受者接受。(这不需要是对最初请求作出答复的同一组接受者。)让我们称之为接受请求。
This describes a proposer’s algorithm. What about an acceptor ? It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety.So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request,accepting the proposal, iff it has not promised not to. In other words:
这描述了一个提议者的算法。那么对于接受者呢?它可以接收来自提议者的两种请求:准备请求和接受请求。接受者可以忽略任何请求而不损害安全性。所以,我们需要说一下接受者们允许响应一个请求的情况。接受者总是可以响应准备请求。对于接受请求,当且仅当接受者还没有允诺的情况下它都可以响应一个接受请求并接受提议。换句话说就是:
P1a. An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.
当且仅当它没有响应编号大于
n的准备请求的时候 , 接受方可以接受编号为
n的提议。
Observe that P1a subsumes P1.
注意
P1a包含P1。
We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.
我们现在有一个完整的算法来选择一个值,假设通过生成唯一不重复的提案编号来满足所需的安全性。通过一个小的优化得到了最终的算法。
Suppose an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, thereby promising not to accept any new proposal numbered n. There is then no reason for the acceptor to respond to the new prepare request, since it will not accept the proposal numbered n that the proposer wants to issue. So we have the acceptor ignore such a prepare request. We also have it ignore a prepare request for a proposal it has already accepted.
假设一个接受者接收到编号为
n的准备请求,但是它已经响应了编号大于
n的一个准备请求,从而承诺不接受任何编号为
n的新提议。因为它不会接受提议者想要发出的编号为
n的提议,那么接受者就没有理由为这个新的准备请求做出响应。所以我们让接受者忽略这样一个准备请求。我们还让接受者忽略一个它已经接受的提案的准备请求。
With this optimization, an acceptor needs to remember only the highest-numbered proposal that it has ever accepted and the number of the highest-numbered prepare request to which it has responded. Because P2c must be kept invariant regardless of failures, an acceptor must remember this information even if it fails and then restarts.Note that the proposer can always abandon a proposal and forget all about it—as long as it never tries to issue another proposal with the same number.
通过这种优化,接受者只需要记住它曾经接受过的编号最高的提议,以及它响应的编号最高的准备请求的数值。因为不考虑接受者挂掉的情况下,
P2c必须保持不变性,即使接受者挂掉然后重新启动,它也必须要记住这个不变的条件信息。请注意,只要它从来没有试图发出另一个同样编号的提议,提议者可以随时放弃这个提议,并丢弃掉所有关于这个提议的一切信息。
Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.
将提议者和接受者的动作放在一起,我们可以看到,算法的运行分为以下两个阶段。
**Phase 1. **
第一阶段。
(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(a)提议者选择了一个编号为
n的提议,并向大多数接受者发送编号为
n的准备请求。
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded,then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any)that it has accepted.
如果一个接受者接收到一个准备请求,编号为
n且大于它已经响应的所有准备请求的提议编号,那么这个接受者会用一个不再接受任何编号小于
n提议的承诺以及它已经接受的最高编号的提议(如果有的话)信息来响应该请求。
Phase 2.
第二阶段。
(a) If the proposer receives a response to its prepare requests(numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
如果提议者从大多数接受者接收到对其准备请求(编号为
n)的响应,然后,它向每个这些接受方发送一个编号为
n值为
v的提议的接受请求,其中
v是响应中编号最高的提议的值,如果响应显示这些接收者没有收到过任何提议,则
v可以为任何值。
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.
如果接受者接收到编号为
n的提议的接受请求,除非它已经响应了编号大于
n的准备请求, 否则它会接受该提议。
A proposer can make multiple proposals, so long as it follows the algorithm for each one. It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.) It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one. Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.
一个提议者可以提出多个提议,只要每一个提议都遵循这个算法。它可以在任何时候放弃一个正在流程中的提议。(即使提议的请求和/或响应可能在提议被放弃很久之后才到达目的地,也会保持正确性。)如果某个提议者已经开始尝试发行一个编号更高的提议,放弃该提议可能是一个好主意。因此,如果一个接受者选择忽略了一个准备或接受请求,因为它已经收到了一个更高数量的准备请求,那么它可能应该通知提议者,提议者应该放弃它的提议。这是一个性能优化,不影响正确性。
2.3 Learning a Chosen Value
获知一个选定的值
To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners.
要知道一个值已经被选择,学习者必须找出一个已经被大多数接受者接受的提议。明显的算法是让每个接受者,每当它接受一个提议,它便将这个提议响应给所有学习者。这使得学习者尽快找出一个选定的值,但它要求每个接受者对每个学习者作出响应-响应的数量等于接受者数量和学习者数量的乘积。
The assumption of non-Byzantine failures makes it easy for one learner to find out from another learner that a value has been accepted. We can have the acceptors respond with their acceptances to a distinguished learner,which in turn informs the other learners when a value has been chosen. This approach requires an extra round for all the learners to discover the chosen value. It is also less reliable, since the distinguished learner could fail. But it requires a number of responses equal only to the sum of the number of acceptors and the number of learners.
非拜占庭式失败的假设使得一个学习者很容易从另一个学习者那里发现一个值已经被接受。当一个值被接受选中时,我们可以让接受者用他们的接受的提议来响应一个杰出的学习者,该学习者反过来又通知其他学习者。这种方法需要所有的学习者额外寻找一次所选择的值。它也是不太可靠,因为杰出的学习者可能会失败。但它需要的响应数只等于接受者的数量和学习者的数量之和。
More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.
更广泛地说,接受者可以对一组杰出的学习者作出已接受提议的响应,每一位都可以在选择了一个值时通知所有的学习者。使用一组更大的杰出学习者提供了更大的可靠性,以更大的通信复杂性为代价。
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
由于消息丢失,已经被选中的值可能没有被学习者发现。学习者可以问接受者他们接受了什么建议,但是如果一个接受者失败了,往往会导致不可能知道是否大多数接受者接受了一个特定的提议。在这种情况下,只有当一个新的提议被选中的时候,学习者才会找出被选中的值。如果一个学习者需要知道一个值是否已经被选择,它可以让一个建议者发出一个提议,使用上述算法来完成。
2.4 Progress
改进
It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1.Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.
很容易构建这样一个场景:两个提议者各自不断发出一系列数值递增的提议,但没有一个被选中。提议者p完成编号为
n1 的提议的第一阶段。然后另一个提议者q完成编号为
n2 (n2>n1)的提议的第一阶段。处于第二阶段的提议者P的编号为
n1 的接受请求被忽略,因为接受者都承诺不接受任何编号小于
n2 的提议。因此,提议者P然后开始并完成编号为
n3 (n3>n2)的新提议的第一阶段,使得处于第二阶段提议者q的接受请求被忽略。诸如此类。
To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing a proposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number.
为了保证进度,必须选择一位杰出的提议者作为唯一一个尝试发布提议的对象。如果杰出的提议者能成功地与大多数接受者沟通,并且如果它使用的提议的编号大于任何已经使用的提议编号,那么它将成功地发出一个被接受的提议。如果它获悉到一些具有较高的提案编号的请求,通过放弃一个提案,并再次尝试,杰出的提议者将最终选择一个足够高的提议编号。
If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer,Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However,safety is ensured regardless of the success or failure of the election.
如果有足够多的系统(提议者,接受者,和通信网络)工作正常,因此可以通过选举一个单一的杰出的提议者来实现存活性。菲舍尔、林奇和帕特森的著名结论[1]意味着一个选举一个提议者可靠的算法,必须支持随机性或实时性,例如,在超时的情况下也支持。然而,无论选举是否成功,安全性都要得到保证。
2.5 The Implementation
实现方案
The Paxos algorithm [5] assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner. The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. The Paxos consensus algorithm is precisely the one described above, where requests and responses are sent as ordinary messages. (Response messages are tagged with the corresponding proposal number to prevent confusion.) Stable storage, preserved during failures, is used to maintain the information that the acceptor must remember. An acceptor records its intended response in stable storage before actually sending the response.
Paxos算法[5]假定一个进程网络。在它的一致性算法中,每个进程都扮演着提议者、接受者和学习者的角色。该算法选择一个领导者,扮演着杰出的提议者和杰出的学习者的角色。Paxos一致性算法正是上面描述的算法,其中请求和响应作为普通消息发送。(响应消息用相应的提议编号进行标记,以防止混淆。)稳定的存储器,用于故障期间存储接受者必须记住的信息。接受者在实际发送响应之前将其预期响应记录在稳定存储器中。
All that remains is to describe the mechanism for guaranteeing that no two proposals are ever issued with the same number. Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue,and begins phase 1 with a higher proposal number than any it has already used.
剩下要做的就是描述确保不以相同的编号发布两个提议的机制。不同的提议者从不相交的数字集合中选择他们的数字,所以两个不同的提议者永远不会发出相同数字的提议。每个提议者记住(在稳定存储器中)它已经尝试发出的最高编号的提议,并以比它已经使用的任何提议号更高的提议号开始阶段一。
3 Implementing a State Machine
实现一个状态机
A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. For example, the clients of a distributed banking system might be tellers, and the state-machine state might consist of the account balances of all users. A withdrawal would be performed by executing a state machine command that decreases an account’s balance if and only if the balance is greater than the amount withdrawn, producing as output the old and new balances.
实现分布式系统的一种简单方法是将其作为向中央服务器发出命令的客户机的集合。服务器可以被描述为一个确定性的状态机,它按一定的顺序执行客户端命令。状态机有一个当前状态;它执行步骤是接受一个命令作为输入,并产生一个输出和一个新的状态。例如,分布式银行系统的客户机可能是出纳员,状态机状态可能包括所有用户的账户余额。取款将通过执行状态机命令来执行,如果并且仅当余额大于提取的金额时,该命令将减少帐户的余额,产生旧的和新的余额作为输出。
An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic,all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.
使用单点中央服务器的方案,如果这个服务器发生故障,那么这个方案将失败。因此,我们使用一组服务器,每一个服务器独立地实现状态机。因为状态机是确定性的,所以如果所有服务器都执行相同的命令序列,那么它们将产生相同的状态序列和输出。发出命令的客户端可以使用任何服务器为它生成的输出。
To guarantee that all servers execute the same sequence of state-machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the ith instance being the ith state machine command in the sequence. Each server plays all the roles (proposer,acceptor, and learner) in each instance of the algorithm. For now,I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.
为了保证所有服务器执行相同的状态机命令序列,我们实现了一个Paxos一致性算法的独立实例序列,第i个实例选择的值是序列中的第i个状态机命令。每个服务器在算法的每个实例中扮演所有的角色(提议者、接受者和学习者)。现在,我假设服务器集是固定的,因此一致性算法的所有实例都使用相同的代理集。
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear. If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm.
在正常操作中,在一致性算法的所有实例中,选择一台服务器作为领导者,该服务器充当杰出的提议者(唯一试图发布提议的提议者)。客户端发送命令给领导者,领导者决定每个命令应该出现在序列中的哪个位置。如果领导者决定某个客户端命令应该是第135个命令时,它会尝试将该命令选择给第135个实例运行。
It will usually succeed. It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. But the consensus algorithm ensures that at most one command can be chosen as the 135th one.
一般都会成功的。它可能因为服务挂掉而失败,或者因为另一个服务器也认为自己是领导者,对第135条命令应该是什么有不同的想法。但是一致性算法确保最多一个命令可以被选择为第135号命令。
Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2.Recall that,after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.
这种方法有效性的关键在于,在Paxos一致性算法中,要提出的值直到第2阶段才被选择。回想一下,在完成提议者算法的第1阶段之后,要么确定要提议的值,要么提议者可以自由提议任何值。
I will now describe how the Paxos state machine implementation works during normal operation. Later, I will discuss what can go wrong. I consider what happens when the previous leader has just failed and a new leader has been selected. (System startup is a special case in which no commands have yet been proposed.)
现在我将描述Paxos状态机实现在正常操作期间是如何工作的。稍后,我将讨论哪些地方可能出错。我会考虑当前任领导者刚刚失效,新的领导者已经被选中时会发生什么。(系统启动是一种特殊情况,在这种情况下,尚未提出任何命令。)
The new leader, being a learner in all instances of the consensus algorithm, should know most of the commands that have already been chosen. Suppose it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. (We will see later how such a gap in the command sequence could arise.) It then executes phase 1 of instances 135–137 and of all instances greater than 139.(I describe below how this is done.) Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.
新的领导者,作为在一致性算法所有实例中的一个学习者,就应该知道被大多数实例选择的命令。假设它知道命令1-134、138和139,目标值将会在一致性算法的实例1-134、138和139中选出。(我们将在后面看到命令序列中如何出现这样的间隙。)然后135-137和大于139的所有实例执行第一阶段(我将在下面描述这是如何完成的)。假设这些执行的结果要在实例135和140要提议的值中确定,但在所有其他实例中不限制所建议的值。然后,领导者针对实例135和140执行第二阶段,从而选择命令135和140。
The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135.However,it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137.Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “noop” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these noop commands have been chosen, commands 138–140 can be executed.
这个领导者,以及学习了领导者知道的所有命令的任何其他服务器,现在可以执行命令1-135。然而,它不能执行命令138-140,因为命令136和137还没有被选择。领导者可以将客户端请求的下两个命令作为命令136和137。相反,我们让它立即通过提出一个特殊的空操作命令来填补缺口,作为命令136和137,该命令保持状态不变。(它通过执行一致性算法的实例136和137的第二阶段来实现这一点。)一旦选择了这些空操作命令,就可以执行命令138-140。
Commands 1–140 have now been chosen. The leader has also completed phase 1 for all instances greater than 140 of the consensus algorithm, and it is free to propose any value in phase 2 of those instances. It assigns command number 141 to the next command requested by a client, proposing it as the value in phase 2 of instance 141 of the consensus algorithm. It proposes the next client command it receives as command 142, and so on.
现在已经选择了命令1-140。领导者还完成了对于所有编号大于140一致性算法实例的第一阶段,并且可以在这些实例的第二阶段中自由地提出任何值。它将命令编号141分配给客户端请求的下一个命令,将其作为一致性算法的实例141的第二阶段中的值。它提议它接收的下一个客户端命令为命令142,依此类推。
The leader can propose command 142 before it learns that its proposed command 141 has been chosen. It’s possible for all the messages it sent in proposing command 141 to be lost, and for command 142 to be chosen before any other server has learned what the leader proposed as command 141. When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well,its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands. In general, suppose a leader can get α commands ahead—that is, it can propose commands i + 1 through i +α after commands 1 through i are chosen. A gap of up to α−1 commands could then arise.
领导者可以在得知其提议的命令141已被选择之前提议命令142。它在提议命令141中发送的所有消息都有可能丢失,也有可能在任何其他服务器已经获知领导者提议的命令141之前选择命令142。在实例141中,当领导者未能接收到对其第二阶段消息的预期响应时,它将重传那些消息。如果一切顺利,它提出的命令将被选中。然而,它可能首先失败,在所选命令序列中留下一个缺口。一般来说,假设一个领导者可以预先得到α命令,也就是说,在
1 到
i之间的命令被选中后,它可以提议
i+1 到
i+α 之间的命令。这时会出现高达
α -1 个命令缺口。
A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer.(In the scenario, this was the case only for instances 135 and140.)Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. Executing these infinitely many instances of phase 1 therefore poses no problem.
一个新选择的领导者针对一致性算法的无数实例执行第一阶段-在上述场景中,是针对实例135-137和所有大于139的实例。通过对所有实例使用相同的建议号,它可以通过向其他服务器发送一条合理的短消息来实现这一点。在第一阶段中,只有当一个接受者已经从某个提议者那里接收到第二阶段的消息时,它才会用不止一个简单的OK来响应(在该场景中,这种情况只出现在实例135和140中)。因此,服务器(充当接受者)可以用一条合理的短消息来响应所有实例。因此,执行第一阶段的这些不确定的许多实例不会造成问题。
Since failure of the leader and election of a new one should be rare events, the effective cost of executing a state machine command—that is, of achieving consensus on the command/value—is the cost of executing only phase 2 of the consensus algorithm. It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults [2]. Hence, the Paxos algorithm is essentially optimal.
由于失败的领导者和选举一个新的领导者应该是罕见的事件,执行一个状态机命令(取得一致性的命令/值)的有效成本,它是执行只有第2阶段一致性算法的成本。可以证明,存在故障时Paxos一致性算法的第二阶段在所有一致性算法中具有最低可能成本[2]。因此,Paxos 一致性算法基本上是最优的。
This discussion of the normal operation of the system assumes that there is always a single leader, except for a brief period between the failure of the current leader and the election of a new one. In abnormal circumstances,the leader election might fail. If no server is acting as leader,then no new commands will be proposed. If multiple servers think they are leaders, then they can all propose values in the same instance of the consensus algorithm, which could prevent any value from being chosen.However, safety is preserved—two different servers will never disagree on the value chosen as the ith state machine command. Election of a single leader is needed only to ensure progress.
这种对系统正常运行的探讨假定,除了现任领导者的失败与新领导人的选举之间有一段短暂的时间外,总有一位领导者。在非正常情况下,领导人选举可能会失败。如果没有服务器充当领导者,则不会提出新命令。如果多个服务器认为它们是leader,那么它们都可以在一致性算法的同一实例中建议值,这可能会阻止任何值被选中。然而,安全性得到了保证,两个不同的服务器将永远不会在第i个状态机命令的值上产生分歧。选举一个单一的领导人只是为了确保流程能够正常往下执行。
If the set of servers can change, then there must be some way of determining what servers implement what instances of the consensus algorithm. The easiest way to do this is through the state machine itself. The current set of servers can be made part of the state and can be changed with ordinary state-machine commands. We can allow a leader to get α commands ahead by letting the set of servers that execute instance \(i + α\) of the consensus algorithm be specified by the state after execution of the ith state machine command. This permits a simple implementation of an arbitrarily sophisticated reconfiguration algorithm.
如果服务器的集合可以改变,那么一定有某种方法来确定哪些服务器实现了一致性算法的哪些实例。最简单的方法是通过状态机本身。当前的服务器集合可以成为状态的一部分,并且可以用普通的状态机命令来改变。我们可以让执行共识算法实例i + α的服务器集在执行第i个状态机命令后由状态指定,从而允许领导者提前获得α个命令。这允许一个任意复杂的重新配置算法的简单实现。
References
参考资料
[1] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibilityof distributed consensus with one faulty process. Journal of the ACM,32(2):374–382, April 1985.[1]Michael J.Fischer,Nancy Lynch和Michael S.Paterson。
[2] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensuswhen there are no faults—a tutorial. TechnicalReport MIT-LCS-TR-821,Laboratory for Computer Science, Massachusetts Institute Technology,Cambridge, MA, 02139, May 2001. also published in SIGACT News32(2) (June 2001).
[3] Leslie Lamport. The implementation of reliable distributed multiprocesssystems. Computer Networks, 2:95–114, 1978.
[4] Leslie Lamport. Time, clocks, and the ordering of events in a distributedsystem. Communications of the ACM, 21(7):558–565, July 1978.
[5] Leslie Lamport. The part-time parliament. ACM Transactions on Com-puter Systems, 16(2):133–169, May 1998.