Kafka学习笔记--复制日志的学习

前言

学习Kafka的复制日志,这里涉及到了一些分布式一致性算法的内容。本文依旧采用中英混排的方式。

正文

Replicated Logs: Quorums, ISRs, and State Machines (Oh my!)

At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, and there are many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distributed systems in the state-machine style.

在 state-machine style中,复制日志可以被一些系统作为原语来实现其它一些分布式系统。

A replicated log models the process of coming into consensus on the order of a series of values (generally numbering the log entries 0, 1, 2, ...). There are many ways to implement this, but the simplest and fastest is with a leader who chooses the ordering of values provided to it. As long as the leader remains alive, all followers need to only copy the values and ordering the leader chooses.

一个复制日志,在一些列值的顺序上形成共识的过程来model(怎么理解?作为模型?)(通常日志条目被编号成0,1,2,......)。最简单的和最快的实现是leader来选择提供给它的值的顺序。只要leader活着,所有的follower需要做的,仅仅是拷贝leader选择的值和顺序。

Of course if leaders didn't fail we wouldn't need followers! When the leader does die we need to choose a new leader from among the followers. But followers themselves may fall behind or crash so we must ensure we choose an up-to-date follower. The fundamental guarantee a log replication algorithm must provide is that if we tell the client a message is committed, and the leader fails, the new leader we elect must also have that message. This yields a tradeoff: if the leader waits for more followers to acknowledge a message before declaring it committed then there will be more potentially electable leaders.

一旦leader挂了,我们要从follower中选择一个新的leader,这个follower必须是一个一直跟上了leader的follower,就是说数据是跟leader一致的,保持着最新的。

If you choose the number of acknowledgements required and the number of logs that must be compared to elect a leader such that there is guaranteed to be an overlap, then this is called a Quorum.

Quorum(法定人员)的概念。

A common approach to this tradeoff is to use a majority vote for both the commit decision and the leader election. This is not what Kafka does, but let's explore it anyway to understand the tradeoffs. Let's say we have 2f+1 replicas. If f+1 replicas must receive a message prior to a commit being declared by the leader, and if we elect a new leader by electing the follower with the most complete log from at least f+1 replicas, then, with no more than f failures, the leader is guaranteed to have all committed messages. This is because among any f+1 replicas, there must be at least one replica that contains all committed messages. That replica's log will be the most complete and therefore will be selected as the new leader. There are many remaining details that each algorithm must handle (such as precisely defined what makes a log more complete, ensuring log consistency during leader failure or changing the set of servers in the replica set) but we will ignore these for now.

对于这个权衡(tradeoff)的一个常用的方法是使用对于提交决定(commit decision)和领导者选举(leader election)的大多数的投票。这并不是Kafka所做,但是我们也来理解一下。假设我们有2f+1个复制者,如果f+1复制者必须接收消息优于leader的声明的一commit,并且如果我们从至少f+1个复制者中使用最完整的log选举出一个leaders,那么,在没有超过f个复制者失败的情况下,这个leader被保证拥有所有提交的信息。这是因为在任何f+1个复制者中,肯定会至少有一个复制者包含着所有提交的信息。那个复制者的日志会是最完整的,所以将被挑选成为新的leader。

这一段太绕了,还得好好理解一下。

This majority vote approach has a very nice property: the latency is dependent on only the fastest servers. That is, if the replication factor is three, the latency is determined by the faster slave not the slower one.

There are a rich variety of algorithms in this family including ZooKeeper's ZabRaft, and Viewstamped Replication. The most similar academic publication we are aware of to Kafka's actual implementation is PacificA from Microsoft.

我们意识到的最相似的对于Kafka实际实现的学术发表,是微软的PacificA。这里有点奇怪,上文说这个工作那并不是Kafka所做的,那我理解,就应该是ZooKeeper所做的,但是这里又说不是ZooKeeper的Zab。

看到这里,突然感觉,这有些像比特币的共识算法啊。这里的Raft就是现在区块链领域里面要使用的一个新的共识算法,可以参考另外一篇转载的博文《区块链:深入剖析区块链的共识算法 Raft & PBFT》。

The downside of majority vote is that it doesn't take many failures to leave you with no electable leaders. To tolerate one failure requires three copies of the data, and to tolerate two failures requires five copies of the data. In our experience having only enough redundancy to tolerate a single failure is not enough for a practical system, but doing every write five times, with 5x the disk space requirements and 1/5th the throughput, is not very practical for large volume data problems. This is likely why quorum algorithms more commonly appear for shared cluster configuration such as ZooKeeper but are less common for primary data storage. For example in HDFS the namenode's high-availability feature is built on a majority-vote-based journal, but this more expensive approach is not used for the data itself.

HDFS用了一个majority-vote-based journal,非常昂贵。

Kafka takes a slightly different approach to choosing its quorum set. Instead of majority vote, Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not considered committed until all in-sync replicas have received the write. This ISR set is persisted to ZooKeeper whenever it changes. Because of this, any replica in the ISR is eligible to be elected leader. This is an important factor for Kafka's usage model where there are many partitions and ensuring leadership balance is important. With this ISR model and f+1replicas, a Kafka topic can tolerate f failures without losing committed messages.

Kafka没有使用majority-vote-based journal,而是有一些轻微的区别。采用了一个ISR(in-sync replicas,在同步的复制点,是一致保持和leader一致的)的概念,只有ISR里面的成员才能被选举成为一个leader。只有ISR中的成员都确认,一个对于Kafka分区的写才被认为是提交了。而这个ISR都会持久化到ZooKeeper上去,无论何时它发生改变。

最大容错是f个复制失败了,但是上面说的是需要2f+1个复制。

For most use cases we hope to handle, we think this tradeoff is a reasonable one. In practice, to tolerate f failures, both the majority vote and the ISR approach will wait for the same number of replicas to acknowledge before committing a message (e.g. to survive one failure a majority quorum needs three replicas and one acknowledgement and the ISR approach requires two replicas and one acknowledgement). The ability to commit without the slowest servers is an advantage of the majority vote approach. However, we think it is ameliorated by allowing the client to choose whether they block on the message commit or not, and the additional throughput and disk space due to the lower required replication factor is worth it.

Another important design distinction is that Kafka does not require that crashed nodes recover with all their data intact. It is not uncommon for replication algorithms in this space to depend on the existence of "stable storage" that cannot be lost in any failure-recovery scenario without potential consistency violations. There are two primary problems with this assumption. First, disk errors are the most common problem we observe in real operation of persistent data systems and they often do not leave data intact. Secondly, even if this were not a problem, we do not want to require the use of fsync on every write for our consistency guarantees as this can reduce performance by two to three orders of magnitude. Our protocol for allowing a replica to rejoin the ISR ensures that before rejoining, it must fully re-sync again even if it lost unflushed data in its crash.

参考

https://kafka.apache.org/documentation/#design

https://stackoverflow.com/questions/48825755/how-does-kafka-handle-network-partitions 

 

posted on 2019-12-15 17:06  chaiyu2002  阅读(117)  评论(0编辑  收藏  举报

导航