[24] Zookeeper-算法&源码1

1. 基础算法

1.1 Paxos 算法

一种基于消息传递且具有高度容错特性的一致性算法。
摘自：https://www.zhihu.com/question/19787937

a. 产生背景

Paxos 算法是分布式技术大师 Lamport 提出的，主要目的是通过这个算法，让参与分布式处理的每个参与者逐步达成一致意见。用好理解的方式来说，就是在一个选举过程中，让不同的选民最终做出一致的决定。

Lamport 为了讲述这个算法，假想了一个叫做 Paxos 的希腊城邦进行选举的情景，这个算法也是因此而得名。在他的假想中，这个城邦要采用民主提议和投票的方式选出一个最终的决议，但由于城邦的居民没有人愿意把全部时间和精力放在这种事情上，所以他们只能不定时的来参加提议，不定时来了解提议、投票进展，不定时的表达自己的投票意见。Paxos 算法的目标就是让他们按照少数服从多数的方式，最终达成一致意见。

b. 流程概述

在整个提议和投票过程中，主要的角色就是“提议者”（向“接受者”提出提议）和“接受者”（收到“提议者”的提议后，向“提议者”表达自己的意见）。

（1）整个算法的大致过程

第一阶段：因为存在多个“提议者”，如果都提意见，那么“接受者”接受谁的不接受谁的？太混乱了。所以，要先明确哪个“提议者”是意见领袖有权提出提议，未来“接受者”们就主要处理这个“提议者”的提议了（这样也可以在提出提议时就尽量让意见统一，谋求尽早形成多数派）；
第二阶段：由上阶段选出的意见领袖提出提议，“接受者”反馈意见。如果多数“接受者”接受了一个提议，那么提议就通过了。

（2）必须要了解的其他相关背景

怎么明确意见领袖呢？通过编号。每个“提议者”在第一阶段先报个号，谁的号大，谁就是意见领袖。如果不好理解，可以想象为贿选。每个提议者先拿着钞票贿赂一圈“接受者”，谁给的钱多，第二阶段“接受者”就听谁的。（注：这里和下文提到的“意见领袖”，并不是一个新的角色，而是代表在那一轮贿赂成功的“提议者”。所以，请把意见领袖理解为贿赂中胜出的“提议者”即可）；
有个跟选举常识不一样的地方，就是每个“提议者”不会执着于让自己的提议通过，而是每个“提议者”会执着于让提议尽快达成一致意见。所以，为了这个目标，如果“提议者”在贿选的时候，发现“接受者”已经接受过前面意见领袖的提议了，即便“提议者”贿选成功，也会默默的把自己的提议改为前面意见领袖的提议。所以一旦贿赂成功，胜出的“提议者”再提出提议，提议内容也是前面意见领袖的提议（这样，在谋求尽早形成多数派的路上，又前进了一步）。
钱的多少很重要，如果钱少了，无论在第一还是第二阶段“接受者”都不会鸟你，直接拒绝。
上面 [2] 中讲到，如果“提议者”在贿选时，发现前面已经有意见领袖的提议，那就将自己的提议默默改成前面意见领袖的提议。这里有一种情况：如果你是“提议者”，在贿赂的时候“接受者1”跟你说“他见过的意见领袖的提议是方案 1”，而“接受者2”跟你说“他见过的意见领袖提议是方案2”，你该怎么办？这时的原则也很简单，还是：钱的多少很重要！你判断一下是“接受者1”见过的意见领袖有钱，还是“接受者2”见过的意见领袖有钱？如何判断呢？因为“接受者”在被“提议者”贿赂的时候，自己会记下贿赂的金额。所以当你贿赂“接受者”时，一旦你给的贿赂多而胜出，“接受者”会告诉你两件事情：① 前任意见领袖的提议内容（如果有的话）；② 前任意见领袖当时贿赂了多少钱。这样，再面对刚才的情景时，你只需要判断一下“接受者1”和“接受者2”告诉你的信息中，哪个意见领袖当时给的钱多，那你就默默的把自己的提议，改成那个意见领袖的提议。
最后这一部分最有意思，但描述起来有点绕，如果不能一下子就理解可以先看后面的例子。在整个选举过程中，每个人谁先来谁后到，“接受者”什么时间能够接到“提议者”的信息，是完全不可控的。所以很可能一个意见领袖已经产生了，但是由于这个意见领袖的第二阶段刚刚开始，绝大部分“接受者”还没有收到这个意见领袖的提议。结果，这时突然冲进来了一个新的土豪“提议者”，那么这个土豪“提议者”也是有机会让自己的提议胜出的！这时就形成了一种博弈：a. 上一个意见领袖要赶在土豪“提议者”贿赂到“接受者”前，赶到“接受者”面前让他接受自己的提议，否则会因为自己的之前贿赂的钱比土豪少而被拒绝；b. 土豪“提议者”要赶在上一个意见领袖将提议传达给“接受者”前，贿赂到“接受者”，否则土豪“提议者”即便贿赂成功，也要默默的将自己的提议改为前任意见领袖的提议。这整个博弈的过程，最终就看这两个“提议者”谁的进展快了。但最终一定会有一个意见领袖，先得到多数“接受者”的认可，那他的提议就胜出了。

（3）小结

Paxos 算法包括两个阶段：第一个阶段主要是贿选，还没有提出提议；第二个阶段主要根据第一阶段的结果，明确接受谁的提议，并明确提议的内容是什么（这个提议可能是贿选胜出“提议者”自己的提议，也可能是前任意见领袖的提议，具体是哪个提议，见下面第 3 点原则）；
编号（贿赂金额）很重要，无论在哪个阶段，编号（贿赂金额）小的，都会被鄙视（被拒绝）；
在第一阶段中，一旦“接受者”已经接受了之前意见领袖的提议，那后面再来找这个“接受者”的“提议者”，即便在贿赂中胜出，也要被洗脑，默默将自己的提议改为前任意见领袖的提议，然后他会在第二阶段提出该提议（也就是之前意见领袖的提议，以力争让大家的意见趋同）。如果“接受者”之前没有接受过任何提议，那贿选胜出的“提议者”就可以提出自己的提议了。

c. 举例说明

有两个“提议者”和三个“接受者”。

（1）首先“提议者1”贿赂了3个“接受者”

（2）3 个“接受者”记录下贿赂金额，因为目前只有一个“提议者”出价，因此 $1 就是最高的了，所以“接受者”们返回贿赂成功。此外，因为没有任何先前的意见领袖提出的提议，因此“接受者”们告诉“提议者 1”没有之前接受过的提议（自然也就没有上一个意见领袖的贿赂金额了）。

（3）“提议者1”向“接受者1”提出了自己的提议：1 号提议，并告知自己之前已贿赂 $1。

（4）“接受者1”检查了一下，目前记录的贿赂金额就是 $1，于是接受了这一提议，并把 1 号提议记录在案。

（5）在“提议者1”向“接受者2”“接受者3”发起提议前，土豪“提议者2”出现，他开始用 $2 贿赂“接受者1”与“接受者2”。

（6）“接受者1”与“接受者2”立刻被收买，将贿赂金额改为 $2。但是，不同的是：“接受者1”告诉“提议者2”，之前我已经接受过 1 号提议了，同时 1 号提议的“提议者”贿赂过 $1；而“接受者2”告诉“提议者2”，之前没有接受过其他意见领袖的提议，也没有上一个意见领袖的贿赂金额。

（7）这时，“提议者1”回过神来了，他向“接受者2”和“接受者3”发起 1 号提议，并带着信息“我前期已经贿赂过 $1”。

（8）“接受者2”“接受者3”开始答复：“接受者2”检查了一下自己记录的贿赂金额，然后表示，已经有人出价到 $2 了，而你之前只出到 $1，不接受你的提议，再见。但“接受者 3”检查了一下自己记录的贿赂金额，目前记录的贿赂金额就是 $1，于是接受了这一提议，并把 1 号提议记录在案。

（9）到这里，“提议者1”已经得到两个接受者的赞同，已经得到了多数“接受者”的赞同。于是“提议者1”确定 1 号提议最终通过。

（10）下面，回到“提议者2”。刚才说到，“提议者2”贿赂了“接受者1”和“接受者2”，且被“接受者1”告知：“之前已经接受过 1 号提议了，同时 1 号提议的 '提议者' 贿赂过 $1”，还被“接受者2”告知：“之前没有接到过其他意见领袖的提议，也没有其他意见领袖的贿赂金额”。这时“提议者2”，拿到信息后，判断一下，目前贿赂过最高金额（即 $1）的提议就是 1 号提议了，所以“提议者2”默默的把自己的提议改为与 1 号提议一致，然后开始向“接受者1”“接受者2”发起提议（提议内容仍然是 1 号提议），并带着信息：之前自己已贿赂过 $2。

（11）这时“接受者1”“接受者2”收到“提议者2”的提议后，照例先比对一下贿赂金额，比对发现“提议者2”之前已贿赂 $2，并且自己记录的贿赂金额也是 $2，所以接受他的提议，也就是都接受 1 号提议。

（12）于是，“提议者2”也拿到了多数派的意见，最终通过的也是 1 号提议。

回到上面的第（5）步，如果“提议者2”第一次先去贿赂“接受者2”“接受者3”会发生什么？

那很可能 1 号提议就不会成为最终选出的提议。因为当“提议者2”先贿赂到了“接受者2”“接受者3”，那等“提议者1”带着议题再去找这两位的时候，就会因为之前贿赂的钱少（$1<$2）而被拒绝。所以，这也就是刚才讲到可能存在博弈的地方：

“提议者1”要赶在“提议者2”贿赂到“接受者2”“接受者3”之前，让“接受者2”“接受者3”接受自己的意见，否则“提议者1”会因为钱少而被拒绝；
“提议者2”要赶在“提议者1”之前贿赂到“接受者”，否则“提议者2”即便贿赂成功，也要默默的将自己的提议改为“提议者1”的提议。

但你往后推演会发现，无论如何，总会有一个“提议者”的提议获得多数票而胜出。

以上，只是把大致的 Paxos 算法的思路介绍了一下，因为情景实在太复杂，比如：“提议者”、“接受者”如果是 4 个、5 个 ……；比如：“提议者”与“接受者”之间的交互谁先谁后，等等各类情况。但是，其实都是能够严谨的推导出最后能够选出一个多数派的，不过篇幅就会太长了。大家有兴趣可以按照上面的思路，自己再模拟模拟“提议者”“接受者”数量或多或少，交互或先或后的各种情况，结果肯定是最终唯一一个提议会获得多数票而胜出。

d. 算法流程

https://www.cnblogs.com/linbingdong/p/6253479.html

1.2 Zab 协议

Zab 借鉴了 Paxos 算法，是特别为 Zookeeper 设计的支持崩溃恢复的原子广播协议。基于该协议，Zookeeper 设计为只有一台客户端（Leader）负责处理外部的写事务请求，然后 Leader 客户端将数据同步到其他 Follower 节点。即 Zookeeper 只有一个 Leader 可以发起提案。

Zab 协议包括两种基本的模式：消息广播、崩溃恢复。

a. 消息广播

客户端发起一个写操作请求；
Leader 服务器将客户端的请求转化为事务 Proposal 提案，同时为每个 Proposal 分配一个全局的 ID，即 zxid；
Leader 服务器为每个 Follower 服务器分配一个单独的队列，然后将需要广播的 Proposal 依次放到队列中去，并且根据 FIFO 策略进行消息发送；
Follower 接收到 Proposal 后，会首先将其以事务日志的方式写入本地磁盘中，写入成功后向 Leader 反馈一个 Ack 响应消息；
Leader 接收到超过半数以上 Follower 的 Ack 响应消息后，即认为消息发送成功，可以发送 commit 消息；
Leader 向所有 Follower 广播 commit 消息，同时自身也会完成事务提交。Follower 接收到 commit 消息后，会将上一条事务提交；
Zookeeper 采用 Zab 协议的核心，就是只要有一台服务器提交了 Proposal，就要确保所有的服务器最终都能正确提交 Proposal。

Zab 协议针对事务请求的处理过程（类似于一个两阶段提交过程）：

广播事务阶段
广播提交操作

这两阶段提交模型有可能因为 Leader 宕机带来数据不一致，比如：

Leader 发起一个事务 Proposal1 后就宕机， Follower 都没有 Proposal1；
Leader 收到半数 Ack 后宕机，没来得及向 Follower 发送 Commit 怎么解决呢？

Zab 引入了「崩溃恢复模式」。

b. 崩溃恢复

一旦 Leader 服务器出现崩溃或者由于网络原因导致 Leader 服务器失去了与过半 Follower 的联系，那么就会进入崩溃恢复模式。

假设两种服务器异常情况：

一个事务在 Leader 提出之后，Leader 挂了；
一个事务在 Leader 上提交了，并且过半的 Follower 都响应 Ack 了，但是 Leader 在 Commit 消息发出之前挂了。

Zab 协议崩溃恢复要求满足以下两个要求：

确保已经被 Leader 提交的提案 Proposal，必须最终被所有的 Follower 服务器提交（已经产生的提案 Follower 必须执行）；
确保丢弃已经被 Leader 提出的，但是没有被提交的 Proposal（丢弃胎死腹中的提案）。

崩溃恢复主要包括两部分：Leader 选举和数据恢复。

（1）Leader 选举

根据上述要求，Zab 协议需要保证选举出来的 Leader 需要满足以下条件：

新选举出来的 Leader 不能包含未提交的 Proposal。即新 Leader 必须都是已经提交了 Proposal 的 Follower 服务器节点；
新选举的 Leader 节点中含有最大的 zxid。这样做的好处是可以避免 Leader 服务器检查 Proposal 的提交和丢弃工作。

（2）数据恢复

完成 Leader 选举后，在正式开始工作之前（接收事务请求然后提出新的 Proposal），Leader 服务器会首先确认事务日志中的所有的 Proposal 是否已经被集群中过半的服务器 Commit；
Leader 服务器需要确保所有的 Follower 服务器能够接收到每一条事务的 Proposal，并且能将所有已经提交的事务 Proposal 应用到内存数据中。等到 Follower 将所有尚未同步的事务 Proposal 都从 Leader 服务器上同步过，并且应用到内存数据中以后，Leader 才会把该 Follower 加入到真正可用的 Follower 列表中。

1.3 CAP 理论

一致性（Consistency）
可用性（Available）
分区容错性（Partition Tolerance）

这三个基本需求，最多只能同时满足其中的两项，因为 P 是必须的，因此往往选择就在 CP 或者 AP 中。

（1）一致性 C

在分布式环境中，一致性是指数据在多个副本之间是否能够保持数据一致的特性。在一致性的需求下，当个系统在数据一致的状态下执行更新操作后，应该保证系统的数据仍然处于一致的状态。

（2）可用性 A

可用性是指系统提供的服务必须一直处于可用的状态，对于用户的每一个操作请求总是能够在有限的时间内返回结果。

（3）分区容错性 P

分布式系统在遇到任何网络分区故障的时候，仍然需要能够保证对外提供满足一致性和可用性的服务，除非是整个网络环境都发生了故障。

ZooKeeper 保证的是 CP。

ZooKeeper 不能保证每次服务请求的可用性（在极端环境下，ZooKeeper 可能会丢弃一些请求，消费者程序需要重新请求才能获得结果）。所以说，ZooKeeper 不能保证服务可用性。
进行 Leader 选举时集群都是不可用。

2. 基础功能源码

ZooKeeper 集群中的每个服务器节点每次接收到写操作请求时，都会先将这次请求发送给 Leader，Leader 将这次写操作转换为带有状态的事务，然后 Leader 会对这次写操作广播出去以便进行协调。当协调通过（大多数节点允许这次写）后，Leader 通知所有的服务器节点，让它们将这次写操作应用到内存数据库中，并将其记录到“事务日志”中。

当“事务日志”记录的次数达到一定数量后（默认 10w 次），就会将内存数据库序列化一次，使其持久化保存到磁盘上，序列化后的文件称为“快照文件”。每次拍快照都会生成新的事务日志。需要强调快照文件名后缀 ZXID 是触发快照的瞬间，提交的最后一个事务 ID（如果是 ZXID5 触发快照，那么快照文件名就是 snapshot.ZXID5，快照之后的下一个事务的 ID 是 ZXID6，新的事务日志名就是 log.ZXID6）。

快照日志是将 ZooKeeper 服务器上某个时刻的全量内存数据，写入到指定磁盘文件中。可以这样理解，快照日志文件是存量数据，事务日志文件是增量数据，二者加起来就是最大限度的全量数据。

有了事务日志 TxnLog 和快照 SnapShot，就可以让任意节点恢复到任意时间点（只要没有清理事务日志和快照）。

[补链] https://it.cha138.com/shida/show-466243.html#_70

2.1 持久化源码

Leader 和 Follower 中的数据会在内存和磁盘中各保存一份。所以需要将内存中的数据持久化到磁盘中。

在 org.apache.zookeeper.server.persistence 包下的相关类都是序列化相关的代码。

a. 快照

public interface SnapShot {
    /**
     * deserialize a data tree from the last valid snapshot
     * and return the last zxid that was deserialized.
     * @param dt the datatree to be deserialized into
     * @param sessions the sessions to be deserialized into
     * @return the last zxid that was deserialized from the snapshot
     * @throws IOException
     */
    long deserialize(DataTree dt, Map<Long, Integer> sessions) throws IOException;

    /**
     * persist the datatree and the sessions into a persistence storage
     * @param dt the datatree to be serialized
     * @param sessions
     * @throws IOException
     */
    void serialize(DataTree dt, Map<Long, Integer> sessions, File name) throws IOException;

    /**
     * find the most recent snapshot file
     */
    File findMostRecentSnapshot() throws IOException;

    /**
     * free resources from this snapshot immediately
     */
    void close() throws IOException;
}

b. 操作日志

public interface TxnLog {

    /**
     * Setter for ServerStats to monitor fsync threshold exceed
     * @param serverStats used to update fsyncThresholdExceedCount
     */
    void setServerStats(ServerStats serverStats);

    /**
     * roll the current log being appended to
     * @throws IOException
     */
    void rollLog() throws IOException;
    /**
     * Append a request to the transaction log
     * @param hdr the transaction header
     * @param r the transaction itself
     * returns true iff something appended, otw false
     * @throws IOException
     */
    boolean append(TxnHeader hdr, Record r) throws IOException;

    /**
     * Start reading the transaction logs from a given zxid
     * @param zxid
     * @return returns an iterator to read the next transaction in the logs.
     * @throws IOException
     */
    TxnIterator read(long zxid) throws IOException;

    /**
     * the last zxid of the logged transactions.
     * @return the last zxid of the logged transactions.
     */
    long getLastLoggedZxid() throws IOException;

    /**
     * truncate the log to get in sync with the leader.
     * @param zxid the zxid to truncate at.
     */
    boolean truncate(long zxid) throws IOException;

    /**
     * the dbid for this transaction log.
     * @return the dbid for this transaction log.
     */
    long getDbId() throws IOException;

    /**
     * commit the transaction and make sure they are persisted
     */
    void commit() throws IOException;

    /**
     * @return transaction log's elapsed sync time in milliseconds
     */
    long getTxnLogSyncElapsedTime();

    /**
     * close the transactions logs
     */
    void close() throws IOException;
    /**
     * an iterating interface for reading transaction logs.
     */
    public interface TxnIterator {
        /**
         * return the transaction header.
         * @return return the transaction header.
         */
        TxnHeader getHeader();

        /**
         * return the transaction record.
         */
        Record getTxn();

        /**
         * go to the next transaction record.
         */
        boolean next() throws IOException;

        /**
         * close files and release the resources
         */
        void close() throws IOException;

        /**
         * Get an estimated storage space used to store transaction records
         * that will return by this iterator
         */
        long getStorageSize() throws IOException;
    }
}

c. 核心类

2.2 序列化源码

zookeeper-jute 中的代码是关于 Zookeeper 序列化的相关源码。

a. 序列化/反序列化方法

@InterfaceAudience.Public
public interface Record {
    public void serialize(OutputArchive archive, String tag) throws IOException;
    public void deserialize(InputArchive archive, String tag) throws IOException;
}

b. 迭代器

/**
 * Interface that acts as an iterator for deserializing maps.
 * The deserializer returns an instance that the record uses to
 * read vectors and maps. An example of usage is as follows:
 *
 * <code>
 * Index idx = startVector(...);
 * while (!idx.done()) {
 *   .... // read element of a vector
 *   idx.incr();
 * }
 * </code>
 *
 */
public interface Index {
    public boolean done();
    public void incr();
}

c. 序列化支持的数据类型

public interface OutputArchive {
    public void writeByte(byte b, String tag) throws IOException;
    public void writeBool(boolean b, String tag) throws IOException;
    public void writeInt(int i, String tag) throws IOException;
    public void writeLong(long l, String tag) throws IOException;
    public void writeFloat(float f, String tag) throws IOException;
    public void writeDouble(double d, String tag) throws IOException;
    public void writeString(String s, String tag) throws IOException;
    public void writeBuffer(byte buf[], String tag) throws IOException;
    public void writeRecord(Record r, String tag) throws IOException;
    public void startRecord(Record r, String tag) throws IOException;
    public void endRecord(Record r, String tag) throws IOException;
    public void startVector(List<?> v, String tag) throws IOException;
    public void endVector(List<?> v, String tag) throws IOException;
    public void startMap(TreeMap<?,?> v, String tag) throws IOException;
    public void endMap(TreeMap<?,?> v, String tag) throws IOException;
}

d. 反序列化支持的数据类型

public interface InputArchive {
    public byte readByte(String tag) throws IOException;
    public boolean readBool(String tag) throws IOException;
    public int readInt(String tag) throws IOException;
    public long readLong(String tag) throws IOException;
    public float readFloat(String tag) throws IOException;
    public double readDouble(String tag) throws IOException;
    public String readString(String tag) throws IOException;
    public byte[] readBuffer(String tag) throws IOException;
    public void readRecord(Record r, String tag) throws IOException;
    public void startRecord(String tag) throws IOException;
    public void endRecord(String tag) throws IOException;
    public Index startVector(String tag) throws IOException;
    public void endVector(String tag) throws IOException;
    public Index startMap(String tag) throws IOException;
    public void endMap(String tag) throws IOException;
}

3. ZkServer 初始化源码

3.1 启动脚本

zkServer.sh start 底层的实际执行内容：

nohup "$JAVA"
+ 一堆提交参数
+ $ZOOMAIN  (org.apache.zookeeper.server.quorum.QuorumPeerMain)
+ "$ZOOCFG" (zkEnv.sh 中的 ZOOCFG="zoo.cfg")

所以程序的入口是 QuorumPeerMain.java 类。

3.2 服务端启动

public static void main(String[] args) {
    QuorumPeerMain main = new QuorumPeerMain();

    // ...

    main.initializeAndRun(args);

    // ...
}

protected void initializeAndRun(String[] args)
        throws ConfigException, IOException, AdminServerException {

    QuorumPeerConfig config = new QuorumPeerConfig();
    if (args.length == 1) {
        // [1] 解析参数: zoo.cfg、myid
        config.parse(args[0]);
    }

    // [2] 启动定时任务，对过期的快照执行删除
    // Start and schedule the the purge task
    DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
            .getDataDir(), config.getDataLogDir(), config
            .getSnapRetainCount(), config.getPurgeInterval());
    purgeMgr.start();

    if (args.length == 1 && config.isDistributed()) {
        // [3] 通信初始化
        runFromConfig(config);
        // [4] 启动集群在[3]内部
    } else {
        LOG.warn("Either no config or no quorum defined in config, running "
                + " in standalone mode");
        // there is only server in the quorum -- run as standalone
        ZooKeeperServerMain.main(args);
    }
}

a. 解析参数

QuorumPeerConfig

/**
 * Parse a ZooKeeper configuration file
 * @param path the patch of the configuration file
 * @throws ConfigException error processing configuration
 */
public void parse(String path) throws ConfigException {

    // 校验文件路径及是否存在
    File configFile = (new VerifyingFileFactory.Builder(LOG)
        .warnForRelativePath()
        .failForNonExistingPath()
        .build()).create(path);

    Properties cfg = new Properties();
    FileInputStream in = new FileInputStream(configFile);
    try {
        // 加载配置文件
        cfg.load(in);
        configFileStr = path;
    } finally {
        in.close();
    }

    // 解析配置文件
    parseProperties(cfg);

    // ...
}

public void parseProperties(Properties zkProp) throws IOException, ConfigException {
    int clientPort = 0;
    int secureClientPort = 0;
    String clientPortAddress = null;
    String secureClientPortAddress = null;
    VerifyingFileFactory vff = new VerifyingFileFactory
                                        .Builder(LOG).warnForRelativePath().build();
    // 读取 zoo.cfg 中的属性值，并赋值给 QuorumPeerConfig 的类对象
    for (Entry<Object, Object> entry : zkProp.entrySet()) {
        String key = entry.getKey().toString().trim();
        String value = entry.getValue().toString().trim();
        if (key.equals("dataDir")) {
            dataDir = vff.create(value);
        } else if (key.equals("dataLogDir")) {
            dataLogDir = vff.create(value);
        } else if (key.equals("clientPort")) {
            clientPort = Integer.parseInt(value);
        } else if (key.equals("localSessionsEnabled")) {
            localSessionsEnabled = Boolean.parseBoolean(value);
        } else if (key.equals("localSessionsUpgradingEnabled")) {
            localSessionsUpgradingEnabled = Boolean.parseBoolean(value);
        } else if (key.equals("clientPortAddress")) {
            clientPortAddress = value.trim();
        } else if (key.equals("secureClientPort")) {
            secureClientPort = Integer.parseInt(value);
        } else if (key.equals("secureClientPortAddress")){
            secureClientPortAddress = value.trim();
        } else if (key.equals("tickTime")) {
            tickTime = Integer.parseInt(value);
        } else if (key.equals("maxClientCnxns")) {
            maxClientCnxns = Integer.parseInt(value);
        } else if (key.equals("minSessionTimeout")) {
            minSessionTimeout = Integer.parseInt(value);
        } else if (key.equals("maxSessionTimeout")) {
            maxSessionTimeout = Integer.parseInt(value);
        } else if (key.equals("initLimit")) {
            initLimit = Integer.parseInt(value);
        } else if (key.equals("syncLimit")) {
            syncLimit = Integer.parseInt(value);
        }
        // else if ...
    }

    // ...

    // backward compatibility - dynamic configuration in the same file
    // as static configuration params see writeDynamicConfig()
    if (dynamicConfigFileStr == null) {
        // ======== ↓ Step Into ↓ ========
        setupQuorumPeerConfig(zkProp, true);
        if (isDistributed() && isReconfigEnabled()) {
            // we don't backup static config for standalone mode.
            // we also don't backup if reconfig feature is disabled.
            backupOldConfig();
        }
    }
}

void setupQuorumPeerConfig(Properties prop, boolean configBackwardCompatibilityMode)
        throws IOException, ConfigException {
    quorumVerifier = parseDynamicConfig(
                        prop, electionAlg, true, configBackwardCompatibilityMode);
    // ======== ↓ Step Into ↓ ========
    setupMyId();
    setupClientPort();
    setupPeerType();
    checkValidity();
}

private void setupMyId() throws IOException {
    File myIdFile = new File(dataDir, "myid");
    // standalone server doesn't need myid file.
    if (!myIdFile.isFile()) {
        return;
    }
    BufferedReader br = new BufferedReader(new FileReader(myIdFile));
    String myIdString;
    try {
        myIdString = br.readLine();
    } finally {
        br.close();
    }
    try {
        // 将解析 myid 中的 id 赋值给 serverId
        serverId = Long.parseLong(myIdString);
        MDC.put("myid", myIdString);
    } catch (NumberFormatException e) {
        throw new IllegalArgumentException("serverid " + myIdString
                + " is not a number");
    }
}

b. 过期快照删除

QuorumPeerMain

// Start and schedule the the purge task
DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config.getDataDir(),
     config.getDataLogDir(), config.getSnapRetainCount(), config.getPurgeInterval()
);
purgeMgr.start();

QuorumPeerConfig

// 最少保留的快照个数
protected int snapRetainCount = 3;
// 默认 0 表示关闭的
protected int purgeInterval = 0;

DatadirCleanupManager

public void start() {
    if (PurgeTaskStatus.STARTED == purgeTaskStatus) {
        LOG.warn("Purge task is already running.");
        return;
    }

    // 默认情况 purgeInterval=0 任务关闭，直接返回
    // Don't schedule the purge task with 0 or negative purge interval.
    if (purgeInterval <= 0) {
        LOG.info("Purge task is not scheduled.");
        return;
    }

    // 创建一个定时器
    timer = new Timer("PurgeTask", true);
    // 创建一个清理快照任务
    TimerTask task = new PurgeTask(dataLogDir, snapDir, snapRetainCount);
    // 如果 purgeInterval 设置的值是 1，表示 1 小时检查一次是否有过期快照，有则删除
    timer.scheduleAtFixedRate(task, 0, TimeUnit.HOURS.toMillis(purgeInterval));

    purgeTaskStatus = PurgeTaskStatus.STARTED;
}

static class PurgeTask extends TimerTask {
    private File logsDir;
    private File snapsDir;
    private int snapRetainCount;

    public PurgeTask(File dataDir, File snapDir, int count) {
        logsDir = dataDir;
        snapsDir = snapDir;
        snapRetainCount = count;
    }

    @Override
    public void run() {
        LOG.info("Purge task started.");
        try {
            // 清理过期的数据
            PurgeTxnLog.purge(logsDir, snapsDir, snapRetainCount);
        } catch (Exception e) {
            LOG.error("Error occurred while purging.", e);
        }
        LOG.info("Purge task completed.");
    }
}

PurgeTxnLog

public static void purge(File dataDir, File snapDir, int num) throws IOException {
    if (num < 3) {
        throw new IllegalArgumentException(COUNT_ERR_MSG);
    }

    FileTxnSnapLog txnLog = new FileTxnSnapLog(dataDir, snapDir);

    List<File> snaps = txnLog.findNRecentSnapshots(num);
    int numSnaps = snaps.size();
    if (numSnaps > 0) {
        purgeOlderSnapshots(txnLog, snaps.get(numSnaps - 1));
    }
}

c. 初始化通信组件

QuorumPeerMain

public void runFromConfig(QuorumPeerConfig config) throws ... {
 // ...

 LOG.info("Starting quorum peer");

 try {
  ServerCnxnFactory cnxnFactory = null;
  ServerCnxnFactory secureCnxnFactory = null;

  // 通信组件初始化，默认是 NIO 通信
  if (config.getClientPortAddress() != null) {
      // ======== ↓ Step Into ↓ ========
      cnxnFactory = ServerCnxnFactory.createFactory();
      // ======== ↓ Step Into ↓ ========
      cnxnFactory.configure(config.getClientPortAddress(),
      config.getMaxClientCnxns(), false);
  }

  if (config.getSecureClientPortAddress() != null) {
      secureCnxnFactory = ServerCnxnFactory.createFactory();
      secureCnxnFactory.configure(config.getSecureClientPortAddress(),
      config.getMaxClientCnxns(), true);
  }

  // 把解析的参数赋值给该 Zk 节点 (每一个 ZooKeeper实例都对应一个 QuorumPeer)
  quorumPeer = getQuorumPeer();
  quorumPeer.setTxnFactory(new FileTxnSnapLog(config.getDataLogDir(),config.getDataDir()));
  quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
  quorumPeer.enableLocalSessionsUpgrading(config.isLocalSessionsUpgradingEnabled());
  //quorumPeer.setQuorumPeers(config.getAllMembers());
  quorumPeer.setElectionType(config.getElectionAlg());
  quorumPeer.setMyid(config.getServerId());
  quorumPeer.setTickTime(config.getTickTime());
  quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
  quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
  quorumPeer.setInitLimit(config.getInitLimit());
  quorumPeer.setSyncLimit(config.getSyncLimit());
  quorumPeer.setConfigFileName(config.getConfigFilename());
  quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
  quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
  if (config.getLastSeenQuorumVerifier()!=null) {
      quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
  }

  // 管理 Zk 数据的存储
  quorumPeer.initConfigInZKDatabase();
  // 管理 Zk 的通信
  quorumPeer.setCnxnFactory(cnxnFactory);

  quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
  quorumPeer.setSslQuorum(config.isSslQuorum());
  quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
  quorumPeer.setLearnerType(config.getPeerType());
  quorumPeer.setSyncEnabled(config.getSyncEnabled());
  quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
  if (config.sslQuorumReloadCertFiles) {
      quorumPeer.getX509Util().enableCertFileReloading();
  }

  // sets quorum sasl authentication configurations
  quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
  if(quorumPeer.isQuorumSaslAuthEnabled()){
      quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
      quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
      quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
      quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
      quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
  }
  quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
  quorumPeer.initialize();

  // ===== 启动 Zk 节点，开始加载数据(见#4) =====
  quorumPeer.start();
  quorumPeer.join();
 } catch (InterruptedException e) {
      // warn, but generally this is ok
      LOG.warn("Quorum Peer interrupted", e);
  }
}

ServerCnxnFactory

static public ServerCnxnFactory createFactory() throws IOException {
  String serverCnxnFactoryName = System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
  if (serverCnxnFactoryName == null) {
    serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
  }
  try {
    ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(
            serverCnxnFactoryName).getDeclaredConstructor().newInstance();
    LOG.info("Using {} as server connection factory", serverCnxnFactoryName);
    return serverCnxnFactory;
  } catch (Exception e) {
    IOException ioe = new IOException("Couldn't instantiate " + serverCnxnFactoryName);
    ioe.initCause(e);
    throw ioe;
  }
}

NIOServerCnxnFactory

@Override
public void configure(InetSocketAddress addr, int maxcc, boolean secure) {
    if (secure) {
        throw new UnsupportedOperationException("SSL isn't supported in NIOServerCnxn");
    }
    configureSaslLogin();

    maxClientCnxns = maxcc;
    sessionlessCnxnTimeout = Integer.getInteger(
        ZOOKEEPER_NIO_SESSIONLESS_CNXN_TIMEOUT, 10000);
    // We also use the sessionlessCnxnTimeout as expiring interval for
    // cnxnExpiryQueue. These don't need to be the same, but the expiring
    // interval passed into the ExpiryQueue() constructor below should be
    // less than or equal to the timeout.
    cnxnExpiryQueue = new ExpiryQueue<NIOServerCnxn>(sessionlessCnxnTimeout);
    expirerThread = new ConnectionExpirerThread();

    int numCores = Runtime.getRuntime().availableProcessors();
    // 32 cores sweet spot seems to be 4 selector threads
    numSelectorThreads = Integer.getInteger(
        ZOOKEEPER_NIO_NUM_SELECTOR_THREADS,
        Math.max((int) Math.sqrt((float) numCores/2), 1));
    if (numSelectorThreads < 1) {
        throw new IOException("numSelectorThreads must be at least 1");
    }

    numWorkerThreads = Integer.getInteger(
        ZOOKEEPER_NIO_NUM_WORKER_THREADS, 2 * numCores);
    workerShutdownTimeoutMS = Long.getLong(
        ZOOKEEPER_NIO_SHUTDOWN_TIMEOUT, 5000);

    LOG.info("Configuring NIO connection handler with "
             + (sessionlessCnxnTimeout/1000) + "s sessionless connection"
             + " timeout, " + numSelectorThreads + " selector thread(s), "
             + (numWorkerThreads > 0 ? numWorkerThreads : "no")
             + " worker threads, and "
             + (directBufferBytes == 0 ? "gathered writes." :
                ("" + (directBufferBytes/1024) + " kB direct buffers.")));
    for(int i=0; i<numSelectorThreads; ++i) {
        selectorThreads.add(new SelectorThread(i));
    }

    // 初始化 NIO 服务端 socket，绑定 2181 端口，可以接收客户端请求
    this.ss = ServerSocketChannel.open();
    ss.socket().setReuseAddress(true);
    LOG.info("binding to port " + addr);
    ss.socket().bind(addr);
    ss.configureBlocking(false);
    acceptThread = new AcceptThread(ss, addr, selectorThreads);
}

4. ZkServer 加载数据源码

4.1 原理图示

Zk 中的数据模型，是一棵 DataTree，每个节点，叫做 DataNode
Zk 集群中的 DataTree 时刻保持状态同步；
Zk 集群中每个 Zk 节点中，数据在内存和磁盘中都有一份完整的数据；
- 内存数据：DataTree
- 磁盘数据：快照文件 + 编辑日志

4.2 源码流程

QuorumPeer

@Override
public synchronized void start() {
    if (!getView().containsKey(myid)) {
       throw new RuntimeException("My id " + myid + " not in the peer list");
    }
    // ========== ↓ Step Into ↓ ==========
    // 冷启动数据恢复
    loadDataBase();
    startServerCnxnFactory();
    try {
        // 启动通信工厂实例对象
        adminServer.start();
    } catch (AdminServerException e) {
        LOG.warn("Problem starting AdminServer", e);
        System.out.println(e);
    }
    // 准备选举环境
    startLeaderElection();
    // 执行选举
    super.start();
}

private void loadDataBase() {
    try {
        // zk 的操作分两种：事务操作和非事务操作
        // ---- 事务操作：zk.cteate() 都会被分配一个全局唯一的 zxid(64位)
        // -------- 前 32 位：epoch 每个 leader 任期的代号
        // -------- 后 32 位：txid 为事务 id
        // ---- 非事务操作：zk.getData()

        // 数据恢复过程：
        // 1. 从快照文件中恢复大部分数据，并得到一个 lastProcessZXid
        // 2. 再从编辑(事务)日志中执行 replay，执行到最后一条日志并更新 lastProcessZXid
        // 3. 最终得到 datatree 和 lastProcessZXid，表示数据恢复完成

        // ========== ↓ Step Into ↓ ==========
        // 加载磁盘数据到内存，恢复 DataTree
        zkDb.loadDataBase();


        // load the epochs
        long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
        long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
        try {
            currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
        } catch(FileNotFoundException e) {
        	// pick a reasonable epoch number
        	// this should only happen once when moving to a new code version
        	currentEpoch = epochOfZxid;
        	LOG.info(...);
        	writeLongToFile(CURRENT_EPOCH_FILENAME, currentEpoch);
        }
        if (epochOfZxid > currentEpoch) {
            throw new IOException(...);
        }
        try {
            acceptedEpoch = readLongFromFile(ACCEPTED_EPOCH_FILENAME);
        } catch(FileNotFoundException e) {
        	// pick a reasonable epoch number
        	// this should only happen once when moving to a
        	// new code version
        	acceptedEpoch = epochOfZxid;
        	LOG.info(...);
        	writeLongToFile(ACCEPTED_EPOCH_FILENAME, acceptedEpoch);
        }
        if (acceptedEpoch < currentEpoch) {
            throw new IOException(...);
        }
    } catch(IOException ie) {
        LOG.error("Unable to load database on disk", ie);
        throw new RuntimeException("Unable to run quorum server ", ie);
    }
}

ZKDatabase

/**
 * load the database from the disk onto memory and also add
 * the transactions to the committedlog in memory.
 * @return the last valid zxid on disk
 * @throws IOException
 */
public long loadDataBase() throws IOException {
  long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
  initialized = true;
  return zxid;
}

FileTxnSnapLog

/**
 * this function restores the server database after reading from the
 * snapshots and transaction logs
 * @param dt the datatree to be restored
 * @param sessions the sessions to be restored
 * @param listener the playback listener to run on the database restoration
 * @return the highest zxid restored
 * @throws IOException
 */
public long restore(DataTree dt, Map<Long, Integer> sessions,
                    PlayBackListener listener) throws IOException {

  // =====> a. 恢复快照文件数据到 DataTree
  long deserializeResult = snapLog.deserialize(dt, sessions);
  FileTxnLog txnLog = new FileTxnLog(dataDir);

  RestoreFinalizer finalizer = () -> {
    // =====> b. 恢复编辑日志数据到 DataTree
    long highestZxid = fastForwardFromEdits(dt, sessions, listener);
    return highestZxid;
  };

  // ...

  return finalizer.run();
}

a. 恢复快照数据

FileSnap

/**
 * deserialize a data tree from the most recent snapshot
 * @return the zxid of the snapshot
 */
public long deserialize(DataTree dt, Map<Long, Integer> sessions)
        throws IOException {
    // we run through 100 snapshots (not all of them)
    // if we cannot get it running within 100 snapshots
    // we should give up
    List<File> snapList = findNValidSnapshots(100);
    if (snapList.size() == 0) {
        return -1L;
    }
    File snap = null;
    boolean foundValid = false;

    // =====> 依次遍历每一个快照的数据
    for (int i = 0, snapListSize = snapList.size(); i < snapListSize; i++) {
        snap = snapList.get(i);
        LOG.info("Reading snapshot " + snap);
        // =====> 反序列化环境准备
        try (InputStream snapIS = new BufferedInputStream(new FileInputStream(snap));
             CheckedInputStream crcIn = new CheckedInputStream(snapIS, new Adler32())) {
            InputArchive ia = BinaryInputArchive.getArchive(crcIn);
            // =====> 反序列化，恢复数据到 DataTree
            deserialize(dt, sessions, ia);
            long checkSum = crcIn.getChecksum().getValue();
            long val = ia.readLong("val");
            if (val != checkSum) {
                throw new IOException("CRC corruption in snapshot :  " + snap);
            }
            foundValid = true;
            break;
        } catch (IOException e) {
            LOG.warn("problem reading snap file " + snap, e);
        }
    }
    if (!foundValid) {
        throw new IOException("Not able to find valid snapshots in " + snapDir);
    }
    dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);
    return dt.lastProcessedZxid;
}

/**
 * deserialize the datatree from an inputarchive
 * @param dt the datatree to be serialized into
 * @param sessions the sessions to be filled up
 * @param ia the input archive to restore from
 * @throws IOException
 */
public void deserialize(DataTree dt, Map<Long, Integer> sessions, InputArchive ia) {
    FileHeader header = new FileHeader();
    header.deserialize(ia, "fileheader");
    if (header.getMagic() != SNAP_MAGIC) {
        throw new IOException("mismatching magic headers "
                + header.getMagic() + " !=  " + FileSnap.SNAP_MAGIC);
    }
    // =====> 恢复快照数据到 DataTree
    SerializeUtils.deserializeSnapshot(dt,ia,sessions);
}

SerializeUtils

public static void deserializeSnapshot(DataTree dt,InputArchive ia,
        Map<Long, Integer> sessions) throws IOException {
    int count = ia.readInt("count");
    while (count > 0) {
        long id = ia.readLong("id");
        int to = ia.readInt("timeout");
        sessions.put(id, to);
        if (LOG.isTraceEnabled()) {
            ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK,
                    "loadData --- session in archive: " + id + " with timeout: " + to);
        }
        count--;
    }
    // =====> 恢复快照数据到 DataTree
    dt.deserialize(ia, "tree");
}

public void deserialize(InputArchive ia, String tag) throws IOException {
    aclCache.deserialize(ia);
    nodes.clear();
    pTrie.clear();

    // =====> 从快照中恢复每一个 DataNode 节点数据到 DataTree
    String path = ia.readString("path");
    while (!"/".equals(path)) {
        // =====> 每次循环创建一个节点对象
        DataNode node = new DataNode();
        ia.readRecord(node, "node");
        // =====> 将 DataNode 恢复到 DataTree
        nodes.put(path, node);
        synchronized (node) {
            aclCache.addUsage(node.acl);
        }
        int lastSlash = path.lastIndexOf('/');
        if (lastSlash == -1) {
            root = node;
        } else {
            // =====> 处理父节点
            String parentPath = path.substring(0, lastSlash);
            DataNode parent = nodes.get(parentPath);
            if (parent == null) {
                throw new IOException("Invalid Datatree, unable to find " +
                        "parent " + parentPath + " of path " + path);
            }
            // =====> 处理子节点
            parent.addChild(path.substring(lastSlash + 1));
            // =====> 处理临时节点和永久节点
            long eowner = node.stat.getEphemeralOwner();
            EphemeralType ephemeralType = EphemeralType.get(eowner);
            if (ephemeralType == EphemeralType.CONTAINER) {
                containers.add(path);
            } else if (ephemeralType == EphemeralType.TTL) {
                ttls.add(path);
            } else if (eowner != 0) {
                HashSet<String> list = ephemerals.get(eowner);
                if (list == null) {
                    list = new HashSet<String>();
                    ephemerals.put(eowner, list);
                }
                list.add(path);
            }
        }
        path = ia.readString("path");
    }
    nodes.put("/", root);
    // we are done with deserializing the the datatree
    // update the quotas - create path trie and also update the stat nodes
    setupQuota();

    aclCache.purgeUnused();
}

b. 恢复编辑日志