Zookeeper学习--zab协议和启动选举

Zookeeper学习--zab协议和启动选举

本章记录zookeeper学习过程中,关于zab协议的原理,server端启动后的自动选举等。部分内容参考自zookeeper官网和咕泡学院教材。本次分析zookeeper源码版本:3.6.0。

@

基本介绍

zab是干嘛的?不知道,那么看了一下官网的介绍,意思大概是zookeeper的原子广播协议。但是,从网络上对zab的描述,其分为多种模式,广播是其中一种,还有经常听说的崩溃恢复。zookeeper基于zab协议实现了集群主备模式的系统架构下的数据一致性以及选举机制。

Zab is the ZooKeeper Atomic Broadcast protocol. We use it to propagate state changes produced by the ZooKeeper leader.

原子广播

原子广播用于解决集群模式下的数据一致性问题,根据官网的描述,zk属于顺序一致性zab提供的原子广播机制是一种2PC协议,这种协议表示,只需要集群中过半的节点确认既可提交。

顺序一致性

因为zk使用过半提交的策略,因此意味着其是最终一致性。在zk中,顺序一致性是更强的一致性保证,对于单个对象,如果被更新后能够立即被后续的读请求读到。参考:Consistency Guarantees

二阶段提交

Two-phased Commit

A two-phase commit protocol is an algorithm that lets all clients in a distributed system agree either to commit a transaction or abort.

In ZooKeeper, you can implement a two-phased commit by having a coordinator create a transaction node, say "/app/Tx", and one child node per participating site, say "/app/Tx/s_i". When coordinator creates the child node, it leaves the content undefined. Once each site involved in the transaction receives the transaction from the coordinator, the site reads each child node and sets a watch. Each site then processes the query and votes "commit" or "abort" by writing to its respective node. Once the write completes, the other sites are notified, and as soon as all sites have all votes, they can decide either "abort" or "commit". Note that a node can decide "abort" earlier if some site votes for "abort".

An interesting aspect of this implementation is that the only role of the coordinator is to decide upon the group of sites, to create the ZooKeeper nodes, and to propagate the transaction to the corresponding sites. In fact, even propagating the transaction can be done through ZooKeeper by writing it in the transaction node.

There are two important drawbacks of the approach described above. One is the message complexity, which is O(n²). The second is the impossibility of detecting failures of sites through ephemeral nodes. To detect the failure of a site using ephemeral nodes, it is necessary that the site create the node.

To solve the first problem, you can have only the coordinator notified of changes to the transaction nodes, and then notify the sites once coordinator reaches a decision. Note that this approach is scalable, but it's is slower too, as it requires all communication to go through the coordinator.

To address the second problem, you can have the coordinator propagate the transaction to the sites, and have each site creating its own ephemeral node.

直译:

两阶段提交协议是一种算法,它允许分布式系统中的所有客户端同意提交事务或中止事务。

在ZooKeeper中,可以通过让协调器创建一个事务节点,比如“/app/Tx”,以及每个参与站点创建一个子节点,比如“/app/Tx/s_i”来实现两阶段提交。当协调器创建子节点时,将保留未定义的内容。一旦涉及事务的每个站点从协调器接收到事务,该站点将读取每个子节点并设置监视。每个站点然后处理查询并通过写入到各自的节点来投票“提交”或“中止”。一旦写操作完成,其他站点就会收到通知,并且一旦所有站点都有了所有的投票,它们就可以决定是“中止”还是“提交”。注意,如果一些站点投票支持“abort”,则节点可以更早地决定“abort”。

这个实现的一个有趣的方面是,协调器的唯一角色是决定站点组,创建ZooKeeper节点,并将事务传播到相应的站点。事实上,甚至传播事务也可以通过在事务节点中写入ZooKeeper来完成。

上述方法有两个重要的缺点。一个是消息复杂度,也就是O(n²)。其次是无法通过临时节点检测到站点的故障。要使用临时节点检测站点的故障,站点必须创建节点。

为了解决第一个问题,您只能将事务节点的更改通知协调器,然后在协调器做出决定后通知站点。请注意,这种方法是可伸缩的,但是速度也比较慢,因为它要求所有通信都经过协调器。

为了解决第二个问题,可以让协调器将事务传播到站点,并让每个站点创建自己的临时节点。

崩溃恢复

当zkServer集群运行中,其中Leader节点因网络或服务崩溃等原因导致通信中断,zab会进入崩溃恢复模式。崩溃恢复模式下,follower节点会重新进入looking状态并进行Leader选举。当Leader节点选举出来后开始进入数据同步阶段,同步完成后,zab退出崩溃恢复模式。

崩溃恢复保证,已经处理过的消息不会被丢失,而未被处理的消息不会再次出现。

源码分析

源码的分析前,需要先了解以下几个基本概念。另外在zk的源码中,使用了大量的生产者消费者模式,不了解生产者消费者模式的建议先补充相关知识,不然无法理解后面代码的执行。

思考:建议阅读源码的前提是带着疑问去阅读,比如说,我们知道,在zookeeper中,集群启动后会自动选举出leader,那么leader是怎么决策出来的,而不是关注的点可以暂时不去关注,从头到尾每个方法都看,是看不懂的。

proposed

每一个事务请求处理会以proposed的形式发送到全部服务器。

Proposal : a unit of agreement. Proposals are agreed upon by exchanging packets with a quorum of ZooKeeper servers. Most proposals contain messages, however the NEW_LEADER proposal is an example of a proposal that does not correspond to a message.

zxid

zxid是属于zk中存储的事物id,zxid包含两个部分,一个是epoch,一个是counter(计数器)。zxid是一个64位的数字,其中高32位表示epoch,低32位表示技术器。epoch可以理解为一个朝代的国号,每次发生Leader的变化后,epoch会递增。

The zxid has two parts: the epoch and a counter. In our implementation the zxid is a 64-bit number. We use the high order 32-bits for the epoch and the low order 32-bits for the counter. Because it has two parts represent the zxid both as a number and as a pair of integers, (epoch, count). The epoch number represents a change in leadership. Each time a new leader comes into power it will have its own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: the leader simply increments the zxid to obtain a unique zxid for each proposal. Leadership activation will ensure that only one leader uses a given epoch, so our simple algorithm guarantees that every proposal will have a unique id.

开始分析

打开zkServer的启动文件,可以看到,实际是执行org.apache.zookeeper.server.quorum.QuorumPeerMain

/**
 * To start the replicated server specify the configuration file name on
 * the command line.
 * @param args path to the configfile
 */
public static void main(String[] args) {
    QuorumPeerMain main = new QuorumPeerMain();
    try {
        main.initializeAndRun(args);
    } catch (IllegalArgumentException e) {
       //...忽略部分内容
    }
    LOG.info("Exiting normally");
    System.exit(ExitCode.EXECUTION_FINISHED.getValue());
}
protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            //解析zk配置文件配置项
            config.parse(args[0]);
        }
        // ...忽略部分内容
    	//这里开始是否集群模式来选择不同的启动流程
        if (args.length == 1 && config.isDistributed()) {
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running " + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            ZooKeeperServerMain.main(args);
        }
    }

这里可以看到zkServer不同的模式下执行启动方法不同,这里重点分析集群的模式。

public void runFromConfig(QuorumPeerConfig config) throws IOException, AdminServerException {
    //...省略的上面一大串代码是一些JMX的注册、构建cnxn连接工厂、安全等以及一些其他配置的初始化。
		//重点在于start方法
        quorumPeer.start();
        quorumPeer.join();
    } catch (InterruptedException e) {
        // warn, but generally this is ok
        LOG.warn("Quorum Peer interrupted", e);
    } finally {
        if (metricsProvider != null) {
            try {
                metricsProvider.stop();
            } catch (Throwable error) {
                LOG.warn("Error while stopping metrics", error);
            }
        }
    }
}

runFromConfig方法,虽然很大一串,但是实际需要关注的点在于quorumPeer的start()。quorumPeer的start方法是Thread的覆盖,关于启动时数据的加载和选举以及后面的崩溃恢复等都在这里开始。

public synchronized void start() {
    //getView方法返回本次参与集群的成员信息,这里是校验myid是否包含在成员信息中
    if (!getView().containsKey(myid)) {
        throw new RuntimeException("My id " + myid + " not in the peer list");
    }
    //数据存储初始化以及加载zxid等信息
    loadDataBase();
    //通信2181端口的服务启动
    startServerCnxnFactory();
    try {
        //3.5后新增的admin服务,就是它占掉你的8080端口
        adminServer.start();
    } catch (AdminServerException e) {
        LOG.warn("Problem starting AdminServer", e);
        System.out.println(e);
    }
    //开始选举
    startLeaderElection();
    startJvmPauseMonitor();
    //服务运行时状态逻辑
    super.start();
}

现在主要看一下选举startLeaderElection。

public synchronized void startLeaderElection() {
        try {
            //刚启动的时候,状态为looking,因此这里会开始构建票据vote信息
            if (getPeerState() == ServerState.LOOKING) {
                //票据包含当前server的myid、zxid、currentEpoch
                currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
            }
        } catch (IOException e) {
            RuntimeException re = new RuntimeException(e.getMessage());
            re.setStackTrace(e.getStackTrace());
            throw re;
        }
        //创建选举算法
        this.electionAlg = createElectionAlgorithm(electionType);
    }

选举票据的构建完成、以及选举算法的创建完成。

protected Election createElectionAlgorithm(int electionAlgorithm) {
        QuorumCnxManager qcm = createCnxnManager();
		//...
    	//这里的监听先忽略,回过头在来看
        QuorumCnxManager.Listener listener = qcm.listener;
            listener.start();
    		//FastLeaderElection构造中初始化了一些成员信息
            FastLeaderElection fle = new FastLeaderElection(this, qcm);
    		//然后在start方法中启动了WorkerSender和WorkerReceiver线程
            fle.start();
            le = fle;
       //...
}

private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;
		//构建了两个阻塞队列
        sendqueue = new LinkedBlockingQueue<ToSend>();
        recvqueue = new LinkedBlockingQueue<Notification>();
    	//这个messager很重要,构建了两个线程,WorkerSender和WorkerReceiver,用来接收发送消息的
        this.messenger = new Messenger(manager);
    }

看到这个阶段,会发现实际上主线程的逻辑就走到这里了,之后会调用quorumPeer的join方法,进入阻塞状态。看了这么多,实际上还没有到选举的过程,接下来开始分析选举的过程。

回到QuorumPeer#start(),这里最后是线程的启动,因此这个对象一定会去执行对应的run()方法。这个方法很长,通过逻辑拆分开,主要是对四个状态的判断处理。LOOKING、LEADING、FOLLOWING、OBSERVING,其中LOOKING表示选举的状态,因此接下来分析一下LOOKING执行的逻辑。

LOOKING:选举状态,只有在LOOKING状态才会去执行选举逻辑。

LEADING:领导状态,在这个状态下表示当前节点已经是Leader了,允许处理事务请求。

FOLLOWING:跟随者状态,在这个状态下会同步Leader数据,参与事务投票,处理非事务请求。

OBSERVING:观察者状态,这个状态下会同步Leader数据,但是不参与投票,可以处理非事务请求。

case LOOKING:
    LOG.info("LOOKING");
    ServerMetrics.getMetrics().LOOKING_COUNT.add(1);

    //...
        try {
            reconfigFlagClear();
            if (shuttingDownLE) {
                shuttingDownLE = false;
                startLeaderElection();
            }
            //开始执行选举的逻辑
            setCurrentVote(makeLEStrategy().lookForLeader());
        } catch (Exception e) {
            LOG.warn("Unexpected exception", e);
            setPeerState(ServerState.LOOKING);
        }
    }
    break;

lookForLeader()有两个实现,我们之前在分析的时候,选举的算法默认是FastLeaderElection。

public Vote lookForLeader() throws InterruptedException {
   // ---------------------------------- 第一部分初始化发送选票 -----------------------
        Map<Long, Vote> recvset = new HashMap<Long, Vote>();

        Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

        int notTimeout = minNotificationInterval;

        synchronized (this) {
            //更新逻辑时钟,用于选举周期判断
            logicalclock.incrementAndGet();
            //设置选票数据,myid、lastLoggedZxid、epoch
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }
    	//发送选票数据到其他节点,这里是发送是异步的发送,是生产者消费者模式的体现,构建toSend对象然后由WorkerSender线程去完成发送逻辑
        sendNotifications();

    
    // ---------------------- 第二部分循环获取选票信息进行Leader选举 -----------------------
        /*
         * Loop in which we exchange notifications until we find a leader
         * 循环获取
         */
        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             * 从队列中获取投票信息,这里也是生产者消费者模式的体现,
             */
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             * 如果没有收到信息,判断连接是否正常
             */
            if (n == null) {
                if (manager.haveDelivered()) {
                    sendNotifications();
                } else {
                    manager.connectAll();
                }
                /*
                 * Exponential backoff
                 */
                int tmpTimeOut = notTimeout * 2;
                notTimeout = (tmpTimeOut < maxNotificationInterval ? tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            } else if (validVoter(n.sid) && validVoter(n.leader)) {
                //否则校验选票的有效性。判断leader和投票者是否在配置中
                /*
                 * Only proceed if the vote comes from a replica in the current or next
                 * voting view for a replica in the current or next voting view.
                 */
                switch (n.state) {
                case LOOKING:
                    //...
                    // If notification > current, replace and send messages out 判断通知的选举epoch如果大于当前逻辑时钟
                    if (n.electionEpoch > logicalclock.get()) {
                        //将当前逻辑时钟设置为通知拿到的对象epoch
                        logicalclock.set(n.electionEpoch);
                        recvset.clear();
                        //将收到的票据与当前选举信息进行比较,这里是选举的判断核心
                        //判断逻辑为,先比较epoch,然后比较zxid,最后比较myid
                        if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            //更新proposal为新的票据信息
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            //否则使用当前的
                            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                        }
                        //然后重新广播
                        sendNotifications();
                    } else if (n.electionEpoch < logicalclock.get()) {//如果当前逻辑时钟大于通知的选举epoch,则表示收到的vote已经过期了
                        if (LOG.isDebugEnabled()) {
                            LOG.debug(
                                "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch)
                                + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                        }
                        break;
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Adding vote: from=" + n.sid
                                  + ", proposed leader=" + n.leader
                                  + ", proposed zxid=0x" + Long.toHexString(n.zxid)
                                  + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                    }
				   //将收到的投票信息放入集合中,以服务节点区分,用于后面决策出leader判断
                    // don't care about the version if it's in LOOKING state
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
					//判断投票结论,这里有多个实现,默认:The default QuorumVerifier is QuorumMaj
                    if (voteSet.hasAllQuorums()) {
                        // Verify if there is any change in the proposed leader
                        //判断是否有漏的选票,重新拉一次,如果有则重新计算
                        while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        //如果没有遗漏的选票了,则开始决策出结果,记录当前节点状态,清理投票信息后然后最终选出的票据
                        if (n == null) {
                            setPeerState(proposedLeader, voteSet);
                            Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING://observing不参与
                    LOG.debug("Notification from observer: {}", n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if (n.electionEpoch == logicalclock.get()) {
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            setPeerState(n.leader, voteSet);
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify that
                     * a majority are following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                    voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                    if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                        synchronized (this) {
                            logicalclock.set(n.electionEpoch);
                            setPeerState(n.leader, voteSet);
                        }
                        Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                   //...
}

从上面的分析中得出,选举通过在循环中不断受到广播过来的选票信息来判断,到达决策点的时候,更新状态跳出循环。接下来回到QuorumPeer#run()方法中,这个时候在run方法的循环中,getPeerState已经被修改为选举对应的结论信息了,接下来看一下如果是Leader或者是Follower的情况下,逻辑如何执行。

case FOLLOWING:
//如果是follower
    try {
        LOG.info("FOLLOWING");
        //初始化follower对象信息
        setFollower(makeFollower(logFactory));
        //
        follower.followLeader();
    } catch (Exception e) {
        LOG.warn("Unexpected exception", e);
    } finally {
        follower.shutdown();
        setFollower(null);
        updateServerState();
    }
    break;

接下来看一下followLeader的方法。

void followLeader() throws InterruptedException {
    //...
        self.setZabState(QuorumPeer.ZabState.DISCOVERY);
    	//拿到leader节点信息
        QuorumServer leaderServer = findLeader();
        try {
            //连接到leader
            connectToLeader(leaderServer.addr, leaderServer.hostname);
            //将 Follower 的 zxid 及 myid 等信息封装好发
			//送到 Leader,同步 epoch。
			//也就是意味着接下来 follower 节点只同步新epoch 的数据信息
            long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
            if (self.isReconfigStateChange()) {
                throw new Exception("learned about role change");
            }
            //check to see if the leader zxid is lower than ours
            //this should never happen but is just a safety check
            long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
            //如果leader的epoch比当前epoch小
            if (newEpoch < self.getAcceptedEpoch()) {
                LOG.error("Proposed leader epoch "
                          + ZxidUtils.zxidToString(newEpochZxid)
                          + " is less than our accepted epoch "
                          + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
                throw new IOException("Error: Epoch of leader is lower");
            }
            long startTime = Time.currentElapsedTime();
            try {
                self.setLeaderAddressAndId(leaderServer.addr, leaderServer.getId());
                self.setZabState(QuorumPeer.ZabState.SYNCHRONIZATION);
                //开始从leader同步数据,同步完成后启动了FollowerZooKeeperServer
                syncWithLeader(newEpochZxid);
                self.setZabState(QuorumPeer.ZabState.BROADCAST);
            } finally {
                long syncTime = Time.currentElapsedTime() - startTime;
                ServerMetrics.getMetrics().FOLLOWER_SYNC_TIME.add(syncTime);
            }
            if (self.getObserverMasterPort() > 0) {
                LOG.info("Starting ObserverMaster");

                om = new ObserverMaster(self, fzk, self.getObserverMasterPort());
                om.start();
            } else {
                om = null;
            }
            // create a reusable packet to reduce gc impact
            QuorumPacket qp = new QuorumPacket();
            while (this.isRunning()) {
                //接受 Leader消息,执行并反馈给 leader,线程在此自旋
                readPacket(qp);//从 leader 读取数据包
                processPacket(qp);//处理 packet
            }
        //...

follower启动的大致逻辑到这里就分析的差不多了,其他关于数据怎么同步的,跟客户端的交互等不在此篇幅分析范围内。

看完了follower,接下来看一下如果是leader要做些什么操作。

case LEADING:
    LOG.info("LEADING");
    try {
        //MarkLeader同markFollower差不多,都是初始化一些信息
        setLeader(makeLeader(logFactory));
        //主要还是看这个方法
        leader.lead();
        setLeader(null);
    } catch (Exception e) {
        LOG.warn("Unexpected exception", e);
    } finally {
        if (leader != null) {
            leader.shutdown("Forcing shutdown");
            setLeader(null);
        }
        updateServerState();
    }
    break;
}

好吧,这里又是一个巨长的方法。

/**
 * This method is main function that is called to lead
 *
 * @throws IOException
 * @throws InterruptedException
 */
void lead() throws IOException, InterruptedException {
        //...
        try {
            self.setZabState(QuorumPeer.ZabState.DISCOVERY);
            self.tick.set(0);
            //lead的数据加载从本地文件读取(启动时已经初始化的情况下不会重新再次加载)
            zk.loadData();

            leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid());

            // Start thread that waits for connection requests from
            // new followers.
            cnxAcceptor = new LearnerCnxAcceptor();
            //处理同Follower或者Observer的信息同步,监听learner变化
            cnxAcceptor.start();

            long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());

            zk.setZxid(ZxidUtils.makeZxid(epoch, 0));

            synchronized (this) {
                lastProposed = zk.getZxid();
            }

            newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), null, null);

            if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) {
                LOG.info("NEWLEADER proposal has Zxid of " + Long.toHexString(newLeaderProposal.packet.getZxid()));
            }

            QuorumVerifier lastSeenQV = self.getLastSeenQuorumVerifier();
            QuorumVerifier curQV = self.getQuorumVerifier();
            if (curQV.getVersion() == 0 && curQV.getVersion() == lastSeenQV.getVersion()) {
                // This was added in ZOOKEEPER-1783. The initial config has version 0 (not explicitly
                // specified by the user; the lack of version in a config file is interpreted as version=0).
                // As soon as a config is established we would like to increase its version so that it
                // takes presedence over other initial configs that were not established (such as a config
                // of a server trying to join the ensemble, which may be a partial view of the system, not the full config).
                // We chose to set the new version to the one of the NEWLEADER message. However, before we can do that
                // there must be agreement on the new version, so we can only change the version when sending/receiving UPTODATE,
                // not when sending/receiving NEWLEADER. In other words, we can't change curQV here since its the committed quorum verifier,
                // and there's still no agreement on the new version that we'd like to use. Instead, we use
                // lastSeenQuorumVerifier which is being sent with NEWLEADER message
                // so its a good way to let followers know about the new version. (The original reason for sending
                // lastSeenQuorumVerifier with NEWLEADER is so that the leader completes any potentially uncommitted reconfigs
                // that it finds before starting to propose operations. Here we're reusing the same code path for
                // reaching consensus on the new version number.)

                // It is important that this is done before the leader executes waitForEpochAck,
                // so before LearnerHandlers return from their waitForEpochAck
                // hence before they construct the NEWLEADER message containing
                // the last-seen-quorumverifier of the leader, which we change below
                try {
                    QuorumVerifier newQV = self.configFromString(curQV.toString());
                    newQV.setVersion(zk.getZxid());
                    self.setLastSeenQuorumVerifier(newQV, true);
                } catch (Exception e) {
                    throw new IOException(e);
                }
            }

            newLeaderProposal.addQuorumVerifier(self.getQuorumVerifier());
            if (self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
                newLeaderProposal.addQuorumVerifier(self.getLastSeenQuorumVerifier());
            }

            // We have to get at least a majority of servers in sync with
            // us. We do this by waiting for the NEWLEADER packet to get
            // acknowledged
            //等待过半的节点完成同步
            waitForEpochAck(self.getId(), leaderStateSummary);
            self.setCurrentEpoch(epoch);
            self.setLeaderAddressAndId(self.getQuorumAddress(), self.getId());
            self.setZabState(QuorumPeer.ZabState.SYNCHRONIZATION);

            try {
                //等待leader确认
                waitForNewLeaderAck(self.getId(), zk.getZxid());
            } catch (InterruptedException e) {
                shutdown("Waiting for a quorum of followers, only synced with sids: [ "
                         + newLeaderProposal.ackSetsToString()
                         + " ]");
                HashSet<Long> followerSet = new HashSet<Long>();

                for (LearnerHandler f : getLearners()) {
                    if (self.getQuorumVerifier().getVotingMembers().containsKey(f.getSid())) {
                        followerSet.add(f.getSid());
                    }
                }
                boolean initTicksShouldBeIncreased = true;
                for (Proposal.QuorumVerifierAcksetPair qvAckset : newLeaderProposal.qvAcksetPairs) {
                    if (!qvAckset.getQuorumVerifier().containsQuorum(followerSet)) {
                        initTicksShouldBeIncreased = false;
                        break;
                    }
                }
                if (initTicksShouldBeIncreased) {
                    LOG.warn("Enough followers present. " + "Perhaps the initTicks need to be increased.");
                }
                return;
            }
            //启动leaderServer服务
            startZkServer();

            /**
             * WARNING: do not use this for anything other than QA testing
             * on a real cluster. Specifically to enable verification that quorum
             * can handle the lower 32bit roll-over issue identified in
             * ZOOKEEPER-1277. Without this option it would take a very long
             * time (on order of a month say) to see the 4 billion writes
             * necessary to cause the roll-over to occur.
             *
             * This field allows you to override the zxid of the server. Typically
             * you'll want to set it to something like 0xfffffff0 and then
             * start the quorum, run some operations and see the re-election.
             */
            String initialZxid = System.getProperty("zookeeper.testingonly.initialZxid");
            if (initialZxid != null) {
                long zxid = Long.parseLong(initialZxid);
                zk.setZxid((zk.getZxid() & 0xffffffff00000000L) | zxid);
            }

            if (!System.getProperty("zookeeper.leaderServes", "yes").equals("no")) {
                self.setZooKeeperServer(zk);
            }

            self.setZabState(QuorumPeer.ZabState.BROADCAST);
            self.adminServer.setZooKeeperServer(zk);

            // Everything is a go, simply start counting the ticks
            // WARNING: I couldn't find any wait statement on a synchronized
            // block that would be notified by this notifyAll() call, so
            // I commented it out
            //synchronized (this) {
            //    notifyAll();
            //}
            // We ping twice a tick, so we only update the tick every other
            // iteration
            boolean tickSkip = true;
            // If not null then shutdown this leader
            String shutdownMessage = null;
            //通过心跳监听集群的状态
            while (true) {
                synchronized (this) {
                    long start = Time.currentElapsedTime();
                    long cur = start;
                    long end = start + self.tickTime / 2;
                    while (cur < end) {
                        wait(end - cur);
                        cur = Time.currentElapsedTime();
                    }

                    if (!tickSkip) {
                        self.tick.incrementAndGet();
                    }

                    // We use an instance of SyncedLearnerTracker to
                    // track synced learners to make sure we still have a
                    // quorum of current (and potentially next pending) view.
                    SyncedLearnerTracker syncedAckSet = new SyncedLearnerTracker();
                    syncedAckSet.addQuorumVerifier(self.getQuorumVerifier());
                    if (self.getLastSeenQuorumVerifier() != null
                        && self.getLastSeenQuorumVerifier().getVersion() > self.getQuorumVerifier().getVersion()) {
                        syncedAckSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
                    }

                    syncedAckSet.addAck(self.getId());

                    for (LearnerHandler f : getLearners()) {
                        if (f.synced()) {
                            syncedAckSet.addAck(f.getSid());
                        }
                    }

                    // check leader running status
                    if (!this.isRunning()) {
                        // set shutdown flag
                        shutdownMessage = "Unexpected internal error";
                        break;
                    }

                    if (!tickSkip && !syncedAckSet.hasAllQuorums()) {
                        // Lost quorum of last committed and/or last proposed
                        // config, set shutdown flag
                        shutdownMessage = "Not sufficient followers synced, only synced with sids: [ "
                                          + syncedAckSet.ackSetsToString()
                                          + " ]";
                        break;
                    }
                    tickSkip = !tickSkip;
                }
                for (LearnerHandler f : getLearners()) {
                    //监听的方式是通过发送一个ping的数据包
                    f.ping();
                }
            }
            if (shutdownMessage != null) {
                shutdown(shutdownMessage);
                // leader goes in looking state
            }
        } finally {
            zk.unregisterJMX(this);
        }
    }

到这里,启动的源码分析差不多完成了,可以看到,关于leader的接单选举出来后会去完成同follower或者observer的同步,然后开启了Leader的zookeeperServer服务,并通过不断的发送ping数据包来保持对集群的监听。

总结:

当节点启动默认节点在looking状态,然后在looking状态下通过指定的选举算法不断发起投票,广播票据最终决策出Leader节点,默认情况下使用的过半机制来确认。关于投票的判断顺序为:epoch > zxid > myid。

如果是Follower节点:初始化follower信息,连接到leader节点并从leader节点同步信息,启动follower的zookeeperServer服务,然后循环监听leader的packet信息。

如果是Leader节点:初始化leader信息,加载本地database信息,启动LearnerCnxAcceptor监听follower或者observer的信息同步并监听其状态变化,等待过半节点同步完成,启动leader的zookeeperServer服务,然后通过循环监听集群状态,通过发送ping数据包完成。

posted @ 2022-03-03 15:16  生如梦境  阅读(50)  评论(0编辑  收藏  举报