Zk选举源码分析

首先说明下 zk的源码版本是3.5.5

代码入口在 QuorumPeerMain.main

如果要以分布式方式启动，走的方法是

QuorumPeerMain#runFromConfig

quorumPeer = getQuorumPeer();//new 一个QuorumPeer，可以把QuorumPeer当成zk服务器
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
          quorumPeer.enableLocalSessionsUpgrading(
              config.isLocalSessionsUpgradingEnabled());
          //quorumPeer.setQuorumPeers(config.getAllMembers());
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          .... //中间是设置各种属性，配置
          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();
          
          quorumPeer.start();
          quorumPeer.join();

public class QuorumPeer extends ZooKeeperThread

QuorumPeer继承自ZooKeeperThread，而ZooKeeperThread继承自Thread，所以主要就是看它的run方法的实现

QuorumPeer.run

其实核心就是一句话

setCurrentVote(makeLEStrategy().lookForLeader());

其中 Election默认的实现是 FastLeaderElection，一般情况下不会有人再zoo.cfg中配置 electionType，electionType默认值是3，也就是FastLeaderElection

FastLeaderElection#lookForLeader()

public Vote lookForLeader() throws InterruptedException {
        ......    
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());//更新本zk服务的 要投票的 epoch，zxid，myid
                //其实本
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

其中 updateProposal 会被调用多次，因为如果本zk节点收到比他更适合的leader投票，就会更新自身的投票

synchronized void updateProposal(long leader, long zxid, long epoch){
        if(LOG.isDebugEnabled()){
            LOG.debug("Updating proposal: " + leader + " (newleader), 0x"
                    + Long.toHexString(zxid) + " (newzxid), " + proposedLeader
                    + " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)");
        }
        proposedLeader = leader;
        proposedZxid = zxid;
        proposedEpoch = epoch;
    }

proposedLeader ，proposedZxid ，proposedEpoch
都是FastLeaderElection的成员变量，表示本节点所支持成为leader的投票，也就是该投给谁

然后就是向所有zk服务器发送投票消息

sendNotifications()

private void sendNotifications() {
        for (long sid : self.getCurrentAndNextConfigVoters()) {
            QuorumVerifier qv = self.getQuorumVerifier();
            ToSend notmsg = new ToSend(ToSend.mType.notification,
                    proposedLeader,
                    proposedZxid,
                    logicalclock.get(),
                    QuorumPeer.ServerState.LOOKING,
                    sid,
                    proposedEpoch, qv.toString().getBytes());
            if(LOG.isDebugEnabled()){
                LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                      Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                      " (n.round), " + sid + " (recipient), " + self.getId() +
                      " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
            }
            sendqueue.offer(notmsg);
        }
    }

这里简单描述下发送的过程

1 FastLeaderElection 有两个queue，一个是发送queue ，一个是接受queue

 LinkedBlockingQueue<ToSend> sendqueue;
    LinkedBlockingQueue<Notification> recvqueue;

2 FastLeaderElection 还有两个线程 WorkerReceiver，WorkerSender。从名字就能知道一个是发送一个是接受

3 这两个线程都有一个成员变量QuorumCnxManager，它是真正进行网络通信的工具类

4 发送的时候把消息放到发送sendqueue里

5 发送线程是一个循环，执行sendqueue的poll逻辑，每次poll指定等待时间3秒，然后调用网络工具类进行发送

6 如果给本节点自身发送消息，QuorumCnxManager会直接把消息放到要交给FastLeaderElection的接收 recvqueue

注意在lookForLeader方法里有一个本地变量

HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

这个结构也很关键，他是判断选leader何时结束的关键数据结构。这里简单说下，key是long型，意义是myid，Vote就是投票。Vote有三个成员，分别是epoch，zxid，myid。比较顺序就是先比较epoch，然后zxid，最后myid。原则都是越大优先级越高

上面的准备工作做完了，下面分析选举逻辑

在上面给所有的zk节点发送投票之后，就进入到了一个while循环里。

分为两个部分来讲，第一部分是收到别人的投票怎么处理

while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);//从接收queue里顺序的遍历，这里notTimeout是200，也就是200毫秒

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){//如果还没有收到，要么重发，要么重连
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                } 
                //这里才是核心逻辑
                else if (validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the current or next
                     * voting view for a replica in the current or next voting view.
                     */
                    switch (n.state) {
                    case LOOKING://如果该消息也是LOOKING
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {//如果收到的消息epoch比自己的大
                            logicalclock.set(n.electionEpoch);//本地epoch要跟上大部队，logicalclock相当于是epoch的发生器
                            recvset.clear();//清楚recvset，因为消息要重发
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                                //totalOrderPredicate是判断收到的别人的投票，是不是比自己更适合当leader，如果是更新自己的三个属性
                            } else {//因为更新过epoch了，所以要更新自己的epoch
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();//从这里我们能看出来，本地的epoch小于其他服务器，会更新epoch后重新发送。那么其他机器的epoch小于本机的epoch也是会再次把投票发给我们的
                        } else if (n.electionEpoch < logicalclock.get()) {//如果对方的epoch没有自己大，那就什么都不做，推出switch，重新到while循环里，继续从接收queue里选择消息
                            //对方会再次发送投票过来的，不必担心退出switch后，再也进不来switch了
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {//如果对方的投票比我们的优先级高
                            updateProposal(n.leader, n.zxid, n.peerEpoch);//更新自己的投票三个属性
                            sendNotifications();//重新发送投票
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        // don't care about the version if it's in LOOKING state
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

最后的 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); 其实非常的精髓，recvset会记录每个机器的投票，甚至是自己的投票。同时也要注意，每次收到消息，recvset都会更新的，因为收到消息意味着，某台服务器发现了可能比自己更合适的leader，又发过来消息，所以就得更新recvset

然后是第二部分，判断是否满足了结束条件

　　　　　　　　if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        /*
         * First make the views consistent. Sometimes peers will have different
         * zxids for a server depending on timing.
         */
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            //注意这里，votes就是我反复强调的recvset，这里是判断我收到消息的投票，如果和我自己的投票一致，就加入到voteSet
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }
        
        return voteSet.hasAllQuorums();
    }

public boolean hasAllQuorums() {
        for (QuorumVerifierAcksetPair qvAckset : qvAcksetPairs) {
            if (!qvAckset.getQuorumVerifier().containsQuorum(qvAckset.getAckset()))
                return false;
        }
        return true;
    }

QuorumMaj# containsQuorum

public boolean containsQuorum(Set<Long> ackSet) {
        return (ackSet.size() > half);
    }

其中half就是参与投票的服务器除2。比如三台机器那么half就是1.同时 (ackSet.size() > half) 这里是大于，也就是投票要大于等于2才满足条件。

我们再回到第二部分，分析剩余部分

// Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){//这里是继续判断，即使满足了结束条件也得再看看是否又收到了新的消息，如果收到了就break，然后再次循环
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {//代码走到这里说明没有新的消息了，而且也满足了选主条件
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());//设置自己的角色是leader 还是follower
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }

看到这里，我们看到选出来主是每个zk服务端自动就会把自己的角色设置好，而不是选出来主，主会再发一次消息告诉大家我是主。

当每个zk服务器中接收消息的队列为空的时候，就说明该发的消息都已经发完了。那么谁是主，就已经确定了

posted on 2021-06-17 21:01 MaXianZhe 阅读(213) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

MaXianZhe

Zk选举源码分析

导航

公告