选主的核心代码是在org.apache.zookeeper.server.quorum.FastLeaderElection#lookForLeader方法下。

选主逻辑的核心代码如下:

public Vote lookForLeader() throws InterruptedException {
        //无关代码部分忽略
    
        self.start_fle = Time.currentElapsedTime();
        try {
            
            //存储本轮选举收到的有效选票,用于判断是否有多数派的选票支持同一成员为Leader
            Map<Long, Vote> recvset = new HashMap<>();
            //用户加快Leader收敛,当成员加入集群时推测哪个成员为Leader,并且在广播选票之前对Logicalclock自增1
            Map<Long, Vote> outofelection = new HashMap<>();
            int notTimeout = minNotificationInterval;
            synchronized (this) {
                //自增生成logicalclock
                logicalclock.incrementAndGet();
                //更新最新选票内容
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            //向其他节点发送提议请求
            sendNotifications();

            SyncedLearnerTracker voteSet = null;

            //当前节点处于查找状态时,循环读取接收队列里的消息
            while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
                //从队列中取出消息
                Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
                if (n == null) {
                    //无消息接收逻辑代码省略
                } else if (validVoter(n.sid) && validVoter(n.leader)) {
                    //选票消息的节点状态
                    switch (n.state) {
                    case LOOKING:
                        //省略zxid校验代码
                        if (n.electionEpoch > logicalclock.get()) {
                            //选票的所处的轮次大于自己的logicalclock则说明自己所处的选举轮次是落后的,应更新自己的logicalclock,清空选票池,并重新广播自己的选票
                            //更新当前节点epoch
                            logicalclock.set(n.electionEpoch);
                            //清空选票池
                            recvset.clear();
                            //检测本次notification的leader是否赢得选举,包含epoch、sid、zxid比较
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            //重新广播自己的选票
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            // 选票轮次小于自己的logicalclock,则忽略
                                LOG.debug(
                                    "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
                                    Long.toHexString(n.electionEpoch),
                                    Long.toHexString(logicalclock.get()));
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                            //选票所处轮次等于自己的logicalclock,然后进行检测是否赢得选票,如果选票获胜,则更新自己选票并广播
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            //重新广播选票
                            sendNotifications();
                        }

                        //记录选票
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        //获取选票集合,用于判断自己的选票是否获得多数派,以此结束本轮选举
                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
                        //如果已经获得多数派选票
                        if (voteSet.hasAllQuorums()) {
                            //如果还存在一些未处理的选票请求,则遍历判断,如果有选票在比较中胜出,则重新入队,并结束此次选举判断(选举获得多数派也不作数,即不会更新节点状态)
                            //如果没有选票在比较中胜出,则修改状态
                            // Verify if there is any change in the proposed leader
                            while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                                if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                    recvqueue.put(n);
                                    break;
                                }
                            }
                            //如果在指定时间内还没有收到新的请求,那么则可以对节点状态进行更新
                            if (n == null) {
                                //节点状态变更,如果proposedLeader是当前节点,则将当前节点状态标记为LEADING
                                setPeerState(proposedLeader, voteSet);
                                Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: {}", n.sid);
                        break;
                    case FOLLOWING:
                        /*
                        * To avoid duplicate codes
                        * */
                        Vote resultFN = receivedFollowingNotification(recvset, outofelection, voteSet, n);
                        if (resultFN == null) {
                            break;
                        } else {
                            return resultFN;
                        }
                    case LEADING:
                        Vote resultLN = receivedLeadingNotification(recvset, outofelection, voteSet, n);
                        if (resultLN == null) {
                            break;
                        } else {
                            return resultLN;
                        }
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)", n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            //省略部分代码
        }
    }

其中totalOrderPredicate方法的源码如下:

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        if (self.getQuorumVerifier().getWeight(newId) == 0) {
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */

        return ((newEpoch > curEpoch)
                || ((newEpoch == curEpoch)
                    && ((newZxid > curZxid)
                        || ((newZxid == curZxid)
                            && (newId > curId)))));
    }

核心逻辑就是,先比较epoch大小,然后是比较zxid大小最后是比较serverId大小。主要判断当前接收的投票是否是有效的,如果不满足代码里的逻辑判断则认为是无效的。

posted on 2024-03-02 17:07  bibibao  阅读(7)  评论(0编辑  收藏  举报