ZooKeeper之FastLeaderElection算法详解

当我们把zookeeper服务启动时，首先需要做的一件事就是leader选举，zookeeper中leader选举的算法有3种，包括LeaderElection算法、AuthFastLeaderElection算法以及FastLeaderElection算法，其中FastLeadElection算法是默认的，当然，我们也可以在配置文件中修改配置项：electionAlg。

1、当zookeeper服务启动时，在类QuorumPeerMain中的入口函数main，主线程启动：

public class QuorumPeerMain {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeerMain.class);

    private static final String USAGE = "Usage: QuorumPeerMain configfile";

    protected QuorumPeer quorumPeer;

    /**
     * To start the replicated server specify the configuration file name on
     * the command line.
     * @param args path to the configfile
     */
    public static void main(String[] args) {
        QuorumPeerMain main = new QuorumPeerMain();

2、然后便是QuorumPeer重写Thread.start方法，启动：

          quorumPeer.start();
          quorumPeer.join();

在类QuorumPeer中

   @Override
    public synchronized void start() {
        if (!getView().containsKey(myid)) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
         }
        loadDataBase();
        cnxnFactory.start();
        try {
            adminServer.start();
        } catch (AdminServerException e) {
            LOG.warn("Problem starting AdminServer", e);
            System.out.println(e);
        }
        startLeaderElection();
        super.start();
    }

3、可以从上面的源码中看到，quorumPeer线程启动后，首先做的是数据恢复，它会读取保存在磁盘中的数据：

 private void loadDataBase() {
        try {
            //从本地文件中恢复db
            zkDb.loadDataBase();

            // load the epochs
            /*
            从最新的zxid恢复epoch变量
            其中zxid为long型，前32位代表epoch值，后32位代表zxid值，
            这个zxid（ZooKeeper Transaction Id），即事务id,zookeeper每次更，zxid都会增大
            因此越大代表数据越新
            */
            long lastProcessedZxid = zkDb.getDataTree().lastProcessedZxid;
            long epochOfZxid = ZxidUtils.getEpochFromZxid(lastProcessedZxid);
            try {
                currentEpoch = readLongFromFile(CURRENT_EPOCH_FILENAME);
            } catch(FileNotFoundException e) {
            	// pick a reasonable epoch number
            	// this should only happen once when moving to a
            	// new code version
            	currentEpoch = epochOfZxid;
                //....

4、然后便是初始化选举，一开始选举自己，默认使用的算法是FastLeaderElection：

synchronized public void startLeaderElection() {
       try {
            /*
            先投自己
            */
           if (getPeerState() == ServerState.LOOKING) {
               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
           }
       } catch(IOException e) {
           RuntimeException re = new RuntimeException(e.getMessage());
           re.setStackTrace(e.getStackTrace());
           throw re;
       }

       // if (!getView().containsKey(myid)) {
      //      throw new RuntimeException("My id " + myid + " not in the peer list");
        //}
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        this.electionAlg = createElectionAlgorithm(electionType);
    }

5、然后便是绑定选举端口，FastLeaderElection初始化：

protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;

        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
            qcm = new QuorumCnxManager(this);
            /*
            绑定选举端口，等待集群其它机器连接
            */
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                listener.start();
                //基于TCP的选举算法
                FastLeaderElection fle = new FastLeaderElection(this, qcm);
                fle.start();
                le = fle;
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

6、QuorumPeer线程启动：

private void starter(QuorumPeer self, QuorumCnxManager manager) {
        this.self = self;
        proposedLeader = -1;
        proposedZxid = -1;

        /*
        业务层发送队列，业务对象ToSend
        业务层接收队列，业务对象Notification
        */
        sendqueue = new LinkedBlockingQueue<ToSend>();
        recvqueue = new LinkedBlockingQueue<Notification>();
        this.messenger = new Messenger(manager);

    }

在FastLeaderElection.java文件中：

Messenger(QuorumCnxManager manager) {

            this.ws = new WorkerSender(manager);

            this.wsThread = new Thread(this.ws,
                    "WorkerSender[myid=" + self.getId() + "]");
            this.wsThread.setDaemon(true);

            this.wr = new WorkerReceiver(manager);

            this.wrThread = new Thread(this.wr,
                    "WorkerReceiver[myid=" + self.getId() + "]");
            this.wrThread.setDaemon(true);
        }

7、在进行选举的过程中，每台zookeeper server服务器有以下四种状态：LOOKING、FOLLOWING、LEADING、OBSERVING，其中出于OBSERVING状态的server不参加投票过程，只有出于LOOKING状态的机子才参加投票过程，一旦投票结束，server的状态就会变成FOLLOWER或者LEADER。

下面先说一下leader选举过程：

步骤1：对于处于LOOKING状态的server来说，首先判断一个被称为逻辑时钟值（logicalclock），如果收到的logicalclock的值大于当前server自身的logicalclock值，说明这是更新的一次选举，此时需要更新自身server的logicalclock值，并且将之前收到的来自其他server的投票结果清空，然后判断是否需要更新自身的投票，判断的标准是先看epoch值的大小，然后再判断zxid的大小，最后再看server id的大小（当然，针对这种情况，server肯定会更新自身的投票，因为当前server的epoch值小于收到的epoch值嘛），然后将自身的投票广播给其他server。

在FastLeaderElection.java文件中：

 protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */

        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

步骤2：如果是自身的logicalclock值大于接收的logicalclock值，那么就直接break；如果刚好相等，就根据epoch、zxid以及server id来判断是否需要更新，然后再把自己的投票广播给其他server，最后要把收到投票加入到当前server接收的投票队伍中。

 HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

在FastLeaderElection.java文件的lookForLeader函数中：

case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            //清空之前收到的投票结果
                            recvset.clear();
                            //判断是否需要更新自身投票
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug(
                                    "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break; 
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            //广播
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        //加入投票队伍
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

步骤3：服务器判断投票是否结束，结束的条件是：是否某个leader得到了半数以上的server的支持，如果是，则尝试再等一会儿（200ms）看是否收到更新数据，如果没有收到，则设置自身的角色（follower Or leader），然后退出选举流程，否则继续。

FastLeaderElection.java文件中;

//判断投票是否结束
    private boolean termPredicate(HashMap<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        /*
         * First make the views consistent. Sometimes peers will have different
         * zxids for a server depending on timing.
         */
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }

        return voteSet.hasAllQuorums();
    }

在lookForLeader函数中：

 //判读投票是否结束
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            //再等一会儿，看是否有新的投票
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            //如果没有发生新的投票，则结束选举过程
                            //设置自身状态
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

步骤4：以上我们讨论的是数据发送server的状态是LOOKING状态，如果数据发送方的状态是FOLLOWING或是LEADING状态，那么如果logicalclock相同，则将数据保存到recvset中，如果对方server自称是leader的话，那么就判断是否有半数以上的server支持它，如果是，则设置自身选举状态并且退出选举；

 case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        //当前server与发送方server的logicalclock相同
                        if(n.electionEpoch == logicalclock.get()){
                            //加入到recvset中
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            if(termPredicate(recvset, new Vote(n.leader,
                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

步骤5：如果收到的数据的logicalclock值与当前server的logicalclock不相等，那么说明在另外一个选举中已经有了选举结果，于是加入outofelection集合中，并且在outofelection集合中判断时候支持过半，如果是，则更新自身的投票，并且设置自身的状态：

 outofelection.put(n.sid, new Vote(n.leader, 
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
                        if (termPredicate(outofelection, new Vote(n.leader,
                                IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, IGNOREVALUE)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }

总结：这就是zookeeper的FastLeaderElection选举的大致过程。

参考博客：

http://blog.csdn.net/xhh198781/article/details/6619203

http://iwinit.iteye.com/blog/1773531

posted @ 2014-12-16 19:23 ihge2k 阅读(2333) 评论(0) 收藏举报

刷新页面返回顶部

stay hungry,stay foolish

ZooKeeper之FastLeaderElection算法详解

公告