zookeeper 选举流程源码解析

在开始之前,我们先了解一下zookeeper集群角色,zookeeper中存在leader,follower, observer这么几个角色, leader, follower, 就类似与mysql 数据库中的主从结构,leader 是master主要用于接收客户端读写 follower 就类似于 slave,主要接收查询请求, leader, follower 是zookeeper的选举节点,observer 不参与选举,这里暂时不进行讨论。

其实zookeeper的leader选举机制和redis集群master的选举机制差不多, 也都是过半机制

zk的选票中含有几个属性(sid,zid,epoch)

1; sid zk 节点对应的服务id

2: 当前zk节点的事务id

3: 周期

假设有2个节点(1,0,0),(2,0,0), 如果周期和事务id 都一致,那么会选择sid 大的为leader 及(2,0,0)

假设有2个节点(1,2,1),(2,1,1),如果周期一致,那么会选择事务id 大的为leader,及(1,2,0)

假设有2个节点(1,0,1),(2,0,1),这个时候肯定已经选出来一个leader为(2,0,1),如果再加入一个新的节点,这时候这个新增加进来的节点周期肯定比之前的小,那么新增的节点只能为follower, 然后将节点的周期的值设置和其他节点保持一致

下面先了解一下zookeeper 的leader选举架构图:

从上述的简单的架构图中我们可以看到zookeeper充分的利用了多线程和队列的方式实现了程序的解耦和并发,因此在阅读源码的时候,我们也分块进行解析,防止混乱。

再上一张图,这是zk 集群的配置文件,先对几个配置有个概念

zookeeper 的启动类是QuorumPeerMain, 我们从这个类进入

进入到runFromConfig方法中,再这个方法中会生成一个管理对象,并且设置一些属性,比如说服务端之间数据传输对象,内存数据库对象等等,并且启动服务节点。

QuorumPeerMain#runFromConfig

public void runFromConfig(QuorumPeerConfig config)
            throws IOException, AdminServerException
    {
      try {
          // 注册JMX
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }

      LOG.info("Starting quorum peer");
      try {
          ServerCnxnFactory cnxnFactory = null;
          ServerCnxnFactory secureCnxnFactory = null;

          if (config.getClientPortAddress() != null) {
              // 1.1:初始化服务端连接对象,zk 默认使用NIO, 但是可以-Dzookeeper.serverCnxnFactory=xxx 的方式设置启动参数,比如说设置为netty
              cnxnFactory = ServerCnxnFactory.createFactory();
              cnxnFactory.configure(config.getClientPortAddress(),
                      config.getMaxClientCnxns(),
                      false);
          }

          if (config.getSecureClientPortAddress() != null) {
              secureCnxnFactory = ServerCnxnFactory.createFactory();
              secureCnxnFactory.configure(config.getSecureClientPortAddress(),
                      config.getMaxClientCnxns(),
                      true);
          }
		  // 获取当前服务节点对象,这里实际上是new 了一个 QuorumPeer对象出来
          quorumPeer = getQuorumPeer();
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
          quorumPeer.enableLocalSessionsUpgrading(
              config.isLocalSessionsUpgradingEnabled());
          //quorumPeer.setQuorumPeers(config.getAllMembers());
          // 设置选举算法类型,这里默认为3, 而3代表的选举算法是FastLeaderElection,后面各个节点之间选举leader就是通过FastLeaderElection实现
          quorumPeer.setElectionType(config.getElectionAlg());
          // 这是设置当前zk节点服务对应的id,就是再config 中设置的server.1,2,3 中对应的本节点的值
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setConfigFileName(config.getConfigFilename());
          // 初始化内存数据库
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
          if (config.getLastSeenQuorumVerifier()!=null) {
              quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
          }
          quorumPeer.initConfigInZKDatabase();
          // 设置服务端连接对象,比如说NIO,还是netty
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
          quorumPeer.setSslQuorum(config.isSslQuorum());
          quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          if (config.sslQuorumReloadCertFiles) {
              quorumPeer.getX509Util().enableCertFileReloading();
          }

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }
          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          // 这里会初始化权限的服务
          quorumPeer.initialize();
          // 2:这里是启动服务节点
          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

1.1:初始化服务端连接对象,zk 默认使用NIO, 但是可以-Dzookeeper.serverCnxnFactory=xxx 的方式设置启动参数,比如说设置为netty,解析见ServerCnxnFactory#createFactory

2:这里是启动服务节点,详情见QuorumPeer#start

这里对上面的1,2进行详细解析一下

ServerCnxnFactory#createFactory

ServerCnxnFactory#createFactory

QuorumPeer#start

QuorumPeer#start

@Override
public synchronized void start() {
    if (!getView().containsKey(myid)) {
        throw new RuntimeException("My id " + myid + " not in the peer list");
    }
    // 1:加载磁盘数据到内存中,恢复DataTree
    loadDataBase();
    // 2:启动quorumPeer中设置的socket连接服务,NIO或者自己指定类型,比如netty
    startServerCnxnFactory();
    try {
        // 3:启动内嵌的jetty服务查看相关信息
        adminServer.start();
    } catch (AdminServerException e) {
        LOG.warn("Problem starting AdminServer", e);
        System.out.println(e);
    }
    // 4:初始化集群选举leader相关对象数据,比如说选举数据管理器,选举监听socket, 启动选举算法的相关线程等
    startLeaderElection();
    // 5:启动集群选举leader线程
    super.start();
}

因为本文的主要是zookeeper的选举流程,所以部分其他逻辑就暂时不进行解析

4:初始化集群选举leader相关对象数据,比如说选举数据管理器,选举监听socket, 启动选举算法的相关线程等,详情见startLeaderElection

5:启动集群选举leader线程: 详情见super.start

QuorumPeer#startLeaderElection

startLeaderElection

createElectionAlgorithm

createElectionAlgorithm

启动选举监听,详情见listener.start

启动快速选举算法线程,详情见fle.start

listener.start

listener.start

从代码中我们可以了解到Lisener实际上是ZookeeperThread 的子类, 而Zookeeper 实际上是继承Threa 类, 因此执行listener.start 方法实际上是执行Listener 的run方法, 我们来了解一下run 方法,为了代码方便查看,因此省略一部分代码

@Override
public void run() {
    ....
        while ((!shutdown) && (portBindMaxRetry == 0 || numRetries < portBindMaxRetry)) {
            try {
                // 在选举的端口进行监听,利用socket进行监听,这里是BIO
				.... 
                ss = new ServerSocket();
             	....
                ss.setReuseAddress(true);
                if (self.getQuorumListenOnAllIPs()) {
                    // 这里的port 就是在配置文件中配置的port,server.1=127.0.0.1:2888:32888  32888 就是选举端口
                    int port = self.getElectionAddress().getPort();  
                    addr = new InetSocketAddress(port);
                } else {
                    self.recreateSocketAddresses(self.getId());
                    addr = self.getElectionAddress();
                }
                setName(addr.toString());
                ss.bind(addr);  // 绑定端口
                while (!shutdown) {
                    try {
                        client = ss.accept();  // 监听
                        setSockOpts(client);
                        if (quorumSaslAuthEnabled) {
                            // 异步处理消息
                            receiveConnectionAsync(client);
                        } else {
                            // 同步处理消息
                            receiveConnection(client);
                        }
                        numRetries = 0;
                    } catch (SocketTimeoutException e) {
                        ......
                    }
                }
            } catch (IOException e) {
                ....
            }
        }
    ...
}

这里解析同步处理消息,详情见receiveConnection

receiveConnection

QuorumCnxManager.receiveConnection

处理消息见handleConnection

handleConnection

QuorumCnxManager.handleConnection

这里方便阅读省略了部分代码

    private void handleConnection(Socket sock, DataInputStream din)
            throws IOException {
        Long sid = null, protocolVersion = null;
        InetSocketAddress electionAddr = null;

        try {
            protocolVersion = din.readLong();  // 这里从输入流中获取选票对应的zk节点的id
            if (protocolVersion >= 0) { 
                sid = protocolVersion; // 如果大于0,进行赋值,这里一般都是大于0的, 就是我们conf中设置的server.1 ,2,3
            } else {
			.......
                
                
        if (sid < self.getId()) {  
            // 如果sid 比当前的zk 节点的sid 小,那么直接关闭当前的socket, 然后当前zk 向sid 对应的zk 节点发起一个socket 连接,为什么这么处理呢?
            // 因为socket 支持双向传输,为了防止重复连接,zk 做了控制,只能sid 大的向sid 小的节点发起socket 连接
            SendWorker sw = senderWorkerMap.get(sid);
            if (sw != null) {
                sw.finish();
            }
            LOG.debug("Create new connection to server: {}", sid);
            closeSocket(sock);

            if (electionAddr != null) {
                connectOne(sid, electionAddr);
            } else {
                connectOne(sid);
            }
        } else if (sid == self.getId()) {
			// 当前的节点自己向自己发起了一个socket连接,说明应该是出bug了,可能配置有问题
        } else { 
            // 这里创建了两个线程,发送socket消息的线程, 接收socket 消息的线程,这里可以和文章顶部的的架构图对应起来更加方便理解
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);

            SendWorker vsw = senderWorkerMap.get(sid);

            if (vsw != null) {
                vsw.finish();
            }
			// 将发送线程放入到一个Map中,key 为其他zk 节点的服务id,  value 为创建的发送消息的线程
            senderWorkerMap.put(sid, sw);

            // 如果zk 节点对应的队列不存在,那么创建一个队列放到map中, key 为 zk节点服务id, value 为一个阻塞队列
            queueSendMap.putIfAbsent(sid,
                    new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY));

            // 启动发送消息线程
            sw.start();
            // 启动接收接收消息的线程
            rw.start();
        }

启动发送消息线程见sw.start

启动接收接收消息的线程见rw.start

sw.start

SendWorker.run

@Override
public void run() {
    threadCnt.incrementAndGet();
    try {
        ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
        if (bq == null || isSendQueueEmpty(bq)) {
            ByteBuffer b = lastMessageSent.get(sid);
            if (b != null) {
                LOG.debug("Attempting to send lastMessage to sid=" + sid);
                send(b);
            }
        }
    } catch (IOException e) {
        this.finish();
    }
    try {
        // 利用while死循环重复的从sid 对应的队列中获取数据,然后利用socket发送给其他zk 节点
        while (running && !shutdown && sock != null) {
            ByteBuffer b = null;
            try {
                ArrayBlockingQueue<ByteBuffer> bq = queueSendMap.get(sid);
                if (bq != null) {
                    // 从队列中获取数据
                    b = pollSendQueue(bq, 1000, TimeUnit.MILLISECONDS);
                } else {
					....
                    break;
                }

                if(b != null){
                    lastMessageSent.put(sid, b);
                    // 发送数据
                    send(b);
                }
            } catch (InterruptedException e) {
				......
            }
        }
    } catch (Exception e) {
		.....
    }
    this.finish();
    
}
}

从队列中获取数据见pollSendQueue

发送数据见send

pollSendQueue

QuorumCnxManager.pollSendQueue

send

QuorumCnxManager.send

rw.start

RecvWorker.run

fle.start

fle.start

FastLeaderElection.start 实际上是启动了两个守护线程,一个是将投票数据添加到sid 对应队列中的的线程,一个是处理recvqueue 队列中投票数据的队列,如果感觉比较抽象,可以和文章顶部的架构图进行对比的观看比较直观

发送消息线程见wsThread.start

处理投票数据线程见wrThread.start

wsThread.start

WorkerSender.run

发送选票信息见toSend

toSend

QuorumCnxManager.toSend

 public void toSend(Long sid, ByteBuffer b) {

     if (this.mySid == sid) {
         // 如果选票接收的节点就是自己,那么将选票信息放入到自己对应的recvQueue 接收队列中
         b.position(0);
         addToRecvQueue(new Message(b.duplicate(), sid));
     } else {		
         ArrayBlockingQueue<ByteBuffer> bq = new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY);
         // 如果接收节点不是自己,那么构建一个阻塞队列,将阻塞队列添加到queueSendMap中, 看到这里我们就可以发现这里的queueSendMap 和
         // handleConnection 中的queueSendMap是同一个,那么根据架构图,我们就可以对应起来
         ArrayBlockingQueue<ByteBuffer> oldq = queueSendMap.putIfAbsent(sid, bq);
         // 将待发送数据添加到队列中,如果SendWorker存在,那么前面阻塞的SendWorker获取数据后就会发送数据
         if (oldq != null) {
             addToSendQueue(oldq, b);
         } else {
             addToSendQueue(bq, b);
         }
         // 建立socket 连接,这里会进行判断一下SendWorker是否存在,存在直接返回,如果不存在那么创建SendWorker发送数据
         connectOne(sid);

     }
 }

这里解析一下connetOne

connectOne





startConnetction

    private boolean startConnection(Socket sock, Long sid)
            throws IOException {
		....
        // socket 发送数据省略

       ......
        
       //  这里的代码是不是看着非常眼熟,没错,跟前面的handleConnection中的逻辑很像, 都是创建两个线程SendWorker,RecvWorker 用于收发数据
            
        if (sid > self.getId()) {
            LOG.info("Have smaller server identifier, so dropping the connection: (myId:{} --> sid:{})", self.getId(), sid);
            closeSocket(sock);
            // Otherwise proceed with the connection
        } else {
            LOG.debug("Have larger server identifier, so keeping the connection: (myId:{} --> sid:{})", self.getId(), sid);
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);

            SendWorker vsw = senderWorkerMap.get(sid);

            if(vsw != null)
                vsw.finish();

            senderWorkerMap.put(sid, sw);
            queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(
                    SEND_CAPACITY));

            sw.start();
            rw.start();

            return true;

        }
        return false;
    }

wrThread.start

WorkerReceiver.run

public void run() {

                Message response;
                while (!stop) {
                    // Sleeps on receive
                    try {
                        response = manager.pollRecvQueue(3000, TimeUnit.MILLISECONDS);
 
						... 省略部分代码

                            if(self.getPeerState() == QuorumPeer.ServerState.LOOKING){
                                // 如果当前zk 还处于选举的状态,那么将这张选票放到recvqueue队列中,让WorkerReceiver线程处理
                                recvqueue.offer(n);

                                if((ackstate == QuorumPeer.ServerState.LOOKING)
                                        && (n.electionEpoch < logicalclock.get())){
                                    Vote v = getVote();
                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(ToSend.mType.notification,
                                            v.getId(),
                                            v.getZxid(),
                                            logicalclock.get(),
                                            self.getPeerState(),
                                            response.sid,
                                            v.getPeerEpoch(),
                                            qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            } else {
								// 如果当前zk节点不处于选举的状态,那么说明当前节点的leader已经选举出来了,将当前节点保存的leader的信息
                                // 封装后防止到sendqueue中,让SendWorker线程发送消息给指定的zk节点, 这种情况一般都是集群中有新节点
                                // 加入的情况
                                Vote current = self.getCurrentVote();
                                if(ackstate == QuorumPeer.ServerState.LOOKING){
                                    if(LOG.isDebugEnabled()){
                                                self.getId(),
                                                response.sid,
                                                Long.toHexString(current.getZxid()),
                                                current.getId(),
                                                Long.toHexString(self.getQuorumVerifier().getVersion()));
                                    }

                                    QuorumVerifier qv = self.getQuorumVerifier();
                                    ToSend notmsg = new ToSend(
                                            ToSend.mType.notification,
                                            current.getId(),
                                            current.getZxid(),
                                            current.getElectionEpoch(),
                                            self.getPeerState(),
                                            response.sid,
                                            current.getPeerEpoch(),
                                            qv.toString().getBytes());
                                    sendqueue.offer(notmsg);
                                }
                            }
                        }
                    } catch (InterruptedException e) {
                        LOG.warn("Interrupted Exception while waiting for new message" +
                                e.toString());
                    }
                }
                LOG.info("WorkerReceiver is down");
            }

super.start()->QuorumPeer#run

super.start

public void run() {
    ... 省略代码
    try {
        while (running) {
            // 这里根据zk 所处不同的状态走不同的代码的逻辑,这里考虑选举的状态,所以走LOOKING逻辑
            switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");
					// 这里会根据zkServer.sh 中配置的readonlymode.enabled的值进行判断走不同的逻辑
                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                        final ReadOnlyZooKeeperServer roZk =
                            new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);

                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            reconfigFlagClear();
                            if (shuttingDownLE) {
                                shuttingDownLE = false;
                                startLeaderElection();
                            }
                            // 这里是核心逻辑
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                            reconfigFlagClear();
                            if (shuttingDownLE) {
                                shuttingDownLE = false;
                                startLeaderElection();
                            }
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }                        
                    }
                    break;
            }
            start_fle = Time.currentElapsedTime();
        }
    } finally {
		....
    }
}

makeLEStrategy: 初始化选举协议见makeLEStrategy

lookForLeader: 开始选举见lookForLeader

makeLEStrategy

QuorumPeer.makeLEStrategy

这个方法中创建了一个leader选举的对象

lookForLeader

FastLeaderElection.lookForLeader

这里为什么是FastLeaderElection选举对象是因为我们默认的选举类型是3,然后创建了这个对象,具体见createElectionAlgorithm

public Vote lookForLeader() throws InterruptedException {
		... 省略代码
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
            int notTimeout = finalizeWait;
            synchronized(this){
                // 这里将选举周期+1
                logicalclock.incrementAndGet();
                // 这里初始化当前zk 的选票信息
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            // 发送选票信息
            sendNotifications();


            while ((self.getPeerState() == ServerState.LOOKING) &&(!stop)){

                Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

                
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                     // 这里如果初始没有选票,那么会和所有的的zk 节点建立socket连接
                        manager.connectAll();
                    }

                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                } 
                else if (validVoter(n.sid) && validVoter(n.leader)) {

                    switch (n.state) {
                    case LOOKING:
                        // 接收选票的周期大于自己的周期,说明自己是新增加到集群中的节点或者是网络中断后重新加入的,所以把自己的选举周期更新为最新的周期
                        if (n.electionEpoch > logicalclock.get()) {
                            // 更新选举周期
                            logicalclock.set(n.electionEpoch);
                            // 清空之前的储存选票的队列
                            recvset.clear();
                            // 把拿到的选票和自己的选票进行比较,更新选票信息
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            // 将选票信息放入到队列中,由其他线程发送给参与选举的节点
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            // 说明发出选票的节点是新加入的节点,那么这种选票信息直接废弃掉
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            // 这里是节点一起参与选举投票,那么经过比较以后更新选票信息,然后将比较后优胜的选票信息发送给其他参与选举
                            // 的节点
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        // 将更新后的选票放入到队列中
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        // 这里采用过半选举机制选举leader
                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // 在上一步选举出leader 之后再看是否还有新选票加入, 如果有,还需要再做一下选票的比较,如果新增的选票获胜,
                            // 并且不是获胜的选票,那么需要重新选举
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

							// 如果当前节点选举的leader和自己的sid一致,那么当前节点就是leader节点
                            // 否则当前节点就是follower节点
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                            if(termPredicate(recvset, new Vote(n.version, n.leader,
                                            n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                            && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, n.electionEpoch, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify that
                         * a majority are following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version, n.leader, 
                                n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (termPredicate(outofelection, new Vote(n.version, n.leader,
                                n.zxid, n.electionEpoch, n.peerEpoch, n.state))
                                && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, 
                                    n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecoginized: " + n.state
                              + " (n.state), " + n.sid + " (n.sid)");
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
		...
    }
posted @   苜蓿椒盐  阅读(509)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· PowerShell开发游戏 · 打蜜蜂
· 在鹅厂做java开发是什么体验
· 百万级群聊的设计实践
· WPF到Web的无缝过渡:英雄联盟客户端的OpenSilver迁移实战
· 永远不要相信用户的输入:从 SQL 注入攻防看输入验证的重要性
点击右上角即可分享
微信分享提示