选主的核心代码是在org.apache.zookeeper.server.quorum.FastLeaderElection#lookForLeader
方法下。
选主逻辑的核心代码如下:
public Vote lookForLeader() throws InterruptedException {
//无关代码部分忽略
self.start_fle = Time.currentElapsedTime();
try {
//存储本轮选举收到的有效选票,用于判断是否有多数派的选票支持同一成员为Leader
Map<Long, Vote> recvset = new HashMap<>();
//用户加快Leader收敛,当成员加入集群时推测哪个成员为Leader,并且在广播选票之前对Logicalclock自增1
Map<Long, Vote> outofelection = new HashMap<>();
int notTimeout = minNotificationInterval;
synchronized (this) {
//自增生成logicalclock
logicalclock.incrementAndGet();
//更新最新选票内容
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
//向其他节点发送提议请求
sendNotifications();
SyncedLearnerTracker voteSet = null;
//当前节点处于查找状态时,循环读取接收队列里的消息
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
//从队列中取出消息
Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
if (n == null) {
//无消息接收逻辑代码省略
} else if (validVoter(n.sid) && validVoter(n.leader)) {
//选票消息的节点状态
switch (n.state) {
case LOOKING:
//省略zxid校验代码
if (n.electionEpoch > logicalclock.get()) {
//选票的所处的轮次大于自己的logicalclock则说明自己所处的选举轮次是落后的,应更新自己的logicalclock,清空选票池,并重新广播自己的选票
//更新当前节点epoch
logicalclock.set(n.electionEpoch);
//清空选票池
recvset.clear();
//检测本次notification的leader是否赢得选举,包含epoch、sid、zxid比较
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
updateProposal(n.leader, n.zxid, n.peerEpoch);
} else {
updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
}
//重新广播自己的选票
sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
// 选票轮次小于自己的logicalclock,则忽略
LOG.debug(
"Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x{}, logicalclock=0x{}",
Long.toHexString(n.electionEpoch),
Long.toHexString(logicalclock.get()));
break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
//选票所处轮次等于自己的logicalclock,然后进行检测是否赢得选票,如果选票获胜,则更新自己选票并广播
updateProposal(n.leader, n.zxid, n.peerEpoch);
//重新广播选票
sendNotifications();
}
//记录选票
recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
//获取选票集合,用于判断自己的选票是否获得多数派,以此结束本轮选举
voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
//如果已经获得多数派选票
if (voteSet.hasAllQuorums()) {
//如果还存在一些未处理的选票请求,则遍历判断,如果有选票在比较中胜出,则重新入队,并结束此次选举判断(选举获得多数派也不作数,即不会更新节点状态)
//如果没有选票在比较中胜出,则修改状态
// Verify if there is any change in the proposed leader
while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
recvqueue.put(n);
break;
}
}
//如果在指定时间内还没有收到新的请求,那么则可以对节点状态进行更新
if (n == null) {
//节点状态变更,如果proposedLeader是当前节点,则将当前节点状态标记为LEADING
setPeerState(proposedLeader, voteSet);
Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
leaveInstance(endVote);
return endVote;
}
}
break;
case OBSERVING:
LOG.debug("Notification from observer: {}", n.sid);
break;
case FOLLOWING:
/*
* To avoid duplicate codes
* */
Vote resultFN = receivedFollowingNotification(recvset, outofelection, voteSet, n);
if (resultFN == null) {
break;
} else {
return resultFN;
}
case LEADING:
Vote resultLN = receivedLeadingNotification(recvset, outofelection, voteSet, n);
if (resultLN == null) {
break;
} else {
return resultLN;
}
default:
LOG.warn("Notification state unrecognized: {} (n.state), {}(n.sid)", n.state, n.sid);
break;
}
} else {
if (!validVoter(n.leader)) {
LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
}
if (!validVoter(n.sid)) {
LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
}
}
}
return null;
} finally {
//省略部分代码
}
}
其中totalOrderPredicate
方法的源码如下:
protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
if (self.getQuorumVerifier().getWeight(newId) == 0) {
return false;
}
/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
* as current zxid, but server id is higher.
*/
return ((newEpoch > curEpoch)
|| ((newEpoch == curEpoch)
&& ((newZxid > curZxid)
|| ((newZxid == curZxid)
&& (newId > curId)))));
}
核心逻辑就是,先比较epoch大小,然后是比较zxid大小最后是比较serverId大小。主要判断当前接收的投票是否是有效的,如果不满足代码里的逻辑判断则认为是无效的。
一步一步往上爬