RocketMQ消息发送的队列选择与容错策略
一个topic有多个队列,分散在不同的broker。producer在发送消息的时候,需要选择一个队列
producer发送消息全局时序图:
队列选择与容错策略结论:
- 在不开启容错的情况下,轮询队列进行发送,如果失败了,重试的时候过滤失败的Broker
- 如果开启了容错策略,会通过RocketMQ的预测机制来预测一个Broker是否可用
- 如果上次失败的Broker可用那么还是会选择该Broker的队列
- 如果上述情况失败,则随机选择一个进行发送
- 在发送消息的时候会记录一下调用的时间与是否报错,根据该时间去预测broker的可用时间
String lastBrokerName = null == mq ? null : mq.getBrokerName(); MessageQueue tmpmq = this.selectOneMessageQueue(lastBrokerName); if (tmpmq != null) { mq = tmpmq; //....
如上,如果发送失败了,重试的时候lastBrokerName将不为空,进入到selectOneMessageQueue方法
public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) { if (this.sendLatencyFaultEnable) { try { int index = tpInfo.getSendWhichQueue().getAndIncrement(); for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) { int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size(); if (pos < 0) pos = 0; MessageQueue mq = tpInfo.getMessageQueueList().get(pos); if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) { if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName)) return mq; } } final String notBestBroker = latencyFaultTolerance.pickOneAtLeast(); int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker); if (writeQueueNums > 0) { final MessageQueue mq = tpInfo.selectOneMessageQueue(); if (notBestBroker != null) { mq.setBrokerName(notBestBroker); mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums); } return mq; } else { latencyFaultTolerance.remove(notBestBroker); } } catch (Exception e) { } return tpInfo.selectOneMessageQueue(); } return tpInfo.selectOneMessageQueue(lastBrokerName); }
首先判断sendLatencyFaultEnable是否为true,来走不同的流程,默认为false
public MessageQueue selectOneMessageQueue(final String lastBrokerName) { // 如果为空,即第一次发生,未发生错误重试 // 直接轮询队列进行发送 if (lastBrokerName == null) { return selectOneMessageQueue(); } else { // 与selectOneMessageQueue类似,过滤的lastBrokerName的队列 int index = this.sendWhichQueue.getAndIncrement(); for (int i = 0; i < this.messageQueueList.size(); i++) { int pos = Math.abs(index++) % this.messageQueueList.size(); if (pos < 0) pos = 0; MessageQueue mq = this.messageQueueList.get(pos); if (!mq.getBrokerName().equals(lastBrokerName)) { return mq; } } return selectOneMessageQueue(); } } public MessageQueue selectOneMessageQueue() { int index = this.sendWhichQueue.getAndIncrement(); int pos = Math.abs(index) % this.messageQueueList.size(); if (pos < 0) pos = 0; return this.messageQueueList.get(pos); }
总的来说都是轮询,只是一个有过滤失败的lastBrokerName,一个没有
sendLatencyFaultEnable开启:
- 1
int index = tpInfo.getSendWhichQueue().getAndIncrement(); for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) { int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size(); if (pos < 0) pos = 0; MessageQueue mq = tpInfo.getMessageQueueList().get(pos); // 判断该Broker是否可用,不可用则进行第二部分的逻辑 if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) { // 非失败重试,直接返回到的队列 // 失败重试的情况,如果和选择的队列是上次重试是一样的,则返回 if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName)) return mq; } }
- 2
//从容错信息中取一个Broker final String notBestBroker = latencyFaultTolerance.pickOneAtLeast(); int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker); if (writeQueueNums > 0) {// 有可写队列 // 往后取一个 final MessageQueue mq = tpInfo.selectOneMessageQueue(); if (notBestBroker != null) { // 将取到的队列信息设置为取到的broker mq.setBrokerName(notBestBroker); // 队列重置 mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums); } return mq; } else { latencyFaultTolerance.remove(notBestBroker); }
第一部分主要是选择一个可用的并且brokerName为lastBrokerName的队列,这里其实有点疑问,是失败的时候lastBrokerName才不为空,这时候为什么还会选择可用且brokerName为lastBrokerName的队列?这个猜测可能是觉得当前brokerName的上一次发送的队列失败了,可能下个队列会成功,加上当前延迟容错机制下的确保可用情况下,选择另外的队列。
假设没有找到对应的队列,只有一种情况
- 延迟容错机制觉得lastBrokerName这个broker不可用
那么将会进入第二部分代码,首先调用pickOneAtLeast获取一个broker,再调用selectOneMessageQueue获取一个队列,如果pickOneAtLeast取到的不为空,那么将队列信息替换
容错策略
如何判断broker是否可用
public boolean isAvailable(final String name) { final FaultItem faultItem = this.faultItemTable.get(name); if (faultItem != null) { return faultItem.isAvailable(); } return true; }
分两部分
- faultItemTable放进去的时机
- FaultItem的isAvailable实现
isAvailable实现
public boolean isAvailable() { return (System.currentTimeMillis() - startTimestamp) >= 0; }
判断当前时间是否大于startTimestamp,为什么只是判断一个时间就可以知道Broker是否可用?
faultItemTable
通过查找faultItemTable使用的地方,找到updateFaultItem方法
public void updateFaultItem(final String name/*brokerName*/, final long currentLatency, final long notAvailableDuration) { FaultItem old = this.faultItemTable.get(name); if (null == old) { final FaultItem faultItem = new FaultItem(name); faultItem.setCurrentLatency(currentLatency); faultItem.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration); old = this.faultItemTable.putIfAbsent(name, faultItem); if (old != null) { old.setCurrentLatency(currentLatency); old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration); } } else { old.setCurrentLatency(currentLatency); old.setStartTimestamp(System.currentTimeMillis() + notAvailableDuration); } }
通过brokerName找到对应的FaultItem,startTimestamp=当前时间+notAvailableDuration,找到updateFaultItem使用的地方,看看notAvailableDuration是什么,找到MQFaultStrategy.updateFaultItem(String, long, boolean)方法
public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) { if (this.sendLatencyFaultEnable) {// 开启延迟容错功能 long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency); this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration); } } private long computeNotAvailableDuration(final long currentLatency) { for (int i = latencyMax.length - 1; i >= 0; i--) { if (currentLatency >= latencyMax[i]) return this.notAvailableDuration[i]; } return 0; }
MQFaultStrategy.java部分属性
public class MQFaultStrategy { private final static Logger log = ClientLogger.getLog(); /** * 延迟故障容错,维护每个Broker的发送消息的延迟 * key:brokerName */ private final LatencyFaultTolerance<String> latencyFaultTolerance = new LatencyFaultToleranceImpl(); /** * 发送消息延迟容错开关 */ private boolean sendLatencyFaultEnable = false; /** * 延迟级别数组 */ private long[] latencyMax = {50L, 100L, 550L, 1000L, 2000L, 3000L, 15000L}; /** * 不可用时长数组 */ private long[] notAvailableDuration = {0L, 0L, 30000L, 60000L, 120000L, 180000L, 600000L}; ..... }
notAvailableDuration为notAvailableDuration数组某个位置的值,latencyMax和notAvailableDuration数组的值分别如下
latencyMax | notAvailableDuration |
---|---|
50L | 0L |
100L | 0L |
550L | 30000L |
1000L | 60000L |
2000L | 120000L |
3000L | 180000L |
15000L | 600000L |
即
- currentLatency如果大于等于50小于100,则notAvailableDuration为0
- currentLatency如果大于等于100小于550,则notAvailableDuration为0
- currentLatency如果大于等于550小于1000,则notAvailableDuration为300000
- …以此类推
假设isolation传入true,那么notAvailableDuration将传入600000。
结合isAvailable方法,大概流程如下,RocketMQ为每个Broker预测了个可用时间(当前时间+notAvailableDuration),当当前时间大于该时间,才代表Broker可用,而notAvailableDuration有6个级别和latencyMax的区间一一对应,根据传入的currentLatency去预测该Broker在什么时候可用
那么看下updateFaultItem使用的地方,看看currentLatency传入的是什么
// 1. try { beginTimestampPrev = System.currentTimeMillis(); sendResult = this.sendKernelImpl(msg, mq, communicationMode, sendCallback, topicPublishInfo, timeout); endTimestamp = System.currentTimeMillis(); this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false); // 2. } catch (xxException e) { endTimestamp = System.currentTimeMillis(); this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, true); }
currentLatency为发送消息的执行时间,根据执行时间来看落入哪个区间,在0~100的时间内notAvailableDuration都是0,都是可用的,大于该值后,可用的时间就会开始变大了,而在报错的时候isolation参数为true,那么该broker在600000毫秒后才可用
pickOneAtLeast
当真的出现600000毫秒后才可用的情况,在selectOneMessageQueue方法的第一部分代码就走不下去了,只能走到第二部分代码,先调用pickOneAtLeast方法获取一个broker
public String pickOneAtLeast() { final Enumeration<FaultItem> elements = this.faultItemTable.elements(); List<FaultItem> tmpList = new LinkedList<FaultItem>(); // 将faultItemTable里的元素全放到list中 while (elements.hasMoreElements()) { final FaultItem faultItem = elements.nextElement(); tmpList.add(faultItem); } if (!tmpList.isEmpty()) { // 先打乱再排序 Collections.shuffle(tmpList); Collections.sort(tmpList); final int half = tmpList.size() / 2; if (half <= 0) {// 只有一个元素的情况 return tmpList.get(0).getName(); } else {// 根据half取余 final int i = this.whichItemWorst.getAndIncrement() % half; return tmpList.get(i).getName(); } } return null; }