kafka-mirror不稳定问题分析与解决方法
前段时间,线上环境的kafka多集群在采用mirror组件进行跨机房数据同步时,会偶尔出现hang住不稳定的情况:
1. 现象
a. 线上出现返回包序号不一致的现象:"Correlationid for response (13502150) does not match request"而程序hang住,cpu飙高,同步服务停止工作
b. 发生平均频率:线上分3组group×2共6个实例,平均每天2个实例发生
c. 类似线上问题请参考
kafka-mirrormaker问题(https://issues.apache.org/jira/browse/KAFKA-1257)和kafka-producer问题(https://issues.apache.org/jira/browse/KAFKA-4669)
2. 原因
a. kafka网络协议背景
kafka网络协议设计保证连接的请求和响应均是有序的,即对于每个单独的tcp连接下,可以保证发送请求和接收响应包序列均是有序的,同时每个发送请求包和响应包均有唯一递增id关联编号进行关联:“The server guarantees that on a single TCP connection, requests will be processed in the order they are sent and responses will return in that order as well.”出自kafka-network官网介绍;
b. mirrormaker同步判断成功与否逻辑
mirrormaker同步给目标kafka集群的每个数据request包均会搁在本地内存池里,直到收到相同CorrelationId的响应包,然后做两种判断: a. 发送成功,则销毁内存池中的数据请求包,b. 发送失败则数据包放回队列重新进行发送;
c. mirrormaker同步判断线上bug原因
而在判断函数handleCompletedReceives中: 由于条件a,默认认为每个发送请求包和响应包id号是一致的,而并未处理两者id号不一致的异常情况。所以一旦出现id编号不一致异常,则异常一直向上抛,而导致当前"发送请求包"并未得到任何响应处理,同时不会做内存释放最终导致泄露;
d. 目前确定0.8.×、0.9.×版本均会存在线上同样问题
3. 修复方案
修改mirror-maker中kafka-client-0.8.2的源码: 增加出现了错乱包的异常捕获逻辑:把错乱时的数据请求包扔回内存队列进行重发。处理修改源码如下:
/**
* Handle any completed receives and update the response list with the responses received.
* @param responses The list of responses to update
* @param now The current time
*/
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
for (NetworkReceive receive : this.selector.completedReceives()) {
int source = receive.source();
ResponseHeader header = ResponseHeader.parse(receive.payload());
int compared = 0;
ClientRequest req = null;
short apiKey = -1;
do{
req = inFlightRequests.fristSent(source);
if(req == null){
break;
}
apiKey = req.request().header().apiKey();
compared = compareCorrelationId(req.request().header(), header, source);
if (compared < 0 && apiKey != ApiKeys.METADATA.id) {
responses.add(new ClientResponse(req, now, true, null));
}
if (compared < 0 || compared == 0){
req = inFlightRequests.completeNext(source);
}
}while(compared < 0);
if(req == null || compared > 0){
log.error("never go to this line");
continue;
}
Struct body = (Struct) ProtoUtils.currentResponseSchema(apiKey).read(receive.payload());
if (apiKey == ApiKeys.METADATA.id) {
handleMetadataResponse(req.request().header(), body, now);
} else {
// need to add body/header to response here
responses.add(new ClientResponse(req, now, false, body));
}
}
}