kafka-mirror不稳定问题分析与解决方法

     前段时间,线上环境的kafka多集群在采用mirror组件进行跨机房数据同步时,会偶尔出现hang住不稳定的情况:
1. 现象
   a. 线上出现返回包序号不一致的现象:"Correlationid for response (13502150) does not match request"而程序hang住,cpu飙高,同步服务停止工作
  b. 发生平均频率:线上分3组group×2共6个实例,平均每天2个实例发生
  c. 类似线上问题请参考
      kafka-mirrormaker问题(https://issues.apache.org/jira/browse/KAFKA-1257)和kafka-producer问题(https://issues.apache.org/jira/browse/KAFKA-4669)

2. 原因
   a. kafka网络协议背景
     kafka网络协议设计保证连接的请求和响应均是有序的,即对于每个单独的tcp连接下,可以保证发送请求和接收响应包序列均是有序的,同时每个发送请求包和响应包均有唯一递增id关联编号进行关联:“The server guarantees that on a single TCP connection, requests will be processed in the order they are sent and responses will return in that order as well.”出自kafka-network官网介绍;
   b. mirrormaker同步判断成功与否逻辑
    mirrormaker同步给目标kafka集群的每个数据request包均会搁在本地内存池里,直到收到相同CorrelationId的响应包,然后做两种判断: a. 发送成功,则销毁内存池中的数据请求包,b. 发送失败则数据包放回队列重新进行发送;
   c. mirrormaker同步判断线上bug原因
    而在判断函数handleCompletedReceives中: 由于条件a,默认认为每个发送请求包和响应包id号是一致的,而并未处理两者id号不一致的异常情况。所以一旦出现id编号不一致异常,则异常一直向上抛,而导致当前"发送请求包"并未得到任何响应处理,同时不会做内存释放最终导致泄露;
  d. 目前确定0.8.×、0.9.×版本均会存在线上同样问题

3. 修复方案
     修改mirror-maker中kafka-client-0.8.2的源码: 增加出现了错乱包的异常捕获逻辑:把错乱时的数据请求包扔回内存队列进行重发。处理修改源码如下:

 /**
 * Handle any completed receives and update the response list with the responses received.
 * @param responses The list of responses to update
 * @param now The current time
 */
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
    for (NetworkReceive receive : this.selector.completedReceives()) {
        int source = receive.source();
        ResponseHeader header = ResponseHeader.parse(receive.payload());
        int compared = 0;
        ClientRequest req = null;
        short apiKey = -1;
        do{
            req = inFlightRequests.fristSent(source);
            if(req == null){
            	break;
            }
            apiKey = req.request().header().apiKey();
            compared = compareCorrelationId(req.request().header(), header, source);
        	if (compared < 0 && apiKey != ApiKeys.METADATA.id) {
        		responses.add(new ClientResponse(req, now, true, null));
        	}
        	if (compared < 0 || compared == 0){
        		req = inFlightRequests.completeNext(source);
        	}
        }while(compared < 0);
        if(req == null || compared > 0){
        	log.error("never go to this line");
        	continue;
        }
        Struct body = (Struct) ProtoUtils.currentResponseSchema(apiKey).read(receive.payload());
        if (apiKey == ApiKeys.METADATA.id) {
            handleMetadataResponse(req.request().header(), body, now);
        } else {
            // need to add body/header to response here
            responses.add(new ClientResponse(req, now, false, body));
        }
    }
} 
posted @ 2018-02-25 23:41  gisorange  阅读(3855)  评论(0编辑  收藏  举报