kafka consumer.seek 之后立即 poll 可能拉不到消息




    records = consumer.poll(Duration.ofSeconds(1));
    // do something with records


public ConsumerRecord<String, String> seekAndPoll(String topic, int partition, long offset) {
    TopicPartition tp = new TopicPartition(topic, partition);
    System.out.println("assignment:" + consumer.assignment()); // 这里是有分配到分区的
    consumer.seek(tp, offset);
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100))
        // 大概率拉取不到消息,进入此分支
        return null;
    } else {
        return records.iterator().next();


我在StackOverflow上也搜到了这个问题(java - Kafka Cluster sometimes returns no records during seek and poll - Stack Overflow),但是没有答案。在解决了这个问题后,我添加了一个答案。


猜测1 新旧poll方法的区别

在测试时,发现有时使用旧版本的poll(long timeout)方法有效,使用新版本的poll(Duration timeout)方法无效。会不会跟这个有关?(调式发现无关,不感兴趣的可跳过这一节)


public ConsumerRecords<K, V> poll(final long timeoutMs) {
    return poll(time.timer(timeoutMs), false);

public ConsumerRecords<K, V> poll(final Duration timeout) {
    return poll(time.timer(timeout), true);


private ConsumerRecords<K, V> poll(final Timer timer, final boolean includeMetadataInTimeout) {
    // 略
    if (includeMetadataInTimeout) {
        // try to update assignment metadata BUT do not need to block on the timer for join group
        updateAssignmentMetadataIfNeeded(timer, false);
    } else {
        while (!updateAssignmentMetadataIfNeeded(time.timer(Long.MAX_VALUE), true)) {
            log.warn("Still waiting for metadata");
    // 略


boolean updateAssignmentMetadataIfNeeded(final Timer timer, final boolean waitForJoinGroup) {
    if (coordinator != null && !coordinator.poll(timer, waitForJoinGroup)) {
        return false;

    return updateFetchPositions(timer);

但调试发现,在使用assign手动指定消费分区时,coordinator 为 null。这很好理解,只有subscribe模式才存在重平衡等情况,需要coordinator进行协调。



KIP-266: Fix consumer indefinite blocking behavior



The pre-existing variant poll(long timeout) would block indefinitely for metadata updates if they were needed, then it would issue a fetch and poll for timeout ms for new records. The initial indefinite metadata block caused applications to become stuck when the brokers became unavailable. The existence of the timeout parameter made the indefinite block especially unintuitive.

We will add a new method poll(Duration timeout) with the semantics:

  1. iff a metadata update is needed:

    1. send (asynchronous) metadata requests

    2. poll for metadata responses (counts against timeout)

      • if no response within timeout, return an empty collection immediately
  2. if there is fetch data available, return it immediately

  3. if there is no fetch request in flight, send fetch requests

  4. poll for fetch responses (counts against timeout)

    • if no response within timeout, return an empty collection (leaving async fetch request for the next poll)
    • if we get a response, return the response

We will deprecate the original method, poll(long timeout), and we will not change its semantics, so it remains:

  1. iff a metadata update is needed:

    1. send (asynchronous) metadata requests
    2. poll for metadata responses indefinitely until we get it
  2. if there is fetch data available, return it immediately

  3. if there is no fetch request in flight, send fetch requests

  4. poll for fetch responses (counts against timeout)

    • if no response within timeout, return an empty collection (leaving async fetch request for the next poll)
    • if we get a response, return the response

One notable usage is prohibited by the new poll: previously, you could call poll(0) to block for metadata updates, for example to initialize the client, supposedly without fetching records. Note, though, that this behavior is not according to any contract, and there is no guarantee that poll(0) won't return records the first time it's called. Therefore, it has always been unsafe to ignore the response.

简言之,poll(long timeout) 是无限期阻塞的,会等待订阅的元数据信息更新完成(这个等待时间不包含在timeout之内),确保能拉到消息。而poll(Duration timeout)不会一直阻塞,经过最多timeout后就会返回,不管拉没拉到消息。

猜测2 timeout决定了一切


public ConsumerRecord<String, String> seekAndPoll(String topic, int partition, long offset) {
    TopicPartition tp = new TopicPartition(topic, partition);
    System.out.println("assignment:" + consumer.assignment()); // 这里是有分配到分区的
    // endOffset: the offset of the last successfully replicated message plus one
    // if there has 5 messages, valid offsets are [0,1,2,3,4], endOffset is 4+1=5
    Long endOffset = consumer.endOffsets(Collections.singleton(tp)).get(tp); 
    Long beginOffset = consumer.beginningOffsets(Collections.singleton(tp)).get(tp);
    if (offset < beginOffset || offset >= endOffset) {
        System.out.println("offset is illegal");
        return null;
    } else {
        consumer.seek(tp, offset);
        ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(2000))
            return null;
        } else {
            return records.iterator().next();

真相? consumer.endOffsets()

我做了一个测试,在2个topic的4个partition上反复执行猜测2的代码,循环10000次,并更改timeout的大小,期望得出timeout值的大小与seekAndPoll失败之间量化关系。结果发现,即使timeout只有10ms,poll也有非常高的成功率;timeout=50ms时,poll成功率就能达到100%。而之前要timeout=1000ms ~ 2000ms才能有这么高的成功率。我反复检查,最终发现是这两行代码造成的:

Long beginOffset = consumer.beginningOffsets(Collections.singleton(tp)).get(tp);
Long endOffset = consumer.endOffsets(Collections.singleton(tp)).get(tp); 








kafka-test/SeekTest.java at main · whuwangyong/kafka-test (github.com)

posted @ 2022-02-17 20:48  duanguyuan  阅读(2952)  评论(0编辑  收藏  举报