Kafka学习笔记--KIP98- EOS和事务化消息
前言
研究到了事务性消息,这篇文章又长,英文难度又大,蚂蚁搬家似的一点一点翻译和记录笔记,结果,最郁闷的是,半天的翻译因为没有写标题而没有及时保存,后来网页挂了,白干了,只好又重来,赶紧先把标题写上,以防万一。
本文仍然是部分翻译和记录笔记,中英混排。
KIP-98 - Exactly Once Delivery and Transactional Messaging
KIP的原理和BIP一样,是给代码维护者提出的改进建议。
Status
Current state: Adopted
Discussion thread: http://search-hadoop.com/m/Kafka/uyzND1jwZrr7HRHf?subj=+DISCUSS+KIP+98+Exactly+Once+Delivery+and+Transactional+Messaging
JIRA: KAFKA-4815 - Idempotent/transactional Producer (KIP-98) RESOLVED
这个建议的状态是已经被接受。
Motivation
This document outlines a proposal for strengthening the message delivery semantics of Kafka. This builds on significant work which has been done previously, specifically, here and here.
Kafka currently provides at least once semantics, viz. When tuned for reliability, users are guaranteed that every message write will be persisted at least once, without data loss. Duplicates may occur in the stream due to producer retries. For instance, the broker may crash between committing a message and sending an acknowledgment to the producer, causing the producer to retry and thus resulting in a duplicate message in the stream.
Kafka当前提供了一个至少一次的语义。这个是针对写来说的,每一个消息肯定至少会被写一次,不会丢失。重复只可能发生在producer在重复时。例如,broker可能再提交一个消息和发送给producer之间挂了,就是消息已经提交了,也就是写入了,但是没有来得及给producer响应。
Users of messaging systems greatly benefit from the more stringent idempotent producer semantics, viz. Every message write will be persisted exactly once, without duplicates and without data loss -- even in the event of client retries or broker failures. These stronger semantics not only make writing applications easier, they expand the space of applications which can use a given messaging system.
消息系统的用户很大程度上受益于很严格的幂等性producer语义上。每一个消息写入只会被持久化一次,既没有重复,也不会丢失--哪怕是client重复事件的发生或者broker失败。
However, idempotent producers don’t provide guarantees for writes across multiple TopicPartitions. For this, one needs stronger transactional guarantees, ie. the ability to write to several TopicPartitions atomically. By atomically, we mean the ability to commit a set of messages across TopicPartitions as a unit: either all messages are committed, or none of them are.
然后,幂等性的producer不提供对在多个TopicPartitions中写入的保证。这里又涉及到幂等性的producer,是另外一篇文章里面所讲。(下一步,就是阅读那一篇文章。)
Stream processing applications, which are a pipelines of ‘consume-transform-produce’ tasks, absolutely require transactional guarantees when duplicate processing of the stream is unacceptable. As such, adding transactional guarantees to Kafka --a streaming platform-- makes it much more useful not just for stream processing, but a variety of other applications.
事务对于流的处理也是有意义的。
In this document we present a proposal for bringing transactions to Kafka. We will only focus on the user facing changes: the client API changes, and the new configurations we will introduce, and the summary of guarantees. We also outline the basic data flow, which summarizes all the new RPCs we will introduce with transactions. The design details are presented in a separate document.
A little bit about transactions and streams
In the previous section, we mentioned the main motivation for transactions is to enable exactly once processing in Kafka Streams. It is worth digging into this use case a little more, as it motivates many of the tradeoffs in our design.
Recall that data transformation using Kafka Streams typically happens through multiple stream processors, each of which is connected by Kafka topics. This setup is known as a stream topology and is basically a DAG where the stream processors are nodes and the connecting Kafka topics are vertices. This pattern is typical of all streaming architectures. You can read more about the Kafka streams architecture here.
As such, a transaction for Kafka streams would essentially encompass the input messages, the updates to the local state store, and the output messages. Including input offsets in a transaction motivates adding the ‘sendOffsets’ API to the Producer interface, described below. Further details will be presented in a separate KIP.
Further, stream topologies can get pretty deep --10 stages is not uncommon. If output messages are only materialized on transaction commits, then a topology which is N stages deep will take N x T to process its input, where T is the average time of a single transaction. So Kafka Streams requires speculative execution, where output messages can be read by downstream processors even before they are committed. Otherwise transactions would not be an option for serious streaming applications. This motivates the ‘read uncommitted’ consumer mode described later.
These are two specific instances where we chose to optimize for the streams use case. As the reader works through this document we encourage her to keep this use case in mind as it motivated large elements of the proposal.
上面讲的是事务与流的关系,关于流,还没有太接触,所以先跳过这里。
Public Interfaces
Producer API changes
The producer will get five new methods (initTransactions, beginTransaction, sendOffsets, commitTransaction, abortTransaction), with the send method updated to throw a new exception. This is detailed below:
producer将会有5个方法。(将会?难道还没有实现)
KafkaProducer.java
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
The OutOfOrderSequence Exception
The Producer will raise an OutOfOrderSequenceException
if the broker detects data loss. In other words, if it receives a sequence number which is greater than the sequence it expected. This exception will be returned in the Future
and passed to the Callback
, if any. This is a fatal exception, and future invocations of Producer methods like send
, beginTransaction
, commitTransaction
, etc. will raise an IlegalStateException
.
如果broker检测到数据丢失,producer将会发出一个OutOfOrderSequenceException的异常。sequence number不是producer产生的吗,怎么还会出现这样的错误呢?这个sequence是需要连续的,如果不连续,说明有消息丢失了。
An Example Application
Here is an simple application which demonstrates the use of the APIs introduced above.
KafkaTransactionsExample.java
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
上面这段代码,接收到消息,然后发送出去(这么做的意义在哪里?),再把offset也发送出去(发送给谁?)。
New Configurations
Broker configs
transactional.id.timeout.ms |
The maximum amount of time in ms that the transaction coordinator will wait before proactively expire a producer TransactionalId without receiving any transaction status updates from it. Default is 604800000 (7 days). This allows periodic weekly producer jobs to maintain their ids. |
max.transaction.timeout.ms |
The maximum allowed timeout for transactions. If a client’s requested transaction time exceed this, then the broker will return a InvalidTransactionTimeout error in InitPidRequest. This prevents a client from too large of a timeout, which can stall consumers reading from topics included in the transaction. Default is 900000 (15 min). This is a conservative upper bound on the period of time a transaction of messages will need to be sent. |
transaction.state.log.replication.factor |
The number of replicas for the transaction state topic. Default: 3 |
transaction.state.log.num.partitions |
The number of partitions for the transaction state topic. Default: 50 |
transaction.state.log.min.isr |
The minimum number of insync replicas the each partition of the transaction state topic needs to have to be considered online. Default: 2 |
transaction.state.log.segment.bytes |
The segment size for the transaction state topic. Default: 104857600 bytes. |
Producer configs
enable.idempotence |
Whether or not idempotence is enabled (false by default). If disabled, the producer will not set the PID field in produce requests and the current producer delivery semantics will be in effect. Note that idempotence must be enabled in order to use transactions. When idempotence is enabled, we enforce that acks=all, retries > 1, and max.inflight.requests.per.connection=1. Without these values for these configurations, we cannot guarantee idempotence. If these settings are not explicitly overidden by the application, the producer will set acks=all, retries=Integer.MAX_VALUE, and max.inflight.requests.per.connection=1 when idempotence is enabled. |
transaction.timeout.ms |
The maximum amount of time in ms that the transaction coordinator will wait for a transaction status update from the producer before proactively aborting the ongoing transaction. This config value will be sent to the transaction coordinator along with the InitPidRequest. If this value is larger than the max.transaction.timeout.ms setting in the broker, the request will fail with a `InvalidTransactionTimeout` error. Default is 60000. This makes a transaction to not block downstream consumption more than a minute, which is generally allowable in real-time apps. |
transactional.id |
The TransactionalId to use for transactional delivery. This enables reliability semantics which span multiple producer sessions since it allows the client to guarantee that transactions using the same TransactionalId have been completed prior to starting any new transactions. If no TransactionalId is provided, then the producer is limited to idempotent delivery. Note that The default is empty, which means transactions cannot be used. |
Consumer configs
isolation.level |
Here are the possible values (default is read_uncommitted): read_uncommitted: consume both committed and uncommitted messages in offset ordering. read_committed: only consume non-transactional messages or committed transactional messages in offset order. In order to maintain offset ordering, this setting means that we will have to buffer messages in the consumer until we see all messages in a given transaction. |
Proposed Changes
Summary of Guarantees
Idempotent Producer Guarantees
To implement idempotent producer semantics, we introduce the concepts of a producer id, henceforth called the PID, and sequence numbers for Kafka messages. Every new producer will be assigned a unique PID during initialization. The PID assignment is completely transparent to users and is never exposed by clients.
为了实现producer的幂等性语义,我们引入了producer id的概念,被称为PID。PID的分配对于用户是完全透明的,不会被client所暴露----这两句应该怎么理解?用户是看不到这个PID的?这个是由producer这边的模块代码来保证的?不同的客户端,或者说不同的会话中,这个PID都是不一样的?
For a given PID, sequence numbers will start from zero and be monotonically increasing, with one sequence number per topic partition produced to. The sequence number will be incremented by the producer on every message sent to the broker. The broker maintains in memory the sequence numbers it receives for each topic partition from every PID. The broker will reject a produce request if its sequence number is not exactly one greater than the last committed message from that PID/TopicPartition pair. Messages with a lower sequence number result in a duplicate error, which can be ignored by the producer. Messages with a higher number result in an out-of-sequence error, which indicates that some messages have been lost, and is fatal.
对于一个给定的PID,sequence number会从0开始单调地增长,后面这句有点不太理解,with one sequence number per topic partition produced to,是不是应该理解成每条消息会伴随着一个sequence number生产到每个topic partition中去?
This ensures that, even though a producer must retry requests upon failures, every message will be persisted in the log exactly once. Further, since each new instance of a producer is assigned a new, unique, PID, we can only guarantee idempotent production within a single producer session.
这就保证了,哪怕一个producer必须针对失败进行重试,但是每条消息会在日志中持久化一次。(根据PID和sequence number)更多,因为每一个producer的新的实例会被分配一个新的唯一的PID,我们能仅仅保证在一个单独的producer的session中保证幂等的生产。
这里就是理解为,一个PID针对着一个producer session(生产会话),在这次会话中通过sequence number来保证幂等性。
These idempotent producer semantics are potentially useful for stateless applications like metrics tracking and auditing.
Transactional Guarantees
At the core, transactional guarantees enable applications to produce to multiple TopicPartitions atomically, ie. all writes to these TopicPartitions will succeed or fail as a unit.
事务的保证使得应用程序对多TopicPartitions进行原子性的生产成为可能,例如,所有对这些TopicPartitions的写将会作为一个单位,一起成功或者一起失败。
Further, since consumer progress is recorded as a write to the offsets topic, the above capability is leveraged to enable applications to batch consumed and produced messages into a single atomic unit, ie. a set of messages may be considered consumed only if the entire ‘consume-transform-produce’ executed in its entirety.
这一句又有点绕,此外,因为consumer的前进(消费)会被记录,作为一个对offsets topic的写。上面的能力就会被用于去使得应用程序可以在一个单一的原子单元内去批量消费和生产消息(batch consumed and produced messages)。这是一种应用场景。
Additionally, stateful applications will also be able to ensure continuity across multiple sessions of the application. In other words, Kafka can guarantee idempotent production and transaction recovery across application bounces.
更多的,带有状态的应用程序( stateful applications在这里应该怎么理解呢?这里的状态是指事务的状态吗?那就不会使用Additionally了)将会可以保证在多个应用程序的sessions中来回穿越的连续性。换一句话说,Kafka可以保证在应用程序的边界内幂等生产和事务恢复。这里对bounce有些不好解释,个人理解就是在应用程序的边界内,那不是boundary更为合理。
To achieve this, we require that the application provides a unique id which is stable across all sessions of the application. For the rest of this document, we refer to such an id as the TransactionalId. While there may be a 1-1 mapping between an TransactionalId and the internal PID, the main difference is the the TransactionalId is provided by users, and is what enables idempotent guarantees across producers sessions described below.
这样,我们就需要应用程序提供一个唯一的id,在应用程序的所有session中是固定的。(这应该就是上文所说的状态,互联网的状态其实即使说每一次请求互相是有所区别的,就叫有状态,没有区别,就是无状态)在下面的文档中,我们将会把这个id称作TransactionalId。这里可能存在1对1映射,在TransactionalId和内部的PID之间,主要区别在于TransactionalId是由用户提供,下面会有介绍,它是如何在producers的会话集中保证幂等的。
这个TransactionalId也是带有唯一性的,是怎么生成的,样子是什么样?
When provided with such an TransactionalId, Kafka will guarantee:
-
Exactly one active producer with a given TransactionalId. This is achieved by fencing off old generations when a new instance with the same TransactionalId comes online.
- Transaction recovery across application sessions. If an application instance dies, the next instance can be guaranteed that any unfinished transactions have been completed (whether aborted or committed), leaving the new instance in a clean state prior to resuming work.
当提供这样一个TransactionalId,Kafka将会保证:
- 一个给定的TransactionalId只会给一个活跃的producer。这将通过隔离旧的一代来获得,当一个同样TransactionalId的新的实例上线后。(这个隔离是什么概念?)
- 事务恢复跨越应用程序的会话。如果一个应用程序的实例挂掉了,下一个实例将被保证,任何没有完成的事务就会被完成(无论是放弃还是提交),会在继续工作之前给新的实例一个干净的状态。
Note that the transactional guarantees mentioned here are from the point of view of the producer. On the consumer side, the guarantees are a bit weaker. In particular, we cannot guarantee that all the messages of a committed transaction will be consumed all together. This is for several reasons:
- For compacted topics, some messages of a transaction maybe overwritten by newer versions.
- Transactions may straddle log segments. Hence when old segments are deleted, we may lose some messages in the first part of a transaction.
- Consumers may seek to arbitrary points within a transaction, hence missing some of the initial messages.
- Consumer may not consume from all the partitions which participated in a transaction. Hence they will never be able to read all the messages that comprised the transaction.
注意这里提到的事务的保证是从producer的角度来看的。在consumer方,这个保证会有一些弱。特别的是,我们不能保证所有一个提交的事务中的消息将会被一起消费。这是有一些原因的:
- 对于压缩的topics,一个事务的一些消息可能会被新版本所覆盖掉。
- 事务可能会跨log segments(《Kafka学习笔记--存储内部》这篇有讲)。因此当旧的segments被删除掉,我们就可能在一个事务的第一部分丢失一些消息。为什么是第一部分?
- Consumer可能会在一个事务中寻找一些任意的点,因此会错过一些初始的消息。
- Consumer可能不会从所有参与一个transaction的partitions进行消费。因此,它们将不会读到组成事务的所有消息。
上面这一部分,没有理解。这里的consumer是指的上文中consume-transform-produce这种场景下的consumer吗,感觉又不像,应该是指纯粹的consumer。这一段留待看完全文,再回来理解。
Key Concepts
To implement transactions, ie. ensuring that a group of messages are produced and consumed atomically, we introduce several new concepts:
- We introduce a new entity called a Transaction Coordinator. Similar to the consumer group coordinator, each producer is assigned a transaction coordinator, and all the logic of assigning PIDs and managing transactions is done by the transaction coordinator.
- We introduce a new internal kafka topic called the Transaction Log. Similar to the Consumer Offsets topic, the transaction log is a persistent and replicated record of every transaction. The transaction log is the state store for the transaction coordinator, with the snapshot of the latest version of the log encapsulating the current state of each active transaction.
- We introduce the notion of Control Messages. These are special messages written to user topics, processed by clients, but never exposed to users. They are used, for instance, to let brokers indicate to consumers if the previously fetched messages have been committed atomically or not. Control messages have been previously proposed here.
- We introduce a notion of TransactionalId, to enable users to uniquely identify producers in a persistent way. Different instances of a producer with the same TransactionalId will be able to resume (or abort) any transactions instantiated by the previous instance.
- We introduce the notion of a producer epoch, which enables us to ensure that there is only one legitimate active instance of a producer with a given TransactionalId, and hence enables us to maintain transaction guarantees in the event of failures.
为了实现事务,例如,保证一组消息可以被原子性地生产和消费,我们引入了一些新的概念。
- 我们引入了一个新的实体,被称作Transaction Coordinator。和consumer group coordinator相似(这个又是什么),每一个producer被分配到一个transaction coordinator上,并且所有分配PIDs的逻辑和管理事务的工作是由transaction coordinator来完成的。
- 我们引入了一个新的内部的kafka topic,被称作Transaction Log。和Consumer Offsets topic相似,transaction log是对每一个事务的一个持久化的并且重复的记录。transaction log是对transaction coordinator的状态存储,伴随着最后版本的日志的快照,这个日志封装着每一个活跃的事务的当前状态。
- 我们引入了Control Messages的概念。它们是一些特殊的消息,被写入用户的topics里面,被client所处理(client是指什么),但是不会暴露给用户。它们被用于,举一个例子,去让broker指示给consumer,是否之前拿回的消息已经被原子性地提交。
- 我们引入了TransactionalId的概念,来让用户在一种持久化的方式下标识producers。一个producer的,带着同样的TransactionalId的不同实例将会继续(或者放弃)由之前实例初始化的任何事务。这里的意思是之前的实例可能挂了,新的实例还是使用同样的TransactionalId可以继续,事务是在broker方被保留着。
- 我们引入了一个producer epoch的概念,这可以使得我们可以保证给定一个TransactionalId,只有一个合法的活跃的producer的实例,因此在失败的情况下使我们可以保证事务。
In additional to the new concepts above, we also introduce new request types, new versions of existing requests, and new versions of the core message format in order to support transactions. The details of all of these will be deferred to other documents.
Data Flow
In the diagram above, the sharp edged boxes represent distinct machines. The rounded boxes at the bottom represent Kafka TopicPartitions, and the diagonally rounded boxes represent logical entities which run inside brokers.
上图中,锐利边的盒子代表不同的机器。底部的圆角的盒子代表Kafka的TopicPartitions,对角是圆的盒子代表了逻辑实体,运行在brokers里面的。
Each arrow represents either an RPC, or a write to a Kafka topic. These operations occur in the sequence indicated by the numbers next to each arrow. The sections below are numbered to match the operations in the diagram above, and describe the operation in question.
每一个箭头或者代表一个RPC,或者是一个对topic的写。
1. Finding a transaction coordinator -- the FindCoordinatorRequest
找到一个transaction coordinator -- the FindCoordinatorRequest
体现在哪里?
Since the transaction coordinator is at the center assigning PIDs and managing transactions, the first thing a producer has to do is issue a FindCoordinatorRequest (previously known as GroupCoordinatorRequest, but renamed for more general usage) to any broker to discover the location of its coordinator.
因为transaction coordinator是在分配PID和管理事务的中心中,第一件事,一个producer必须去做的是发布一个FindCoordinatorRequest(之前叫做GroupCoordinatorRequest,但是为了更多的用途被重命名)给任意的broker来发现它的coordinator的位置。
2. Getting a producer Id -- the InitPidRequest
获得一个a producer Id -- the InitPidRequest
After discovering the location of its coordinator, the next step is to retrieve the producer’s PID. This is achieved by issuing a InitPidRequest to the transaction coordinator.
在发现了它的coordinator的位置,下一步是去获得这个producer的PID。这是通过发布一个InitPidRequest到这个事务coordinator来获得的。
2.1 When an TransactionalId is specified
If the configuration is set, this TransactionalId passed along with the InitPidRequest, and the mapping to the corresponding PID is logged in the transaction log in step 2a. This enables us to return the same PID for the TransactionalId to future instances of the producer, and hence enables recovering or aborting previously incomplete transactions.
如果配置被设置了,这个TransactionalId和InitPidRequest这个方法一起被传递,然后映射到对应的PID在step 2a被记录在事务日志。这可以使我们能对这个TransactionalId返回同样的PID到future的producer的实例,因此发现或者中止之前未完成的事务。
In addition to returning the PID, the InitPidRequest performs the following tasks:
- Bumps up the epoch of the PID, so that the any previous zombie instance of the producer is fenced off and cannot move forward with its transaction.
- Recovers (rolls forward or rolls back) any transaction left incomplete by the previous instance of the producer.
The handling of the InitPidRequest is synchronous. Once it returns, the producer can send data and start new transactions.
2.2 When an TransactionalId is not specified
If no TransactionalId is specified in the configuration, a fresh PID is assigned, and the producer only enjoys idempotent semantics and transactional semantics within a single session.
3. Starting a Transaction – The beginTransaction() API
开始一个事务 – The beginTransaction() API
The new KafkaProducer will have a beginTransaction() method which has to be called to signal the start of a new transaction. The producer records local state indicating that the transaction has begun, but the transaction won’t begin from the coordinator’s perspective until the first record is sent.
4. The consume-transform-produce loop
consume-transform-produce循环
In this stage, the producer begins to consume-transform-produce the messages that comprise the transaction. This is a long phase and is potentially comprised of multiple requests.
在这一阶段,这个producer开始consume-transform-produce这样去处理组成事务的消息。这是一个长的阶段,并且是由很多个请求潜在组成的。
4.1 AddPartitionsToTxnRequest
The producer sends this request to the transaction coordinator the first time a new TopicPartition is written to as part of a transaction. The addition of this TopicPartition to the transaction is logged by the coordinator in step 4.1a. We need this information so that we can write the commit or abort markers to each TopicPartition (see section 5.2 for details). If this is the first partition added to the transaction, the coordinator will also start the transaction timer.
producer发送这个请求给transaction coordinator,在第一次,一个新的TopicPartition作为一个事务的一部分被写入的时候。在step 4.1a,这个TopicPartition被加入到事务中,被coordinator记录下来。我们需要这个信息,以便我们可以给每一个TopicPartition写提交或者中止的标记(marker)(参考5.2)。如果这是加入transaction的第一个partition,这个coordinator将会开启transaction计时器。
4.2 ProduceRequest
The producer writes a bunch of messages to the user’s TopicPartitions through one or more ProduceRequests (fired from the send method of the producer). These requests include the PID , epoch, and sequence number as denoted in 4.2a.
生产者通过一个或者多个ProduceRequests去写入一堆的数据到TopicPartitions(通过producer的send方法触发)。这些方法包括在4.2a提到的PID,epoch(时间点),和sequence number。
4.3 AddOffsetCommitsToTxnRequest
The producer has a new KafkaProducer.sendOffsetsToTransaction API method, which enables the batching of consumed and produced messages. This method takes a Map<TopicPartitions, OffsetAndMetadata> and a groupId argument.
producer有一个新的KafkaProducer.sendOffsetsToTransaction方法,这可以批量的消费和生产消息。
The sendOffsetsToTransaction method sends an AddOffsetCommitsToTxnRequests with the groupId to the transaction coordinator, from which it can deduce the TopicPartition for this consumer group in the internal __consumer-offsets topic. The transaction coordinator logs the addition of this topic partition to the transaction log in step 4.3a.
这个方法发送一个AddOffsetCommitsToTxnRequests请求,带着groupId,到transaction coordinator,在那里它可以在内部的__consumer-offsets topic中推演出针对这个group的TopicPartition。transaction coordinator在step 4.3a把这个topic partition的增加记录到transaction log。
4.4 TxnOffsetCommitRequest
Also as part of sendOffsets, the producer will send a TxnOffsetCommitRequest to the consumer coordinator to persist the offsets in the __consumer-offsets topic (step 4.4a). The consumer coordinator validates that the producer is allowed to make this request (and is not a zombie) by using the PID and producer epoch which are sent as part of this request.
作为sendOffsets的一部分,producer将会发送一个TxnOffsetCommitRequest请求给consumer coordinator来持久化在__consumer-offsets topic的offsets(step 4.4a)。consumer coordinator确认这个producer被允许去制作这个请求(并且没有死掉),通过使用作为请求的一部分的这个PID和producer epoch。
The consumed offsets are not visible externally until the transaction is committed, the process for which we will discuss now.
5. Committing or Aborting a Transaction
提交或者中止一个事务
Once the data has been written, the user must call the new commitTransaction or abortTransaction methods of the KafkaProducer. These methods will begin the process of committing or aborting the transaction respectively.
一旦数据被写入,用户可以调用KafkaProducer的commitTransaction或者abortTransaction。这些方法将会开始提交或者中止事务的过程。
5.1 EndTxnRequest
When a producer is finished with a transaction, the newly introduced KafkaProducer.endTransaction or KafkaProducer.abortTransaction must be called. The former makes the data produced in 4 available to downstream consumers. The latter effectively erases the produced data from the log: it will never be accessible to the user, ie. downstream consumers will read and discard the aborted messages.
一旦一个producer伴随着一个事务的的结束,新引入的KafkaProducer.endTransaction或者KafkaProducer.abortTransaction必须被调用。前者使得在4生产的数据在下流的consumer中可用。后者有效地从log中清理了生产的数据:它将不会再对用户可访问,例如,下游的用户将读取或者放弃这个中止消息。
Regardless of which producer method is called, the producer issues an EndTxnRequest to the transaction coordinator, with additional data indicating whether the transaction is to be committed or aborted. Upon receiving this request, the coordinator:
- Writes a PREPARE_COMMIT or PREPARE_ABORT message to the transaction log. (step 5.1a)
- Begins the process of writing the command messages known as COMMIT (or ABORT) markers to the user logs through the WriteTxnMarkerRequest. (see section 5.2 below).
- Finally writes the COMMITTED (or ABORTED) message to transaction log. (see 5.3 below).
无论哪一个producer的方法被调用,这个producer发布一个EndTxnRequest给一个transaction coordinator,带着额外的数据指示着这个事务是否提交还是中止。随着接收到这个请求,这个coordinator将会:
- 对transaction log写入一个PREPARE_COMMIT或者PREPARE_ABORT消息。(step 5.1a)
- 通过WriteTxnMarkerRequest对user logs开始写入COMMIT(或者ABORT)的命令消息的过程。(参考下面的5.2章节)
- 最后,对transaction log写入COMMITTED(或者ABORTED)消息。(参看下面的5.3)
5.2 WriteTxnMarkerRequest
This request is issued by the transaction coordinator to the the leader of each TopicPartition which is part of the transaction. Upon receiving this request, each broker will write a COMMIT(PID) or ABORT(PID) control message to the log. (step 5.2a)
这个请求被transaction coordinator发布给每一个TopicPartition(作为事务的一部分)的leader。收到这个请求,每个broker将会写一个COMMIT(PID)或者 ABORT(PID)消息到log中。(step 5.2a)
This message indicates to consumers whether the messages with the given PID must be delivered to the user or dropped. As such, the consumer will buffer messages which have a PID until it reads a corresponding COMMIT or ABORT message, at which point it will deliver or drop the messages respectively.
这个消息对consumer意味着,是否给定PID的消息将被提交给用户或者被丢弃。所以,这个consumer将会缓存这个PID的消息,知道它读到一个对应的COMMIT或者ABORT消息,在这点上,它将会响应地传递或者丢弃掉消息。
Note that, if the __consumer-offsets topic is one of the TopicPartitions in the transaction, the commit (or abort) marker is also written to the log, and the consumer coordinator is notified that it needs to materialize these offsets in the case of a commit or ignore them in the case of an abort (step 5.2a on the left).
注意,如果这个 __consumer-offsets topic是在事务中的一个TopicPartitions,这个commit(或者abort)marker也写到了log中,并且,这个consumer coordinator被通知到,它需要去物质化(materialize)这些offsets在需要提交时,或者忽略他们在需要中止时。
(左边step 5.2a)
5.3 Writing the final Commit or Abort Message
写最后的提交或者中止的消息
After all the commit or abort markers are written the data logs, the transaction coordinator writes the final COMMITTED or ABORTED message to the transaction log, indicating that the transaction is complete (step 5.3 in the diagram). At this point, most of the messages pertaining to the transaction in the transaction log can be removed.
在所有commit或者abort标记被写入到data logs后,transaction coordinator写最后的COMMITTED或者ABORTED消息到transaction log,意味着事务完成了(图中step 5.3)。在这点上,所有在transaction log 从属于这个事务的消息将被移除。
We only need to retain the PID of the completed transaction along with a timestamp, so we can eventually remove the TransactionalId->PID mapping for the producer. See the Expiring PIDs section below.
我们仅仅需要保留完成的事务的PID和一个时间戳,这样我们可以最终移除给producer做的TransactionalId->PID映射,请看下面的Expiring PIDs章节。
Authorization
认证
It is desirable to control access to the transaction log to ensure that clients cannot intentionally or unintentionally interfere with each other’s transactions. In this work, we introduce a new resource type to represent the TransactionalId tied to transactional producers, and an associated error code for authorization failures.
|
The transaction coordinator handles each of the following requests: InitPid, AddPartitionsToTxn, AddOffsetsToTxn, and EndTxn. Each request to the transaction coordinator includes the producer’s TransactionalId and can be used for authorization. Each of these requests mutates the transaction state of the producer, so they all require Write access to the corresponding ProducerTransactionalId resource. Additionally, the AddPartitionsToTxn API requires Write access to the topics corresponding to the included partitions, and the AddOffsetsToTxn API requires Read access to the group included in the request.
We also require additional authorization to produce transactional data. This can be used to minimize the risk of an “endless transaction attack,” in which a malicious producer writes transactional data without corresponding COMMIT or ABORT markers in order to prevent the LSO from advancing and consumers from making progress. We can use the ProducerTransactionalId resource introduced above to ensure that the producer is authorized to write transactional data as the producer’s TransactionalId is included in the ProduceRequest schema. The WriteTxnMarker API is for inter-broker usage only, and therefore requires ClusterAction permission on the Cluster resource. Note that the writing of control messages is not permitted through the Produce API.
Clients will not be allowed to write directly to the transaction log using the Produce API, though it is useful to make it accessible to consumers with Read permission for the purpose of debugging.
Discussion on limitations of coordinator authorization
Although we can control access to the transaction log using the TransactionalId, we cannot prevent a malicious producer from hijacking the PID of another producer and writing data to the log. This would allow the attacker to either insert bad data into an active transaction or to fence the authorized producer by forcing an epoch bump. It is not possible for the malicious producer to finish a transaction, however, because the brokers do not allow clients to write control messages. Note also that the malicious producer would have to have Write permission to the same set of topics used by the legitimate producer, so it is still possible to use topic ACLs combined with TransactionalId ACLs to protect sensitive topics. Future work can explore protecting the binding between TransactionalId and PID (e.g. through the use of message authentication codes).
下面是一些对于RPC协议的概要,就先不具体翻译了。
RPC Protocol Summary
We summarize all the new request / response pairs as well as modified requests in this section.
FetchRequest/Response
Sent by the consumer to any partition leaders to fetch messages. We bump the API version to allow the consumer to specify the required isolation level. We also modify the response schema to include the list of aborted transactions included in the range of fetched messages.
FetchRequest
|
FetchResponse
|
When the consumer sends a request for an older version, the broker assumes the READ_UNCOMMITTED isolation level and converts the message set to the appropriate format before sending back the response. Hence zero-copy cannot be used. This conversion can be costly when compression is enabled, so it is important to update the client as soon as possible.
We have also added the LSO to the fetch response. In READ_COMMMITED, the consumer will use this to compute lag instead of the high watermark. Note also the addition of the field for aborted transactions. This is used by the consumer in READ_COMMITTED mode to know where aborted transactions begin. This allows the consumer to discard the aborted transaction data without buffering until the associated marker is read.
ProduceRequest/Response
Sent by the producer to any brokers to produce messages. Instead of allowing the protocol to send multiple message sets for each partition, we modify the schema to allow only one message set for each partition. This allows us to remove the message set size since each message set already contains a field for the size. More importantly, since there is only one message set to be written to the log, partial produce failures are no longer possible. The full message set is either successfully written to the log (and replicated) or it is not.
We include the TransactionalId in order to ensure that producers using transactional messages (i.e. those with the transaction bit set in the attributes) are authorized to do so. If the client is not using transactions, this field should be null.
ProduceRequest
|
ProduceResponse
|
Error codes:
-
DuplicateSequenceNumber [NEW]
-
InvalidSequenceNumber [NEW]
-
InvalidProducerEpoch [NEW]
-
UNSUPPORTED_FOR_MESSAGE_FORMAT
Note that clients sending version 3 of the produce request MUST use the new message set format. The broker may still down-convert the message to an older format when writing to the log, depending on the internal message format specified.
ListOffsetRequest/Response
Sent by the client to search offsets by timestamp and to find the first and last offsets for a partition. In this proposal, we modify this request to also support retrieval of the last stable offset, which is needed by the consumer to implement seekToEnd() in READ_COMMITTED mode.
ListOffsetRequest
|
ListOffsetResponse
|
The schema is exactly the same as version 1, but we now support a new sentinel timestamp in the request (-3) to retrieve the LSO.
FindCoordinatorRequest/Response
Sent by client to any broker to find the corresponding coordinator. This is the same API that was previously used to find the group coordinator, but we have changed the name to reflect the more general usage (there is no group for transactional producers). We bump up the version of the request and add a new field indicating the group type, which can be either Consumer or Txn. Request handling details can be found here.
FindCoordinatorRequest
|
FindCoordinatorResponse
|
Error codes:
-
Ok
-
CoordinatorNotAvailable
The node id is the identifier of the broker. We use the coordinator id to identify the connection to the corresponding broker.
InitPidRequest/Response
Sent by producer to its transaction coordinator to to get the assigned PID, increment its epoch, and fence any previous producers sharing the same TransactionalId. Request handling details can be found here.
InitPidRequest
|
InitPidResponse
|
Error code:
-
Ok
-
NotCoordinatorForTransactionalId
-
CoordinatorNotAvailable
-
ConcurrentTransactions
-
InvalidTransactionTimeout
AddPartitionsToTxnRequest/Response
Sent by producer to its transaction coordinator to add a partition to the current ongoing transaction. Request handling details can be found here.
AddPartitionsToTxnRequest
|
AddPartitionsToTxnResponse
|
Error code:
-
Ok
-
NotCoordinator
-
CoordinatorNotAvailable
-
CoordinatorLoadInProgress
-
InvalidPidMapping
-
InvalidTxnState
-
ConcurrentTransactions
-
GroupAuthorizationFailed
AddOffsetsToTxnRequest
Sent by the producer to its transaction coordinator to indicate a consumer offset commit operation is called as part of the current ongoing transaction. Request handling details can be found here.
AddOffsetsToTxnRequest
|
AddOffsetsToTxnResponse
|
Error code:
-
Ok
-
InvalidProducerEpoch
-
InvalidPidMapping
-
NotCoordinatorForTransactionalId
-
CoordinatorNotAvailable
-
ConcurrentTransactions
-
InvalidTxnRequest
EndTxnRequest/Response
Sent by producer to its transaction coordinator to prepare committing or aborting the current ongoing transaction. Request handling details can be found here.
EndTxnRequest
|
EndTxnResponse
|
Error code:
-
Ok
-
InvalidProducerEpoch
-
InvalidPidMapping
-
CoordinatorNotAvailable
-
ConcurrentTransactions
-
NotCoordinatorForTransactionalId
-
InvalidTxnRequest
WriteTxnMarkersRequest/Response
Sent by transaction coordinator to broker to commit the transaction. Request handling details can be found here.
WriteTxnMarkersRequest
|
WriteTxnMarkersResponse
|
Error code:
-
Ok
TxnOffsetCommitRequest/Response
Sent by transactional producers to consumer group coordinator to commit offsets within a single transaction. Request handling details can be found here.
Note that just like consumers, users will not be exposed to set the retention time explicitly, and the default value (-1) will always be used which lets broker to determine its retention time.
TxnOffsetCommitRequest
|
TxnOffsetCommitResponse
|
Error code:
- InvalidProducerEpoch
Note: The following is tangential to the TxnOffsetCommitRequest/Response: When an OffsetCommitRequest from a consumer failed with a retriable error, we return RetriableOffsetCommitException to the application callback. Previously, this 'RetriableOffsetCommitException' would include the underlying exception. With the changes in KIP-98, we no longer include the underlying exception in the 'RetriableOffsetCommitException'.
Message Format
In order to add new fields such as PID and epoch into the produced messages for transactional messaging and de-duplication, we need to change Kafka’s message format and bump up its version (i.e. the “magic byte”). More specifically, we need to add the following fields into each message:
-
PID => int64
-
Epoch => int16
-
Sequence number => int32
Adding these fields on the message-level format schema potentially adds a considerable amount of overhead; on the other hand, at least the PID and epoch will never change within a set of messages from a given producer. We therefore propose to enhance the current concept of a message set by giving it a separate schema from an individual message. In this way, we can locate these fields only at the message set level which allows the additional overhead to be amortized across batches of messages rather than paying the cost for each message separately.
Both the epoch and sequence number will wrap around once int16_max and int32_max are reached. Since there is a single point of allocation and validation for both the epoch and sequence number, wrapping these values will not break either the idempotent or transactional semantics.
For reference, the current message format (v1) is the following:
|
A message set is a sequence of messages. To support compression, we currently play a trick with this format and allow the compressed output of a message set to be embedded in the value field of another message (a.k.a., the “wrapper message”). In this design, we propose to extend this concept to non-compressed messages and to decouple the schema for the message wrapper (which contains the compressed message set). This allows us to maintain a separate set of fields at the message set level and avoid some costly redundancy:
|
The ability to store some fields only at the message set level allows us to conserve space considerably when batching messages into a message set. For example, there is no need to write the PID within each message since it will always be the same for all messages within each message set. In addition, by separating the message level format and message set format, now we can also use variable-length types for the inner (relative) offsets and save considerably over a fixed 8-byte field size.
Message Set Fields
The first four fields of a message set in this format must to be the same as the existing format because any fields before the magic byte cannot be changed in order to provide a path for upgrades following a similar approach as was used in KIP-32. Clients which request an older version of the format will require conversion on the broker.
The offset provided in the message set header represents the offset of the first message in the set. Similarly, we the sequence number field represents the sequence number of the first message. We also include an “offset delta” at the message set level to provide an easy way to compute the last offset / sequence number in the set: i.e. the starting offset of the next message set should be “offset + offset delta”. This also allows us to search for the message set corresponding to a particular offset without scanning the individual messages, which may or may not be compressed. Similarly, we can use this to easily compute the next expected sequence number.
The offset, sequence number, and offset delta values of the message set never change after the creation of the message set. The log cleaner may remove individual messages from the message set, and it may remove the message set itself once all messages have been removed, but we must preserve the range of sequence numbers that were ever used in a message set since we depend on this to determine the next sequence number expected for each PID.
Message Set Attributes: The message set attributes are essentially the same as in the existing format, though we have added an additional byte for future use. In addition to the existing 3 bits used to indicate the compression codec and 1 bit for timestamp type, we will use another bit to indicate that the message set is transactional (see Transaction Markers section). This lets consumers in READ_COMMITTED know whether a transaction marker is expected for a given message set.
The control flag indicates that the messages contained in the message set are not intended for application consumption (see below).
Compression (3) |
Timestamp type (1) |
Transactional (1) |
Control(1) |
Unused (10) |
Discussion on Maximum Message Size. The broker’s configuration max.message.size previously controlled the maximum size of a single uncompressed message or a compressed set of messages. With this design, it now controls the maximum message set size, compressed or not. In practice, the difference is minor because a single message can be written as a singleton message set, with the small increase in overhead mentioned above.
Message Fields
The length field of the message format is encoded as a signed variable-length integer. Similarly the offset delta and key length fields are encoded as unitVar as well. The message’s offset can then be calculated as the offset of the message set + offset delta.
Message Attributes: In this format, we have also added a single byte for individual message attributes. Only message sets can be compressed, so there is no need to reserve some of these attributes for the compression type. All of the message-level attributes are available for future use.
Unused (8) |
Control Messages
We use control messages to represent transaction markers. All messages contained in a batch with the control attribute set (see above) are considered control messages and follow a specific format. Each control message must have a non-null key, which is used to indicate the type of control message type with the following schema:
|
In this proposal, a control message type of 0 indicates a COMMIT marker, and a control message type of 1 indicates an ABORT marker. The schema for control values is generally specific to the control message type.
Discussion on Message-level Schema. A few additional notes about this schema:
-
Having easy access to the offset of the first message allows us to stream messages to the user on demand. In the existing format, we only know the last offset in each message set, so we have to read the messages fully into memory in order to compute the offset of the first message to be returned to the user.
-
As before, the message set header has a fixed size. This is important because it allows us to do in-place offset/timestamp assignment on the broker before writing to disk.
-
We have removed the per-message CRC in this format. We hesitated initially to do so because of its use in some auditing applications for end-to-end validation. The problem is that it is not safe, even currently, to assume that the CRC seen by the producer will match that seen by the consumer. One case where it is not preserved is when the topic is configured to use the log append time. Another is when messages need to be up-converted prior to appending to the log. For these reasons, and to conserve space and save computation, we have removed the CRC and deprecated client usage of these fields.
-
The message set CRC covers the header and message data. Alternatively, we could let it cover only the header, but if compressed data is corrupted, then decompression may fail with obscure errors. Additionally, that would require us to add the message-level CRC back to the message.
-
The CRC32C polynomial is used for all CRC computations in the new format because optimised implementations are significantly faster (i.e. if they use the CRC32 instruction introduced in SSE4.2).
-
Individual messages within a message set have their full size (including header, key, and value) as the first field. This is designed to make deserialization efficient. As we do for the message set itself, we can read the size from the input stream, allocate memory accordingly, and do a single read up to the end of the message. This also makes it easier to skip over the messages if we are looking for a particular one, which potentially saves us from copying the key and value.
-
We have not included a field for the size of the value in the message schema since it can be computed directly using the message size and the length of the header and key.
-
We have used a variable length integer to represent timestamps. Our approach is to let the first message
Space Comparison
As the batch size increases, the overhead of the new format grows smaller compared to the old format because of the eliminated redundancy. The overhead per message in the old format is fixed at 34 bytes. For the new format, the message set overhead is 53 bytes, while per-message overhead ranges from 6 to 25 bytes. This makes it more costly to send individual messages, but space is quickly recovered with even modest batching. For example, assuming a fixed message size of 1K with 100 byte keys and reasonably close timestamps, the overhead increases by only 7 bytes for each additional batched message (2 bytes for the message size, 1 byte for attributes, 2 bytes for timestamp delta, 1 byte for offset delta, and 1 byte for key size) :
Batch Size |
Old Format Overhead |
New Format Overhead |
1 |
34*1 = 34 |
53 + 1*7 = 60 |
3 |
34*3 = 102 |
53 + 3*7 = 74 |
10 |
34*10 = 340 |
53 + 10*7 = 123 |
50 |
34*50 = 1700 |
53 + 50*7 = 403 |
100 |
34*100 = 3400 |
45 + 100*7 = 745 |
Metrics
As part of this work, we would need to expose new metrics to make the system operable. These would include:
- Number of live PIDs (a proxy for the size of the PID->Sequence map)
- Current LSO per partition (useful to detect stuck consumers and lost commit/abort markers).
- Number of active transactionalIds (proxy for the memory consumed by the transaction coordinator).
Compatibility, Deprecation, and Migration Plan
We follow the same approach used in KIP-32. To upgrade from a previous message format version, users should:
-
Upgrade the brokers once with the inter-broker protocol set to the previous deployed version.
-
Upgrade the brokers again with an updated inter-broker protocol, but leaving the message format unchanged.
-
Notify clients that they can upgrade, BUT should not start using the idempotent / transactional message APIs yet.
-
[When observed that most of the clients have upgraded] Restart the brokers, with the message format version set to the latest.
-
Notify upgraded clients that they can now start using the idempotent / transactional message APIs.
The reason for step 3 is to avoid the performance cost for down-converting messages to an older format, which effectively loses the “zero-copy” optimization. Ideally, all consumers are upgraded before the producers even begin writing to the new message format.
Note: Since the old producer has long since been deprecated and the old consumer will be deprecated in 0.11.0, these clients will not support the new format. In order to avoid the conversion hit, users will have to upgrade to the new clients. It is possible to selectively enable the message format on topics which are already using the new clients.
Test Plan
Correctness
The new features will be tested through unit, integration, and system tests.
The integration tests will focus on ensuring that the basic guarantees (outlined in the Summary of Guarantees section) are satisfied across components.
The system tests will focus on ensuring that the guarantees are satisfied even with failing components, ie. that the system works even when consumers, producers, brokers are killed in various states.
We will also add to existing compatibility system tests to ensure that old clients can still talk to the new brokers with the new message format.
Performance
This KIP introduces significant changes to the message format along with the new features.
We plan on introducing changes in a staged fashion, with the first change being to the message format. We will run our performance test suite on these message format changes and ensure that there is a minimal performance impact thanks to these changes at worst. Note that the message format changes are the only ones which can affect users who don't enable the idempotent producer and don't use transactions.
Then, we will benchmark the performance of the idempotent producer and the transactional producer separately. Finally, we will benchmark the consumer and broker performance when transactions are in use and read_committed mode is enabled. We will publish the results of all these benchmarks so that users can make informed decisions about when and how to use these features.
Rejected Alternatives
As mentioned earlier, we have a separate design document which explores the design space --including rejected alternatives-- as well as all the implementation details. The latter also includes the specifics of message format changes, new RPCs, error handling, etc.
The design document is available here.
总结
翻译了两天,经历过一次保存崩溃,才只是记录了不到一半,前一半主要讲的是这个事务的原理,后一半讲的是讲解了一下RPC和各个报文。只是大体明白了这个原理,还需要充分的实践。
这个笔记还需要不断地完善,在不断的理解,获得了更加深刻的认知后。
参考
posted on 2019-12-15 17:06 chaiyu2002 阅读(365) 评论(0) 编辑 收藏 举报