Kafka学习笔记--存储内部

博客

前言

研究一下Kafka的存储内部的原理，这里转发一篇文章，个人感觉这篇文章讲的非常好，所以推荐出来，同时做一下学习笔记。

当然，基于不鼓励懒人，启发式记录的原则，我只是对一些重点进行笔记记录，这样无论是将来自己看，还是现在别人看，都需要参考笔记去阅读一下原文----这也才是笔记的意义。

正文

A Practical Introduction to Kafka Storage Internals

Kafka is everywhere these days. With the advent of Microservices and distributed computing, Kafka has become a regular occurrence in every product architecture. In this article, I’ll try to explain how Kafka’s internal storage mechanism works.

Since this is going to be a deep dive into Kafka’s internals, I would expect you to have some understanding about it. Although I’ve tried to keep the entry level for this article pretty low, you might not be able to understand everything if you’re not familiar with the general workings of Kafka. Proceed further with that in mind.

Kafka is usually referred to as a Distributed, Replicated Messaging Queue, which is technically true but might lead to some confusion, depending on your definition of what a messaging queue is. Instead, I prefer the definition Distributed, Replicated Commit Log. This I think clearly represents what Kafka does as all of us understand how logs are written to disk. And in this case, it is the messages pushed into Kafka that are stored to disk.

这里是一些总体的概要介绍。

Kafka通常被引用为一个分布的，有备份的消息队列，这在技术上是正确的，但是这个会带来一些困扰，取决于你对于消息队列的定义。取而代之的，作者更倾向于这样的定义，Kafka是一个分布的，备份的提交日志（commit log，这个是一个有语义的术语）。

Kafka自己定位自己是一个分布式的流化平台，而不是上面所写的分布式，复制的消息队列，或者分布式的，复制的提交日志，但是这个流化平台的概念不是太容易理解。

With reference to storage in Kafka, you’ll always hear two terms, Partition and Topic. Partitions are the unit of storage in Kafka for data messages. And Topic can be thought of as being a container in which these partitions lie.

With the basic stuff out of our way, let’s understand these concepts better by working with Kafka.

I am going to start by creating a topic in Kafka with three partitions defined. If you want to follow along, the command looks like this for a local Kafka setup on windows.

而关于Kafka的存储的介绍，我们会经常听到两个术语，Partition和Topic。Partitions是Kafka用来存储数据消息的基本单元，一个Topic可以被认为是这些partition的容器。

下面，作者会带着大家来实践一把。（下面的这些命令是运行在windows平台上的，而想要运行在其他的平台，可以参考《快速开始》那一章节。）

kafka-topics.bat --create --topic freblogg --partitions 3 --replication-factor 1 --zookeeper localhost:2181

按照上面的命令，启动一下Kafka。

If I go to Kafka’s log directory, I see three directories created as follows.

$ tree freblogg*
freblogg-0
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint
freblogg-1
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint
freblogg-2
|-- 00000000000000000000.index
|-- 00000000000000000000.log
|-- 00000000000000000000.timeindex
`-- leader-epoch-checkpoint

上面的例子创建了3个partition，当你查看目录，会看到上文这样的目录和文件结构。

We have three directories created because we’ve given three partitions for our topic, which means that each partition gets a directory on the file system. You also see some files like index, log etc. We’ll get to them shortly.

一个partition对应着一个实际的目录，有几种文件，index，log等等。别着急，下面会有讲解。

One more thing that you should be able to see from here is that in Kafka, the topic is more of a logical grouping than anything else and that the Partition is the actual unit of storage in Kafka. Let’s understand partitions in some more detail.

Partitions

A partition, in theory, can be described as an immutable collection (or sequence) of messages. We can only append messages to a partition but cannot delete from it. And by “We”, I am talking about the Kafka consumer. A consumer can’t delete the messages in the topic.

Now we’ll send some messages into the topic. But before that, I want you to see the sizes of files in our partition folders.

这里你可以看到，一个topic其实是一个逻辑上的组，而partition则是Kafka中实际的存储单元。

一个partition，在理论上，被描述为一个不可修改的消息集合（或者序列）。我们仅仅可以给一个partition追加消息而不能从它中间删除。这里的“我们”，我是说作为一个Kafka消费者（consumer）。

在发送消息给这个主题之前，我们先看一下这些文件的大小。

$ ls -lh freblogg-0
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

You see the index files combined are about 20M in size while the log file is completely empty. This is the same case with freblogg-1 and freblogg-2folders.

Now let us send a couple of messages and see what happens. To send the messages I’m using the console producer as follows:

2个index文件大概20M，log文件是空的，3个目录是一样的。

现在，我们发送一些消息去看看会发生什么。

kafka-console-producer.bat --topic freblogg --broker-list localhost:9092

I have sent two messages, first a customary “hello world” and then I pressed the Enter key, which becomes the second message. Now if I print the sizes again:

笔者发送了两个消息，第一个是“hello world”，第二个是一个空消息。

$ ls -lh freblogg*
freblogg-0:
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

freblogg-1:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  68 Aug  5 10:15 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 10:15 leader-epoch-checkpoint

freblogg-2:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  79 Aug  5 09:59 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 09:59 leader-epoch-checkpoint

You will see that the log files have a non zero size now. This is because the messages in the partition are stored in the ‘xxxx.log’ file. To confirm that the messages are indeed stored in the log file, we can just see what’s inside that log file.

可以看到两个文件发生了变化，都是log文件，这是因为partition的消息是存储在log文件里面的，我们来再确认一下。

$ cat freblogg-2/*.log
@^@^BÂ°Â£Ã¦Ãƒ^@^K^XÃ¿Ã¿Ã¿Ã¿Ã¿Ã¿^@^@^@^A"^@^@^A^VHello World^@

The file format of the ‘log’ file is not one that is conducive for textual representation but nevertheless, you should see the ‘Hello World’ at the end indicating that this file got updated when we have sent the message into the topic.

Notice that the first message we sent, went into the third partition (freblogg-2) and the second message went into the second partition. This is because Kafka arbitrarily picks the partition for the first message and then distributes the messages to partitions in a round robin fashion. If a third message comes now, it would go into freblogg-0 and this order of partition continues for any new message that comes in. We can also make Kafka always choose the partition for our messages by adding a key to the message. Kafka stores all the messages with the same key into a single partition.

Each new message in the partition gets an ID which is one more than the previous Id number. This Id number is also called as the Offset. So, the first message is at ‘offset’ 0, the second message is at offset 1 and so on. These offset Id’s are always incremented from the previous value.

我们可以看到这个log文件的尾部有“Hello World”，这说明这个消息被发送到了这个topic中。

注意，第一个消息发到了第三个partition，第二个消息发送到了第二个partition。这是因为Kafka是随机地选取了一个partition来发送第一个消息，然后使用一个round robin的方式选择第二个partition来发送第二个消息，如果有第三个消息，它就会发送到第一个partition里面。我们也可以选择partition来发送，通过增加一个key给消息，Kafka的一个partition是使用同一个key的。

这里讲的在每个partition里面，有一个sequence Id，也就是offset，每个消息一个，按1来步进增加。

<Quick detour>

We can understand those random characters in the log file, using a Kafka tool. Those extra characters might not seem useful to us, but they are useful for Kafka as they are the metadata for each message in the queue. If I run,

kafka-run-class.bat kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files logs\freblogg-2\00000000000000000000.log

This gives the output

Dumping logs\freblogg-2\00000000000000000000.log
Starting offset: 0

offset: 0 position: 0 CreateTime: 1533443377944 isvalid: true keysize: -1 valuesize: 11 producerId: -1 headerKeys: [] payload: Hello World

offset: 1 position: 79 CreateTime: 1533462689974 isvalid: true keysize: -1 valuesize: 6 producerId: -1 headerKeys: [] payload: amazon

(I’ve removed a couple of things from this output which are not necessary for this discussion.)

我们用Kafka的工具来输出一下log里面的内容，这些是一些元数据。

You can see that it stores information of the offset, time of creation, key and value sizes etc along with the actual message payload in the log file.

</Quick detour>

It is also important to note that a partition is tied to a broker. In other words, If we have three brokers and if the folder freblogg-0 exists on broker-1, you can be sure that it will not appear in any of the other brokers. Partitions of a topic can be spread out to multiple brokers but a partition is always present on one single Kafka broker (When the replication factor has its default value, which is 1. Replication is mentioned further below).

partition是和broker绑定在一起的，一个partition只会在一个broker上出现一次。这里的broker似乎是指一台宿主服务器。

但是如何备份呢？按照之前几篇文章的介绍，每个partition会在每个broker上被备份一份。

Segments

We’ll finally talk about those index and log files we’ve seen in the partition directory. Partition might be the standard unit of storage in Kafka, but it is not the lowest level of abstraction provided. Each partition is sub-divided into segments.

这里提到了一个新的概念，segments，尽管partition被作为一个标准的存储单元，但并不是被提供的最低级别的抽象。每一个partition又被划分为segments。

A segment is simply a collection of messages of a partition. Instead of storing all the messages of a partition in a single file (think of the log file analogy again), Kafka splits them into chunks called segments. Doing this provides several advantages. Divide and Conquer FTW!

一个segment是在一个partition里面的一个简单的消息集合。取代把一个partition的所有消息都放在一个单一文件里面的方案是，Kafka把它们分成被称为segment的大块。这是一种分而治之的方法。

Most importantly, it makes purging data easy. As previously introduced partition is immutable from a consumer perspective. But Kafka can still remove the messages based on the “Retention policy” of the topic. Deleting segments is much simpler than deleting things from a single file, especially when a producer might be pushing data into it.

很重要的是，这使得清理数据非常容易。Kafka根据topic上的“Retention policy”（保留策略）来移除消息。这样，删除一个segment会比从一个文件中删除东西要容易的多，尤其在一个producer可能要往里面放数据的时候。（是说整个segment一起删除吗？）

$ ls -lh freblogg-0
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

The `00000000000000000000` in front of the log and the index files in each partition folder, is the name of our segment. Each segment file has segment.log, segment.index and segment.timeindex files.

在每个index文件前面的00000000000000000000，是segment的名字，每个segment由segment.log，segment.index和segment.timeindex构成。

Kafka always writes the messages into these segment files under a partition. There is always an active segment to which Kafka writes to. Once the segment’s size limit is reached, a new segment file is created and that becomes the newly active segment.

会有一个活跃的segment来让Kafka去写入。一旦segment的大小限制达到了，一个新的segment文件会被创建，并且成为新的活跃的segment。

Each segment file is created with the offset of the first message as its file name. So, In the above picture, segment 0 has messages from offset 0 to offset 2, segment 3 has messages from offset 3 to 5 and so on. Segment 6 which is the last segment is the active segment.

每个segment被使用第一个消息的offset来作为它的文件名来创建，（可以看上面的图，我第一次看到这个图，还对这个segment的名字感到奇怪）。

$ ls -lh freblogg*
freblogg-0:
total 20M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121   0 Aug  5 08:26 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121   0 Aug  5 08:26 leader-epoch-checkpoint

freblogg-1:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  68 Aug  5 10:15 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 10:15 leader-epoch-checkpoint

freblogg-2:
total 21M
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.index
- freblogg 197121  79 Aug  5 09:59 00000000000000000000.log
- freblogg 197121 10M Aug  5 08:26 00000000000000000000.timeindex
- freblogg 197121  11 Aug  5 09:59 leader-epoch-checkpoint

In our case, we only had one segment in each of our partitions which is 00000000000000000000. Since we don't see another segment file present, it means that 00000000000000000000 is the active segment in each of those partitions.

The default value for segment size is a high value (1 GB) but let’s say we’ve tweaked Kafka configuration so that each segment can hold only three messages. Let’s see how that would play out.

每个segment的默认大小是1GB，但是让我们调整一下Kafka的配置，让每个segment只能保有3条消息。

Say this is the current state of the freblogg-2 partition. We've three messages pushed into it.

Since ‘three messages’ is the limit we’ve set, If a new message comes into this partition, Kafka will automatically close the current segment, create a new segment, make that the active segment and store that new message in the new segment’s log file.

因为我们设置了3条消息的显示，所以如果有新的消息到来，Kafka将会自动关闭当前的segment，创建一个新的segment，让它成为活跃的segment，并且在新的segmetn里面存储新的消息。

(I'm not showing the preceding zeroes to make it easy on the eyes)

freblogg-2
|-- 00.index
|-- 00.log
|-- 00.timeindex
|-- 03.index
|-- 03.log
|-- 03.timeindex
`--

You should’ve noted that the name of the newer segment is not 01. Instead, you see 03.index, 03.log. So, what is going on?

看到了吧，新的segment出现了，名字是03。

This is because Kafka makes the lowest offset in the segment as its name. Since the new message that came into the partition has the offset 3, that is the name Kafka gives for the new segment. It also means that since we have 00 and 03 as our segments, we can be sure that the messages with offsets 0,1 and 2 are indeed present in the 00 segment. New messages coming into freblogg-2 partition with offsets 3,4 and 5 will be stored in the segment 03.

One of the common operations in Kafka is to read the message at a particular offset. For this, if it has to go to the log file to find the offset, it becomes an expensive task especially because the log file can grow to huge sizes (Default — 1G). This is where the .index file becomes useful. Index file stores the offsets and physical position of the message in the log file.

因为segment可以达到1G，那么查找起来，有一个索引会快很多，Index文件存储着offsets和消息在log文件上的物理位置。

An index file for the log file I’ve showed in the ‘Quick detour’ above would look something like this:

而index文件则会显示类似上面的内容，在1G的空间内，有一个索引来指定位置，这样查找起来会非常的块。

If you need to read the message at offset 1, you first search for it in the index file and figure out that the message is in position 79. Then you directly go to position 79 in the log file and start reading. This makes it quite effective as you can use binary search to quickly get to the correct offset in the already sorted index file.

Parallelism with Partitions

To guarantee the order of reading messages from a partition, Kafka restricts to having only consumer (from a consumer group) per partition. So, if a partition gets messages a,f and k, the consumer will also read them in the order a,f and k. This is an important thing to make a note of as order of message consumption is not guaranteed at a topic level when you have multiple partitions.

消息的顺序仅在partition层面被保证，在topic层面是不被保证的。

Just increasing the number of consumers won’t increase the parallelism. You need to scale your partitions accordingly. To read data from a topic in parallel with two consumers, you create two partitions so that each consumer can read from its own partition. Also since partitions of a topic can be on different brokers, two consumers of a topic can read the data from two different brokers.

仅仅增加consumer的数量并不能增加并行能力，增加partition才可以，不同的partition有可能位于不同的broker上，这样增加了数据的读的能力。

Topics

We’ve finally come to what a topic is. We’ve covered a lot of things about topics already. The most important thing to know is that a Topic is merely a logical grouping of several partitions.

一个Topic只不过是一些partition的一个逻辑上的组。

A topic can be distributed across multiple brokers. This is done using the partitions. But a partition still needs to be on a single broker. Each topic will have its unique name and the partitions will be named from that.

Replication

Let’s talk about replication. Whenever we’re creating a topic in Kafka, we need to specify what the replication factor we need for that topic. Let's say we've two brokers and so we've given the replication-factor as 2. What this means is that Kafka will try to always ensure that each partition of this topic has a backup/replica. The way Kafka distributes the partitions is quite similar to how HDFS distributes its data blocks across nodes.

Say for the freblogg topic that we've been using so far, we've given the replication factor as 2. The resulting distribution of its three partitions will look something like this.

Even when you have a replicated partition on a different broker, Kafka wouldn’t let you read from it because in each replicated set of partitions, there is a LEADER and the rest of them are just mere FOLLOWERS serving as backup. The followers keep on syncing the data from the leader partition periodically, waiting for their chance to shine. When the leader goes down, one of the in-sync follower partitions is chosen as the new leader and now you can consume data from this partition.

你永远是从作为LEADER的partition中去读数据，其余的FOLLOWERS仅仅作为备份。”The followers keep on syncing the data from the leader partition periodically, waiting for their chance to shine. “这句话挺有意思，这些follower周期性地保持着对leader的partition的数据同步，等待着它们能够大放光芒的一天。

A Leader and a Follower of a single partition are never in a single broker. It should be quite obvious why that is so.

Finally, this long article ends. Congratulations on making this far. You now know most of what there is to know about Kafka’s data storage. To ensure that you retain this information let’s do a quick recap.

Recap

Data in Kafka is stored in topics

Topics are partitioned

Each partition is further divided into segments

Each segment has a log file to store the actual message and an index file to store the position of the messages in the log file

Various partitions of a topic can be on different brokers but a partition is always tied to a single broker

Replicated partitions are passive. You can consume messages from them only when the leader is down

That ought to cover everything we’ve talked about. Thanks for reading. See you again in the next one.

Attribution:

Kafka image — https://kafka.apache.org/images/kafka_diagram.png