KafKa——学习笔记

学习时间:2020年02月03日10:03:41

官网地址 http://kafka.apache.org/intro.html

img

kafka:消息队列介绍:

近两年发展速度很快。从1.0.0版本发布就进步很快了。

Scala语言:kafka的核心代码使用Scala语言编写

接下来,主要学习springboot如何使用kafka完成消息的接收和发送。

  1. 需要学习kafka和基本使用
  2. 使用SpringBoot整合kafka

通过阅读官方文档来进行学习


Introduction

img

Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

面向消息队列MQ的开发模式:已经在互联网中得到了大量的应用

传统的模型

面向MQ的模式

面向MQ编程风格的优点:

  1. 实现了软件开发领域的:高内聚低耦合
  2. 能够合理解决而高并发的问题

Apache Kafka® is a distributed streaming platform. What exactly does that mean?

ApacheKafka®是一个分布式流平台。 这到底是什么意思呢?

A streaming platform has three key capabilities:

  • Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
  • Store streams of records in a fault-tolerant durable way.
  • Process streams of records as they occur.

流平台具有三个关键功能:

发布和订阅记录流,类似于消息队列或企业消息传递系统。

以容错的持久方式存储记录流。

处理记录流。

Kafka is generally used for two broad classes of applications:

  • Building real-time streaming data pipelines that reliably get data between systems or applications
  • Building real-time streaming applications that transform or react to the streams of data

Kafka通常用于两大类应用程序:

建立实时流数据管道,以可靠地在系统或应用程序之间获取数据

构建实时流应用程序,以转换或响应数据流

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

First a few concepts:

  • Kafka is run as a cluster on one or more servers that can span multiple datacenters.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.

要了解Kafka如何执行这些操作,让我们从头开始深入研究Kafka的功能。

首先几个概念:

Kafka在一个或多个可以跨越多个数据中心的服务器上作为集群运行。

Kafka集群将记录流存储在称为主题的类别中。

每个记录由一个键,一个值和一个时间戳组成。

Kafka has four core APIs:

  • The Producer API allows an application to publish a stream of records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Kafka具有四个核心API:

Producer API允许应用程序将记录流发布到一个或多个Kafka主题。

Consumer API允许应用程序订阅一个或多个主题并处理为其生成的记录流。

Streams API允许应用程序充当流处理器,使用一个或多个主题的输入流,并生成一个或多个输出主题的输出流,从而有效地将输入流转换为输出流。

Connector API允许构建和运行将Kafka主题连接到现有应用程序或数据系统的可重用生产者或使用者。 例如,关系数据库的连接器可能会捕获对表的所有更改。

具体关系如下图所示:

img

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.

在Kafka中,客户端和服务器之间的通信是通过简单,高性能,与语言无关的TCP协议完成的。 该协议已版本化,并与旧版本保持向后兼容性。 我们为Kafka提供了Java客户端,但是客户端支持多种语言。

Topics and Logs

Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

首先,让我们深入探讨Kafka提供的记录主题的核心抽象。

主题是将记录发布到的类别或订阅源名称。 Kafka中的主题始终是多用户的; 也就是说,一个主题可以有零个,一个或多个消费者来订阅写入该主题的数据。

对于每个主题,Kafka集群都会维护一个分区日志,如下所示:

img

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

每个分区都是有序的,不变的记录序列,这些记录连续地附加到结构化的提交日志中。 分别为分区中的记录分配了一个顺序ID号,称为偏移号,该ID号唯一标识分区中的每个记录。

PS:分区与分区之间是没有顺序的,每个分区里面是有严格顺序的。

The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

Kafka集群使用可配置的保留期限持久保留所有已发布的记录(无论是否已使用它们)。 例如,如果将保留策略设置为两天,则在发布记录后的两天内,该记录可供使用,之后将被丢弃以释放空间。 Kafka的性能相对于数据大小实际上是恒定的,因此长时间存储数据不是问题。

下图是介绍:关于生产者和消费者的原理

img

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

实际上,基于每个消费者保留的唯一元数据是该消费者在日志中的偏移量或位置。 此偏移量由使用者控制:通常,使用者在读取记录时会线性地推进其偏移量,但是实际上,由于位置是由使用者控制的,因此它可以按喜欢的任何顺序使用记录。 例如,使用者可以重置到较旧的偏移量以重新处理过去的数据,或者跳到最近的记录并从“现在”开始使用。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

这些功能的组合意味着Kafka的消费者非常便宜-他们来来去去对集群或其他消费者没有太大影响。 例如,您可以使用我们的命令行工具来“尾部”任何主题的内容,而无需更改任何现有使用者所消耗的内容。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

日志中的分区有多种用途。 首先,它们允许日志扩展到超出单个服务器所能容纳的大小。 每个单独的分区都必须适合承载它的服务器,但是一个主题可能有很多分区,因此它可以处理任意数量的数据。 其次,它们充当并行性的单元-稍有更多。

Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

日志的分区分布在Kafka群集中的服务器上,每台服务器处理数据并要求共享分区。 每个分区都跨可配置数量的服务器复制,以实现容错功能。

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

每个分区都有一个充当“领导者”的服务器和零个或多个充当“跟随者”的服务器。 领导者处理对分区的所有读写请求,而跟随者则被动地复制领导者。 如果领导者失败,则跟随者之一将自动成为新领导者。 每个服务器充当其某些分区的领导者,而充当其他分区的跟随者,因此群集中的负载得到了很好的平衡。

这里涉及到了选举算法,实现了集群的负载均衡。

Geo-Replication

Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.

Kafka MirrorMaker为您的集群提供地理复制支持。 使用MirrorMaker,可以在多个数据中心或云区域中复制消息。 您可以在主动/被动方案中使用它进行备份和恢复。 或在主动/主动方案中将数据放置在离您的用户更近的位置,或支持数据位置要求。

Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

生产者将数据发布到他们选择的主题。 生产者负责选择将哪个记录分配给主题中的哪个分区。 可以以循环方式完成此操作,仅是为了平衡负载,也可以根据某些语义分区功能(例如基于记录中的某些键)进行此操作。 一秒钟就可以了解更多有关分区的信息!

Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

消费者使用消费者组名称标记自己,并且发布到主题的每条记录都会传递到每个订阅消费者组中的一个消费者实例。 使用者实例可以在单独的进程中或在单独的机器上。

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

如果所有使用者实例都具有相同的使用者组,那么将在这些使用者实例上有效地平衡记录。

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

如果所有使用者实例具有不同的使用者组,则每个记录将广播到所有使用者进程。

PS:翻译后是这样理解的: 实现了单博和广播的特性。

如果有5个Consumer,在同一个组中,那么只有一个能够接受的到。(按照一定的策略。)

如果有5个Consumer,不在同一个组,那么有消息的时候都能收到消息

img

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

由两台服务器组成的Kafka群集,其中包含四个带有两个使用者组的分区(P0-P3)。 消费者组A有两个消费者实例,组B有四个。

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

但是,更常见的是,我们发现主题具有少量的消费者组,每个“逻辑订户”一个。 每个组均由许多使用者实例组成,以实现可伸缩性和容错能力。 这无非就是发布-订阅语义,其中订阅者是消费者的集群而不是单个进程。

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

在Kafka中实现使用的方式是通过在使用方实例上划分日志中的分区,以便每个实例在任何时间点都是分区“公平份额”的排他使用方。 Kafka协议动态处理了维护组成员身份的过程。 如果新实例加入该组,它们将接管该组其他成员的某些分区; 如果实例死亡,则其分区将分配给其余实例。

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Kafka仅提供分区中记录的总顺序,而不提供主题中不同分区之间的记录。 对于大多数应用程序,按分区排序以及按键对数据进行分区的能力就足够了。 但是,如果您需要记录的总订单量,则可以使用只有一个分区的主题来实现,尽管这将意味着每个使用者组只有一个使用者流程。

Multi-tenancy

You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.

多租户

您可以将Kafka部署为多租户解决方案。 通过配置哪些主题可以产生或使用数据来启用多租户。 配额也有运营支持。 管理员可以在请求上定义和实施配额,以控制客户端使用的代理资源。 有关更多信息,请参阅安全性文档。

Guarantees

At a high-level Kafka gives the following guarantees:

  • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.

保证金

在较高级别上,Kafka提供以下保证:

生产者发送到特定主题分区的消息将按其发送顺序附加。 也就是说,如果记录M1是由与记录M2相同的生产者发送的,并且首先发送M1,则M1的偏移量将小于M2,并在日志中更早出现。

使用者实例按记录在日志中的存储顺序查看记录。

对于具有复制因子N的主题,我们最多可以容忍N-1个服务器故障,而不会丢失任何提交给日志的记录。

在文档的设计部分中提供了有关这些保证的更多详细信息。

PS : 副本 N,就是为了防止 失败机制。

Kafka as a Messaging System

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Kafka作为消息传递系统

Kafka的流概念与传统的企业消息传递系统相比如何?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

传统上,消息传递有两种模型:排队和发布-订阅。 在队列中,一组使用者可以从服务器中读取内容,并且每条记录都将转到其中一个。 在发布-订阅记录中广播给所有消费者。 这两个模型中的每一个都有优点和缺点。 排队的优势在于,它允许您将数据处理划分到多个使用者实例上,从而扩展处理量。 不幸的是,队列不是多用户的—一次进程读取了丢失的数据。 发布-订阅允许您将数据广播到多个进程,但是由于每条消息都传递给每个订阅者,因此无法扩展处理。

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

卡夫卡的消费者群体概念概括了这两个概念。 与队列一样,使用者组允许您将处理划分为一组进程(使用者组的成员)。 与发布订阅一样,Kafka允许您将消息广播到多个消费者组。

The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

Kafka模型的优点在于,每个主题都具有这些属性-可以扩展处理范围,并且是多订阅者-无需选择其中一个。

Kafka has stronger ordering guarantees than a traditional messaging system, too.

与传统的消息传递系统相比,Kafka还具有更强的订购保证。

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

传统队列将记录按顺序保留在服务器上,如果多个使用者从队列中消费,则服务器将按记录的存储顺序分发记录。 但是,尽管服务器按顺序分发记录,但是这些记录是异步传递给使用者的,因此它们可能在不同的使用者上乱序到达。 这实际上意味着在并行使用的情况下会丢失记录的顺序。 消息传递系统通常通过“专有使用者”的概念来解决此问题,该概念仅允许一个进程从队列中使用,但是,这当然意味着在处理中没有并行性。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

卡夫卡做得更好。 通过在主题内具有并行性(即分区)的概念,Kafka能够在用户进程池中提供排序保证和负载均衡。 这是通过将主题中的分区分配给消费者组中的消费者来实现的,以便每个分区都由组中的一个消费者完全消费。 通过这样做,我们确保使用者是该分区的唯一读取器,并按顺序使用数据。 由于存在许多分区,因此仍然可以平衡许多使用者实例上的负载。 但是请注意,使用者组中的使用者实例不能超过分区。

PS : kafka的优秀功能的实现原理。

Kafka as a Storage System

Kafka作为存储系统

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

任何允许发布与使用无关的消息发布的消息队列都有效地充当了运行中消息的存储系统。 Kafka的不同之处在于它是一个非常好的存储系统。

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

写入Kafka的数据将写入磁盘并进行复制以实现容错功能。 Kafka允许生产者等待确认,以便直到完全复制并确保即使写入服务器失败的情况下写入也不会完成。

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

Kafka的磁盘结构可以很好地扩展使用-无论服务器上有50 KB还是50 TB的持久数据,Kafka都将执行相同的操作。

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

For details about the Kafka's commit log storage and replication design, please read this page.

由于认真对待存储并允许客户端控制其读取位置,因此您可以将Kafka视为一种专用于高性能,低延迟提交日志存储,复制和传播的专用分布式文件系统。

有关Kafka的提交日志存储和复制设计的详细信息,请阅读此页面。

Kafka for Stream Processing

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

仅读取,写入和存储数据流是不够的,目的是实现对流的实时处理。

在Kafka中,流处理器是指从输入主题中获取连续数据流,对该输入进行一些处理并生成连续数据流以输出主题的任何东西。

例如,零售应用程序可以接受销售和装运的输入流,并输出根据此数据计算出的重新订购和价格调整流。

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.

官网介绍到此结束: 自行看一遍上述文档。


Zookeeper官网,介绍(自行查阅)

Kafka重度依赖于:Zookeeper组件(Apache)

Zookeeper:分布式协调框架。 : 作为配置中心 (服务和注册中心)(分布式锁的相关功能-Redis或者zookeper

Curator: 一个库 也是Apache的 一个Zookeeper库。使用Curator能够快速完成Zookeeper的原生API

一个新技术,你学,你去找资源,找教学视频。

最需要的是:学习学习新技术的方法


实操

  1. 下载Kafaka
  2. 使用kafka内置的zookeeper启动kafka
  3. 生产者产生信息掉,消费掉。

实操步骤

  1. 下载

  2. zookeeper.properties (修改路径)

  3. server.properties (先修改log.dirs )

  4. 通过zookeeper-server-start.sh config/zookeeper.properties 命令 和对应的 配置文件。

    ​ zookeeper启动成功

  5. 通过bin/kafka-server-start.sh config/server.properties

    ​ kafka启动成功。

生产者和消费者案例。启动测试。

https://www.bilibili.com/video/av79288673/?p=51

关键名词:重点概念剖析

Producer 生产者

image-20200205223102283

Consumer消费者

image-20200205223131021

Broker代理

image-20200205223156440

Topic主题

image-20200205223220632

Record消息

image-20200205223240842

Cluster集群

image-20200205223311732

Controller控制器

image-20200205223259205

Consumer Group 消费者组

image-20200205223043035

集群,心态方式的实现。

副本是用来保证高可用。

日志文件中偏移量的4个文件的解释

分区目录

image-20200206153145934

默认50个__consumer_ofsers-1

000000.index (索引)

000000.log (数据存储的文件)

000000.timeindex (时间戳)

关于Kafka分区

  1. 每个分区都是一个有序,不可变的消息序列。后续新来的消息会源源不断地追加到当前分区的后面,这相当于一种结构化的提交日志(类似于Git的提交日志)
  2. 分区中的每一条消息都会被分配一个连续的id值(即offset),该值用于唯一标识分区中的每一条消息。

分区的重要作用

  1. 分区中的消息数据是存储在日志文件中的,而且同一分区中的消息数据是按照发送顺序严格有序的。分区在逻辑上对应一个日志,当生产者将消息写入分区中时,实际上是写到了分区所对应的日志当中。而日志可以看做是一种逻辑上的概念,它对应于磁盘上的一个目录。一个日志文件由多个Segment(段)来构成,每个Segment对应于一个索引文件与一个日志文件。
  2. 借助于分区,我们可以实现Kafka Server的水平扩展。对于一台机器来说,无论是物理机还是虚拟机,其运行能力总归是有上限的。当一台机器到达其能力上限时就无法再扩展了。即垂直扩展能力总是受到硬件约束的。通过使用分区,我们可以将一个主题中的消息分散到不同的Kafka Server上(这里需要使用Kafka集群),这样当机器的能力不足时,我们只需要添加机器就可以了。在新的机器上创建新的分区,这样理论上就可以实现无限的水平扩展能力。
  3. 分区还可以实现并行处理能力,向一个主题所发送的消息会发送给该主题所拥有的不同的分区中,这样消息就可以实现并行发送与处理。由多个分区来接收所发送的消息。

Segment(段)

一个分区(partition)是由一系列有序,不可变的消息所构成的。一个partition中的消息数量可能会非常多,因此显然不能讲所有的消息都保存到一个文件当中。因此。类似于log4j的rolling log,当partition中的消息数量增长到一定程度之后,消息文件会进行切割,新的消息会被写到一个新的文件当中,当新的文件增长到一定程度后,新的消息又会被写到另一个新的文件当中,一次类推。

因此,一个partition在物理上是由一个或者多个Segment所构成的。每个Segment中则保存了真实的消息数据。

关于partition与Segment之间的关系

  1. 每个partition都相当于一个大型文件被分配到多个大小相等的Segment数据文件中,每个Segment中的消息数量未必相等(这与消息大小有着亲密的关系,不同的消息所占据的磁盘空间显然是不一样的)这个特点使得老的Segment文件可以很容易就被删除掉,有助于提升磁盘的利用效率。
  2. 每个partition只需要支持顺序读写即可,Segment文件的生命周期是由Kafka Server的配置参数所决定的。比如说,server。properties文件中的参数项:log.retention.hours=168 表示7天后,日志过期。

分区目录中的文件含义与zuo'yong

文件命名和格式详解

image-20200206154538356

  1. 0000000000000000000000.index:它是Segment文件的索引文件,他与log数据文件是成对出现的。
  2. 0000000000000000000000.log:它是Segment文件的数据文件,用于存储实际的消息。该文件是二进制格式的。Segment文件的命名规则是:partition全局的第一个Segment从0开始,后续每个Segment文件名为上一个Segment文件最后一条消息的offset值。没有数字则用0填充。由于这里的主体的消息数量较少,因此只有一个数据文件。
  3. 0000000000000000000000.timeindex:该文件是一个基于消息日期的索引文件,主要用途是在一些根据日期或时间来寻找消息的场景下使用,此外在基于时间的日志保留策略等情况下也会使用。实际上,该文件是在Kafka较新的版本中才增加的,老版本中Kafka是没有该文件的。它是对*.index文件的一个有益补充。*.index文件是基于偏移量的索引文件,而*.timeindex则是基于时间戳的索引文件。
  4. Leader-epoch-checkpoint:是leader的一个缓存文件。实际上,它是与Kafka的HW(High water)与LEO(Log End Offset)相关的一个重要文件。

一个消息队列产品可以解决什么问题,特点是什么。能帮你做到什么。

Kafka脚本重要命令

image-20200206193306001

  1. 启动zookeeper
  2. 启动Kafka
  3. 创建主题
  4. 生产者发送消息,发送到指定主题

--zookeeper 参数已经过期了。 新版本使用 bootstrap-server 参数替代。

image-20200206195209472

使用命令: 可以看到参数帮助

image-20200206195232362

image-20200206195355305

image-20200206195451400

详细信息详情

image-20200206195557603

image-20200206195755735

那么,Topic数据是存在哪里的呢? 存在ZK中,那么是如何存储的呢?

![image-20200206200402090](/Users/shangyifeng/Library/Application Support/typora-user-images/image-20200206200402090.png)

image-20200206195933777

image-20200206200319886

image-20200206200501178

看:存在zookeeper上的topic就出来了

image-20200206200534160

到这里,只是简单的先了解一下zookeeper,以后有兴趣或者需要的话,自行了解。还是有必要的。

多消费者消费

image-20200206200655738

image-20200206200912262

image-20200206200941141

image-20200206201015476

image-20200206201043446

接下来:消费者组的使用

image-20200206201316944

image-20200206201524566

测试结论:同一个消费者组,只会有一个消费者能够收到消息。

主题删除

image-20200206201541030

对于Kafka来说,删除是一个挺难的事情。

保证数据的完整性和一致性。

删除单分区的流程(自动执行以下步骤)

  1. rename
  2. 删除.log
  3. 删除.index
  4. 删除.timeindex
  5. 删除rename后的目录

image-20200206202041767

如果是多目录的话,同时对多个目录rename,然后进行以下相同的步骤。

image-20200206202423224

image-20200206202518849

在删除kafka的topic的同时,zookeeper中也是会被删除的

接下来看删除主题的同时,zookeeper的步骤。

image-20200206203202061

会发现:已经被删除了。

image-20200206203257300

image-20200206203440478

以上所有的讲解,均来源于,官方文档。

image-20200206203700774

image-20200206204003644

image-20200206204459544

Kafka的概念介绍就到这里。

只是简单的了解,并不是全部。

如果需要了解核心概念,ISR,ACK等, 需要深入自行学习。

springboot 集成 kafka

概念理清,开始使用。不要混淆你的目的。

原则上:对于这种生产者和消费者的工程,至少需要两个工程。

springboot集成Kafka步骤

  1. 引入相关jar包。org.springframework.kafka:spring-kafkacom.google.code.gson:gson

  2. 创建实体类(装数据的message对象。自定义就行)

  3. 创建生产者类

    image-20200206210552461

  4. 创建消费者

    image-20200206211653304

  5. 配置文件,添加关于kafka的

    image-20200206212917227

  6. 编写一个控制器,COntroller,让我们能够通过浏览器访问去访问。

    Get方法:

    image-20200206213942192

    POST方法:

    image-20200206220123985

    使用CURL工具(命令行的方式去实现)

    或者使用POSTMan 软件,等等

    这里使用CURL工具。

    image-20200206220354196

  7. 启动zookeeper,启动kafka. (使用命令行)

启动-成功测试。

JSON viewer 插件 (Chrome)

目前主流的消息服务器

Kafka

ActiveMQ

RabbitMQ

RocketMQ

ZeroMQ


完结~ 撒花~

posted @ 2020-03-16 06:34  dawa大娃bigbaby  阅读(292)  评论(0编辑  收藏  举报