201707160046复习-kafka篇

一、

　　1、jms java message server java消息服务

　　2、jms出现的原因：

　　3、JMS两种规范：

　　　　3.1、队列模式，即p2p，点对点模式，一个生产者对应一个消费者

　　　　3.2、发布订阅模式， pub-sub publish-subscribe, 一个生产者生成一个主题，所有消费者都可以看到

　　4、看法卡是将上面2个概念整合到了一起：引入了消费者组的概念，一个组内多个consumer只能有一个来消费

　　　　当只有一个组时，且组内就一个consumer，那就是队列

　　　　当有多个组，每个阻力都有consumer能够消费，那就是发布-订阅模式

　　5、kafka特点：

　　　　持久化消息：支持TB级别

　　　　高吞吐，迟滞每秒百万消息

　　　　分布式，支持消息分区

　　　　多客户端支持，支持多种语言

　　6、安装kafka，启动kafaka：kafka-server-start.sh config/server.properties

　　7、

这是kakfa/config/server.propertis配置文件内容
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.

broker.id=0

# Switch to enable topic deletion or not, default value is false
#是否能够删除topic,默认不允许,开启之后允许删除
delete.topic.enable=true

############################# Socket Server Settings #############################
#kafka监听的端口号,默认是9092,
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://0.0.0.0:9092

# Hostname and port the broker will advertise to producers and consumers. If not set,
# it uses the value for "listeners" if configured. Otherwise, it will use the value
# returned from java.net.InetAddress.getCanonicalHostName().
advertised.listeners=PLAINTEXT://192.168.40.128:9092

# Maps listener names to security protocols, the default is for them to be the same. See the config documentation for more details
#listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

# The number of threads handling network requests
num.network.threads=3

# The number of threads doing disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600

############################# Log Basics #############################

# A comma seperated list of directories under which to store log files

log.dirs=/home/ubuntu/kafka/kafka-logs

#每个topic默认的日志分区数量
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here: 几个重要性权衡
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 1. 耐久性：如果不适用副本，可能会导致数据丢失
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush. 延迟:flush时间间隔设置过大，那么当flush时会导致大量的数据flush
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.吞吐量:flush通常是一个昂贵的操作,如果flush时间间隔过小,会导致大量的寻道时间
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.
# flush数据可以设置为:经过一段时间flush一次 或者 每N条message时flush一次 或者 同时设置两者.
# 可以配置一个全局设置,也可以在每个topic上设置,topic上设置会覆盖全局设置

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000 每10000条message则flush一次，不管是否已经到了指定的时间间隔

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000 每1000毫秒flush一次，不管这1000毫秒内message数量是否已经达到10000

############################# Log Retention Policy 日志滞留策略 #############################

#控制每个日志片的处理：可以设置经过一定时间后删除log片,或者 到达指定的累计值。
# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log. 如果满足了条件，日志将被删除.且日志的删除是一定会发生的

# The minimum age of a log file to be eligible for deletion due to age 默认设置日志将被保留7天
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
# 日志最大会到达1g时就会发生滚动，产生一个新的日志文件
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies 每隔300000毫秒 即300秒，5分钟检查一次，是否有日志片达到了删除条件。
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=s129:2181,s130:2181,s131:2181

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

View Code

　　8、topic的增删改查:

　　　　kafka-topics.sh --alter （修改）

　　　　kafka-topics.sh --config （配置：覆盖全局配置：副本数，驻留策略）

　　　　kafka-topics.sh --create 创建主题

　　　　kafka-topics.sh --delete 删除主题

　　　　kafka-topics.sh --delete-config 删除配置

　　9、kafka是从producer push 到borker 然后由consumer pull 消息

　　10、producer维护一个到broker的连接池，由zk返回一个回调实时感知

　　11、消息压缩，使用gzip或者snappy ，生产者压缩消息，消费者解压消息

　　12、producer 写消息流程：

　　　　producer 通过zk找到当前分区的多个副本的leader ，

　　　　发送消息到leader

　　　　leader立即写入到log中

　　　　另外的副本broker都从leader上pull拉取数据

　　　　当副本broker全部拉取完毕，发送ack

　　　　当leader 收到全部副本发来的ack，告知producer当前partition写入完毕

　　13、同步副本：必须所有的副本都写入完毕，producer收到所有副本发来的ack表示写入完毕

　　　　异步副本：只要leader写入成功，那么久返回给producer，写入成功。

　　14、api使用：

　　15、kafka 拥有高吞吐率的原因：

　　　　还有一点就是：一个topic有多个partition，可以同时多个partition中写数据，这也是效率高的一个原因

　　16、生产数据问题：

　　　　Q1：如何得知数据是否已经发送成功？发送过程失败，就没有发送到对方，怎么处理？对方收到了，但是在反馈ack消息时网络问题，导致生产者不知道broker已经收到数据了，怎么处理？

　　　　Q2：发送到broker上，先保存到内存？什么情况下flush到磁盘？

　　　　Q3：一条消息发送时，这条消息应该到第1个partition上去，但是这个partition有3个副本，分别在不同的broker上，那么消息是发送到其中的一台内存中，然后由别的来拷贝？还是直接发送到3台broker的内存中？

　　　　Q4：数据发送到了broker上，记录在内存中，他也没有日志啥的，如果此时宕机了，数据怎么恢复？假设他有一个副本

　　　　Q5：怎么实现生产数据at most once ， at least once， exactly once

　　　　Q6：既然kafka也有partition，那会不会产生数据倾斜？某个partition数据量特别大，而别的partition数据量很小，怎么处理？

　　　　Q7：发送过程的优化

　　　　Q8：kafkaProducer的send过程：

　　　　　　1、构建ProducerRecord<Integer, String>对象，key是用作分区的，value是消息内容

　　　　　　2、调用send方法

　　　　　　3、在本地有一个accumulator对象，称为蓄能池，将数据保存在MemoryRecords，记录在内存中，

　　　　　　4、sender启动，sender是一个线程类，运行里面的run方法

　　　　5、后续的就看不懂了。

　　17、消费数据问题：

　　　　Q1：怎么实现消费数据at most once ， at least once， exactly once

　　　　Q2：数据的偏移量处理

　　　　Q3：数据怎么重复消费

　　　　Q4：有10个partition，而一个组内有4个consumer，怎么分配每个consumer消费几号partition

　　　　Q5：消费过程优化

　　18、对上面问题的回答　　

　　　　kafka 功能：存储数据，持久化数据，数据容错性。

　　　　producer和consumer只是作为kafka的一个客户端存在，producer发送数据到kafka server端，consumer从kafka server端消费数据

　　　　1、发送消息时，如果连接不上kafkaserver，那么在客户端producer的send方法会报错，客户端程序终止。
　　　　　　或者在客户端try catch到这个异常，然后发邮件或者短信告知管理员kafka集群异常。此时跟kafkaserver没有做任何的交互呢。

　　　　2、kafkaserver的工作仅仅是：收到一条消息，保存这条消息，并发送ack到客户端producer。
　　　　　　数据正常发送到kafka集群中，数据是发送到集群上的某个topic的某个partition上。这个partition又有副本，多个副本之间有leader和flower
　　　　　　leader收到数据后：
　　　　　　　　如果ack = 0，客户端producer并不关心数据是否成功发送。
　　　　　　　　如果ack = 1，leader不等待其他flower从自己这里同步拷贝数据，直接向producer发送ack，告知producer消息成功接收。
　　　　　　　　　　　但是当leader刚发了ack，就宕机，其他flower没来得及同步消息，那么这条消息就永远的丢失了。
　　　　　　　　如果ack = all,leader等待其他flower同步数据完成再发ack给客户端。
　　　　3、consumer消费数据，消费到哪儿了，客户端自己记录。可以记录在客户端自己内存中，也可以记录在客户端本地文件上，也可以记录在zk上(旧的api自己管理api)，也可以在kafka中新建一个topic，用来保存偏移量(新api就是这么做的)。
　　　　4、总结：kafka唯一功能：接受数据，存储数据。并不关心数据是否成功发送，或者成功被消费。
　　　　　　数据有没有发送成功，发送失败后怎么处理，是否重复发送等工作是producer端的事儿。
　　　　　　数据有没有消费成功，消费到哪儿了，是否要重复消费等工作是consumer的事儿。
　　　　5、producer端确保数据发送成功的方法：
　　　　　　1、设置ack = all，
　　　　　　2、当ack = 1时，发回来的偏移量是-1，说明这条消息发送失败，在callback中可以重新发送。
　　　　　　3、想要确保数据发送成功，ack一定不能设置为0。
　　　　6、consumer确保数据正常消费：
　　　　　　1、消费数据有两个工作：消费数据，修改偏移量。
　　　　　　2、先消费数据，后修改偏移量：当数据被消费，没来得及修改偏移量，机器宕机，会有重复消费数据的风险
　　　　　　3、先修改偏移量，后消费数据：刚修改了偏移量，就宕机了，数据没有被消费的风险
　　　　　　4、把消费和修改偏移量看成一个事务，要么都成功，要么都失败。
　　　　　　5、4中的问题：消费一条消息就联系zk修改一次偏移量，未免太麻烦了，太浪费资源了，都是消费一批记录一次。
　　　　　　6、一般这样做：
　　　　　　　　数据可重复消费，但不可丢失：先消费数据，消费完这一批数据记录一次偏移量
　　　　　　　　数据可丢失，但不可重复消费：先记录这批数据的最后一条记录的偏移量，然后再消费这批数据
　　　　　　　　数据刚好被消费一次：每一批次的消息只有一条，一条作为一批，然后处理完毕记录偏移量。
　　　　7、consumer有两套api：
　　　　　　　　Kafka提供了两套consumer api，分为high-level api和sample-api。
　　　　　　　　Sample-api(需要手动管理偏移量) ：是一个底层的API，它维持了一个和单一broker的连接，并且这个API是完全无状态的，每次请求都需要指定offset值，因此，这套API也是最灵活的。
　　　　　　　　High-levelAPI(自动管理偏移量):封装了对集群中一系列broker的访问，可以透明的消费一个topic。它自己维持了已消费消息的状态，即每次消费的都是下一个消息。
　　　　　　　　　　　高级api，在logdir配置的目录里产生类似于__consumer_offsets的主题，这个主题记录的是偏移量，并且数据是压缩的，当leader和flower都写成功后，才提交偏移量到__consumer_offsets中，如果偏移量不能在可配置的超时时间内赋值成功，便宜提交将失败，高级api会在回退后重试。定期压缩这些记录偏移量的主题，只需要维护每个分区最近的偏移提交。
　　　　8、优化：kafka集群优化，producer优化，consumer优化
　　　　　　kafka优化：
　　　　　　　　log.segment.bytes Segment文件的大小，超过此值将会自动新建一个segment，此值可以被topic级别的参数覆盖。
　　　　　　　　log.retention.check.interval.ms 检查超时的周期，
　　　　　　producer优化：在客户端增加batch的大小，然后一批消息发送，减少网络请求
　　　　　　　　batch.num.messages 采用异步模式时，一个batch缓存的消息数量。达到这个数量值时producer才会发送消息。
　　　　　　　　压缩消息，网络传输数据量减少
　　　　　　　　request.timeout.ms 减少ack的等待事假
　　　　　　　　producer.type 同步异步模式。async表示异步，sync表示同步。如果设置成异步模式，可以允许生产者以batch的形式push数据，这样会极大的提高broker性能，推荐设置为异步。
　　　　　　　　queue.buffering.max.ms 启用异步模式时，producer缓存消息的时间。比如我们设置成1000时，它会缓存1秒的数据再一次发送出去，这样可以极大的增加broker吞吐量，但也会造成时效性的降低。
　　　　　　　　queue.buffering.max.messages 采用异步模式时producer buffer 队列里最大缓存的消息数量，如果超过这个数值，producer就会阻塞或者丢掉消息。
　　　　　　　　message.send.max.retries Producer发送失败时重试次数。若网络出现问题，可能会导致不断重试。
　　　　　　consumer优化：
　　　　　　　　auto.commit.enable 如果此值设置为true，consumer会周期性的把当前消费的offset值保存到zookeeper。当consumer失败重启之后将会使用此值作为新开始消费的值。
　　　　　　　　auto.commit.interval.ms Consumer提交offset值到zookeeper的周期。
　　　　　　　　consumer.timeout.ms 若在指定时间内没有消息消费，consumer将会抛出异常。

　　　　9、kafka如何保证数据的顺序性？只能使用一个partition，否则无法保证， MetaQ可以保证数据的顺序性。

posted @ 2017-09-04 17:14 IT豪哥阅读(104) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

IT豪哥

知识共享，共享世界

201707160046复习-kafka篇

公告