Flume配置与案例实践
Flume基础
netcat(linux系统轻量级通信工具)安装
#桌面端可能自带 环境centos7
sudo yum install -y nc
#开启服务端(端口9999)
nc -lk 9999
#另一个窗口客户端监听 9999
nc localhost 9999
fluem官方案例测试
#判断 44444 端口是否被占用
sudo netstat -nlp | grep 44444
#创建 Flume Agent 配置文件 flume-netcat-logger.conf
[flume]$ mkdir job
[flume]$ cd job/
#在 job 文件夹下创建 Flume Agent 配置文件 flume-netcat-logger.conf
[job]$ vim flume-netcat-logger.conf
在 flume-netcat-logger.conf 文件中添加如下内容:
# Name the components on this agent
# 给当前agent命名组件 agent名字(唯一):a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost #绑定节点
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 #总容量1000个event
#表示a1的channel传输时收集到了100条event以后再去提交事务
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
# r1可以和多个c1绑定
a1.sources.r1.channels = c1 #表示将r1和c1连接起来
a1.sinks.k1.channel = c1 #表示将k1和c1连接起来
启动flume命令,开启一个agent
#-n agent的名字
#-c 配置文件
#-f flume 本次启动读取的agent的配置文件 job/net-flume-logger.conf
# -Dflume.root.logger=INFO,console :-D 表示 flume 运行时动态修改 flume.root.logger参数属性值,并将控制台日志打印级别设置为 INFO 级别。日志级别包括:log、info、warn、error
$ bin/flume-ng agent -n $agent_name -c conf -f job/flume-conf.properties.template -Dflume.root.logger=INFO,console
另一个窗口监听44444端口
nc localhost 44444
实时监控的单个追加文件
需求:实时监控Hive日志,并上传到HDFS
一,Hadoop和java环境正确
二,创建fluem-file-hdfs.conf
#创建文件
[job]$ vim flume-file-hdfs.conf
# Name the components on this agent
# 给当前agent命名组件 agent名字(唯一):a1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 #总容量1000个event
#表示a1的channel传输时收集到了100条event以后再去提交事务
a1.channels.c1.transactionCapacity = 100
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:8020/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 30 # s
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700 #接近128M
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
先启动hdfs(已启动) 在启动flume的agent
bin/flume-ng agent -n a1 -c conf -f job/flume-file-hdfs.conf
hive.log存放初始位置为/tmp/本机用户/hive.log
开启hive
不支持断点续传
实时监控目录下多个新文件
案例需求:使用flume监听整个目录的文件,并上传至hdfs
一,创建配置文件flume-dir-hdfs.conf
vim flume-dir-hdfs.conf
# 添加内容
# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp 结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://master:8020/flume/upload/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 20
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
启动flume的agent命令
bin/flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf
文件创建后再次写入数据无法监控
文件更名后会再次监控
实时监控目录下的多个追加文件(重要)
案例需求:使用Flume监听整个目录的实时追加文件,并上传到hdfs
一,创建配置文件 flume-taildir-hdfs.conf
vim flume-taildir-hdfs.conf
#添加内容
# Name the components on this agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json #保存读取的配置信息实现断点续传
a3.sources.r3.filegroups = f1 f2 #文件组名字随意,f1,f2相当于一个变量
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files2/.*log.*
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path =
hdfs://master:8020/flume/upload2/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小大概是 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a3.sinks.k3.hdfs.rollCount = 0
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
启动flume的agent
bin/flume-ng agent -c conf/ -n a3 -f job/flume-taildir-hdfs.conf
源码修改使得flume任务可以一直监控而非改名后就不能监控的缺点
应用于hive日志的新一天会把当天的hive.log 变为hive-日期.log,若是在前一天的11:30宕机后指导另一天才修复,造成最后半小时的日志无法获取的问题
修改fluem-taildir-hdfs的源码配置,修改使用一个标准(inode)判断文件的是否新建
flume-taildir-source-1.9.0.jar
Flume进阶
一,事务
Put事务流程
•doCommit:检查channel内存队列是否足够合并。
•doRollback:channel内存队列空间不足,回滚数据
Take事务
•doTake:将数据取到临时缓冲区takeList,并将数据发送到HDFS
•doCommit:如果数据全部发送成功,则清除临时缓冲区takeList
•doRollback:数据发送过程中如果出现异常,rollback将临时缓冲区t
akeList中的数据归还给channel内存队列。
二,内部原理
重要组件:
1)ChannelSelector
ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型,
分别是 Replicating(复制)和 Multiplexing(多路复用)。
ReplicatingSelector 会将同一个 Event 发往所有的 Channel,Multiplexing 会根据相
应的原则,将不同的 Event 发往不同的 Channel。
2)SinkProcessor
SinkProcessor
共 有 三 种 类 型 , 分 别 是 DefaultSinkProcessor 、LoadBalancingSinkProcessor 和 FailoverSinkProcessor
DefaultSinkProcessor 对 应 的 是 单 个 的 Sink ,
LoadBalancingSinkProcessor 和FailoverSinkProcessor 对应的是 Sink Group,
LoadBalancingSinkProcessor 可以实现负载均衡的功能,
FailoverSinkProcessor 可以错误恢复的功能
三,拓扑结构
-
简单串联
是将多个 flume 顺序连接起来了,从最初的 source 开始到最终 sink 传送的目的存储系统。此模式不建议桥接过多的 flume 数量, flume 数量过多不仅会影响传输速率,而且一旦传输过程中某个节点 flume 宕机,会影响整个传输系统
-
复制和多路复用
Flume 支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel 中,或者将不同数据分发到不同的 channel 中,sink 可以选择传送到不同的目的地。
-
负载均衡和故障转移
Flume支持使用将多个sink逻辑上分到一个sink组,sink组配合不同的SinkProcessor可以实现负载均衡和错误恢复的功能
-
聚合
这种模式是我们最常见的,也非常实用,日常 web 应用通常分布在上百个服务器,大者甚至上千个、上万个服务器。产生的日志,处理起来也非常麻烦。用 flume 的这种组合方式能很好的解决这一问题,每台服务器部署一个 flume 采集日志,传送到一个集中收集日志的 flume,再由此 flume 上传到 hdfs、hive、hbase 等,进行日志分析
Flume开发案例
复制和多路复用
需求:使用flume-1监控文件变动,flume-1将变动内容传递给Flume-2,Flume-2负责存储到HDFS。同时Flume-1将变动内容传递给Flume-3,Flume-3负责输出到LocalFileSystem。
在/opt/module/flume/job 目录下创建 group1 文件夹
[job]$ mkdir group1
[job]$ cd group1/
#创建三个flume文件
#第一个文件flume,配置 1 个接收日志文件的 source 和两个 channel、两个 sink,分别输送给 flume-flumehdfs 和 flume-flume-dir
[group1]$ vim flume-file-flume.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel(默认为replication)
a1.sources.r1.selector.type = replicating
# Describe/configure the source(若没有改hive日志则在tmp/hive.log)
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = master
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
#创建 flume-flume-hdfs.conf,配置上级 Flume 输出的 Source,输出是到 HDFS 的 Sink
[group1]$ vim flume-flume-hdfs.conf
# Name the components on this agent(agent名字必须不同)
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = master
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://master:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
#创建 flume-flume-dir.conf,配置上级 Flume 输出的 Source,输出是到本地目录的 Sink
[group1]$ vim flume-flume-dir.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
#输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目
录
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2
#新建保存本地数据的目录
[data]$ mkdir flume3
分别启动配置文件(先启动服务端(a2,a3),再开启客户端a1)
flume-flume-hdfs.conf
bin/flume-ng agent -c conf -n a2 -f job/group1/flume-flume-hdfs.conf
flume-flume-dir.conf
bin/flume-ng agent -c conf -n a2 -f job/group1/flume-flume-dir.conf
flume-file-flume.conf
bin/flume-ng agent -c conf -n a1 -f job/group1/flume-file-flume.conf
启动hadoop和hive
检查数据
负载均衡和故障转移
需求:使用Flume1监控一个端口,其sink组中的sink分别对接Flume2和Flume3,采用FailoverSinkProcessor,实现故障转移的功能
在/opt/module/flume/job 目录下创建 group2 文件夹
[job]$ mkdir group2
[job]$ cd group2/
#创建 flume-netcat-flume.conf,配置 1 个 netcat source 和 1 个 channel、1 个 sink group(2 个 sink),分别输送给flume-flume-console1 和 flume-flume-console2
[group2]$ vim flume-netcat-flume.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#优先级设置priority k2更高
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = master
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
#创建 flume-flume-console1.conf,配置上级 Flume 输出的 Source,输出是到本地控制台
[group2]$ vim flume-flume-console1.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = master
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
#创建 flume-flume-console2.conf,配置上级 Flume 输出的 Source,输出是到本地控制台
[group2]$ vim flume-flume-console2.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2
分别启动配置文件(服务端先启动a2,a3)
分别开启对应配置文件:flume-flume-console2,flume-flume-console1,flume-netcat-flume
#启动a2
bin/flume-ng agent -c conf -n a2 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
#启动a3
bin/flume-ng agent -c conf -n a3 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
#启动a1
bin/flume-ng agent -c conf -n a1 -f job/group2/flume-netcat-flume.conf
使用 netcat 工具向本机的 44444 端口发送内容
nc localhost 44444
查看Flume2和Flume3的控制台日志
kill 掉Flume2 ,观察Flume3控制台打印情况
使用 jps -ml 查看 Flume 进程
故障转移改负载均衡
将flume-netcat-flume.conf复制一份为flume-netcat-flume.conf并修改sink group
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#优先级设置priority k2更高
a1.sinkgroups.g1.processor.type = load balance
a1.sinkgroups.g1.processor.selector = round_robin
#退避算法backoff,两个sink拉取数据,若当前一个sink未拉去数据则选择此sink一段时间内不要去拉去数据
a1.sinkgroups.g1.processor.backoff = true
#指数增长
a1.sinkgroups.g1.processor.selector.maxTimeOut = 30000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = master
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
聚合
需求:master上的Flume-1监控文件/opt/module/group.log
slave1上的Flume-2监控某一个端口的数据流
Flume-1与Flume-2将数据发送给slave2上的Flume-3,Flume-3将最终数据打印到控制台
分发Flume
[module]$ xsync flume
在 master、slave1 以及 slave2 的/opt/module/flume/job 目录下创建一个
group3 文件夹
[master job]$ mkdir group3
[slave1 job]$ mkdir group3
[slave2 job]$ mkdir group3
创建 flume1-netcat-flume.conf,配置 Source 监控端口 44444 数据流,配置 Sink 数据到下一级 Flume
[master group3]$ vim flume1-netcat-flume.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slave2
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#创建 flume2-logger-flume.conf,配置 Source 用于监控 hive.log 文件,配置 Sink 输出数据到下一级 Flume
[slave1 group3]$ vim flume2-logger-flume.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = exec
a2.sources.r1.command = tail -F /opt/module/group.log
a2.sources.r1.shell = /bin/bash -c
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = slave2
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
#创建 flume3-flume-logger.conf ,配置 source 用于接收 flume1 与 flume2 发送过来的数据流,最终合并后 sink 到控制台
[slave2 group3]$ vim flume3-flume-logger.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = slave2
a3.sources.r1.port = 4141
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
分别启动配置文件(服务端先启动a3)
分别开启对应配置文件:flume3-flume-logger.conf,flume1-netcat-flume.conf,flume2-logger-flume.conf
[slave2 flume]$ bin/flume-ng agent --conf conf/ --name
a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
[slave1 flume]$ bin/flume-ng agent --conf conf/ --name
a1 --conf-file job/group3/flume2-logger-flume.conf
[master flume]$ bin/flume-ng agent --conf conf/ --name
a2 --conf-file job/group3/flume1-netcat-flume.conf
在 slave1 上向/opt/module 目录下的 group.log 追加内容
[slave1 module]$ echo 'hello' > group.log
在 master 上向 44444 端口发送数据
[master flume]$ telnet master 44444
[master flume]$ nc master 44444
检查 slave2上数据
自定义 Interceptor
需求:使用 Flume 采集服务器本地日志,需要按照日志类型的不同,将不同种类的日志发往不同的分析系统
创建Java-maven项目实现Flume的自定义
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>1.9.0</version>
</dependency>
</dependencies>
package com.hadoop.interceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
/**
* @author H
* @create 2022/11/20 19:47
*/
public class TypeInterceptor implements Interceptor {
//声明集合用于处理存放拦截器处理后的事件
private List<Event> addHeaderEvents;
@Override
public void initialize() {
//初始化集合用于处理存放拦截器处理后的事件
addHeaderEvents = new ArrayList<>();
}
//单个事件处理方法
@Override
public Event intercept(Event event) {
//1.获取header和body
Map<String, String> headers = event.getHeaders();
String body = new String(event.getBody());
//2.根据body中是否包含hadoop字符串添加不同的头信息
if (body.contains("hadoop")){
headers.put("type","hadoop");
}else{
headers.put("type","other");
}
//3.返回数据
return event;
}
//批量事件处理方法
@Override
public List<Event> intercept(List<Event> list) {
//1.清空集合
addHeaderEvents.clear();
//2.遍历events
for (Event event : list) {
addHeaderEvents.add(intercept(event));
}
//3.返回数据
return addHeaderEvents;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new TypeInterceptor();//用于构建对象
}
@Override
public void configure(Context context) {
}
}
}
打包将jar包导入到Flume的lib目录
编辑flume配置文件
#为 master 上的 Flume1 配置 1 个 netcat source,1 个 sink group(2 个 avro sink),并配置相应的 ChannelSelector 和 interceptor
[master job]$ mkdir group4
[master group4]$ vim flume1.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.hadoop.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.hadoop = c1
a1.sources.r1.selector.mapping.other = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slave1
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = slave2
a1.sinks.k2.port = 4242
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
#为 slave1 上的 Flume4 配置一个 avro source 和一个 logger sink
[slave1 group4]$ vim flume2.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = slave1
a1.sources.r1.port = 4141
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
#为 slave2 上的 Flume3 配置一个 avro source 和一个 logger sink
[slave2 group4]$ vim flume3.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = slave2
a1.sources.r1.port = 4242
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
分别启动slave1,slave2 和master上启动flume进程
[slave1 flume]$ bin/flume-ng agent -c conf -n a1 -f job/group4/flume2.conf -Dflume.root.logger=INFO,console
[slave2 flume]$ bin/flume-ng agent -c conf -n a1 -f job/group4/flume3.conf -Dflume.root.logger=INFO,console
[master flume]$ bin/flume-ng agent -c conf -n a1 -f job/group4/flume1.conf
在 master使用 netcat 向 localhost:44444 发送字母和数字
观察 slave1和 slave2 打印的日志
slave1只接收带有hadoop字符的字符串
其他不符合的都由slave2接收
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通