Flume笔记
Flume文档:Flume 1.9.0 用户指南 — Apache Flume
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单。
Flume包含四个组件
Agent --JVM进程 包含下面三个
Source -- 读取文件
Channel --缓冲池 :Memory Channel(内存)和File Channel(磁盘)。
Sink --输出文件
Source:
Source是负责接收数据的Flume Agent的组件 Source组件可以处理各种类型、各种格式的日志数据
Channel:
Memory Channel(内存)和File Channel(磁盘)。
Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。
Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作
Sink:
Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。
Event --Flume传输数据的基本单元 由Header和Body组成
Header用来存放event的一些属性 是k-v结构
Body用来存放该条数据 形式是字节数组
rm /opt/module/flume/lib/guava-11.0.2.jar
使用Flume监听一个端口,收集该端口数据,并打印到控制台。
flume-netcat-logger.conf
启动
bin/flume-ng agent –conf conf –conf-file example.conf –name a1 -Dflume.root.logger=INFO,console
-Dflume.root.logger=INFO,console :-D表示flume运行时动态修改flume.root.logger参数属性值,
并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。
#定义agent
#定义Source名字 可以多个
a1.sources = r1
#定义sink名称
a1.sinks = k1
#定义channel名字 可以多个
a1.channel = c1
#定义source配置
#表示a1的输入源类型为netcat端口类型
a1.sources.r1.type = netcat
#监听的主机
a1.sources.r1.bind = localhost
#监听的端口
a1.sources.r1.port = 44444
#定义sink的配置
#表示a1的输出目的就是控制台logger类型
a1.sinks.k1.type = logger
#定义channel类型
#表示a1的channel类型是memory内存类型
a1.channels.c1.type = memory
#最大容量是1000个event
a1.channels.c1.capacity = 1000
#表示a1的channel传输时收集到了100条event以后再去提交事务
a1.channels.c1.transactionCapacity = 100
#连接关系
#sources 和 channle 连接
a1.sources.r1.channels = c1
# sink 和 channel连接起来 一个sink 只能连接一个channel
a1.sinks.k1.channel = c1
#--name 放前面放后面是一样的
flume-ng agent --conf conf/ --conf-file $FLUME_HOME/jobs/flume-netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/flume-netcat-logger.conf -Dflume.root.logger=INFO,console
--简化版本
flume-ng agent -c conf/ -n a1 -f $FLUME_HOME/jobs/flume-netcat-logger.conf -Dflume.root.logger=INFO,console
实时监控单个追加文件
案例需求:实时监控Hive日志,并上传到HDFS中
exec即execute执行的意思。表示执行Linux命令来读取文件。
flume-file-hdfs.conf
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/jobs/log.log
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks 输出到hdfs 上
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/output/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次 <=事务容量<=channel容量
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
#sources 和 channel的关系
a1.sources.r1.channels = c1
#channel 和 sink 的关系
a1.sinks.k1.channel = c1
--flume-ng agent -c /$FLUME_HOME/conf -n a1 -f $FLUME_HOME/jobs/flume-file-hdfs.conf 这个不行 会卡日志
flume-ng agent --name a1 --conf conf/ --conf-file $FLUME_HOME/jobs/flume-file-hdfs.conf
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/flume-file-hdfs.conf
实时监控目录下多个新文件
使用Flume监听整个目录的文件,并上传至HDFS
flume-dir-hdfs.conf
#监控目录 使用Spooldir Sources
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
#定义类型是目录 可以适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步
a1.sources.r1.type = spooldir
#设置监控的目录
a1.sources.r1.spoolDir= /opt/module/flume-1.9.0/jobs/upload
#上传文件的后缀名是COMPLETED
a1.sources.r1.fileSuffix = .COMPLETED
#是否有文件头
a1.sources.r1.fileHeader = true
#按照正则表达式 忽略配置的文件
#a1.sources.r1.ignorePattern = ([^]*\.tmp)
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/output/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/flume-dir-hdfs.conf
4.实时监控目录下的多个追加文件
Exec source:适用于监控一个实时追加的文件,不能实现断点续传;
Spooldir Source:适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步;
Taildir Source:适合用于监听多个实时追加的文件,并且能够实现断点续传。
点击查看代码
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
a1.sources.r1.type = TAILDIR
#实现断点续传的
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/position.json
#给监控的文件分组 方便分开处理
a1.sources.r1.filegroups = f1 f2
#每个组是用来干嘛的 指定每个组监控哪个目录
a1.sources.r1.filegroups.f1= /opt/module/flume-1.9.0/jobs/tail/file/file.*
a1.sources.r1.filegroups.f2= /opt/module/flume-1.9.0/jobs/tail/log/log.*
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/tail/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/flume-taildir-hdfs.conf
flume拓扑结构,多个flume连接处理文件
1.复制和多路复用
使用Flume-1监控文件变动,Flume-1将变动内容传递给Flume-2,Flume-2负责存储到HDFS。
同时Flume-1将变动内容传递给Flume-3,Flume-3负责输出到Local FileSystem。
先启动服务器 再启动客户端 从后面开始启动
sources ---->default---->channel1 channel2 --->sink1 sink2 ------>source1 source2 --->channel1 channel2
--->hdfs sink1 file roll sink2
#通过复制选择器将event发往不同的channel 然后分给不同的sink连接到后面的
--flume1.conf
#agent
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
#sources
a1.sources.r1.type = TAILDIR
#实现断点续传的
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/position.json
#给监控的文件分组 方便分开处理
a1.sources.r1.filegroups = f1 f2
#每个组是用来干嘛的 指定每个组监控哪个目录
a1.sources.r1.filegroups.f1= /opt/module/flume-1.9.0/jobs/tail/file/file.*
a1.sources.r1.filegroups.f2= /opt/module/flume-1.9.0/jobs/tail/log/log.*
#ChannelSelector
a1.sources.r1.selector.type = replicating
#channels.c2
#表示a2的channel类型是memory内存类型
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
#channels.c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks k1 hdfs
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 6666
#sinks k2 本地
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 5555
#bind
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/replicating/flume1.conf -Dflume.root.logger=INFO,console
#负责将文件输出到本地
--flume2.conf
#定义agent
#定义Source名字 可以多个
a2.sources = r2
#定义sink名称
a2.sinks = k2
#定义channel名字 可以多个
a2.channels = c2
#定义source配置
a2.sources.r2.type = avro
#监听的主机
a2.sources.r2.bind = localhost
#监听的端口
a2.sources.r2.port = 5555
#定义sink的配置
#表示a2的输出目的就是控制台logger类型 输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。
a2.sinks.k2.type = file_roll
a2.sinks.k2.sink.directory = /opt/module/flume-1.9.0/jobs/replicating/repout
#定义channel类型
#表示a2的channel类型是memory内存类型
a2.channels.c2.type = memory
#最大容量是1000个event
a2.channels.c2.capacity = 1000
#表示a2的channel传输时收集到了100条event以后再去提交事务
a2.channels.c2.transactionCapacity = 100
#连接关系
#sources 和 channle 连接
a2.sources.r2.channels = c2
# sink 和 channel连接起来 一个sink 只能连接一个channel
a2.sinks.k2.channel = c2
flume-ng agent --conf conf/ --name a2 --conf-file $FLUME_HOME/jobs/replicating/flume2.conf
#负责将文件上传到hdfs
--flume3.conf
#agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
#sources
a3.sources.r3.type = avro
a3.sources.r3.bind = localhost
a3.sources.r3.port = 6666
#channels
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
#sinks
a3.sinks.k3.type = hdfs
#上传到hdfs的路径
a3.sinks.k3.hdfs.path = hdfs://zzz01:9820/flume/output/repout/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = repout-
##是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0
#Bind 关系
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
flume-ng agent --conf conf/ --name a3 --conf-file $FLUME_HOME/jobs/replicating/flume3.conf -Dflume.root.logger=INFO,console
负载均衡:
flume有三种不同的sinkProcessor:DefaultSinkProcessor、LoadBalancingSinkProcessor 和FailoverSinkProcessor
DefaultSinkProcessor:对应的是单个的Sink,
LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group,LoadBalancingSinkProcessor可以实现负载均衡的功能,
FailoverSinkProcessor可以错误恢复的功能。
使用 LoadBalancingSinkProcessor 对
sinks组 分发数据 实现负载均衡
flume1.conf
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#ChannelSelector
a1.sources.r1.selector.type = replicating
#channles
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sink processor sinks组定义
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = false
a1.sinkgroups.g1.processor.selector = random
#a1.sinkgroups.g1.processor.selector = round_robin 轮询设置
#sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/balance/flume1.conf -Dflume.root.logger=INFO,console
flume2.conf
#agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2
#sources
a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 6666
#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#sinks
a2.sinks.k2.type = logger
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
flume-ng agent --conf conf/ --name a2 --conf-file $FLUME_HOME/jobs/balance/flume2.conf -Dflume.root.logger=INFO,console
flume3.conf
#agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
#sources
a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 5555
#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
#sinks
a3.sinks.k3.type = logger
#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
flume-ng agent --conf conf/ --name a3 --conf-file $FLUME_HOME/jobs/balance/flume3.conf -Dflume.root.logger=INFO,console
使用sinks组的FailoverSinkProcessor 实现错误恢复 高可用
--使用Failover Sink Processor 只有一台sinks是active 另外一台待机 Failover Sink Processor可以错误恢复的功能
flume1.conf
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666
#bind
a1.sources.r1.channels= c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
启动:
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/FailoverSinkProcessor/flume1.conf -Dflume.root.logger=INFO,console
flume2.conf
#agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2
#sources
a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 5555
#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#sinks
a2.sinks.k2.type = logger
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
启动
flume-ng agent --conf conf/ --name a2 --conf-file $FLUME_HOME/jobs/FailoverSinkProcessor/flume2.conf -Dflume.root.logger=INFO,console
flume3.conf
#agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
#sources
a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 6666
#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
#sinks
a3.sinks.k3.type = logger
#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
启动
flume-ng agent --conf conf/ --name a3 --conf-file $FLUME_HOME/jobs/FailoverSinkProcessor/flume3.conf -Dflume.root.logger=INFO,console
聚合 多数据源聚合案例
flume1.sink1 + flume2.sink2 = flume3.sources
zzz03的配置文件
flume-exec-avro.conf
$agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/jobs/log.log
#channle
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type=avro
a1.sinks.k1.hostname = zzz01
a1.sinks.k1.port = 4141
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
--
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/groupby/flume-exec-avro.conf -Dflume.root.logger=INFO,console
--
zzz02的监控端口
flume-netcat-avro.conf
$agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = zzz02
a1.sources.r1.port = 44444
#channle
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type=avro
a1.sinks.k1.hostname = zzz01
a1.sinks.k1.port = 4141
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/groupby/flume-netcat-avro.conf -Dflume.root.logger=INFO,console
zzz01的将接收到的打到控制台
flume-avro-logger.conf
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
a1.sources.r1.type = avro
a1.sources.r1.bind = zzz01
a1.sources.r1.port = 4141
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type=logger
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/groupby/flume-avro-logger.conf -Dflume.root.logger=INFO,console
自定义Interceptor 拦截器
自定义拦截器代码:
自定义拦截器
package EventHeaderInterceptor;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
/* 打的jar放到lib目录下 使用要用全类名$bulider */
public class EventHeaderInterceptor implements Interceptor {
@Override
public void initialize() {
/*初始化用的*/
}
/**
* 主要用来实现拦截逻辑的
*
* @param event 是数据包
* @return 返回处理之后的数据包 拦截下来整点操作
*/
@Override
public Event intercept(Event event) {
Map<String, String> headers = event.getHeaders();
String body = new String(event.getBody(), StandardCharsets.UTF_8);
//给event的header加点东西
if (body.contains("向晚")) {
headers.put("title", "Ava");
} else if (body.contains("贝拉")) {
headers.put("title", "kira");
} else {
headers.put("title", "ot");
}
//处理完接着把event返回回去
return event;
}
@Override
public List<Event> intercept(List<Event> list) {
for (Event event : list) {
intercept(event);
}
return list;
}
@Override
public void close() {
/*善后的*/
}
//创建创建拦截器的内部类 会通过反射调用这个内部类来调用bulid()方法 创建拦截器对象
public static class MyBuilder implements Builder {
@Override
public Interceptor build() {
return new EventHeaderInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
Flume拓扑结构中的Multiplexing结构,Multiplexing的原理是,根据event中Header的某个key的值,
将不同的event发送到不同的Channel中,所以我们需要自定义一个Interceptor,为不同类型的event的Header中的key赋予不同的值。
flume1.conf
#agent
a1.sources = r1
a1.channels = c1 c2 c3
a1.sinks = k1 k2 k3
#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = zzz01
a1.sources.r1.port = 44444
#ChannelSelector
a1.sources.r1.selector.type = multiplexing
#根据event中header的key拦截 key的值是titile
a1.sources.r1.selector.header = title
#map里面的value是Ava就发送到c1
a1.sources.r1.selector.mapping.Ava = c1
a1.sources.r1.selector.mapping.kira = c2
#默认的或者是直接指定
a1.sources.r1.selector.default = c3
#a1.sources.r1.selector.mapping.ot = c3
#Interceptor
a1.sources.r1.interceptors = i1
#全类名+创建拦截器对象的bulid方法的对象
a1.sources.r1.interceptors.i1.type =EventHeaderInterceptor.EventHeaderInterceptor$MyBuilder
#channles
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100
#sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666
a1.sinks.k3.type = avro
a1.sinks.k3.hostname = localhost
a1.sinks.k3.port = 7777
#bind
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3
#启动flume1.conf
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/Interceptor/flume1.conf -Dflume.root.logger=INFO,console
flume2.conf
#agent
a2.sources = r2
a2.channels = c2
a2.sinks = k2
#sources
a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 6666
#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#sinks
a2.sinks.k2.type = logger
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
#启动
flume-ng agent --conf conf/ --name a2 --conf-file $FLUME_HOME/jobs/Interceptor/flume2.conf -Dflume.root.logger=INFO,console
flume3.conf
flume3.conf
#agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
#sources
a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 5555
#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
#sinks
a3.sinks.k3.type = logger
#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
flume-ng agent --conf conf/ --name a3 --conf-file $FLUME_HOME/jobs/Interceptor/flume3.conf -Dflume.root.logger=INFO,console
flume4.conf
#agent
a4.sources = r4
a4.channels = c4
a4.sinks = k4
#sources
a4.sources.r4.type= avro
a4.sources.r4.bind= localhost
a4.sources.r4.port= 7777
#channles
a4.channels.c4.type = memory
a4.channels.c4.capacity = 1000
a4.channels.c4.transactionCapacity = 100
#sinks
a4.sinks.k4.type = logger
#bind
a4.sources.r4.channels = c4
a4.sinks.k4.channel = c4
#启动
flume-ng agent --conf conf/ --name a4 --conf-file $FLUME_HOME/jobs/Interceptor/flume4.conf -Dflume.root.logger=INFO,console
自定义sources
据官方说明自定义MySource需要继承AbstractSource类并实现Configurable和PollableSource接口。
自定义source代码
实现相应方法:
getBackOffSleepIncrement() //backoff 步长
getMaxBackOffSleepInterval()//backoff 最长时间
configure(Context context)//初始化context(读取配置文件内容)
process()//获取数据封装成event并写入channel,这个方法将被循环调用。
使用场景:读取MySQL数据或者其他文件系统。
自定义Source
package MySource;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;
import java.util.UUID;
import java.util.concurrent.TimeUnit;
public class MySource extends AbstractSource implements Configurable, PollableSource {
String prefix;
/**
* 主要的逻辑处理方法
*
* @return 返回当前sources获取数据的一个状态
* @throws EventDeliveryException 搞进来一批数据 会一直调用这个方法
*/
@Override
public Status process() throws EventDeliveryException {
//每次调用先休眠一下 控制一下速度
try {
TimeUnit.SECONDS.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
Status status = null;
try {
// Receive new data 封装event对象
Event event = getData();
// 传递到ChannelProcessor 进行处理
getChannelProcessor().processEvent(event);
//返回状态
status = Status.READY;
} catch (Throwable t) {
// 如果处理不成功 返回的状态
status = Status.BACKOFF;
}
return status;
}
/**
* sources是如何采集并且封装数据的 可以对header,body进行操作
*
* @return 返回处理好的event
*/
private Event getData() {
//使用一个随机String生成测试
String data = UUID.randomUUID().toString();
//Event是接口 不可以直接new 使用SimpleEvent类实现
Event event = new SimpleEvent();
//添加点数据
//配置文件里设定的参数可以拼接上来 body要求是byte类型的数组
String line = prefix + data;
event.setBody(line.getBytes());
//header放点东西 方便后面的多路复用根据key确定去哪个channel
event.getHeaders().put("star", "kira");
return event;
}
@Override
public long getBackOffSleepIncrement() {
return 0;
}
@Override
public long getMaxBackOffSleepInterval() {
return 0;
}
/**
* 读取配置文件 可以返回一个默认值
*
* @param context
*/
@Override
public void configure(Context context) {
//获取配置文件中的sources里面的配置 如果没有获取到就使用默认值
prefix = context.getString("prefix", "Ava-a");
}
}
自定义Souce使用
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
#自定义需要写全类名
a1.sources.r1.type = MySource.MySource
a1.sources.r1.prefix = log->
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type = logger
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/mysource/mysource.conf -Dflume.root.logger=INFO,console
自定义sink
自定义Sink代码
官方也提供了自定义sink的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#sink
根据官方说明自定义MySink需要继承AbstractSink类并实现Configurable接口。
实现相应方法:
configure(Context context)//初始化context(读取配置文件内容)
process()//从Channel读取获取数据(event),这个方法将被循环调用。
使用场景:读取Channel数据写入MySQL或者其他文件系统
package MySink;
import org.apache.flume.Channel;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.Transaction;
import org.apache.flume.sink.AbstractSink;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.Configurable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.charset.StandardCharsets;
import java.util.Map;
/*官方也提供了自定义sink的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#sink根据官方说明自定义MySink需要继承AbstractSink类并实现Configurable接口。
实现相应方法:
configure(Context context)//初始化context(读取配置文件内容)
process()//从Channel读取获取数据(event),这个方法将被循环调用。
使用场景:读取Channel数据写入MySQL或者其他文件系统
*/
public class MySink extends AbstractSink implements Configurable {
Logger logger = LoggerFactory.getLogger(MySink.class);
/**
* sink的主要逻辑处理办法
*
* @return
* @throws EventDeliveryException
*/
@Override
public Status process() throws EventDeliveryException {
Status status = null;
// 从某一个Channel中拉取数据
Channel ch = getChannel();
//获取事务 Sink是完全事务性的
Transaction txn = ch.getTransaction();
//事务开始
txn.begin();
try {
//获取到event
Event event = ch.take();
//处理event
processEvent(event);
//提交事务
txn.commit();
status = Status.READY;
} catch (Throwable t) {
txn.rollback();
// Log exception, handle individual exceptions as needed
status = Status.BACKOFF;
} finally {
//关闭事务
txn.close();
}
return status;
}
/* 自定义处理event的方法*/
private void processEvent(Event event) {
Map<String, String> headers = event.getHeaders();
String body = new String(event.getBody(), StandardCharsets.UTF_8);
String result = headers.toString() + "^_^" + body;
logger.info(result);
}
@Override
public RequestConfig getConfig() {
return null;
}
}
自定义Sink使用
--自定义sinks
#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#sources
#自定义需要写全类名
a1.sources.r1.type = MySource.MySource
a1.sources.r1.prefix = log->
#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#sinks
a1.sinks.k1.type = MySink.MySink
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/mysink/mysink.conf -Dflume.root.logger=INFO,console