Flume笔记

Flume文档:Flume 1.9.0 用户指南 — Apache Flume

Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单。

Flume包含四个组件
Agent  --JVM进程 包含下面三个
Source  -- 读取文件
Channel --缓冲池  :Memory Channel(内存)和File Channel(磁盘)。
Sink  --输出文件


Source:
Source是负责接收数据的Flume Agent的组件 Source组件可以处理各种类型、各种格式的日志数据

Channel:
Memory Channel(内存)和File Channel(磁盘)。
Channel是位于Source和Sink之间的缓冲区。因此,Channel允许Source和Sink运作在不同的速率上。
Channel是线程安全的,可以同时处理几个Source的写入操作和几个Sink的读取操作

Sink:
Sink不断地轮询Channel中的事件且批量地移除它们,并将这些事件批量写入到存储或索引系统、或者被发送到另一个Flume Agent。

 

Event --Flume传输数据的基本单元 由Header和Body组成

Header用来存放event的一些属性 是k-v结构
Body用来存放该条数据 形式是字节数组


rm /opt/module/flume/lib/guava-11.0.2.jar

使用Flume监听一个端口,收集该端口数据,并打印到控制台。

flume-netcat-logger.conf

启动

bin/flume-ng agent –conf conf –conf-file example.conf –name a1 -Dflume.root.logger=INFO,console

-Dflume.root.logger=INFO,console :-D表示flume运行时动态修改flume.root.logger参数属性值,
并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。

#定义agent

#定义Source名字 可以多个
a1.sources = r1
#定义sink名称 
a1.sinks = k1
#定义channel名字 可以多个
a1.channel = c1 

#定义source配置
#表示a1的输入源类型为netcat端口类型
a1.sources.r1.type = netcat
#监听的主机
a1.sources.r1.bind = localhost
#监听的端口
a1.sources.r1.port = 44444

#定义sink的配置  
#表示a1的输出目的就是控制台logger类型
a1.sinks.k1.type = logger

#定义channel类型
#表示a1的channel类型是memory内存类型
a1.channels.c1.type = memory
#最大容量是1000个event
a1.channels.c1.capacity = 1000
#表示a1的channel传输时收集到了100条event以后再去提交事务
a1.channels.c1.transactionCapacity = 100 

#连接关系
#sources 和 channle 连接
a1.sources.r1.channels = c1 
# sink 和 channel连接起来 一个sink 只能连接一个channel  
a1.sinks.k1.channel = c1


#--name 放前面放后面是一样的
flume-ng agent --conf conf/   --conf-file $FLUME_HOME/jobs/flume-netcat-logger.conf  --name a1 -Dflume.root.logger=INFO,console
flume-ng agent --conf conf/ --name a1 --conf-file $FLUME_HOME/jobs/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

--简化版本
flume-ng agent -c conf/ -n a1 -f $FLUME_HOME/jobs/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

 

实时监控单个追加文件  

案例需求:实时监控Hive日志,并上传到HDFS中
exec即execute执行的意思。表示执行Linux命令来读取文件。

flume-file-hdfs.conf 

#agent 
a1.sources = r1 
a1.channels = c1
a1.sinks = k1

#sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/jobs/log.log


#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks 输出到hdfs 上
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/output/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次 <=事务容量<=channel容量
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0




#sources 和 channel的关系
a1.sources.r1.channels  = c1
#channel 和 sink 的关系
a1.sinks.k1.channel = c1


--flume-ng agent -c /$FLUME_HOME/conf -n a1  -f $FLUME_HOME/jobs/flume-file-hdfs.conf   这个不行 会卡日志

flume-ng agent  --name a1  --conf conf/  --conf-file $FLUME_HOME/jobs/flume-file-hdfs.conf  
flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/flume-file-hdfs.conf

实时监控目录下多个新文件

使用Flume监听整个目录的文件,并上传至HDFS

flume-dir-hdfs.conf
#监控目录 使用Spooldir Sources
#agent
a1.sources = r1
a1.channels = c1 
a1.sinks = k1

#sources
#定义类型是目录 可以适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步
a1.sources.r1.type = spooldir
#设置监控的目录
a1.sources.r1.spoolDir= /opt/module/flume-1.9.0/jobs/upload
#上传文件的后缀名是COMPLETED
a1.sources.r1.fileSuffix = .COMPLETED
#是否有文件头
a1.sources.r1.fileHeader = true
#按照正则表达式 忽略配置的文件
#a1.sources.r1.ignorePattern = ([^]*\.tmp)


#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/output/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 
#关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1


flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/flume-dir-hdfs.conf

4.实时监控目录下的多个追加文件

Exec source:适用于监控一个实时追加的文件,不能实现断点续传;
Spooldir Source:适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步;
Taildir Source:适合用于监听多个实时追加的文件,并且能够实现断点续传。

点击查看代码
#agent
a1.sources = r1 
a1.channels  = c1 
a1.sinks = k1

#sources
a1.sources.r1.type = TAILDIR
#实现断点续传的
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/position.json
#给监控的文件分组 方便分开处理
a1.sources.r1.filegroups = f1 f2 
#每个组是用来干嘛的 指定每个组监控哪个目录
a1.sources.r1.filegroups.f1= /opt/module/flume-1.9.0/jobs/tail/file/file.*
a1.sources.r1.filegroups.f2= /opt/module/flume-1.9.0/jobs/tail/log/log.*


#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks
a1.sinks.k1.type = hdfs
#上传到hdfs的路径
a1.sinks.k1.hdfs.path = hdfs://zzz01:9820/flume/tail/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = upload-
##是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1





flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/flume-taildir-hdfs.conf

flume拓扑结构,多个flume连接处理文件

1.复制和多路复用
使用Flume-1监控文件变动,Flume-1将变动内容传递给Flume-2,Flume-2负责存储到HDFS。
同时Flume-1将变动内容传递给Flume-3,Flume-3负责输出到Local FileSystem。

先启动服务器 再启动客户端 从后面开始启动

sources ---->default---->channel1 channel2 --->sink1 sink2 ------>source1 source2 --->channel1 channel2
--->hdfs sink1  file roll sink2

#通过复制选择器将event发往不同的channel 然后分给不同的sink连接到后面的
--flume1.conf
#agent
a1.sources = r1 
a1.channels  = c1 c2
a1.sinks = k1 k2

#sources
a1.sources.r1.type = TAILDIR
#实现断点续传的
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/position.json
#给监控的文件分组 方便分开处理
a1.sources.r1.filegroups = f1 f2 
#每个组是用来干嘛的 指定每个组监控哪个目录
a1.sources.r1.filegroups.f1= /opt/module/flume-1.9.0/jobs/tail/file/file.*
a1.sources.r1.filegroups.f2= /opt/module/flume-1.9.0/jobs/tail/log/log.*

#ChannelSelector
a1.sources.r1.selector.type = replicating

#channels.c2
#表示a2的channel类型是memory内存类型
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100 

#channels.c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 


#sinks k1 hdfs
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 6666

#sinks k2 本地
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 5555


#bind
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/replicating/flume1.conf -Dflume.root.logger=INFO,console


#负责将文件输出到本地
--flume2.conf

#定义agent

#定义Source名字 可以多个
a2.sources = r2
#定义sink名称 
a2.sinks = k2
#定义channel名字 可以多个
a2.channels = c2 

#定义source配置
a2.sources.r2.type = avro
#监听的主机
a2.sources.r2.bind = localhost
#监听的端口
a2.sources.r2.port = 5555


#定义sink的配置  
#表示a2的输出目的就是控制台logger类型  输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。
a2.sinks.k2.type = file_roll
a2.sinks.k2.sink.directory = /opt/module/flume-1.9.0/jobs/replicating/repout


#定义channel类型
#表示a2的channel类型是memory内存类型
a2.channels.c2.type = memory
#最大容量是1000个event
a2.channels.c2.capacity = 1000
#表示a2的channel传输时收集到了100条event以后再去提交事务
a2.channels.c2.transactionCapacity = 100 


#连接关系
#sources 和 channle 连接
a2.sources.r2.channels = c2 
# sink 和 channel连接起来 一个sink 只能连接一个channel  
a2.sinks.k2.channel = c2

flume-ng agent --conf conf/ --name a2  --conf-file $FLUME_HOME/jobs/replicating/flume2.conf

#负责将文件上传到hdfs
--flume3.conf
#agent
a3.sources = r3 
a3.channels  = c3 
a3.sinks = k3

#sources
a3.sources.r3.type = avro
a3.sources.r3.bind = localhost
a3.sources.r3.port = 6666

#channels
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100 

#sinks
a3.sinks.k3.type = hdfs
#上传到hdfs的路径
a3.sinks.k3.hdfs.path = hdfs://zzz01:9820/flume/output/repout/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = repout-
##是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 60
#设置每个文件的滚动大小
a3.sinks.k3.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0


#Bind 关系
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

flume-ng agent --conf conf/ --name a3  --conf-file $FLUME_HOME/jobs/replicating/flume3.conf  -Dflume.root.logger=INFO,console

负载均衡:

flume有三种不同的sinkProcessor:DefaultSinkProcessor、LoadBalancingSinkProcessor 和FailoverSinkProcessor
DefaultSinkProcessor:对应的是单个的Sink,
LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group,LoadBalancingSinkProcessor可以实现负载均衡的功能,
FailoverSinkProcessor可以错误恢复的功能。

使用 LoadBalancingSinkProcessor 对
sinks组 分发数据 实现负载均衡

flume1.conf
#agent
a1.sources = r1 
a1.channels  = c1 
a1.sinks = k1 k2

#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#ChannelSelector
a1.sources.r1.selector.type = replicating

#channles
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 


#sink processor sinks组定义

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = false
a1.sinkgroups.g1.processor.selector = random
#a1.sinkgroups.g1.processor.selector = round_robin 轮询设置


#sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555


a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666




#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/balance/flume1.conf  -Dflume.root.logger=INFO,console

flume2.conf
#agent
a2.sources = r2 
a2.channels  = c2 
a2.sinks = k2

#sources

a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 6666

#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100 


#sinks
a2.sinks.k2.type = logger
 
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

flume-ng agent --conf conf/ --name a2  --conf-file $FLUME_HOME/jobs/balance/flume2.conf  -Dflume.root.logger=INFO,console

flume3.conf
#agent
a3.sources = r3 
a3.channels  = c3 
a3.sinks = k3

#sources

a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 5555

#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100 


#sinks
a3.sinks.k3.type = logger

#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

flume-ng agent --conf conf/ --name a3  --conf-file $FLUME_HOME/jobs/balance/flume3.conf  -Dflume.root.logger=INFO,console

使用sinks组的FailoverSinkProcessor 实现错误恢复 高可用

--使用Failover Sink Processor 只有一台sinks是active 另外一台待机 Failover Sink Processor可以错误恢复的功能

flume1.conf

#agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2

#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555


a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666

#bind
a1.sources.r1.channels= c1
a1.sinks.k1.channel =  c1
a1.sinks.k2.channel =  c1


启动:  

flume-ng agent --conf conf/ --name a1 --conf-file  $FLUME_HOME/jobs/FailoverSinkProcessor/flume1.conf -Dflume.root.logger=INFO,console

flume2.conf

#agent
a2.sources = r2 
a2.channels  = c2 
a2.sinks = k2

#sources

a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 5555

#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100 


#sinks
a2.sinks.k2.type = logger
 
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

启动
flume-ng agent --conf conf/ --name a2 --conf-file  $FLUME_HOME/jobs/FailoverSinkProcessor/flume2.conf -Dflume.root.logger=INFO,console




flume3.conf

#agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3

#sources
a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 6666

#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100 


#sinks
a3.sinks.k3.type = logger
 
#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动
 flume-ng agent --conf conf/ --name a3 --conf-file  $FLUME_HOME/jobs/FailoverSinkProcessor/flume3.conf -Dflume.root.logger=INFO,console

聚合 多数据源聚合案例 

flume1.sink1 + flume2.sink2 = flume3.sources

zzz03的配置文件
flume-exec-avro.conf
$agent
a1.sources = r1
a1.channels = c1 
a1.sinks = k1

#sources
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/jobs/log.log


#channle
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks
a1.sinks.k1.type=avro
a1.sinks.k1.hostname = zzz01
a1.sinks.k1.port = 4141

#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 

--
flume-ng agent --conf conf/ --name a1 --conf-file  $FLUME_HOME/jobs/groupby/flume-exec-avro.conf  -Dflume.root.logger=INFO,console

--
zzz02的监控端口
flume-netcat-avro.conf

$agent
a1.sources = r1
a1.channels = c1 
a1.sinks = k1

#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = zzz02
a1.sources.r1.port = 44444


#channle
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks
a1.sinks.k1.type=avro
a1.sinks.k1.hostname = zzz01
a1.sinks.k1.port = 4141

#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 


flume-ng agent --conf conf/ --name a1 --conf-file  $FLUME_HOME/jobs/groupby/flume-netcat-avro.conf  -Dflume.root.logger=INFO,console

zzz01的将接收到的打到控制台
flume-avro-logger.conf
#agent
a1.sources = r1
a1.channels = c1 
a1.sinks = k1

#sources
a1.sources.r1.type = avro
a1.sources.r1.bind = zzz01
a1.sources.r1.port = 4141


#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

#sinks
a1.sinks.k1.type=logger

#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 


flume-ng agent --conf conf/ --name a1 --conf-file  $FLUME_HOME/jobs/groupby/flume-avro-logger.conf  -Dflume.root.logger=INFO,console

自定义Interceptor 拦截器

自定义拦截器代码:

自定义拦截器
package EventHeaderInterceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;


/* 打的jar放到lib目录下 使用要用全类名$bulider */
public class EventHeaderInterceptor implements Interceptor {
        @Override
        public void initialize() {
            /*初始化用的*/
        }

    /**
     * 主要用来实现拦截逻辑的
     *
     * @param event 是数据包
     * @return 返回处理之后的数据包 拦截下来整点操作
     */
    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody(), StandardCharsets.UTF_8);
        //给event的header加点东西
        if (body.contains("向晚")) {
            headers.put("title", "Ava");
        } else if (body.contains("贝拉")) {
            headers.put("title", "kira");
        } else {
            headers.put("title", "ot");
        }
        //处理完接着把event返回回去
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }

        return list;
    }

    @Override
    public void close() {
        /*善后的*/
    }

    //创建创建拦截器的内部类 会通过反射调用这个内部类来调用bulid()方法 创建拦截器对象
    public static class MyBuilder implements Builder {

        @Override
        public Interceptor build() {
            return new EventHeaderInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

Flume拓扑结构中的Multiplexing结构,Multiplexing的原理是,根据event中Header的某个key的值,
将不同的event发送到不同的Channel中,所以我们需要自定义一个Interceptor,为不同类型的event的Header中的key赋予不同的值。

flume1.conf
#agent
a1.sources = r1 
a1.channels  = c1 c2 c3
a1.sinks = k1 k2 k3

#sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = zzz01
a1.sources.r1.port = 44444

#ChannelSelector
a1.sources.r1.selector.type = multiplexing
#根据event中header的key拦截 key的值是titile
a1.sources.r1.selector.header = title
#map里面的value是Ava就发送到c1
a1.sources.r1.selector.mapping.Ava = c1
a1.sources.r1.selector.mapping.kira = c2
#默认的或者是直接指定
a1.sources.r1.selector.default = c3
#a1.sources.r1.selector.mapping.ot = c3

#Interceptor
a1.sources.r1.interceptors = i1
#全类名+创建拦截器对象的bulid方法的对象
a1.sources.r1.interceptors.i1.type =EventHeaderInterceptor.EventHeaderInterceptor$MyBuilder

#channles
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100 

a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100 


#sinks
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = localhost
a1.sinks.k1.port = 5555


a1.sinks.k2.type = avro
a1.sinks.k2.hostname = localhost
a1.sinks.k2.port = 6666


a1.sinks.k3.type = avro
a1.sinks.k3.hostname = localhost
a1.sinks.k3.port = 7777

#bind
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3

#启动flume1.conf

flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/Interceptor/flume1.conf  -Dflume.root.logger=INFO,console
flume2.conf
#agent
a2.sources = r2 
a2.channels  = c2 
a2.sinks = k2

#sources

a2.sources.r2.type= avro
a2.sources.r2.bind= localhost
a2.sources.r2.port= 6666

#channles
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100 


#sinks
a2.sinks.k2.type = logger
 
#bind
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
#启动
flume-ng agent --conf conf/ --name a2  --conf-file $FLUME_HOME/jobs/Interceptor/flume2.conf  -Dflume.root.logger=INFO,console

flume3.conf
flume3.conf
#agent
a3.sources = r3 
a3.channels  = c3 
a3.sinks = k3

#sources

a3.sources.r3.type= avro
a3.sources.r3.bind= localhost
a3.sources.r3.port= 5555

#channles
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100 


#sinks
a3.sinks.k3.type = logger

#bind
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

flume-ng agent --conf conf/ --name a3  --conf-file $FLUME_HOME/jobs/Interceptor/flume3.conf  -Dflume.root.logger=INFO,console
flume4.conf
#agent
a4.sources = r4 
a4.channels  = c4
a4.sinks = k4

#sources

a4.sources.r4.type= avro
a4.sources.r4.bind= localhost
a4.sources.r4.port= 7777

#channles
a4.channels.c4.type = memory
a4.channels.c4.capacity = 1000
a4.channels.c4.transactionCapacity = 100 


#sinks
a4.sinks.k4.type = logger

#bind
a4.sources.r4.channels = c4
a4.sinks.k4.channel = c4
#启动
flume-ng agent --conf conf/ --name a4  --conf-file $FLUME_HOME/jobs/Interceptor/flume4.conf  -Dflume.root.logger=INFO,console

 

自定义sources

据官方说明自定义MySource需要继承AbstractSource类并实现Configurable和PollableSource接口。

自定义source代码

实现相应方法:
getBackOffSleepIncrement() //backoff 步长
getMaxBackOffSleepInterval()//backoff 最长时间
configure(Context context)//初始化context(读取配置文件内容)
process()//获取数据封装成event并写入channel,这个方法将被循环调用。
使用场景:读取MySQL数据或者其他文件系统

自定义Source
package MySource;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;

import java.util.UUID;
import java.util.concurrent.TimeUnit;

public class MySource extends AbstractSource implements Configurable, PollableSource {
    String prefix;

    /**
     * 主要的逻辑处理方法
     *
     * @return 返回当前sources获取数据的一个状态
     * @throws EventDeliveryException 搞进来一批数据 会一直调用这个方法
     */
    @Override
    public Status process() throws EventDeliveryException {
        //每次调用先休眠一下 控制一下速度
        try {
            TimeUnit.SECONDS.sleep(1);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        Status status = null;

        try {

            // Receive new data 封装event对象
            Event event = getData();

            // 传递到ChannelProcessor 进行处理
            getChannelProcessor().processEvent(event);
            //返回状态
            status = Status.READY;
        } catch (Throwable t) {
            // 如果处理不成功 返回的状态

            status = Status.BACKOFF;

        }
        return status;
    }

    /**
     * sources是如何采集并且封装数据的 可以对header,body进行操作
     *
     * @return 返回处理好的event
     */
    private Event getData() {
        //使用一个随机String生成测试
        String data = UUID.randomUUID().toString();
        //Event是接口 不可以直接new  使用SimpleEvent类实现
        Event event = new SimpleEvent();
        //添加点数据
        //配置文件里设定的参数可以拼接上来 body要求是byte类型的数组
        String line = prefix + data;
        event.setBody(line.getBytes());
        //header放点东西 方便后面的多路复用根据key确定去哪个channel
        event.getHeaders().put("star", "kira");

        return event;
    }


    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }

    /**
     * 读取配置文件 可以返回一个默认值
     *
     * @param context
     */
    @Override
    public void configure(Context context) {
        //获取配置文件中的sources里面的配置  如果没有获取到就使用默认值
        prefix = context.getString("prefix", "Ava-a");
    }
}

 自定义Souce使用

#agent
a1.sources = r1 
a1.channels = c1 
a1.sinks = k1

#sources
#自定义需要写全类名
a1.sources.r1.type = MySource.MySource
a1.sources.r1.prefix = log->

#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 


#sinks
a1.sinks.k1.type = logger
#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 




flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/mysource/mysource.conf  -Dflume.root.logger=INFO,console

自定义sink

自定义Sink代码

官方也提供了自定义sink的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#sink

根据官方说明自定义MySink需要继承AbstractSink类并实现Configurable接口。
实现相应方法:
configure(Context context)//初始化context(读取配置文件内容)
process()//从Channel读取获取数据(event),这个方法将被循环调用。
使用场景:读取Channel数据写入MySQL或者其他文件系统

package MySink;

import org.apache.flume.Channel;
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.Transaction;
import org.apache.flume.sink.AbstractSink;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.Configurable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.nio.charset.StandardCharsets;
import java.util.Map;

/*官方也提供了自定义sink的接口:
https://flume.apache.org/FlumeDeveloperGuide.html#sink根据官方说明自定义MySink需要继承AbstractSink类并实现Configurable接口。
实现相应方法:
configure(Context context)//初始化context(读取配置文件内容)
process()//从Channel读取获取数据(event),这个方法将被循环调用。
使用场景:读取Channel数据写入MySQL或者其他文件系统
*/
public class MySink extends AbstractSink implements Configurable {
    Logger logger = LoggerFactory.getLogger(MySink.class);

    /**
     * sink的主要逻辑处理办法
     *
     * @return
     * @throws EventDeliveryException
     */
    @Override
    public Status process() throws EventDeliveryException {

        Status status = null;
        // 从某一个Channel中拉取数据
        Channel ch = getChannel();
        //获取事务  Sink是完全事务性的
        Transaction txn = ch.getTransaction();
        //事务开始
        txn.begin();
        try {
            //获取到event
            Event event = ch.take();
            //处理event
            processEvent(event);
            //提交事务
            txn.commit();
            status = Status.READY;
        } catch (Throwable t) {
            txn.rollback();

            // Log exception, handle individual exceptions as needed

            status = Status.BACKOFF;


        } finally {
            //关闭事务
            txn.close();
        }
        return status;
    }

    /* 自定义处理event的方法*/
    private void processEvent(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody(), StandardCharsets.UTF_8);
        String result = headers.toString() + "^_^" + body;
        logger.info(result);


    }


    @Override
    public RequestConfig getConfig() {
        return null;
    }
}

自定义Sink使用

--自定义sinks
#agent
a1.sources = r1 
a1.channels = c1 
a1.sinks = k1

#sources
#自定义需要写全类名
a1.sources.r1.type = MySource.MySource
a1.sources.r1.prefix = log->

#channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100 


#sinks
a1.sinks.k1.type = MySink.MySink

#bind
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 

flume-ng agent --conf conf/ --name a1  --conf-file $FLUME_HOME/jobs/mysink/mysink.conf  -Dflume.root.logger=INFO,console

 

posted @ 2021-06-23 19:57  超级无敌小剑  阅读(106)  评论(0编辑  收藏  举报