Flume Sink

一、Logger Sink

记录指定级别（比如INFO，DEBUG，ERROR等）的日志，通常用于调试

要求，在 --conf参数指定的目录下有log4j的配置文件

根据设计，logger sink将体内容限制为16字节，从而避免屏幕充斥着过多的内容。如果想要查看调试的完整内容，那么你应该使用其他的sink，也许可以使用file_roll sink，它会将日志写到本地文件系统中。

可配置项说明

配置项	说明
channel
type	logger
补充说明	要求必须在 --conf 参数指定的目录下有 log4j的配置文件可以通过-Dflume.root.logger=INFO,console在命令启动时手动指定log4j参数

配置示例：

#配置Agent a1 的组件

a1.sources=r1

a1.channels=c1

a1.sinks=s1


#描述/配置a1的r1

a1.sources.r1.type=netcat

a1.sources.r1.bind=0.0.0.0

a1.sources.r1.port=44444

 

#描述a1的s1

a1.sinks.s1.type=logger

#描述a1的c1

a1.channels.c1.type=memory

a1.channels.c1.capacity=1000

a1.channels.c1.transactionCapacity=100

 
#为channel 绑定 source和sink

a1.sources.r1.channels=c1

a1.sinks.s1.channel=c1

二、File Roll Sink

在本地系统中存储事件。

每隔指定时长生成文件保存这段时间内收集到的日志信息。

可配置参数说明

配置项	说明
channel
type	file_roll
sink.directory	文件被存储的目录
sink.rollInterva	30 记录日志到文件里，每隔30秒生成一个新日志文件。如果设置为0，则禁止滚动，从而导致所有数据被写入到一个文件中。

配置示例：

#配置Agent a1 的组件
a1.sources=r1
a1.sinks=s1
a1.channels=c1
 
#描述/配置a1的source1
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=8888
 
#描述sink
a1.sinks.s1.type=file_roll
a1.sinks.s1.sink.directory=/home/work/rolldata
a1.sinks.s1.sink.rollInterval=60 ##每隔60s生成一个文件
 
#描述内存channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
 
#位channel 绑定 source和sink
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

三、HDFS Sink

此Sink将事件写入到Hadoop分布式文件系统HDFS中。

目前它支持创建文本文件和序列化文件。

对这两种格式都支持压缩。

这些文件可以分卷，按照指定的时间或数据量或事件的数量为基础。

它还通过类似时间戳或机器属性对数据进行 buckets/partitions 操作

It also buckets/partitions data by attributes like timestamp or machine where the event originated.

HDFS的目录路径可以包含将要由HDFS替换格式的转移序列用以生成存储事件的目录/文件名。

使用这个Sink要求haddop必须已经安装好，以便Flume可以通过hadoop提供的jar包与HDFS进行通信。

配置项	说明
channel	绑定的通道
type	hdfs
hdfs.path	HDFS 目录路径（hdfs://namenode/flume/webdata/)
hdfs.inUseSuffix	.tmp Flume正在处理的文件所加的后缀
hdfs.rollInterval	30 Number of seconds to wait before 举例：如果flume需要40s，30s=>1个文件 10s=>30s 1个文件
hdfs.rollSize	1024 File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount	10 Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.fileType	SequenceFile File format: currently SequenceFile, DataStream or CompressedStream

配置文件：hdfssource.conf

#配置Agent a1 的组件
a1.sources=r1
a1.sinks=s1
a1.channels=c1
 
#描述/配置a1的source1
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=8888
 
#描述sink
a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://hadoop101:9000/flume
a1.sinks.s1.hdfs.fileType=DataStream
 
#描述内存channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
 
#位channel 绑定 source和sink
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/hdfssource.conf -Dflume.root.logger=INFO,console

新建ssh窗口执行：net hadoop101 8888

扩展：如果HDFS中NameNode挂了,产生了切换，Flume sink如何切换？

四、Avro Sink

是实现多级流动、扇出流(1到多) 扇入流(多到1) 的基础。

可配置项说明

配置项	说明
channel
type	avro
hostname	The hostname or IP address to bind to
port	The port # to listen on

多级流动：

配置文件：multi.conf（hadoop101上）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

配置文件：multi.conf（hadoop102上）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

配置文件：multi.conf（hadoop103上）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinksk1.channel  = c1

启动时，倒序启动，先启动hadoop103上的flume，再启动hadoop102上的，最后启动hadoop101上的

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/multi.conf -Dflume.root.logger=INFO,console

在hadoop101上，新建ssh窗口，执行：nc hadoop101 8090

扇入：shanru.conf

hadoop101上的配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

hadoop102上的配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

hadoop103上的配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

启动时，倒序启动，先启动hadoop103上的flume，hadoop101和hadoop102不分顺序

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/shanru.conf -Dflume.root.logger=INFO,console

在hadoop101上，新建ssh窗口，执行：nc hadoop101 8090

扇出：shanchu.conf

hadoop101上的配置

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1,c2
a1.sinks.k1.channel  = c1
a1.sinks.k2.channel  = c2

hadoop102上的配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

hadoop103上的配置

a1.sources = s1
a1.channels  c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel  = c1

启动时，倒序启动，先启动hadoop103和hadoop102的flume，再启动hadoop101上的flume

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/shanchu.conf -Dflume.root.logger=INFO,console

在hadoop101上，新建ssh窗口，执行：nc hadoop101 8090

路由模式：multiplexing.conf,在扇出的结构下进行修改

在这种模式下，用户可以指定转发的规则。selector根据规则进行数据的分发

可配置参数

配置项	说明
selector.type	multiplexing 表示路由模式
selector.header	指定要监测的头的名称
selector.mapping.*	匹配规则
selector.default	如果未满足匹配规则，则默认发往指定的通道

在hadoop101上的配置如下，hadoop102和hadoop103和扇出的配置一样

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.s1.type = http
a1.sources.s1.port = 8090

a1.sources.s1.selector.type = multiplexing
a1.sources.s1.selector.header = class
a1.sources.s1.selector.mapping.big1902 = c1
a1.sources.s1.selector.mapping.big1903 = c2
a1.sources.r1.selector.default = c2 #若class不为big1902或1903，则默认到c2，到hadoop103上

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1,c2
a1.sinks.k1.channel  = c1
a1.sinks.k2.channel  = c2

启动时，倒序启动，先启动hadoop103和hadoop102的flume，再启动hadoop101上的flume

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/multiplexing.conf -Dflume.root.logger=INFO,console

在hadoop101上，新建ssh窗口，执行：

　　curl -X POST -d '[{"headers":{"class":"big1902"},"body":"This is 1902"}]' http://0.0.0.0:8090 -- 会发送到hadoop102,c1

　　curl -X POST -d '[{"headers":{"class":"big1903"},"body":"This is 1903"}]' http://0.0.0.0:8090 -- 会发送到hadoop103,c2

　　curl -X POST -d '[{"headers":{"class":"big1900"},"body":"This is 1900"}]' http://0.0.0.0:8090 -- 会发送到hadoop103,默认c2

Interceptor:timein.conf

1.hadoop101上的配置：type为timestamp类型

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=8090
a1.sources.r1.interceptors=i1 
a1.sources.r1.interceptors.i1.type=timestamp

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

a1.sinks.s1.type=logger

a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

bin目录下执行：./flume-ng agent -n a1 -c ../conf/ -f ../data/timein.conf -Dflume.root.logger=INFO,console

2.将数据按天存放到hdfs上，并按天存放：

配置如下：

#agent a1 的组件
a1.sources=r1
a1.sinks=s1
a1.channels=c1

#描述/配置a1的source1
a1.sources.r1.type=netcat
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=8888
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
 
#描述sink
a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://hadoop101:9000/flume/time=%Y-%m-%D
a1.sinks.s1.hdfs.fileType=DataStream

#描述内存channel
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

#位channel 绑定 source和sink
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

3.search_replace - 搜索替换

searchin.conf配置如下：将0~9的数字配置为*

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = http
a1.sources.s1.port = 8090
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=search_replace
a1.sources.s1.interceptors.i1.searchPattern=[0-9]
a1.sources.s1.interceptors.i1.replaceString= *

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

正则过滤：regex_filter

regexin.conf配置如下：

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = http
a1.sources.s1.port = 8090
a1.sources.s1.interceptors=i1
a1.sources.s1.interceptors.i1.type=regex_filter
a1.sources.s1.interceptors.i1.regex=.*[0-9].*
a1.sources.s1.interceptors.i1.replaceString= *
a1.sources.s1.interceptors.i1.excludeEvents=true #true：匹配到的过滤，false：没有匹配到的过滤，默认false

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

五、Failover Sink Processor

1.维护一个sink的优先表，确保只要一个是可用的就可以被处理

2.失败处理原来是，为失效的sink指定一个冷却时间，在冷却时间到达后再重新使用

3.sink可用被配置一个优先级，数字越大优先级越高

4.如果sink发送事件失败，则下一个最高优先级的sink将会尝试接着发送事件

5.如果没有指定优先级，则优先级顺序取决于sink的配置顺序，先配置的默认优先级高于后配置的

6.在配置过程中，设置一个group processor，并且为每个sink都指定一个优先级

7.优先级必须是唯一的

8.可以设置maxpenalty属性指定限定失败时间

六、Load Balancing Sink Processor

1.提供了多个sink之间实现负载均衡的能力

2.维护了一个活动sink的索引列表

3.支持轮询或随机方式的负载均衡，默认是轮询方式，可以通过配置指定

4.可以通过实现AbstractSinkSelector接口实现自定义的选择机制

posted @ 2020-06-29 13:49 alen-fly 阅读(216) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

alen-fly

Flume Sink

公告