flume知识总结

flume

1.flume是什么

高可用、分布式的日志数据收集工具

2.flume架构

1.0版本之前

collector
master

缺点：收集速度快，处理速度慢，中间容易造成数据丢失。

1.0版本之后

source
channel
sink

3.flume核心概念

agent

代理：代表的是一个flume数据采集的服务端

一个agent对应的就是一个jvm进程
一个agent中包至少含一套source channel sink
一个节点上可以启动多个agent

source

数据来源：数据来源有多种形式，代表flume进行收集数据的来源

avro source：来自于用户指定的节点端口（常用）
exec source：来自于一个linux命令
Spooling Directory Source：来源于一个本地磁盘的一个文件夹的数据变化
taildir Source：相当于 spooldir source + exec source。可以监控多个目录
netcat source：tcp|udp

channel

数据通道，缓存数据。包含处理器、拦截器、选择器

memory channel：内存，将数据缓存在内存中。
file channel：文件中转站，将数据存储在本地的磁盘文件上。
jdbc channel：数据库中转站，将数据存储在数据库中。

sink

数据下沉目的地

hdfs sink：将收集的数据放在hdfs上。常用
hive sink：将数据放在hive中。很少用
avro sink：数据存储在用户指定的端口
logger sink：将收集的数据打印到控制台显示。主要用于测试

event

数据被封装的最小单元，数据传输中的最小单元，一条数据封装成一个event。

header：头信息，时间戳|来源
body：包含的是真是数据，数据格式为键值对

4.flume的经典部署方案

1）单 Agent 采集数据
2）多agent串联
3）多 Agent 合并串联
4）多路复用
5）高可用的配置

注意：每一个agent只能采集当前节点数据

source

Exec source

配置文件

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = exec
a1.sources.r1.command= cat /home/hadoop/zookeeper.out

# 指定agent的sink的
a1.sinks.k1.type = logger

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

avro source

配置文件

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = avro
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44555

# 指定agent的sink的
a1.sinks.k1.type = logger

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Spooling Directory Source

配置文件

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/tmpdata
# spoolDir 监听指定的目录下所有文件
# 指定目录下的文件采集之前,不会被修改名字的,数据一旦被采集过文件名后面就会添加一个后缀 .COMPLETED
# 指定目录下一旦有新的文件产生,就会被监控到

# 指定agent的sink的
a1.sinks.k1.type = logger

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

sink

hdfs sink

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = cat  /home/hadoop/apps/apache-flume-1.8.0-bin/conf/flume-env.sh

# 指定agent的sink的
a1.sinks.k1.type = hdfs
# 指定存储的hdfs的路径的
a1.sinks.k1.hdfs.path = /data/flume/test_01

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

如果hadoop集群是高可用集群请将hdfs-site.xml core-site.xml 放在flume的conf下

改进

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = cat  /home/hadoop/apps/apache-flume-1.8.0-bin/conf/flume-env.sh

# 指定agent的sink的
a1.sinks.k1.type = hdfs
# 指定存储的hdfs的路径的
a1.sinks.k1.hdfs.path = /data/flume/test_02
a1.sinks.k1.hdfs.filePrefix = log		# 文件前缀
a1.sinks.k1.hdfs.fileSuffix = .1811		# 文件后缀
a1.sinks.k1.hdfs.rollInterval = 60		# 文件回滚时间间隔，60秒
a1.sinks.k1.hdfs.rollSize = 134217728 	# 文件回滚大小，128M
a1.sinks.k1.hdfs.rollCount = 10000		# 文件回滚条数，10000条

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 M
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Avro Sink

多个agent的串联

agent1:   hdp03
exec --- channel ----> avro sink 
# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = exec
a1.sources.r1.command = cat  /home/hadoop/apps/apache-flume-1.8.0-bin/conf/flume-env.sh

# 指定agent的sink的
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hdp02
a1.sinks.k1.port = 44555

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

agent2 :  hdp02 
avro source--> channel ---> hdfs sink 
# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = avro
a1.sources.r1.bind = hdp02
a1.sources.r1.port = 44555

# 指定agent的sink的
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /test/flume01

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.拦截器

interceptors   拦截器
可以拦截数据源(source),给数据源添加数据的header信息，为了后续的数据的更加方便的使用
flume中拦截器的种类：
1）Timestamp Interceptor
在数据源上添加时间戳
headers:{timestamp=1554707017331}
key: timestamp 
value:当前系统的时间戳
2）host interceptor
拦截数据源，在每一条数据的header中添加 hostname | ip 
key: host 
value : 当前主机的 hostname | ip 
3）Static Interceptor
静态拦截器：拦截每一个event数据，手动定义拦截器的key value，手动在header中添加需要添加的k v 便于后面的数据的分类使用

案例1：单个拦截器

配置文件

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# 指定agent的sink的
a1.sinks.k1.type = logger

# 指定agent的通道
a1.channels.c1.type = memory

# 指定拦截器
# 指定拦截器的别名
a1.sources.r1.interceptors = i1
# 指定拦截期的类型：host
a1.sources.r1.interceptors.i1.type = host

# 指定拦截期的类型：static
# a1.sources.r1.interceptors.i1.type = static
# 手动指定拦截器的 key值
# a1.sources.r1.interceptors.i1.key = class
# 手动指定拦截器的value值
# a1.sources.r1.interceptors.i1.value = bd1811

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

flume-ng agent --conf conf --conf-file /home/hadoop/app/flume/conf/test_intereptor_01  --name a1 -Dflume.root.logger=INFO,console

收集到的每一条数据如下

host拦截器：
Event: { headers:{host=192.168.191.203} body: 68 65 6C 6C 6F 20 74 6F 6D 0D             hello tom. }
static拦截器：
headers:{class=bd1811}

案例2：多个拦截器联合使用

配置文件

# 指定当前agent a1的 sources sinks  channels 的别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# agent的数据源的
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# 定义拦截器
# 定义拦截器的别名
a1.sources.r1.interceptors = i1 i2
# i1定义拦截器的类型的
a1.sources.r1.interceptors.i1.type = static
# i1手动指定拦截器的 key值
a1.sources.r1.interceptors.i1.key = class
# i1手动指定拦截器的value值
a1.sources.r1.interceptors.i1.value = bd1811
# 指定i2对应的拦截器
a1.sources.r1.interceptors.i2.type = timestamp

# 指定agent的sink的
a1.sinks.k1.type = logger

# 指定agent的通道
a1.channels.c1.type = memory

# 绑定agent的  r1   c1   k1 
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

收集到的每一条数据如下

headers:{ class=bd1811, timestamp=1554711940387}

6.综合案例

要求

A、B 两台日志服务机器实时生产日志主要类型为 access.log、nginx.log、web.log
现在要求：
把 A、B 机器中的 access.log、nginx.log、web.log 采集汇总到 C 机器上然后统一收集到 hdfs
中。
但是在 hdfs 中要求的目录为：
/source/logs/access/20160101/**
/source/logs/nginx/20160101/**
/source/logs/web/20160101/**

hdp01 hdp02 agent配置 agent1 agent2

# 指定各个核心组件
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1
# 指定数据源1
# static 拦截器的功能就是往采集到的数据的header中插入自己定义的key-value对
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/flume/access.log
# 指定r1对应的拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = name
a1.sources.r1.interceptors.i1.value = access

# 指定数据源2 nginx.log
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /home/hadoop/data/flume/nginx.log
# 指定拦截器
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = name
a1.sources.r2.interceptors.i2.value = nginx

# 指定数据源3
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /home/hadoop/data/flume/web.log
# 指定拦截器
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = name
a1.sources.r3.interceptors.i3.value = web

# 指定sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = cdh03
a1.sinks.k1.port = 41414

# 指定channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# 绑定关系
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

汇总的agent ：布置在hdp03 ，名为agent03

#定义 agent 名， source、channel、sink 的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#定义 source
a1.sources.r1.type = avro
a1.sources.r1.bind = cdh03
a1.sources.r1.port =41414
#添加时间拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.TimestampInterceptor$Builder

#定义 channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000
#定义 sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /source/logs/%{name}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix =events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

#时间类型
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件不按条数生成
a1.sinks.k1.hdfs.rollCount = 0
#生成的文件按时间生成
a1.sinks.k1.hdfs.rollInterval = 30
#生成的文件按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760
#进行回滚的时候是以文件为基准的

#批量写入 hdfs 的个数  优化
a1.sinks.k1.hdfs.batchSize = 20
#flume 操作 hdfs 的线程数（包括新建，写入等）  优化
a1.sinks.k1.hdfs.threadsPoolSize=10
#操作 hdfs 超时时间
a1.sinks.k1.hdfs.callTimeout=30000

#组装 source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动：

cdh03

bin/flume-ng agent --conf conf --conf-file /home/hadoop/app/flume/conf/agent3 -name a1

cdh01,cdh02

bin/flume-ng agent --conf conf --conf-file /home/hadoop/app/flume/conf/agent1 -name a1
bin/flume-ng agent --conf conf --conf-file /home/hadoop/app/flume/conf/agent2 -name a1

7.flume高级应用

高可用实现方案

hdp01 hdp03 hdp02    	webserver 
hdp03   				汇总的主agent 
hdp04   				备份的汇总agent

配置文件

webserver 端的agent: hdp01 hdp02 hdp03

# 指定别名
agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1 k2

# 指定一个sink的组名
agent1.sinkgroups = g1

# 指定source 
agent1.sources.r1.type = exec
agent1.sources.r1.command = tail -F /home/hadoop/flume_data/access.log
# 定义拦截器
agent1.sources.r1.interceptors = i1 i2
agent1.sources.r1.interceptors.i1.type = static
agent1.sources.r1.interceptors.i1.key = Type
agent1.sources.r1.interceptors.i1.value = LOGIN
agent1.sources.r1.interceptors.i2.type = timestamp
#指定channnel
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

# 指定sink1 主agent
agent1.sinks.k1.type = avro
agent1.sinks.k1.hostname = hdp03
agent1.sinks.k1.port = 52020
# 指定备份的sink  
agent1.sinks.k2.type = avro
agent1.sinks.k2.hostname = hdp04
agent1.sinks.k2.port = 52020

# 设置组中的sink成员
agent1.sinkgroups.g1.sinks = k1 k2
# 设置失败自启方案   切换的方案  priority 指定优先级  越大  优先级越高  处理数据的时候 先进行处理 优先级高的agent存活 优先级低的agent不接受数据的
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 1
agent1.sinkgroups.g1.processor.maxpenalty = 10000

# 进行绑定
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
agent1.sinks.k2.channel = c1

两个汇总的agent
hdp03

#设置别名
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# 设置source 
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.type = avro
a2.sources.r1.bind = hdp03
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hdp03

#指定channnel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#指定sink
a2.sinks.k1.type=hdfs
a2.sinks.k1.hdfs.path= /flume_ha/loghdfs
a2.sinks.k1.hdfs.fileType=DataStream
a2.sinks.k1.hdfs.writeFormat=TEXT
a2.sinks.k1.hdfs.rollInterval=10
a2.sinks.k1.hdfs.filePrefix=%Y-%m-%d

# 指定绑定
a2.sources.r1.channels = c1
a2.sinks.k1.channel=c1

hdp04

#设置别名
a2.sources = r1
a2.channels = c1
a2.sinks = k1

# 设置source 
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.type = avro
a2.sources.r1.bind = hdp04
a2.sources.r1.port = 52020
a2.sources.r1.interceptors = i1
a2.sources.r1.interceptors.i1.type = static
a2.sources.r1.interceptors.i1.key = Collector
## 当前主机为什么，就修改成什么主机名
a2.sources.r1.interceptors.i1.value = hdp04

#指定channnel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

#指定sink
a2.sinks.k1.type=hdfs
a2.sinks.k1.hdfs.path= /flume_ha/loghdfs
a2.sinks.k1.hdfs.fileType=DataStream
a2.sinks.k1.hdfs.writeFormat=TEXT
a2.sinks.k1.hdfs.rollInterval=10
a2.sinks.k1.hdfs.filePrefix=%Y-%m-%d

# 指定绑定
a2.sources.r1.channels = c1
a2.sinks.k1.channel=c1

posted @ 2021-01-16 14:14 凯尔哥阅读(344) 评论(0) 编辑收藏举报

刷新页面返回顶部

凯尔哥

执着的蜗牛

flume知识总结

flume

1.flume是什么

2.flume架构

3.flume核心概念

agent

source

channel

sink

event

4.flume的经典部署方案

source

Exec source

avro source

Spooling Directory Source

sink

hdfs sink

Avro Sink

5.拦截器

案例1：单个拦截器

案例2：多个拦截器联合使用

6.综合案例

7.flume高级应用

公告