Flume

服务：启动后监听某个端口，等待客户端连接，并处理客户端请求。客户端断开连接后，服务程序依然运行。
工具：需要使用工具的功能时，启动程序。使用完毕后，工具程序可以直接关闭。
框架：一个半成品软件，需要开发人员根据业务逻辑填写核心代码，组成完成的程序，提供工具或者服务的功能。

是什么

在数据处理场景中，数据的产生往往分散在时间空间不同的各个服务器上，需要将各个服务器的数据自动化采集到同一个HDFS集群，就需要使用自动化采集工具
flume是一个hadoop生态中的专用用来进行海量日志，收集、聚合、移动的工具

核心概念

agent 代理 flume启动的一个Jvm实例，称之为一个Agent
source 源是agent中的一个组件，负责读取原始数据(文件，其他协议的网络连接) 将原始数据封装为event对象将event对象发送到channel中
channel 通道通过内存或者磁盘文件构建一个用来缓存event的队列
sink 出口读取channel中缓存的event对象拆解出event中的数据并发送到下游组件(HDFS)
event 事件 flume中自定义的对象用于封装数据(比如一行日志文件封装为一个event)

flume安装

1.上传解压

2.修改配置文件

flume-env.sh

JAVA_HOME=/opt/java/modul/jdk1.8.0_291

配置环境变量

1 echo 'export FLUME_HOME=/opt/java/modul/flume-1.9.0' >> /etc/profile
2 echo 'export PATH=$PATH:$FLUME_HOME/bin' >> /etc/profile
3 source /etc/profile

Flume操作

Flume通过解耦的设计将数据收集端和数据存储端以及中间的缓存拆分为了三个模块
使用flume进行数据采集时，只需要根据需要选择对应功能source、channel、sink，编写agent的配置文件，将三个模块组合成一个agent，启动程序即可

常用source

对比	exec	spooldir	taildir
功能	通过执行一个shell命令读取数据，通常配合`tail -F file`实现单文件的实时读取	监控整个目录，读取目录中所有的文件，不支持递归	可以监控多个文件或者文件夹的变化
优势	性能最优	可以监控整个文件夹	可以监控文件或者文件夹，并支持记录读取进度
劣势	只能监控单文件，不支持断点续传	不支持断点续传，也不支持读取文件的改动	-

常用channel

对比	memory	file
性能	优	略弱
安全性	略弱	优

常用sink

logger 通常用于测试，可以方便将数据打印到控制台
HDFS sink 将数据写入到HDFS集群

案例一：

给agent_test_1.conf 添加配置

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 a1.sources = r1
 5 a1.sinks = k1
 6 a1.channels = c1
 7 
 8 # Describe/configure the source
 9 a1.sources.r1.type = netcat
10 a1.sources.r1.bind = localhost
11 a1.sources.r1.port = 44444
12 
13 # Describe the sink
14 a1.sinks.k1.type = logger
15 
16 # Use a channel which buffers events in memory
17 a1.channels.c1.type = memory
18 a1.channels.c1.capacity = 1000
19 a1.channels.c1.transactionCapacity = 100
20 
21 # Bind the source and sink to the channel
22 a1.sources.r1.channels = c1
23 a1.sinks.k1.channel = c1

启动监听

1 flume-ng agent --conf /opt/module/flume-1.9.0/conf/ --conf-file agent_test_1.conf --name a1 -Dflume.root.logger=INFO,console

案例二：

给agent_test_1.conf 添加配置

案例三

hdfs

flume常见架构

flume2：

在idea创建项目如下

文件中配置如下

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 #设置各组件名称
 5 #设置资源别名r1
 6 exec2logger.sources = r1
 7 #设置出口 k1
 8 exec2logger.sinks = k1
 9 #设置通道 c1
10 exec2logger.channels = c1
11 
12 # Describe/configure the source
13 #数据源的类型
14 exec2logger.sources.r1.type = exec
15 #数据源的地址
16 exec2logger.sources.r1.command = tail -F /opt/log.txt
17 #exec2logger.sources.r1.port = 44444
18 
19 # Describe the sink
20 #输出的方式
21 exec2logger.sinks.k1.type = logger
22 
23 # Use a channel which buffers events in memory
24 #缓存
25 exec2logger.channels.c1.type = memory
26 #容量
27 exec2logger.channels.c1.capacity = 200000
28 #流速
29 exec2logger.channels.c1.transactionCapacity = 1024
30 
31 # Bind the source and sink to the channel
32 exec2logger.sources.r1.channels = c1
33 exec2logger.sinks.k1.channel = c1

将文件exec2logger 拷贝到 liunx flume 安装目录下/opt/module/flume-1.9.0/agents

启动agent

jar 冲突

案例二

编写spooldir2logger.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
#设置各组件名称
#设置资源别名r1

spooldir2logger.sources = r1
#设置出口 k1
spooldir2logger.sinks = k1
#设置通道 c1
spooldir2logger.channels = c1

# Describe/configure the source
#数据源的类型
spooldir2logger.sources.r1.type = spooldir
#数据源的地址
spooldir2logger.sources.r1.spoolDir = /opt/log
#spooldir2logger.sources.r1.port = 44444

# Describe the sink
#输出的方式
spooldir2logger.sinks.k1.type = logger

# Use a channel which buffers events in memory
#缓存
spooldir2logger.channels.c1.type = memory
#容量
spooldir2logger.channels.c1.capacity = 200000
#流速
spooldir2logger.channels.c1.transactionCapacity = 1024

# Bind the source and sink to the channel
spooldir2logger.sources.r1.channels = c1
spooldir2logger.sinks.k1.channel = c1

注意:优先创建出/opt/log 文件夹

启动agent

将log.txt 移动到log文件

给文件中添加信息没有反应

修改后缀数据写入

案例三

taildir2logger.conf

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 taildir2logger.sources = r1
 5 taildir2logger.sinks = k1
 6 taildir2logger.channels = c1
 7 
 8 # Describe/configure the source
 9 taildir2logger.sources.r1.type = TAILDIR
10 taildir2logger.sources.r1.filegroups = g1 g2 g3
11 taildir2logger.sources.r1.filegroups.g1 =/opt/1.txt
12 taildir2logger.sources.r1.filegroups.g2 =/opt/.*.txt
13 taildir2logger.sources.r1.filegroups.g3 =/opt/2.txt
14 #taildir2logger.sources.r1.port = 44444
15 
16 # Describe the sink
17 taildir2logger.sinks.k1.type = logger
18 
19 # Use a channel which buffers events in memory
20 taildir2logger.channels.c1.type = memory
21 taildir2logger.channels.c1.capacity = 1000
22 taildir2logger.channels.c1.transactionCapacity = 100
23 
24 # Bind the source and sink to the channel
25 taildir2logger.sources.r1.channels = c1
26 taildir2logger.sinks.k1.channel = c1

检查~家目录下的.flume

案例四

taildir2hdfs.conf

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 #设置各组件名称
 5 #设置资源别名r1
 6 
 7 taildir2hdfs.sources = r1
 8 #设置出口 k1
 9 taildir2hdfs.sinks = k1
10 #设置通道 c1
11 taildir2hdfs.channels = c1
12 
13 # Describe/configure the source
14 #数据源的类型
15 taildir2hdfs.sources.r1.type = TAILDIR
16 #数据源的地址
17 taildir2hdfs.sources.r1.filegroups = g1 g2 g3
18 taildir2hdfs.sources.r1.filegroups.g1=/opt/1.txt
19 taildir2hdfs.sources.r1.filegroups.g2=/opt/log/.*.txt
20 taildir2hdfs.sources.r1.filegroups.g3=/opt/2.txt
21 #taildir2hdfs.sources.r1.port = 44444
22 
23 # Describe the sink
24 #输出的方式
25 taildir2hdfs.sinks.k1.hdfs.useLocalTimeStamp=true
26 taildir2hdfs.sinks.k1.type = hdfs
27 taildir2hdfs.sinks.k1.hdfs.path = hdfs://bd0801:8020/flume/%Y-%m-%d-%H-%M
28 taildir2hdfs.sinks.k1.hdfs.fileType = DataStream
29 taildir2hdfs.sinks.k1.hdfs.writeFormat = Text
30 taildir2hdfs.sinks.k1.hdfs.rollInterval = 300
31 taildir2hdfs.sinks.k1.hdfs.rollSize = 13000000
32 taildir2hdfs.sinks.k1.hdfs.rollCount=0
33 
34 
35 # Use a channel which buffers events in memory
36 #缓存
37 taildir2hdfs.channels.c1.type = memory
38 #容量
39 taildir2hdfs.channels.c1.capacity = 200000
40 #流速
41 taildir2hdfs.channels.c1.transactionCapacity = 1024
42 
43 # Bind the source and sink to the channel
44 taildir2hdfs.sources.r1.channels = c1
45 taildir2hdfs.sinks.k1.channel = c1

运行结果

模拟存储 vi printer.sh

1 #!/bin/bash
2 
3 
4 while true
5 do
6  echo '298444.812] (--) Found matching XKB configuration "Chinese (Simplified)"
7 [298444.812] (--) Model = "pc105" Layout = "cn" Variant = "none" Options = "none"
8 [298444.812] Rules = "base" Model = "pc105" Layout = "cn" Variant = "none" Options = "none"' >> /opt/2.txt
9 done

执行脚本

avro.apache.org

将flume在bd0802安装

解放终端执行

启动服务端

avro2hdfs.conf

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 #设置各组件名称
 5 #设置资源别名r1
 6 
 7 avro2hdfs.sources = r1
 8 #设置出口 k1
 9 avro2hdfs.sinks = k1
10 #设置通道 c1
11 avro2hdfs.channels = c1
12 
13 # Describe/configure the source
14 #数据源的类型
15 avro2hdfs.sources.r1.type = avro
16 avro2hdfs.sources.r1.bind=bd0802
17 avro2hdfs.sources.r1.port=12345
18 
19 #avro2hdfs.sources.r1.port = 44444
20 
21 # Describe the sink
22 #输出的方式
23 avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp=true
24 avro2hdfs.sinks.k1.type = hdfs
25 avro2hdfs.sinks.k1.hdfs.path = hdfs://bd0801:8020/flume/%Y-%m-%d
26 avro2hdfs.sinks.k1.hdfs.fileType = DataStream
27 avro2hdfs.sinks.k1.hdfs.writeFormat = Text
28 avro2hdfs.sinks.k1.hdfs.rollInterval = 300
29 avro2hdfs.sinks.k1.hdfs.rollSize = 13000000
30 avro2hdfs.sinks.k1.hdfs.rollCount=0
31 
32 
33 # Use a channel which buffers events in memory
34 #缓存
35 avro2hdfs.channels.c1.type = memory
36 #容量
37 avro2hdfs.channels.c1.capacity = 200000
38 #流速
39 avro2hdfs.channels.c1.transactionCapacity = 1024
40 
41 # Bind the source and sink to the channel
42 avro2hdfs.sources.r1.channels = c1
43 avro2hdfs.sinks.k1.channel = c1

启动客户端

taildir2avro.conf

 1 # example.conf: A single-node Flume configuration
 2 
 3 # Name the components on this agent
 4 #设置各组件名称
 5 #设置资源别名r1
 6 
 7 taildir2avro.sources = r1
 8 #设置出口 k1
 9 taildir2avro.sinks = k1
10 #设置通道 c1
11 taildir2avro.channels = c1
12 
13 # Describe/configure the source
14 #数据源的类型
15 taildir2avro.sources.r1.type = TAILDIR
16 #数据源的地址
17 taildir2avro.sources.r1.filegroups = g1 g2 g3
18 taildir2avro.sources.r1.filegroups.g1=/opt/1.txt
19 taildir2avro.sources.r1.filegroups.g2=/opt/log/.*.txt
20 taildir2avro.sources.r1.filegroups.g3=/opt/2.txt
21 #taildir2avro.sources.r1.port = 44444
22 
23 # Describe the sink
24 #输出的方式
25 
26 taildir2avro.sinks.k1.type = avro
27 taildir2avro.sinks.k1.hostname=bd0802
28 taildir2avro.sinks.k1.port=12345
29 
30 
31 # Use a channel which buffers events in memory
32 #缓存
33 taildir2avro.channels.c1.type = memory
34 #容量
35 taildir2avro.channels.c1.capacity = 200000
36 #流速
37 taildir2avro.channels.c1.transactionCapacity = 1024
38 
39 # Bind the source and sink to the channel
40 taildir2avro.sources.r1.channels = c1
41 taildir2avro.sinks.k1.channel = c1

posted @ 2022-02-19 21:11 学后端的菜妹阅读(226) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Hadoop

· Mapreduce

· 日志数据采集-Flume

· 【大数据实战】flume 数据采集

· Flume详解

阅读排行：
· 无需6万激活码！GitHub神秘组织3小时极速复刻Manus，手把手教你使用OpenManus搭建本
· Manus爆火，是硬核还是营销？
· 终于写完轮子一部分：tcp代理了，记录一下
· 别再用vector＜bool＞了！Google高级工程师：这可能是STL最大的设计失误
· 单元测试从入门到精通

公告

昵称：学后端的菜妹
园龄： 3年3个月
粉丝： 1
关注： 1

+加关注

2025年3月

日

一

二

三

四

五

六

学后端的菜妹

Flume

是什么

核心概念

flume安装

Flume操作

常用source

常用channel

常用sink

flume常见架构

flume2：

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜