Flume
-
-
工具:需要使用工具的功能时,启动程序。使用完毕后,工具程序可以直接关闭。
-
框架:一个半成品软件,需要开发人员根据业务逻辑填写核心代码,组成完成的程序,提供工具或者服务的功能。
是什么
-
在数据处理场景中,数据的产生往往分散在时间空间不同的各个服务器上,需要将各个服务器的数据自动化采集到同一个HDFS集群,就需要使用自动化采集工具
-
flume是一个hadoop生态中的专用用来进行海量日志,收集、聚合、移动的工具
核心概念
-
agent 代理 flume启动的一个Jvm实例,称之为一个Agent
-
source 源 是agent中的一个组件,负责读取原始数据(文件,其他协议的网络连接) 将原始数据封装为event对象 将event对象发送到channel中
-
channel 通道 通过内存或者磁盘文件构建一个用来缓存event的队列
-
sink 出口 读取channel中缓存的event对象 拆解出event中的数据并发送到下游组件(HDFS)
-
event 事件 flume中自定义的对象用于封装数据(比如一行日志文件封装为一个event)
flume安装
1.上传解压
2.
JAVA_HOME=/opt/java/modul/jdk1.8.0_291
1 echo 'export FLUME_HOME=/opt/java/modul/flume-1.9.0' >> /etc/profile 2 echo 'export PATH=$PATH:$FLUME_HOME/bin' >> /etc/profile 3 source /etc/profile
Flume操作
-
Flume通过解耦的设计将数据收集端和数据存储端以及中间的缓存拆分为了三个模块
-
使用flume进行数据采集时,只需要根据需要选择对应功能source、channel、sink,编写agent的配置文件,将三个模块组合成一个agent,启动程序即可
常用source
对比 | exec | spooldir | taildir |
---|---|---|---|
功能 | 通过执行一个shell命令读取数据, 通常配合tail -F file 实现单文件的实时读取 |
监控整个目录,读取目录中所有的文件,不支持递归 | 可以监控多个文件或者文件夹的变化 |
优势 | 性能最优 | 可以监控整个文件夹 | 可以监控文件或者文件夹,并支持记录读取进度 |
劣势 | 只能监控单文件,不支持断点续传 | 不支持断点续传,也不支持读取文件的改动 | - |
常用channel
对比 | memory | file |
---|---|---|
性能 | 优 | 略弱 |
安全性 | 略弱 | 优 |
常用sink
-
logger 通常用于测试,可以方便将数据打印到控制台
-
HDFS sink 将数据写入到HDFS集群
案例一:
给agent_test_1.conf 添加配置
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 a1.sources = r1 5 a1.sinks = k1 6 a1.channels = c1 7 8 # Describe/configure the source 9 a1.sources.r1.type = netcat 10 a1.sources.r1.bind = localhost 11 a1.sources.r1.port = 44444 12 13 # Describe the sink 14 a1.sinks.k1.type = logger 15 16 # Use a channel which buffers events in memory 17 a1.channels.c1.type = memory 18 a1.channels.c1.capacity = 1000 19 a1.channels.c1.transactionCapacity = 100 20 21 # Bind the source and sink to the channel 22 a1.sources.r1.channels = c1 23 a1.sinks.k1.channel = c1
启动监听
1 flume-ng agent --conf /opt/module/flume-1.9.0/conf/ --conf-file agent_test_1.conf --name a1 -Dflume.root.logger=INFO,console
案例二:
给agent_test_1.conf 添加配置
案例三
hdfs
flume2:
在idea创建项目如下
文件中配置如下
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 #设置各组件名称 5 #设置资源别名r1 6 exec2logger.sources = r1 7 #设置出口 k1 8 exec2logger.sinks = k1 9 #设置通道 c1 10 exec2logger.channels = c1 11 12 # Describe/configure the source 13 #数据源的类型 14 exec2logger.sources.r1.type = exec 15 #数据源的地址 16 exec2logger.sources.r1.command = tail -F /opt/log.txt 17 #exec2logger.sources.r1.port = 44444 18 19 # Describe the sink 20 #输出的方式 21 exec2logger.sinks.k1.type = logger 22 23 # Use a channel which buffers events in memory 24 #缓存 25 exec2logger.channels.c1.type = memory 26 #容量 27 exec2logger.channels.c1.capacity = 200000 28 #流速 29 exec2logger.channels.c1.transactionCapacity = 1024 30 31 # Bind the source and sink to the channel 32 exec2logger.sources.r1.channels = c1 33 exec2logger.sinks.k1.channel = c1
将文件exec2logger 拷贝到 liunx flume 安装目录下/opt/module/flume-1.9.0/agents
启动agent
jar 冲突
案例二
编写spooldir2logger.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
#设置各组件名称
#设置资源别名r1
spooldir2logger.sources = r1
#设置出口 k1
spooldir2logger.sinks = k1
#设置通道 c1
spooldir2logger.channels = c1
# Describe/configure the source
#数据源的类型
spooldir2logger.sources.r1.type = spooldir
#数据源的地址
spooldir2logger.sources.r1.spoolDir = /opt/log
#spooldir2logger.sources.r1.port = 44444
# Describe the sink
#输出的方式
spooldir2logger.sinks.k1.type = logger
# Use a channel which buffers events in memory
#缓存
spooldir2logger.channels.c1.type = memory
#容量
spooldir2logger.channels.c1.capacity = 200000
#流速
spooldir2logger.channels.c1.transactionCapacity = 1024
# Bind the source and sink to the channel
spooldir2logger.sources.r1.channels = c1
spooldir2logger.sinks.k1.channel = c1
注意:优先创建出/opt/log 文件夹
启动agent
将log.txt 移动到log文件
给文件中添加信息 没有反应
修改后缀 数据写入
案例三
taildir2logger.conf
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 taildir2logger.sources = r1 5 taildir2logger.sinks = k1 6 taildir2logger.channels = c1 7 8 # Describe/configure the source 9 taildir2logger.sources.r1.type = TAILDIR 10 taildir2logger.sources.r1.filegroups = g1 g2 g3 11 taildir2logger.sources.r1.filegroups.g1 =/opt/1.txt 12 taildir2logger.sources.r1.filegroups.g2 =/opt/.*.txt 13 taildir2logger.sources.r1.filegroups.g3 =/opt/2.txt 14 #taildir2logger.sources.r1.port = 44444 15 16 # Describe the sink 17 taildir2logger.sinks.k1.type = logger 18 19 # Use a channel which buffers events in memory 20 taildir2logger.channels.c1.type = memory 21 taildir2logger.channels.c1.capacity = 1000 22 taildir2logger.channels.c1.transactionCapacity = 100 23 24 # Bind the source and sink to the channel 25 taildir2logger.sources.r1.channels = c1 26 taildir2logger.sinks.k1.channel = c1
检查~家目录下的.flume
案例四
taildir2hdfs.conf
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 #设置各组件名称 5 #设置资源别名r1 6 7 taildir2hdfs.sources = r1 8 #设置出口 k1 9 taildir2hdfs.sinks = k1 10 #设置通道 c1 11 taildir2hdfs.channels = c1 12 13 # Describe/configure the source 14 #数据源的类型 15 taildir2hdfs.sources.r1.type = TAILDIR 16 #数据源的地址 17 taildir2hdfs.sources.r1.filegroups = g1 g2 g3 18 taildir2hdfs.sources.r1.filegroups.g1=/opt/1.txt 19 taildir2hdfs.sources.r1.filegroups.g2=/opt/log/.*.txt 20 taildir2hdfs.sources.r1.filegroups.g3=/opt/2.txt 21 #taildir2hdfs.sources.r1.port = 44444 22 23 # Describe the sink 24 #输出的方式 25 taildir2hdfs.sinks.k1.hdfs.useLocalTimeStamp=true 26 taildir2hdfs.sinks.k1.type = hdfs 27 taildir2hdfs.sinks.k1.hdfs.path = hdfs://bd0801:8020/flume/%Y-%m-%d-%H-%M 28 taildir2hdfs.sinks.k1.hdfs.fileType = DataStream 29 taildir2hdfs.sinks.k1.hdfs.writeFormat = Text 30 taildir2hdfs.sinks.k1.hdfs.rollInterval = 300 31 taildir2hdfs.sinks.k1.hdfs.rollSize = 13000000 32 taildir2hdfs.sinks.k1.hdfs.rollCount=0 33 34 35 # Use a channel which buffers events in memory 36 #缓存 37 taildir2hdfs.channels.c1.type = memory 38 #容量 39 taildir2hdfs.channels.c1.capacity = 200000 40 #流速 41 taildir2hdfs.channels.c1.transactionCapacity = 1024 42 43 # Bind the source and sink to the channel 44 taildir2hdfs.sources.r1.channels = c1 45 taildir2hdfs.sinks.k1.channel = c1
运行结果
模拟存储 vi printer.sh
1 #!/bin/bash 2 3 4 while true 5 do 6 echo '298444.812] (--) Found matching XKB configuration "Chinese (Simplified)" 7 [298444.812] (--) Model = "pc105" Layout = "cn" Variant = "none" Options = "none" 8 [298444.812] Rules = "base" Model = "pc105" Layout = "cn" Variant = "none" Options = "none"' >> /opt/2.txt 9 done
执行脚本
avro.apache.org
将flume在bd0802安装
解放终端执行
启动服务端
avro2hdfs.conf
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 #设置各组件名称 5 #设置资源别名r1 6 7 avro2hdfs.sources = r1 8 #设置出口 k1 9 avro2hdfs.sinks = k1 10 #设置通道 c1 11 avro2hdfs.channels = c1 12 13 # Describe/configure the source 14 #数据源的类型 15 avro2hdfs.sources.r1.type = avro 16 avro2hdfs.sources.r1.bind=bd0802 17 avro2hdfs.sources.r1.port=12345 18 19 #avro2hdfs.sources.r1.port = 44444 20 21 # Describe the sink 22 #输出的方式 23 avro2hdfs.sinks.k1.hdfs.useLocalTimeStamp=true 24 avro2hdfs.sinks.k1.type = hdfs 25 avro2hdfs.sinks.k1.hdfs.path = hdfs://bd0801:8020/flume/%Y-%m-%d 26 avro2hdfs.sinks.k1.hdfs.fileType = DataStream 27 avro2hdfs.sinks.k1.hdfs.writeFormat = Text 28 avro2hdfs.sinks.k1.hdfs.rollInterval = 300 29 avro2hdfs.sinks.k1.hdfs.rollSize = 13000000 30 avro2hdfs.sinks.k1.hdfs.rollCount=0 31 32 33 # Use a channel which buffers events in memory 34 #缓存 35 avro2hdfs.channels.c1.type = memory 36 #容量 37 avro2hdfs.channels.c1.capacity = 200000 38 #流速 39 avro2hdfs.channels.c1.transactionCapacity = 1024 40 41 # Bind the source and sink to the channel 42 avro2hdfs.sources.r1.channels = c1 43 avro2hdfs.sinks.k1.channel = c1
启动客户端
taildir2avro.conf
1 # example.conf: A single-node Flume configuration 2 3 # Name the components on this agent 4 #设置各组件名称 5 #设置资源别名r1 6 7 taildir2avro.sources = r1 8 #设置出口 k1 9 taildir2avro.sinks = k1 10 #设置通道 c1 11 taildir2avro.channels = c1 12 13 # Describe/configure the source 14 #数据源的类型 15 taildir2avro.sources.r1.type = TAILDIR 16 #数据源的地址 17 taildir2avro.sources.r1.filegroups = g1 g2 g3 18 taildir2avro.sources.r1.filegroups.g1=/opt/1.txt 19 taildir2avro.sources.r1.filegroups.g2=/opt/log/.*.txt 20 taildir2avro.sources.r1.filegroups.g3=/opt/2.txt 21 #taildir2avro.sources.r1.port = 44444 22 23 # Describe the sink 24 #输出的方式 25 26 taildir2avro.sinks.k1.type = avro 27 taildir2avro.sinks.k1.hostname=bd0802 28 taildir2avro.sinks.k1.port=12345 29 30 31 # Use a channel which buffers events in memory 32 #缓存 33 taildir2avro.channels.c1.type = memory 34 #容量 35 taildir2avro.channels.c1.capacity = 200000 36 #流速 37 taildir2avro.channels.c1.transactionCapacity = 1024 38 39 # Bind the source and sink to the channel 40 taildir2avro.sources.r1.channels = c1 41 taildir2avro.sinks.k1.channel = c1
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通