|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Flume.V07|——|Flume.v07|Flume.v1.9案例.v05|
一、监控目录采集信息到HDFS
### --- 监控目录采集信息到HDFS
~~~ # 业务需求:
~~~ 监控指定目录,收集信息实时上传到HDFS
### --- 需求分析:
~~~ source 选择 spooldir。
~~~ spooldir 能够保证数据不丢失,且能够实现断点续传,但延迟较高,不能实时监控
~~~ channel 选择 memory
~~~ sink 选择 HDFS
### --- spooldir Source监听一个指定的目录,
~~~ 即只要向指定目录添加新的文件,source组件就可以获取到该信息,
~~~ 并解析该文件的内容,写入到channel。sink处理完之后,
~~~ 标记该文件已完成处理,文件名添加 .completed 后缀。
~~~ 虽然是自动监控整个目录,但是只能监控文件,
~~~ 如果以追加的方式向已被处理的文件中添加内容,source并不能识别。
~~~ # 需要注意的是:
~~~ 拷贝到spool目录下的文件不可以再打开编辑
~~~ 无法监控子目录的文件夹变动
~~~ 被监控文件夹每500毫秒扫描一次文件变动
~~~ 适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步
二、创建配置文件。flume-spooldir-hdfs.conf
### --- 创建配置文件。flume-spooldir-hdfs.conf
[root@linux123 ~]# vim $FLUME_HOME/conf/flume-spooldir-hdfs.conf
# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
# 忽略以.tmp结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://linux121:9000/flume/upload/%Y%m%d/%H%M
# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 积攒500个Event,flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500
# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream
# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60
# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700
# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
三、启动Agent
### --- 准备目录
[root@linux123 upload]# pwd
/root/upload
### --- 启动agent
$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file $FLUME_HOME/conf/flume-spooldir-hdfs.conf \
-Dflume.root.logger=INFO,console
~~~
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: c3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c3 started
INFO node.Application: Starting Sink k3
INFO node.Application: Starting Source r3
INFO source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /root/upload
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: k3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k3 started
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: r3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: r3 started
四、向upload文件夹中添加文件
### --- 写入文件
[root@linux123 ~]# cp nohup.out upload/b.log
### --- agent输出参数
INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /root/upload/b.log to /root/upload/b.log.COMPLETED
INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
INFO hdfs.BucketWriter: Creating hdfs://linux121:9000/flume/upload/20210828/1307/upload-.1630127221375.tmp
五、查看HDFS上的数据
[root@linux123 ~]# hdfs dfs -ls /flume/upload
drwxrwxrwx - root supergroup 0 2021-08-28 13:08 /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1307
-rw-r--r-- 3 root supergroup 48490 2021-08-28 13:08 /flume/upload/20210828/1307/upload-.1630127221375
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1310
-rw-r--r-- 3 root supergroup 35127 2021-08-28 13:10 /flume/upload/20210828/1310/upload-.1630127424794.tmp
六、HDFS Sink
### --- 一般使用 HDFS Sink 都会采用滚动生成文件的方式,滚动生成文件的策略有:
~~~ # 基于时间
~~~ hdfs.rollInterval
~~~ 缺省值:30,单位秒
~~~ 0禁用
~~~ # 基于文件大小
~~~ hdfs.rollSize
~~~ 缺省值:1024字节
~~~ 0禁用
~~~ # 基于event数量
~~~ hdfs.rollCount
~~~ 10
~~~ 0禁用
~~~ # 基于文件空闲时间
~~~ hdfs.idleTimeout
~~~ 缺省值:0。
~~~ 禁用
~~~ # 基于HDFS文件副本数
~~~ hdfs.minBlockReplicas
~~~ 默认:与HDFS的副本数一致
~~~ 要将该参数设置为1;否则HFDS文件所在块的复制会引起文件滚动
七、其他重要配置:
~~~ # hdfs.useLocalTimeStamp
~~~ 使用本地时间,而不是event header的时间戳
~~~ 默认值:false
~~~ # hdfs.round
~~~ 时间戳是否四舍五入
~~~ 默认值false
~~~ 如果为true,会影响所有的时间,除了t%
~~~ # hdfs.roundValue
~~~ 四舍五入的最高倍数(单位配置在hdfs.roundUnit),但是要小于当前时间
~~~ 默认值:1
~~~ # hdfs.roundUnit
~~~ 可选值为:second、minute、hour
~~~ 默认值:second
~~~ # 如果要避免HDFS Sink产生小文件,参考如下参数设置:
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path=hdfs://linux121:9000/flume/events/%Y/%m/%d/%H/%M
a1.sinks.k1.hdfs.minBlockReplicas=1
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」