|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Flume.V07|——|Flume.v07|Flume.v1.9案例.v05|

一、监控目录采集信息到HDFS
### --- 监控目录采集信息到HDFS

~~~     # 业务需求:
~~~     监控指定目录,收集信息实时上传到HDFS
### --- 需求分析:

~~~     source 选择 spooldir。
~~~     spooldir 能够保证数据不丢失,且能够实现断点续传,但延迟较高,不能实时监控
~~~     channel 选择 memory
~~~     sink 选择 HDFS
### --- spooldir Source监听一个指定的目录,
~~~     即只要向指定目录添加新的文件,source组件就可以获取到该信息,
~~~     并解析该文件的内容,写入到channel。sink处理完之后,
~~~     标记该文件已完成处理,文件名添加 .completed 后缀。
~~~     虽然是自动监控整个目录,但是只能监控文件,
~~~     如果以追加的方式向已被处理的文件中添加内容,source并不能识别。

~~~     # 需要注意的是:
~~~     拷贝到spool目录下的文件不可以再打开编辑
~~~     无法监控子目录的文件夹变动
~~~     被监控文件夹每500毫秒扫描一次文件变动
~~~     适合用于同步新文件,但不适合对实时追加日志的文件进行监听并同步
二、创建配置文件。flume-spooldir-hdfs.conf
### --- 创建配置文件。flume-spooldir-hdfs.conf

[root@linux123 ~]# vim $FLUME_HOME/conf/flume-spooldir-hdfs.conf
# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true

# 忽略以.tmp结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://linux121:9000/flume/upload/%Y%m%d/%H%M

# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-

# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true

# 积攒500个Event,flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500

# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream

# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60

# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700

# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
三、启动Agent
### --- 准备目录

[root@linux123 upload]# pwd
/root/upload
### --- 启动agent

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file $FLUME_HOME/conf/flume-spooldir-hdfs.conf \
-Dflume.root.logger=INFO,console
~~~
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: c3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c3 started
INFO node.Application: Starting Sink k3
INFO node.Application: Starting Source r3
INFO source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /root/upload
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: k3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k3 started
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: r3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: r3 started
四、向upload文件夹中添加文件
### --- 写入文件
[root@linux123 ~]# cp nohup.out upload/b.log
 
### --- agent输出参数
INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /root/upload/b.log to /root/upload/b.log.COMPLETED
INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
INFO hdfs.BucketWriter: Creating hdfs://linux121:9000/flume/upload/20210828/1307/upload-.1630127221375.tmp
五、查看HDFS上的数据
[root@linux123 ~]# hdfs dfs -ls /flume/upload
drwxrwxrwx   - root supergroup          0 2021-08-28 13:08 /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1307
-rw-r--r--   3 root supergroup      48490 2021-08-28 13:08 /flume/upload/20210828/1307/upload-.1630127221375
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1310
-rw-r--r--   3 root supergroup      35127 2021-08-28 13:10 /flume/upload/20210828/1310/upload-.1630127424794.tmp
六、HDFS Sink
### --- 一般使用 HDFS Sink 都会采用滚动生成文件的方式,滚动生成文件的策略有:

~~~     # 基于时间
~~~     hdfs.rollInterval
~~~     缺省值:30,单位秒
~~~     0禁用
~~~     # 基于文件大小

~~~     hdfs.rollSize
~~~     缺省值:1024字节
~~~     0禁用
~~~     # 基于event数量

~~~     hdfs.rollCount
~~~     10
~~~     0禁用
~~~     # 基于文件空闲时间

~~~     hdfs.idleTimeout
~~~     缺省值:0。
~~~     禁用
~~~     # 基于HDFS文件副本数

~~~     hdfs.minBlockReplicas
~~~     默认:与HDFS的副本数一致
~~~     要将该参数设置为1;否则HFDS文件所在块的复制会引起文件滚动
七、其他重要配置:
~~~     # hdfs.useLocalTimeStamp
~~~     使用本地时间,而不是event header的时间戳
~~~     默认值:false
~~~     # hdfs.round
~~~     时间戳是否四舍五入
~~~     默认值false
~~~     如果为true,会影响所有的时间,除了t%
~~~     # hdfs.roundValue
~~~     四舍五入的最高倍数(单位配置在hdfs.roundUnit),但是要小于当前时间
~~~     默认值:1
~~~     # hdfs.roundUnit
~~~     可选值为:secondminutehour
~~~     默认值:second
~~~     # 如果要避免HDFS Sink产生小文件,参考如下参数设置:

a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path=hdfs://linux121:9000/flume/events/%Y/%m/%d/%H/%M

a1.sinks.k1.hdfs.minBlockReplicas=1
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0

 
 
 
 
 
 
 
 
 

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
                                                                                                                                                   ——W.S.Landor

 

 

posted on   yanqi_vip  阅读(99)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

导航

统计

点击右上角即可分享
微信分享提示