|NO.Z.00044|——————————|BigDataEnd|——|Hadoop&Flume.V07|——|Flume.v07|Flume.v1.9案例.v05|

一、监控目录采集信息到HDFS

### --- 监控目录采集信息到HDFS

~~~     # 业务需求：
~~~     监控指定目录，收集信息实时上传到HDFS

### --- 需求分析：

~~~     source 选择 spooldir。
~~~     spooldir 能够保证数据不丢失，且能够实现断点续传，但延迟较高，不能实时监控
~~~     channel 选择 memory
~~~     sink 选择 HDFS

### --- spooldir Source监听一个指定的目录，
~~~     即只要向指定目录添加新的文件，source组件就可以获取到该信息，
~~~     并解析该文件的内容，写入到channel。sink处理完之后，
~~~     标记该文件已完成处理，文件名添加 .completed 后缀。
~~~     虽然是自动监控整个目录，但是只能监控文件，
~~~     如果以追加的方式向已被处理的文件中添加内容，source并不能识别。

~~~     # 需要注意的是：
~~~     拷贝到spool目录下的文件不可以再打开编辑
~~~     无法监控子目录的文件夹变动
~~~     被监控文件夹每500毫秒扫描一次文件变动
~~~     适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步

二、创建配置文件。flume-spooldir-hdfs.conf

### --- 创建配置文件。flume-spooldir-hdfs.conf

[root@linux123 ~]# vim $FLUME_HOME/conf/flume-spooldir-hdfs.conf
# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /root/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true

# 忽略以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://linux121:9000/flume/upload/%Y%m%d/%H%M

# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-

# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true

# 积攒500个Event，flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500

# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream

# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60

# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700

# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0

# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

三、启动Agent

### --- 准备目录

[root@linux123 upload]# pwd
/root/upload

### --- 启动agent

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file $FLUME_HOME/conf/flume-spooldir-hdfs.conf \
-Dflume.root.logger=INFO,console
~~~
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: CHANNEL, name: c3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: c3 started
INFO node.Application: Starting Sink k3
INFO node.Application: Starting Source r3
INFO source.SpoolDirectorySource: SpoolDirectorySource source starting with directory: /root/upload
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: k3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k3 started
INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: r3: Successfully registered new MBean.
INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: r3 started

四、向upload文件夹中添加文件

### --- 写入文件
[root@linux123 ~]# cp nohup.out upload/b.log
 
### --- agent输出参数
INFO avro.ReliableSpoolingFileEventReader: Last read took us just up to a file boundary. Rolling to the next file, if there is one.
INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /root/upload/b.log to /root/upload/b.log.COMPLETED
INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
INFO hdfs.BucketWriter: Creating hdfs://linux121:9000/flume/upload/20210828/1307/upload-.1630127221375.tmp

五、查看HDFS上的数据

[root@linux123 ~]# hdfs dfs -ls /flume/upload
drwxrwxrwx   - root supergroup          0 2021-08-28 13:08 /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1307
-rw-r--r--   3 root supergroup      48490 2021-08-28 13:08 /flume/upload/20210828/1307/upload-.1630127221375
[root@linux123 ~]# hdfs dfs -ls /flume/upload/20210828/1310
-rw-r--r--   3 root supergroup      35127 2021-08-28 13:10 /flume/upload/20210828/1310/upload-.1630127424794.tmp

六、HDFS Sink

### --- 一般使用 HDFS Sink 都会采用滚动生成文件的方式，滚动生成文件的策略有：

~~~     # 基于时间
~~~     hdfs.rollInterval
~~~     缺省值：30，单位秒
~~~     0禁用

~~~     # 基于文件大小

~~~     hdfs.rollSize
~~~     缺省值：1024字节
~~~     0禁用

~~~     # 基于event数量

~~~     hdfs.rollCount
~~~     10
~~~     0禁用

~~~     # 基于文件空闲时间

~~~     hdfs.idleTimeout
~~~     缺省值：0。
~~~     禁用

~~~     # 基于HDFS文件副本数

~~~     hdfs.minBlockReplicas
~~~     默认：与HDFS的副本数一致
~~~     要将该参数设置为1；否则HFDS文件所在块的复制会引起文件滚动

七、其他重要配置：

~~~     # hdfs.useLocalTimeStamp
~~~     使用本地时间，而不是event header的时间戳
~~~     默认值：false

~~~     # hdfs.round
~~~     时间戳是否四舍五入
~~~     默认值false
~~~     如果为true，会影响所有的时间，除了t%

~~~     # hdfs.roundValue
~~~     四舍五入的最高倍数（单位配置在hdfs.roundUnit），但是要小于当前时间
~~~     默认值：1

~~~     # hdfs.roundUnit
~~~     可选值为：second、minute、hour
~~~     默认值：second

~~~     # 如果要避免HDFS Sink产生小文件，参考如下参数设置：

a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.path=hdfs://linux121:9000/flume/events/%Y/%m/%d/%H/%M

a1.sinks.k1.hdfs.minBlockReplicas=1
a1.sinks.k1.hdfs.rollInterval=3600
a1.sinks.k1.hdfs.rollSize=0
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart

——W.S.Landor