Flume实战案例 -- 采集文件到HDFS
需求分析:
-
采集需求:比如业务系统使用log4j生成的日志,日志内容不断增加,需要把追加到日志文件中的数据实时采集到hdfs
-
根据需求,首先定义以下3大要素
-
采集源,即source——监控文件内容更新 : exec ‘tail -f file’
-
下沉目标,即sink——HDFS文件系统 : hdfs sink
-
Source和sink之间的传递通道——channel,可用file channel 也可以用 内存channel
-
flume的配置文件开发
-
hadoop03开发配置文件
cd /bigdata/install/flume-1.9.0/conf vim tail-file.conf
-
配置文件内容
agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 # Describe/configure tail -F source1 agent1.sources.source1.type = exec agent1.sources.source1.command = tail -f /bigdata/install/mydata/flume/taillogs/access_log agent1.sources.source1.channels = channel1 # Describe sink1 agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:8020/weblog/flume-collection/%y-%m-%d/%H-%M agent1.sinks.sink1.hdfs.filePrefix = access_log # 允许打开的文件数;如果超出5000,老文件会被关闭 agent1.sinks.sink1.hdfs.maxOpenFiles = 5000 agent1.sinks.sink1.hdfs.batchSize= 100 agent1.sinks.sink1.hdfs.fileType = DataStream agent1.sinks.sink1.hdfs.writeFormat =Text agent1.sinks.sink1.hdfs.rollSize = 102400 agent1.sinks.sink1.hdfs.rollCount = 1000000 agent1.sinks.sink1.hdfs.rollInterval = 60 agent1.sinks.sink1.hdfs.round = true agent1.sinks.sink1.hdfs.roundValue = 10 agent1.sinks.sink1.hdfs.roundUnit = minute agent1.sinks.sink1.hdfs.useLocalTimeStamp = true # Use a channel which buffers events in memory agent1.channels.channel1.type = memory # 向channel添加一个event或从channel移除一个event的超时时间 agent1.channels.channel1.keep-alive = 120 agent1.channels.channel1.capacity = 5000 ##设置过大,效果不是太明显 agent1.channels.channel1.transactionCapacity = 4500 # Bind the source and sink to the channel agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1
组件官网:
启动flume
cd /bigdata/install/flume-1.9.0
bin/flume-ng agent -c conf -f conf/tail-file.conf -n agent1 -Dflume.root.logger=INFO,console
开发shell脚本定时追加文件内容
mkdir -p /home/hadoop/shells/
cd /home/hadoop/shells/
vim tail-file.sh
- 内容如下
#!/bin/bash
while true
do
date >> /bigdata/install/mydata/flume/taillogs/access_log;
sleep 0.5;
done
- 创建文件夹
mkdir -p /bigdata/install/mydata/flume/taillogs/
- 启动脚本
chmod u+x tail-file.sh
sh /home/hadoop/shells/tail-file.sh
-
验证结果,在hdfs的webui下和console下可以看到如下截图