Centos7 搭建 Flume 搭配 Hadoop 采集 Nginx 日志
本文目的是根据前文的博文,打造一个Hadoop、Sprak的服务器闭环。也是经验归纳。
版本信息
CentOS: Linux localhost.localdomain 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
JDK: Oracle jdk1.8.0_241 , https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
Hadoop : hadoop-3.2.1.tar.gz
Flume:apache-flume-1.9.0-bin.tar.gz , http://flume.apache.org/download.html
服务器搭建
Hadoop:CentOS7 部署 Hadoop 3.2.1 (伪分布式)
Nginx: 请参考 CentOS 6.7 配置 yum 安装 nginx
搭建 Flume
1.下载Flume 的 bin包并解压到指定目录
mkdir /data/server/flume/ wget https://mirrors.tuna.tsinghua.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz tar zxvf apache-flume-1.9.0-bin.tar.gz mv apache-flume-1.9.0-bin 1.9.0
2. 安装JDK
下载Java SDK,前往 https://www.oracle.com/java/technologies/javase-jdk8-downloads.html 下载
rz #选择你下载好的文件,上传到当前目录下 tar zxvf jdk-8u241-linux-x64.tar.gz
编辑env文件
cp 1.9.0/conf/flume-env.sh.template 1.9.0/conf/flume-env.sh
vi 1.9.0/conf/flume-env.sh
在文件末尾添加如下内容:
export JAVA_HOME=/data/server/flume/jdk1.8.0_241/
3. 配置Flume
新建配置文件 flume.conf
cp 1.9.0/conf/flume-conf.properties.template 1.9.0/conf/flume.conf
vi 1.9.0/conf/flume.conf
添加如下内容:
##配置Agent myagent.sources = r1 myagent.sinks = k1 myagent.channels = c1 # # 配置Source myagent.sources.r1.type = exec myagent.sources.r1.channels = c1 myagent.sources.r1.deserializer.outputCharset = UTF-8 # # 配置需要监控的日志输出文件 myagent.sources.r1.command = tail -F /usr/local/nginx/logs/flume-test.access.log # # 配置Sink myagent.sinks.k1.type = hdfs myagent.sinks.k1.channel = c1 myagent.sinks.k1.hdfs.useLocalTimeStamp = true myagent.sinks.k1.hdfs.path = hdfs://172.16.1.126:9000/flume/nginx_logs/%Y%m%d myagent.sinks.k1.hdfs.filePrefix = %Y-%m-%d-%H myagent.sinks.k1.hdfs.fileSuffix = .log myagent.sinks.k1.hdfs.minBlockReplicas = 1 myagent.sinks.k1.hdfs.fileType = DataStream myagent.sinks.k1.hdfs.writeFormat = Text myagent.sinks.k1.hdfs.rollInterval = 86400 myagent.sinks.k1.hdfs.rollSize = 1000000 myagent.sinks.k1.hdfs.rollCount = 10000 myagent.sinks.k1.hdfs.idleTimeout = 0 # # 配置Channel myagent.channels.c1.type = memory myagent.channels.c1.capacity = 1000 myagent.channels.c1.transactionCapacity = 100 # # 将三者连接 myagent.sources.r1.channel = c1 myagent.sinks.k1.channel = c1
4. 编写启动、关闭脚本
start_flume.sh
#!/usr/bin/env bash CURRENT_DIR=$(pwd) BIN_DIR="/data/server/flume/1.9.0/" CHECK_PID="ps aux | grep \"${BIN_DIR}\" | grep 'flume' | grep -v grep | awk '{print \$2}'" cd ${BIN_DIR} FLUME_PID=$(eval ${CHECK_PID}) if [ ""x != "${FLUME_PID}"x ] ;then echo "Flume is running, please kill the flume process" cd ${CURRENT_DIR} exit 0 fi nohup ./bin/flume-ng agent --conf ./conf -f ./conf/flume.conf --name myagent > ../nohup.out 2>&1 & #等3秒后执行下一条 sleep 3 FLUME_PID=$(eval ${CHECK_PID}) if [ ""x != "${FLUME_PID}"x ] ;then echo "Flume is running!" fi cd ${CURRENT_DIR}
stop_flume.sh
#!/usr/bin/env bash CURRENT_DIR=$(pwd) BIN_DIR="/data/server/flume/1.9.0/" CHECK_PID="ps aux | grep \"${BIN_DIR}\" | grep 'flume' | grep -v grep | awk '{print \$2}'" cd ${BIN_DIR} FLUME_PID=$(eval ${CHECK_PID}) if [ ""x == "${FLUME_PID}"x ] ;then echo "Flume is no runnig" cd ${CURRENT_DIR} exit 0 fi kill -9 $FLUME_PID echo "Flume is stop!" cd ${CURRENT_DIR}
5.安装Nginx
1.安装请参考:请参考 CentOS 6.7 配置 yum 安装 nginx
2.设置 Nginx 日志打印格式为JSON字符串
编辑 /etc/nginx/nginx.cnf , 在 http{} 节点查找关键字 log_format ,另起一行增加如下内容:、
log_format post_json '{"remote_addr":"$remote_addr","http_x_forwarded_for":"$http_x_forwarded_for","remote_user":"$remote_user","time_local":"$time_local","server_protocol":"$server_protocol","request_time":"$request_time","request_method":"$request_method","request_uri":"$request_uri","status":$status,"body_bytes_sent":$body_bytes_sent,"http_token":"$http_token","http_referer":"$http_referer","http_user_agent":"$http_user_agent","request_body":"$request_body"}';
增加一个你的测试web, /etc/nginx/conf.d/test.flume.conf
server { listen 8881; access_log logs/flume-test.access.log post_json; location / { root /data/www/test; index index.html index.htm; } }
重载配置
nginx -s reload
至此配置完毕!
服务验证
1.启动服务:
./1.9.0/bin/flume-ng agent --conf 1.9.0/conf/ -f 1.9.0/conf/flume.conf --name myagent
出现如下报错:
Info: Sourcing environment configuration script /data/server/flume/1.9.0/conf/flume-env.sh Info: Including Hadoop libraries found via (/data/server/hadoop/3.2.1/bin/hadoop) for HDFS access Info: Including Hive libraries found via () for Hive access + exec /data/server/flume/jdk1.8.0_241//bin/java -Xmx20m -cp '/data/server/flume/1.9.0/conf:/data/server/flume/1.9.0/lib/*:/data/server/hadoop/3.2.1/etc/hadoop:/data/server/hadoop/3.2.1/share/hadoop/common/lib/*:/data/server/hadoop/3.2.1/share/hadoop/common/*:/data/server/hadoop/3.2.1/share/hadoop/hdfs:/data/server/hadoop/3.2.1/share/hadoop/hdfs/lib/*:/data/server/hadoop/3.2.1/share/hadoop/hdfs/*:/data/server/hadoop/3.2.1/share/hadoop/mapreduce/lib/*:/data/server/hadoop/3.2.1/share/hadoop/mapreduce/*:/data/server/hadoop/3.2.1/share/hadoop/yarn:/data/server/hadoop/3.2.1/share/hadoop/yarn/lib/*:/data/server/hadoop/3.2.1/share/hadoop/yarn/*:/lib/*' -Djava.library.path=:/data/server/hadoop/3.2.1/lib/native org.apache.flume.node.Application -f 1.9.0/conf/flume.conf --name myagent SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/data/server/flume/1.9.0/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/data/server/hadoop/3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679) at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:221) at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:572) at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:412) at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67) at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145) at java.lang.Thread.run(Thread.java:748)
原因是 guava jar包版本过低,前往 Maven 仓库下载一个最新的包:guava-28.1-jre.jar ,参考:https://blog.csdn.net/GQB1226/article/details/102555820
移除旧版本
mv 1.9.0/lib/guava-11.0.2.jar ./ wget https://repo1.maven.org/maven2/com/google/guava/guava/28.1-jre/guava-28.1-jre.jar -P 1.9.0/lib/
再次手动启动,控制台输出,显示新建了一个零时文件,hdfs://172.16.1.126:9000/flume/nginx_logs/20200331/2020-03-31-05.1585647655051.log.tmp:
31 Mar 2020 05:40:50,916 INFO [lifecycleSupervisor-1-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start:62) - Configuration provider starting 31 Mar 2020 05:40:50,921 INFO [conf-file-poller-0] (org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run:138) - Reloading configuration file:./conf/flume.conf 31 Mar 2020 05:40:50,927 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig:1203) - Processing:k131 Mar 2020 05:40:50,929 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1117) - Added sinks: k1 Agent: myagent 31 Mar 2020 05:40:50,929 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addComponentConfig:1203) - Processing:r131 Mar 2020 05:40:50,935 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet:623) - Agent configuration for 'myagent' has no configfilters. 31 Mar 2020 05:40:50,956 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:163) - Post-validation flume configuration contains configuration for agents: [myagent] 31 Mar 2020 05:40:50,957 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:151) - Creating channels 31 Mar 2020 05:40:50,963 INFO [conf-file-poller-0] (org.apache.flume.channel.DefaultChannelFactory.create:42) - Creating instance of channel c1 type memory 31 Mar 2020 05:40:50,967 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.loadChannels:205) - Created channel c1 31 Mar 2020 05:40:50,968 INFO [conf-file-poller-0] (org.apache.flume.source.DefaultSourceFactory.create:41) - Creating instance of source r1, type exec 31 Mar 2020 05:40:50,974 INFO [conf-file-poller-0] (org.apache.flume.sink.DefaultSinkFactory.create:42) - Creating instance of sink: k1, type: hdfs 31 Mar 2020 05:40:50,983 INFO [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.getConfiguration:120) - Channel c1 connected to [r1, k1] 31 Mar 2020 05:40:50,985 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:162) - Starting new configuration:{ sourceRunners:{r1=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:r1,state:IDLE} }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@7007686a counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} } 31 Mar 2020 05:40:50,986 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:169) - Starting Channel c1 31 Mar 2020 05:40:51,032 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: CHANNEL, name: c1: Successfully registered new MBean. 31 Mar 2020 05:40:51,032 INFO [lifecycleSupervisor-1-0] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: CHANNEL, name: c1 started 31 Mar 2020 05:40:51,034 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:196) - Starting Sink k1 31 Mar 2020 05:40:51,035 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:207) - Starting Source r1 31 Mar 2020 05:40:51,035 INFO [lifecycleSupervisor-1-4] (org.apache.flume.source.ExecSource.start:170) - Exec source starting with command: tail -F /usr/local/nginx/logs/hadoop.access.log 31 Mar 2020 05:40:51,036 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: SINK, name: k1: Successfully registered new MBean. 31 Mar 2020 05:40:51,036 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: SINK, name: k1 started 31 Mar 2020 05:40:51,036 INFO [lifecycleSupervisor-1-4] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:119) - Monitored counter group for type: SOURCE, name: r1: Successfully registered new MBean. 31 Mar 2020 05:40:51,037 INFO [lifecycleSupervisor-1-4] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:95) - Component type: SOURCE, name:r1 started 31 Mar 2020 05:40:55,050 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.HDFSDataStream.configure:57) - Serializer = TEXT, UseRawLocalFileSystem = false 31 Mar 2020 05:40:55,167 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:246) - Creating hdfs://172.16.1.126:9000/flume/nginx_logs/20200331/2020-03-31-05.1585647655051.log.tmp 31 Mar 2020 05:40:59,225 INFO [Thread-9] (org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend:239) - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
登陆Hadoop查看: http://172.16.1.126:9870/explorer.html#/flume/nginx_logs/20200331
编写一个测试脚本,mockRequest2NginxForTestFlume.sh:
#!/bin/bash step=1 #间隔的秒数,不能大于60 user_agent_list=("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0") user_agent_list[1]="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" referer_list=("https://www.baidu.com" "https://www.qq.com" "https://www.sina.com" "https://weibo.com/") while [ 1 ] do random=$((RANDOM)) num=$(((RANDOM%7)+1)) agent=$(((RANDOM%2))) referer=$(((RANDOM%4))) url="http://172.16.1.126:8881/"$num".html?r="$random; url="http://172.16.1.126:8881/888.html"; echo " `date +%Y-%m-%d\ %H:%M:%S` get $url" #curl http://192.168.75.137/1.html #调用链接 curl -s -A "${user_agent_list[$agent]}" -e "${referer_list[$referer]}" $url > /dev/null sleep $step done
监控 HDFS文件:
[root@localhost lib]# hadoop fs -tail -f /flume/nginx_logs/20200331/2020-03-31-05.1585647655051.log.tmp 2020-03-31 05:47:31,491 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36","request_body":"-"} {"remote_addr":"172.16.39.19","http_x_forwarded_for":"-","remote_user":"-","time_local":"31/Mar/2020:04:53:30 -0400","server_protocol":"HTTP/1.1","request_time":"0.000","request_method":"GET","request_uri":"/999.html","status":404,"body_bytes_sent":193,"http_token":"-","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36","request_body":"-"} {"remote_addr":"172.16.39.19","http_x_forwarded_for":"-","remote_user":"-","time_local":"31/Mar/2020:04:54:05 -0400","server_protocol":"HTTP/1.1","request_time":"0.000","request_method":"GET","request_uri":"/888.html","status":404,"body_bytes_sent":193,"http_token":"-","http_referer":"-","http_user_agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36","request_body":"-"}
可见日志已经收集进Hadoop里!
Ok, 完结撒花!!!
PS:
flume使用之flume+hive 实现日志离线收集、分析flume使用之flume+hive 实现日志离线收集、分析
Hadoop之——Flume采集Nginx日志到Hive的事务表