elk日志收集之rsyslog软连接监控文件深度坑
业务中通过rsyslog监控本地文件收集了一些redis和mc的慢日志,推到elk集群分析,这些日志一天一个文件,每晚零点5分通过计划任务用软连接的方式将新的文件固定到指定文件下,但是最近发现日志丢了很多,分析中发现了一个深坑,先说下现有的配置:
rsyslog的配置如下,监控固定的文件:
local6.* @r-log.cms.sina.cn:1514 main_queue( queue.workerthreads="10" # threads to work on the queue queue.dequeueBatchSize="1000" # max number of messages to process at once queue.size="50000" # max queue size ) module(load="imfile") module(load="omkafka") $PreserveFQDN on ############################ php curl log ##################################### input(type="imfile" File="/data1/ms/log/php_common/php_slow_log" Tag="tag1" Facility="local5" freshStartTail="on" deleteStateOnFileDelete="off" reopenOnTruncate="on" ##日志切割Rsyslog不去读取文件,使用此选项 ) local5.* @r-log.cms.sina.cn:1515 ########################## redis log ######################################## $template redislog7215,"%$myhostname%`redis7215`%msg%" ruleset(name="redislog7215") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="redis-log" template="redislog7215" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/redis7215.log" Tag="" ruleset="redislog7215" freshStartTail="on" reopenOnTruncate="on" ) $template redislog7216,"%$myhostname%`redis7216`%msg%" ruleset(name="redislog7216") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="redis-log" template="redislog7216" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/redis7216.log" Tag="" ruleset="redislog7216" freshStartTail="on" reopenOnTruncate="on" ) $template redislog7242,"%$myhostname%`redis7242`%msg%" ruleset(name="redislog7242") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="redis-log" template="redislog7242" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/redis7242.log" Tag="" ruleset="redislog7242" freshStartTail="on" reopenOnTruncate="on" ) $template redislog7243,"%$myhostname%`redis7243`%msg%" ruleset(name="redislog7243") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="redis-log" template="redislog7243" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/redis7243.log" Tag="" ruleset="redislog7243" freshStartTail="on" reopenOnTruncate="on" ) #################### mc .log ############################### $template mc40016,"%$myhostname%`mc40016`%msg%" ruleset(name="mc40016") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="cms-front-mc" template="mc40016" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/memcached_get_err_40016.log" Tag="" ruleset="mc40016" freshStartTail="on" reopenOnTruncate="on" ) $template mc40023,"%$myhostname%`mc40023`%msg%" ruleset(name="mc40023") { action( broker=["10.13.88.190:9092","10.13.88.191:9092","10.13.88.192:9092","10.13.88.193:9092"] type="omkafka" topic="cms-front-mc" template="mc40023" partitions.auto="on" ) } input(type="imfile" File="/data1/ms/log/front/memcached_get_err_40023.log" Tag="" ruleset="mc40023" freshStartTail="on" reopenOnTruncate="on" )
每晚凌晨5分进行计划任务切换,计划任务与脚本如下:
5 0 * * * root sh /usr/local/script/elkslowlog.sh & > /dev/null
#!/bin/bash DATE=`date +%F` DATE2=`date +%Y%m%d` ln -sf /data1/ms/log/php_common/curl-$DATE /data1/ms/log/php_common/php_slow_log ln -sf /data1/ms/log/front/redis7215.$DATE2.log /data1/ms/log/front/redis7215.log ln -sf /data1/ms/log/front/redis7216.$DATE2.log /data1/ms/log/front/redis7216.log ln -sf /data1/ms/log/front/redis7242.$DATE2.log /data1/ms/log/front/redis7242.log ln -sf /data1/ms/log/front/redis7243.$DATE2.log /data1/ms/log/front/redis7243.log ln -sf /data1/ms/log/front/memcached_get_err_40023.$DATE2.log /data1/ms/log/front/memcached_get_err_40023.log ln -sf /data1/ms/log/front/memcached_get_err_40016.$DATE2.log /data1/ms/log/front/memcached_get_err_40016.log
问题分析:看着似乎没什么问题,但分析后发现,同样的收集方法,curl的php_slow_log就没有问题,丢日志的mc和redis有个共同特点,就是量不大,有时候半个小时可能就有1条,继续考虑其中的不同,于是想到,可能就是在零点5分创建超连接的瞬间,当天的日志文件还未生成,也就是建立了一个空连接,后面当日志文件真的生成后,rsyslog未检测到超连接对应文件的变化,所以一天的日志都没有,于是测试,果然如此,测试过程不说了。
解决方案:修改脚本文件,在创建超连接时,检测当天日志文件是否生成,如未生成,用echo写入一个空行(空行不会影响收集)生成文件,再进行超连接转换,脚本修改如下
#!/bin/bash DATE=`date +%F` DATE2=`date +%Y%m%d` echo >> /data1/ms/log/php_common/curl-$DATE && ln -sf /data1/ms/log/php_common/curl-$DATE /data1/ms/log/php_common/php_slow_log echo >> /data1/ms/log/front/redis7215.$DATE2.log && ln -sf /data1/ms/log/front/redis7215.$DATE2.log /data1/ms/log/front/redis7215.log echo >> /data1/ms/log/front/redis7216.$DATE2.log && ln -sf /data1/ms/log/front/redis7216.$DATE2.log /data1/ms/log/front/redis7216.log echo >> /data1/ms/log/front/redis7242.$DATE2.log && ln -sf /data1/ms/log/front/redis7242.$DATE2.log /data1/ms/log/front/redis7242.log echo >> /data1/ms/log/front/redis7243.$DATE2.log && ln -sf /data1/ms/log/front/redis7243.$DATE2.log /data1/ms/log/front/redis7243.log echo >> /data1/ms/log/front/memcached_get_err_40023.$DATE2.log && ln -sf /data1/ms/log/front/memcached_get_err_40023.$DATE2.log /data1/ms/log/front/memcached_get_err_40023.log echo >> /data1/ms/log/front/memcached_get_err_40016.$DATE2.log && ln -sf /data1/ms/log/front/memcached_get_err_40016.$DATE2.log /data1/ms/log/front/memcached_get_err_40016.log
好了,问题圆满解决。