web日志采集实战
为了采集网站访问日志,构建了一套日志采集系统,使用js探针的方式采集请求数据,避免了使用web服务器访问日志采集带来的大量无效数据(js,css等的请求,占比达到70%左右).
先来看一下整体的流程图:
- 应用服务器搭建
安装nginx,修改配置文件(/etc/nginx/conf.d/default.conf)
server {
listen 80;
server_name spark2;
location / {
root /data/nginx/app;
index index.html index.htm;
access_log on;
}
}
添加html页面index.html,content.html
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>首页</title> </head> <body> <a href="content.html">hello nginx</a> <script type="text/javascript" src="track.js"></script> </body> </html>
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>内容</title> </head> <body> <h3>来看内容啊</h3> <script type="text/javascript" src="track.js"></script> </body> </html>
启动nginx(service nginx start)
- js探针的实现
页面嵌入js
<script type="text/javascript"> var _maq = _maq || []; _maq.push(['_setAccount', 'zx5352']); (function() { var ma = document.createElement('script'); ma.type = 'text/javascript'; ma.async = true; ma.src = 'http://flow.itcast.zx/ma.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ma, s); })(); </script>
track.js
(function () { var params = {}; //Document对象数据 if(document) { params.domain = document.domain || ''; params.url = document.URL || ''; params.title = document.title || ''; params.referrer = document.referrer || ''; } //Window对象数据 if(window && window.screen) { params.sh = window.screen.height || 0; params.sw = window.screen.width || 0; params.cd = window.screen.colorDepth || 0; } //navigator对象数据 if(navigator) { params.lang = navigator.language || ''; } //解析_maq配置 if(_maq) { for(var i in _maq) { switch(_maq[i][0]) { case '_setAccount': params.account = _maq[i][1]; break; default: break; } } } //拼接参数串 var args = ''; for(var i in params) { if(args != '') { args += '&'; } args += i + '=' + encodeURIComponent(params[i]); } //通过Image对象请求后端脚本 var img = new Image(1, 1); img.src = 'http://spark3/log.gif?' + args; })();
js请求的URL:
http://spark3/log.gif?domain=spark2&url=http://spark2/content.html&title=内容&referrer=http://spark2/&sh=768&sw=1366&cd=24&lang=zh-CN&account=hll
3:日志服务器搭建
1.安装依赖 yum -y install gcc perl pcre-devel openssl openssl-devel 2.上传LuaJIT-2.0.4.tar.gz并安装LuaJIT tar -zxvf LuaJIT-2.0.4.tar.gz -C /usr/local/src/ cd /usr/local/src/LuaJIT-2.0.4/ make && make install PREFIX=/usr/local/luajit 3.设置环境变量 export LUAJIT_LIB=/usr/local/luajit/lib export LUAJIT_INC=/usr/local/luajit/include/luajit-2.0 4.创建modules保存nginx的模块 mkdir -p /usr/local/nginx/modules
5.上传openresty-1.9.7.3.tar.gz和依赖的模块lua-nginx-module-0.10.0.tar、ngx_devel_kit-0.2.19.tar、ngx_devel_kit-0.2.19.tar、echo-nginx-module-0.58.tar.gz
6.将依赖的模块直接解压到/usr/local/nginx/modules目录即可,不需要编译安装 tar -zxvf lua-nginx-module-0.10.0.tar.gz -C /usr/local/nginx/modules/ tar -zxvf set-misc-nginx-module-0.29.tar.gz -C /usr/local/nginx/modules/ tar -zxvf ngx_devel_kit-0.2.19.tar.gz -C /usr/local/nginx/modules/ tar -zxvf echo-nginx-module-0.58.tar.gz -C /usr/local/nginx/modules/
7.解压openresty-1.9.7.3.tar.gz tar -zxvf openresty-1.9.7.3.tar.gz -C /usr/local/src/ cd /usr/local/src/openresty-1.9.7.3/ 8.编译安装openresty ./configure --prefix=/usr/local/openresty --with-luajit && make && make install
9.上传nginx tar -zxvf nginx-1.8.1.tar.gz -C /usr/local/src/ cd /usr/local/src/nginx-1.8.1/ 10.编译nginx并支持其他模块 ./configure --prefix=/usr/local/nginx \ --with-ld-opt="-Wl,-rpath,/usr/local/luajit/lib" \ --add-module=/usr/local/nginx/modules/ngx_devel_kit-0.2.19 \ --add-module=/usr/local/nginx/modules/lua-nginx-module-0.10.0 \ --add-module=/usr/local/nginx/modules/set-misc-nginx-module-0.29 \ --add-module=/usr/local/nginx/modules/echo-nginx-module-0.58 make -j2 make install
11.修改nginx配置文件 worker_processes 2; events { worker_connections 1024; } http { include mime.types; default_type application/octet-stream; log_format tick "$msec^A$remote_addr^A$u_domain^A$u_url^A$u_title^A$u_referrer^A$u_sh^A$u_sw^A$u_cd^A$u_lang^A$http_user_agent^A$u_utrace^A$u_account"; access_log logs/access.log tick; sendfile on; keepalive_timeout 65; server { listen 80; server_name localhost; location /1.gif { #伪装成gif文件 default_type image/gif; #本身关闭access_log,通过subrequest记录log access_log off; access_by_lua " -- 用户跟踪cookie名为__utrace local uid = ngx.var.cookie___utrace if not uid then -- 如果没有则生成一个跟踪cookie,算法为md5(时间戳+IP+客户端信息) uid = ngx.md5(ngx.now() .. ngx.var.remote_addr .. ngx.var.http_user_agent) end ngx.header['Set-Cookie'] = {'__utrace=' .. uid .. '; path=/'} if ngx.var.arg_domain then -- 通过subrequest到/i-log记录日志,将参数和用户跟踪cookie带过去 ngx.location.capture('/i-log?' .. ngx.var.args .. '&utrace=' .. uid) end "; #此请求不缓存 add_header Expires "Fri, 01 Jan 1980 00:00:00 GMT"; add_header Pragma "no-cache"; add_header Cache-Control "no-cache, max-age=0, must-revalidate"; #返回一个1×1的空gif图片 empty_gif; } location /i-log { #内部location,不允许外部直接访问 internal; #设置变量,注意需要unescape set_unescape_uri $u_domain $arg_domain; set_unescape_uri $u_url $arg_url; set_unescape_uri $u_title $arg_title; set_unescape_uri $u_referrer $arg_referrer; set_unescape_uri $u_sh $arg_sh; set_unescape_uri $u_sw $arg_sw; set_unescape_uri $u_cd $arg_cd; set_unescape_uri $u_lang $arg_lang; set_unescape_uri $u_utrace $arg_utrace; set_unescape_uri $u_account $arg_account; #打开日志 log_subrequest on; #记录日志到ma.log,实际应用中最好加buffer,格式为tick access_log /var/nginx_logs/ma.log tick; #输出空字符串 echo ''; } } } |
查看日志:
1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\xE6\xA3\xA3\xE6\xA0\xAD\xE3\x80\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352 1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\xE5\x86\x85\xE5\xAE\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352
4:日志采集
logstash配置文件
input { file { type => "syslog" path => "/var/nginx_logs/track.log" discover_interval => 10 start_position => "beginning" } } output { stdout { codec => rubydebug } }
[root@spark3 logstash]# bin/logstash -f config/log.conf
logstash打印到屏幕的日志
{ "message" => "1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\\xE6\\xA3\\xA3\\xE6\\xA0\\xAD\\xE3\\x80\\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352", "@version" => "1", "@timestamp" => "2017-03-17T03:12:34.380Z", "path" => "/var/nginx_logs/track.log", "host" => "spark3", "type" => "syslog" } { "message" => "1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\\xE5\\x86\\x85\\xE5\\xAE\\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352", "@version" => "1", "@timestamp" => "2017-03-17T03:12:34.906Z", "path" => "/var/nginx_logs/track.log", "host" => "spark3", "type" => "syslog" }
- 可以使用logstash的filter对日志做一些过滤,使用output组件将日志写入kafka或者es等存储介质,以供后续的处理。