web日志采集实战

为了采集网站访问日志,构建了一套日志采集系统,使用js探针的方式采集请求数据,避免了使用web服务器访问日志采集带来的大量无效数据(js,css等的请求,占比达到70%左右).

 先来看一下整体的流程图:

  • 应用服务器搭建

安装nginx,修改配置文件(/etc/nginx/conf.d/default.conf)

server {
  listen 80;
  server_name spark2;

  location / {
    root /data/nginx/app;
    index index.html index.htm;
    access_log on;
  }
}

添加html页面index.html,content.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>首页</title>
</head>
<body>
<a href="content.html">hello nginx</a>

<script type="text/javascript" src="track.js"></script>
</body>
</html>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>内容</title>
</head>
<body>

<h3>来看内容啊</h3>

<script type="text/javascript" src="track.js"></script>
</body>
</html>
启动nginx(service nginx start) 
 
  • js探针的实现

页面嵌入js

<script type="text/javascript">
    var _maq = _maq || [];
    _maq.push(['_setAccount', 'zx5352']);
 
    (function() {
        var ma = document.createElement('script'); 
        ma.type = 'text/javascript';
        ma.async = true;
        ma.src = 'http://flow.itcast.zx/ma.js';
        var s = document.getElementsByTagName('script')[0]; 
        s.parentNode.insertBefore(ma, s);
    })();
</script>

 

track.js

(function () {
    var params = {};
    //Document对象数据
    if(document) {
        params.domain = document.domain || ''; 
        params.url = document.URL || ''; 
        params.title = document.title || ''; 
        params.referrer = document.referrer || ''; 
    }   
    //Window对象数据
    if(window && window.screen) {
        params.sh = window.screen.height || 0;
        params.sw = window.screen.width || 0;
        params.cd = window.screen.colorDepth || 0;
    }   
    //navigator对象数据
    if(navigator) {
        params.lang = navigator.language || ''; 
    }   
    //解析_maq配置
    if(_maq) {
        for(var i in _maq) {
            switch(_maq[i][0]) {
                case '_setAccount':
                    params.account = _maq[i][1];
                    break;
                default:
                    break;
            }   
        }   
    }   
    //拼接参数串
    var args = ''; 
    for(var i in params) {
        if(args != '') {
            args += '&';
        }   
        args += i + '=' + encodeURIComponent(params[i]);
    }   
 
    //通过Image对象请求后端脚本
    var img = new Image(1, 1); 
    img.src = 'http://spark3/log.gif?' + args;
})();

 js请求的URL:

http://spark3/log.gif?domain=spark2&url=http://spark2/content.html&title=内容&referrer=http://spark2/&sh=768&sw=1366&cd=24&lang=zh-CN&account=hll

  

3:日志服务器搭建

1.安装依赖

yum -y install gcc perl pcre-devel openssl openssl-devel

2.上传LuaJIT-2.0.4.tar.gz并安装LuaJIT

tar -zxvf LuaJIT-2.0.4.tar.gz -C /usr/local/src/

cd /usr/local/src/LuaJIT-2.0.4/

make && make install PREFIX=/usr/local/luajit

3.设置环境变量

export LUAJIT_LIB=/usr/local/luajit/lib

export LUAJIT_INC=/usr/local/luajit/include/luajit-2.0

4.创建modules保存nginx的模块

mkdir -p /usr/local/nginx/modules

 

5.上传openresty-1.9.7.3.tar.gz和依赖的模块lua-nginx-module-0.10.0.tarngx_devel_kit-0.2.19.tarngx_devel_kit-0.2.19.tarecho-nginx-module-0.58.tar.gz

 

6.将依赖的模块直接解压到/usr/local/nginx/modules目录即可,不需要编译安装

tar -zxvf lua-nginx-module-0.10.0.tar.gz -C /usr/local/nginx/modules/

tar -zxvf set-misc-nginx-module-0.29.tar.gz -C /usr/local/nginx/modules/

tar -zxvf ngx_devel_kit-0.2.19.tar.gz -C /usr/local/nginx/modules/

tar -zxvf echo-nginx-module-0.58.tar.gz -C /usr/local/nginx/modules/

 

7.解压openresty-1.9.7.3.tar.gz

tar -zxvf openresty-1.9.7.3.tar.gz -C /usr/local/src/

cd /usr/local/src/openresty-1.9.7.3/

8.编译安装openresty

./configure --prefix=/usr/local/openresty --with-luajit && make && make install

 

9.上传nginx

tar -zxvf nginx-1.8.1.tar.gz -C /usr/local/src/

cd /usr/local/src/nginx-1.8.1/

10.编译nginx并支持其他模块

./configure --prefix=/usr/local/nginx \

--with-ld-opt="-Wl,-rpath,/usr/local/luajit/lib" \

    --add-module=/usr/local/nginx/modules/ngx_devel_kit-0.2.19 \

    --add-module=/usr/local/nginx/modules/lua-nginx-module-0.10.0 \

    --add-module=/usr/local/nginx/modules/set-misc-nginx-module-0.29 \

    --add-module=/usr/local/nginx/modules/echo-nginx-module-0.58

make -j2

make install

 

11.修改nginx配置文件

worker_processes  2;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    log_format tick "$msec^A$remote_addr^A$u_domain^A$u_url^A$u_title^A$u_referrer^A$u_sh^A$u_sw^A$u_cd^A$u_lang^A$http_user_agent^A$u_utrace^A$u_account";
    
    access_log  logs/access.log  tick;

    sendfile        on;

    keepalive_timeout  65;

    server {
        listen       80;
        server_name  localhost;
        location /1.gif {
            #伪装成gif文件
            default_type image/gif;    
            #本身关闭access_log,通过subrequest记录log
            access_log off;
        
            access_by_lua "
                -- 用户跟踪cookie名为__utrace
                local uid = ngx.var.cookie___utrace        
                if not uid then
                    -- 如果没有则生成一个跟踪cookie,算法为md5(时间戳+IP+客户端信息)
                    uid = ngx.md5(ngx.now() .. ngx.var.remote_addr .. ngx.var.http_user_agent)
                end 
                ngx.header['Set-Cookie'] = {'__utrace=' .. uid .. '; path=/'}
                if ngx.var.arg_domain then
                -- 通过subrequest到/i-log记录日志,将参数和用户跟踪cookie带过去
                    ngx.location.capture('/i-log?' .. ngx.var.args .. '&utrace=' .. uid)
                end 
            ";  
        
            #此请求不缓存
            add_header Expires "Fri, 01 Jan 1980 00:00:00 GMT";
            add_header Pragma "no-cache";
            add_header Cache-Control "no-cache, max-age=0, must-revalidate";
        
            #返回一个1×1的空gif图片
            empty_gif;
        }   
    
        location /i-log {
            #内部location,不允许外部直接访问
            internal;
        
            #设置变量,注意需要unescape
            set_unescape_uri $u_domain $arg_domain;
            set_unescape_uri $u_url $arg_url;
            set_unescape_uri $u_title $arg_title;
            set_unescape_uri $u_referrer $arg_referrer;
            set_unescape_uri $u_sh $arg_sh;
            set_unescape_uri $u_sw $arg_sw;
            set_unescape_uri $u_cd $arg_cd;
            set_unescape_uri $u_lang $arg_lang;
            set_unescape_uri $u_utrace $arg_utrace;
            set_unescape_uri $u_account $arg_account;
        
            #打开日志
            log_subrequest on;
            #记录日志到ma.log,实际应用中最好加buffer,格式为tick
            access_log /var/nginx_logs/ma.log tick;
        
            #输出空字符串
            echo '';
        }
    }
}

查看日志:

1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\xE6\xA3\xA3\xE6\xA0\xAD\xE3\x80\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352
1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\xE5\x86\x85\xE5\xAE\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352

 

4:日志采集

 logstash配置文件

input {
  file {
    type => "syslog"
    path => "/var/nginx_logs/track.log"
    discover_interval => 10
    start_position => "beginning" 
  }
    
}
output { stdout { codec => rubydebug } }

[root@spark3 logstash]# bin/logstash -f config/log.conf

logstash打印到屏幕的日志

{
       "message" => "1489718383.170^A192.168.154.2^Aspark2^Ahttp://spark2/^A\\xE6\\xA3\\xA3\\xE6\\xA0\\xAD\\xE3\\x80\\x89^A^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352",
      "@version" => "1",
    "@timestamp" => "2017-03-17T03:12:34.380Z",
          "path" => "/var/nginx_logs/track.log",
          "host" => "spark3",
          "type" => "syslog"
}
{
       "message" => "1489718385.448^A192.168.154.2^Aspark2^Ahttp://spark2/content.html^A\\xE5\\x86\\x85\\xE5\\xAE\\xB9^Ahttp://spark2/^A768^A1366^A24^Azh-CN^AMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36^A0f21f45cf2c1ba459e9812ee3de17d8a^Azx5352",
      "@version" => "1",
    "@timestamp" => "2017-03-17T03:12:34.906Z",
          "path" => "/var/nginx_logs/track.log",
          "host" => "spark3",
          "type" => "syslog"
}

 

  • 可以使用logstash的filter对日志做一些过滤,使用output组件将日志写入kafka或者es等存储介质,以供后续的处理。

 

posted @ 2017-03-17 11:19  huangll99  阅读(581)  评论(0编辑  收藏  举报