Hadoop-MR实现日志清洗(一)

1.日志内容样式
目前所接触到的日志一种是网页请求日志,一种是埋点日志,一种后端系统日志。
1.1请求日志
请求日志是用户访问网站时,打开网址或点击网站上了项目元素时,向服务器发送或提交的资源请求。
(论坛日志)
27.38.53.84 - - [30/May/2013:23:37:57 +0800] "GET /uc_server/data/avatar/000/00/50/90_avatar_small.jpg HTTP/1.1" 200 1828
218.28.247.140 - - [30/May/2013:23:37:57 +0800] "GET /static/image/common/swfupload.swf?preventswfcaching=1369928282717 HTTP/1.1" 200 13333
123.147.245.79 - - [30/May/2013:23:37:57 +0800] "GET /static/js/swfupload.queue.js?y7a HTTP/1.1" 304 -
182.242.227.232 - - [30/May/2013:23:37:56 +0800] "GET /misc.php?mod=patch&action=ipnotice&inajax=1&ajaxtarget=ip_notice HTTP/1.1" 200 65
183.67.254.204 - - [30/May/2013:23:37:56 +0800] "POST /forum.php?mod=post&action=newthread&fid=72&extra=&topicsubmit=yes&inajax=1 HTTP/1.1" 200 425
110.255.113.85 - - [30/May/2013:23:37:59 +0800] "GET /uc_server/avatar.php?uid=26294&size=middle HTTP/1.1" 301 -
111.37.4.243 - - [30/May/2013:23:37:58 +0800] "POST /source/plugin/pcmgr_url_safeguard/url_api.inc.php HTTP/1.1" 200 1300
125.82.229.229 - - [30/May/2013:23:38:05 +0800] "GET /uc_server/data/avatar/000/07/18/34_avatar_middle.jpg HTTP/1.1" 200 3790
122.70.237.247 - - [30/May/2013:23:38:03 +0800] "GET /forum.php?mod=image&aid=18696&size=300x300&key=3e12991ed5ff7ecd&nocache=yes&type=fixnone&ramdom=dZqQb HTTP/1.1" 200 39594
111.37.4.243 - - [30/May/2013:23:38:04 +0800] "GET /forum.php?mod=misc&action=postreview&do=support&tid=11228&pid=44989&hash=29c64660&infloat=yes&handlekey=login&referer=http%3A%2F%2Fbbs.itcast.cn%2Fforum.php%3Fmod%3Dviewthread%26tid%3D11228&inajax=1&ajaxtarget=fwin_content_login HTTP/1.1" 302 -
49.5.1.14 - - [30/May/2013:23:38:09 +0800] "GET /api/connect/like.php HTTP/1.1" 200 722

 

(商城日志)
183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:50:08 +0000] "-" 400 0 "-" "-"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.248.178.212 - - [18/Sep/2013:06:51:40 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 200 786 "http://blog.fens.me/nodejs-grunt-intro/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; MDDR; InfoPath.2; .NET4.0C)"
180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"
180.168.34.26 - - [18/Sep/2013:07:11:08 +0000] "-" 400 0 "-" "-"
50.116.27.194 - - [18/Sep/2013:07:11:29 +0000] "POST /wp-cron.php?doing_wp_cron=1379488288.8893849849700927734375 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"
222.35.232.69 - - [18/Sep/2013:16:14:17 +0000] "GET /wp-content/uploads/2013/05/favicon.ico HTTP/1.1" 200 1150 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
114.252.89.91 - - [18/Sep/2013:16:14:20 +0000] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 58 "http://blog.fens.me/wp-admin/post.php?post=2445&action=edit&message=10" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"
58.209.132.183 - - [18/Sep/2013:16:29:17 +0000] "GET /images/2.jpg HTTP/1.1" 200 105089 "http://image.baidu.com/i?ct=503316480&z=&tn=baiduimagedetail&ipn=d&word=%E6%B5%99%E6%B1%9F%E5%AE%89%E5%90%89&step_word=&ie=utf-8&in=17038&cl=2&lm=-1&st=&pn=0&rn=1&di=47839122900&ln=1998&fr=&&fmq=1379521091792_R&ic=&s=&se=&sme=0&tab=&width=&height=&face=&is=&istype=&ist=&jit=&objurl=http%3A%2F%2Fnews.eastday.com%2Feastday%2F06news%2Fchina%2Fzh2green%2Fanji%2Fnode327399%2Fimages%2F01517676.jpg" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)"

 

1.2埋点日志
埋点日志是电商网站采用的技术手段,当用户浏览曝光的商时,主动记录曝光的商品列表、停留时间、点击的商品、点击的组件等信息,服务运营,优化商城布局,常见的埋点日志有浏览、点击、曝光日志。
(浏览)
2018-08-28 11:59:58,263 - site: leeyk99, ip: 188.133.207.46, refer: https://m.leeyk99.com/ru/user/login?redirection=%2Fru%2FSneakers-c-1913.html%3Ficn%3Dsneakers%26ici%3Dmru_navbar15menu01dir02&prot=1, agent: Mozilla/5.0 (Linux; Android 5.1.1; SM-G531H Build/LMY48B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mru","language":"ru","money_type":"RUB","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"5.1.1","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428798994,"local_time":"2018/8/28 10:59:58","device_id":"","cookie_id":"5BCE0E1F_DAFD_2E64_F24E_B3B6D5D6BAC5","member_id":"","login":0,"page_id":3,"page_name":"page_real_class","page_param":{"category_id":"1913","source_category_id":"1745"},"start_time":1535428764401,"end_time":1535428798994,"tab_page_id":"page_real_class1535428764401"}
2018-08-28 11:59:58,272 - site: leeyk99, ip: 74.205.199.213, refer: https://m.leeyk99.com/us/Watermelon-Print-Round-Beach-Blanket-p-365584-cat-1866.html, agent: Mozilla/5.0 (Linux; Android 6.0; HUAWEI CAM-L21 Build/HUAWEICAM-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Mobile Safari/537.36, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"360X640","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"Android","os_versions":"6.0","browser_name":"Chrome","browser_versions":"68.0.3440.91","session_id":"","timestamp":1535428797165,"local_time":"2018/8/27 20:59:57","device_id":"","cookie_id":"B66A47CF_5522_DC84_F221_F0848C812BCA","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":365584,"traceid":"sm`1535428371336`B66A47CF_5522_DC84_F221_F0848C812BCA"},"start_time":1535428797165,"end_time":"","tab_page_id":"page_goods_detail1535428797165"}
2018-08-28 11:59:58,274 - site: leeyk99, ip: 99.174.207.56, refer: https://m.leeyk99.com/us/Striped-Ringer-Tee-p-469810-cat-1738.html?rrec=true, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1, body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"Mobile Safari","browser_versions":"11.0","session_id":"","timestamp":1535428797977,"local_time":"2018/8/27 22:59:57","device_id":"","cookie_id":"D56B15A4_37D3_9164_CA60_3B4CDB382F2D","member_id":"","login":0,"page_id":7,"page_name":"page_goods_detail","page_param":{"goods_id":469810,"traceid":"sm`1535428730780`D56B15A4_37D3_9164_CA60_3B4CDB382F2D"},"start_time":1535428797977,"end_time":"","tab_page_id":"page_goods_detail1535428797977"}
2018-08-28 11:59:58,293 - site: leeyk99, ip: 172.56.35.21, refer: https://m.leeyk99.com/us/FB-US-Striped-20180402-A-D7-vc-64042.html?utm_source=facebook.com&utm_medium=cpc&utm_campaign=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_&url_from=fbadsus_20180408_mobmpa_Food_FB-US-Striped-20180402-A-D7-vc-64042_3554_, agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15G77 Instagram 24.0.0.14.205 (iPhone7,2; iOS 11_4_1; en_US; en-US; scale=2.00; gamut=normal; 750x1334), body: {"device_type":"m","home_site":"leeyk99","sub_site":"mus","language":"en","money_type":"USD","device_country":"","app_versions":"","network_type":"","ip":"","screen_pixel":"375X667","screen_size":"","device_class":0,"device_brand":"","device_name":"","device_model":"","os_type":0,"os_name":"iOS","os_versions":"11.4.1","browser_name":"WebKit","browser_versions":"605.1.15","session_id":"","timestamp":1535428797377,"local_time":"2018/8/27 23:59:57","device_id":"","cookie_id":"3BFCB287_A97B_AA24_DBCC_86DC346D3100","member_id":"","login":0,"page_id":2,"page_name":"page_virtual_class","page_param":{"category_id":"64042"},"start_time":1535428797377,"end_time":"","tab_page_id":"page_virtual_class1535428797377"}

 

点击、曝光的日志内容与浏览的类似,根据埋点需求不同,采集记录的数据略有不同,记录的核心内容就是body里的内容。
埋点日志是根据需求设计记录的内容,格式齐整,内容规范,一般使用Hive-正则即可进行过滤入库,像这个浏览日志,只需要创建一张表,指定以下正则格式,即可入库使用日志:
'input.regex'='([0-9\\.\\- :,]+) \\- site: ([\\w]+), ip: ([0-9\\.\\- :,]+), refer: (.*), agent: (.*), body: ([\\[\\{].*[\\}\\]])'

 
1.3后端系统日志
后端系统日志是系统自己主动记录的,通常是前端或其他系统向后端系统请求接口数据,后端系统记录接口请求信息或接口返回结果信息。这种数据通常是系统间约定好的,因此是格式非常规范的日志数据,也可以直接使用Hive的正则技术处理数据。
例如:
格式一:(结果信息)
2018-07-03 06:50:00,142 [XNIO-2 task-28] INFO  com.leeyk99.bi.abt.rest.CoreApiController - 1A42F7C6_B904_A334_AB87_5A69A7034DA0  leeyk99PcRealClass 66 158

 

格式二:(接口信息)
2018-07-03 20:39:46,043 [XNIO-2 task-211] INFO  com.leeyk99.bi.abt.filter.LogFilter - GET  /api/v1/bi/abt?cid=973EA838_E20E_74E4_41AB_E218DA91D73E&uid=&site=mtw&terminal=leeyk99-M&lan=zh-tw took 1ms and returned 200
 
(1.2\1.3中的leeyk99是对源数据中某个公司品牌的替换)
 
关于Hive正则技术处理比较规范的日志数据,可以查看:https://www.cnblogs.com/leeyuki/p/9548811.html (博客园)或者 ABT日志入库记录 (印象笔记)
本篇学习使用Hadoop-MR清洗请求日志。
 
2.请求日志采集入库
对于日志文件的采集,我们数仓一般不会直接去生产系统去采集,而是由运维或者专门的小组负责日志采集,一般是采集落到HDFS或S3文件系统上或者某台接口机上,数仓采集入库这些文件,进行清洗加工。
ELK结构(Elasticsearch , Logstash, Kibana )提供了一整套解决方案,并且都是开源软件,之间互相配合使用,完美衔接,高效的满足了很多场合的应用,这个结构是面向平台或系统用户的,用来查看监视日志,跟踪系统运行状况的。
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。
  • Flume+Kafka+Storm+mysql构建大数据实时系统
  • Flume+HDFS+KafKa+Strom实现实时推荐,反爬虫服务等服务
  • Flume+Hadoop+Hive的离线分析网站用户浏览行为路径
  • Flume+Logstash+Kafka+Spark Streaming进行实时日志处理分析
  • Flume+Spark + ELK数据系统实时监控平台
ftp文件传输也是一种非常重要的文件服务方式,但对于大量的日志可能不太适用。除非是日志离线归档收集好,再传输到接口机上供第三方取用。
关于实时收集等模式,暂无涉猎。
 
3.配置Maven-Hadoop环境
3.1.项目初始化
<groupId>com.leeyk99.udp</groupId>
<artifactId>hadoop-mapreduce</artifactId>
<version>1.0-SNAPSHOT</version>

 

目标:创建一个Maven项目,配置Hadoop运行环境需要的Jar文件。
 
3.2.配置pom.xml
配置Hadoop运行需要的JAR文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
 
    <groupId>com.leeyk99.udp</groupId>
    <artifactId>hadoop-mapreduce</artifactId>
    <version>1.0-SNAPSHOT</version>
 
   <!-- <packaging>jar</packaging>-->
 
    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.6</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
    </dependencies>
    <!--<build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
        </plugins>
    </build>-->
</project>

 

关于IDEA上Maven项目JAR文件自动下载配置,参考笔记 Maven
自动下载后,IDEA给该Maven项目下载了很多JAR文件(External Libraries下),除了我们自己配置的核心文件,还有相关必要的文件也被下载了, 省去了我们逐个下载的麻烦。
 
 
 
 
 
posted @ 2018-08-30 16:09  leeyuki  阅读(3107)  评论(0编辑  收藏  举报