[Hive_add_11] Hive 使用 UDTF 实现日志降维


0. 说明

  对日志进行降维处理,将日志分为几个小表

  通过编写 UDTF ,对日志降维,将日志聚合体相关字段抽取出来,形成新表。

 

 

 


1. 操作流程

  1.0 日志部分内容

1532868065.518#192.168.23.1#1532868067548#200#{\"appChannel\":\"appstore\",\"appErrorLogs\":

[{\"createdAtMs\":1530455040000,\"errorBrief\":\"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)\",
\"errorDetail\":\"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67) at sun.reflect.DelegatingMethodAccessorImpl.
invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)\"},{\"createdAtMs\":1530393180000,
\"errorBrief\":\"at cn.lift.appIn.control.CommandUtil.getInfo(CommandUtil.java:67)\",\"errorDetail\":
\"at cn.lift.dfdfdf.control.CommandUtil.getInfo(CommandUtil.java:67)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)\"}
],\"appEventLogs\":[{\"createdAtMs\":1530412800000,\"eventId\":
\"share\",\"logType\":\"event\",\"mark\":\"4\",\"musicID\":\"傲红尘\"}
],\"appPageLogs\":[{\"createdAtMs\":1530449520000,
\"logType\":\"page\",\"nextPage\":\"list.html\",\"pageId\":\"list.html\",\"pageViewCntInSession\":0,\"visitIndex\":\"2\"}
],
\"appPlatform\":\"ios\",\"appStartupLogs\":[{\"brand\":\"联想\",\"carrier\":\"中国联通\",\"country\":\"china\",\"createdAtMs\":
1530385560000,\"logType\":\"startup\",\"network\":\"cell\",\"province\":\"hebei\",\"screenSize\":\"960 * 640\"},
{\"brand\":\"魅族\",\"carrier\":\"中国铁通\",\"country\":\"america\",\"createdAtMs\":1530412860000,\"logType\":\"startup\",
\"network\":\"3g\",\"province\":\"guangxi\",\"screenSize\":\"480 * 320\"}
],\"appUsageLogs\":[{\"createdAtMs\":1530379200000,
\"logType\":\"usage\",\"singleDownloadTraffic\":\"12800\",\"singleUploadTraffic\":\"128\",\"singleUseDurationSecs\":\"123\"}
],
\"appVersion\":\"1.0.0\",\"deviceId\":\"Device000099\",\"deviceStyle\":\"oppo 1\",\"osType\":\"1.4.0\"}

 

  1.1 创建 logAgg表

  创建 logAgg表,分区表 => year, month, day

    create table logAgg(serverTime string,remoteIp string,clientTime string,status string, json string)
    partitioned by(year string, month string, day string)
    row format delimited
    fields terminated by '#' ;

 

  1.2 load 数据到 logAgg表

  load data local inpath '/home/centos/files/2018-07-01.log' into table logagg partition(year='2018',month='07',day='01');

 

  1.3 降维处理

  1. 代码编写

  ParseJsonUtil.java

 

  ParseEvent.java

 

 

  2. 上传并同步

  先打包再放入 /soft/hive/lib 中

    cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/
xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

 

  3. 注册临时函数

    create temporary function parseEvent as 'com.share.udtf.ParseEvent';

 

  4.测试

    select parseEvent(json) from logAgg;

 

  1.4 创建 logEvent表

    create table logevent(deviceId string, createdAtMs string, eventId string, logType string , mark string, musicID string)
    stored as parquet tblproperties('parquet.compression'='GZIP');

 

  1.5 转储

    insert into logevent 
    select parseEvent(json) from logagg
    where year='2018' and month='07' and day='01';

 

  1.6 对 logEvent表进行操作

  1. 计算每个用户对每首歌的评分

select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent
where musicId is not null
group by deviceid, musicid;

 

  2. 计算每个用户对每首歌的评分与最高评分

select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
from (
select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent where musicId is not null group by deviceid, musicid
)a;

 

  3. 使用 sql 计算出,每个用户最喜欢(评分最高)的歌曲,及其评分

select deviceid , musicid, sum from (
select deviceid , musicid, sum, max(sum)over(partition by deviceid) as sum2
from (
select deviceid, musicid, sum(cast(mark as int)) as sum
from logevent where musicId is not null group by deviceid, musicid
)a
)b 
where sum=sum2 ;

 

 


2. 附加内容 

  2.1 Hive 加载 jar 方式

  1. 将 jar 放在 Hive 的 lib 下,并重启 Hive
    /soft/hive/lib

  2. 通过配置文件指定 jar,并重启 Hive
    hive.aux.jars.path=/x/x/x.jar

  3. 临时加载 jar
    hive> add jar /x/x/x.jar

  2.2 注册函数方法

  1. create function as '';    // 永久

  2. create temporary function as '';    // 临时

  3. create function xxx as '' using jar 'hdfs://mycluster/xxx.jar';    // 将 jar 包放入 HDFS 中,避免重启

 


 

posted @ 2019-02-11 12:01  山间一棵松  阅读(381)  评论(0编辑  收藏  举报