Pig

 

http://www.codelast.com/?p=4249

http://www.klshu.com/656.html

 

 

 

 
 
 
 
o_data = LOAD 'hdfs://hadoop/user/xuting/analy/userDevtmp' using PigStorage('`') AS (userId,userMac,Time_Stamp,reqUserAgent,device,time);
 
 
pig
cd hdfs:///user/ xuting/analy/userDevtmp


data2015 = FILTER o_data BY time MATCHES  '2015.*';
因为我们自己添加了两个字段   设备号和时间   这个是肯定没有问题的   你可以匹配最后一个时间的字段  如果不满足‘2015.*’这样滤掉数据

data_head_error = FILTER data2015 BY userId MATCHES  '^A';
 
 

汇总
 B = GROUP data all;
 C = FOREACH B GENERATE COUNT(data);
 DUMP C;
 
 B = GROUP data all;
C = FOREACH B GENERATE COUNT_STAR(data);
 DUMP C;

 B = GROUP data by userId;
 C = FOREACH B GENERATE  group, COUNT(data);
 

空值
item = FOREACH data GENERATE reqUserAgent;
A = FILTER item BY reqUserAgent is null;
B = GROUP A all;
C = FOREACH B GENERATE COUNT_STAR(A) ;
dump C;

Missing
item = FOREACH data GENERATE reqUserAgent;
A = FILTER item BY reqUserAgent =='Missing';
B = GROUP A all;
C = FOREACH B GENERATE COUNT_STAR(A) ;
dump C;

异常
item = FOREACH data GENERATE reqUserAgent;
A = FILTER item BY reqUserAgent =='-';
B = GROUP A all;
C = FOREACH B GENERATE COUNT_STAR(A) ;
dump C;

匹配
item = FOREACH data GENERATE userId;
A = FILTER data BY NOT userId MATCHES '^U[0-9a-zA-Z]{47}$';
B = GROUP A all;
C = FOREACH B GENERATE COUNT_STAR(A) ;
dump C;
 

 
1. userId : 以U开头,后面接47个[0-9a-zA-Z]范围内的字符
'^U[0-9a-zA-Z]{47}$'
 
2. userMac : '^[0-9a-zA-Z]{2}:[0-9a-zA-Z]{2}:[0-9a-zA-Z]{2}:[0-9a-zA-Z]{2}:[0-9a-zA-Z]{2}:[0-9a-zA-Z]{2}$'
 
3. Time_Stamp : [0-9]范围内的数字共10位
'^[0-9]{10}$'

4. reqUserAgent 这个数据的异常情况先不用分析
数据样例:
Dalvik/1.6.0 (Linux; U; Android 4.3; ZTE Q805T Build/Q805TV1.0.0B02)
Mozilla/5.0 (Linux; U; Android 4.4.4; zh-cn; YL-Coolpad 8297-T01 Build/KVT49L) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
 
 
 
 
 
 
 
 

 

posted @ 2015-11-14 10:17  Man_华  阅读(235)  评论(0编辑  收藏  举报