Hive练习--蚂蚁森林习题二
问题:蚂蚁森林低碳用户排名分析
查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
seq(key) user_id data_dt low_carbon xxxxx10 u_002 2017/1/2 150 xxxxx11 u_002 2017/1/2 70 xxxxx12 u_002 2017/1/3 30 xxxxx13 u_002 2017/1/3 80 xxxxx14 u_002 2017/1/4 150 xxxxx14 u_002 2017/1/5 101 备注:统计方法不限于sql、procedure、python,java等
提供的数据说明:
user_low_carbon: u_001 2017/1/1 10 u_001 2017/1/2 150 u_001 2017/1/2 110 u_001 2017/1/2 10 u_001 2017/1/4 50 ...
一、准备工作
创建表和导入数据,这在之前已经做好了,详情看练习一。表结构具体如下所示:
user_low_carbon(user_id String,data_dt String,low_carbon int) plant_carbon(plant_id string,plant_name String,low_carbon int)
二、解决步骤
1、将2017年,每个用户在一天内减少碳排放量大于100g的日期过滤出来
select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100; a1
# 输出结果
user_id data_dt u_001 2017-01-02 u_001 2017-01-06 u_002 2017-01-02 u_002 2017-01-03 u_002 2017-01-04 u_002 2017-01-05 u_003 2017-01-02 u_003 2017-01-03 u_003 2017-01-05 u_003 2017-01-07 u_004 2017-01-01 ……
2、对于日期这一列的数据,将每一条记录的前两行和后两行在这一列的数据放置在当前行
select user_id,data_dt,lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1, lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2, lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1, lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2 from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) a1;a2
输出结果:
user_id data_dt lag1 lag2 lead1 lead2 u_001 2017-01-02 1970-01-01 1970-01-01 2017-01-06 1970-01-01 u_001 2017-01-06 2017-01-02 1970-01-01 1970-01-01 1970-01-01 u_002 2017-01-02 1970-01-01 1970-01-01 2017-01-03 2017-01-04 u_002 2017-01-03 2017-01-02 1970-01-01 2017-01-04 2017-01-05 u_002 2017-01-04 2017-01-03 2017-01-02 2017-01-05 1970-01-01 u_002 2017-01-05 2017-01-04 2017-01-03 1970-01-01 1970-01-01 u_003 2017-01-02 1970-01-01 1970-01-01 2017-01-03 2017-01-05 u_003 2017-01-03 2017-01-02 1970-01-01 2017-01-05 2017-01-07 u_003 2017-01-05 2017-01-03 2017-01-02 2017-01-07 1970-01-01 u_003 2017-01-07 2017-01-05 2017-01-03 1970-01-01 1970-01-01 u_004 2017-01-01 1970-01-01 1970-01-01 2017-01-03 2017-01-06 ………………………………
3、计算当前日期和前两条记录日期的差值,当前日期和后两条记录的差值
select user_id,data_dt,datediff(data_dt,lag1) d_lag_1,datediff(data_dt,lag2) d_lag_2,datediff(data_dt,lead1) d_lead_1,datediff(data_dt,lead2) d_lead_2 from (select user_id,data_dt,lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1, lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2, lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1, lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2 from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) a1) a2; a3
输出结果:
user_id data_dt d_lag_1 d_lag_2 d_lead_1 d_lead_2 u_001 2017-01-02 17168 17168 -4 17168 u_001 2017-01-06 4 17172 17172 17172 u_002 2017-01-02 17168 17168 -1 -2 u_002 2017-01-03 1 17169 -1 -2 u_002 2017-01-04 1 2 -1 17170 u_002 2017-01-05 1 2 17171 17171 u_003 2017-01-02 17168 17168 -1 -3 u_003 2017-01-03 1 17169 -2 -4 u_003 2017-01-05 2 3 -2 17171 u_003 2017-01-07 2 4 17173 17173 u_004 2017-01-01 17167 17167 -2 -5 u_004 2017-01-03 2 17169 -3 -4 u_004 2017-01-06 3 5 -1 17172 u_004 2017-01-07 1 4 17173 17173 u_005 2017-01-02 17168 17168 -1 -2 u_005 2017-01-03 1 17169 -1 -3 …………………………
4、过滤出连续三天减碳量超过100g的用户
select user_id,data_dt from a3 where (d_lag_1=1 and d_lag_2=2) or (d_lead_1=-1 and d_lead_2=-2) or (d_lead_1=-1 and d_lag_1=1); a4 select user_id,data_dt from (select user_id,data_dt,datediff(data_dt,lag1) d_lag_1,datediff(data_dt,lag2) d_lag_2,datediff(data_dt,lead1) d_lead_1,datediff(data_dt,lead2) d_lead_2 from (select user_id,data_dt,lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1, lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2, lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1, lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2 from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) a1) a2) a3 where (d_lag_1=1 and d_lag_2=2) or (d_lead_1=-1 and d_lead_2=-2) or (d_lead_1=-1 and d_lag_1=1);
输出结果:
user_id data_dt u_002 2017-01-02 u_002 2017-01-03 u_002 2017-01-04 u_002 2017-01-05 u_005 2017-01-02 u_005 2017-01-03 u_005 2017-01-04 u_008 2017-01-04 u_008 2017-01-05 u_008 2017-01-06 u_008 2017-01-07 u_009 2017-01-02 u_009 2017-01-03 u_009 2017-01-04 u_010 2017-01-04 …… ……
5、和原表进行内连接
select users.user_id,users.data_dt,users.low_carbon from (select user_id,data_dt from (select user_id,data_dt,datediff(data_dt,lag1) d_lag_1,datediff(data_dt,lag2) d_lag_2,datediff(data_dt,lead1) d_lead_1,datediff(data_dt,lead2) d_lead_2 from (select user_id,data_dt,lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1, lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2, lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1, lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2 from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) a1) a2) a3 where (d_lag_1=1 and d_lag_2=2) or (d_lead_1=-1 and d_lead_2=-2) or (d_lead_1=-1 and d_lag_1=1)) a join user_low_carbon users where users.user_id=a.user_id and a.data_dt=date_format(regexp_replace(users.data_dt,'/','-'),'yyyy-MM-dd');
输出结果:
users.user_id users.data_dt users.low_carbon u_002 2017/1/2 150 u_002 2017/1/2 70 u_002 2017/1/3 30 u_002 2017/1/3 80 u_002 2017/1/4 150 u_002 2017/1/5 101 u_005 2017/1/2 50 u_005 2017/1/2 80 u_005 2017/1/3 180 u_005 2017/1/4 180 u_005 2017/1/4 10 u_008 2017/1/4 260 u_008 2017/1/5 360 u_008 2017/1/6 160 …… ……
三、第二种解法
上面的解法对于求连续三天的情况还是可以的,但是如果想求连续十天的情况,那么过滤条件就会写的十分的复杂。利用一个等差数列和另一个等差数列,对应位置元素相减,那么每一个位置的一对元素的差值都是相等的这一规律来实现我们的第二种解法。
1、将2017年,每个用户在一天内减少碳排放量大于100g的日期过滤出来
select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100; a1
2、对数据按照用户id进行分区,在分区内部按照日期进行排序,然后给每个分区的每一条数据一个标记,这个标记就是rank函数产生的值,对于区内按照日期排序后的数据而言,如果日期是连续的,就会对应的标记值是两个对应的等差数列,那么二者相减,得到的一个结果日期就会是相等,如果一个分区内,相减之后得到结果日期值相同的记录数量多于或者等于三个,那么就是连续三天减碳量超过100g的用户:
select user_id,data_dt,rank() over(partition by user_id order by data_dt) rk from t1;t2 select user_id,data_dt,rank() over(partition by user_id order by data_dt) rk from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) t1;t2
输出结果:
user_id data_dt rk u_001 2017-01-02 1 u_001 2017-01-06 2 u_002 2017-01-02 1 u_002 2017-01-03 2 u_002 2017-01-04 3 u_002 2017-01-05 4 u_003 2017-01-02 1 u_003 2017-01-03 2 u_003 2017-01-05 3 u_003 2017-01-07 4 u_004 2017-01-01 1 u_004 2017-01-03 2 u_004 2017-01-06 3 u_004 2017-01-07 4 u_005 2017-01-02 1 u_005 2017-01-03 2 …… …… ……
3、将日期减去当前的rank值
select user_id,data_dt,data_sub(data_dt,rk) data_sub_rk from t2;t3 select user_id,data_dt,date_sub(data_dt,rk) data_sub_rk from (select user_id,data_dt,rank() over(partition by user_id order by data_dt) rk from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) t1) t2;t3
# 输出结果
user_id data_dt data_sub_rk u_001 2017-01-02 2017-01-01 u_001 2017-01-06 2017-01-04 u_002 2017-01-02 2017-01-01 u_002 2017-01-03 2017-01-01 u_002 2017-01-04 2017-01-01 u_002 2017-01-05 2017-01-01 u_003 2017-01-02 2017-01-01 u_003 2017-01-03 2017-01-01 u_003 2017-01-05 2017-01-02 u_003 2017-01-07 2017-01-03 u_004 2017-01-01 2016-12-31 u_004 2017-01-03 2017-01-01 u_004 2017-01-06 2017-01-03 u_004 2017-01-07 2017-01-03 …… …… ……
4、过滤出连续三天减碳量超过100g的用户
select user_id from t3 group by user_id,data_sub_rk having count(*)>=3; select user_id from (select user_id,data_dt,date_sub(data_dt,rk) data_sub_rk from (select user_id,data_dt,rank() over(partition by user_id order by data_dt) rk from (select user_id,date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt from user_low_carbon where substring(data_dt,1,4)="2017" group by user_id,data_dt having sum(low_carbon)>100) t1) t2) t3 group by user_id,data_sub_rk having count(*)>=3;