Hive 求出连续三天有销售记录的店铺
原始数据
A,2017-10-11,300
A,2017-10-12,200
A,2017-10-13,100
A,2017-10-15,100
A,2017-10-16,300
A,2017-10-17,150
A,2017-10-18,340
A,2017-10-19,360
B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600
C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600
求出连续三天有销售记录的店铺,给每个用户一个编号,用日期减去编号,如果是同一天,那么就是连续的 。
A,2017-10-11,300,1,2017-10-10
A,2017-10-12,200,2,2017-10-10
A,2017-10-13,100,3,2017-10-10
A,2017-10-15,100,4,2017-10-11
A,2017-10-16,300,5,2017-10-11
A,2017-10-17,150,6,2017-10-11
A,2017-10-18,340,7,2017-10-11
A,2017-10-19,360,8,2017-10-11
B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600
C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600
step1 建表 并加载数据
hive (default)> drop table if exists t_jd;
OK
Time taken: 8.41 seconds
hive (default)> create table t_jd(shopid string,dt string,sale int)
> row format delimited fields terminated by ',';
OK
Time taken: 0.119 seconds
hive (default)> load data local inpath '/opt/module/data/sales.txt' into table t_jd;
Loading data to table default.t_jd
Table default.t_jd stats: [numFiles=1, totalSize=363]
OK
Time taken: 1.811 seconds
下面在hive中执行的代码 省略了日志
step2 每一条记录按照店铺id分区 按照日期全局排序 row_number()编号
hive (default)> select shopid,dt,sale,
> row_number() over(partition by shopid order by dt) as rn
> from t_jd;
shopid dt sale rn
A 2017-10-11 300 1
A 2017-10-12 200 2
A 2017-10-13 100 3
A 2017-10-15 100 4
A 2017-10-16 300 5
A 2017-10-17 150 6
A 2017-10-18 340 7
A 2017-10-19 360 8
B 2017-10-11 400 1
B 2017-10-12 200 2
B 2017-10-15 600 3
C 2017-10-11 350 1
C 2017-10-13 250 2
C 2017-10-14 300 3
C 2017-10-15 400 4
C 2017-10-16 200 5
D 2017-10-13 500 1
D 2017-10-14 600 2
E 2017-10-14 600 1
E 2017-10-15 500 2
step3 根据编号 生成连续日期
hive (default)> select shopid,dt,sale,rn,
> date_sub(to_date(dt),rn)
> from
> (select shopid,dt,sale,
> row_number() over(partition by shopid order by dt) as rn
> from t_jd) tmp;
shopid dt sale rn _c4
A 2017-10-11 300 1 2017-10-10
A 2017-10-12 200 2 2017-10-10
A 2017-10-13 100 3 2017-10-10
A 2017-10-15 100 4 2017-10-11
A 2017-10-16 300 5 2017-10-11
A 2017-10-17 150 6 2017-10-11
A 2017-10-18 340 7 2017-10-11
A 2017-10-19 360 8 2017-10-11
B 2017-10-11 400 1 2017-10-10
B 2017-10-12 200 2 2017-10-10
B 2017-10-15 600 3 2017-10-12
C 2017-10-11 350 1 2017-10-10
C 2017-10-13 250 2 2017-10-11
C 2017-10-14 300 3 2017-10-11
C 2017-10-15 400 4 2017-10-11
C 2017-10-16 200 5 2017-10-11
D 2017-10-13 500 1 2017-10-12
D 2017-10-14 600 2 2017-10-12
E 2017-10-14 600 1 2017-10-13
E 2017-10-15 500 2 2017-10-13
step4 分组 求count
hive (default)> select shopid,count(1) as cnt
> from
> (select shopid,dt,sale,rn,
> date_sub(to_date(dt),rn) as flag
> from
> (select shopid,dt,sale,
> row_number() over(partition by shopid order by dt) as rn
> from t_jd) tmp) tmp2
> group by shopid,flag;
shopid cnt
A 3
A 5
B 2
B 1
C 1
C 4
D 2
E 2
step5 筛选出连续天数大于等于3的
hive (default)> select shopid from
> (select shopid,count(1) as cnt
> from
> (select shopid,dt,sale,rn,
> date_sub(to_date(dt),rn) as flag
> from
> (select shopid,dt,sale,
> row_number() over(partition by shopid order by dt) as rn
> from t_jd) tmp) tmp2
> group by shopid,flag) tmp3
> where tmp3.cnt>=3;
shopid
A
A
C
step6 去重
hive (default)> select distinct shopid from
> (select shopid,count(1) as cnt
> from
> (select shopid,dt,sale,rn,
> date_sub(to_date(dt),rn) as flag
> from
> (select shopid,dt,sale,
> row_number() over(partition by shopid order by dt) as rn
> from t_jd) tmp) tmp2
> group by shopid,flag) tmp3
> where tmp3.cnt>=3;
shopid
A
C