【企业流行新数仓】Day02:DWS层(按日分区的宽表)、DWT层(全量累计表)、ADS层、总结
一、DWS层
1、概括
dwd层的数据,每日轻度聚合,建宽表
粒度 | |
---|---|
dws_uv_detail_daycount | 一个设备是一行 |
dws_user_action_daycount(只统计今天登录的会员) | 一个会员是一行 |
dws_sku_action_daycount(只统计被下单或平均或支付或加购或收藏的商品) | 一个商品是一行 |
dws_coupon_use_daycount(只统计未过期的优惠券) | 一个优惠券是一行 |
dws_activity_info_daycount(统计所有活动) | 一个活动是一行 |
dws_sale_detail_daycount(每日购买数据) |
2、dws_uv_detail_daycount(每日设备行为)-一台设备有多个行为,每列的多个行为进行字符串拼接
(1)建表
create external table dws_uv_detail_daycount ( -- 从启动日志dwd_start_log表取以下字段 `mid_id` string COMMENT '设备唯一标识', `user_id` string COMMENT '用户标识', `version_code` string COMMENT '程序版本号', `version_name` string COMMENT '程序版本名', `lang` string COMMENT '系统语言', `source` string COMMENT '渠道号', `os` string COMMENT '安卓系统版本', `area` string COMMENT '区域', `model` string COMMENT '手机型号', `brand` string COMMENT '手机品牌', `sdk_version` string COMMENT 'sdkVersion', `gmail` string COMMENT 'gmail', `height_width` string COMMENT '屏幕宽高', `app_time` string COMMENT '客户端日志产生时的时间', `network` string COMMENT '网络模式', `lng` string COMMENT '经度', `lat` string COMMENT '纬度', -- 从启动日志dwd_start_log表按照mid_id进行聚合,之后count(*)取以下字段 `login_count` bigint COMMENT '活跃次数' )
(2)数据导入
insert overwrite table dws_uv_detail_daycount PARTITION(dt='2020-05-06') select mid_id, concat_ws('|',collect_set(user_id)), concat_ws('|',collect_set(version_code)), concat_ws('|',collect_set(version_name)), concat_ws('|',collect_set(lang)), concat_ws('|',collect_set(source)), concat_ws('|',collect_set(os)), concat_ws('|',collect_set(area)), concat_ws('|',collect_set(model)), concat_ws('|',collect_set(brand)), concat_ws('|',collect_set(sdk_version)), concat_ws('|',collect_set(gmail)), concat_ws('|',collect_set(height_width)), concat_ws('|',collect_set(app_time)), concat_ws('|',collect_set(network)), concat_ws('|',collect_set(lng)), concat_ws('|',collect_set(lat)), count(*) FROM dwd_start_log where dt='2020-05-06' GROUP by mid_id
3、dws_user_action_daycount(每日会员行为)
用户登录、加购、下单、支付次数及金额
导入数据时,类似于建表/视图操作
witht1
as (select user_id,count(*) login_count from dwd_start_log where dt='2020-05-06' and user_id is not NULL GROUP BY user_id),
t3 as (select user_id,count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-05-06' GROUP by user_id ),
t4 as (select user_id,count(*) payment_count,sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-05-06' GROUP by user_id),
t2 as (select user_id,count(*) cart_count,sum(cart_price*sku_num) cart_amount from dwd_fact_cart_info where dt='2020-05-06' and date_format(create_time,'yyyy-MM-dd')='2020-05-06' GROUP by user_id )
insert overwrite TABLE dws_user_action_daycount PARTITION(dt='2020-05-06')select t1.user_id,login_count, nvl(cart_count,0), nvl(cart_amount,0), nvl(order_count,0),
nvl(order_amount,0), nvl(payment_count,0), nvl(payment_amount,0)from t1 left join t2 on t1.user_id=t2.user_idleft join t3 on t1.user_id=t3.user_idleft join t4 on t1.user_id=t4.user_id
4、dws_sku_action_daycount(每日商品行为)
被下单次数、被支付、退款、加购、好评、差评次数
with t1 as (select sku_id,count(*) order_count,sum(sku_num) order_num, sum(total_amount) order_amount from dwd_fact_order_detail where dt='2020-05-06' GROUP by sku_id), t2 as (select sku_id, sum(sku_num) payment_num,sum(total_amount) payment_amount, count(*) payment_count from (SELECT order_id,sku_id,sku_num,total_amount from dwd_fact_order_detail where dt='2020-05-06' or dt=date_sub('2020-05-06',1)) tmp1 join (select order_id from dwd_fact_payment_info where dt='2020-05-06') tmp2 on tmp1.order_id=tmp2.order_id GROUP by sku_id), t3 as (SELECT sku_id, count(*) refund_count,sum(refund_num) refund_num, sum(refund_amount) refund_amount from dwd_fact_order_refund_info where dt='2020-05-06' GROUP by sku_id), t4 as (select sku_id, count(*) cart_count,sum(sku_num) cart_num from dwd_fact_cart_info where dt='2020-05-06' and sku_num>0 GROUP by sku_id), t5 as (SELECT sku_id, count(*) favor_count from dwd_fact_favor_info where dt='2020-05-06' and is_cancel=0 group by sku_id), t6 as (SELECT sku_id, sum(if(appraise='1201',1,0)) appraise_good_count, sum(if(appraise='1202',1,0)) appraise_mid_count, sum(if(appraise='1203',1,0)) appraise_bad_count, sum(if(appraise='1204',1,0)) appraise_default_count from dwd_fact_comment_info where dt='2020-05-06' group by sku_id) insert overwrite table dws_sku_action_daycount partition(dt='2020-05-06') SELECT nvl(nvl(nvl(nvl(nvl(t1.sku_id,t2.sku_id),t3.sku_id),t4.sku_id),t5.sku_id),t6.sku_id), nvl(order_count,0), nvl(order_num,0), nvl(order_amount,0), nvl(payment_count,0), nvl(payment_num,0), nvl(payment_amount,0), nvl(refund_count,0), nvl(refund_num,0), nvl(refund_amount,0), nvl(cart_count,0), nvl(cart_num,0), nvl(favor_count,0), nvl(appraise_good_count,0), nvl(appraise_mid_count,0), nvl(appraise_bad_count,0), nvl(appraise_default_count,0) from t1 full join t2 on t1.sku_id=t2.sku_id full join t3 on t1.sku_id=t3.sku_id full join t4 on t1.sku_id=t4.sku_id full join t5 on t1.sku_id=t5.sku_id full join t6 on t1.sku_id=t6.sku_id
5、dws_coupon_use_daycount(每日优惠券使用行为)
范围、商品id、品牌、品类、领用次数、下单次数
insert overwrite table dws_coupon_use_daycount PARTITION(dt='2020-05-06') select t1.id coupon_id,coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, get_count,using_count, used_count from (SELECT * from dwd_dim_coupon_info where dt='2020-05-06' and nvl(expire_time,'9999-99-99') >'2020-05-06') t1 left join (select coupon_id, sum(if(date_format(get_time,'yyyy-MM-dd')='2020-05-06',1,0)) get_count, sum(if(date_format(using_time,'yyyy-MM-dd')='2020-05-06',1,0)) using_count, sum(if(date_format(used_time,'yyyy-MM-dd')='2020-05-06',1,0)) used_count from dwd_fact_coupon_use GROUP by coupon_id) t2 on t1.id=t2.coupon_id
6、dws_activity_info_daycount(每日活动行为)
活动类型、时间、下单、支付次数
with t1 as (select id,activity_name,activity_type, start_time,end_time,create_time from dwd_dim_activity_info where dt='2020-05-06' GROUP by id,activity_name,activity_type, start_time,end_time,create_time), t2 as (select activity_id,count(*) order_count from dwd_fact_order_info where dt='2020-05-06' GROUP by activity_id), t5 as (SELECT activity_id,count(*) payment_count from (SELECT order_id,id from dwd_fact_payment_info where dt='2020-05-06') t3 join (SELECT id,activity_id from dwd_fact_order_info WHERE dt='2020-05-06' or dt=date_sub('2020-05-06',1)) t4 on t3.order_id=t4.id GROUP by activity_id) insert overwrite table dws_activity_info_daycount partition(dt='2020-05-06') SELECT t1.id,activity_name, activity_type, start_time, end_time, create_time, nvl(order_count,0), nvl(payment_count,0) from t1 left join t2 on t1.id=t2.activity_id left join t5 on t1.id=t5.activity_id
7、dws_sale_detail_daycount(每日用户购买商品详情)
用户、商品、sku、购买次数、下单次数、下单金额
GROUP by user_id,sku_id) insert overwrite table dws_sale_detail_daycount PARTITION(dt='2020-05-06') SELECT t7.user_id, t7.sku_id, user_gender, user_age, user_level, order_price, sku_name, sku_tm_id, sku_category3_id, sku_category2_id, sku_category1_id, sku_category3_name, sku_category2_name, sku_category1_name, spu_id, sku_num, order_count, order_amount from (select nvl(t3.user_id,t4.user_id) user_id, nvl(t3.sku_id,t4.sku_id) sku_id, nvl(order_count,0) order_count, nvl(order_amount,0) order_amount, nvl(sku_num,0) sku_num FROM t3 full join t4 on t3.user_id=t4.user_id and t3.sku_id=t4.sku_id) t7 join t1 on t7.user_id=t1.user_id join t2 on t7.sku_id=t2.sku_id
二、DWT层-当前表及DWS表中按日期的汇总(合并及更新)
1、概述
将DWS层每日聚合的数据进行累积
不是分区表,是一个累积型全量表
累积型全量表: ①查询要改动的旧数据 ②查询新增和变化的新数据 ③新旧关联,以新换旧 ④导入覆盖
2、dwt_uv_topic
create external table dwt_uv_topic
用户及设备信息、首次活跃、当日活跃、末次活跃、累计活跃天数
今天未登录的老用户:new.mid_id is null
老用户:old.mid_id is not null
新用户:old.mid_id is null
今天登录的老用户:new.mid_id is not null and old.mid_id is not null
insert overwrite table gmall.dwt_uv_topic select nvl(old.mid_id,new.mid_id), concat_ws('|',old.user_id,new.user_id), concat_ws('|',old.version_code,new.version_code), concat_ws('|',old.version_name,new.version_name), concat_ws('|',old.lang,new.lang), concat_ws('|',old.source,new.source), concat_ws('|',old.os,new.os), concat_ws('|',old.area,new.area), concat_ws('|',old.model,new.model), concat_ws('|',old.brand,new.brand), concat_ws('|',old.sdk_version,new.sdk_version), concat_ws('|',old.gmail,new.gmail), concat_ws('|',old.height_width,new.height_width), concat_ws('|',old.app_time,new.app_time), concat_ws('|',old.network,new.network), concat_ws('|',old.lng,new.lng), concat_ws('|',old.lat,new.lat), nvl(old.login_date_first,'2020-05-06') login_date_first, IF(new.mid_id is null,old.login_date_last,'2020-05-06') login_date_last, nvl(new.login_count,0) login_day_count, nvl(old.login_count,0)+if(new.login_count is not null,1,0) login_count from dwt_uv_topic old full join (select * from dws_uv_detail_daycount where dt='2020-05-06') new on old.mid_id=new.mid_id
3、dwt_user_topic
用户首末次登录、下单时间、天数,累计、最近30天下单支付金额、次数
insert overwrite table dwt_user_topic SELECT t1.user_id,login_date_first, login_date_last, login_count, nvl(login_last_30d_count,0), order_date_first, order_date_last, order_count, order_amount, nvl(order_last_30d_count,0), nvl(order_last_30d_amount,0), payment_date_first, payment_date_last, payment_count, payment_amount, nvl(payment_last_30d_count,0), nvl(payment_last_30d_amount,0) from (SELECT nvl(old.user_id,new.user_id) user_id, nvl(old.login_date_first,'2020-05-06') login_date_first, nvl(old.order_date_first,if(new.order_count>0,'2020-05-06',null)) order_date_first, nvl(old.payment_date_first,if(new.payment_count>0,'2020-05-06',null)) payment_date_first, if(new.user_id is null,old.login_date_last,'2020-05-06') login_date_last, if(new.order_count>0,'2020-05-06',old.order_date_last) order_date_last, if(new.payment_count>0,'2020-05-06',old.payment_date_last) payment_date_last, nvl(old.login_count,0)+if(new.user_id is not null,1,0) login_count, nvl(old.order_count,0)+nvl(new.order_count,0) order_count, nvl(old.order_amount,0)+nvl(new.order_amount,0) order_amount, nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count, nvl(old.payment_amount,0)+nvl(new.payment_amount,0) payment_amount from dwt_user_topic old full join (select * from dws_user_action_daycount where dt='2020-05-06') new on old.user_id=new.user_id) t1 left join ( SELECT user_id, sum(order_count) order_last_30d_count, sum(order_amount) order_last_30d_amount, sum(payment_count) payment_last_30d_count, sum(payment_amount) payment_last_30d_amount, count(*) login_last_30d_count FROM dws_user_action_daycount where dt BETWEEN date_sub('2020-05-06',29) and '2020-05-06' GROUP by user_id) t2 on t1.user_id=t2.user_id
4、dwt_sku_topic
最近30日及累计下单、支付、退款、加购、收藏、好中差评次数
create external table dwt_sku_topic ( sku_id string comment 'sku_id', spu_id string comment 'spu_id', -- 从dws_sku_action_daycount 取,where 30天之前<=dt<=今天,sum() order_last_30d_count bigint comment '最近30日被下单次数',
insert overwrite TABLE dwt_sku_topic SELECT t2.sku_id, t2.spu_id, nvl(order_last_30d_count,0),
5、dwt_coupon_topic
优惠券当日及累计领用、下单、支付次数
insert overwrite table dwt_coupon_topic select nvl(old.coupon_id,new.coupon_id) coupon_id, nvl(new.get_count,0) get_day_count, nvl(new.using_count,0) using_day_count, nvl(new.used_count,0) used_day_count, nvl(old.get_count,0)+nvl(new.get_count,0) get_count, nvl(old.get_count,0)+nvl(new.using_count,0) using_count, nvl(old.get_count,0)+nvl(new.used_count,0) used_count from dwt_coupon_topic old full join (select * from dws_coupon_use_daycount where dt='2020-05-06')new on old.coupon_id=new.coupon_id
6、dwt_activity_topic
活动当日及累计下单、支付次数
insert overwrite table dwt_activity_topic select nvl(old.id,new.id) id, nvl(old.activity_name,new.activity_name) activity_name, nvl(new.order_count,0) order_day_count, nvl(new.payment_count,0) payment_day_count, nvl(old.order_count,0)+nvl(new.order_count,0) order_count, nvl(old.payment_count,0)+nvl(new.payment_count,0) payment_count from dwt_activity_topic old full join (select * from dws_activity_info_daycount where dt='2020-05-06')new on old.id=new.id
三、ADS层
1、概述
将需求根据要查询的数据源进行分类
同一类需求创建一张表进行统计
创建的表均为全量表
2、构造数据
将集群的时间,调整到要导入数据的前一天
上传jar包
启动采集通道,启动hive
执行脚本
3、设备主题
(1)活跃设备数(日、周、月)
从dws_uv_daycount 或 dwt_uv_topic 表取数据
日活、周活、月活【字段:是否是周末或月末】:至少活跃一次
create external table ads_uv_count( `dt` string COMMENT '统计日期', // 从dws层取当天的,也可以从dwt层取 `day_count` bigint COMMENT '当日用户数量', // 从dws层取当周的,也可以从dwt层取 `wk_count` bigint COMMENT '当周用户数量', // 从dws层取当月的,也可以从dwt层取 `mn_count` bigint COMMENT '当月用户数量', // 借助next_day() `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果', // 借助last_day() `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果' ) COMMENT '活跃设备数'
insert into table ads_uv_count SELECT '2020-05-06',day_count,wk_count,mn_count, if('2020-05-06'=date_sub(next_day('2020-05-06','MO'),1),'Y','N') is_weekend, if('2020-05-06'=last_day('2020-05-06'),'Y','N') is_monthend from (SELECT '2020-05-06' dt,count(*) day_count
(2)每日新增设备, login_date_first=今天
insert into ads_new_mid_count SELECT '2020-05-06' create_date, count(*) new_mid_count FROM dwt_uv_topic where login_date_first='2020-05-06';
(3)沉默用户数
只在安装当天启动过: login_date_first='当天'= login_date_last
启动时间是在7天前: login_date_last< 今天的7天前
insert into table ads_silent_count SELECT '2020-05-06', count(*) from dwt_uv_topic where login_date_first=login_date_last and login_date_last<date_sub('2020-05-06',7)
(4)本周回流用户数
本周登录过的,没在上周登录过的老用户数
本周活跃与上周活跃,两个结果集使用left join后取差集(本周活跃但上周不活跃):
on t1.mid_id=t2.mid_id
where t2.mid_id is null
login_date_last >= date_sub(next_day('2020-05-06','MO'),7)
(5)流失用户数:连续7天未活跃的设备
login_date_last<date_sub('2020-05-06',7)
(6)留存率:留存用户 占 某天新增用户的 比率
某天新增的用户中,在n天后继续使用的用户称为留存用户
①某一天新增的人数
②留存的天数,留存的日期=新增的天数+留存的天数
③取留存日期当天的留存人数
(7)最近连续三周活跃用户数
用户在这三周中,都至少需要出现一次
(8)最近七天内连续三天活跃用户数
4、会员主题
(1)会员信息
用户新鲜率、活跃率、付费率
cast(sum(if(login_date_last='2020-05-19',1,0)) / count(*) * 100 as decimal(10,2)) day_users2users
(2)转化率
访问/加购,加购/下单,下单/支付
cast( sum(if(payment_count>0,1,0)) / sum(if(order_count>0,1,0)) * 100 as decimal(10,2))
5、商品主题
(1)商品个数信息-各个商品的种类数
insert into table ads_product_info SELECT '2020-05-19' dt, count(*) sku_num, count(DISTINCT spu_id) spu_num from dwt_sku_topic
(2)商品累积销量排名
FROM dwt_sku_topic where payment_num>0 order by payment_num desc limit 10
(3)商品收藏排名
(4)加入购物车排名
(5)最近30天退款率
(6)差评率排名
6、营销主题
(1)每日下单统计ads_order_daycount
(2)每日支付统计ads_payment_daycount
(3)品牌的月复购率ads_sale_tm_category1_stat_mn
单次、多次复购率
四、总结
1、数据来源
2、各层数据的来源于导入
建模 | 如何导入数据 | 备注 | |
---|---|---|---|
hdfs | 采用lzo压缩的格式 | ||
ODS | 原数据有几个字段是什么类型,就怎么建模 | 必须指定ODS的表使用能够读取LZO压缩格式的输入格式,为LZO格式创建索引 | |
用户行为DWD | 用户行为数据根据不同类型数据的字段明细,进行建模 | 启动日志: get_json_object 事件日志: 自定义UDF,UDTF,将事件日志中的每个事件,解析到一个base_event表中,再使用get_json_object展开事件明细。 | |
业务数据DWD | 维度表:维度退化,将多个同一类型维度的字段合并到一张表中。事实表:采取星型模型,基于3w原则,按照选取业务线---确认粒度---选取维度---选取度量进行建模 | 维度表:多表Join 事实表:选择一张事实表作为主表,通过外键关联维度表,选取维度字段。再选取度量! | |
事务型事实表:选取ods层某一天分区的数据,再关联维度表,选取维度字段,再选取度量! | |||
周期型快照事实表:直接从ODS层全量导入(加入购物车,收藏表) | |||
累积型快照事实表: 按照事实发生最初的事件作为分区字段!①选择要覆盖的老的分区的所有数据②选取今日新增和变化的新数据③新旧交替,以新换旧④覆盖到指定的分区 | |||
拉链表(缓慢变化维度):old left join new ,将old中过期的数据的end_date修改为new中start_date的前一天。 再union all new。导入到临时表,再导入到原表 | |||
dws层 | 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表 | 取dwd层每日最新的分区,进行多表关联 | |
dwt层 | 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(设备,用户,商品,优惠券,活动,购买行为),创建宽表 | dwt full join dws 当日分区的数据①新旧交替,以新换旧②覆盖原表 | |
ads | 紧紧贴合需求。将同一类型的需求,汇总,分类,以某个需求的统计目标为主题(用户,商品,会员,营销), | 取某一天的历史切片数据,从dws层取,如果要取当前的数据或累计状态,从dwt层取 | |
导出mysql | update_mode: allowinsert update-key: dt |
本文来自博客园,作者:哥们要飞,转载请注明原文链接:https://www.cnblogs.com/liujinhui/p/15718787.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 【自荐】一款简洁、开源的在线白板工具 Drawnix