数据仓库| 1.4 ODS& DWD& DWS&DWT& ADS
数仓分层搭建导入数据
1.ODS层
① 用户行为数据日志表
创建输入支持lzo压缩、输出是text,支持Json解析解析的分区表
hive (gmall)>
drop table if exists ods_log;
CREATE EXTERNAL TABLE ods_log (`line` string)
PARTITIONED BY (`dt` string) -- 按照时间创建分区
STORED AS -- 指定存储方式,读数据采用LzoTextInputFormat;
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/warehouse/gmall/ods/ods_log' -- 指定数据在hdfs上的存储位置
;
说明Hive的LZO压缩:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
加载数据
hive (gmall)>
load data inpath '/origin_data/gmall/log/topic_log/2020-06-14' into table ods_log partition(dt='2020-06-14');
注意:时间格式都配置成YYYY-MM-DD格式,这是Hive默认支持的时间格式
为lzo压缩文件创建索引
[kris@hadoop101 bin]$ hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
-Dmapreduce.job.queuename=hive \
/warehouse/gmall/ods/ods_log/dt=2020-06-14
在hadoop101的/home/kris/bin目录下创建脚本: vim hdfs_to_ods_log.sh
#!/bin/bash # 定义变量方便修改 APP=gmall hive=/opt/module/hive/bin/hive hadoop=/opt/module/hadoop-3.1.3/bin/hadoop # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi echo ================== 日志日期为 $do_date ================== sql=" load data inpath '/origin_data/$APP/log/topic_log/$do_date' into table ${APP}.ods_log partition(dt='$do_date'); " $hive -e "$sql" $hadoop jar /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer -Dmapreduce.job.queuename=hive /warehouse/$APP/ods/ods_log/dt=$do_date
(1)说明1:
[ -n 变量值 ] 判断变量的值,是否为空
-- 变量的值,非空,返回true
-- 变量的值,为空,返回false
注意:[ -n 变量值 ]不会解析数据,使用[ -n 变量值 ]时,需要对变量加上双引号(" ")
(2)说明2:
查看date命令的使用,date --help
3)脚本使用
[kris@hadoop101 module]$ hdfs_to_ods_log.sh 2020-06-15
② 业务数据
表信息及同步方式(增量、增量及更新、新增及变化、全量、特殊)如下
订单详情表( 增量 )ods_order_detail
支付流水表( 增量 ) ods_payment_info
订单状态表( 增量 ) ods_order_status_log
商品评论表( 增量 )ods_comment_info
退单表( 增量 ) ods_order_refund_info
活动订单关联表( 增量 ) ods_activity_order
订单表( 增量及更新 )ods_order_info
用户表( 增量及更新 )ods_user_info
优惠券领用表( 新增及变化 ) ods_coupon_use
SKU商品表( 全量 )ods_sku_info
商品一级分类表( 全量 )ods_base_category1
商品二级分类表( 全量 )ods_base_category2
商品三级分类表( 全量 )ods_base_category3
品牌表( 全量 )ods_base_trademark
SPU商品表( 全量 ) ods_spu_info
加购表( 全量 ) ods_cart_info
商品收藏表( 全量 ) ods_favor_info
优惠券表( 全量 ) ods_coupon_info
活动表( 全量 ) ods_activity_info
优惠规则表( 全量 ) ods_activity_rule
编码字典表( 全量 )ods_base_dic
省份表( 特殊 ) ods_base_province
地区表( 特殊 ) ods_base_region
订单表建表语句如下,其他类似:
drop table if exists ods_order_info;
create external table ods_order_info (
`id` string COMMENT '订单号',
`final_total_amount` decimal(16,2) COMMENT '订单金额',
`order_status` string COMMENT '订单状态',
`user_id` string COMMENT '用户id',
`out_trade_no` string COMMENT '支付流水号',
`create_time` string COMMENT '创建时间',
`operate_time` string COMMENT '操作时间',
`province_id` string COMMENT '省份ID',
`benefit_reduce_amount` decimal(16,2) COMMENT '优惠金额',
`original_total_amount` decimal(16,2) COMMENT '原价金额',
`feight_fee` decimal(16,2) COMMENT '运费'
) COMMENT '订单表'
PARTITIONED BY (`dt` string) -- 按照时间创建分区
row format delimited fields terminated by '\t' -- 指定分割符为\t
STORED AS -- 指定存储方式,读数据采用LzoTextInputFormat;输出数据采用TextOutputFormat
INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location '/warehouse/gmall/ods/ods_order_info/' -- 指定数据在hdfs上的存储位置
;
ODS层加载数据脚本,在/home/kris/bin/目录下创建hdfs_to_ods_db.sh,并赋予执行权限。
#!/bin/bash APP=gmall hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$2" ] ;then do_date=$2 else do_date=`date -d "-1 day" +%F` fi sql1=" load data inpath '/origin_data/$APP/db/order_info/$do_date' OVERWRITE into table ${APP}.ods_order_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/order_detail/$do_date' OVERWRITE into table ${APP}.ods_order_detail partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/sku_info/$do_date' OVERWRITE into table ${APP}.ods_sku_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/user_info/$do_date' OVERWRITE into table ${APP}.ods_user_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/payment_info/$do_date' OVERWRITE into table ${APP}.ods_payment_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/base_category1/$do_date' OVERWRITE into table ${APP}.ods_base_category1 partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/base_category2/$do_date' OVERWRITE into table ${APP}.ods_base_category2 partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/base_category3/$do_date' OVERWRITE into table ${APP}.ods_base_category3 partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/base_trademark/$do_date' OVERWRITE into table ${APP}.ods_base_trademark partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/activity_info/$do_date' OVERWRITE into table ${APP}.ods_activity_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/activity_order/$do_date' OVERWRITE into table ${APP}.ods_activity_order partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/cart_info/$do_date' OVERWRITE into table ${APP}.ods_cart_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/comment_info/$do_date' OVERWRITE into table ${APP}.ods_comment_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/coupon_info/$do_date' OVERWRITE into table ${APP}.ods_coupon_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/coupon_use/$do_date' OVERWRITE into table ${APP}.ods_coupon_use partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/favor_info/$do_date' OVERWRITE into table ${APP}.ods_favor_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/order_refund_info/$do_date' OVERWRITE into table ${APP}.ods_order_refund_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/order_status_log/$do_date' OVERWRITE into table ${APP}.ods_order_status_log partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/spu_info/$do_date' OVERWRITE into table ${APP}.ods_spu_info partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/activity_rule/$do_date' OVERWRITE into table ${APP}.ods_activity_rule partition(dt='$do_date'); load data inpath '/origin_data/$APP/db/base_dic/$do_date' OVERWRITE into table ${APP}.ods_base_dic partition(dt='$do_date'); " sql2=" load data inpath '/origin_data/$APP/db/base_province/$do_date' OVERWRITE into table ${APP}.ods_base_province; load data inpath '/origin_data/$APP/db/base_region/$do_date' OVERWRITE into table ${APP}.ods_base_region; " case $1 in "first"){ $hive -e "$sql1$sql2" };; "all"){ $hive -e "$sql1" };; esac
脚本的使用如下:
1)初次导入
初次导入时,脚本的第一个参数应为first,线上环境不传第二个参数,自动获取前一天日期
[kris@hadoop101 bin]$ hdfs_to_ods_db.sh first 2020-06-14
2)每日导入
每日重复导入,脚本的第一个参数应为all,线上环境不传第二个参数,自动获取前一天日期。
[kris@hadoop101 bin]$ hdfs_to_ods_db.sh all 2020-06-15
2.DWD层
1)对用户行为数据解析。
2)对核心数据进行判空过滤。
3)对业务数据采用维度模型重新建模。
对ODS层数据进行判空过滤。对商品分类表进行维度退化(降维)。
1)维度退化要付出什么代价?或者说会造成什么样的需求处理不了?
如果被退化的维度,还有其他业务表使用,退化后处理起来就麻烦些。
2)想想在实际业务中还有那些维度表可以退化
城市的三级分类(省、市、县)等
用户行为日志解析
① 启动日志表解析
启动日志解析思路:启动日志表中每行数据对应一个启动记录,一个启动记录应该包含日志中的公共信息+启动信息+时间。先将所有包含start字段的日志过滤出来,然后使用get_json_object函数解析每个字段。
1) 建表:
drop table if exists dwd_start_log; CREATE EXTERNAL TABLE dwd_start_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `entry` string COMMENT ' icon手机图标 notice 通知 install 安装后启动', `loading_time` bigint COMMENT '启动加载时间', `open_ad_id` string COMMENT '广告页ID ', `open_ad_ms` bigint COMMENT '广告总共播放时间', `open_ad_skip_ms` bigint COMMENT '用户跳过广告时点', `ts` bigint COMMENT '时间' ) COMMENT '启动日志表' PARTITIONED BY (dt string) -- 按照时间创建分区 stored as parquet -- 采用parquet列式存储 LOCATION '/warehouse/gmall/dwd/dwd_start_log' -- 指定在HDFS上存储位置 TBLPROPERTIES('parquet.compression'='lzo') -- 采用LZO压缩 ; 说明:数据采用parquet存储方式,是可以支持切片的,不需要再对数据创建索引。如果单纯的text方式存储数据,需要采用支持切片的,lzop压缩方式并创建索引。
2) 数据导入 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_start_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.ts') from ods_log where dt='2020-06-14' and get_json_object(line,'$.start') is not null;
Hive读取索引文件问题
select * 和 select count(*) 两次查询结果不一致。
原因是select * from ods_log不执行MR操作,默认采用的是ods_log建表语句中指定的DeprecatedLzoTextInputFormat,能够识别lzo.index为索引文件。
select count(*) from ods_log执行MR操作,默认采用的是CombineHiveInputFormat,不能识别lzo.index为索引文件,将索引文件当做普通文件处理。更严重的是,这会导致LZO文件无法切片。
hive (gmall)> set hive.input.format; // 查看Hive默认的InputFormat
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
解决办法:修改CombineHiveInputFormat为 HiveInputFormat
② 页面日志表
页面日志解析思路:页面日志表中每行数据对应一个页面访问记录,一个页面访问记录应该包含日志中的公共信息和页面信息。先将所有包含page字段的日志过滤出来,然后使用
get_json_object函数解析每个字段。
建表并 导入数据
---建表 drop table if exists dwd_page_log; CREATE EXTERNAL TABLE dwd_page_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `during_time` bigint COMMENT '持续时间毫秒', `page_item` string COMMENT '目标id ', `page_item_type` string COMMENT '目标类型', `last_page_id` string COMMENT '上页类型', `page_id` string COMMENT '页面ID ', `source_type` string COMMENT '来源类型', `ts` bigint ) COMMENT '页面日志表' PARTITIONED BY (dt string) stored as parquet LOCATION '/warehouse/gmall/dwd/dwd_page_log' TBLPROPERTIES('parquet.compression'='lzo'); --导入数据 SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_page_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts') from ods_log where dt='2020-06-14' and get_json_object(line,'$.page') is not null;
③ 动作日志表
动作日志解析思路:动作日志表中每行数据对应用户的一个动作记录,一个动作记录应当包含公共信息、页面信息以及动作信息。先将包含action字段的日志过滤出来,然后通过
UDTF函数,将action数组“炸开”(类似于explode函数的效果),然后使用get_json_object函数解析每个字段。
--建表
drop table if exists dwd_action_log;
CREATE EXTERNAL TABLE dwd_action_log(
`area_code` string COMMENT '地区编码',
`brand` string COMMENT '手机品牌',
`channel` string COMMENT '渠道',
`model` string COMMENT '手机型号',
`mid_id` string COMMENT '设备id',
`os` string COMMENT '操作系统',
`user_id` string COMMENT '会员id',
`version_code` string COMMENT 'app版本号',
`during_time` bigint COMMENT '持续时间毫秒',
`page_item` string COMMENT '目标id ',
`page_item_type` string COMMENT '目标类型',
`last_page_id` string COMMENT '上页类型',
`page_id` string COMMENT '页面id ',
`source_type` string COMMENT '来源类型',
`action_id` string COMMENT '动作id',
`item` string COMMENT '目标id ',
`item_type` string COMMENT '目标类型',
`ts` bigint COMMENT '时间'
) COMMENT '动作日志表'
PARTITIONED BY (dt string)
stored as parquet
LOCATION '/warehouse/gmall/dwd/dwd_action_log'
TBLPROPERTIES('parquet.compression'='lzo');
--创建UDTF函数,如下:
UDTF函数的编写及创建Hive永久函数。
比如输入数据: get_json_object(line,'$.actions')
[ {"action_id":"favor_add","item":"2","item_type":"sku_id","ts":1592123787234},
{"action_id":"cart_add","item":"2","item_type":"sku_id","ts":1592123789884} ]
输出:
{"action_id":"favor_add","item":"2","item_type":"sku_id","ts":1592123787234}
{"action_id":"cart_add","item":"2","item_type":"sku_id","ts":1592123789884}
LATERAL VIEW udtf(expression) tableAlias AS columnAlias
解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。
--引入依赖 <dependencies> <!--添加hive依赖--> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>3.1.2</version> </dependency> </dependencies> ---代码如下 import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.json.JSONArray; import java.util.ArrayList; import java.util.List; public class ExplodeJSONArray extends GenericUDTF { @Override public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { // 1 参数合法性检查 if (argOIs.getAllStructFieldRefs().size() != 1){ throw new UDFArgumentException("ExplodeJSONArray 只需要一个参数"); } // 2 第一个参数必须为string if(!"string".equals(argOIs.getAllStructFieldRefs().get(0).getFieldObjectInspector().getTypeName())){ throw new UDFArgumentException("json_array_to_struct_array的第1个参数应为string类型"); } // 3 定义返回值名称和类型 List<String> fieldNames = new ArrayList<String>(); List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(); fieldNames.add("items"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } public void process(Object[] objects) throws HiveException { // 1 获取传入的数据 String jsonArray = objects[0].toString(); // 2 将string转换为json数组 JSONArray actions = new JSONArray(jsonArray); // 3 循环一次,取出数组中的一个json,并写出 for (int i = 0; i < actions.length(); i++) { String[] result = new String[1]; result[0] = actions.getString(i); forward(result); } } public void close() throws HiveException { } } ---创建函数 打包将hivefunction-1.0-SNAPSHOT.jar上传到hadoop101的/opt/module,然后再将该jar包上传到HDFS的/user/hive/jars路径下 [kirs@hadoop101 module]$ hadoop fs -mkdir -p /user/hive/jars [kirs@hadoop101 module]$ hadoop fs -put hivefunction-1.0-SNAPSHOT.jar /user/hive/jars 创建永久函数与开发好的java class关联 hive (gmall)> create function explode_json_array as 'com.atguigu.hive.udtf.ExplodeJSONArray' using jar 'hdfs://hadoop101:8020/user/hive/jars/hivefunction-1.0-SNAPSHOT.jar'; 注意:如果修改了自定义函数重新生成jar包怎么处理?只需要替换HDFS路径上的旧jar包,然后重启Hive客户端即可。
--导入数据
hive (gmall)>
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
insert overwrite table dwd_action_log partition(dt='2020-06-14')
select
get_json_object(line,'$.common.ar'),
get_json_object(line,'$.common.ba'),
get_json_object(line,'$.common.ch'),
get_json_object(line,'$.common.md'),
get_json_object(line,'$.common.mid'),
get_json_object(line,'$.common.os'),
get_json_object(line,'$.common.uid'),
get_json_object(line,'$.common.vc'),
get_json_object(line,'$.page.during_time'),
get_json_object(line,'$.page.item'),
get_json_object(line,'$.page.item_type'),
get_json_object(line,'$.page.last_page_id'),
get_json_object(line,'$.page.page_id'),
get_json_object(line,'$.page.sourceType'),
get_json_object(action,'$.action_id'),
get_json_object(action,'$.item'),
get_json_object(action,'$.item_type'),
get_json_object(action,'$.ts')
from ods_log lateral view explode_json_array(get_json_object(line,'$.actions')) tmp as action
where dt='2020-06-14'
and get_json_object(line,'$.actions') is not null;
④ 曝光日志表
曝光日志解析思路:曝光日志表中每行数据对应一个曝光记录,一个曝光记录应当包含公共信息、页面信息以及曝光信息。先将包含display字段的日志过滤出来,然后通过UDTF函数,将display数组“炸开”(类
似于explode函数的效果),然后使用get_json_object函数解析每个字段。
建表并导入数据
drop table if exists dwd_display_log; CREATE EXTERNAL TABLE dwd_display_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `during_time` bigint COMMENT 'app版本号', `page_item` string COMMENT '目标id ', `page_item_type` string COMMENT '目标类型', `last_page_id` string COMMENT '上页类型', `page_id` string COMMENT '页面ID ', `source_type` string COMMENT '来源类型', `ts` bigint COMMENT 'app版本号', `display_type` string COMMENT '曝光类型', `item` string COMMENT '曝光对象id ', `item_type` string COMMENT 'app版本号', `order` bigint COMMENT '出现顺序' ) COMMENT '曝光日志表' PARTITIONED BY (dt string) stored as parquet LOCATION '/warehouse/gmall/dwd/dwd_display_log' TBLPROPERTIES('parquet.compression'='lzo'); 数据导入 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_display_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts'), get_json_object(display,'$.displayType'), get_json_object(display,'$.item'), get_json_object(display,'$.item_type'), get_json_object(display,'$.order') from ods_log lateral view explode_json_array(get_json_object(line,'$.displays')) tmp as display where dt='2020-06-14' and get_json_object(line,'$.displays') is not null;
⑤ 错误日志表
错误日志解析思路:错误日志表中每行数据对应一个错误记录,为方便定位错误,一个错误记录应当包含与之对应的公共信息、页面信息、曝光信息、动作信息、启动信息以及错误信息。先将包含err字段的日志
过滤出来,然后使用get_json_object函数解析所有字段。
建表并导入数据
drop table if exists dwd_error_log; CREATE EXTERNAL TABLE dwd_error_log( `area_code` string COMMENT '地区编码', `brand` string COMMENT '手机品牌', `channel` string COMMENT '渠道', `model` string COMMENT '手机型号', `mid_id` string COMMENT '设备id', `os` string COMMENT '操作系统', `user_id` string COMMENT '会员id', `version_code` string COMMENT 'app版本号', `page_item` string COMMENT '目标id ', `page_item_type` string COMMENT '目标类型', `last_page_id` string COMMENT '上页类型', `page_id` string COMMENT '页面ID ', `source_type` string COMMENT '来源类型', `entry` string COMMENT ' icon手机图标 notice 通知 install 安装后启动', `loading_time` string COMMENT '启动加载时间', `open_ad_id` string COMMENT '广告页ID ', `open_ad_ms` string COMMENT '广告总共播放时间', `open_ad_skip_ms` string COMMENT '用户跳过广告时点', `actions` string COMMENT '动作', `displays` string COMMENT '曝光', `ts` string COMMENT '时间', `error_code` string COMMENT '错误码', `msg` string COMMENT '错误信息' ) COMMENT '错误日志表' PARTITIONED BY (dt string) stored as parquet LOCATION '/warehouse/gmall/dwd/dwd_error_log' TBLPROPERTIES('parquet.compression'='lzo'); 说明:此处为对动作数组和曝光数组做处理,如需分析错误与单个动作或曝光的关联,可先使用explode_json_array函数将数组“炸开”,再使用get_json_object函数获取具体字段。 数据导入 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_error_log partition(dt='2020-06-14') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.actions'), get_json_object(line,'$.displays'), get_json_object(line,'$.ts'), get_json_object(line,'$.err.error_code'), get_json_object(line,'$.err.msg') from ods_log where dt='2020-06-14' and get_json_object(line,'$.err') is not null;
DWD层用户行为数据加载脚本
1)在hadoop101的/home/kris/bin目录下创建脚本 vim ods_to_dwd_log.sh 并赋予执行权限。
#!/bin/bash hive=/opt/module/hive/bin/hive APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" SET mapreduce.job.queuename=hive; SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_start_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.start') is not null; insert overwrite table ${APP}.dwd_action_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(action,'$.action_id'), get_json_object(action,'$.item'), get_json_object(action,'$.item_type'), get_json_object(action,'$.ts') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.actions')) tmp as action where dt='$do_date' and get_json_object(line,'$.actions') is not null; insert overwrite table ${APP}.dwd_display_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts'), get_json_object(display,'$.displayType'), get_json_object(display,'$.item'), get_json_object(display,'$.item_type'), get_json_object(display,'$.order') from ${APP}.ods_log lateral view ${APP}.explode_json_array(get_json_object(line,'$.displays')) tmp as display where dt='$do_date' and get_json_object(line,'$.displays') is not null; insert overwrite table ${APP}.dwd_page_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.during_time'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.ts') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.page') is not null; insert overwrite table ${APP}.dwd_error_log partition(dt='$do_date') select get_json_object(line,'$.common.ar'), get_json_object(line,'$.common.ba'), get_json_object(line,'$.common.ch'), get_json_object(line,'$.common.md'), get_json_object(line,'$.common.mid'), get_json_object(line,'$.common.os'), get_json_object(line,'$.common.uid'), get_json_object(line,'$.common.vc'), get_json_object(line,'$.page.item'), get_json_object(line,'$.page.item_type'), get_json_object(line,'$.page.last_page_id'), get_json_object(line,'$.page.page_id'), get_json_object(line,'$.page.sourceType'), get_json_object(line,'$.start.entry'), get_json_object(line,'$.start.loading_time'), get_json_object(line,'$.start.open_ad_id'), get_json_object(line,'$.start.open_ad_ms'), get_json_object(line,'$.start.open_ad_skip_ms'), get_json_object(line,'$.actions'), get_json_object(line,'$.displays'), get_json_object(line,'$.ts'), get_json_object(line,'$.err.error_code'), get_json_object(line,'$.err.msg') from ${APP}.ods_log where dt='$do_date' and get_json_object(line,'$.err') is not null; " $hive -e "$sql"
业务数据
业务数据方面DWD层的搭建主要注意点在于维度的退化,减少后续大量Join操作。
维度表| 商品、优惠券、活动、地区、时间等
① 商品维度表(全量)
商品维度表主要是将商品表SKU表、商品一级分类、商品二级分类、商品三级分类、商品品牌表和商品SPU表退化为商品表。
建表并导入数据
DROP TABLE IF EXISTS `dwd_dim_sku_info`; CREATE EXTERNAL TABLE `dwd_dim_sku_info` ( `id` string COMMENT '商品id', `spu_id` string COMMENT 'spuid', `price` decimal(16,2) COMMENT '商品价格', `sku_name` string COMMENT '商品名称', `sku_desc` string COMMENT '商品描述', `weight` decimal(16,2) COMMENT '重量', `tm_id` string COMMENT '品牌id', `tm_name` string COMMENT '品牌名称', `category3_id` string COMMENT '三级分类id', `category2_id` string COMMENT '二级分类id', `category1_id` string COMMENT '一级分类id', `category3_name` string COMMENT '三级分类名称', `category2_name` string COMMENT '二级分类名称', `category1_name` string COMMENT '一级分类名称', `spu_name` string COMMENT 'spu名称', `create_time` string COMMENT '创建时间' ) COMMENT '商品维度表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_dim_sku_info/' tblproperties ("parquet.compression"="lzo"); SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --商品维度表 dwd_dim_sku_info insert overwrite table dwd_dim_sku_info partition(dt='2020-06-14') select sku.id, sku.spu_id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.tm_id, ob.tm_name, sku.category3_id, c2.id category2_id, c1.id category1_id, c3.name category3_name, c2.name category2_name, c1.name category1_name, spu.spu_name, sku.create_time from ( --sku商品表 ods_sku_info select * from ods_sku_info where dt='2020-06-14' )sku join ( --品牌表 ods_base_trademark select * from ods_base_trademark where dt='2020-06-14' )ob on sku.tm_id=ob.tm_id join ( --spu商品表 ods_spu_info select * from ods_spu_info where dt='2020-06-14' )spu on spu.id = sku.spu_id join ( --商品三级分类表 ods_base_category3 select * from ods_base_category3 where dt='2020-06-14' )c3 on sku.category3_id=c3.id join ( --商品二级分类表 ods_base_category2 select * from ods_base_category2 where dt='2020-06-14' )c2 on c3.category2_id=c2.id join ( --商品一级分类表 ods_base_category1 select * from ods_base_category1 where dt='2020-06-14' )c1 on c2.category1_id=c1.id;
② 优惠券维度表(全量)
把ODS层ods_coupon_info表数据导入到DWD层优惠卷维度表,在导入过程中可以做适当的清洗。
建表并加载数据:
drop table if exists dwd_dim_coupon_info; create external table dwd_dim_coupon_info( `id` string COMMENT '购物券编号', `coupon_name` string COMMENT '购物券名称', `coupon_type` string COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券', `condition_amount` decimal(16,2) COMMENT '满额数', `condition_num` bigint COMMENT '满件数', `activity_id` string COMMENT '活动编号', `benefit_amount` decimal(16,2) COMMENT '减金额', `benefit_discount` decimal(16,2) COMMENT '折扣', `create_time` string COMMENT '创建时间', `range_type` string COMMENT '范围类型 1、商品 2、品类 3、品牌', `spu_id` string COMMENT '商品id', `tm_id` string COMMENT '品牌id', `category3_id` string COMMENT '品类id', `limit_num` bigint COMMENT '最多领用次数', `operate_time` string COMMENT '修改时间', `expire_time` string COMMENT '过期时间' ) COMMENT '优惠券维度表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_dim_coupon_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --优惠券维度表 insert overwrite table dwd_dim_coupon_info partition(dt='2020-06-14') select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, operate_time, expire_time from ods_coupon_info --优惠券表 where dt='2020-06-14';
③ 活动维度表(全量)
drop table if exists dwd_dim_activity_info; create external table dwd_dim_activity_info( `id` string COMMENT '编号', `activity_name` string COMMENT '活动名称', `activity_type` string COMMENT '活动类型', `start_time` string COMMENT '开始时间', `end_time` string COMMENT '结束时间', `create_time` string COMMENT '创建时间' ) COMMENT '活动信息表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_dim_activity_info/' tblproperties ("parquet.compression"="lzo"); --数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --活动维度表 insert overwrite table dwd_dim_activity_info partition(dt='2020-06-14') select id, activity_name, activity_type, start_time, end_time, create_time from ods_activity_info --活动表 where dt='2020-06-14';
④ 地区维度表(特殊)
DROP TABLE IF EXISTS `dwd_dim_base_province`; CREATE EXTERNAL TABLE `dwd_dim_base_province` ( `id` string COMMENT 'id', `province_name` string COMMENT '省市名称', `area_code` string COMMENT '地区编码', `iso_code` string COMMENT 'ISO编码', `region_id` string COMMENT '地区id', `region_name` string COMMENT '地区名称' ) COMMENT '地区维度表' stored as parquet location '/warehouse/gmall/dwd/dwd_dim_base_province/' tblproperties ("parquet.compression"="lzo"); --数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --地区维度表 insert overwrite table dwd_dim_base_province select bp.id, bp.name, bp.area_code, bp.iso_code, bp.region_id, br.region_name from ( --省份表 select * from ods_base_province ) bp join ( --地区表 select * from ods_base_region ) br on bp.region_id = br.id;
⑤ 时间维度表(特殊)
1)建表 DROP TABLE IF EXISTS `gmall.dwd_dim_date_info`; CREATE EXTERNAL TABLE `gmall.dwd_dim_date_info`( `date_id` string COMMENT '日', `week_id` string COMMENT '周', `week_day` string COMMENT '周的第几天', `day` string COMMENT '每月的第几天', `month` string COMMENT '第几月', `quarter` string COMMENT '第几季度', `year` string COMMENT '年', `is_workday` string COMMENT '是否是周末', `holiday_id` string COMMENT '是否是节假日' ) COMMENT '时间维度表' stored as parquet location '/warehouse/gmall/dwd/dwd_dim_date_info/' tblproperties ("parquet.compression"="lzo"); 2)把date_info.txt文件上传到hadoop101的/opt/module/db_log/路径 3)数据装载 注意:由于dwd_dim_date_info是列式存储+LZO压缩。直接将date_info.txt文件导入到目标表,并不会直接转换为列式存储+LZO压缩。我们需要创建一张普通的临时表dwd_dim_date_info_tmp,将date_info.txt加载到该临时表中。最后通过查询临时表数据,把查询到的数据插入到最终的目标表中。 Ⅰ 创建临时表,非列式存储 DROP TABLE IF EXISTS `gmall.dwd_dim_date_info_tmp`; CREATE EXTERNAL TABLE `gmall.dwd_dim_date_info_tmp`( `date_id` string COMMENT '日', `week_id` string COMMENT '周', `week_day` string COMMENT '周的第几天', `day` string COMMENT '每月的第几天', `month` string COMMENT '第几月', `quarter` string COMMENT '第几季度', `year` string COMMENT '年', `is_workday` string COMMENT '是否是周末', `holiday_id` string COMMENT '是否是节假日' ) COMMENT '时间临时表' row format delimited fields terminated by '\t' location '/warehouse/gmall/dwd/dwd_dim_date_info_tmp/'; Ⅱ 将数据导入临时表 load data local inpath '/opt/module/db_log/date_info.txt' into table gmall.dwd_dim_date_info_tmp; Ⅲ 将数据导入正式表 insert overwrite table gmall.dwd_dim_date_info select * from gmall.dwd_dim_date_info_tmp;
事实表| 事务型、周期型快照、累计型快照
① 支付事实表(事务型事实表)
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
支付 |
√ |
√ |
√ |
|
|
金额 |
drop table if exists dwd_fact_payment_info; create external table dwd_fact_payment_info ( `id` string COMMENT 'id', `out_trade_no` string COMMENT '对外业务编号', `order_id` string COMMENT '订单编号', `user_id` string COMMENT '用户编号', `alipay_trade_no` string COMMENT '支付宝交易流水编号', `payment_amount` decimal(16,2) COMMENT '支付金额', `subject` string COMMENT '交易内容', `payment_type` string COMMENT '支付类型', `payment_time` string COMMENT '支付时间', `province_id` string COMMENT '省份ID' ) COMMENT '支付事实表表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_payment_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --支付事实表(事务型事实表) insert overwrite table dwd_fact_payment_info partition(dt='2020-06-14') select pi.id, pi.out_trade_no, pi.order_id, pi.user_id, pi.alipay_trade_no, pi.total_amount, pi.subject, pi.payment_type, pi.payment_time, oi.province_id from ( --支付流水表 select * from ods_payment_info where dt='2020-06-14' )pi join ( --订单表 select id, province_id from ods_order_info where dt='2020-06-14' )oi on pi.order_id = oi.id;
② 退款事实表(事务型事实表)
把ODS层ods_order_refund_info表数据导入到DWD层退款事实表,在导入过程中可以做适当的清洗。
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
退款 |
√ |
√ |
|
√ |
|
|
件数/金额 |
drop table if exists dwd_fact_order_refund_info; create external table dwd_fact_order_refund_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户ID', `order_id` string COMMENT '订单ID', `sku_id` string COMMENT '商品ID', `refund_type` string COMMENT '退款类型', `refund_num` bigint COMMENT '退款件数', `refund_amount` decimal(16,2) COMMENT '退款金额', `refund_reason_type` string COMMENT '退款原因类型', `create_time` string COMMENT '退款时间' ) COMMENT '退款事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_order_refund_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_order_refund_info partition(dt='2020-06-14') select id, user_id, order_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, create_time from ods_order_refund_info where dt='2020-06-14';
③ 评价事实表(事务型事实表)
把ODS层ods_comment_info表数据导入到DWD层评价事实表,在导入过程中可以做适当的清洗。
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
评价 |
√ |
√ |
|
√ |
|
|
个数 |
drop table if exists dwd_fact_comment_info; create external table dwd_fact_comment_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户ID', `sku_id` string COMMENT '商品sku', `spu_id` string COMMENT '商品spu', `order_id` string COMMENT '订单ID', `appraise` string COMMENT '评价', `create_time` string COMMENT '评价时间' ) COMMENT '评价事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_comment_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_comment_info partition(dt='2020-06-14') select id, user_id, sku_id, spu_id, order_id, appraise, create_time from ods_comment_info where dt='2020-06-14';
④ 订单明细事实表(事务型事实表)
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
订单详情 |
√ |
√ |
√ |
√ |
|
|
件数/金额
|
drop table if exists dwd_fact_order_detail; create external table dwd_fact_order_detail ( `id` string COMMENT '订单编号', `order_id` string COMMENT '订单号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'sku商品id', `sku_name` string COMMENT '商品名称', `order_price` decimal(16,2) COMMENT '商品价格', `sku_num` bigint COMMENT '商品数量', `create_time` string COMMENT '创建时间', `province_id` string COMMENT '省份ID', `source_type` string COMMENT '来源类型', `source_id` string COMMENT '来源编号', `original_amount_d` decimal(20,2) COMMENT '原始价格分摊', `final_amount_d` decimal(20,2) COMMENT '购买价格分摊', `feight_fee_d` decimal(20,2) COMMENT '分摊运费', `benefit_reduce_amount_d` decimal(20,2) COMMENT '分摊优惠' ) COMMENT '订单明细事实表表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_order_detail/' tblproperties ("parquet.compression"="lzo"); --数据装载 SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; --订单明细事实表(事务型事实表) insert overwrite table dwd_fact_order_detail partition(dt='2020-06-14') select id, order_id, user_id, sku_id, sku_name, order_price, sku_num, create_time, province_id, source_type, source_id, original_amount_d, if(rn=1,final_total_amount -(sum_div_final_amount - final_amount_d),final_amount_d), if(rn=1,feight_fee - (sum_div_feight_fee - feight_fee_d),feight_fee_d), if(rn=1,benefit_reduce_amount - (sum_div_benefit_reduce_amount -benefit_reduce_amount_d), benefit_reduce_amount_d) from ( select od.id, od.order_id, od.user_id, od.sku_id, od.sku_name, od.order_price, od.sku_num, od.create_time, oi.province_id, od.source_type, od.source_id, round(od.order_price*od.sku_num,2) original_amount_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.final_total_amount,2) final_amount_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.feight_fee,2) feight_fee_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.benefit_reduce_amount,2) benefit_reduce_amount_d, row_number() over(partition by od.order_id order by od.id desc) rn, oi.final_total_amount, --订单金额 oi.feight_fee, --运费 oi.benefit_reduce_amount,--优惠金额 sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.final_total_amount,2)) over(partition by od.order_id) sum_div_final_amount, sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.feight_fee,2)) over(partition by od.order_id) sum_div_feight_fee, sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.benefit_reduce_amount,2)) over(partition by od.order_id) sum_div_benefit_reduce_amount from ( --订单详情表 select * from ods_order_detail where dt='2020-06-14' ) od join ( --订单表 select * from ods_order_info where dt='2020-06-14' ) oi on od.order_id=oi.id )t1;
⑤ 加购事实表(周期型快照事实表,每日快照)
由于购物车的数量是会发生变化,所以导增量不合适。
每天做一次快照,导入的数据是全量,区别于事务型事实表是每天导入新增。
周期型快照事实表劣势:存储的数据量会比较大。
解决方案:周期型快照事实表存储的数据比较讲究时效性,时间太久了的意义不大,可以删除以前的数据。
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
加购 |
√ |
√ |
|
√ |
|
|
件数/金额 |
drop table if exists dwd_fact_cart_info; create external table dwd_fact_cart_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'skuid', `cart_price` string COMMENT '放入购物车时价格', `sku_num` string COMMENT '数量', `sku_name` string COMMENT 'sku名称 (冗余)', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '修改时间', `is_ordered` string COMMENT '是否已经下单。1为已下单;0为未下单', `order_time` string COMMENT '下单时间', `source_type` string COMMENT '来源类型', `srouce_id` string COMMENT '来源编号' ) COMMENT '加购事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_cart_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_cart_info partition(dt='2020-06-14') select id, user_id, sku_id, cart_price, sku_num, sku_name, create_time, operate_time, is_ordered, order_time, source_type, source_id from ods_cart_info where dt='2020-06-14';
⑥ 收藏事实表(周期型快照事实表,每日快照)
收藏的标记,是否取消,会发生变化,做增量不合适。
每天做一次快照,导入的数据是全量,区别于事务型事实表是每天导入新增。
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
收藏 |
√ |
√ |
|
√ |
|
|
个数 |
drop table if exists dwd_fact_favor_info; create external table dwd_fact_favor_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'skuid', `spu_id` string COMMENT 'spuid', `is_cancel` string COMMENT '是否取消', `create_time` string COMMENT '收藏时间', `cancel_time` string COMMENT '取消时间' ) COMMENT '收藏事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_favor_info/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_favor_info partition(dt='2020-06-14') select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time from ods_favor_info where dt='2020-
⑦ 优惠券领用事实表(累积型快照事实表)
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
优惠券领用 |
√ |
√ |
|
√ |
|
|
个数 |
优惠卷的生命周期:领取优惠卷-》用优惠卷下单-》优惠卷参与支付
累积型快照事实表使用:统计优惠卷领取次数、优惠卷下单次数、优惠卷参与支付次数
drop table if exists dwd_fact_coupon_use; create external table dwd_fact_coupon_use( `id` string COMMENT '编号', `coupon_id` string COMMENT '优惠券ID', `user_id` string COMMENT 'userid', `order_id` string COMMENT '订单id', `coupon_status` string COMMENT '优惠券状态', `get_time` string COMMENT '领取时间', `using_time` string COMMENT '使用时间(下单)', `used_time` string COMMENT '使用时间(支付)' ) COMMENT '优惠券领用事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_coupon_use/' tblproperties ("parquet.compression"="lzo"); 注意:dt是按照优惠卷领用时间get_time做为分区。 --加载数据 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_coupon_use partition(dt) select if(new.id is null,old.id,new.id), if(new.coupon_id is null,old.coupon_id,new.coupon_id), if(new.user_id is null,old.user_id,new.user_id), if(new.order_id is null,old.order_id,new.order_id), if(new.coupon_status is null,old.coupon_status,new.coupon_status), if(new.get_time is null,old.get_time,new.get_time), if(new.using_time is null,old.using_time,new.using_time), if(new.used_time is null,old.used_time,new.used_time), date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd') from ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from dwd_fact_coupon_use where dt in ( select date_format(get_time,'yyyy-MM-dd') from ods_coupon_use where dt='2020-06-14' ) )old full outer join ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from ods_coupon_use where dt='2020-06-14' )new on old.id=new.id;
⑧ 订单事实表(累积型快照事实表)
时间 |
用户 |
地区 |
商品 |
优惠券 |
活动 |
编码 |
度量值 |
|
订单 |
√ |
√ |
√ |
√ |
|
件数/金额 |
订单生命周期:创建时间=》支付时间=》取消时间=》完成时间=》退款时间=》退款完成时间。
由于ODS层订单表只有创建时间和操作时间两个状态,不能表达所有时间含义,所以需要关联订单状态表。订单事实表里面增加了活动id,所以需要关联活动订单表。
drop table if exists dwd_fact_order_info; create external table dwd_fact_order_info ( `id` string COMMENT '订单编号', `order_status` string COMMENT '订单状态', `user_id` string COMMENT '用户id', `out_trade_no` string COMMENT '支付流水号', `create_time` string COMMENT '创建时间(未支付状态)', `payment_time` string COMMENT '支付时间(已支付状态)', `cancel_time` string COMMENT '取消时间(已取消状态)', `finish_time` string COMMENT '完成时间(已完成状态)', `refund_time` string COMMENT '退款时间(退款中状态)', `refund_finish_time` string COMMENT '退款完成时间(退款完成状态)', `province_id` string COMMENT '省份ID', `activity_id` string COMMENT '活动ID', `original_total_amount` decimal(16,2) COMMENT '原价金额', `benefit_reduce_amount` decimal(16,2) COMMENT '优惠金额', `feight_fee` decimal(16,2) COMMENT '运费', `final_total_amount` decimal(16,2) COMMENT '订单金额' ) COMMENT '订单事实表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dwd/dwd_fact_order_info/' tblproperties ("parquet.compression"="lzo"); --加载数据 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_fact_order_info partition(dt) select if(new.id is null,old.id,new.id), if(new.order_status is null,old.order_status,new.order_status), if(new.user_id is null,old.user_id,new.user_id), if(new.out_trade_no is null,old.out_trade_no,new.out_trade_no), if(new.tms['1001'] is null,old.create_time,new.tms['1001']),--1001对应未支付状态 if(new.tms['1002'] is null,old.payment_time,new.tms['1002']), if(new.tms['1003'] is null,old.cancel_time,new.tms['1003']), if(new.tms['1004'] is null,old.finish_time,new.tms['1004']), if(new.tms['1005'] is null,old.refund_time,new.tms['1005']), if(new.tms['1006'] is null,old.refund_finish_time,new.tms['1006']), if(new.province_id is null,old.province_id,new.province_id), if(new.activity_id is null,old.activity_id,new.activity_id), if(new.original_total_amount is null,old.original_total_amount,new.original_total_amount), if(new.benefit_reduce_amount is null,old.benefit_reduce_amount,new.benefit_reduce_amount), if(new.feight_fee is null,old.feight_fee,new.feight_fee), if(new.final_total_amount is null,old.final_total_amount,new.final_total_amount), date_format(if(new.tms['1001'] is null,old.create_time,new.tms['1001']),'yyyy-MM-dd') from ( select id, order_status, user_id, out_trade_no, create_time, payment_time, cancel_time, finish_time, refund_time, refund_finish_time, province_id, activity_id, original_total_amount, benefit_reduce_amount, feight_fee, final_total_amount from dwd_fact_order_info where dt in ( select date_format(create_time,'yyyy-MM-dd') from ods_order_info where dt='2020-06-14' ) )old full outer join ( select info.id, info.order_status, info.user_id, info.out_trade_no, info.province_id, act.activity_id, log.tms, info.original_total_amount, info.benefit_reduce_amount, info.feight_fee, info.final_total_amount from ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms from ods_order_status_log --订单状态表 where dt='2020-06-14' group by order_id )log join ( --订单表 select * from ods_order_info where dt='2020-06-14' )info on log.order_id=info.id left join ( --活动订单关联表 select * from ods_activity_order where dt='2020-06-14' )act on log.order_id=act.order_id )new on old.id=new.id;
拉链表| 用户维度
用户维度表(拉链表)
用户表中的数据每日既有可能新增,也有可能修改,但修改频率并不高,属于缓慢变化维度,此处采用拉链表存储用户维度数据。
订单表新增及变化,判断这两个条件创建时间create time 和操作时间operation time,同理订单表也可以做成拉链表的形式。
拉链表,是记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。
如果当前信息至今有效,在生效结束日期中填入一个极大值(如9999-99-99)。
拉链表适用于:数据会发生变化,但是大部分是不变的。(即缓慢变化维)
比如:用户信息会发生变化,但是每天变化的比例不高。如果数据量有一定规模,按照每日全量的方式保存效率很低,比如 1亿用户 * 365天,每天一份用户信息。(做每日全量效率低)
拉链表的使用:
通过,生效开始日期start_date <= 某个日期 且 生效结束日期end_date,能够得到某个时间点的数据全量切片。
拉链表的形成过程
拉链表制作流程
拉链表的制作过程
步骤1:初始化拉链表(首次独立执行),建立并初始化拉链表
(1)建立拉链表 hive (gmall)> drop table if exists dwd_dim_user_info_his; create external table dwd_dim_user_info_his( `id` string COMMENT '用户id', `name` string COMMENT '姓名', `birthday` string COMMENT '生日', `gender` string COMMENT '性别', `email` string COMMENT '邮箱', `user_level` string COMMENT '用户等级', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间', `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期' ) COMMENT '用户拉链表' stored as parquet location '/warehouse/gmall/dwd/dwd_dim_user_info_his/' tblproperties ("parquet.compression"="lzo"); (2)初始化拉链表 hive (gmall)> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_dim_user_info_his select id, name, birthday, gender, email, user_level, create_time, operate_time, '2020-06-14', '9999-99-99' from ods_user_info oi where oi.dt='2020-06-14';
步骤2:制作当日变动数据(包括新增,修改)每日执行
(1)如何获得每日变动表
- a.最好表内有创建时间和变动时间(Lucky!)
- b.如果没有,可以利用第三方工具监控比如canal,监控MySQL的实时变化进行记录(麻烦)。
- c.逐行对比前后两天的数据,检查md5(concat(全部有可能变化的字段))是否相同(low)
- d.要求业务数据库提供变动流水(人品,颜值)
(2)因为ods_user_info本身导入过来就是新增变动明细的表,所以不用处理
- a)数据库中新增2020-06-15一天的数据
- b)通过Sqoop把2020-06-15日所有数据导入 mysql_to_hdfs.sh all 2020-06-15
- c)ods层数据导入hdfs_to_ods_db.sh all 2020-06-15
步骤3:先合并变动信息,再追加新增信息,插入到临时表中。建立临时表并导入数据
1)建立临时表 hive (gmall)> drop table if exists dwd_dim_user_info_his_tmp; create external table dwd_dim_user_info_his_tmp( `id` string COMMENT '用户id', `name` string COMMENT '姓名', `birthday` string COMMENT '生日', `gender` string COMMENT '性别', `email` string COMMENT '邮箱', `user_level` string COMMENT '用户等级', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间', `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期' ) COMMENT '订单拉链临时表' stored as parquet location '/warehouse/gmall/dwd/dwd_dim_user_info_his_tmp/' tblproperties ("parquet.compression"="lzo"); 导入数据 SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table dwd_dim_user_info_his_tmp select * from ( select id, name, birthday, gender, email, user_level, create_time, operate_time, '2020-06-15' start_date, '9999-99-99' end_date from ods_user_info where dt='2020-06-15' union all select uh.id, uh.name, uh.birthday, uh.gender, uh.email, uh.user_level, uh.create_time, uh.operate_time, uh.start_date, if(ui.id is not null and uh.end_date='9999-99-99', date_add(ui.dt,-1), uh.end_date) end_date from dwd_dim_user_info_his uh left join ( select * from ods_user_info where dt='2020-06-15' ) ui on uh.id=ui.id )his order by his.id, start_date;
步骤4:把临时表覆盖给拉链表,导入数据
hive (gmall)>insert overwrite table dwd_dim_user_info_his
select * from dwd_dim_user_info_his_tmp;
订单表的拉链表:
1)初始化拉链表(首次独立执行) hive (gmall)> drop table if exists dwd_order_info_his; create table dwd_order_info_his( `id` string COMMENT '订单编号', `total_amount` decimal(10,2) COMMENT '订单金额', `order_status` string COMMENT '订单状态', `user_id` string COMMENT '用户id' , `payment_way` string COMMENT '支付方式', `out_trade_no` string COMMENT '支付流水号', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间' , `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期' ) COMMENT '订单拉链表' stored as parquet location '/warehouse/gmall/dwd/dwd_order_info_his/' tblproperties ("parquet.compression"="snappy"); 就在原来基础上添加两个字段:start_date, end_date insert overwrite table gmall.dwd_order_info_his select id, total_amount, order_status, user_id, payment_way, out_trade_no, create_time, operate_time, '2019-02-13', '9999-99-99' from ods_order_info oi where oi.dt='2019-02-13'; 步骤1:制作当日变动数据(包括新增,修改)每日执行 步骤2:先合并变动信息,再追加新增信息,插入到临时表中 1)建立临时表 drop table if exists dwd_order_info_his_tmp; create external table dwd_order_info_his_tmp( `id` string COMMENT '订单编号', `total_amount` decimal(10,2) COMMENT '订单金额', `order_status` string COMMENT '订单状态', `user_id` string COMMENT '用户id' , `payment_way` string COMMENT '支付方式', `out_trade_no` string COMMENT '支付流水号', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间', `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期' ) COMMENT '订单拉链临时表' stored as parquet location '/warehouse/gmall/dwd/dwd_order_info_his_tmp/' tblproperties ("parquet.compression"="snappy"); insert overwrite table dwd_order_info_his_tmp select * from( select id, total_amount, order_status, user_id, payment_way, out_trade_no, create_time, operate_time, '2019-02-14' start_date, '2019-99-99' end_date from dwd_order_info where dt='2019-02-14' union all select oh.id, oh.total_amount, oh.order_status, oh.user_id, oh.payment_way, oh.out_trade_no, oh.create_time, oh.operate_time, oh.start_date, if(oi.id is null, oh.end_date, date_add(oi.dt, -1)) end_date ##没匹配上即不为空就还是原来的 from dwd_order_info_his oh left join( select * from dwd_order_info where dt="2019-02-14" )oi on oh.id = oi.id and oh.end_date="9999-99-99" ##要改的就是它 )his order by his.id, start_date; left join下,能join上id(不为空)匹配上之后,判断结束的生效日期是否是9999-99-99,把匹配上的日期当前日期-1 步骤3:把临时表覆盖给拉链表 overwrite回原来的状态; 一般1个月拉链1次; insert overwrite table dwd_order_info_his select * from dwd_order_info_his_tmp;
DWD层业务数据导入脚本
在/home/kris/bin目录下创建脚本ods_to_dwd_db.sh
#!/bin/bash APP=gmall hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$2" ] ;then do_date=$2 else do_date=`date -d "-1 day" +%F` fi sql1=" set mapreduce.job.queuename=hive; set hive.exec.dynamic.partition.mode=nonstrict; SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; insert overwrite table ${APP}.dwd_dim_sku_info partition(dt='$do_date') select sku.id, sku.spu_id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.tm_id, ob.tm_name, sku.category3_id, c2.id category2_id, c1.id category1_id, c3.name category3_name, c2.name category2_name, c1.name category1_name, spu.spu_name, sku.create_time from ( select * from ${APP}.ods_sku_info where dt='$do_date' )sku join ( select * from ${APP}.ods_base_trademark where dt='$do_date' )ob on sku.tm_id=ob.tm_id join ( select * from ${APP}.ods_spu_info where dt='$do_date' )spu on spu.id = sku.spu_id join ( select * from ${APP}.ods_base_category3 where dt='$do_date' )c3 on sku.category3_id=c3.id join ( select * from ${APP}.ods_base_category2 where dt='$do_date' )c2 on c3.category2_id=c2.id join ( select * from ${APP}.ods_base_category1 where dt='$do_date' )c1 on c2.category1_id=c1.id; insert overwrite table ${APP}.dwd_dim_coupon_info partition(dt='$do_date') select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, operate_time, expire_time from ${APP}.ods_coupon_info where dt='$do_date'; insert overwrite table ${APP}.dwd_dim_activity_info partition(dt='$do_date') select id, activity_name, activity_type, start_time, end_time, create_time from ${APP}.ods_activity_info where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_order_detail partition(dt='$do_date') select id, order_id, user_id, sku_id, sku_num, order_price, sku_num, create_time, province_id, source_type, source_id, original_amount_d, if(rn=1,final_total_amount-(sum_div_final_amount-final_amount_d),final_amount_d), if(rn=1,feight_fee-(sum_div_feight_fee-feight_fee_d),feight_fee_d), if(rn=1,benefit_reduce_amount-(sum_div_benefit_reduce_amount-benefit_reduce_amount_d),benefit_reduce_amount_d) from ( select od.id, od.order_id, od.user_id, od.sku_id, od.sku_name, od.order_price, od.sku_num, od.create_time, oi.province_id, od.source_type, od.source_id, round(od.order_price*od.sku_num,2) original_amount_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.final_total_amount,2) final_amount_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.feight_fee,2) feight_fee_d, round(od.order_price*od.sku_num/oi.original_total_amount*oi.benefit_reduce_amount,2) benefit_reduce_amount_d, row_number() over(partition by od.order_id order by od.id desc) rn, oi.final_total_amount, oi.feight_fee, oi.benefit_reduce_amount, sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.final_total_amount,2)) over(partition by od.order_id) sum_div_final_amount, sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.feight_fee,2)) over(partition by od.order_id) sum_div_feight_fee, sum(round(od.order_price*od.sku_num/oi.original_total_amount*oi.benefit_reduce_amount,2)) over(partition by od.order_id) sum_div_benefit_reduce_amount from ( select * from ${APP}.ods_order_detail where dt='$do_date' ) od join ( select * from ${APP}.ods_order_info where dt='$do_date' ) oi on od.order_id=oi.id )t1; insert overwrite table ${APP}.dwd_fact_payment_info partition(dt='$do_date') select pi.id, pi.out_trade_no, pi.order_id, pi.user_id, pi.alipay_trade_no, pi.total_amount, pi.subject, pi.payment_type, pi.payment_time, oi.province_id from ( select * from ${APP}.ods_payment_info where dt='$do_date' )pi join ( select id, province_id from ${APP}.ods_order_info where dt='$do_date' )oi on pi.order_id = oi.id; insert overwrite table ${APP}.dwd_fact_order_refund_info partition(dt='$do_date') select id, user_id, order_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, create_time from ${APP}.ods_order_refund_info where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_comment_info partition(dt='$do_date') select id, user_id, sku_id, spu_id, order_id, appraise, create_time from ${APP}.ods_comment_info where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_cart_info partition(dt='$do_date') select id, user_id, sku_id, cart_price, sku_num, sku_name, create_time, operate_time, is_ordered, order_time, source_type, source_id from ${APP}.ods_cart_info where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_favor_info partition(dt='$do_date') select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time from ${APP}.ods_favor_info where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_coupon_use partition(dt) select if(new.id is null,old.id,new.id), if(new.coupon_id is null,old.coupon_id,new.coupon_id), if(new.user_id is null,old.user_id,new.user_id), if(new.order_id is null,old.order_id,new.order_id), if(new.coupon_status is null,old.coupon_status,new.coupon_status), if(new.get_time is null,old.get_time,new.get_time), if(new.using_time is null,old.using_time,new.using_time), if(new.used_time is null,old.used_time,new.used_time), date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd') from ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from ${APP}.dwd_fact_coupon_use where dt in ( select date_format(get_time,'yyyy-MM-dd') from ${APP}.ods_coupon_use where dt='$do_date' ) )old full outer join ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from ${APP}.ods_coupon_use where dt='$do_date' )new on old.id=new.id; insert overwrite table ${APP}.dwd_fact_order_info partition(dt) select if(new.id is null,old.id,new.id), if(new.order_status is null,old.order_status,new.order_status), if(new.user_id is null,old.user_id,new.user_id), if(new.out_trade_no is null,old.out_trade_no,new.out_trade_no), if(new.tms['1001'] is null,old.create_time,new.tms['1001']),--1001对应未支付状态 if(new.tms['1002'] is null,old.payment_time,new.tms['1002']), if(new.tms['1003'] is null,old.cancel_time,new.tms['1003']), if(new.tms['1004'] is null,old.finish_time,new.tms['1004']), if(new.tms['1005'] is null,old.refund_time,new.tms['1005']), if(new.tms['1006'] is null,old.refund_finish_time,new.tms['1006']), if(new.province_id is null,old.province_id,new.province_id), if(new.activity_id is null,old.activity_id,new.activity_id), if(new.original_total_amount is null,old.original_total_amount,new.original_total_amount), if(new.benefit_reduce_amount is null,old.benefit_reduce_amount,new.benefit_reduce_amount), if(new.feight_fee is null,old.feight_fee,new.feight_fee), if(new.final_total_amount is null,old.final_total_amount,new.final_total_amount), date_format(if(new.tms['1001'] is null,old.create_time,new.tms['1001']),'yyyy-MM-dd') from ( select id, order_status, user_id, out_trade_no, create_time, payment_time, cancel_time, finish_time, refund_time, refund_finish_time, province_id, activity_id, original_total_amount, benefit_reduce_amount, feight_fee, final_total_amount from ${APP}.dwd_fact_order_info where dt in ( select date_format(create_time,'yyyy-MM-dd') from ${APP}.ods_order_info where dt='$do_date' ) )old full outer join ( select info.id, info.order_status, info.user_id, info.out_trade_no, info.province_id, act.activity_id, log.tms, info.original_total_amount, info.benefit_reduce_amount, info.feight_fee, info.final_total_amount from ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms from ${APP}.ods_order_status_log where dt='$do_date' group by order_id )log join ( select * from ${APP}.ods_order_info where dt='$do_date' )info on log.order_id=info.id left join ( select * from ${APP}.ods_activity_order where dt='$do_date' )act on log.order_id=act.order_id )new on old.id=new.id; " sql2=" insert overwrite table ${APP}.dwd_dim_base_province select bp.id, bp.name, bp.area_code, bp.iso_code, bp.region_id, br.region_name from ${APP}.ods_base_province bp join ${APP}.ods_base_region br on bp.region_id=br.id; " sql3=" insert overwrite table ${APP}.dwd_dim_user_info_his_tmp select * from ( select id, name, birthday, gender, email, user_level, create_time, operate_time, '$do_date' start_date, '9999-99-99' end_date from ${APP}.ods_user_info where dt='$do_date' union all select uh.id, uh.name, uh.birthday, uh.gender, uh.email, uh.user_level, uh.create_time, uh.operate_time, uh.start_date, if(ui.id is not null and uh.end_date='9999-99-99', date_add(ui.dt,-1), uh.end_date) end_date from ${APP}.dwd_dim_user_info_his uh left join ( select * from ${APP}.ods_user_info where dt='$do_date' ) ui on uh.id=ui.id )his order by his.id, start_date; insert overwrite table ${APP}.dwd_dim_user_info_his select * from ${APP}.dwd_dim_user_info_his_tmp; " case $1 in "first"){ $hive -e "$sql1$sql2" };; "all"){ $hive -e "$sql1$sql3" };; esac
1)初次导入
(1)时间维度表
(2)用户维度表
(3)其余表
初次导入时,脚本的第一个参数应为first,线上环境不传第二个参数,自动获取前一天日期。
[kris@hadoop101 bin]$ ods_to_dwd_db.sh first 2020-06-14
2)每日定时导入
每日定时导入,脚本的第一个参数应为all,线上环境不传第二个参数,自动获取前一天日期。
[kris@hadoop101 bin]$ ods_to_dwd_db.sh all 2020-06-15
3.DWS层
把每个用户单日的行为聚合起来组成一张多列宽表,以便之后关联用户维度信息后进行,不同角度的统计分析。
1)用户
用户以设备为判断标准,在移动统计中,每个独立设备认为是一个独立用户。Android系统根据IMEI号,IOS系统根据OpenUDID来标识一个独立用户,每部手机一个用户。
2)新增用户
首次联网使用应用的用户。如果一个用户首次打开某APP,那这个用户定义为新增用户;卸载再安装的设备,不会被算作一次新增。新增用户包括日新增用户、周新增用户、月新增用户。
3)活跃用户
打开应用的用户即为活跃用户,不考虑用户的使用情况。每天一台设备打开多次会被计为一个活跃用户。
4)周(月)活跃用户
某个自然周(月)内启动过应用的用户,该周(月)内的多次启动只记一个活跃用户。
5)月活跃率
月活跃用户与截止到该月累计的用户总和之间的比例。
6)沉默用户
用户仅在安装当天(次日)启动一次,后续时间无再启动行为。该指标可以反映新增用户质量和用户与APP的匹配程度。
7)版本分布
不同版本的周内各天新增用户数,活跃用户数和启动次数。利于判断APP各个版本之间的优劣和用户行为习惯。
8)本周回流用户
上周未启动过应用,本周启动了应用的用户。
9)连续n周活跃用户
连续n周,每周至少启动一次。
10)忠诚用户
连续活跃5周以上的用户
11)连续活跃用户
连续2周及以上活跃的用户
12)近期流失用户
连续n(2<= n <= 4)周没有启动应用的用户。(第n+1周没有启动过)
13)留存用户
某段时间内的新增用户,经过一段时间后,仍然使用应用的被认作是留存用户;这部分用户占当时新增用户的比例即是留存率。
例如,5月份新增用户200,这200人在6月份启动过应用的有100人,7月份启动过应用的有80人,8月份启动过应用的有50人;则5月份新增用户一个月后的留存率是50%,二个月后的留存率是40%,三个月后的留存率是25%。
14)用户新鲜度
每天启动应用的新老用户比例,即新增用户数占活跃用户数的比例。
15)单次使用时长
每次启动使用的时间长度。
16)日使用时长
累计一天内的使用时间长度。
17)启动次数计算标准
IOS平台应用退到后台就算一次独立的启动;Android平台我们规定,两次启动之间的间隔小于30秒,被计算一次启动。用户在使用过程中,若因收发短信或接电话等退出应用30秒又再次返回应用中,那这两次行为应该是延续而非独立的,所以可以被算作一次使用行为,即一次启动。业内大多使用30秒这个标准,但用户还是可以自定义此时间间隔。
①每日设备行为
每日设备行为,主要按照设备id统计
mid_id 设备id, brand 手机品牌, model 手机型号, login_count 活跃次数, page_stats 页面访问统计 array<struct<page_id:string,page_count:bigint>> page_stats:页面访问统计 array类型 [ {"page_id":"comment","page_count":1}, {"page_id":"discovery","page_count":1}, {"page_id":"good_detail","page_count":1}, {"page_id":"good_spec","page_count":1}, {"page_id":"home","page_count":1}, {"page_id":"payment","page_count":1}, {"page_id":"trade","page_count":1} ] page_id 页面id
建表并导入数据
hive (gmall)> drop table if exists dws_uv_detail_daycount; create external table dws_uv_detail_daycount ( `mid_id` string COMMENT '设备id', `brand` string COMMENT '手机品牌', `model` string COMMENT '手机型号', `login_count` bigint COMMENT '活跃次数', `page_stats` array<struct<page_id:string,page_count:bigint>> COMMENT '页面访问统计' ) COMMENT '每日设备行为表' partitioned by(dt string) stored as parquet location '/warehouse/gmall/dws/dws_uv_detail_daycount' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> with tmp_start as ( select mid_id, brand, model, count(*) login_count from dwd_start_log where dt='2020-06-14' group by mid_id,brand,model ), tmp_page as ( select mid_id, brand, model, collect_set(named_struct('page_id',page_id,'page_count',page_count)) page_stats from ( select mid_id, brand, model, page_id, count(*) page_count from dwd_page_log where dt='2020-06-14' group by mid_id,brand,model,page_id )tmp group by mid_id,brand,model ) insert overwrite table dws_uv_detail_daycount partition(dt='2020-06-14') select nvl(tmp_start.mid_id,tmp_page.mid_id), nvl(tmp_start.brand,tmp_page.brand), nvl(tmp_start.model,tmp_page.model), tmp_start.login_count, tmp_page.page_stats from tmp_start full outer join tmp_page on tmp_start.mid_id=tmp_page.mid_id and tmp_start.brand=tmp_page.brand and tmp_start.model=tmp_page.model;
②每日会员行为
user_id '用户 id', login_count '登录次数', cart_count '加入购物车次数', order_count '下单次数', order_amount '下单金额', payment_count '支付次数', payment_amount '支付金额', order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> ‘下单明细统计’ collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats order_detail_stats 下单明细统计: [ {"sku_id":"5",
"sku_num":3,
"order_count":1,
"order_amount":658
}]
sku_id sku商品id
sku_num 商品数量
hive (gmall)> drop table if exists dws_user_action_daycount; create external table dws_user_action_daycount ( user_id string comment '用户 id', login_count bigint comment '登录次数', cart_count bigint comment '加入购物车次数', order_count bigint comment '下单次数', order_amount decimal(16,2) comment '下单金额', payment_count bigint comment '支付次数', payment_amount decimal(16,2) comment '支付金额', order_detail_stats array<struct<sku_id:string,sku_num:bigint,order_count:bigint,order_amount:decimal(20,2)>> comment '下单明细统计' ) COMMENT '每日会员行为' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_user_action_daycount/' tblproperties ("parquet.compression"="lzo"); hive (gmall)> with tmp_login as ( select user_id, count(*) login_count from dwd_start_log where dt='2020-06-14' and user_id is not null group by user_id ), tmp_cart as ( select user_id, count(*) cart_count from dwd_action_log where dt='2020-06-14' and user_id is not null and action_id='cart_add' group by user_id ),tmp_order as ( select user_id, count(*) order_count, sum(final_total_amount) order_amount from dwd_fact_order_info where dt='2020-06-14' group by user_id ) , tmp_payment as ( select user_id, count(*) payment_count, sum(payment_amount) payment_amount from dwd_fact_payment_info where dt='2020-06-14' group by user_id ), tmp_order_detail as ( select user_id, collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats from ( select user_id, sku_id, sum(sku_num) sku_num, count(*) order_count, cast(sum(final_amount_d) as decimal(20,2)) order_amount from dwd_fact_order_detail where dt='2020-06-14' group by user_id,sku_id )tmp group by user_id ) insert overwrite table dws_user_action_daycount partition(dt='2020-06-14') select tmp_login.user_id, login_count, nvl(cart_count,0), nvl(order_count,0), nvl(order_amount,0.0), nvl(payment_count,0), nvl(payment_amount,0.0), order_stats from tmp_login left join tmp_cart on tmp_login.user_id=tmp_cart.user_id left join tmp_order on tmp_login.user_id=tmp_order.user_id left join tmp_payment on tmp_login.user_id=tmp_payment.user_id left join tmp_order_detail on tmp_login.user_id=tmp_order_detail.user_id;
③每日商品行为
商品 sku_id,
被下单的次数、件数
被支付的次数、件数、金额
被退款的次数、件数、金额
被加入购物车次数
被收藏次数
好评、中评、差评、默认评价数
hive (gmall)> drop table if exists dws_sku_action_daycount; create external table dws_sku_action_daycount ( sku_id string comment 'sku_id', order_count bigint comment '被下单次数', order_num bigint comment '被下单件数', order_amount decimal(16,2) comment '被下单金额', payment_count bigint comment '被支付次数', payment_num bigint comment '被支付件数', payment_amount decimal(16,2) comment '被支付金额', refund_count bigint comment '被退款次数', refund_num bigint comment '被退款件数', refund_amount decimal(16,2) comment '被退款金额', cart_count bigint comment '被加入购物车次数', favor_count bigint comment '被收藏次数', appraise_good_count bigint comment '好评数', appraise_mid_count bigint comment '中评数', appraise_bad_count bigint comment '差评数', appraise_default_count bigint comment '默认评价数' ) COMMENT '每日商品行为' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_sku_action_daycount/' tblproperties ("parquet.compression"="lzo"); 数据装载 注意:如果是23点59下单,支付日期跨天。需要从订单详情里面取出支付时间是今天,且订单时间是昨天或者今天的订单。 hive (gmall)> with tmp_order as ( select sku_id, count(*) order_count, sum(sku_num) order_num, --sku_num 商品数量 sum(final_amount_d) order_amount --购买价格分摊 from dwd_fact_order_detail --订单明细事实表 where dt='2020-06-14' group by sku_id ), tmp_payment as ( select sku_id, count(*) payment_count, sum(sku_num) payment_num, sum(final_amount_d) payment_amount from dwd_fact_order_detail where dt='2020-06-14' and order_id in ( select id from dwd_fact_order_info --订单事实表 where (dt='2020-06-14' or dt=date_add('2020-06-14',-1)) and date_format(payment_time,'yyyy-MM-dd')='2020-06-14' ) group by sku_id ), tmp_refund as ( select sku_id, count(*) refund_count, sum(refund_num) refund_num, --退款件数 sum(refund_amount) refund_amount --退款金额 from dwd_fact_order_refund_info --退款事实表 where dt='2020-06-14' group by sku_id ), tmp_cart as ( select item sku_id, --item 目标id count(*) cart_count from dwd_action_log --动作日志表 where dt='2020-06-14' and user_id is not null and action_id='cart_add' --action_id='cart_add' 添加购物车 group by item ),tmp_favor as ( select item sku_id, --item 目标id count(*) favor_count from dwd_action_log --动作日志表 where dt='2020-06-14' and user_id is not null and action_id='favor_add' --action_id='favor_add' 添加收藏 group by item ), tmp_appraise as ( select sku_id, sum(if(appraise='1201',1,0)) appraise_good_count, --appraise='1201' 好评 sum(if(appraise='1202',1,0)) appraise_mid_count, --appraise='1202' 中评 sum(if(appraise='1203',1,0)) appraise_bad_count, --appraise='1203' 差评 sum(if(appraise='1204',1,0)) appraise_default_count --appraise='1204' 默认评价 from dwd_fact_comment_info --评价事实表 where dt='2020-06-14' group by sku_id ) insert overwrite table dws_sku_action_daycount partition(dt='2020-06-14') select sku_id, sum(order_count), sum(order_num), sum(order_amount), sum(payment_count), sum(payment_num), sum(payment_amount), sum(refund_count), sum(refund_num), sum(refund_amount), sum(cart_count), sum(favor_count), sum(appraise_good_count), sum(appraise_mid_count), sum(appraise_bad_count), sum(appraise_default_count) from ( select sku_id, order_count, order_num, order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_order union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, payment_count, payment_num, payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_payment union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, refund_count, refund_num, refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_refund union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_cart union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_favor union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, appraise_good_count, appraise_mid_count, appraise_bad_count, appraise_default_count from tmp_appraise )tmp group by sku_id;
④每日活动统计
活动activity_id
活动编号、名称、类型、开始、结束、创建时间
曝光次数
下单次数、金额
支付次数、金额
hive (gmall)> drop table if exists dws_activity_info_daycount; create external table dws_activity_info_daycount( `id` string COMMENT '编号', `activity_name` string COMMENT '活动名称', `activity_type` string COMMENT '活动类型', `start_time` string COMMENT '开始时间', `end_time` string COMMENT '结束时间', `create_time` string COMMENT '创建时间', `display_count` bigint COMMENT '曝光次数', `order_count` bigint COMMENT '下单次数', `order_amount` decimal(20,2) COMMENT '下单金额', `payment_count` bigint COMMENT '支付次数', `payment_amount` decimal(20,2) COMMENT '支付金额' ) COMMENT '每日活动统计' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_activity_info_daycount/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> with tmp_op as ( select activity_id, sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-14',1,0)) order_count, sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-14',final_total_amount,0)) order_amount, sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-14',1,0)) payment_count, sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-14',final_total_amount,0)) payment_amount from dwd_fact_order_info --订单事实表 where (dt='2020-06-14' or dt=date_add('2020-06-14',-1)) and activity_id is not null group by activity_id ), tmp_display as ( select item activity_id, count(*) display_count from dwd_display_log --曝光日志表 where dt='2020-06-14' and item_type='activity_id' --item_type='activity_id' 曝光对象为活动 group by item ), tmp_activity as ( select * from dwd_dim_activity_info --活动维度表 where dt='2020-06-14' ) insert overwrite table dws_activity_info_daycount partition(dt='2020-06-14') select nvl(tmp_op.activity_id,tmp_display.activity_id), tmp_activity.activity_name, tmp_activity.activity_type, tmp_activity.start_time, tmp_activity.end_time, tmp_activity.create_time, tmp_display.display_count, tmp_op.order_count, tmp_op.order_amount, tmp_op.payment_count, tmp_op.payment_amount from tmp_op full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id;
⑤每日地区统计
地区
地区编号、编码、iso编码、名称、省份
活跃设备数
下单次数、金额
支付次数、金额
hive (gmall)> drop table if exists dws_area_stats_daycount; create external table dws_area_stats_daycount( `id` bigint COMMENT '编号', `province_name` string COMMENT '省份名称', `area_code` string COMMENT '地区编码', `iso_code` string COMMENT 'iso编码', `region_id` string COMMENT '地区ID', `region_name` string COMMENT '地区名称', `login_count` string COMMENT '活跃设备数', `order_count` bigint COMMENT '下单次数', `order_amount` decimal(20,2) COMMENT '下单金额', `payment_count` bigint COMMENT '支付次数', `payment_amount` decimal(20,2) COMMENT '支付金额' ) COMMENT '每日地区统计表' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_area_stats_daycount/' tblproperties ("parquet.compression"="lzo"); 数据装载 hive (gmall)> with tmp_login as ( select area_code, count(*) login_count from dwd_start_log --启动日志表 where dt='2020-06-14' group by area_code ), tmp_op as ( select province_id, sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-14',1,0)) order_count, sum(if(date_format(create_time,'yyyy-MM-dd')='2020-06-14',final_total_amount,0)) order_amount, sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-14',1,0)) payment_count, sum(if(date_format(payment_time,'yyyy-MM-dd')='2020-06-14',final_total_amount,0)) payment_amount from dwd_fact_order_info --订单事实表 where (dt='2020-06-14' or dt=date_add('2020-06-14',-1)) group by province_id ) insert overwrite table dws_area_stats_daycount partition(dt='2020-06-14') select pro.id, pro.province_name, pro.area_code, pro.iso_code, pro.region_id, pro.region_name, nvl(tmp_login.login_count,0), nvl(tmp_op.order_count,0), nvl(tmp_op.order_amount,0.0), nvl(tmp_op.payment_count,0), nvl(tmp_op.payment_amount,0.0) from dwd_dim_base_province pro --地区维度表 left join tmp_login on pro.area_code=tmp_login.area_code left join tmp_op on pro.id=tmp_op.province_id;
DWS层数据导入脚本
1)在/home/kris/bin目录下创建脚本dwd_to_dws.sh并赋予执行权限
#!/bin/bash APP=gmall hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set mapreduce.job.queuename=hive; with tmp_start as ( select mid_id, brand, model, count(*) login_count from ${APP}.dwd_start_log where dt='$do_date' group by mid_id,brand,model ), tmp_page as ( select mid_id, brand, model, collect_set(named_struct('page_id',page_id,'page_count',page_count)) page_stats from ( select mid_id, brand, model, page_id, count(*) page_count from ${APP}.dwd_page_log where dt='$do_date' group by mid_id,brand,model,page_id )tmp group by mid_id,brand,model ) insert overwrite table ${APP}.dws_uv_detail_daycount partition(dt='$do_date') select nvl(tmp_start.mid_id,tmp_page.mid_id), nvl(tmp_start.brand,tmp_page.brand), nvl(tmp_start.model,tmp_page.model), tmp_start.login_count, tmp_page.page_stats from tmp_start full outer join tmp_page on tmp_start.mid_id=tmp_page.mid_id and tmp_start.brand=tmp_page.brand and tmp_start.model=tmp_page.model; with tmp_login as ( select user_id, count(*) login_count from ${APP}.dwd_start_log where dt='$do_date' and user_id is not null group by user_id ), tmp_cart as ( select user_id, count(*) cart_count from ${APP}.dwd_action_log where dt='$do_date' and user_id is not null and action_id='cart_add' group by user_id ),tmp_order as ( select user_id, count(*) order_count, sum(final_total_amount) order_amount from ${APP}.dwd_fact_order_info where dt='$do_date' group by user_id ) , tmp_payment as ( select user_id, count(*) payment_count, sum(payment_amount) payment_amount from ${APP}.dwd_fact_payment_info where dt='$do_date' group by user_id ), tmp_order_detail as ( select user_id, collect_set(named_struct('sku_id',sku_id,'sku_num',sku_num,'order_count',order_count,'order_amount',order_amount)) order_stats from ( select user_id, sku_id, sum(sku_num) sku_num, count(*) order_count, cast(sum(final_amount_d) as decimal(20,2)) order_amount from ${APP}.dwd_fact_order_detail where dt='$do_date' group by user_id,sku_id )tmp group by user_id ) insert overwrite table ${APP}.dws_user_action_daycount partition(dt='$do_date') select tmp_login.user_id, login_count, nvl(cart_count,0), nvl(order_count,0), nvl(order_amount,0.0), nvl(payment_count,0), nvl(payment_amount,0.0), order_stats from tmp_login left outer join tmp_cart on tmp_login.user_id=tmp_cart.user_id left outer join tmp_order on tmp_login.user_id=tmp_order.user_id left outer join tmp_payment on tmp_login.user_id=tmp_payment.user_id left outer join tmp_order_detail on tmp_login.user_id=tmp_order_detail.user_id; with tmp_order as ( select sku_id, count(*) order_count, sum(sku_num) order_num, sum(final_amount_d) order_amount from ${APP}.dwd_fact_order_detail where dt='$do_date' group by sku_id ), tmp_payment as ( select sku_id, count(*) payment_count, sum(sku_num) payment_num, sum(final_amount_d) payment_amount from ${APP}.dwd_fact_order_detail where dt='$do_date' and order_id in ( select id from ${APP}.dwd_fact_order_info where (dt='$do_date' or dt=date_add('$do_date',-1)) and date_format(payment_time,'yyyy-MM-dd')='$do_date' ) group by sku_id ), tmp_refund as ( select sku_id, count(*) refund_count, sum(refund_num) refund_num, sum(refund_amount) refund_amount from ${APP}.dwd_fact_order_refund_info where dt='$do_date' group by sku_id ), tmp_cart as ( select item sku_id, count(*) cart_count from ${APP}.dwd_action_log where dt='$do_date' and user_id is not null and action_id='cart_add' group by item ),tmp_favor as ( select item sku_id, count(*) favor_count from ${APP}.dwd_action_log where dt='$do_date' and user_id is not null and action_id='favor_add' group by item ), tmp_appraise as ( select sku_id, sum(if(appraise='1201',1,0)) appraise_good_count, sum(if(appraise='1202',1,0)) appraise_mid_count, sum(if(appraise='1203',1,0)) appraise_bad_count, sum(if(appraise='1204',1,0)) appraise_default_count from ${APP}.dwd_fact_comment_info where dt='$do_date' group by sku_id ) insert overwrite table ${APP}.dws_sku_action_daycount partition(dt='$do_date') select sku_id, sum(order_count), sum(order_num), sum(order_amount), sum(payment_count), sum(payment_num), sum(payment_amount), sum(refund_count), sum(refund_num), sum(refund_amount), sum(cart_count), sum(favor_count), sum(appraise_good_count), sum(appraise_mid_count), sum(appraise_bad_count), sum(appraise_default_count) from ( select sku_id, order_count, order_num, order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_order union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, payment_count, payment_num, payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_payment union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, refund_count, refund_num, refund_amount, 0 cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_refund union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, cart_count, 0 favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_cart union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, favor_count, 0 appraise_good_count, 0 appraise_mid_count, 0 appraise_bad_count, 0 appraise_default_count from tmp_favor union all select sku_id, 0 order_count, 0 order_num, 0 order_amount, 0 payment_count, 0 payment_num, 0 payment_amount, 0 refund_count, 0 refund_num, 0 refund_amount, 0 cart_count, 0 favor_count, appraise_good_count, appraise_mid_count, appraise_bad_count, appraise_default_count from tmp_appraise )tmp group by sku_id; with tmp_login as ( select area_code, count(*) login_count from ${APP}.dwd_start_log where dt='$do_date' group by area_code ), tmp_op as ( select province_id, sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count, sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount, sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count, sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount from ${APP}.dwd_fact_order_info where (dt='$do_date' or dt=date_add('$do_date',-1)) group by province_id ) insert overwrite table ${APP}.dws_area_stats_daycount partition(dt='$do_date') select pro.id, pro.province_name, pro.area_code, pro.iso_code, pro.region_id, pro.region_name, nvl(tmp_login.login_count,0), nvl(tmp_op.order_count,0), nvl(tmp_op.order_amount,0.0), nvl(tmp_op.payment_count,0), nvl(tmp_op.payment_amount,0.0) from ${APP}.dwd_dim_base_province pro left join tmp_login on pro.area_code=tmp_login.area_code left join tmp_op on pro.id=tmp_op.province_id; with tmp_op as ( select activity_id, sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',1,0)) order_count, sum(if(date_format(create_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) order_amount, sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',1,0)) payment_count, sum(if(date_format(payment_time,'yyyy-MM-dd')='$do_date',final_total_amount,0)) payment_amount from ${APP}.dwd_fact_order_info where (dt='$do_date' or dt=date_add('$do_date',-1)) and activity_id is not null group by activity_id ), tmp_display as ( select item activity_id, count(*) display_count from ${APP}.dwd_display_log where dt='$do_date' and item_type='activity_id' group by item ), tmp_activity as ( select * from ${APP}.dwd_dim_activity_info where dt='$do_date' ) insert overwrite table ${APP}.dws_activity_info_daycount partition(dt='$do_date') select nvl(tmp_op.activity_id,tmp_display.activity_id), tmp_activity.activity_name, tmp_activity.activity_type, tmp_activity.start_time, tmp_activity.end_time, tmp_activity.create_time, tmp_display.display_count, tmp_op.order_count, tmp_op.order_amount, tmp_op.payment_count, tmp_op.payment_amount from tmp_op full outer join tmp_display on tmp_op.activity_id=tmp_display.activity_id left join tmp_activity on nvl(tmp_op.activity_id,tmp_display.activity_id)=tmp_activity.id; " $hive -e "$sql"
4.DWT层
①设备主题宽表
设备id
手机品牌、型号
首次活跃时间、末次活跃时间
当日活跃次数、累计活跃天数
old full outer join on today
drop table if exists dwt_uv_topic;
create external table dwt_uv_topic
(
`mid_id` string comment '设备id',
`brand` string comment '手机品牌',
`model` string comment '手机型号',
`login_date_first` string comment '首次活跃时间',
`login_date_last` string comment '末次活跃时间',
`login_day_count` bigint comment '当日活跃次数',
`login_count` bigint comment '累积活跃天数'
) COMMENT '设备主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_uv_topic'
tblproperties ("parquet.compression"="lzo");
数据装载
hive (gmall)>
insert overwrite table dwt_uv_topic
select
nvl(new.mid_id,old.mid_id),
nvl(new.model,old.model),
nvl(new.brand,old.brand),
if(old.mid_id is null,'2020-06-14',old.login_date_first),
if(new.mid_id is not null,'2020-06-14',old.login_date_last),
if(new.mid_id is not null, new.login_count,0),
nvl(old.login_count,0)+if(new.login_count>0,1,0)
from
(
select
*
from dwt_uv_topic
)old
full outer join
(
select
*
from dws_uv_detail_daycount
where dt='2020-06-14'
)new
on old.mid_id=new.mid_id;
②会员主题宽表
宽表字段怎么来?维度关联的事实表度量值+开头、结尾+累积+累积一个时间段。
用户
首次、末次登录时间
首次、末次下单时间
首次、末次支付时间
最近30天下单次数、金额
最近30天支付次数、金额
累计登录天数、最近30天登录天数
累计下单次数、金额
累计支付次数、金额
date_add(today, -30) full outer join on today
drop table if exists dwt_user_topic;
create external table dwt_user_topic
(
user_id string comment '用户id',
login_date_first string comment '首次登录时间',
login_date_last string comment '末次登录时间',
login_count bigint comment '累积登录天数',
login_last_30d_count bigint comment '最近30日登录天数',
order_date_first string comment '首次下单时间',
order_date_last string comment '末次下单时间',
order_count bigint comment '累积下单次数',
order_amount decimal(16,2) comment '累积下单金额',
order_last_30d_count bigint comment '最近30日下单次数',
order_last_30d_amount bigint comment '最近30日下单金额',
payment_date_first string comment '首次支付时间',
payment_date_last string comment '末次支付时间',
payment_count decimal(16,2) comment '累积支付次数',
payment_amount decimal(16,2) comment '累积支付金额',
payment_last_30d_count decimal(16,2) comment '最近30日支付次数',
payment_last_30d_amount decimal(16,2) comment '最近30日支付金额'
)COMMENT '会员主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_user_topic/'
tblproperties ("parquet.compression"="lzo");
2)数据装载
hive (gmall)>
insert overwrite table dwt_user_topic
select
nvl(new.user_id,old.user_id),
if(old.login_date_first is null and new.login_count>0,'2020-06-14',old.login_date_first),
if(new.login_count>0,'2020-06-14',old.login_date_last),
nvl(old.login_count,0)+if(new.login_count>0,1,0),
nvl(new.login_last_30d_count,0),
if(old.order_date_first is null and new.order_count>0,'2020-06-14',old.order_date_first),
if(new.order_count>0,'2020-06-14',old.order_date_last),
nvl(old.order_count,0)+nvl(new.order_count,0),
nvl(old.order_amount,0)+nvl(new.order_amount,0),
nvl(new.order_last_30d_count,0),
nvl(new.order_last_30d_amount,0),
if(old.payment_date_first is null and new.payment_count>0,'2020-06-14',old.payment_date_first),
if(new.payment_count>0,'2020-06-14',old.payment_date_last),
nvl(old.payment_count,0)+nvl(new.payment_count,0),
nvl(old.payment_amount,0)+nvl(new.payment_amount,0),
nvl(new.payment_last_30d_count,0),
nvl(new.payment_last_30d_amount,0)
from
dwt_user_topic old
full outer join
(
select
user_id,
sum(if(dt='2020-06-14',login_count,0)) login_count,
sum(if(dt='2020-06-14',order_count,0)) order_count,
sum(if(dt='2020-06-14',order_amount,0)) order_amount,
sum(if(dt='2020-06-14',payment_count,0)) payment_count,
sum(if(dt='2020-06-14',payment_amount,0)) payment_amount,
sum(if(login_count>0,1,0)) login_last_30d_count,
sum(order_count) order_last_30d_count,
sum(order_amount) order_last_30d_amount,
sum(payment_count) payment_last_30d_count,
sum(payment_amount) payment_last_30d_amount
from dws_user_action_daycount
where dt>=date_add( '2020-06-14',-30)
group by user_id
)new
on old.user_id=new.user_id;
③商品主题宽表
商品sku_id
最近30天被下单次数、件数、金额、
最近30天被支付次数、件数、金额
最近30天被退款次数、件数、金额
最近30天被收藏次数
最近30天被加入购物车次数
最近30天被好评、中评、差评、默认评价数
累计被下单次数、件数、金额
累计被支付次数、件数、金额
累计被退款次数、件数、金额
累计被收藏次数
累计被加入购物车次数
累积被好评、中评、差评、默认评价数
date_add(today, -30) full out join today
1)建表语句
hive (gmall)>
drop table if exists dwt_sku_topic;
create external table dwt_sku_topic
(
sku_id string comment 'sku_id',
spu_id string comment 'spu_id',
order_last_30d_count bigint comment '最近30日被下单次数',
order_last_30d_num bigint comment '最近30日被下单件数',
order_last_30d_amount decimal(16,2) comment '最近30日被下单金额',
order_count bigint comment '累积被下单次数',
order_num bigint comment '累积被下单件数',
order_amount decimal(16,2) comment '累积被下单金额',
payment_last_30d_count bigint comment '最近30日被支付次数',
payment_last_30d_num bigint comment '最近30日被支付件数',
payment_last_30d_amount decimal(16,2) comment '最近30日被支付金额',
payment_count bigint comment '累积被支付次数',
payment_num bigint comment '累积被支付件数',
payment_amount decimal(16,2) comment '累积被支付金额',
refund_last_30d_count bigint comment '最近三十日退款次数',
refund_last_30d_num bigint comment '最近三十日退款件数',
refund_last_30d_amount decimal(16,2) comment '最近三十日退款金额',
refund_count bigint comment '累积退款次数',
refund_num bigint comment '累积退款件数',
refund_amount decimal(16,2) comment '累积退款金额',
cart_last_30d_count bigint comment '最近30日被加入购物车次数',
cart_count bigint comment '累积被加入购物车次数',
favor_last_30d_count bigint comment '最近30日被收藏次数',
favor_count bigint comment '累积被收藏次数',
appraise_last_30d_good_count bigint comment '最近30日好评数',
appraise_last_30d_mid_count bigint comment '最近30日中评数',
appraise_last_30d_bad_count bigint comment '最近30日差评数',
appraise_last_30d_default_count bigint comment '最近30日默认评价数',
appraise_good_count bigint comment '累积好评数',
appraise_mid_count bigint comment '累积中评数',
appraise_bad_count bigint comment '累积差评数',
appraise_default_count bigint comment '累积默认评价数'
)COMMENT '商品主题宽表'
stored as parquet
location '/warehouse/gmall/dwt/dwt_sku_topic/'
tblproperties ("parquet.compression"="lzo");
2)数据装载
hive (gmall)>
insert overwrite table dwt_sku_topic
select
nvl(new.sku_id,old.sku_id),
sku_info.spu_id,
nvl(new.order_count30,0),
nvl(new.order_num30,0),
nvl(new.order_amount30,0),
nvl(old.order_count,0) + nvl(new.order_count,0),
nvl(old.order_num,0) + nvl(new.order_num,0),
nvl(old.order_amount,0) + nvl(new.order_amount,0),
nvl(new.payment_count30,0),
nvl(new.payment_num30,0),
nvl(new.payment_amount30,0),
nvl(old.payment_count,0) + nvl(new.payment_count,0),
nvl(old.payment_num,0) + nvl(new.payment_count,0),
nvl(old.payment_amount,0) + nvl(new.payment_count,0),
nvl(new.refund_count30,0),
nvl(new.refund_num30,0),
nvl(new.refund_amount30,0),
nvl(old.refund_count,0) + nvl(new.refund_count,0),
nvl(old.refund_num,0) + nvl(new.refund_num,0),
nvl(old.refund_amount,0) + nvl(new.refund_amount,0),
nvl(new.cart_count30,0),
nvl(old.cart_count,0) + nvl(new.cart_count,0),
nvl(new.favor_count30,0),
nvl(old.favor_count,0) + nvl(new.favor_count,0),
nvl(new.appraise_good_count30,0),
nvl(new.appraise_mid_count30,0),
nvl(new.appraise_bad_count30,0),
nvl(new.appraise_default_count30,0) ,
nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0),
nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0),
nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0),
nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0)
from
dwt_sku_topic old
full outer join
(
select
sku_id,
sum(if(dt='2020-06-14', order_count,0 )) order_count,
sum(if(dt='2020-06-14',order_num ,0 )) order_num,
sum(if(dt='2020-06-14',order_amount,0 )) order_amount ,
sum(if(dt='2020-06-14',payment_count,0 )) payment_count,
sum(if(dt='2020-06-14',payment_num,0 )) payment_num,
sum(if(dt='2020-06-14',payment_amount,0 )) payment_amount,
sum(if(dt='2020-06-14',refund_count,0 )) refund_count,
sum(if(dt='2020-06-14',refund_num,0 )) refund_num,
sum(if(dt='2020-06-14',refund_amount,0 )) refund_amount,
sum(if(dt='2020-06-14',cart_count,0 )) cart_count,
sum(if(dt='2020-06-14',favor_count,0 )) favor_count,
sum(if(dt='2020-06-14',appraise_good_count,0 )) appraise_good_count,
sum(if(dt='2020-06-14',appraise_mid_count,0 ) ) appraise_mid_count ,
sum(if(dt='2020-06-14',appraise_bad_count,0 )) appraise_bad_count,
sum(if(dt='2020-06-14',appraise_default_count,0 )) appraise_default_count,
sum(order_count) order_count30 ,
sum(order_num) order_num30,
sum(order_amount) order_amount30,
sum(payment_count) payment_count30,
sum(payment_num) payment_num30,
sum(payment_amount) payment_amount30,
sum(refund_count) refund_count30,
sum(refund_num) refund_num30,
sum(refund_amount) refund_amount30,
sum(cart_count) cart_count30,
sum(favor_count) favor_count30,
sum(appraise_good_count) appraise_good_count30,
sum(appraise_mid_count) appraise_mid_count30,
sum(appraise_bad_count) appraise_bad_count30,
sum(appraise_default_count) appraise_default_count30
from dws_sku_action_daycount
where dt >= date_add ('2020-06-14', -30)
group by sku_id
)new
on new.sku_id = old.sku_id
left join
(select * from dwd_dim_sku_info where dt='2020-06-14') sku_info
on nvl(new.sku_id,old.sku_id)= sku_info.id;
④活动主题宽表
活动
名称、类型、开始、结束、创建时间
当日曝光次数
当日下单次数、金额
当日支付次数、金额
累计曝光次数
累计下单次数、金额
累计支付次数、金额
old full outer join today
1)建表语句 hive (gmall)> drop table if exists dwt_activity_topic; create external table dwt_activity_topic( `id` string COMMENT '编号', `activity_name` string COMMENT '活动名称', `activity_type` string COMMENT '活动类型', `start_time` string COMMENT '开始时间', `end_time` string COMMENT '结束时间', `create_time` string COMMENT '创建时间', `display_day_count` bigint COMMENT '当日曝光次数', `order_day_count` bigint COMMENT '当日下单次数', `order_day_amount` decimal(20,2) COMMENT '当日下单金额', `payment_day_count` bigint COMMENT '当日支付次数', `payment_day_amount` decimal(20,2) COMMENT '当日支付金额', `display_count` bigint COMMENT '累积曝光次数', `order_count` bigint COMMENT '累积下单次数', `order_amount` decimal(20,2) COMMENT '累积下单金额', `payment_count` bigint COMMENT '累积支付次数', `payment_amount` decimal(20,2) COMMENT '累积支付金额' ) COMMENT '活动主题宽表' stored as parquet location '/warehouse/gmall/dwt/dwt_activity_topic/' tblproperties ("parquet.compression"="lzo"); 2)数据装载 hive (gmall)> insert overwrite table dwt_activity_topic select nvl(new.id,old.id), nvl(new.activity_name,old.activity_name), nvl(new.activity_type,old.activity_type), nvl(new.start_time,old.start_time), nvl(new.end_time,old.end_time), nvl(new.create_time,old.create_time), nvl(new.display_count,0), nvl(new.order_count,0), nvl(new.order_amount,0.0), nvl(new.payment_count,0), nvl(new.payment_amount,0.0), nvl(new.display_count,0)+nvl(old.display_count,0), nvl(new.order_count,0)+nvl(old.order_count,0), nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0), nvl(new.payment_count,0)+nvl(old.payment_count,0), nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0) from ( select * from dwt_activity_topic )old full outer join ( select * from dws_activity_info_daycount where dt='2020-06-14' )new on old.id=new.id;
⑤地区主题宽表
地区
编号、编码、iso编码、名称、省份
当天活跃设备数、下单次数、下单金额、支付次数、支付金额
最近30天活跃设备数、下单次数、下单金额、支付次数、支付金额
old full outer join today
1)建表语句 hive (gmall)> drop table if exists dwt_area_topic; create external table dwt_area_topic( `id` bigint COMMENT '编号', `province_name` string COMMENT '省份名称', `area_code` string COMMENT '地区编码', `iso_code` string COMMENT 'iso编码', `region_id` string COMMENT '地区ID', `region_name` string COMMENT '地区名称', `login_day_count` string COMMENT '当天活跃设备数', `login_last_30d_count` string COMMENT '最近30天活跃设备数', `order_day_count` bigint COMMENT '当天下单次数', `order_day_amount` decimal(16,2) COMMENT '当天下单金额', `order_last_30d_count` bigint COMMENT '最近30天下单次数', `order_last_30d_amount` decimal(16,2) COMMENT '最近30天下单金额', `payment_day_count` bigint COMMENT '当天支付次数', `payment_day_amount` decimal(16,2) COMMENT '当天支付金额', `payment_last_30d_count` bigint COMMENT '最近30天支付次数', `payment_last_30d_amount` decimal(16,2) COMMENT '最近30天支付金额' ) COMMENT '地区主题宽表' stored as parquet location '/warehouse/gmall/dwt/dwt_area_topic/' tblproperties ("parquet.compression"="lzo"); 2)数据装载 hive (gmall)> insert overwrite table dwt_area_topic select nvl(old.id,new.id), nvl(old.province_name,new.province_name), nvl(old.area_code,new.area_code), nvl(old.iso_code,new.iso_code), nvl(old.region_id,new.region_id), nvl(old.region_name,new.region_name), nvl(new.login_day_count,0), nvl(new.login_last_30d_count,0), nvl(new.order_day_count,0), nvl(new.order_day_amount,0.0), nvl(new.order_last_30d_count,0), nvl(new.order_last_30d_amount,0.0), nvl(new.payment_day_count,0), nvl(new.payment_day_amount,0.0), nvl(new.payment_last_30d_count,0), nvl(new.payment_last_30d_amount,0.0) from ( select * from dwt_area_topic )old full outer join ( select id, province_name, area_code, iso_code, region_id, region_name, sum(if(dt='2020-06-14',login_count,0)) login_day_count, sum(if(dt='2020-06-14',order_count,0)) order_day_count, sum(if(dt='2020-06-14',order_amount,0.0)) order_day_amount, sum(if(dt='2020-06-14',payment_count,0)) payment_day_count, sum(if(dt='2020-06-14',payment_amount,0.0)) payment_day_amount, sum(login_count) login_last_30d_count, sum(order_count) order_last_30d_count, sum(order_amount) order_last_30d_amount, sum(payment_count) payment_last_30d_count, sum(payment_amount) payment_last_30d_amount from dws_area_stats_daycount where dt>=date_add('2020-06-14',-30) group by id,province_name,area_code,iso_code,region_id,region_name )new on old.id=new.id;
DWT层数据导入脚本
1)在/home/kris/bin目录下创建脚本dws_to_dwt.sh
#!/bin/bash APP=gmall hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set mapreduce.job.queuename=hive; insert overwrite table ${APP}.dwt_uv_topic select nvl(new.mid_id,old.mid_id), nvl(new.model,old.model), nvl(new.brand,old.brand), if(old.mid_id is null,'$do_date',old.login_date_first), if(new.mid_id is not null,'$do_date',old.login_date_last), if(new.mid_id is not null, new.login_count,0), nvl(old.login_count,0)+if(new.login_count>0,1,0) from ( select * from ${APP}.dwt_uv_topic )old full outer join ( select * from ${APP}.dws_uv_detail_daycount where dt='$do_date' )new on old.mid_id=new.mid_id; insert overwrite table ${APP}.dwt_user_topic select nvl(new.user_id,old.user_id), if(old.login_date_first is null and new.login_count>0,'$do_date',old.login_date_first), if(new.login_count>0,'$do_date',old.login_date_last), nvl(old.login_count,0)+if(new.login_count>0,1,0), nvl(new.login_last_30d_count,0), if(old.order_date_first is null and new.order_count>0,'$do_date',old.order_date_first), if(new.order_count>0,'$do_date',old.order_date_last), nvl(old.order_count,0)+nvl(new.order_count,0), nvl(old.order_amount,0)+nvl(new.order_amount,0), nvl(new.order_last_30d_count,0), nvl(new.order_last_30d_amount,0), if(old.payment_date_first is null and new.payment_count>0,'$do_date',old.payment_date_first), if(new.payment_count>0,'$do_date',old.payment_date_last), nvl(old.payment_count,0)+nvl(new.payment_count,0), nvl(old.payment_amount,0)+nvl(new.payment_amount,0), nvl(new.payment_last_30d_count,0), nvl(new.payment_last_30d_amount,0) from ${APP}.dwt_user_topic old full outer join ( select user_id, sum(if(dt='$do_date',login_count,0)) login_count, sum(if(dt='$do_date',order_count,0)) order_count, sum(if(dt='$do_date',order_amount,0)) order_amount, sum(if(dt='$do_date',payment_count,0)) payment_count, sum(if(dt='$do_date',payment_amount,0)) payment_amount, sum(if(login_count>0,1,0)) login_last_30d_count, sum(order_count) order_last_30d_count, sum(order_amount) order_last_30d_amount, sum(payment_count) payment_last_30d_count, sum(payment_amount) payment_last_30d_amount from ${APP}.dws_user_action_daycount where dt>=date_add( '$do_date',-30) group by user_id )new on old.user_id=new.user_id; insert overwrite table ${APP}.dwt_sku_topic select nvl(new.sku_id,old.sku_id), sku_info.spu_id, nvl(new.order_count30,0), nvl(new.order_num30,0), nvl(new.order_amount30,0), nvl(old.order_count,0) + nvl(new.order_count,0), nvl(old.order_num,0) + nvl(new.order_num,0), nvl(old.order_amount,0) + nvl(new.order_amount,0), nvl(new.payment_count30,0), nvl(new.payment_num30,0), nvl(new.payment_amount30,0), nvl(old.payment_count,0) + nvl(new.payment_count,0), nvl(old.payment_num,0) + nvl(new.payment_count,0), nvl(old.payment_amount,0) + nvl(new.payment_count,0), nvl(new.refund_count30,0), nvl(new.refund_num30,0), nvl(new.refund_amount30,0), nvl(old.refund_count,0) + nvl(new.refund_count,0), nvl(old.refund_num,0) + nvl(new.refund_num,0), nvl(old.refund_amount,0) + nvl(new.refund_amount,0), nvl(new.cart_count30,0), nvl(old.cart_count,0) + nvl(new.cart_count,0), nvl(new.favor_count30,0), nvl(old.favor_count,0) + nvl(new.favor_count,0), nvl(new.appraise_good_count30,0), nvl(new.appraise_mid_count30,0), nvl(new.appraise_bad_count30,0), nvl(new.appraise_default_count30,0) , nvl(old.appraise_good_count,0) + nvl(new.appraise_good_count,0), nvl(old.appraise_mid_count,0) + nvl(new.appraise_mid_count,0), nvl(old.appraise_bad_count,0) + nvl(new.appraise_bad_count,0), nvl(old.appraise_default_count,0) + nvl(new.appraise_default_count,0) from ( select sku_id, spu_id, order_last_30d_count, order_last_30d_num, order_last_30d_amount, order_count, order_num, order_amount , payment_last_30d_count, payment_last_30d_num, payment_last_30d_amount, payment_count, payment_num, payment_amount, refund_last_30d_count, refund_last_30d_num, refund_last_30d_amount, refund_count, refund_num, refund_amount, cart_last_30d_count, cart_count, favor_last_30d_count, favor_count, appraise_last_30d_good_count, appraise_last_30d_mid_count, appraise_last_30d_bad_count, appraise_last_30d_default_count, appraise_good_count, appraise_mid_count, appraise_bad_count, appraise_default_count from ${APP}.dwt_sku_topic )old full outer join ( select sku_id, sum(if(dt='$do_date', order_count,0 )) order_count, sum(if(dt='$do_date',order_num ,0 )) order_num, sum(if(dt='$do_date',order_amount,0 )) order_amount , sum(if(dt='$do_date',payment_count,0 )) payment_count, sum(if(dt='$do_date',payment_num,0 )) payment_num, sum(if(dt='$do_date',payment_amount,0 )) payment_amount, sum(if(dt='$do_date',refund_count,0 )) refund_count, sum(if(dt='$do_date',refund_num,0 )) refund_num, sum(if(dt='$do_date',refund_amount,0 )) refund_amount, sum(if(dt='$do_date',cart_count,0 )) cart_count, sum(if(dt='$do_date',favor_count,0 )) favor_count, sum(if(dt='$do_date',appraise_good_count,0 )) appraise_good_count, sum(if(dt='$do_date',appraise_mid_count,0 ) ) appraise_mid_count , sum(if(dt='$do_date',appraise_bad_count,0 )) appraise_bad_count, sum(if(dt='$do_date',appraise_default_count,0 )) appraise_default_count, sum(order_count) order_count30 , sum(order_num) order_num30, sum(order_amount) order_amount30, sum(payment_count) payment_count30, sum(payment_num) payment_num30, sum(payment_amount) payment_amount30, sum(refund_count) refund_count30, sum(refund_num) refund_num30, sum(refund_amount) refund_amount30, sum(cart_count) cart_count30, sum(favor_count) favor_count30, sum(appraise_good_count) appraise_good_count30, sum(appraise_mid_count) appraise_mid_count30, sum(appraise_bad_count) appraise_bad_count30, sum(appraise_default_count) appraise_default_count30 from ${APP}.dws_sku_action_daycount where dt >= date_add ('$do_date', -30) group by sku_id )new on new.sku_id = old.sku_id left join (select * from ${APP}.dwd_dim_sku_info where dt='$do_date') sku_info on nvl(new.sku_id,old.sku_id)= sku_info.id; insert overwrite table ${APP}.dwt_activity_topic select nvl(new.id,old.id), nvl(new.activity_name,old.activity_name), nvl(new.activity_type,old.activity_type), nvl(new.start_time,old.start_time), nvl(new.end_time,old.end_time), nvl(new.create_time,old.create_time), nvl(new.display_count,0), nvl(new.order_count,0), nvl(new.order_amount,0.0), nvl(new.payment_count,0), nvl(new.payment_amount,0.0), nvl(new.display_count,0)+nvl(old.display_count,0), nvl(new.order_count,0)+nvl(old.order_count,0), nvl(new.order_amount,0.0)+nvl(old.order_amount,0.0), nvl(new.payment_count,0)+nvl(old.payment_count,0), nvl(new.payment_amount,0.0)+nvl(old.payment_amount,0.0) from ( select * from ${APP}.dwt_activity_topic )old full outer join ( select * from ${APP}.dws_activity_info_daycount where dt='$do_date' )new on old.id=new.id; insert overwrite table ${APP}.dwt_area_topic select nvl(old.id,new.id), nvl(old.province_name,new.province_name), nvl(old.area_code,new.area_code), nvl(old.iso_code,new.iso_code), nvl(old.region_id,new.region_id), nvl(old.region_name,new.region_name), nvl(new.login_day_count,0), nvl(new.login_last_30d_count,0), nvl(new.order_day_count,0), nvl(new.order_day_amount,0.0), nvl(new.order_last_30d_count,0), nvl(new.order_last_30d_amount,0.0), nvl(new.payment_day_count,0), nvl(new.payment_day_amount,0.0), nvl(new.payment_last_30d_count,0), nvl(new.payment_last_30d_amount,0.0) from ( select * from ${APP}.dwt_area_topic )old full outer join ( select id, province_name, area_code, iso_code, region_id, region_name, sum(if(dt='$do_date',login_count,0)) login_day_count, sum(if(dt='$do_date',order_count,0)) order_day_count, sum(if(dt='$do_date',order_amount,0.0)) order_day_amount, sum(if(dt='$do_date',payment_count,0)) payment_day_count, sum(if(dt='$do_date',payment_amount,0.0)) payment_day_amount, sum(login_count) login_last_30d_count, sum(order_count) order_last_30d_count, sum(order_amount) order_last_30d_amount, sum(payment_count) payment_last_30d_count, sum(payment_amount) payment_last_30d_amount from ${APP}.dws_area_stats_daycount where dt>=date_add('$do_date',-30) group by id,province_name,area_code,iso_code,region_id,region_name )new on old.id=new.id; " $hive -e "$sql"
5.ADS层
①设备主题
活跃设备数(日、周、月)
需求定义:
- 日活:当日活跃的设备数
- 周活:当周活跃的设备数
- 月活:当月活跃的设备数
hive (gmall)> drop table if exists ads_uv_count; create external table ads_uv_count( `dt` string COMMENT '统计日期', `day_count` bigint COMMENT '当日用户数量', `wk_count` bigint COMMENT '当周用户数量', `mn_count` bigint COMMENT '当月用户数量', `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果', `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果' ) COMMENT '活跃设备数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_uv_count/'; 2)导入数据 hive (gmall)> insert into table ads_uv_count select '2020-06-14' dt, daycount.ct, wkcount.ct, mncount.ct, if(date_add(next_day('2020-06-14','MO'),-1)='2020-06-14','Y','N') , if(last_day('2020-06-14')='2020-06-14','Y','N') from --日活 ( select '2020-06-14' dt, count(*) ct from dwt_uv_topic where login_date_last='2020-06-14' )daycount join --周活 ( select '2020-06-14' dt, count (*) ct from dwt_uv_topic where login_date_last>=date_add(next_day('2020-06-14','MO'),-7) and login_date_last<= date_add(next_day('2020-06-14','MO'),-1) ) wkcount on daycount.dt=wkcount.dt join --月活 ( select '2020-06-14' dt, count (*) ct from dwt_uv_topic where date_format(login_date_last,'yyyy-MM')=date_format('2020-06-14','yyyy-MM') )mncount on daycount.dt=mncount.dt;
每日新增设备
1)建表语句 hive (gmall)> drop table if exists ads_new_mid_count; create external table ads_new_mid_count ( `create_date` string comment '创建时间' , `new_mid_count` BIGINT comment '新增设备数量' ) COMMENT '每日新增设备数量' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_new_mid_count/'; 2)导入数据 hive (gmall)> insert into table ads_new_mid_count select '2020-06-14', count(*) from dwt_uv_topic where login_date_first='2020-06-14';
留存率
login_date_first首次登陆时间
==> n日留存率 ( 留存率 = 留存数量 / 设备新增数量 )
login_date_last 末次登陆时间
1)建表语句 hive (gmall)> drop table if exists ads_user_retention_day_rate; create external table ads_user_retention_day_rate ( `stat_date` string comment '统计日期', `create_date` string comment '设备新增日期', `retention_day` int comment '截止当前日期留存天数', `retention_count` bigint comment '留存数量', `new_mid_count` bigint comment '设备新增数量', `retention_ratio` decimal(16,2) comment '留存率' ) COMMENT '留存率' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_user_retention_day_rate/'; 2)导入数据 hive (gmall)> insert into table ads_user_retention_day_rate select '2020-06-15', date_add('2020-06-15',-1), 1,--留存天数 sum(if(login_date_first=date_add('2020-06-15',-1) and login_date_last='2020-06-15',1,0)), sum(if(login_date_first=date_add('2020-06-15',-1),1,0)), sum(if(login_date_first=date_add('2020-06-15',-1) and login_date_last='2020-06-15',1,0))/sum(if(login_date_first=date_add('2020-06-15',-1),1,0))*100 from dwt_uv_topic union all select '2020-06-15', date_add('2020-06-15',-2), 2, sum(if(login_date_first=date_add('2020-06-15',-2) and login_date_last='2020-06-15',1,0)), sum(if(login_date_first=date_add('2020-06-15',-2),1,0)), sum(if(login_date_first=date_add('2020-06-15',-2) and login_date_last='2020-06-15',1,0))/sum(if(login_date_first=date_add('2020-06-15',-2),1,0))*100 from dwt_uv_topic union all select '2020-06-15', date_add('2020-06-15',-3), 3, sum(if(login_date_first=date_add('2020-06-15',-3) and login_date_last='2020-06-15',1,0)), sum(if(login_date_first=date_add('2020-06-15',-3),1,0)), sum(if(login_date_first=date_add('2020-06-15',-3) and login_date_last='2020-06-15',1,0))/sum(if(login_date_first=date_add('2020-06-15',-3),1,0))*100 from dwt_uv_topic;
沉默用户数
需求定义:
沉默用户:只在安装当天启动过,且启动时间是在7天前
1)建表语句 hive (gmall)> drop table if exists ads_silent_count; create external table ads_silent_count( `dt` string COMMENT '统计日期', `silent_count` bigint COMMENT '沉默设备数' ) COMMENT '沉默用户数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_silent_count'; 2)导入2020-06-25数据 hive (gmall)> insert into table ads_silent_count select '2020-06-25', count(*) from dwt_uv_topic where login_date_first=login_date_last and login_date_last<=date_add('2020-06-25',-7);
本周回流用户数
需求定义:
本周回流用户:上周未活跃,本周活跃的设备,且不是本周新增设备
1)建表语句 hive (gmall)> drop table if exists ads_back_count; create external table ads_back_count( `dt` string COMMENT '统计日期', `wk_dt` string COMMENT '统计日期所在周', `wastage_count` bigint COMMENT '回流设备数' ) COMMENT '本周回流用户数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_back_count'; 2)导入数据: hive (gmall)> insert into table ads_back_count select '2020-06-25', concat(date_add(next_day('2020-06-25','MO'),-7),'_', date_add(next_day('2020-06-25','MO'),-1)), count(*) --上周不活跃, 且 最后登陆时间在 本周范围内 且第一次登陆不在本周; from ( select mid_id from dwt_uv_topic --设备主题, 筛选 最后登陆时间在 本周范围内 且第一次登陆不在本周; where login_date_last>=date_add(next_day('2020-06-25','MO'),-7) -- 本周的周一 and login_date_last<= date_add(next_day('2020-06-25','MO'),-1) -- 本周的周日 and login_date_first<date_add(next_day('2020-06-25','MO'),-7) -- 本周的周一 )current_wk left join ( select mid_id from dws_uv_detail_daycount --每日设备行为 where dt>=date_add(next_day('2020-06-25','MO'),-7*2) --当前日期的上周的周一 and dt<= date_add(next_day('2020-06-25','MO'),-7-1) --当前日期的上周的周日 group by mid_id )last_wk on current_wk.mid_id=last_wk.mid_id where last_wk.mid_id is null;
流失用户数
需求定义:
流失用户:最近7天未活跃的设备
1)建表语句 hive (gmall)> drop table if exists ads_wastage_count; create external table ads_wastage_count( `dt` string COMMENT '统计日期', `wastage_count` bigint COMMENT '流失设备数' ) COMMENT '流失用户数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_wastage_count'; 2)导入2020-06-25数据 hive (gmall)> insert into table ads_wastage_count select '2020-06-25', count(*) from ( select mid_id from dwt_uv_topic where login_date_last<=date_add('2020-06-25',-7) group by mid_id )t1;
最近连续三周活跃用户数
1)建表语句 hive (gmall)> drop table if exists ads_continuity_wk_count; create external table ads_continuity_wk_count( `dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期', `wk_dt` string COMMENT '持续时间', `continuity_count` bigint COMMENT '活跃次数' ) COMMENT '最近连续三周活跃用户数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_continuity_wk_count'; 2)导入2020-06-25所在周的数据 hive (gmall)> insert into table ads_continuity_wk_count select '2020-06-25', concat(date_add(next_day('2020-06-25','MO'),-7*3),'_',date_add(next_day('2020-06-25','MO'),-1)), count(*) from ( select mid_id from ( --本周活跃的 mid_id select mid_id from dws_uv_detail_daycount where dt>=date_add(next_day('2020-06-25','monday'),-7) --本周的周一 and dt<=date_add(next_day('2020-06-25','monday'),-1) --本周的周日 group by mid_id union all --上周活跃的 mid_id select mid_id from dws_uv_detail_daycount where dt>=date_add(next_day('2020-06-25','monday'),-7*2) --上周的周一 and dt<=date_add(next_day('2020-06-25','monday'),-7-1) --上周的周日 group by mid_id union all --上上周活跃的 mid_id select mid_id from dws_uv_detail_daycount where dt>=date_add(next_day('2020-06-25','monday'),-7*3) --上上周的周一 and dt<=date_add(next_day('2020-06-25','monday'),-7*2-1) --上上周的周日 group by mid_id )t1 group by mid_id having count(*)=3 )t2;
最近七天内连续三天活跃用户数
1)建表语句 hive (gmall)> drop table if exists ads_continuity_uv_count; create external table ads_continuity_uv_count( `dt` string COMMENT '统计日期', `wk_dt` string COMMENT '最近7天日期', `continuity_count` bigint ) COMMENT '最近七天内连续三天活跃用户数' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_continuity_uv_count'; 2)写出导入数据的SQL语句 hive (gmall)> insert into table ads_continuity_uv_count select '2020-06-16', concat(date_add('2020-06-16',-6),'_','2020-06-16'), count(*) from ( select mid_id from ( select mid_id from ( select mid_id, date_sub(dt,rank) date_dif --dt与 rank排名的差值 from ( --mid_id 按dt升序排名 select mid_id, dt, rank() over(partition by mid_id order by dt) rank from dws_uv_detail_daycount where dt>=date_add('2020-06-16',-6) and dt<='2020-06-16' --dt 为当前日期往前推 6天 即连续7天内 )t1 )t2 group by mid_id,date_dif having count(*)>=3 )t3 group by mid_id )t4;
②会员主题
会员信息
会员信息
活跃、新增、新增消费、总付费、总 会员数
会员活跃率
会员付费率
会员新鲜度
1)建表 hive (gmall)> drop table if exists ads_user_topic; create external table ads_user_topic( `dt` string COMMENT '统计日期', `day_users` string COMMENT '活跃会员数', `day_new_users` string COMMENT '新增会员数', `day_new_payment_users` string COMMENT '新增消费会员数', `payment_users` string COMMENT '总付费会员数', `users` string COMMENT '总会员数', `day_users2users` decimal(16,2) COMMENT '会员活跃率', -- 活跃会员数 /总会员数 `payment_users2users` decimal(16,2) COMMENT '会员付费率', --总付费会员数 /总会员数 `day_new_users2users` decimal(16,2) COMMENT '会员新鲜度' --新增会员数/ 总会员数 ) COMMENT '会员信息表' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_user_topic'; 2)导入数据 hive (gmall)> insert into table ads_user_topic select '2020-06-14', sum(if(login_date_last='2020-06-14',1,0)), sum(if(login_date_first='2020-06-14',1,0)), sum(if(payment_date_first='2020-06-14',1,0)), sum(if(payment_count>0,1,0)), count(*), sum(if(login_date_last='2020-06-14',1,0))/count(*), sum(if(payment_count>0,1,0))/count(*), sum(if(login_date_first='2020-06-14',1,0))/sum(if(login_date_last='2020-06-14',1,0)) from dwt_user_topic;
漏斗分析
统计“浏览首页->浏览商品详情页->加入购物车->下单->支付”的转化率
思路:统计各个行为的人数,然后计算比值。
1)建表语句 hive (gmall)> drop table if exists ads_user_action_convert_day; create external table ads_user_action_convert_day( `dt` string COMMENT '统计日期', `home_count` bigint COMMENT '浏览首页人数', `good_detail_count` bigint COMMENT '浏览商品详情页人数', `home2good_detail_convert_ratio` decimal(16,2) COMMENT '首页到商品详情转化率', `cart_count` bigint COMMENT '加入购物车的人数', `good_detail2cart_convert_ratio` decimal(16,2) COMMENT '商品详情页到加入购物车转化率', `order_count` bigint COMMENT '下单人数', `cart2order_convert_ratio` decimal(16,2) COMMENT '加入购物车到下单转化率', `payment_amount` bigint COMMENT '支付人数', `order2payment_convert_ratio` decimal(16,2) COMMENT '下单到支付的转化率' ) COMMENT '漏斗分析' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_user_action_convert_day/'; 2)数据装载 hive (gmall)> with tmp_uv as ( select '2020-06-14' dt, sum(if(array_contains(pages,'home'),1,0)) home_count, sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count from ( select mid_id, collect_set(page_id) pages from dwd_page_log where dt='2020-06-14' and page_id in ('home','good_detail') --page_id = 'home' 浏览首页 | page_id = 'good_detail'浏览商品详情页 group by mid_id )tmp ), tmp_cop as ( select '2020-06-14' dt, sum(if(cart_count>0,1,0)) cart_count, sum(if(order_count>0,1,0)) order_count, sum(if(payment_count>0,1,0)) payment_count from dws_user_action_daycount where dt='2020-06-14' ) insert into table ads_user_action_convert_day select tmp_uv.dt, tmp_uv.home_count, --浏览首页人数 tmp_uv.good_detail_count, --浏览商品详情页人数 tmp_uv.good_detail_count/tmp_uv.home_count*100, --首页到商品详情 转化率 tmp_cop.cart_count, --加入购物车的人数 tmp_cop.cart_count/tmp_uv.good_detail_count*100,--商品详情页到加入购物车 转化率 tmp_cop.order_count, --下单人数 tmp_cop.order_count/tmp_cop.cart_count*100, --加入购物车到下单转化率 tmp_cop.payment_count, --支付人数 tmp_cop.payment_count/tmp_cop.order_count*100 --下单到支付的转化率 from tmp_uv join tmp_cop on tmp_uv.dt=tmp_cop.dt;
③商品主题
商品个数信息
1)建表语句 hive (gmall)> drop table if exists ads_product_info; create external table ads_product_info( `dt` string COMMENT '统计日期', `sku_num` string COMMENT 'sku个数', `spu_num` string COMMENT 'spu个数' ) COMMENT '商品个数信息' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_product_info'; 2)导入数据 hive (gmall)> insert into table ads_product_info select '2020-06-14' dt, sku_num, spu_num from ( select '2020-06-14' dt, count(*) sku_num from dwt_sku_topic ) tmp_sku_num join ( select '2020-06-14' dt, count(*) spu_num from ( select spu_id from dwt_sku_topic group by spu_id ) tmp_spu_id ) tmp_spu_num on tmp_sku_num.dt=tmp_spu_num.dt;
商品销量排名
1)建表语句 hive (gmall)> drop table if exists ads_product_sale_topN; create external table ads_product_sale_topN( `dt` string COMMENT '统计日期', `sku_id` string COMMENT '商品ID', `payment_amount` bigint COMMENT '销量' ) COMMENT '商品销量排名' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_product_sale_topN'; 2)导入数据 hive (gmall)> insert into table ads_product_sale_topN select '2020-06-14' dt, sku_id, payment_amount from dws_sku_action_daycount where dt='2020-06-14' order by payment_amount desc limit 10;
商品收藏排名
1)建表语句 hive (gmall)> drop table if exists ads_product_favor_topN; create external table ads_product_favor_topN( `dt` string COMMENT '统计日期', `sku_id` string COMMENT '商品ID', `favor_count` bigint COMMENT '收藏量' ) COMMENT '商品收藏排名' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_product_favor_topN'; 2)导入数据 hive (gmall)> insert into table ads_product_favor_topN select '2020-06-14' dt, sku_id, favor_count from dws_sku_action_daycount where dt='2020-06-14' order by favor_count desc limit 10;
商品加入购物车排名
1)建表语句 hive (gmall)> drop table if exists ads_product_cart_topN; create external table ads_product_cart_topN( `dt` string COMMENT '统计日期', `sku_id` string COMMENT '商品ID', `cart_count` bigint COMMENT '加入购物车次数' ) COMMENT '商品加入购物车排名' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_product_cart_topN'; 2)导入数据 hive (gmall)> insert into table ads_product_cart_topN select '2020-06-14' dt, sku_id, cart_count from dws_sku_action_daycount where dt='2020-06-14' order by cart_count desc limit 10;
商品退款率排名(最近30天)
1)建表语句 hive (gmall)> drop table if exists ads_product_refund_topN; create external table ads_product_refund_topN( `dt` string COMMENT '统计日期', `sku_id` string COMMENT '商品ID', `refund_ratio` decimal(16,2) COMMENT '退款率' ) COMMENT '商品退款率排名' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_product_refund_topN'; 2)导入数据 hive (gmall)> insert into table ads_product_refund_topN select '2020-06-14', sku_id, refund_last_30d_count/payment_last_30d_count*100 refund_ratio from dwt_sku_topic order by refund_ratio desc limit 10;
商品差评率
1)建表语句 hive (gmall)> drop table if exists ads_appraise_bad_topN; create external table ads_appraise_bad_topN( `dt` string COMMENT '统计日期', `sku_id` string COMMENT '商品ID', `appraise_bad_ratio` decimal(16,2) COMMENT '差评率' ) COMMENT '商品差评率' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_appraise_bad_topN'; 2)导入数据 hive (gmall)> insert into table ads_appraise_bad_topN select '2020-06-14' dt, sku_id, appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio from dws_sku_action_daycount where dt='2020-06-14' order by appraise_bad_ratio desc limit 10;
④营销主题(用户+商品+购买行为)
下单数目统计
需求分析:统计每日下单数,下单金额及下单用户数。
1)建表语句 hive (gmall)> drop table if exists ads_order_daycount; create external table ads_order_daycount( dt string comment '统计日期', order_count bigint comment '单日下单笔数', order_amount bigint comment '单日下单金额', order_users bigint comment '单日下单用户数' ) comment '下单数目统计' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_order_daycount'; 2)导入数据 hive (gmall)> insert into table ads_order_daycount select '2020-06-14', sum(order_count), sum(order_amount), sum(if(order_count>0,1,0)) from dws_user_action_daycount where dt='2020-06-14';
支付信息统计
每日支付金额、支付人数、支付商品数、支付笔数以及下单到支付的平均时长(取自DWD)
1)建表 hive (gmall)> drop table if exists ads_payment_daycount; create external table ads_payment_daycount( dt string comment '统计日期', order_count bigint comment '单日支付笔数', order_amount bigint comment '单日支付金额', payment_user_count bigint comment '单日支付人数', payment_sku_count bigint comment '单日支付商品数', payment_avg_time decimal(16,2) comment '下单到支付的平均时长,取分钟数' ) comment '支付信息统计' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_payment_daycount'; 2)导入数据 hive (gmall)> insert into table ads_payment_daycount select tmp_payment.dt, tmp_payment.payment_count, tmp_payment.payment_amount, tmp_payment.payment_user_count, tmp_skucount.payment_sku_count, tmp_time.payment_avg_time from ( select '2020-06-14' dt, sum(payment_count) payment_count, --单日支付笔数 sum(payment_amount) payment_amount, --单日支付金额 sum(if(payment_count>0,1,0)) payment_user_count --单日支付人数 from dws_user_action_daycount where dt='2020-06-14' )tmp_payment join ( select '2020-06-14' dt, sum(if(payment_count>0,1,0)) payment_sku_count --单日支付商品数 from dws_sku_action_daycount where dt='2020-06-14' )tmp_skucount on tmp_payment.dt=tmp_skucount.dt join ( select '2020-06-14' dt, sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time --下单到支付的平均时长 from dwd_fact_order_info where dt='2020-06-14' and payment_time is not null )tmp_time on tmp_payment.dt=tmp_time.dt;
品牌复购率
1)建表语句 hive (gmall)> drop table ads_sale_tm_category1_stat_mn; create external table ads_sale_tm_category1_stat_mn ( tm_id string comment '品牌id', category1_id string comment '1级品类id ', category1_name string comment '1级品类名称 ', buycount bigint comment '购买人数', buy_twice_last bigint comment '两次以上购买人数', buy_twice_last_ratio decimal(16,2) comment '单次复购率', buy_3times_last bigint comment '三次以上购买人数', buy_3times_last_ratio decimal(16,2) comment '多次复购率', stat_mn string comment '统计月份', stat_date string comment '统计日期' ) COMMENT '品牌复购率统计' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/'; 2)数据导入 hive (gmall)> with tmp_order as ( select user_id, order_stats_struct.sku_id sku_id, order_stats_struct.order_count order_count from dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct where date_format(dt,'yyyy-MM')=date_format('2020-06-14','yyyy-MM') ), tmp_sku as ( select id, tm_id, category1_id, category1_name from dwd_dim_sku_info where dt='2020-06-14' ) insert into table ads_sale_tm_category1_stat_mn select tm_id, category1_id, category1_name, sum(if(order_count>=1,1,0)) buycount, sum(if(order_count>=2,1,0)) buyTwiceLast, sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio, sum(if(order_count>=3,1,0)) buy3timeLast , sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio , date_format('2020-06-14' ,'yyyy-MM') stat_mn, '2020-06-14' stat_date from ( select tmp_order.user_id, tmp_sku.category1_id, tmp_sku.category1_name, tmp_sku.tm_id, sum(order_count) order_count from tmp_order join tmp_sku on tmp_order.sku_id=tmp_sku.id group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id )tmp group by tm_id, category1_id, category1_name;
⑤地区主题
地区主题信息
1)建表语句 hive (gmall)> drop table if exists ads_area_topic; create external table ads_area_topic( `dt` string COMMENT '统计日期', `id` bigint COMMENT '编号', `province_name` string COMMENT '省份名称', `area_code` string COMMENT '地区编码', `iso_code` string COMMENT 'iso编码', `region_id` string COMMENT '地区ID', `region_name` string COMMENT '地区名称', `login_day_count` bigint COMMENT '当天活跃设备数', `order_day_count` bigint COMMENT '当天下单次数', `order_day_amount` decimal(16,2) COMMENT '当天下单金额', `payment_day_count` bigint COMMENT '当天支付次数', `payment_day_amount` decimal(16,2) COMMENT '当天支付金额' ) COMMENT '地区主题信息' row format delimited fields terminated by '\t' location '/warehouse/gmall/ads/ads_area_topic/'; 2)数据装载 hive (gmall)> insert into table ads_area_topic select '2020-06-14', id, province_name, area_code, iso_code, region_id, region_name, login_day_count, order_day_count, order_day_amount, payment_day_count, payment_day_amount from dwt_area_topic;
ADS层导入脚本
1)在/home/kris/bin目录下创建脚本dwt_to_ads.sh
#!/bin/bash hive=/opt/module/hive/bin/hive APP=gmall # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set mapreduce.job.queuename=hive; insert into table ${APP}.ads_uv_count select '$do_date' dt, daycount.ct, wkcount.ct, mncount.ct, if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') , if(last_day('$do_date')='$do_date','Y','N') from ( select '$do_date' dt, count(*) ct from ${APP}.dwt_uv_topic where login_date_last='$do_date' )daycount join ( select '$do_date' dt, count (*) ct from ${APP}.dwt_uv_topic where login_date_last>=date_add(next_day('$do_date','MO'),-7) and login_date_last<= date_add(next_day('$do_date','MO'),-1) ) wkcount on daycount.dt=wkcount.dt join ( select '$do_date' dt, count (*) ct from ${APP}.dwt_uv_topic where date_format(login_date_last,'yyyy-MM')=date_format('$do_date','yyyy-MM') )mncount on daycount.dt=mncount.dt; insert into table ${APP}.ads_new_mid_count select login_date_first, count(*) from ${APP}.dwt_uv_topic where login_date_first='$do_date' group by login_date_first; insert into table ${APP}.ads_silent_count select '$do_date', count(*) from ${APP}.dwt_uv_topic where login_date_first=login_date_last and login_date_last<=date_add('$do_date',-7); insert into table ${APP}.ads_back_count select '$do_date', concat(date_add(next_day('$do_date','MO'),-7),'_', date_add(next_day('$do_date','MO'),-1)), count(*) from ( select mid_id from ${APP}.dwt_uv_topic where login_date_last>=date_add(next_day('$do_date','MO'),-7) and login_date_last<= date_add(next_day('$do_date','MO'),-1) and login_date_first<date_add(next_day('$do_date','MO'),-7) )current_wk left join ( select mid_id from ${APP}.dws_uv_detail_daycount where dt>=date_add(next_day('$do_date','MO'),-7*2) and dt<= date_add(next_day('$do_date','MO'),-7-1) group by mid_id )last_wk on current_wk.mid_id=last_wk.mid_id where last_wk.mid_id is null; insert into table ${APP}.ads_wastage_count select '$do_date', count(*) from ( select mid_id from ${APP}.dwt_uv_topic where login_date_last<=date_add('$do_date',-7) group by mid_id )t1; insert into table ${APP}.ads_user_retention_day_rate select '$do_date',--统计日期 date_add('$do_date',-1),--新增日期 1,--留存天数 sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0)),--$do_date的1日留存数 sum(if(login_date_first=date_add('$do_date',-1),1,0)),--$do_date新增 sum(if(login_date_first=date_add('$do_date',-1) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-1),1,0))*100 from ${APP}.dwt_uv_topic union all select '$do_date',--统计日期 date_add('$do_date',-2),--新增日期 2,--留存天数 sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0)),--$do_date的2日留存数 sum(if(login_date_first=date_add('$do_date',-2),1,0)),--$do_date新增 sum(if(login_date_first=date_add('$do_date',-2) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-2),1,0))*100 from ${APP}.dwt_uv_topic union all select '$do_date',--统计日期 date_add('$do_date',-3),--新增日期 3,--留存天数 sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0)),--$do_date的3日留存数 sum(if(login_date_first=date_add('$do_date',-3),1,0)),--$do_date新增 sum(if(login_date_first=date_add('$do_date',-3) and login_date_last='$do_date',1,0))/sum(if(login_date_first=date_add('$do_date',-3),1,0))*100 from ${APP}.dwt_uv_topic; insert into table ${APP}.ads_continuity_wk_count select '$do_date', concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)), count(*) from ( select mid_id from ( select mid_id from ${APP}.dws_uv_detail_daycount where dt>=date_add(next_day('$do_date','monday'),-7) and dt<=date_add(next_day('$do_date','monday'),-1) group by mid_id union all select mid_id from ${APP}.dws_uv_detail_daycount where dt>=date_add(next_day('$do_date','monday'),-7*2) and dt<=date_add(next_day('$do_date','monday'),-7-1) group by mid_id union all select mid_id from ${APP}.dws_uv_detail_daycount where dt>=date_add(next_day('$do_date','monday'),-7*3) and dt<=date_add(next_day('$do_date','monday'),-7*2-1) group by mid_id )t1 group by mid_id having count(*)=3 )t2; insert into table ${APP}.ads_continuity_uv_count select '$do_date', concat(date_add('$do_date',-6),'_','$do_date'), count(*) from ( select mid_id from ( select mid_id from ( select mid_id, date_sub(dt,rank) date_dif from ( select mid_id, dt, rank() over(partition by mid_id order by dt) rank from ${APP}.dws_uv_detail_daycount where dt>=date_add('$do_date',-6) and dt<='$do_date' )t1 )t2 group by mid_id,date_dif having count(*)>=3 )t3 group by mid_id )t4; insert into table ${APP}.ads_user_topic select '$do_date', sum(if(login_date_last='$do_date',1,0)), sum(if(login_date_first='$do_date',1,0)), sum(if(payment_date_first='$do_date',1,0)), sum(if(payment_count>0,1,0)), count(*), sum(if(login_date_last='$do_date',1,0))/count(*), sum(if(payment_count>0,1,0))/count(*), sum(if(login_date_first='$do_date',1,0))/sum(if(login_date_last='$do_date',1,0)) from ${APP}.dwt_user_topic; with tmp_uv as ( select '$do_date' dt, sum(if(array_contains(pages,'home'),1,0)) home_count, sum(if(array_contains(pages,'good_detail'),1,0)) good_detail_count from ( select mid_id, collect_set(page_id) pages from ${APP}.dwd_page_log where dt='$do_date' and page_id in ('home','good_detail') group by mid_id )tmp ), tmp_cop as ( select '$do_date' dt, sum(if(cart_count>0,1,0)) cart_count, sum(if(order_count>0,1,0)) order_count, sum(if(payment_count>0,1,0)) payment_count from ${APP}.dws_user_action_daycount where dt='$do_date' ) insert into table ${APP}.ads_user_action_convert_day select tmp_uv.dt, tmp_uv.home_count, tmp_uv.good_detail_count, tmp_uv.good_detail_count/tmp_uv.home_count*100, tmp_cop.cart_count, tmp_cop.cart_count/tmp_uv.good_detail_count*100, tmp_cop.order_count, tmp_cop.order_count/tmp_cop.cart_count*100, tmp_cop.payment_count, tmp_cop.payment_count/tmp_cop.order_count*100 from tmp_uv join tmp_cop on tmp_uv.dt=tmp_cop.dt; insert into table ${APP}.ads_product_info select '$do_date' dt, sku_num, spu_num from ( select '$do_date' dt, count(*) sku_num from ${APP}.dwt_sku_topic ) tmp_sku_num join ( select '$do_date' dt, count(*) spu_num from ( select spu_id from ${APP}.dwt_sku_topic group by spu_id ) tmp_spu_id ) tmp_spu_num on tmp_sku_num.dt=tmp_spu_num.dt; insert into table ${APP}.ads_product_sale_topN select '$do_date' dt, sku_id, payment_amount from ${APP}.dws_sku_action_daycount where dt='$do_date' order by payment_amount desc limit 10; insert into table ${APP}.ads_product_favor_topN select '$do_date' dt, sku_id, favor_count from ${APP}.dws_sku_action_daycount where dt='$do_date' order by favor_count desc limit 10; insert into table ${APP}.ads_product_cart_topN select '$do_date' dt, sku_id, cart_count from ${APP}.dws_sku_action_daycount where dt='$do_date' order by cart_count desc limit 10; insert into table ${APP}.ads_product_refund_topN select '$do_date', sku_id, refund_last_30d_count/payment_last_30d_count*100 refund_ratio from ${APP}.dwt_sku_topic order by refund_ratio desc limit 10; insert into table ${APP}.ads_appraise_bad_topN select '$do_date' dt, sku_id, appraise_bad_count/(appraise_good_count+appraise_mid_count+appraise_bad_count+appraise_default_count) appraise_bad_ratio from ${APP}.dws_sku_action_daycount where dt='$do_date' order by appraise_bad_ratio desc limit 10; insert into table ${APP}.ads_order_daycount select '$do_date', sum(order_count), sum(order_amount), sum(if(order_count>0,1,0)) from ${APP}.dws_user_action_daycount where dt='$do_date'; insert into table ${APP}.ads_payment_daycount select tmp_payment.dt, tmp_payment.payment_count, tmp_payment.payment_amount, tmp_payment.payment_user_count, tmp_skucount.payment_sku_count, tmp_time.payment_avg_time from ( select '$do_date' dt, sum(payment_count) payment_count, sum(payment_amount) payment_amount, sum(if(payment_count>0,1,0)) payment_user_count from ${APP}.dws_user_action_daycount where dt='$do_date' )tmp_payment join ( select '$do_date' dt, sum(if(payment_count>0,1,0)) payment_sku_count from ${APP}.dws_sku_action_daycount where dt='$do_date' )tmp_skucount on tmp_payment.dt=tmp_skucount.dt join ( select '$do_date' dt, sum(unix_timestamp(payment_time)-unix_timestamp(create_time))/count(*)/60 payment_avg_time from ${APP}.dwd_fact_order_info where dt='$do_date' and payment_time is not null )tmp_time on tmp_payment.dt=tmp_time.dt; with tmp_order as ( select user_id, order_stats_struct.sku_id sku_id, order_stats_struct.order_count order_count from ${APP}.dws_user_action_daycount lateral view explode(order_detail_stats) tmp as order_stats_struct where date_format(dt,'yyyy-MM')=date_format('$do_date','yyyy-MM') ), tmp_sku as ( select id, tm_id, category1_id, category1_name from ${APP}.dwd_dim_sku_info where dt='$do_date' ) insert into table ${APP}.ads_sale_tm_category1_stat_mn select tm_id, category1_id, category1_name, sum(if(order_count>=1,1,0)) buycount, sum(if(order_count>=2,1,0)) buyTwiceLast, sum(if(order_count>=2,1,0))/sum( if(order_count>=1,1,0)) buyTwiceLastRatio, sum(if(order_count>=3,1,0)) buy3timeLast , sum(if(order_count>=3,1,0))/sum( if(order_count>=1,1,0)) buy3timeLastRatio , date_format('$do_date' ,'yyyy-MM') stat_mn, '$do_date' stat_date from ( select tmp_order.user_id, tmp_sku.category1_id, tmp_sku.category1_name, tmp_sku.tm_id, sum(order_count) order_count from tmp_order join tmp_sku on tmp_order.sku_id=tmp_sku.id group by tmp_order.user_id,tmp_sku.category1_id,tmp_sku.category1_name,tmp_sku.tm_id )tmp group by tm_id, category1_id, category1_name; insert into table ${APP}.ads_area_topic select '$do_date', id, province_name, area_code, iso_code, region_id, region_name, login_day_count, order_day_count, order_day_amount, payment_day_count, payment_day_amount from ${APP}.dwt_area_topic; " $hive -e "$sql"
6.Sqoop导出脚本
1)编写Sqoop导出脚本
在/home/kris/bin目录下创建脚本hdfs_to_mysql.sh
#!/bin/bash hive_db_name=gmall mysql_db_name=gmall_report export_data() { /opt/module/sqoop/bin/sqoop export \ -Dmapreduce.job.queuename=hive \ --connect "jdbc:mysql://hadoop102:3306/${mysql_db_name}?useUnicode=true&characterEncoding=utf-8" \ --username root \ --password 000000 \ --table $1 \ --num-mappers 1 \ --export-dir /warehouse/$hive_db_name/ads/$1 \ --input-fields-terminated-by "\t" \ --update-mode allowinsert \ --update-key $2 \ --input-null-string '\\N' \ --input-null-non-string '\\N' } case $1 in "ads_uv_count") export_data "ads_uv_count" "dt" ;; "ads_user_action_convert_day") export_data "ads_user_action_convert_day" "dt" ;; "ads_user_topic") export_data "ads_user_topic" "dt" ;; "ads_area_topic") export_data "ads_area_topic" "dt,iso_code" ;; "all") export_data "ads_user_topic" "dt" export_data "ads_area_topic" "dt,iso_code" #其余表省略未写 ;; esac
关于导出update还是insert的问题
- --update-mode:
updateonly 只更新,无法插入新数据
allowinsert 允许新增
- --update-key:允许更新的情况下,指定哪些字段匹配视为同一条数据,进行更新而不增加。多个字段用逗号分隔。
- --input-null-string和--input-null-non-string:
分别表示,将字符串列和非字符串列的空串和“null”转义。
官网地址:http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL(like IS NULL) will not work correctly.
You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.
Because sqoop is using those parameters in generated code, you need to properly escape value \N to \\N: Hive中的Null在底层是以“\N”来存储,而MySQL中的Null在底层就是Null,为了保证数据两端的一致性。
在导出数据时采用--input-null-string和--input-null-non-string两个参数。
导入数据时采用--null-string和--null-non-string。
7.Azkaban调度
编译写Azkaban工作流程配置文件
1)编写azkaban.project文件,内容如下
azkaban-flow-version: 2.0
2)编写gmall.flow文件,内容如下
nodes:
- name: mysql_to_hdfs
type: command
config:
command: /home/kris/bin/mysql_to_hdfs.sh all ${dt}
- name: hdfs_to_ods_log
type: command
config:
command: /home/kris/bin/hdfs_to_ods_log.sh ${dt}
- name: hdfs_to_ods_db
type: command
dependsOn:
- mysql_to_hdfs
config:
command: /home/kris/bin/hdfs_to_ods_db.sh all ${dt}
- name: ods_to_dwd_log
type: command
dependsOn:
- hdfs_to_ods_log
config:
command: /home/kris/bin/ods_to_dwd_log.sh ${dt}
- name: ods_to_dwd_db
type: command
dependsOn:
- hdfs_to_ods_db
config:
command: /home/kris/bin/ods_to_dwd_db.sh all ${dt}
- name: dwd_to_dws
type: command
dependsOn:
- ods_to_dwd_log
- ods_to_dwd_db
config:
command: /home/kris/bin/dwd_to_dws.sh ${dt}
- name: dws_to_dwt
type: command
dependsOn:
- dwd_to_dws
config:
command: /home/kris/bin/dws_to_dwt.sh ${dt}
- name: dwt_to_ads
type: command
dependsOn:
- dws_to_dwt
config:
command: /home/kris/bin/dwt_to_ads.sh ${dt}
- name: hdfs_to_mysql
type: command
dependsOn:
- dwt_to_ads
config:
command: /home/kris/bin/hdfs_to_mysql.sh ads_user_topic
3)将azkaban.project、gmall.flow文件压缩到一个zip文件,文件名称必须是英文。
4)在WebServer新建项目:http://hadoop101:8081/index
Azkaban多Executor模式下注意事项
Azkaban多Executor模式是指,在集群中多个节点部署Executor。在这种模式下, Azkaban web Server会根据策略,选取其中一个Executor去执行任务。
由于我们需要交给Azkaban调度的脚本,以及脚本需要的Hive,Sqoop等应用只在hadoop102部署了,为保证任务顺利执行,我们须在以下两种方案任选其一,推荐使用方案二。
方案一:指定特定的Executor(hadoop102)去执行任务。
1)在MySQL中azkaban数据库executors表中,查询hadoop102上的Executor的id。 mysql> use azkaban; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> select * from executors; +----+-----------+-------+--------+ | id | host | port | active | +----+-----------+-------+--------+ | 1 | hadoop103 | 35985 | 1 | | 2 | hadoop102 | 36363 | 1 | | 3 | hadoop101 | 12321 | 1 | +----+-----------+-------+--------+ 3 rows in set (0.00 sec)
2)在执行工作流程时加入useExecutor属性,如下
方案二:在Executor所在所有节点部署任务所需脚本和应用。
1)分发脚本、hive以及sqoop
[kirs@hadoop101 ~]$ xsync /home/atguigu/bin/
[kirs@hadoop101 ~]$ xsync /opt/module/hive
[kirs@hadoop101 ~]$ xsync /opt/module/sqoop
数据在hdfs上保存时间,半或1年清理下,可下载压缩保存下来