拉链表设计
一、创建拉链表
- 创建表
CREATE TABLE `ods_login`( | | `uid` string, | | `name` string, | | `phone` string, | | `date` string) | | PARTITIONED BY ( | | `dt` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | | WITH SERDEPROPERTIES ( | | 'field.delim'=',', | | 'serialization.format'=',') | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' | | LOCATION | | 'hdfs://namenode:8020/user/hive/warehouse/ods.db/ods_login' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1645972509')
-
测试数据
-- 2022-02-24.dat 0001,mike,1398836123,2022-02-24 0002,lucy,1396542122,2022-02-24 0003,jack,1398836145,2022-02-24 0004,baran,1398836178,2022-02-24 -- 2022-02-25.dat, 增量数据 0001,mike,1398836241,2022-02-25 0004,baran,1398836756,2022-02-25
0005,xinyi,1398836549,2022-02-25 -
创建dz表
+----------------------------------------------------+ | CREATE TABLE `dim_user_dz`( | | `uid` string, | | `name` string, | | `phone` string, | | `date` string, | | `start_date` string, --开始日期 | | `end_date` string) --结束日期 | | PARTITIONED BY ( | | `dt` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | | LOCATION | | 'hdfs://namenode:8020/user/hive/warehouse/ods.db/dim_user_dz' | | TBLPROPERTIES ( | | 'transient_lastDdlTime'='1645973732') | +----------------------------------------------------+
-
拉链表创建思路
1)假如首日是2022-02-24,首先将数据从ods层加载到dim层,分区日期和结束日期都为9999-00-00
2)第二日2022-02-25,一部分用户新增变化,需要把新增的和变化的装载到dim层,分区结束日期是9999分区,但要注意9999分区有一部分过期数据(过期理解为数据发生了变化后,变化前的数据是过期数据)需要装载到变化前一日即5月1日分区(过期的用户数据分区),就是将过期数据装载到前一日分区,有效数据是9999分区;
3)第三日2022-02-26,和25日一样,新增数据放入9999分区,过期数据放入前一日分区;
- SQL语句实现
-- 首日加载 INSERT INTO dim_user_dz PARTITION(dt='9999-00-00') SELECT uid, name, phone, `date`, '2022-02-24', '9999-00-00' FROM ods_login WHERE dt = '2022-02-24'; -- 每日装载 -- 首先找出9999的数据与增量数据做外连接, 这里创建一张临时表 CREATE TABLE tmp_20220225 AS SELECT * FROM ( SELECT uid as old_uid, name as old_name, phone as old_phone, `date` as old_date, start_date as old_start_date, end_date as old_end_date FROM dim_user_dz WHERE dt = '9999-00-00' ) t1 LEFT OUTER JOIN ( SELECT uid, name, phone, `date`,'2022-02-25' as start_date,'9999-00-00' as end_date FROM ods_login WHERE dt = '2022-02-25' ) t2 ON t1.old_uid = t2.uid; -- 获取全量最新数据和过期数据,动态分区写入 set hive.exec.dynamic.partition.mode=nonstrict; INSERT OVERWRITE TABLE dim_user_dz PARTITION(dt) SELECT nvl(uid, old_uid) as uid, nvl(name, old_name) as name, nvl(phone, old_phone) as phone, nvl(`date`, old_date) as `date`, nvl(start_date, old_start_date) as start_date, nvl(end_date, old_end_date) as end_date, nvl(end_date, old_end_date) as dt FROM tmp_20220225 UNION ALL SELECT old_uid as uid, old_name as name, old_phone as phone, old_date as `date`, old_start_date as start_date, cast(date_sub('2022-02-25', 1) as string) as end_date, cast(date_sub('2022-02-25', 1) as string) as dt FROM tmp_20220225 WHERE uid is not null and old_uid is not null;
二、拉链表回滚