数据仓库保存历史数据方法之拉链表
一、数据仓库
数据仓库是一个面向主题的、集成的、相对稳定的、反应历史变化的数据集合,用于支持管理决策。
l 面向主题:传统的数据库是面向事务处理的,而数据仓库是面向某一领域而组织的数据集合,主题是指用户关心的某一联系紧密的集合。
l 集成:数据仓库中数据来源于各个离散的业务系统数据库、外部数据、非结构化数据的集合,数据仓库数据是集成的。
l 相对稳定:数据仓库中的数据不应该支持dml操作,而是通过批处理方式进行数据的处理。
l 反应历史:数据仓库保存了数据的历史各个版本。
我们今天所介绍的就是数据仓库保留数据历史版本的一种方法-拉链表。
这里我简单介绍一下我们数据仓库中扫采用的架构,主要包括贴源层、明细层、汇总层、集市层、报表层、维度层,简单的介绍如下:
l 贴源层:采集的各个业务系统数据首先存储在贴源层中,这里需要注意的是采集业务源数据的方法,增量采集还是全量采集,好的业务系统设计应该支持增量采集(这里留一个问题作为思考:增量采集数据应该满足哪些要求),这样的好处减少了采集数据对仓库资源和业务系统资源的消耗。
l 明细层:该层采用规范化方式存储数据,处理数据主要来自于贴源层,实现的目的主要包括面向主题设计存储结构、集成不同业务源数据、统一编码规范、保留历史数据(拉链表主要在这一层中进行设计实现)等仓库基本要处理的
l 汇总层:对于明细层整合的数据,针对需要汇总的指标按照业务口径进行计算并且初步反规范化设计实现连接明细层的规范化数据成小宽表,目的方便下一步处理使用。
l 集市层:面向不同需求方,按照维度建模方法,进行星型模型设计, 这一层设计完成后的目的要达到可以方便出具报表和日常提数任务。这里有些仓库设计人员还会用另一个思路,即集市层不采用星型模型设计方法,而是设计大宽表,采用这种方式的设计人员主要理由是这种方式方便人们使用。
l 报表层:根据各个部门不同需求出具报表。
l 维度层:统一存储数仓维表相关数据。
目前数据仓库设计主要有两个阵营,kimball和inmon架构,这里不会针对与这两种放进进行详细说明。个人所接触项目经验,如果极端采用某一种架构,最后数仓项目成功概率都很低,因此个人建议结合两种架构的优点进行数仓设计(即三范式简历数仓明细层,集市层采用星型模型设计方法),合理结合两种思路优点可以有效的避免业务驱动方式带来的烦杂工作以及需求驱动所带来的后期维护及扩展性问题。
二、拉链表原理
这里以一个虚拟的示例简单介绍拉链表实现原理:
1、比如在2017-01-01日,我们初始化了用户数据到数据仓库,我们为初始化到数据仓库中的用户表(customer)添加了一个start_date和end_date字段用来标识该条数据的生命周期,具体如下:
cus_id job start_date end_date
----------------------------------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 3000-12-21
10004 java 2018-01-01 3000-12-21
10005 python 2018-01-01 3000-12-21
2、在2017-01-02这一天,10004用户被删除,同时增加了10006及10007用户,10003用户的job由mysql变成了mongodb,明细数据如下:
cus_id job start_date end_date
--------------------------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 3000-12-21
10007 redis 2018-01-02 3000-12-21
3、在2017-01-03这一天,10007用户被删除,同时10006工作由docker变成了openstack,10003用户工作由mongodb变成了hive,并且增加了10008用户数据,明细数据如下:
cus_id job start_date end_date
---------------- ----------------------------
10001 oracle 2018-01-01 3000-12-21
10002 pgsql 2018-01-01 3000-12-21
10003 mysql 2018-01-01 2018-01-02
10003 mongodb 2018-01-02 2018-01-03
10003 hive 2018-01-03 3000-12-21
10004 java 2018-01-01 2018-01-02
10005 python 2018-01-01 3000-12-21
10006 docker 2018-01-02 2018-01-03
10006 openstack 2018-01-03 3000-12-21
10007 redis 2018-01-02 2018-01-03
10008 hadoop 2018-01-03 3000-12-21
拉链表原理分析:这里以10003用户为例,通过记录10003用户数据变化时间线我们可以发现如下的规律:
2017-01-01 首次注册,job为mysql;
2017-01-02 工作变更,job变为mongodb;
2017-01-03 工作变更,job变为hive。
在上图中,10003用户工作变更的时间线上,我们可以发现每一个时间点,10003用户只有一个工作。在20170101~20170102期间内10003的job为mysql,在20170102~20170103期间内10003的job为mongodb,在20170103~30001231期间内10003的job为hive。拉链表中每一个记录都满足上边规律,下面让我们想想怎么样准确的访问拉链表数据呢?
拉链表访问方法:
1、 访问拉链表最新数据:
select * from customer t where t.end_date = '3000-12-31';
2、 访问2017-01-01这天的历史快照数据:
select * from customer t where t.start_date <= '2017-01-01' and t.end_date > '2017-01-01';
3、访问2017-01-02这天的历史快照数据:
select * from customer t where t.start_date <= '2017-01-02' and t.end_date > '2017-01-02';
4、访问10003用户所有历史数据:
select * from customer t where t.cus_id = '10003';
三、拉链表实现步骤
1、准备数据:
1)2017-01-01初始化数据:
cus_id | job | start_date | end_date | dtype | dw_status | dw_ins_date |
10001 | oracle | 2017-01-01 | 3000-12-31 | C | I | 2017-01-01 |
10002 | pgsql | 2017-01-01 | 3000-12-31 | C | I | 2017-01-01 |
10003 | mysql | 2017-01-01 | 3000-12-31 | C | I | 2017-01-01 |
10004 | java | 2017-01-01 | 3000-12-31 | C | I | 2017-01-01 |
10005 | python | 2017-01-01 | 3000-12-31 | C | I | 2017-01-01 |
2)2017-01-02增量数据:
cus_id | job | dw_status | dw_ins_date |
10003 | mongodb | U | 2017-01-02 |
10004 | java | D | 2017-01-02 |
10006 | docker | I | 2017-01-02 |
10007 | redis | I | 2017-01-02 |
3)2017-01-03增量数据:
cus_id | job | dw_status | dw_ins_date |
10003 | hive | U | 2017-01-03 |
10007 | redis | D | 2017-01-03 |
10006 | openstack | U | 2017-01-03 |
10008 | hadoop | I | 2017-01-03 |
2、数据加载过程:
1) 初始化customer表:
drop table customer;
create table customer(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(end_date)
(
partition cus_par20170101 values('2017-01-01') tablespace users,
partition cus_par20170102 values('2017-01-02') tablespace users,
partition cus_par20170103 values('2017-01-03') tablespace users,
partition cus_par30001231 values('3000-12-31') tablespace users
);
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10001,'oracle','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10002,'pgsql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10003,'mysql','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10004,'java','2017-01-01','3000-12-31','C','I','2017-01-01');
insert into customer(cus_id,job,start_date,end_date,dtype,dw_status,dw_ins_date) values (10005,'python','2017-01-01','3000-12-31','C','I','2017-01-01');
2) 初始化2017-01-02号增量表:
create table customer_inc(
cus_id int,
job varchar2(20),
dw_status varchar2(1),
dw_ins_date varchar2(10)
);
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'mongodb','U','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10004,'java','D','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'docker','I','2017-01-02');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','I','2017-01-02');
3) 创建中间表:
drop table customer_tmp0;
create table customer_tmp0(
cus_id int,
job varchar2(20),
start_date varchar2(10),
end_date varchar2(10),
dtype varchar2(1),
dw_status varchar2(1),
dw_ins_date varchar2(10)
)
partition by list(dtype)
(
partition cus_dtype_H values('H') tablespace users,
partition cus_dtype_C values('C') tablespace users
);
3、刷新customer_inc表数据到customer表(2017-01-02):
1) customer表最新分区和customer_inc表中更新和删除数据连接,处理customer最新分区中变化数据:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-02' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
2)将customer表中更新和插入数据插入到customer_tmp0临时表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-02' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-03' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
3)同步表到customer事实表,这一步可以使用交换分区操作:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
4)查看结果:
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 3000-12-31 C U 2017-01-03
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 3000-12-31 C I 2017-01-03
10007 redis 2017-01-02 3000-12-31 C I 2017-01-03
8 rows selected
SQL>
4、刷新customer_inc表数据到customer表(2017-01-03)
1)初始化2017-01-02号增量表:
truncate table customer_inc;
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10003,'hive','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10008,'hadoop','I','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10006,'openstack','U','2017-01-03');
insert into customer_inc(cus_id,job,dw_status,dw_ins_date)values(10007,'redis','D','2017-01-03');
2) customer表最新分区和customer_inc表中更新和删除数据连接,处理customer最新分区中变化数据:
truncate table customer_tmp0;
insert into customer_tmp0
select
t1.cus_id,
t1.job,
t1.start_date,
case when t2.cus_id is null then t1.end_date else '2017-01-03' end as end_date,
case when t2.cus_id is null then 'C' else 'H' end dtype,
case when t2.cus_id is null then t1.dw_status else t2.dw_status end dw_status,
case when t2.cus_id is null then t1.dw_ins_date else t2.dw_ins_date end as dw_ins_date
from customer t1 left join customer_inc t2 on t1.cus_id = t2.cus_id and t2.dw_status in ('D','U')
where t1.end_date = '3000-12-31'
order by cus_id asc
;
3)将customer表中更新和插入数据插入到customer_tmp0临时表中:
insert into customer_tmp0
select
t1.cus_id,
t1.job,
'2017-01-03' as start_date,
'3000-12-31' as end_date,
'C' as dtype,
t1.dw_status,
'2017-01-04' as dw_ins_date
from customer_inc t1
where t1.dw_status in ('I','U')
;
4) 表到customer事实表,这一步可以使用交换分区操作:
alter table customer truncate partition cus_par30001231;
insert into customer
select * from customer_tmp0;
5) 查看结果
SQL> select * from customer order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
----------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
11 rows selected
SQL>
5、查询拉链表:
1) 查询拉链表最新数据:
SQL> select * from customer where end_date = '3000-12-31' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 openstack 2017-01-03 3000-12-31 C U 2017-01-04
10008 hadoop 2017-01-03 3000-12-31 C I 2017-01-04
6 rows selected
SQL>
2) 查询2017-01-01历史快照数据:
SQL> select * from customer where start_date <= '2017-01-01' and end_date > '2017-01-01' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
--------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10004 java 2017-01-01 2017-01-02 H D 2017-01-02
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
SQL>
3)查询2017-01-02历史快照数据:
SQL> select * from customer where start_date <= '2017-01-02' and end_date > '2017-01-02' order by cus_id asc;
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10001 oracle 2017-01-01 3000-12-31 C I 2017-01-01
10002 pgsql 2017-01-01 3000-12-31 C I 2017-01-01
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10005 python 2017-01-01 3000-12-31 C I 2017-01-01
10006 docker 2017-01-02 2017-01-03 H U 2017-01-03
10007 redis 2017-01-02 2017-01-03 H D 2017-01-03
6 rows selected
SQL>
4)查看10003用户的所有数据:
SQL> select * from customer where cus_id = '10003';
CUS_ID JOB START_DATE END_DATE DTYPE DW_STATUS DW_INS_DATE
---------- -------------------- ---------- ---------- ----- --------- -----------
10003 mysql 2017-01-01 2017-01-02 H U 2017-01-02
10003 mongodb 2017-01-02 2017-01-03 H U 2017-01-03
10003 hive 2017-01-03 3000-12-31 C U 2017-01-04
SQL>