資料倉儲歷史資料儲存拉鍊表

假如我們有乙個賬號account表，我們需要在hive中儲存（資料是從線上mysql讀取binlog同步來的，是有明細變化的）

account表結構：account_id, username, followers_count, modified_at

我們經常使用的儲存方式有快照表和流水表。快照表就是以時間為粒度（比如天），生成每個時間的全量資料快照；流水表則是記錄資料的每一條具體的改變。

現在有乙個需求：需要記錄賬號的歷史變更情況

快照表實現

這裡以天為粒度，對每天賬號最終的狀態進行儲存即可。

在hive中，以天為分割槽儲存，我們需要訪問某天的歷史狀態，直接指定分割槽即可訪問

-- 訪問20190801時某個賬號的狀態 select * from account_snapshot where ds = "20190801" and account_id = ***

快照表的缺點是：當單錶的資料量比較大時，每天儲存全量的快照，會導致不必要的資源開支

流水表實現

流水表記錄資料的每一條變化，來一條插入一條

這種儲存方法對資料的使用者不太友好

-- 查詢20190801時某個賬號的狀態
select * from (
select 
*, row_number over(partition by account_id order by modified_at desc) as ro
from account where modified_at <= "2019-08-02 00:00:00" and account_id = ***
) where ro = 1

以上的兩種方式，多多少少都存在問題，接下來介紹拉鍊表的使用

拉鍊表拉鍊表是維護歷史狀態、以及最新狀態的一種方式。

拉鍊表對快照表進行了優化，根據拉鍊粒度（一般為時間）的不同，去除了在粒度範圍內不變的資料。

拉鍊表可以維護兩個時間（start_time, end_time），來標識當前記錄是否還有效，以及更好的定位歷史資料

實現前提：

首先要有某一時刻的全量資料，作為起始表

其次要有流水表或者快照表兩者其一，作為變化的依據

實現：

-- 原始資料 create table account( account int , username varchar, followers_count int , modified_at timestamp )-- 建立拉鍊表 create table account_zip( account int , username varchar, followers_count int , modified_at timestamp , start_time timestamp, -- 記錄的有效起始時間 end_time timestamp, -- 記錄的有效結束時間

)

今天是8.1，我們從7.31號的資料開始記錄

首先我們將7.31號的資料匯入我們的拉鍊表中

insert into account_zip select *, "2019-07-31 00:00:00" as start_time , "9999-12-31 00:00:00" as end_time from account ;

接下來，我們在8.1的時候，對賬號進行修改和新增

左邊是7.31的資料，右邊是8.1的資料

我們可以看到8.1進行了一條記錄的修改（修改mwf的followers_account）和一條記錄的新增（新增account_id為5的使用者）

針對修改來說：

在拉鍊中已經存在mwf的資訊，8.1對他進行修改，

我們可以將之前那條記錄的end_time修改為8.1，表示他在8.1之後失效了

然後將8.1的這次操作寫入拉鍊表，他的start_time為8.1，end_time為9999-12-31

針對新增來說：

我們直接將它寫入拉鍊表，start_time為8.1，end_time為9999-12-31

8.1過後，我們的拉鍊表變為了如下版本：

以上我們就實現了乙個拉鍊表

查詢記錄

select * from account_zip where end_time = "9999-12-31 00:00:00"

-- 在7.31號前開始生效，且在7.31號當天時還沒有失效，此處通過兩個時間剛好限定了範圍 select * from account_zip where start_time <= "2019-07-31 00:00:00" and end_time >= "2019-07-31 00:00:00"

基於快照表生成拉鍊表

insert into account_zip_tmp 
-- 聯合兩個表，寫入臨時的拉鍊表中
select * from (
-- 改變原有拉鍊表中 失效的資料
-- 這裡用到了md5來比較資料是否相同
select 
bak.account_id,
bak.username ,
bak.followers_count  ,
bak.modified_at, 
bak.start_time
case 
when bak.end_time = "9999-12-31 00:00:00" and  md5(concat(
coalesce(bak.username, 'null'),
coalesce(bak.followers_count, 'null'),
coalesce(bak.modified_at, 'null')
)) != md5(concat(
coalesce(new.username, 'null'),
coalesce(new.followers_count, 'null'),
coalesce(new.modified_at, 'null')
)) then "2019-07-31 00:00:00" 
else bak.end_time
end as end_time 
from account_zip as bak 
left join (
select * from account_snapshot where ds = "20190801"
) as new on bak.account_id = new.account_id
union 
-- 寫入修改或新增的資料
select 
a.account_id,
a.username ,
a.followers_count  ,
a.modified_at, 
"2019-07-31 00:00:00" as start_time, 
"9999-12-31 00:00:00" as end_time
from (   
select * from account_snapshot where ds = "20190801"
) as a 
left join (
select 
*from account_zip
where end_time = "9999-12-31 00:00:00"
) on a.account_id = b.account_id
where md5(concat(
coalesce(a.username, 'null'),
coalesce(a.followers_count, 'null'),
coalesce(a.modified_at, 'null')
)) != md5(concat(
coalesce(b.username, 'null'),
coalesce(b.followers_count, 'null'),
coalesce(b.modified_at, 'null')
)) )；-- 將臨時拉鍊表寫回拉鍊表
insert overwrite table account_zip
select * from account_zip_tmp

參考

實踐出真知！

資料倉儲歷史資料儲存拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表

資料倉儲歷史資料儲存 拉鍊表

資料倉儲資料模型之 極限儲存 歷史拉鍊表

資料倉儲資料模型之 極限儲存 歷史拉鍊表

資料倉儲資料模型之 極限儲存 歷史拉鍊表

相關推薦

資料倉儲歷史資料儲存拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表

資料倉儲資料模型之極限儲存歷史拉鍊表