hive jso匹配 Hive資料倉儲歷史拉鍊實現

背景

發現最近寫部落格的頻率越來越低了，是太忙了？或者是不想寫？呵呵！革命尚未成功，同志還需努力。今天利用下午一點空餘時間整理如何基於hive上實現資料倉儲中的歷史拉鍊演算法。

筆者使用hive1.1進行測試。

歷史儲存方式

下面先來了解一下歷史資料儲存的幾種方法。

資料歷史的儲存方式

1、切片儲存：通過資料日期記錄資料的入庫時點

2、拉鍊儲存：通過開始日期與結束日期記錄每乙個主鍵記錄的變化過程

兩種儲存方式沒有好壞之分，都可以完整的保留資料歷史，只是各自適合於不同的資料情況。

拉鍊儲存方式–新增：

拉鍊儲存方式–刪除：

拉鍊儲存方式–修改：

拉鍊儲存的優點

1、節省儲存空間

2、記錄資料變化

3、便於獲取資料

為什麼我們要大規模使用拉鍊表

1、資料平台的建設原則是保留資料歷史

2、遵循實際的資料規律，搭建標準、靈活、可用行強的資料平台

儲存方式的選擇

1、記錄物件：除特殊情況，一般使用拉鍊儲存

拉鍊儲存的事件表：

拉鍊演算法實現

建立測試表1

47-- 技術緩衝層

drop table

t_account;

create table

t_account

id varchar(30),

name varchar(60),

balance decimal(14,4)

row format delimited fields terminated by '\177';

-- 近源模型層臨時表

drop table

n_account;

create table

n_account

id varchar(30),

name varchar(60),

balance decimal(14,4),

start_dt varchar(8),

end_dt varchar(8)

stored as parquet tblproperties

-- 近源模型層資料表

drop table

o_account;

create table

o_account

id varchar(30),

name varchar(60),

balance decimal(14,4),

start_dt varchar(8)

partitioned by

end_dt varchar(8)

stored as parquet tblproperties

本次使用賬戶表進行測試，技術緩衝層t_account使用預設的txt格式儲存，有便於將格式符合的原始檔案能直接載入到hive中。近源模型層使用parquet格式儲存，parquet格式號稱為計算而生。

拉鍊步驟

從技術緩衝層向近源模型層載入狀態類無刪除拉鍊演算法(left join方式)

步驟1，清空臨時表：

2truncate table

default.n_account;

步驟2，取出增量資料(新增|修改)到臨時表：

23insert

into

table default.n_account

select distinct

t.id,

t.name,

t.balance,

'$' as start_dt,

'30001231' as end_dt

from

default.t_account t

left join

default.o_account o

ono.id=coalesce(rtrim(t.id),'')

and o.start_dt

and o.end_dt>='$'

where

o.id is null)

or o.id<>coalesce(rtrim(t.id),'')

or o.name<>coalesce(rtrim(t.name),'')

or o.balance<>coalesce(t.balance,'') ;

步驟3，取當前有效資料到臨時表,並對其中部分資料進行閉鏈：

22insert

into

table default.n_account

select distinct

o.id ,

o.name ,

o.balance ,

o.start_dt ,

case

when n.id is not null

then n.start_dt

else '30001231'

end as end_dt

from

default.o_account o

left join

default.n_account n

ono.id=n.id

where

o.start_dt

and o.end_dt>='$';

步驟4，取當前有效資料到臨時表,並對其中部分資料進行閉鏈：

2alter table

default.o_account drop partition(end_dt>='$');

步驟5，將臨時表資料全部插入目標表：

16// 開啟動態分割槽

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

insert

overwrite table default.o_account partition

end_dt

select

id,name,

balance,

start_dt,

end_dt

from

default.n_account;

拉鍊測試

將資料載入到技術緩衝層(次步驟可以由pig、mapreduce、spark、sqoop等工具來替代)：

8insert

into

table default.t_account values

'001',

'zhangsan',

根據拉鍊步驟一一執行，並將sql中的$變數替換成測試日期如(20140101)

10truncate table

default.t_account ;

insert

into

table default.t_account values

'001',

'zhangsan',

根據拉鍊步驟一一執行，並將sql中的$變數替換成測試日期如(20140102)

檢視測試結果：

此時可以看到，zhangsan餘額為2000時的資料已經閉鏈。

直接將當日增量資料覆蓋到目標表：

13// 開啟動態分割槽

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table default.o_account partition(etl_dt)

select

id,name,

balance,

end_dt

from

default.n_account t

where t.etl_dt='$';

至此拉鍊演算法實現測試完成。

hive jso匹配 hive安裝和基本使用

1 安裝hadoop集群並啟動 2 安裝mysql資料庫 3 hive安裝 1 上傳安裝包至 usr local目錄中 2 解壓 tar zxvf apache hive 1.2.1 bin.tar.gz 3 將mysql的jar包匯入解壓後的apache hive中的lib檔案下 4 啟動hive...

hive建表匯入資料匹配

1.建表建立非重複表，分隔符設定為 create table if not exists imei guid imei string row format delimited fields terminated by 2.匯入將本地資料夾的資料上傳到hive，適用資料量較大情況 concaten...

遷移hive表及hive資料

公司hadoop集群遷移，需要遷移所有的表結構及比較重要的表的資料跨雲服務機房，源廣州機房，目標北京機房 1 遷移表結構 1 老hive中匯出表結構 hive e use db show tables tables.txt bin bash cat tables.txt while read ea...

hive jso匹配 Hive資料倉儲歷史拉鍊實現

hive jso匹配 hive安裝和基本使用

hive建表 匯入資料 匹配

遷移hive表及hive資料

相關推薦

hive建表匯入資料匹配