hive 實現增量更新

保險公司有乙個表記錄客戶的資訊，其中包括有客戶的id，name和age(為了演示只列出這幾個字段)。

建立hive的表:

create table customer

(id int,

age tinyint,

name string

)partitioned by(dt string)

row format delimited

fields terminated by '|'

stored as textfile;

匯入初始化資料：

load data local inpath '/home/hadoop/hivetestdata/customer.txt' into table customer partition(dt = '201506');

hive> select * from customer order by id;

customer.id customer.age customer.name customer.dt

1 25 jiangshouzhuang 201506

2 23 zhangyun 201506

3 24 yiyi 201506

4 32 mengmeng 201506

對於保險公司來說，客戶每天都會發生變化，我們使用臨時資料表customer_temp來記錄每天客戶資訊,欄位和屬性與customer表一致，

create table customer_temp like customer;

load data local inpath '/home/hadoop/hivetestdata/customer_temp.txt' into table customer_temp partition(dt = '201506');

包含的資料示例如下所示：

hive> select * from customer_temp;

customer_temp.id customer_temp.age customer_temp.name customer_temp.dt

1 26 jiangshouzhuang 201506

5 45 xiaosan 201506

如果需要實現客戶表的增量更新，我們需要將兩個表進行full outer join,將customer_temp表中發生修改的資料更新到customer表中。

hive (hive)> select * from customer_temp

> union all

> select a.* from customer a

> left outer join customer_temp b

> on a.id = b.id where b.id is null;

_u1.id _u1.age _u1.name _u1.dt

2 23 zhangyun201506

3 24 yiyi201506

4 32 mengmeng201506

1 26 jiangshouzhuang201506

5 45 xiaosan201506

之前看到網上有使用類似如下的方法，感覺是存在問題的：

hive> select customer.id,

coalesce(customer_temp.age,customer.age),

customer.name,

coalesce(customer_temp.dt,customer.dt)

from customer_temp

full outer join customer on customer_temp.id = customer.id;

執行後的結果為：

customer.id _c1 customer.name _c3

1 26 jiangshouzhuang 201506

2 23 zhangyun 201506

3 24 yiyi 201506

4 32 mengmeng 201506

null 45 null 201506

Hive中實現增量更新

保險公司有乙個表記錄客戶的資訊，其中包括有客戶的id，name和age 為了演示只列出這幾個字段建立hive的表 create table customer id int,age tinyint,name string partitioned by dt string row format del...

hive增量更新

很多資料需要進行更新，如使用者資訊修改。hive0.11之後開始支援update和delete。但是hive頻繁更新與hive的設計原則相反，並且hive增量更新很緩慢。為實現增量更新，我們可以採用union all進行關聯或在乙個分割槽表中求最新的日期的資料。select b.id,b.conte...

Hive增量更新方案

hive增量更新方案方案一總結出來業界可行方案 1 hive原始表提前規劃好以時間分割槽，初始化裝載源庫記錄為base table 最新資料 2 每個相關表都會有乙個timestamp列，對每一行操作做了修改，都會重置這列timestamp為當前時間戳 3 新增資料通過sqoop 支援當天抽取 ...

hive 實現增量更新

Hive中實現增量更新

hive增量更新

Hive增量更新方案

相關推薦