hive 建立表載入資料 load data

在官方的wiki裡，example是這樣的：

create [external] table [if not exists] table_name
[(col_name data_type [comment col_comment], ...)]
[comment table_comment]
[partitioned by (col_name data_type
[comment col_comment], ...)]
[clustered by (col_name, col_name, ...)
[sorted by (col_name [asc|desc], ...)]
into num_buckets buckets]
[row format row_format]
[stored as file_format]
[location hdfs_path]

[row format delimited]關鍵字，是用來設定建立的表在載入資料的時候，支援的列分隔符；

[stored as file_format]關鍵字是用來設定載入資料的資料型別。hive本身支援的檔案格式只有：text file，sequence file。如果檔案資料是純文字，可以使用[stored as textfile]。如果資料需要壓縮，使用[stored as sequence]。通常情況，只要不需要儲存序列化的物件，我們預設採用[stored as textfile]。

那麼我們建立一張普通的hive表，hive sql就如下：

create table test_1(id int, name string, city string) sorted by textfile row format delimited『\t』

其中，hive支援的字段型別，並不多，可以簡單的理解為數字型別和字串型別，詳細列表如下：

tinyint smallint intbigint boolean float double

string

hive的表，與普通關係型資料庫，如mysql在表上有很大的區別，所有hive的表都是乙個檔案，它是基於hadoop的檔案系統來做的。

hive總體來說可以總結為三種不同型別的表。

1. 普通表

普通表的建立，如上所說，不講了。其中，乙個表，就對應乙個表名對應的檔案。

2. 外部表

external 關鍵字可以讓使用者建立乙個外部表，在建表的同時指定乙個指向實際資料的路徑（location），hive 建立內部表時，會將資料移動到資料倉儲指向的路徑；若建立外部表，僅記錄資料所在的路徑，不對資料的位置做任何改變。在刪除表的時候，內部表的元資料和資料會被一起刪除，而外部表只刪除元資料，不刪除資料。具體sql如下：

create external table test_1(id int, name string, city string) sorted by textfile row format delimited『\t』 location 『hdfs://../../..』

3. 分割槽表

有分割槽的表可以在建立的時候使用partitioned by語句。乙個表可以擁有乙個或者多個分割槽，每乙個分割槽單獨存在乙個目錄下。而且，表和分割槽都可以對某個列進行 clustered by 操作，將若干個列放入乙個桶（bucket）中。也可以利用sort by 對資料進行排序。這樣可以為特定應用提高效能。具體sql如下：

create table test_1(id int, name string, city string) partitioned by (pt string) sorted by textfile row format delimited『\t』

hive的排序，因為底層實現的關係，比較不同於普通排序，這裡先不講。

桶的概念，主要是為效能考慮，可以理解為對分區內列，進行再次劃分，提高效能。在底層，乙個桶其實是乙個檔案。如果桶劃分過多，會導致檔案數量暴增，一旦達到系統檔案數量的上限，就杯具了。哪種是最優數量，這個哥也不知道。

分割槽表實際是乙個資料夾，表名即資料夾名。每個分割槽，實際是表名這個資料夾下面的不同檔案。分割槽可以根據時間、地點等等進行劃分。比如，每天乙個分割槽，等於每天存每天的資料；或者每個城市，存放每個城市的資料。每次查詢資料的時候，只要寫下類似 where pt=2010_08_23這樣的條件即可查詢指定時間得資料。

總體而言，普通表，類似mysql的表結構，外部表的意義更多是指資料的路徑對映。分割槽表，是最難以理解，也是最hive最大的優勢。之後會專門針對分割槽表進行講解。

hive不支援一條一條的用insert語句進行插入操作，也不支援update的操作。資料是以load的方式，載入到建立好的表中。資料一旦匯入，則不可修改。要麼drop掉整個表，要麼建立新的表，匯入新的資料。

官方指導為：

load data [local] inpath 'filepath' [overwrite] into table tablename [partition (partcol1=val1, partcol2=val2 ...)]

hive在資料load這塊，大方向分為兩種方式，load檔案或者查詢一張表，或者將某張表裡的額查詢結果插入指定表。

如果劃分更細一點個人歸納總結為4種不同的方式的load：

1. load data到指定的表

直接將file，載入到指定的表，其中，表可以是普通表或者分割槽表。具體sql如下：

load data local inpath '/home/admin/test/test.txt' overwrite into table test_1

關鍵字[overwrite]意思是是覆蓋原表裡的資料，不寫則不會覆蓋。

關鍵字[local]是指你載入檔案的**為本地檔案，不寫則為hdfs的檔案。

其中『home/admin/test/test.txt』是相對路徑

『/home/admin/test/test.txt』為絕對路徑

2. load到指定表的分割槽

直接將file，載入到指定表的指定分割槽。表本身必須是分割槽表，如果是普通表，匯入會成功，但是資料實際不會被匯入。具體sql如下：

load data local inpath '/home/admin/test/test.txt' overwrite into table test_1 partition（pt=』***x）

load資料，hive支援資料夾的方式，將資料夾內的所有檔案，都load到指定表中。hdfs會將檔案系統內的某資料夾路徑內的檔案，分散到不同的實際實體地址中。這樣，在資料量很大的時候，hive支援讀取多個檔案載入，而不需要限定在唯一的檔案中。

3. insert+select

這個是完全不同於檔案操作的資料匯入方式。官方指導為：

standard syntax:
insert overwrite table tablename1 [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement 
hive extension (multiple inserts):
from from_statement
insert overwrite table tablename1 [partition (partcol1=val1, partcol2=val2 ...)] select_statement1
[insert overwrite table tablename2 [partition ...] select_statement2] ...
hive extension (dynamic partition inserts):
insert overwrite table tablename partition (partcol1[=val1], partcol2[=val2] ...) select_statement from from_statement

這個的用法，和上面兩種直接操作file的方式，截然不同。從sql語句本身理解，就是把查詢到的資料，直接匯入另外一張表。這個暫時不仔細分析，之後查詢章節，再細講。

4. alter 表，對分割槽操作

（1）增加新分割槽

在對錶結構進行修改的時候，我們可以增加乙個新的分割槽，在增加新分割槽的同時，將資料直接load到新的分割槽當中。

alter table table_name add  
partition_spec [ location 'location1' ]   
partition_spec [ location 'location2' ] ...

（2）刪除分割槽

alter table table_name drop partition (partcol1[=value1]);

hive 建立表載入資料 load data

hive使用load載入資料1 0

hive內部表外部表的建立及load資料

hive表載入資料

hive 建立表 載入資料 load data

hive使用load載入資料1 0

hive內部表外部表的建立及load資料

hive表載入資料

相關推薦

hive 建立表載入資料 load data