大資料 hive 常見問題及解決方案

資料傾斜

/display/

hive/languagemanual+ddl

#languagemanualddl-skewedtables

2、示例一：
create
table list_bucket_single (key string, value string)
skewed by (key) on (1,5,6) [stored as directories];
stored as directories：是否為這些指定的值建立子目錄

3、示例二：
create
table list_bucket_multiple (col1 string, col2 int, col3 string)
skewed by (col1, col2) on (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [stored as directories];

4、如果建立表的時候沒有指定，可以使用alter進行修改

1、調整reduce個數 set hive.exec.reducers.max=2000; set mapred.reduce.tasks= 2000; ---增大reduce個數 2、join過程傾斜 set hive.skewjoin.key=1000000; --這個是join的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體資料量設定 set hive.optimize.skewjoin=true; --如果是join 過程出現傾斜應該設定為true 3、group過程傾斜 set hive.groupby.mapaggr.checkinterval=1000000 ; --這個是group的鍵對應的記錄條數超過這個值則會進行分拆,值根據具體資料量設定 set hive.groupby.skewindata=true;

--如果是group by過程出現傾斜應該設定為true

函式使用細節

hive調優

1、參考一：
2、參考二：

2、盡量盡早地過濾資料，減少每個階段的資料量,對於分割槽表要加分割槽，同時只選擇需要使用到的字段
select... from a
joinb on a.key= b.key
where a.userid>10
and b.userid<10
and a.dt='20120417'
and b.dt='20120417';
應該改寫為：
select.... from (select .... from a where dt='201200417'
and userid>10) a join 
(select .... from b where dt='201200417'
and userid <10)b on (a.key= b.key);
原因是因為join發生在where之前，對應join是需要過濾的條件應該寫在on中

3、臨時表 3.1、如果一張表有多個字段，但是join只用到一小部分字段，且這張表要經常用到，將這些字段提取出來生成乙個張臨時表去join，可以減少大大減少該錶掃瞄時間 3.2、如果不想用上訴一樣生成內部表，乙個session之後就不需要中間表的時候，可以使用create temporary

table table_name_here (key string, value string)建立臨時表，session結束後自動刪除

4、小表放左邊，因為每個map操作會把左表資料放入記憶體，然後將右表資料一條條讀取與左表關聯，這樣可以減少磁碟和記憶體的使用。小表的定義：資料條數少，資料占用空間少。

如果一條sql同時要join多張表，那麼把資料占用空間少的表放最後join，這樣可以避免占用空間大的表在中間落盤過程中占用太大磁碟空間。

5、order by最終所有的資料會彙總到乙個reducer上進行排序，可能使得該reducer壓力非常大，任務長時間無法完成。（預設一般強制帶上limit限定數目才能執行，限定數目排序的更快）如果排序只要求保證value有序而key可以無序，例如要統計每個使用者每筆的交易額從高到低排列，只需要對每個使用者的交易額排序，而使用者id本身不需要排序。這種情況採用分片排序更好，語法類似於： select user_id, amount from table distribute by user_id sort by user_id, amount 這裡用到的不是order by，而是distribute by和sort by，distribute by標識map輸出時分發的key。這樣最後排序的時候，相同的user_id和amount在同乙個reducer上被排序，不同的user_id可以同時分別在多個reducer上排序，相比order

by只能在乙個reducer上排序，速度有成倍的提公升。

rcfile等檔案格式的特性

.com/articles/ynfqn2

獲取最後乙個parition

latest_user_relation_dt=$($hive_bin
-s-e
"show partitions feature_offline_relation"
| awk -f
'=''end '
| awk -f
'/''')

大資料 hive 常見問題及解決方案

資料hive常見問題

vs qt qgis環境搭建常見問題及解決方案

hive常見問題

大資料 hive 常見問題及解決方案

資料hive常見問題

vs qt qgis環境搭建常見問題及解決方案

hive常見問題

相關推薦