hive的幾個排序

hive中常用的幾個排序

order by #全域性排序，因為是全域性排序，所以job是乙個reduce，reduce的個數引數設定對其無效

sort by #乙個reduce時為全域性排序，多個reduce時，每個reduce各自排序，為了提高全域性排序的效能，可以先用sort by做區域性排序，然後再做全域性排序

distribute by #hash 分組，根據key和reduce個數進行資料分發，預設使用hash的方式

cluster by #distribute by + sort by（注意和hive.enforce.bucketing引數的應用）

結合兩個例子來看排序的應用：

1）cdnjob的優化思路

熟悉map slot的分配規則的同學應該知道，text gz的檔案是不支援split的，因此這種情況下最多只有乙個map（不管資料量多大），

這樣在做資料處理時，如果檔案很大而且計算邏輯比較複雜（比如cdn的報表，需要做各種聚合計算和ip位址的解析），效率就會比較低下。

乙個思路就是建乙個中間表，對原始表進行distribute by，對中間表進行複雜的計算，這樣就可以使用多個map，提高運算效率。

2）solr測試的思路

測試hive2solr時，需要模擬乙個檔案和多個檔案的寫入情況。

預設使用的sql是 1

insert into table ***select* from data_for_sol;

這是乙個hdfs操作，不涉及mapred，不太好模擬多個檔案的寫入情況。

可以在後面套乙個 distribute by ，並合理設定reduce的數量（mapred.reduce.tasks），這樣就會生成多個reduce的檔案。

insert into table ***select* from data_for_solr distribute by session_id;

注意：reduce slot的分配由下面幾個情況決定

1）sql型別（比如order by只有乙個reduce）

2）引數 1

mapred.reduce.tasks;

每個任務reduce的預設值,預設為-1,-1 代表自動根據作業的情況來設定reduce的值，優先順序高於下面兩個設定。

hive.exec.reducers.bytes.per.reducer

根據reduce的輸入資料（map端的輸出資料，不管是否壓縮），大小判斷reduce的個數，預設1g

hive.exec.reducers.max reduce的最大數量，預設999個

計算方法： 1

reducers = (int) ((totalinputfilesize + bytesperreducer - 1) / bytesperreducer);

reducers = math.max(1, reducers);

reducers = math.min(maxreducers, reducers);

關於map/reduceslot的分配規則以後有機會再說。

hive中幾個排序方式的區別

order by hive中的order by 和傳統sql中的order by 一樣，對資料做全域性排序，加上排序，會新啟動乙個job進行排序，會把所有資料放到同乙個reduce中進行處理，不管資料多少，不管檔案多少，都啟用乙個reduce進行處理。如果指定了hive.mapred.mode st...

hive 幾個join的差別彙總

employee表 department表共4 2 8條結果。select from employee a inner join department b on a.departmentid b.id 等價於返回滿足連線條件的左邊表的所有記錄，若左邊表的某些記錄在右邊表中沒有匹配記錄，右邊表則顯...

Hive資料排序

set hive.groupby.orderby.position.alias true 案例 select name,id,info from employee id order by info.age select name,id from employee id order by 2 desc...

hive的幾個排序

hive中幾個排序方式的區別

hive 幾個join的差別彙總

Hive資料排序

相關推薦