hive的調優以及資料傾斜

調優：

fectch抓取(select,filter,limit) (這個適合老版本，因為老版本hive預設是minimal，現在新版本都是預設more)

hive>set hive.fetch.task.conversion=none;//轉為mr

hive> select * from emp;

query id = yangting_20180812081546_8bbe72fd-e30b-4892-ae71-1f3704d678ed

total jobs = 1

launching job 1 out of 1

number of reduce tasks is set to 0 since there's no reduce operator

job running in-process (local hadoop)

2018-08-12 08:15:53,011 stage-1 map = 100%, reduce = 0%

ended job = job_local108385281_0001

mapreduce jobs launched:

stage-stage-1: hdfs read: 2322 hdfs write: 1651 success

total mapreduce cpu time spent: 0 msec

ok7369 smith clerk 7902 1980-12-17 800.0 null 20

7499 allen salesman 7698 1981-2-20 1600.0 300.0 30

7521 ward salesman 7698 1981-2-22 1250.0 500.0 30

7566 jones manager 7839 1981-4-2 2975.0 null 20

7654 martin salesman 7698 1981-9-28 1250.0 1400.0 30

7698 blake manager 7839 1981-5-1 2850.0 null 30

7782 clark manager 7839 1981-6-9 2450.0 null 10

7788 scott analyst 7566 1987-4-19 3000.0 null 20

7839 king president null 1981-11-17 5000.0 null 10

7844 turner salesman 7698 1981-9-8 1500.0 0.0 30

7876 adams clerk 7788 1987-5-23 1100.0 null 20

7900 james clerk 7698 1981-12-3 950.0 null 30

7902 ford analyst 7566 1981-12-3 3000.0 null 20

7934 miller clerk 7782 1982-1-23 1300.0 null 10

time taken: 6.61 seconds, fetched: 14 row(s)

hive>set hive.fetch.task.conversion=minimal;//直接獲取資料，不轉為mr (filter會轉化)

hive>set hive.fetch.task.conversion=more;//直接獲取資料，不轉為mr

本地模式

1）理論分析

大多數的hadoop job是需要hadoop提供的完整的可擴充套件性來處理大資料集的。不過，有時hive的輸入資料量是非常小的。

在這種情況下，為查詢觸發執行任務時消耗可能會比實際job的執行時間要多的多。對於大多數這種情況，

hive可以通過本地模式在單台機器上處理所有的任務。

對於小資料集，執行時間可以明顯被縮短。使用者可以通過設定hive.exec.mode.local.auto的值為true，

來讓hive在適當的時候自動啟動這個優化。

set hive.exec.mode.local.auto=true;? //開啟本地mr

//設定local mr的最大輸入資料量，當輸入資料量小於這個值時採用local? mr的方式，預設為134217728，即128m

set hive.exec.mode.local.auto.inputbytes.max=50000000;

//設定local mr的最大輸入檔案個數，當輸入檔案個數小於這個值時採用local mr的方式，預設為4

set hive.exec.mode.local.auto.input.files.max=10;

2）案例實操：

（1）開啟本地模式，並執行查詢語句

hive (default)> set hive.exec.mode.local.auto=true;?

hive (default)> select * from emp cluster by deptno;

time taken: 1.328 seconds, fetched: 14 row(s)

（2）關閉本地模式，並執行查詢語句

hive (default)> set hive.exec.mode.local.auto=false;?

hive (default)> select * from emp cluster by deptno;

time taken: 20.09 seconds, fetched: 14 row(s)

大表、小表join

將key相對分散，並且資料量小的表放在join的左邊，這樣可以有效減少記憶體溢位錯誤發生的機率；

再進一步，可以使用group讓小的維度表（1000條以下的記錄條數）先進記憶體。在map端完成reduce。

實際測試發現：新版的hive已經對小表join大表和大表join小表進行了優化。小表放在左邊和右邊已經沒有明顯區別。

案例實操

（0）需求：測試大表join小表和小表join大表的效率

（1）建大表、小表和join後表的語句

create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int,

click_num int, click_url string) row format delimited fields terminated by '\t';

create table smalltable(id bigint, time bigint, uid string, keyword string,

url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

create table jointable(id bigint, time bigint,

uid string,

keyword string,

url_rank int,click_num int,

click_url string)

row format delimited fields terminated by '\t';

（2）分別向大表和小表中匯入資料

hive (default)> load data local inpath '/home/yc/desktop/bigtable.txt' into table bigtable;

hive (default)>load data local inpath '/home/yc/desktop/smalltable.txt' into table smalltable;

（3）關閉mapjoin功能（預設是開啟的）

set hive.auto.convert.join = false;

(4) 大表小表的閾值(預設是25m以下是小表)

set hive.mapjoin.smalltable.filesize=25000000

（5）執行小表join大表語句

insert overwrite table jointable

> select b.id,b.time,b.uid,b.keyword,b.url_rank,s.click_num,

s.click_url from smalltable s left join bigtable b on b.id=s.id;

insert overwrite table jointable

select

b.id,

b.time,

b.uid, b.keyword,

b.url_rank,

s.click_num,

s.click_url

from smalltable s left join bigtable b on b.id = s.id;

time taken: 35.921 seconds

（5）執行大表join小表語句

insert overwrite table jointable

select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url

from bigtable b

left join smalltable s

on s.id = b.id;

time taken: 34.196 seconds

自定義分割槽

增加jvm記憶體

增加reduce數

檔案進行合併

Hive調優資料傾斜

1 通常情況下，作業會通過input的目錄產生乙個或者多個map任務。主要的決定因素有 input的檔案總個數，input的檔案大小，集群設定的檔案塊大小目前為128m，可在hive中通過set dfs.block.size 命令檢視到，該引數不能自定義修改 2 舉例 a 乙個大檔案假設inpu...

Hive資料傾斜調優

開發人員首先要確認幾點需要計算的指標真的需要從資料倉儲的公共明細層來自行彙總嗎？資料團隊開發的公共彙總層是否可以滿足其要求了？真的需要掃瞄這麼多分割槽嗎？能掃瞄一周的就不掃瞄一年的。盡量不要使用select from table這樣的詞語，能指定哪一列就用那一列，盡量新增過濾條件。輸入檔案不要大量...

Hive 資料傾斜解決方案（調優）

在做shuffle階段的優化過程中，遇到了資料傾斜的問題，造成了對一些情況下優化效果不明顯。主要是因為在job完成後的所得到的counters是整個job的總和，優化是基於這些counters得出的平均值，而由於資料傾斜的原因造成map處理資料量的差異過大，使得這些平均值能代表的價值降低。hive的...

hive的調優以及資料傾斜

Hive調優 資料傾斜

Hive資料傾斜調優

Hive 資料傾斜解決方案（調優）

相關推薦

Hive調優資料傾斜