Hive map side join入門及測試

mapjoin顧名思義，就是在map階段進行表之間的連線。而不需要進入到reduce階段才進行連線。這樣就節省了在shuffle階段時要進行的大量資料傳輸。從而起到了優化作業的作用。

即在map 端進行join，其原理是broadcast join，即把小表作為乙個完整的驅動表來進行join操作。通常情況下，要連線的各個表裡面的資料會分布在不同的map中進行處理。即同乙個key對應的value可能存在不同的map中。這樣就必須等到 reduce中去連線。要使mapjoin能夠順利進行，那就必須滿足這樣的條件：除了乙份表的資料分布在不同的map中外，其他連線的表的資料必須在每個map中有完整的拷貝。mapjoin會把小表全部讀入記憶體中，在map階段直接拿另外乙個表的資料和記憶體中表資料做匹配，由於在map是進行了join操作，省去了reduce執行的效率也會高很多。

mapjoin的適用場景如關聯操作中有一張表非常小，.不等值的鏈結操作。通過上面分析你會發現，並不是所有的場景都適合用mapjoin. 它通常會用在如下的一些情景：在二個要連線的表中，有乙個很大，有乙個很小，這個小表可以存放在記憶體中而不影響效能。這樣我們就把小表檔案複製到每乙個map任務的本地，再讓map把檔案讀到記憶體中待用。

1）在map-reduce的驅動程式中使用靜態方法distributedcache.addcachefile()增加要拷貝的小表檔案。 jobtracker在作業啟動之前會獲取這個uri列表，並將相應的檔案拷貝到各個tasktracker的本地磁碟上。

2）在map類的setup方法中使用distributedcache.getlocalcachefiles()方法獲取檔案目錄，並使用標準的檔案讀寫api讀取相應的檔案。

hive內建提供的優化機制之一就包括mapjoin。

在hive v0.7之前，需要給出mapjoin的指示，hive才會提供mapjoin的優化。

hive v0.7之後的版本已經不需要給出mapjoin的指示就進行優化。它是通過如下配置引數來控制的：

hive> set hive.
auto
.convert.join=
true
;

hive 0.11之後，在表的大小符合設定時

hive. auto .convert.join.noconditionaltask= true ,hive. auto .convert.join.noconditionaltask.size= 10000 ,hive.mapjoin.smalltable.filesize=

25000000

缺省會把join轉換為map join（認 hive.ignore.mapjoin.hint為true，hive.auto.convert.join為true）,不過hive0.11的 map join bug比較多，可以通過在預設關閉map join convert,在需要時再設定hint：hive.auto.convert.join=false 。hive.ignore.mapjoin.hint=false.

hive v0.12.0版本，預設狀況下mapjoin優化是開啟的。也就是

hive.
auto
.convert.join=
true

hive還提供另外乙個引數–表檔案的大小作為開啟和關閉mapjoin的閾值。

hive.mapjoin.smalltable.filesize=

25000000

hive 2.1.1 版本

可以看出map端join是預設開啟的

hive>
set hive.auto.
convert
.join
;hive.auto.
convert
.join
=true
hive>
set hive.mapjoin.smalltable.filesize;
hive.mapjoin.smalltable.filesize=
25000000

以2個小表join進行測試,看執行過程可以看出是沒有reduce階段的

hive> select * from u1 left join u2 on u1.id=u2.id; query id = root_20201228193725_f83e5aa0-ec9f- 44b3- 8efd- 655d34bbb638 total jobs = 12020-12 -2819: 38:01 starting to launch local task to process map join; maximum memory = 477626368 2020-12 -2819: 38:06 dump the side-table for tag: 1 with group count: 4 into file: file: /usr/local/hive/iotmp/root/ 54f6a901-c015- 4394 -a032- 97f861b9cdb2/hive_2020-12- 28_19-37- 25_392_4405802127612271879-1/ -local- 10004 /hashtable-stage- 3/mapjoin-mapfile01-- .hashtable 2020-12 -2819: 38:06 uploaded 1 file to: file: /usr/local/hive/iotmp/root/ 54f6a901-c015- 4394 -a032- 97f861b9cdb2/hive_2020-12- 28_19-37- 25_392_4405802127612271879-1/ -local- 10004 /hashtable-stage- 3/mapjoin-mapfile01-- .hashtable ( 348 bytes) 2020-12 -2819: 38:06 end of local task; time taken: 4.661 sec. execution completed successfully mapredlocal task succeeded launching job 1 out of 1 number of reduce tasks is set to 0 since there's no reduce operator job running in-process (local hadoop) 2020-12 -2819: 38:24, 939 stage- 3 map =0% , reduce =0% 2020-12 -2819: 38:26, 089 stage- 3 map = 100% , reduce =0% ended job = job_local1658451598_0001 mapreduce jobs launched: stage-stage- 3: hdfs read: 25 hdfs write: 240 success total mapreduce cpu time spent: 0 msec ok1 a null null 2 b 2 bb 3 c 3 cc 4 d null null 7 y 7 yy 8 u null null null null null

null

Hive map side join入門及測試

monkey壓測入門 bug

visual studio2013安裝及測試

HelloSpring spring入門及概述

Hive map side join入門及測試

monkey壓測入門 bug

visual studio2013安裝及測試

HelloSpring spring入門及概述

相關推薦