在了解了hadoop中的儲存元件hdfs之後,我們再來看一下hadoop中另乙個重要元件的計算mapreduce。hdfs搞定海量的儲存,mapreduce搞定海量的計算。hadoop如其他優秀的開源元件一樣,也提供了豐富的demo,下面我們就來看一下如何使用mapreduce自帶demo進行詞頻統計。
# 切換到家目錄
cd # 進入hadoop的bin目錄
cd hadoop-2.5.2/bin
# vim word,在其中加入以下內容並儲存退出,讀者可以隨意加入別的內容,這是我們待會要統計詞頻的檔案
hello i am zhangli
hello i am xiaoli
hi i am ali
who are you
i am xiaoli
# 上傳word檔案
./hdfs dfs -put word /word
# 檢視上傳結果
./hdfs dfs -cat /word
# 開始統計,其中
# ./yarn是執行命令
# jar是表示執行的是jar包
# /root/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar 表示要執行的jar包
# wordcount 是要執行過程的名字
# /word 是我們上傳的待分析的檔案在hdfs中的路徑
# /output 是我們分析之後結果的輸出路徑
./yarn jar /root/hadoop-2.5.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount /word /output
# 等待一陣子,會有以下輸出
19/05/30 12:29:41 info client.rmproxy: connecting to resourcemanager at hadoop1/192.168.100.192:8032
19/05/30 12:29:46 info input.fileinputformat: total input paths to process : 1
19/05/30 12:29:47 info mapreduce.jobsubmitter: number of splits:1
19/05/30 12:29:48 info mapreduce.jobsubmitter: submitting tokens for job: job_1559056674360_0002
19/05/30 12:29:51 info mapreduce.job: running job: job_1559056674360_0002
19/05/30 12:30:19 info mapreduce.job: job job_1559056674360_0002 running in uber mode : false
19/05/30 12:30:19 info mapreduce.job: map 0% reduce 0%
19/05/30 12:30:36 info mapreduce.job: map 100% reduce 0%
19/05/30 12:30:46 info mapreduce.job: map 100% reduce 100%
19/05/30 12:30:49 info mapreduce.job: job job_1559056674360_0002 completed successfully
19/05/30 12:30:50 info mapreduce.job: counters: 49
file system counters
file: number of bytes read=111
file: number of bytes written=194141
file: number of read operations=0
file: number of large read operations=0
file: number of write operations=0
hdfs: number of bytes read=156
hdfs: number of bytes written=65
hdfs: number of read operations=6
hdfs: number of large read operations=0
hdfs: number of write operations=2
job counters
launched map tasks=1
launched reduce tasks=1
data-local map tasks=1
total time spent by all maps in occupied slots (ms)=15451
total time spent by all reduces in occupied slots (ms)=7614
total time spent by all map tasks (ms)=15451
total time spent by all reduce tasks (ms)=7614
total vcore-seconds taken by all map tasks=15451
total vcore-seconds taken by all reduce tasks=7614
total megabyte-seconds taken by all map tasks=15821824
total megabyte-seconds taken by all reduce tasks=7796736
map-reduce framework
map input records=5
map output records=17
map output bytes=135
map output materialized bytes=111
input split bytes=89
combine input records=17
combine output records=10
reduce input groups=10
reduce shuffle bytes=111
reduce input records=10
reduce output records=10
spilled records=20
shuffled maps =1
failed shuffles=0
merged map outputs=1
gc time elapsed (ms)=697
cpu time spent (ms)=8200
physical memory (bytes) snapshot=445980672
virtual memory (bytes) snapshot=4215586816
total committed heap usage (bytes)=322437120
shuffle errors
bad_id=0
connection=0
io_error=0
wrong_length=0
wrong_map=0
wrong_reduce=0
file input format counters
bytes read=67
file output format counters
bytes written=65
# 檢視/output輸出,在以下路徑中會看到有兩個檔案,其中_success代表成功,part-r-00000代表輸出結果
./hdfs dfs -ls /output
以下為輸出:
found 2 items
-rw-r--r-- 2 root supergroup 0 2019-05-30 12:30 /output/_success
-rw-r--r-- 2 root supergroup 65 2019-05-30 12:30 /output/part-r-00000
# 檢視詞頻統計結果
./hdfs dfs -cat /output/part-r-00000
# 以下為輸出
ali 1
am 4
are 1
hello 1
hi 1
i 4
who 1
xiaoli 2
you 1
zhangli 1
以上就是利用hadoop自帶的詞頻統計demo進行統計並檢視統計結果的過程。 Hadoop 詞頻統計(續)
如上圖所示,統計結果僅僅是按照key排序,value值沒有順序。而我們最終希望的是 1 統計結果在乙個最終檔案中,而不是分散到很多檔案中。2 統計結果按value值,及單詞出現的頻率排序。應該有很多方法可以實現以上的要求,我們以比較簡單的方式來完成這個需求。我們將充分利用hadoop的shuffle...
詞頻統計測試
1.上網查詢關於vs2015對程式進行單元測試的教程,學習了測試的方法。1 首先開啟vs2015新建乙個測試專案,如圖 2 編寫測試 此 是對map對映儲存單詞進行測試 include stdafx.h using namespace system using namespace system te...
Hadoop之詞頻統計WordCount
參考文章 ubuntu16.04安裝hadoop單機和偽分布式環境超詳細 1.啟動hdfs start all.sh2.檢視hdfs下包含的檔案目錄 hadoop dfs ls 由於是第一次執行,沒有檔案 3.在hdfs中建立乙個檔案目錄input,將 usr local hadoop readme...