理解MapReduce計算構架

用python編寫wordcount程式任務

程式wordcount

輸入乙個包含大量單詞的文字檔案

輸出檔案中每個單詞及其出現次數（頻數），並按照單詞字母順序排序，每個單詞和其頻數佔一行，單詞和頻數之間有間隔

編寫map函式，reduce函式

cd /home/hadoop

mkdir wc

cd /home/hadoop/wc

touch reducer.py

編寫兩個函式

#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word,1)

reducer.py:

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = none
current_count = 0
word=none
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count=int(count)
except valueerror:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word,  current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word,  current_count)

將其許可權作出相應修改

chmod a+x /home/hadoop/wc/reducer.py

本機上測試執行**

放到hdfs上執行

cd /home/hadoop/wc

wget

hdfs dfs -put /home/hadoop/hadoop/gutenberg/*.txt /user/hadoop/input

用hadoop streaming命令提交任務

cd /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar

gedit ~/.bashrc

export stream=$hadoop_home/share/hadoop/tools/lib/hadoop-streaming-*.jar

source ~/.bashrc
echo $stream

gedit run.sh

hadoop jar $stream

-file /home/hadoop/wc/reducer.py \

-reducer /home/hadoop/wc/reducer.py \

-input /user/hadoop/input/*.txt \

-output /user/hadoop/wcoutput

source run.sh

理解MapReduce計算構架

用python編寫wordcount程式任務程式wordcount 輸入乙個包含大量單詞的文字檔案輸出檔案中每個單詞及其出現次數頻數並按照單詞字母順序排序，每個單詞和其頻數佔一行，單詞和頻數之間有間隔 1.編寫map函式，reduce函式第一步建立檔案第二步編寫兩個函式 2.將其許可...

理解MapReduce計算構架

用python編寫wordcount程式任務程式wordcount 輸入乙個包含大量單詞的文字檔案輸出檔案中每個單詞及其出現次數頻數並按照單詞字母順序排序，每個單詞和其頻數佔一行，單詞和頻數之間有間隔 1.編寫map函式，reduce函式 cd home hadoop mkdir wc cd...

理解MapReduce計算構架

用python編寫wordcount程式任務程式wordcount 輸入乙個包含大量單詞的文字檔案輸出檔案中每個單詞及其出現次數頻數並按照單詞字母順序排序，每個單詞和其頻數佔一行，單詞和頻數之間有間隔 1.編寫map函式，reduce函式首先先建立檔案然後編寫兩個函式 2.將其許可權作出...

理解MapReduce計算構架

理解MapReduce計算構架

理解MapReduce計算構架

理解MapReduce計算構架

相關推薦