#!/usr/bin/python
import sys
# input comes from stdin (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to stdout (standard output);
# what we output here will be the input for the
# reduce step, i.e. the input for reducer.py
## tab-delimited; the trivial word count is 1
print ('%s\t%s' % (word, 1))
驗證,執行以下語句:
得到以下結果:
檢視統計結果
#!/usr/bin/python
from operator import itemgetter
import sys
current_word = none
current_count = 0
word = none
# input comes from stdin
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except valueerror:
# count was not a number, so silently
# ignore/discard this line
continue
# this if-switch only works because hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to stdout
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print ('%s\t%s' % (current_word, current_count))
驗證,執行以下語句:
得到以下結果:
檢視統計結果
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc cc dd
hdfs dfs -mkdir /data
hdfs dfs -put info.txt /data/info
$hadoop_home/bin/hadoop jar
$hadoop_home/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
-input "/data/*"
-output "/out99"
-reducer "python reducer.py"
-file "/root/reducer.py"
注意:$hadoop_home就是hadoop的家目錄。
以上就是通過python完成詞頻統計的過程。
使用Python進行英文詞頻統計
1.讀取檔案,通過lower replace 函式將所有單詞統一為小寫,並用空格替換特殊字元。def gettext txt open piao.txt r errors ignore read txt txt.lower for ch in txt txt.replace ch,return tx...
python使用jieba實現簡單的詞頻統計
import jieba defgettext txt open hamlet.txt r read txt txt.lower for ch in txt txt.replace ch,return txtharmtxt gettext words harmtxt.split counts for...
python對紅樓夢的每一章節進行詞頻統計
import jieba f open g 紅樓夢.txt r encoding utf 8 txt f.read words jieba.lcut txt 精準模式 ls 在這裡插入描述 第 and word 1 回 if word in ls continue else print ls for...