2 安裝Spark與Python練習

檢查基礎環境hadoop，jdk

配置檔案

環境變數

啟動spark

試執行python**

準備文字檔案（txt）

讀檔案

txt = open("
bumi.txt
", "
r",encoding='
utf-8
').read()

預處理：大小寫，標點符號，停用詞

將大寫字母變成小寫字母

txt = txt.lower()

去除標點符號及停用詞

for ch in
'!"@#$%^&*()+,-./:;<=>?@[\\]_`~':
txt=txt.replace(ch,"")
words =txt.split()
stop_words = ['
so','
out','
all','
for','
of','
to','
on','
in','
if','
by','
under
','it
','at
','into
','with
','about']
lenwords=len(words)
afterwords=
for i in
range(lenwords):
z=1for j in
range(len(stop_words)):
if words[i]==stop_words[j]:
continue
else
:            
if z==len(stop_words):
break
z=z+1
continue

統計每個單詞出現的次數

counts ={}
for word in
afterwords:
counts[word] = counts.get(word,0) + 1
items =list(counts.items())
items.sort(key=lambda x:x[1],reverse=true)

按詞頻大小排序

i=1
while i<=len(items):
word,count = items[i-1
]    print(""
.format(word,count))
i=i+1

結果寫檔案

txt= open("
bumi001.txt
", "
w",encoding='
utf-8')
txt.write(str(items))
print(
"檔案寫入成功
")

結果如圖所示

2 安裝Spark與Python練習

一安裝spark 檢查基礎環境hadoop,jdk 配置檔案環境變數試執行python 二 python程式設計練習英文文字的詞頻統計準備文字檔案讀檔案預處理大小寫，標點符號，停用詞分詞統計每個單詞出現的次數按詞頻大小排序結果寫檔案 with open test.txt r as...

2 安裝Spark與Python練習

讀檔案 text open work1.txt r encoding utf 8 read 載入停用詞表 stopwords line.strip for line in open stopword.txt encoding utf 8 readlines list型別分詞未去停用詞 text s...

2 安裝Spark與Python練習

一安裝spark 檢查基礎環境hadoop,jdk 配置檔案環境變數配置環境修改環境變數 vim bashrc 生效 source bashrc 試執行python 二 python程式設計練習英文文字的詞頻統計準備文字檔案統計每個單詞出現的次數結果寫檔案三根據自己的程式設計習慣...

2 安裝Spark與Python練習

2 安裝Spark與Python練習

2 安裝Spark與Python練習

2 安裝Spark與Python練習

相關推薦