接著上篇的說的,爬取了大資料相關的職位資訊,
#詞云如圖所示:-*- coding: utf-8 -*-
"""created on thu aug 10 07:57:56 2017
@author: lenovo
"""from wordcloud import
wordcloud
import
pandas as pd
import
numpy as np
import
matplotlib.pyplot as plt
import
jieba
defcloud(root,name,stopwords):
filepath = root +'
\\' +name
f = open(filepath,'
r',encoding='
utf-8')
txt =f.read()
f.close()
cut =jieba.cut(txt)
words =
for i in
cut:
df = pd.dataframe()
s= df.groupby(df['
words
'])['
words
'].agg([('
size
',np.size)]).sort_values(by='
size
',ascending=false)
s = s[~s.index.isin(stopwords['
stopword
'])].to_dict()
wordcloud = wordcloud(font_path =r'
e:\python\machine learning\simhei.ttf
',background_color='
black')
wordcloud.fit_words(s[
'size'])
plt.imshow(wordcloud)
pngfile = root +'
\\' + name.split('
.')[0] + '
.png
'wordcloud.to_file(pngfile)
import
os jieba.load_userdict(r
'e:\python\machine learning\nlpstopwords.txt')
stopwords = pd.read_csv(r'
e:\python\machine learning\stopwordscn.txt
',encoding='
utf-8
',index_col=false)
for root,dirs,file in os.walk(r'
e:\職位資訊'):
for name in
file:
if name.split('
.')[-1]=='
txt'
:
(name)
cloud(root,name,stopwords)
可以看出有些雜訊詞沒能被去除,比如相關、以上學歷等無效詞彙。本想通過df判斷停用詞,但是我爬的時候沒顧及到這個問題,外加本身記錄數也不高,就沒再找職位資訊的停用詞。當然也可看出演算法和經驗是很重要的。加油
Python 生成詞云
import matplotlib.pyplot as plt from wordcloud import wordcloud import jieba text from file with apath open python.txt encoding utf 8 read wordlist af...
python 生成詞云
coding utf 8 from wordcloud import wordcloud import matplotlib.pyplot as plt import jieba from pil import image import numpy as np 生成詞云 defcreate word...
python 生成詞云
1 知識點 wordcloud引數講解 font path表示用到字型的路徑 width和height表示畫布的寬和高 prefer horizontal可以調整詞雲中字型水平和垂直的多少 mask即掩膜,產生詞云背景的區域 scale 計算和繪圖之間的縮放 min font size設定最小的字型...