python生成職業要求詞云

2022-07-22 07:33:11 字數 1856 閱讀 9474

接著上篇的說的,爬取了大資料相關的職位資訊,

#

-*- coding: utf-8 -*-

"""created on thu aug 10 07:57:56 2017

@author: lenovo

"""from wordcloud import

wordcloud

import

pandas as pd

import

numpy as np

import

matplotlib.pyplot as plt

import

jieba

defcloud(root,name,stopwords):

filepath = root +'

\\' +name

f = open(filepath,'

r',encoding='

utf-8')

txt =f.read()

f.close()

cut =jieba.cut(txt)

words =

for i in

cut:

df = pd.dataframe()

s= df.groupby(df['

words

'])['

words

'].agg([('

size

',np.size)]).sort_values(by='

size

',ascending=false)

s = s[~s.index.isin(stopwords['

stopword

'])].to_dict()

wordcloud = wordcloud(font_path =r'

e:\python\machine learning\simhei.ttf

',background_color='

black')

wordcloud.fit_words(s[

'size'])

plt.imshow(wordcloud)

pngfile = root +'

\\' + name.split('

.')[0] + '

.png

'wordcloud.to_file(pngfile)

import

os jieba.load_userdict(r

'e:\python\machine learning\nlpstopwords.txt')

stopwords = pd.read_csv(r'

e:\python\machine learning\stopwordscn.txt

',encoding='

utf-8

',index_col=false)

for root,dirs,file in os.walk(r'

e:\職位資訊'):

for name in

file:

if name.split('

.')[-1]=='

txt'

:

print

(name)

cloud(root,name,stopwords)

詞云如圖所示:

可以看出有些雜訊詞沒能被去除,比如相關、以上學歷等無效詞彙。本想通過df判斷停用詞,但是我爬的時候沒顧及到這個問題,外加本身記錄數也不高,就沒再找職位資訊的停用詞。當然也可看出演算法和經驗是很重要的。加油

Python 生成詞云

import matplotlib.pyplot as plt from wordcloud import wordcloud import jieba text from file with apath open python.txt encoding utf 8 read wordlist af...

python 生成詞云

coding utf 8 from wordcloud import wordcloud import matplotlib.pyplot as plt import jieba from pil import image import numpy as np 生成詞云 defcreate word...

python 生成詞云

1 知識點 wordcloud引數講解 font path表示用到字型的路徑 width和height表示畫布的寬和高 prefer horizontal可以調整詞雲中字型水平和垂直的多少 mask即掩膜,產生詞云背景的區域 scale 計算和繪圖之間的縮放 min font size設定最小的字型...