目標:針對一部電影,爬取他的熱評高頻詞,並生成詞云
分解目標:
1、爬取熱評內容,只保留文字部分
2、熱評文字儲存到本地的txt文件,以便後續的分詞
3、文字分詞
4、生成詞云
拿到乙個電影:
這是他的熱評列表reviews
考慮拿到每篇熱評的詳情頁url:
可以拿到前100條熱評的詳情頁url。import requests
from bs4 import beautifulsoup
for i in range(5):
allurl='reviews?start='+str(i*20)
res=requests.get(allurl)
html=res.text
soup=beautifulsoup(html,'html.parser')
items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')
for item in items:
comment_url=item.find('a')['href']
print(comment_url)
對於每篇熱評,爬取文字內容:
將以上爬取到的文字,寫入txt文件:import requests
from bs4 import beautifulsoup
for i in range(5):
allurl='reviews?start='+str(i*20)
res=requests.get(allurl)
html=res.text
soup=beautifulsoup(html,'html.parser')
items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')
for item in items:
comment_url=item.find('a')['href']
#print(comment_url)
res2=requests.get(comment_url)
html2=res2.text
soup2=beautifulsoup(html2,'html.parser')
items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')
for item2 in items2:
print(item2.text)
寫完記得關閉文件。import requests
from bs4 import beautifulsoup
comments=open('comments.txt','w+',encoding='utf-8')
for i in range(5):
allurl='reviews?start='+str(i*20)
res=requests.get(allurl)
html=res.text
soup=beautifulsoup(html,'html.parser')
items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')
for item in items:
comment_url=item.find('a')['href']
#print(comment_url)
res2=requests.get(comment_url)
html2=res2.text
soup2=beautifulsoup(html2,'html.parser')
items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')
for item2 in items2:
#print(item2.text)
comments.writelines(item2.text)
comments.close()
使用jieba庫進行分詞:
import jieba
f=open('comments.txt','r',encoding='utf-8')
t=f.read()
f.close()
ls=jieba.lcut(t)
txt=' '.join(ls)
最後進行整合:import jieba,wordcloud
f=open('comments.txt','r',encoding='utf-8')
t=f.read()
f.close()
ls=jieba.lcut(t)
txt=' '.join(ls)
w=wordcloud.wordcloud(width=800,height=600,background_color='white',font_path='msyh.ttc',max_words=100)
w.generate(txt)
w.to_file('豆瓣某電影熱評.png')
f.close()
最終詞云效果:import requests,jieba,wordcloud
from bs4 import beautifulsoup
comments=open('comments.txt','w+',encoding='utf-8')
for i in range(5):
allurl='reviews?start='+str(i*20)
res=requests.get(allurl)
html=res.text
soup=beautifulsoup(html,'html.parser')
items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')
for item in items:
comment_url=item.find('a')['href']
#print(comment_url)
res2=requests.get(comment_url)
html2=res2.text
soup2=beautifulsoup(html2,'html.parser')
items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')
for item2 in items2:
#print(item2.text)
comments.writelines(item2.text)
comments.close()
f=open('comments.txt','r',encoding='utf-8')
t=f.read()
f.close()
ls=jieba.lcut(t)
txt=' '.join(ls)
w=wordcloud.wordcloud(width=800,height=600,background_color='white',font_path='msyh.ttc',max_words=100)
w.generate(txt)
w.to_file('豆瓣某電影熱評.png')
f.close()
以上就是全部啦!
Python爬取貓眼電影
不多說,直接上 import requests import re import random import pymysql import time 連線資料庫 db pymysql.connect host localhost port 3306,user root passwd a db pyt...
爬蟲2 爬取豆瓣網熱映電影
1.爬取一部電影的詳細內容 from bs4 import beautifulsoup import requests 獲取爬取的 url requests.get 獲取網頁源 v source beautifulsoup url.text,lxml print v source 爬取標題 v ti...
爬取豆瓣熱映電影資訊(爬蟲例項)
在學習完requests網路請求方法和xpath資料解析方法之後,今天通過乙個例項來對前面所學的知識進行鞏固,也算是一種學以致用吧!0 匯入所需要的包 import requests from lxml import etree 1 資訊的獲取 headers url response reques...