Scrapy IT之家評論爬蟲

最近在學習python爬蟲，這裡推薦乙個入門爬蟲的部落格系列

博主寫的對新手很友好，很適合入門。

中做的改進和遇到的問題。

根據原文，我的思路如下：

獲取首頁最熱排行裡文章的url

根據對應url獲取newsid，再將newsid和type資料post給介面獲取返回的熱評資料

本以為能夠很順利的實現，結果還是遇到了一些問題。

原文是用requests和多程序實現爬取速度的提公升，由於scrapy本身就是利用多程序實現爬取速度的提公升，再加上我想換一種方法實現，這裡就採用scrapy實現。下面就是遇到的問題。

原文中給出的newsid直接在url中，例: ，但是最熱排行裡的文章的newsid在url中是被分割的，例: 。

這個很容易解決，正規表示式匹配一下再拼接就搞定了。**如下:

# 選出newsid，例:匹配出[388,110]
pattern = re.compile(r'(\d\d\d)')
newsid_list = pattern.findall(link)
newsid = newsid_list[0] + newsid_list[1]  # 拼接出newsid

# hash在script標籤內
script = response.xpath('/html/head/script[3]/text()').extract()[0]
# 選出hash,例:var ch11 = '0a56bca76ae1ad61';匹配出0a56bca76ae1ad61
pattern = re.compile(r'\w')
hash = pattern.search(script).group()
# print(hash)
post_url = ''  # post url
# post資料為newsid,hash,pid,type
yield scrapy.formrequest(
url=post_url,
meta=,
formdata=,
callback=get_hot_comment
)

# 分析response.text，發現為json格式,'html'對應html原始碼,即
html = json.loads(response.text)['html']
# html原始碼格式化
soup = beautifulsoup(html, 'lxml')
# 每條熱評在class='entry'的li標籤內
li_list = soup.find_all('li', class_='entry')
for li in li_list:
# 分析html原始碼，取出熱評對應資料
item['username'] = li.find('span', class_='nick').text
item['time'] = li.find('span', class_='posandtime').text.split('\xa0')[1]
item['content'] = li.find('p').text
like = li.find('a', class_='s').text
hate = li.find('a', class_='a').text
# 選出點讚數和反對數,例:支援(100)匹配出100
item['like_num'] = re.search(r'\d+', like).group()
item['hate_num'] = re.search(r'\d+', hate).group()
# print(item)
yield item

上面四個問題解決之後，整個專案就沒什麼大的問題了。其中最關鍵的還是拿到hash值。還有需要說明的一點是，當前介面post的資料為newsid、hash、pid、type，後面可能it之家還會修改，爬取時注意post資料是否改變。

上面是專案位址，覺得還可以的話，給個star哦

Python爬蟲之爬取動漫之家

python爬蟲之爬取動漫之家小白上手爬蟲第一天，簡單爬取動漫之家。小小目標 1.爬取5頁的動漫 2.以list返回其動漫位址和動漫名字簡單粗暴，直接附上原始碼 import requests import re 獲取頁面 defgethtmltext url try kv r request...

爬蟲爬取豆瓣網評論內容

1 找到我們想要爬取的電影小哪吒分析出來全部影評的介面位址登入請求位址 s requests.session url 請求頭headers body資料 data 傳送請求 r s.post url,headers headers,data data url2 r2 s.get url2,he...

爬蟲案例爬取網易雲熱門評論

import requests import json import re defget res url proxy 最好使用萬一網易把你ip乾掉了，請求頭最好多複製一些，尤其時 referer，這個判斷請求從哪來的。headers data 這個url 是在我們發現熱評的檔案裡的 url 雖然開...

Scrapy IT之家評論爬蟲

Python爬蟲之爬取動漫之家

爬蟲 爬取豆瓣網評論內容

爬蟲案例 爬取網易雲熱門評論

相關推薦

爬蟲爬取豆瓣網評論內容

爬蟲案例爬取網易雲熱門評論