Python爬蟲實戰反爬機制的解決策略阿里

話不多說，讓我們直接進入csnd官網。

（其實是因為我被阿里的反爬磨到沒脾氣，不想說話……）

分析一下上圖中曲線處的url，不難發現：p為頁數，q為關鍵字。

二、xpath路徑

開啟開發者模式，匹配我們所需資訊的標籤：

通過//dd[@class='author-time']/span[@class='link']/a/@href匹配各個部落格的url位址；

通過//h1[@class='title-article']/text()匹配各個部落格的標題。

注意：對於xpath路徑有疑問的話，可複習《xpath語法的學習與lxml模組的使用》。

結果倒是出來了，但 list index out of range 是什麼鬼？？？

列印open結果進行檢查：

for href in href_s:

try:

print(href)

response_blog = ur.urlopen(getrequest(href)).read()

print(response_blog)

輸出如下：

有爬到東西，但似乎不是我們想要的網頁內容？？？看著這亂七八糟的 \x1f\x8b\x08\x00 ，我一下子想到了「編碼」的問題。

九九八十一難：

encode 和 decode；

utf-8 和 gbk；

bytes 和 str；

urlencode 和 unquote；

dumps 和 loads；

request 和 requests；

甚至是 accept-encoding: gzip, deflate；

······

等等等等；

結果！竟然！都不是！！！啊啊啊啊~ ~ ~ ~

罷了罷了，bug還是要解決的_(:3」∠❀)_

靈機一動！難道是 cookie ？？？

# 返回request物件

def getrequest(url):

return ur.request(

url=url,

headers=

)輸出如下：

成功！！

才明白過來，原來我一直走錯了方向啊，竟然是阿里反爬機制搞的鬼。i hate you。╭(╯^╰)╮

將網頁寫入本地檔案：

with open('blog/%s.html' % title, 'wb') as f:

f.write(response_blog)

加入異常處理**try-except：

try:

······

except exception as e:

print(e)

講到這裡就結束了，大家有心思的話可以自己再把本文的**進行完善：比如將異常寫入到txt中，方便後續進行異常分析；比如對爬取結果進行篩選，提高資料的針對性；等等。

全文完整**：

import urllib.request as ur

import user_agent

import lxml.etree as le

# 返回request物件

def getrequest(url):

return ur.request(

url=url,

headers=

)# 注意要+1

for pn in range(pn_start, pn_end+1):

url = "" % (pn, keyword)

# 構建request物件

request = getrequest(url)

try:

# 開啟request物件

response = ur.urlopen(request).read()

# response為位元組，可直接進行le.html將其解析成xml型別

href_s = le.html(response).xpath("//dd[@class='author-time']/span[@class='link']/a/@href")

# print(href_s)

for href in href_s:

try:

print(href)

response_blog = ur.urlopen(getrequest(href)).read()

# print(response_blog)

title = le.html(response_blog).xpath("//h1[@class='title-article']/text()")[0]

print(title)

with open('blog/%s.html' % title, 'wb') as f:

f.write(response_blog)

except exception as e:

print(e)

except:

pass

全文完整**（**ip）：

import urllib.request as ur

import lxml.etree as le

import user_agent

def getrequest(url):

return ur.request(

url=url,

headers=

)def getproxyopener():

proxy_address = ur.urlopen('').read().decode('utf-8').strip()

proxy_handler = ur.proxyhandler(

)return ur.build_opener(proxy_handler)

for pn in range(pn_start, pn_end+1):

request = getrequest(

'' % (pn,keyword)

)try:

response = getproxyopener().open(request).read()

href_s = le.html(response).xpath('//span[@class="down fr"]/../span[@class="link"]/a/@href')

for href in href_s:

try:

response_blog = getproxyopener().open(

getrequest(href)

).read()

title = le.html(response_blog).xpath('//h1[@class="title-article"]/text()')[0]

print(title)

with open('blog/%s.html' % title,'wb') as f:

f.write(response_blog)

except exception as e:

print(e)

except:pass

爬蟲反爬機制及反爬策略

參考爬蟲是一種模擬瀏覽器對發起請求，獲取資料的方法。簡單的爬蟲在抓取資料的時候，因為對訪問過於頻繁，給伺服器造成過大的壓力，容易使崩潰，因此維護者會通過一些手段避免爬蟲的訪問，以下是幾種常見的反爬蟲和反反爬蟲策略爬蟲與反爬蟲，這相愛相殺的一對，簡直可以寫出一部壯觀的鬥爭史。而在大資料時...

python爬蟲反爬爬蟲怎麼測試反爬？

有沒有反爬，如果你沒有用爬蟲抓取過，你是不可能知道的。就算要測試，你還要嘗試不同的delay。如果設定的 delay 在的反爬頻率外，那就測不出來。如果在頻率內，那就被封。或者封ip，或者封賬號。如果一定要測出來，就簡單粗暴的方法，你不要設定delay，就不間斷的抓，最後出現兩種情況，1 有反爬，...

Python爬蟲實戰 反爬機制的解決策略 阿里

爬蟲反爬機制及反爬策略

python爬蟲反爬 爬蟲怎麼測試反爬？

python爬京東 反爬 爬蟲怎麼測試反爬？

相關推薦

Python爬蟲實戰反爬機制的解決策略阿里

python爬蟲反爬爬蟲怎麼測試反爬？

python爬京東反爬爬蟲怎麼測試反爬？