爬蟲學習（一）

def parse_one_page(html):

pattern = re.compile(』.?(.?).?src="(.?)".?(.

?).?

.?: (.?)&.?』,re.s)

items = re.findall(pattern, html)

for item in items:

yield

def write_to_file(content):

with open(『result.text』,『a』,encoding=『utf-8』) as f:

f.write(json.dumps(content,ensure_ascii=false)+"\n")

def main():

url = 「

html = get_one_page(url)

for item in parse_one_page(html):

print(item)

write_to_file(item)

ifname== 『main』:

main()

print(「爬取成功!」)

執行結果：

我個人認為正則太麻煩，出錯爬不到也不容易找到哪出錯了，推薦bs4等解析庫，過幾天我會用bs4爬一次，到時更新！

參考文獻：python3網路爬蟲開發實戰，崔慶才著。

爬蟲學習（一）

為了從網際網路上批量獲取資料，研究了下spider，在此記錄一筆學習經歷。今天先了解下robots協議，也叫爬蟲協議，全稱是網路爬蟲排除標準 robots exclusion protocol 通過robots協議告訴搜尋引擎哪些頁面可以抓取，哪些頁面不能抓取。我們可以自定義爬蟲所使用的agent...

爬蟲學習（一）

url 統一資源定位符聚焦爬蟲根據特定的需求，從網上把資料去下來爬蟲實現的思路網頁的特點每個網頁有自己的url 網頁是由html組成的網頁傳輸的時候使用http和https協議爬取的思路使用乙個url 寫python 模擬瀏覽器傳送http請求解析資料，提取出來指定的資料，通過一定...

python 爬蟲學習一

爬取目標為aspx 使用到了 viewstate eventvalidation cookie來驗證。使用beautifulsoup來解析網頁內容。encoding utf 8 from bs4 import beautifulsoup import urllib import urllib2 d...

爬蟲學習（一）

爬蟲學習（一）

爬蟲學習（一）

python 爬蟲學習一

相關推薦