爬蟲筆記04 爬蟲小練習多程序改進版

上篇寫了乙個單程序爬224個頁面需要用450秒的時間，這次用下python的多程序的方式去改寫一下，儘管大家都說python的多程序是雞肋，但存在就是合理，多程序思路也很簡單。

1，建立多個queue佇列，在迴圈links鏈結時將每個link鏈結put進queue列表裡。

2.開啟多個執行緒同時將每個queue join進每個執行緒，當執行緒暫用時會進入阻塞狀態

3.當queue佇列長度為0時，向程序傳送none通知執行緒關閉。

import time
import requests
from lxml import etree
import threading
from queue import queue
start_url =
""download_pages =
0cookies =
link_queue = queue(
)threads_num =
50threads =
deffetch
(url)
:print
("正在請求%s"
% url)
r = requests.get(url, cookies=cookies)
text = r.text
print
("請求結果%s"
% r.reason)
if r.status_code !=
200:
r.raise_for_status(
)global download_pages
download_pages +=
1return text.replace(
'\t',''
)def
parse_university
(url)
:    time.sleep(1)
selector = etree.html(fetch(url)
)# print(dir(selector))
data =
data[
'name'
]= selector.xpath(
'//div[@id="wikicontent"]/h1/text()')[
0]table = selector.xpath(
'//div[@id="wikicontent"]/div[@class="infobox"]/table'
)if table:
table = table[0]
keys = table.xpath(
'.//td[1]/p/text()'
)        cols = table.xpath(
'.//td[2]'
)        values =
[' '
.join(col.xpath(
'.//text()'))
for col in cols]
iflen
(keys)
!=len
(values)
:return
none
data.update(
zip(keys, values)
)return data
defprocess_data
(data)
:if data:
print
(data)
defdownload()
:while
true
:# 阻塞直到從佇列裡獲取一條訊息
link = link_queue.get(
)if link is
none
:break
data = parse_university(link)
process_data(data)
link_queue.task_done(
)print
('還有%s個佇列'
% link_queue.qsize())
if __name__ ==
'__main__'
:# 記錄程式啟動時間
start_time = time.time(
)# 1.請求入口頁面
selector = etree.html(fetch(start_url)
)# 2.提取頁面a標籤鏈結
links = selector.xpath(
'//div[@id="content"]//tr[position()>1]/td[2]/a/@href'
)for link in links:
# 將鏈結放入列表中。
link_queue.put(link)
# 啟動執行緒，並將執行緒物件放入乙個列表儲存
for i in
range
(threads_num)
:        t = threading.thread(target=download)
t.start(
)# 阻塞佇列直到佇列被清空
link_queue.join(
)# 向佇列傳送n個none，通知執行緒退出
for i in
range
(threads_num)
:        link_queue.put(
none
)# 退出執行緒
for t in threads:
t.join(
)    cost_time = time.time(
)- start_time
print
("dowmlooad %d pages, cost %.2f second"
%(download_pages, cost_time)
)

主要修改

和新增download方法

實驗結果：

單程序用了400多秒，開啟50個執行緒90多秒。python多程序還是有點效果。

多程序爬蟲

1.多程序爬貓眼電影下圖是爬去後的結果在寫爬蟲的時候,資料量大的時候,所需要的時間就會很長,所以今天就來說多程序爬資料,有興趣的可以執行下面的 coding utf 8 import sys reload sys sys.setdefaultencoding utf 8 import reque...

Python 學習筆記多程序爬蟲

前段時間學習了多執行緒，但在實際的情況中對於多執行緒的速度實在不滿意，所以今天就來學學多程序分布式爬蟲，在這裡感謝莫煩的python教程。在講述多程序之前，先來回顧一下之前學習的多執行緒。對於多執行緒可以簡單的理解成運輸快遞的貨車，雖然在整個運輸快遞的途中有很多貨車參與運輸，但快遞到你手中的時間並不...

Python爬蟲非同步爬蟲（多程序和多執行緒）

非同步爬蟲在爬蟲中使用非同步實現高效能的資料爬取操作執行緒是程式執行的最小單位，乙個程序可以有多個執行緒。非同步爬蟲的方式多程序，多執行緒不建議好處可以為相關阻塞的操作單獨開啟程序或者執行緒，阻塞操作就可以非同步執行繼續執行阻塞的操作之後的弊端無法無限制地開啟多程序或者多執行緒程...

爬蟲筆記04 爬蟲小練習多程序改進版

多程序爬蟲

Python 學習筆記 多程序爬蟲

Python爬蟲 非同步爬蟲（多程序和多執行緒）

相關推薦

Python 學習筆記多程序爬蟲

Python爬蟲非同步爬蟲（多程序和多執行緒）