python爬蟲入門（四）利用多執行緒爬蟲

先回顧前面學過的一些知識

1.乙個cpu一次只能執行乙個任務，多個cpu同時可以執行多個任務

2.乙個cpu一次只能執行乙個程序，其它程序處於非執行狀態

3.程序裡包含的執行單元叫執行緒，乙個程序可以包含多個執行緒

4.乙個程序的記憶體空間是共享的，每個程序裡的執行緒都可以使用這個共享空間

5.乙個執行緒在使用這個共享空間的時候，其它的執行緒必須等待（阻塞狀態）

6.互斥鎖作用就是防止多個執行緒同時使用這塊記憶體空間，先使用的執行緒會將空間上鎖，其它的執行緒處於等待狀態。等鎖開了才能進

7.程序：表示程式的一次執行

8.執行緒：cpu運算的基本排程單位

9.gil（全域性鎖）：python裡的執行通行證，而且只有乙個。拿到通行證的執行緒就可以進入cpu執行任務。沒有gil的執行緒就不能執行任務

10.python的多執行緒適用於大量密集的i/o處理

11.python的多程序適用於大量的密集平行計算

# 使用了執行緒庫

import threading

# 佇列

from queue import queue

# 解析庫

from lxml import etree

# 請求處理

import requests

# json處理

import json

import time

class threadcrawl(threading.thread):

def __init__(self, threadname, pagequeue, dataqueue):

#threading.thread.__init__(self)

# 呼叫父類初始化方法

super(threadcrawl, self).__init__()

# 執行緒名

self.threadname = threadname

# 頁碼佇列

self.pagequeue = pagequeue

# 資料佇列

self.dataqueue = dataqueue

# 請求報頭

def run(self):

print "啟動 " + self.threadname

while not crawl_exit:

try:

# 取出乙個數字，先進先出

# 可選引數block，預設值為true

#1. 如果對列為空，block為true的話，不會結束，會進入阻塞狀態，直到佇列有新的資料

#2. 如果隊列為空，block為false的話，就彈出乙個queue.empty()異常，

page = self.pagequeue.get(false)

url = "" + str(page) +"/"

#print url

content = requests.get(url, headers = self.headers).text

time.sleep(1)

self.dataqueue.put(content)

#print len(content)

except:

pass

print "結束 " + self.threadname

class threadparse(threading.thread):

def __init__(self, threadname, dataqueue, filename, lock):

super(threadparse, self).__init__()

# 執行緒名

self.threadname = threadname

# 資料佇列

self.dataqueue = dataqueue

# 儲存解析後資料的檔名

self.filename = filename

# 鎖self.lock = lock

def run(self):

print "啟動" + self.threadname

while not parse_exit:

try:

html = self.dataqueue.get(false)

self.parse(html)

except:

pass

print "退出" + self.threadname

def parse(self, html):

# 解析為html dom

html = etree.html(html)

node_list = html.xpath('//div[contains(@id, "qiushi_tag")]')

for node in node_list:

# xpath返回的列表，這個列表就這乙個引數，用索引方式取出來，使用者名稱

username = node.xpath('./div/a/@title')[0]

# 連線

image = node.xpath('.//div[@class="thumb"]//@src')#[0]

# 取出標籤下的內容,段子內容

content = node.xpath('.//div[@class="content"]/span')[0].text

# 取出標籤裡包含的內容，點讚

zan = node.xpath('.//i')[0].text

comments = node.xpath('.//i')[1].text

items =

# with 後面有兩個必須執行的操作：__enter__ 和 _exit__

# 不管裡面的操作結果如何，都會執行開啟、關閉

# 開啟鎖、處理內容、釋放鎖

with self.lock:

# 寫入儲存的解析後的資料

self.filename.write(json.dumps(items, ensure_ascii = false).encode("utf-8") + "\n")

crawl_exit = false

parse_exit = false

def main():

# 頁碼的佇列，表示20個頁面

pagequeue = queue(20)

# 放入1~10的數字，先進先出

for i in range(1, 21):

pagequeue.put(i)

# 採集結果(每頁的html原始碼)的資料佇列，引數為空表示不限制

dataqueue = queue()

filename = open("duanzi.json", "a")

# 建立鎖

lock = threading.lock()

# 三個採集執行緒的名字

crawllist = ["採集執行緒1號", "採集執行緒2號", "採集執行緒3號"]

# 儲存三個採集執行緒的列表集合

threadcrawl =

for threadname in crawllist:

thread = threadcrawl(threadname, pagequeue, dataqueue)

thread.start()

# 三個解析執行緒的名字

parselist = ["解析執行緒1號","解析執行緒2號","解析執行緒3號"]

# 儲存三個解析執行緒

threadparse =

for threadname in parselist:

thread = threadparse(threadname, dataqueue, filename, lock)

thread.start()

# 等待pagequeue隊列為空，也就是等待之前的操作執行完畢

while not pagequeue.empty():

pass

# 如果pagequeue為空，採集執行緒退出迴圈

global crawl_exit

crawl_exit = true

print "pagequeue為空"

for thread in threadcrawl:

thread.join()

print "1"

while not dataqueue.empty():

pass

global parse_exit

parse_exit = true

for thread in threadparse:

thread.join()

print "2"

with lock:

# 關閉檔案

filename.close()

print "謝謝使用！"

if __name__ == "__main__":

main()

Python網路爬蟲入門（四）

beautifulsoup庫 from bs4 import beautifulsoup html soup beautifulsoup html,lxml 列印所有的tr標籤 trs soup.find all tr for tr in trs print tr 獲取第二個tr標籤 tr soup...

利用JMeter測試Tornado的多執行緒

我們將在下面的章節中學習如何使用jmeter，以tornado的多執行緒為例。我們將會以tornado的多執行緒為例，描述如何使用jmeter。測試的tornado多執行緒的python 如下定義埠為9090 define port default 9090 help run on the giv...

利用CAD VBA批量插入多段線

sub aa 定義乙個執行過程 dim arr as double 定義乙個空的動態陣列 m 4 隨便定乙個值 nn array 7,13,15,23 定義乙個陣列，有幾個代表最終生成幾條線，數字加1 表示線的節點 for mn 0 to ubound nn 動態定義陣列寬度 n nn mn 給數賦...

python爬蟲入門（四）利用多執行緒爬蟲

Python網路爬蟲入門（四）

利用JMeter測試Tornado的多執行緒

利用CAD VBA批量插入多段線

相關推薦