爬取豆瓣電影top75測試多執行緒

2021-09-12 17:22:02 字數 2362 閱讀 9251

用threading模組寫乙個簡單的多執行緒爬蟲和單執行緒爬蟲對比爬取速度

import requests

import re

import threading

import time

# 單執行緒爬取

def spider(url,headers):

response = requests.get(url,headers).text

pattern = re.compile('.*?',re.s)

linklist = pattern.findall(response)

for link in linklist:

html = requests.get(link,headers).text

p1 = re.compile('(.*?)',re.s)

p2 = re.compile('(.*?)',re.s)

num = re.findall(p1,html)

title = re.findall(p2,html)

print(num[0],':',title[0])

# 多執行緒爬取(三線程)

lock = threading.rlock() # 執行緒中的鎖機制

#爬取每個電影的排名和電影名稱

def infospider(link,headers):

html = requests.get(link, headers).text

p1 = re.compile('(.*?)', re.s)

p2 = re.compile('(.*?)', re.s)

num = re.findall(p1, html)

title = re.findall(p2, html)

print(num[0], ':', title[0])

def a(linklist,headers):

# lock.acquire()

for i in range(0, 25, 3):

url = linklist[i]

infospider(url, headers)

# lock.release()

def b(linklist,headers):

# lock.acquire()

for i in range(1,25, 3):

url = linklist[i]

infospider(url, headers)

# lock.release()

def c(linklist,headers):

# lock.acquire()

for i in range(2,25, 3):

url = linklist[i]

infospider(url, headers)

# lock.release()

def spider2(url,headers):

response = requests.get(url,headers).text

pattern = re.compile('.*?',re.s)

linklist = pattern.findall(response)

t1 = threading.thread(target=a, args=(linklist,headers))

t2 = threading.thread(target=b, args=(linklist,headers))

t3 = threading.thread(target=c, args=(linklist,headers))

t1.start()

t2.start()

t3.start()

t1.join()

t2.join()

t3.join()

def main():

headers =

#單執行緒測試

start1 = time.time()

for i in range(3):

url = ''%(i*25)

spider(url,headers)

end1 = time.time()

#多執行緒測試

start2 = time.time()

for i in range(3):

url = ''%(i*25)

spider2(url,headers)

end2 = time.time()

print(end1-start1)#單執行緒執行時間

print(end2-start2)#多執行緒執行時間

if __name__ == '__main__':

main()

三線程爬取時間基本為單執行緒時間的三倍

爬取豆瓣top20電影

import requests from lxml import etree url for i in range 2 url format i 10 訪問目標 response requests.get url url 獲取頁面內容 text html response.text print ht...

爬取豆瓣電影TOP250

利用css選擇器對電影的資訊進行爬取 import requests import parsel import csv import time import re class cssspider def init self self.headers defget dp self,url respon...

豆瓣Top250電影爬取

from bs4 import beautifulsoup 網頁解析,獲取資料 import re 正規表示式,進行文字匹配 import urllib.request,urllib.error 制定url,獲取網頁資料 import xlwt 進行excel操作 import sqlite3 進行...