書包網小說多執行緒爬蟲

既然如此，何不拿來練手爬蟲專案呢。

直接上**把，此多執行緒爬蟲支援爬取各種這樣類似的**，關鍵需要**支援高併發，否則分分鐘崩了。

畢竟5分鐘一本18mb的**，屬於超級快的那種了

from lxml import etree
import requests
from threading import thread,enumerate
import os
from time import sleep,time
headers=
def thread_it(func,*args):
t = thread(target=func,args=args)
t.setdaemon(true)
t.start()
def getall(url = ""):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
ret = r.text
page_source = etree.html(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: ''+x, link)  #向列表中每個元素都加入字首
novel_list = list(zip(title,link))  #將兩個列表用zip打包成新的zip物件並轉為列表物件
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return none,none,none,none
def getone(link=('第0001章 絕地中走出的少年', '/views/201506/04/id_xndmymja1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
ret = r.text
page_source = etree.html(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return none, none
def writeone(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
article_num = len(novel_list)
xc_num = article_num//20+1
print(f"待開啟執行緒數量為")
def inter(link,f,i):
try:
title, content = getone(link)
txt = writeone(title, content)
f.write(txt)
print(f"\r執行緒正在寫入 ", end="")
except exception as e:
print("\n爬得太快被拒絕連線，等1s遞迴繼續")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"downloads//.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n執行緒執行完畢")
print(f"\n剩餘執行緒數量")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
t2 = time()
hebing(f"downloads/")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n全部執行緒開啟完畢")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixurule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixurule, reverse=false)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print("**合併完成")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getall(url="")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
while true:
pass

全本小說網小說爬蟲

coding utf 8 import requests from pyquery import pyquery import re import os 構造請求頭 headers todo 1.根據鏈結得到目錄和對應的url def get catalogue url 傳送請求 respons...

爬蟲分享四多執行緒爬取小說

解析網頁獲取每章位址這次要爬取一本名為元尊的 url 進入網頁開啟開發者工具這樣，我們就獲取到了每章的位址儲存每章本地隨便開啟一章，開啟開發者工具，就可以輕鬆定位標題和文字。再加入多執行緒，我們便能夠以較快速度爬取完整如下 ecoding utf 8 modulename nov...

python爬蟲例項之多執行緒爬取小說

之前寫過一篇爬取的部落格，但是單執行緒爬取速度太慢了，之前爬取一部花了700多秒，1秒兩章的速度有點讓人難以接受。所以弄了個多執行緒的爬蟲。這次的思路和之前的不一樣，之前是一章一章的爬，每爬一章就寫入一章的內容。這次我新增加了乙個字典用於存放每章爬取完的內容，最後當每個執行緒都爬取完之後，再將所...

書包網小說多執行緒爬蟲

全本小說網小說爬蟲

爬蟲分享 四 多執行緒爬取小說

python爬蟲例項之 多執行緒爬取小說

相關推薦

爬蟲分享四多執行緒爬取小說

python爬蟲例項之多執行緒爬取小說