解析網頁獲取每章**位址
這次要爬取一本名為《元尊》的**
url = 『
進入網頁開啟開發者工具
這樣,我們就獲取到了每章**的位址儲存每章**本地
隨便開啟一章,開啟開發者工具,就可以輕鬆定位標題和文字。
再加入多執行緒,我們便能夠以較快速度爬取**,完整**如下:
# -*- ecoding: utf-8 -*-
# @modulename: novel
# @function:
# @author: shenfugui
# @email: [email protected]
# @time: 3/13/2020 3:36 pm
import requests
import os
import time
import threading
from lxml import etree
from queue import queue
from bs4 import beautifulsoup
# 獲取每個章節**位址
defget_urls
(headers,threads,q)
: url =
''r = requests.get(url, headers=headers)
html = etree.html(r.text)
urls = html.xpath(
'/html/body/div[3]/div[3]/dl/dd/a/@href'
)for url in urls:
n_url =
''+ url
q.put(n_url)
for i in
range(10
):t = threading.thread(target=download_novel,args=
(headers,q)
) t.start(
) q.join(
)for i in
range(10
):q.put(
none
)for t in threads:
t.join(
)print
('finished'
)def
download_novel
(headers,q)
:while
true
:# 阻塞直到從佇列獲取一條訊息
url = q.get(
)if url is
none
:break
try:
r = requests.get(url, headers=headers,timeout=10)
path =
'./novel/'
ifnot os.path.exists(path)
: os.mkdir(path)
soup = beautifulsoup(r.text,
'lxml'
) title = soup.find(
'div'
,class_=
"inner"
).h1.get_text(
) contents = soup.find(
'div',id
="booktext"
).find_all(
'p')
n_path = path + title +
'.txt'
with
open
(path + title +
'.txt'
,'a'
, encoding=
'utf-8'
)as f:
for content in contents:
f.write(content.get_text())
print
(% title)
except requests.exceptions.connectionerror:
pass
except requests.exceptions.timeout:
pass
except requests.exceptions.readtimeout:
pass
q.task_done(
)def
main()
: start = time.time(
) q = queue(
) threads =
headers =
get_urls(headers,threads,q)
end = time.time(
)print
('共用時%s s'
%(end - start)
)if __name__ ==
'__main__'
: main(
)
python爬蟲例項之 多執行緒爬取小說
之前寫過一篇爬取 的部落格,但是單執行緒爬取速度太慢了,之前爬取一部 花了700多秒,1秒兩章的速度有點讓人難以接受。所以弄了個多執行緒的爬蟲。這次的思路和之前的不一樣,之前是一章一章的爬,每爬一章就寫入一章的內容。這次我新增加了乙個字典用於存放每章爬取完的內容,最後當每個執行緒都爬取完之後,再將所...
爬蟲之小說爬取
以筆趣閣 為例,爬取一念永恆這本 具體 如下 1 from bs4 import beautifulsoup 2from urllib import request 3import requests 4importre5 import sys6 def down this chapter chapt...
爬蟲小說爬取 待修改
爬蟲進一步學習,找到了乙份 筆趣說 的爬取 亟待需要維護,修正。但頻繁爬取後出現503錯誤,等待進一步學習解決。from urllib import request from bs4 import beautifulsoup import collections import re import o...