入門級爬蟲 17k小說站爬取指定小說

執行py檔案前，需要先安裝這2個包

pip install beautifulsoup4

pip install requests

#!/usr/bin/env python3.7
# -*- coding: utf-8 -*
# author by slo
from bs4 import beautifulsoup
import requests
class
**********
(object):
def__init__
(self)
:        self.url =
''self.target =
'/list/349579.html'
# 儲存**的章節
self.names =
# 儲存**相應章節的url位址
self.urls =
self.nums =
defget_download_url
(self)
:        html = requests.get(url=self.target)
.content.decode(
'utf-8'
)        soup = beautifulsoup(html,
'html5lib'
)        dl = soup.find_all(
'dl'
, class_=
'volume'
)        a_bf = beautifulsoup(
str(dl[1:
]),'html5lib'
)        a = a_bf.find_all(
'a', target=
'_blank'
)        span = beautifulsoup(
str(a)
,'html5lib'
)        charp_url = span.find_all(
'a', target=
'_blank'
)        norml_txt = span.find_all(
'span'
, class_=
'ellipsis'
)# 這是[:140]是這本**一共140章
self.nums =
len(charp_url[
:140])
for each in norml_txt:
self.nums =
len(charp_url)
for each in charp_url:
'href'))
# 獲取章節內容
defget_contents
(self, target)
:        req = requests.get(url=target)
.content.decode(
'utf-8'
)        dv_area = beautifulsoup(req,
'html5lib'
)        dv_area_text = dv_area.find_all(
'div'
, class_=
'readareabox content')if
len(dv_area_text)!=0
:            dv_p = beautifulsoup(
str(dv_area_text[0]
),'html5lib'
)            dv_p_txt = dv_p.find_all(
'div'
, class_=
'p')
texts = dv_p_txt[0]
.text.replace(
'　　'
,'\n'
)return texts
else
:return
# 把內容寫入文字  name章節名，path當前路徑下,**儲存名稱 text章節內容
defwriter
(self, name, path, text)
:with
open
(path,
'a', encoding=
'utf-8'
)as f:
if name !=
none
and path !=
none
and text !=
none
:                f.write(name +
'\n'
)                f.writelines(text)
f.write(
'\n\n'
)                f.flush(
)if __name__ ==
"__main__"
:    dl = **********(
)    dl.get_download_url(
)print()
for i in
range
(dl.nums)
:        dl.writer(dl.names[i]
,'天才相師.txt'
, dl.get_contents(
str(dl.urls[i]))
)print
()

python爬蟲17K小說網資料

python爬蟲17k 網資料有一些庫可能沒有用，當時寫的時候參考了很多書籍資料，書籍裡用了，我沒有用，但是本著懶的原則，我就沒有特意把那些沒有用到的庫刪掉。因為我們老師對注釋特別強調，為了不讓老師抓錯，我就把除了import的外的都加了注釋。from bs4 import beautiful...

入門級新聞爬蟲

專案需求分析專案目標鏈得得金色財經巴位元爬蟲目標分析爬蟲儲存結構資料庫 id 唯一標識 spider time 採集時間 news img 新聞原首頁圖 news title 新聞原標題 news author 作者 news time 發布時間可能需要增加乙個本地發布時間 news...

PythonCrawler 入門級爬蟲學習

最近在學 py thon p yt ho n，找了乙個入門級的 cr awle r cra wler 進行學習，雙管齊下。僅供個人學習 py thon p yt ho n和爬蟲入門使用，也歡迎大佬們指點。url 詞條頁面 url class lemmawgt lemmatitle title h1 ...

入門級爬蟲 17k小說站爬取指定小說

python爬蟲17K小說網資料

入門級新聞爬蟲

PythonCrawler 入門級爬蟲學習

相關推薦