Python爬取小說

這裡主要爬取筆趣閣的鏈結

因為筆趣閣對段時間的爬取次數做了限制，所以每次我們只能爬取十章

# coding:utf-8
import re
import soup as soup
from bs4 import beautifulsoup
import requests
import chardet
i=0# #限定每行字數
def cut_text(text,lenth):
textarr = re.findall('.', text)
return textarr
def read(url, i):
i+=1
if i>10:
return
r = requests.get(url)
encode=chardet.detect(r.content)
r.encoding=encode['encoding']
doc = r.text  # 伺服器返回響應
soup = beautifulsoup(doc, "html.parser")
for title in soup.find_all('h1'):
print(" ")
print(title.string)  # 輸出響應的html物件
for txt in soup.find_all(id='content'):
#  #每行字數限定
#  #筆趣閣1
# for line in txt.text.split('　　'):
#     if len(line)>40:
#         for cutline in cut_text(line,40):
#             print(cutline)
#     else:print(line)
#筆趣閣2
for line in txt.text.split():
if len(line)>40:
for cutline in cut_text(line,40):
print('   '+cutline)
else:print('   '+line)
# #筆趣閣1
# for c in soup.find_all(class_='page_chapter'):
#     d= c.find_all('a')
# href=d[2]['href']
#筆趣閣2-1
for c in soup.find_all(id='pager_next'):
href=c['href']
# #筆趣閣2-2
# for c in soup.find_all(class_='bottem2'):
#     d= c.find_all('a')
#     href=d[3]['href']
if len(href)>2:
# #筆趣閣1
# href=''+href
# 筆趣閣2-1
href=''+href
# # 筆趣閣2-2
print(href)
# href=''+href
read(href,i)
# #這裡需要持續更新鏈結，
# #筆趣閣1 最初進化
# url='/75_75537/521520773.html'
# 筆趣閣2-1
url = '2647622.html'
# # 筆趣閣2-2
# url = '96_96741/33481914.html'
read(url,i)

Python爬取小說

感覺這個夠蛋疼的，因為你如果正常寫的話，前幾次執行沒問題，之後你連都沒改，再執行就出錯了。其實這可能是網路請求失敗，或者有反爬蟲的東西吧。但這就會讓你寫的時候非常苦惱，所以這這東西，健壯性及其重要！import requests from bs4 import beautifulsoup impo...

python 爬取小說

前些天突然想看一些可能是因為壓力大，所以就要有補償機制吧。為了節省流量，就想著把內容爬下來，然後就可以在路上看了。於是有了下面的指令碼。usr bin env python coding utf 8 import requests from lxml import etree 為了解決unicod...

python爬取小說

一準備安裝 requests pyquery庫二使用定義了search類初始化時傳入第一章url 和名即可再呼叫all content方法即可 coding utf8 import re import requests from requests.exceptions import...

Python爬取小說

Python爬取小說

python 爬取小說

python爬取小說

相關推薦