requests爬取小說

1.url解析

2.傳送請求

3.接收返回

4.進行解析

5.儲存

將國風中文網制定頁的**的題目、作者、最近更新章節和時間抓取下來儲存到本地

小夥伴們，今天我們用的利劍是requests、xpath

第一步：匯入模組

import requests

from lxml import etree

import json

第二步：定義類方法

class yuedu:
def 
__init__(self):
pass
#1傳送資料請求
def getpage(self,url):
pass
#2解析資料
def parsepage(self,html):
pass
#3儲存資料
def writepage(self,item_list):
pass
def startwork(self):
pass
#例項物件
yuedu = yuedu()
#開啟程式
yuedu.startwork()
第三步：按照這個思路，完善資料
1 在init中可以設定報頭，防止**反爬，預設是python27def 
__init__(self):
#請求報頭
2 傳送請求，獲得返回字串，再解析成html文件def get_page(self,url):
#content 返回二進位制位元組流資料
response = requests.get(url,headers=self.headers).content
#print response
#解析資料利用etree.html，將字串解析為html文件
html = etree.html(response)
#設定空字典用來儲存匹配出來的資料
return 
html
3 解析 html文件，返回列表資料
#解析資料
def parse_page(self,html):
#獲取這個兩個節點裡面的所有資訊列表
select1 = html.xpath('// tr[ @class ="odd"]')
select2 = html.xpath('//tr[@class="even"]')
#將這兩個節點的列表進拼接
select_list = select1 + select2
# 儲存所有資訊的
item_list = 
# 迭代select_list 獲取每個節點
for select in select_list:
# 儲存每個書資訊
item = {}
#書名item['bookname'] = select.xpath('.//div[@class="book-name"]/a/text()')
#print item['bookname']
#最新章節
item['newbook'] = select.xpath('.//div[@class="book-newest-chapter"]/text()')
#print item['newbook']
#作者名
item['name'] = select.xpath('.//td[3]/text()')
#print item['name']
#字數item['number'] = select.xpath('.//td[4]/text()')
#print item['number']
#更新時間
item['tiem'] = select.xpath('.//td[5]/text()')
#print item['tiem']
return item_list

4.儲存資料，json格式

#儲存資料
def write_page(self,item_list):
# 把python資料型別的資料，轉換為json格式資料型別儲存
# json.dumps() 處理中文預設使用ascii編碼，ensure_ascii = false表示禁用ascii編碼，使用unicode編碼處理
#如果不設定ensure——ascii返回的資料格式是：[{"bookname": ["\u4e07\u53e4\u4e39\u5e1d"],
#設定編碼後顯示[{"bookname": ["萬古丹帝"], 
# 返回json格式的資料
#寫入的時候使用utf-8編碼，否則
#unicodeencodeerror: 'ascii' codec can't encode characters in position 16-22: ordinal not in range(128)
content = json.dumps(item_list, ensure_ascii=false)
with 
open("yuedu.json", "a") as f:
f.write(content.encode("utf-8"))def startwork(self,num):
for page in 
range(1,num+1):
url = '' + str(page) + '/s30'
#接收主頁返回的資料
html = self.get_page(url)
#解析資料
item_list = self.parse_page(html)
#呼叫寫的方法
self.writepage(item_list)
#例項物件
yuedu = yuedu()
#開啟程式
yuedu.startwork(num=1)

Python爬取小說

感覺這個夠蛋疼的，因為你如果正常寫的話，前幾次執行沒問題，之後你連都沒改，再執行就出錯了。其實這可能是網路請求失敗，或者有反爬蟲的東西吧。但這就會讓你寫的時候非常苦惱，所以這這東西，健壯性及其重要！import requests from bs4 import beautifulsoup impo...

nodejs 爬取小說

前段時間看到有個同學用python爬取了於是打算用nodejs爬取一下在這裡先總結一下整個過程.僅供學習，請勿商業類似jquery的乙個庫 const cheerio require cheerio 檔案管理模組 const fs require fs 控制併發數 const async re...

python 爬取小說

前些天突然想看一些可能是因為壓力大，所以就要有補償機制吧。為了節省流量，就想著把內容爬下來，然後就可以在路上看了。於是有了下面的指令碼。usr bin env python coding utf 8 import requests from lxml import etree 為了解決unicod...

requests爬取小說

Python爬取小說

nodejs 爬取小說

python 爬取小說

相關推薦