scrapy框架爬取古詩文網的名句

使用scrapy框架爬取名句，在這裡只爬取的了名句和出處兩個字段。具體解析如下：

items.py 用來存放爬蟲爬取下來的資料模型，**如下：

import scrapy
class qsbkitem(scrapy.item):
content = scrapy.field()
auth = scrapy.field()

piplines.py 將items的模型儲存到json格式的檔案中，有兩種方法jsonitemexporter和jsonline***porter

1.jsonitemexporter，這個是每次把資料新增到記憶體中，最後統一寫入到磁碟中，好處是，儲存的資料是乙個滿足json規則的資料，壞處是如果資料量比較大，會比較的耗費記憶體

2.jsonlinesitemexporter這個是每次呼叫export_item的時候就把這個資料儲存到硬碟中，壞處是每乙個字典是一行，整個檔案是乙個滿足json格式的檔案，好處是每次從處理資料的時候就直接儲存到了硬碟中，這樣不會對記憶體造成壓力，資料比較安全

# 資料量多的時候寫入，按行寫入
from scrapy.exporters import jsonlinesitemexporter
class qsbkpipeline(object):
def __init__(self):
self.fp = open('mj.json', 'wb')
self.exporter = jsonlinesitemexporter(self.fp, ensure_ascii=false, encoding="utf-8")
def open_spider(self, spider):
print("爬蟲開始了、、、")
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
print("爬蟲結束了、、、")

# 使用匯出器
from scrapy.exporters import jsonitemexporter
# 適用於資料量較少
# 下面的方法是把所有的資料都儲存在item中，然後在一次性寫入
# 若資料較大則會比較耗費記憶體
class qsbkpipeline(object):
def __init__(self):
self.fp = open('mj.json', 'wb')
self.exporter = jsonitemexporter(self.fp, ensure_ascii=false, encoding="utf-8")
# 開始寫入
self.exporter.start_exporting()
def open_spider(self, spider):
print("爬蟲開始了、、、")
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self, spider):
# 結束寫入
self.exporter.finish_exporting()
print("爬蟲結束了、、、")

以上方法可以二選一

settings.py 本爬蟲的一些配置資訊（比如請求頭、多久傳送一次請求、ip**池）對settings.py檔案的修改如下

# 專案名稱
bot_name = 'qsbk'
# 爬蟲應用路徑
spider_modules = ['qsbk.spiders']
newspider_module = 'qsbk.spiders'
# 是否遵循reboot.txt協議 true遵循
# obey robots.txt rules
robotstxt_obey = false
download_delay = 2
# 設定請求頭資訊
# override the default request headers:
default_request_headers =

# -*- coding: utf-8 -*-
import scrapy
from qsbk.items import qsbkitem
class gswspiderspider(scrapy.spider):
# 爬蟲的名字
name = 'gswspider'
# 允許的網域名稱
allowed_domains = ['gushiwen.org']
# 起始的url
start_urls = ['']
domains = ''
def parse(self, response):
# 使用xpath解析網頁
all_mjs = response.xpath('//div[@class="left"]//div[@class="sons"]//div[@class="cont"]')
print(len(all_mjs))
for mj in all_mjs:
# print(mj)
# 使用get()可以從物件轉化為文字
# ma = {}
content = mj.xpath(".//a[1]/text()").get()
auth = mj.xpath(".//a[2]/text()").get()
# # 變成生成器
# yield ma
item = qsbkitem(content=content, auth=auth)
yield item
next_url = response.xpath("//*[@id='frompage']/div/a[1]/@href").get()
print("***********************")
print(next_url)
'''//*[@id="frompage"]/div/a[1]
'''print("***********************")
if not next_url:
return
else:
p = next_url.split("?")[1].split("&")[0]
if p == "p=21":
return
else:
yield scrapy.request(self.domains + next_url, callback=self.parse)

python爬蟲古詩文網驗證碼識別

古詩文網驗證碼識別，是通過對古詩文網登陸介面的驗證碼進行識別的，利用專門的驗證碼識別可以提取驗證碼中的驗證碼推薦超級鷹註冊登陸超級鷹因為驗證碼識別需要消耗題分，所以需要先購買題分 1塊錢1000題分，每次識別10題分就差不多了選擇軟體id 選項，生成乙個軟體id 後面會用到只需要自己...

scrapy框架全站資料爬取

每個都有很多頁碼，將中某板塊下的全部頁碼對應的頁面資料進行爬取實現方式有兩種 1 將所有頁面的url新增到start urls列表不推薦 2 自行手動進行請求傳送推薦 yield scrapy.request url,callback callback專門用做於資料解析下面我們介紹第二種...

scrapy基礎當當網爬取

xpath與正則簡單對比 1.xpath表示式效率更高 2.正規表示式功能更強大 3.一般優先選擇xpath，解決不了再用正則 xpath提取規則逐層提取 text 提取標籤下的文字 html head title text 3.標籤名提取所有名為的標籤 4.標籤名屬性屬性值提取屬性為的...

scrapy框架爬取古詩文網的名句

python爬蟲 古詩文網驗證碼識別

scrapy框架全站資料爬取

scrapy基礎 當當網爬取

相關推薦

python爬蟲古詩文網驗證碼識別

scrapy基礎當當網爬取