scrapy框架基於管道的持久化儲存

全棧資料的爬取

如何傳送post請求：

yield scrapy.fromrequest(url=new_url,callback=self.parse，formdata={})

五大核心元件（物件）

如何適當提公升scrapy爬取資料的效率

增加併發：預設scrapy開啟的併發執行緒為16個，可以適當進行增加。在settings配置檔案中修改concurrent_requests = 100值為100,併發設定成了為100。降低日誌級別：在執行scrapy時，會有大量日誌資訊的輸出，為了減少cpu的使用率。可以設定log輸出資訊為info或者error即可。在配置檔案中編寫：log_level = 『error』禁止cookie：如果不是真的需要cookie，則在scrapy爬取資料時可以禁止cookie從而減少cpu的使用率，提公升爬取效率。在配置檔案中編寫：cookies_enabled = false 禁止重試：對失敗的http進行重新請求（重試）會減慢爬取速度，因此可以禁止重試。在配置檔案中編寫：retry_enabled = false

請求傳參

需求：爬取名稱和簡介，

實現流程

**實現：

# -*- coding: utf-8 -*-
import scrapy
class mvspidersspider(scrapy.spider):
name = 'mvspiders'
# allowed_domains = ['']
start_urls = ['']
url = ""
pagenum = 1
def parse(self, response):
li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for li in li_list:
a_href = li.xpath('./div/a/@href').extract_first()
url = '' + a_href
# 對詳情頁的url進行手動請求傳送
# 請求傳參：
# 引數meta是乙個字典，字典會傳遞給callback
yield scrapy.request(url,callback=self.infoparse)
# 用於全棧的爬取
if self.pagenum < 5:
self.pagenum += 1
new_url = self.url%self.pagenum
# 遞迴呼叫自己
yield scrapy.request(new_url,callback=self.parse)
def infoparse(self,response):
title = response.xpath("/html/body/div[1]/div/div/div/div[2]/h1/text()").extract_first()
content = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()

items.py檔案中的,它的作用就是為了定義你需要給上邊檔案中的item封裝資料就必須要在這個類中新增對應的資料名，後邊的scrapy.field()就是內建字典類（dict）的乙個別名，並沒有提供額外的方法和屬性，被用來基於類屬性的方法來支援item生命語法。

# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class mvitem(scrapy.item):
# define the fields for your item here like:
title = scrapy.field()
content = scrapy.field()

pipelines.py檔案中多種資料儲存的方法

# -*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
#寫入到文字檔案中
import pymysql
from redis import redis
class duanzipropipeline(object):
fp = none
def open_spider(self,spider):
print('開始爬蟲......')
self.fp = open('./duanzi.txt','w',encoding='utf-8')
#方法每被呼叫一次，引數item就是其接收到的乙個item型別的物件
def process_item(self, item, spider):
# print(item)#item就是乙個字典
self.fp.write(item['title']+':'+item['content']+'\n')
return item#可以將item提交給下乙個即將被執行的管道類
def close_spider(self,spider):
self.fp.close()
print('爬蟲結束！！！')
#將資料寫入到mysql
class mysqlpipeline(object):
conn = none
cursor = none
def open_spider(self,spider):
self.conn = pymysql.connect(host='127.0.0.1',port=3306,user='root',password='222',db='spider',charset='utf8')
print(self.conn)
def process_item(self,item,spider):
sql = 'insert into duanzi values ("%s","%s")'%(item['title'],item['content'])
self.cursor = self.conn.cursor()
try:
self.cursor.execute(sql)
self.conn.commit()
except exception as e:
print(e)
self.conn.rollback()
return item
def close_spider(self,spider):
self.cursor.close()
self.conn.close()
#將資料寫入到redis
class redispileline(object):
conn = none
def open_spider(self,spider):
self.conn = redis(host='127.0.0.1',port=6379)
print(self.conn)
def process_item(self,item,spider):
self.conn.lpush('duanzidata',item)
return item

scrapy框架基於管道的持久化儲存

scrapy 基於管道的持久化儲存操作

Scrapy 框架（二）資料的持久化

基於python的 scrapy框架使用步驟

scrapy框架基於管道的持久化儲存

scrapy 基於管道的持久化儲存操作

Scrapy 框架（二）資料的持久化

基於python的 scrapy框架使用步驟

相關推薦