Scrapy框架使用筆記

本文記錄scrapy基本使用方法，不涉及框架底層原理說明。

建立專案：scrapy startproject ***

進入專案：cd ***

建立爬蟲：scrapy genspider ***（爬蟲名） ***.com （爬取域）

生成檔案：scrapy crawl *** -o ***.json (生成某種型別的檔案)

執行爬蟲：scrapy crawl ***

列出所有爬蟲：scrapy list

獲得配置資訊：scrapy settings [options]

（1）【命令列輸入】

> scrapy startproject weather

>cd weather

>scrapy genspider tianqi tianqihoubao.com

（2）【專案檔案初始結構及說明】

檔案結構： weatheer --|| weather --|| spider --|| tianqi.py

--|| items.py

--|| middlewares.py

--|| piplines.py

--|| settings.py

--||scrapy.cfg

各個檔案以及關鍵**：

tianqi.py：

class tianqispider(scrapy.spider):
name = 'tianqi'
allowed_domains = ['tianqihoubao.com']
start_urls = [''] #爬蟲初次入口
def parse(self, response):#網頁解析
infoselector = response.xpath('')##使用選擇器 ：xpath() /css()
info=infoselector
item=weatheritem(info=info)
yield item                 #提交item
next_url='';
if(next_url):
yield scrapy.request(url=next_url,callback=self.parse) #對新url爬取

items.py：

class weatheritem(scrapy.item):   #來自tianqi.py的資料封裝 。變數名與tianqi.py中傳入item物件的一致
# define the fields for your item here like:
# name = scrapy.field()
pass

piplines.py：

# #持久化儲存item
# # 另外需要在settings.py中啟用item pipelines：
# item_pipelines = 
class weatherpipeline(object):
def process_item(self, item, spider):
##填入 資料清洗/檔案儲存**
return item

（3）最後完整專案原始碼：

專案功能：爬取江西近一月各地的天氣資訊，儲存json格式

tianqi.py：

import scrapy
import re
from weather.items import weatheritem
class tianqispider(scrapy.spider):
name = 'tianqi'
allowed_domains = ['tianqihoubao.com'] #
start_urls = ['weather/province.aspx?id=360000'] #爬蟲初次入口
def parse(self, response):#網頁解析1
names = response.xpath('//tr/td/a/text()').extract()##使用選擇器 ：xpath() /css()
urls = response.xpath('//tr/td/a/@href').extract()
for i in range(0,len(names)):
place=names[i]
next_url = 'weather/'+urls[i]
if (next_url):
yield scrapy.request(url=next_url, callback=self.parse_detail)  # 對新url爬取
def parse_detail(self,response):
place=response.xpath('//table/tr[3]/td[1]/b/text()').extract_first()
print("******************************==開始爬取【",place,"】天氣******************************===")
tr = response.xpath('//table/tr')
weatherlist=
for i in range(2,len(tr)):
wea = {}
wea['date']=tr[i].xpath('./td[2]/b/a/text()').extract_first()
s=tr[i].xpath('./td[3]/text()').extract_first()
pattern = re.compile(r'([\u4e00-\u9fa5])')
try:
match = pattern.search(s)
wea['type'] = match.group(1)
except exception:
print("***************=!!!!!!!!!!!!!\n",s,"!!!!!!!!!!!!!!***************\n",i)
wea['wind']=tr[i].xpath('./td[4]/text()').extract_first()
high_temp = tr[i].xpath('./td[5]/text()').extract_first()
s = tr[i].xpath('./td[8]/text()').extract_first()
pattern = re.compile(r'(-?\d*℃)')
low_temp="-999℃"
try:
match = pattern.search(s)
low_temp= match.group(1)
except exception:
print("***************=!!!!!!!!!!!!!\n", s, "!!!!!!!!!!!!!!**********==\n", i)
wea['temperature']=low_temp+"-"+high_temp
print("****************************************√√√success!***********************************=")
item=weatheritem()
info={}
info['name']=place
info['weather']=weatherlist
item['info']=info
yield item

items.py：

import scrapy
class weatheritem(scrapy.item):   #來自tianqi.py的資料封裝 。變數名與tianqi.py中傳入item物件的一致
# define the fields for your item here like:
info = scrapy.field()
pass

piplines.py：

import  json
# #持久化儲存item
class weatherpipeline(object):
def process_item(self, item, spider):
##填入 資料清洗/檔案儲存**
with open('tianqi.json', 'a+',encoding='utf-8') as fp:
json.dump(item['info'], fp=fp, skipkeys=true, indent=4, ensure_ascii=false)
return item

總而言之，scrapy框架比較容易上手，專案建立之後只需要編寫3個檔案的**，唯一的較難點是選擇器的編寫（詳見

scrapy使用筆記

我是用anaconda安裝的scrapy 安裝完成，使用scrapy 建立爬蟲工程，之後把工程資料夾整個移動到ecipse，配置下run configure就可以執行了初步了解了下xpath語法，會簡單地使用xpath摳取網頁裡面的內容例子 td class hello 選取所有class標籤為...

scrapy使用筆記

1.先裝python環境 2.安裝pip 3.使用pip安裝 pip install lxml 3.4.2 有報錯提示考慮到是網路不穩定的問題，這時我們用國內的映象源來加速用pip安裝依賴包時預設訪問，但是經常出現不穩定以及訪問速度非常慢的情況，國內廠商提供的pipy映象目前可用的有參考文章 ...

Python Scrapy框架使用筆記

1.scrapy engine 引擎負責控制資料流在系統中所有元件中流動，並在相應動作發生時觸發事件。詳細內容檢視下面的資料流 data flow 部分。此元件相當於爬蟲的大腦是整個爬蟲的排程中心。2.排程器 scheduler 排程器從引擎接受request並將他們入隊，以便之後引擎請求他們時...

Scrapy框架 使用筆記

scrapy使用筆記

scrapy使用筆記

Python Scrapy框架使用筆記

相關推薦

Scrapy框架使用筆記