進入cmd命令列,切到d盤
#cmd
d:
建立article資料夾
mkdir article
scrapy startproject article
scrapy genspider xinwen www.hbskzy.cn
#命令後面加爬蟲名和網域名稱
#不能和專案名同名
items檔案:
# define here the models for your scraped items
## see documentation in:
# import scrapy
class
articleitem
(scrapy.item)
:# define the fields for your item here like:
title = scrapy.field(
) link = scrapy.field(
)
spider檔案:
import scrapy
from shuichan.items import shuichanitem
from scrapy import request
class
yuspider
(scrapy.spider)
: name =
'yu'
allowed_domains =
['bbs.liyang-tech.com'
]def
start_requests
(self)
: urls =
[''%(i)
for i in
range(1
,20)]
for i in urls:
yield request(url = i,callback = self.next_parse)
defnext_parse
(self, response)
: www =
''item = shuichanitem(
) title = response.xpath(
'//*/tr/th/a[2]/text()')[
3:].extract(
) href = response.xpath(
'//*/tr/th/a[2]/@href')[
3:].extract(
)for i in
range
(len
(title)):
item[
'title'
]= title[i]
item[
'link'
]= href[i]
yield item
pipelines檔案:
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see:
# useful for handling different item types with a single inte***ce
from itemadapter import itemadapter
class
articlepipeline
:def
process_item
(self, item, spider)
:return item
settings檔案:
# scrapy settings for article project
## for simplicity, this file contains only settings considered important or
# commonly used. you can find more settings consulting the documentation:
##
#
#
bot_name =
'article'
spider_modules =
['article.spiders'
]newspider_module =
'article.spiders'
feed_format =
'csv'
#可選feed_uri =
'檔名.csv'
#可選# crawl responsibly by identifying yourself (and your website) on the user-agent
#user_agent = 'article (+'
# obey robots.txt rules
robotstxt_obey =
true
編寫完檢查各個檔案**錯誤,認真檢查
然後啟動爬蟲,先進入cmd命令列輸入:
cd /d d:
/article/article
scrapy crawl xinwen
網頁分析的過程這裡不講,請提前測試除錯爬蟲,否則容易產生報錯資訊。
完成後會在article目錄下生成乙個csv檔案:
開啟後
tip:如果寫入空值也是會報錯的,可以在管道檔案加入if判斷
if
'key'
in item:
scrapy爬蟲小案例
在豆瓣圖書爬取書籍資訊為例 爬取下面劃紅線的資訊 1.先建立乙個myspider專案 如何建立專案上面已經說過了 2.開啟myspider目錄下的items.py item 定義結構化資料字段,用來儲存爬取到的資料 因為要爬取的是兩行資訊,下面定義兩個變數來訪問字串 coding utf 8 def...
scrapy簡單爬蟲
coding utf 8 這只是爬蟲檔案內容,使用pycharm執行,在terminal中使用命令列,要用爬蟲名字 import scrapy from insist.items import insistitem class insistsspider scrapy.spider name ins...
Scrapy框架簡單爬蟲demo
接著上一節的scrapy環境搭建,這次我們開始長征的第二步,如果第一步的還沒走,請出門右轉 scrapy爬蟲框架環境搭建 新建scrapy專案 專案名稱是scrapydemo scrapy startproject scrapydemo 然後回車,就會自動生成乙個專案骨架,如下圖 然後我們寫爬蟲的 ...