scrapy爬蟲簡單案例

2021-10-09 08:37:18 字數 3192 閱讀 5653

進入cmd命令列,切到d盤

#cmd

d:

建立article資料夾

mkdir article
scrapy startproject article
scrapy genspider xinwen www.hbskzy.cn

#命令後面加爬蟲名和網域名稱

#不能和專案名同名

items檔案:

# define here the models for your scraped items

## see documentation in:

# import scrapy

class

articleitem

(scrapy.item)

:# define the fields for your item here like:

title = scrapy.field(

) link = scrapy.field(

)

spider檔案:

import scrapy

from shuichan.items import shuichanitem

from scrapy import request

class

yuspider

(scrapy.spider)

: name =

'yu'

allowed_domains =

['bbs.liyang-tech.com'

]def

start_requests

(self)

: urls =

[''%(i)

for i in

range(1

,20)]

for i in urls:

yield request(url = i,callback = self.next_parse)

defnext_parse

(self, response)

: www =

''item = shuichanitem(

) title = response.xpath(

'//*/tr/th/a[2]/text()')[

3:].extract(

) href = response.xpath(

'//*/tr/th/a[2]/@href')[

3:].extract(

)for i in

range

(len

(title)):

item[

'title'

]= title[i]

item[

'link'

]= href[i]

yield item

pipelines檔案:

# define your item pipelines here

## don't forget to add your pipeline to the item_pipelines setting

# see:

# useful for handling different item types with a single inte***ce

from itemadapter import itemadapter

class

articlepipeline

:def

process_item

(self, item, spider)

:return item

settings檔案:

# scrapy settings for article project

## for simplicity, this file contains only settings considered important or

# commonly used. you can find more settings consulting the documentation:

##

#

#

bot_name =

'article'

spider_modules =

['article.spiders'

]newspider_module =

'article.spiders'

feed_format =

'csv'

#可選feed_uri =

'檔名.csv'

#可選# crawl responsibly by identifying yourself (and your website) on the user-agent

#user_agent = 'article (+'

# obey robots.txt rules

robotstxt_obey =

true

編寫完檢查各個檔案**錯誤,認真檢查

然後啟動爬蟲,先進入cmd命令列輸入:

cd /d d:

/article/article

scrapy crawl xinwen
網頁分析的過程這裡不講,請提前測試除錯爬蟲,否則容易產生報錯資訊。

完成後會在article目錄下生成乙個csv檔案:

開啟後

tip:如果寫入空值也是會報錯的,可以在管道檔案加入if判斷

if

'key'

in item:

scrapy爬蟲小案例

在豆瓣圖書爬取書籍資訊為例 爬取下面劃紅線的資訊 1.先建立乙個myspider專案 如何建立專案上面已經說過了 2.開啟myspider目錄下的items.py item 定義結構化資料字段,用來儲存爬取到的資料 因為要爬取的是兩行資訊,下面定義兩個變數來訪問字串 coding utf 8 def...

scrapy簡單爬蟲

coding utf 8 這只是爬蟲檔案內容,使用pycharm執行,在terminal中使用命令列,要用爬蟲名字 import scrapy from insist.items import insistitem class insistsspider scrapy.spider name ins...

Scrapy框架簡單爬蟲demo

接著上一節的scrapy環境搭建,這次我們開始長征的第二步,如果第一步的還沒走,請出門右轉 scrapy爬蟲框架環境搭建 新建scrapy專案 專案名稱是scrapydemo scrapy startproject scrapydemo 然後回車,就會自動生成乙個專案骨架,如下圖 然後我們寫爬蟲的 ...