高階爬蟲Scrapy框架

記錄下容易出錯的地方

from movie.items import movieitem
#資料夾.檔名 引入 類名

import scrapy
from
..items import movieitem
class
shuichanspider
(scrapy.spider)
:    name =
'shuichan'
allowed_domains =
['bbs.liyang-tech.com'
]    start_urls =
['']def
parse
(self,response)
:        urls =
['&page=%s'
%(i)
for i in
range(1
,51)]
for i in urls:
yield response.follow(i,self.parse_title)
#方法名不要帶括號
defparse_title
(self,response)
:        item = movieitem(
)        txt = response.xpath(
'//*/tr/th/a[2]/text()'
).extract(
)for i in txt:
item[
'title'
]= i
yield item

第二個引數是提交乙個方法，不要打括號，否則報錯

我們經常用scrapy內建的xpath來解析頁面，獲取想要的資訊，但xpath解析的返回值是乙個selector物件，不能直接與str物件進行運算，需要先呼叫extract（）函式來將其變成unicode編碼，之後就能與str物件進行運算了。

解析完畢後一定要返回資料，否則啥也沒有的

先例項化items檔案中的類，然後最後返回解析的資料

item = movieitem(
)

最後返回item

settings.py 是整個專案的配置檔案，這個檔案裡可以設定爬取併發個數、等待時間、輸出格式、預設 header 等等。這次我們可以寫一些配置如：

bot_name = spider_modules = newspider_module = # crawl responsibly by identifying yourself (and your website) on the user-agent # obey robots.txt rules robotstxt_obey = true # 上面都是自動生成的，下面開始是我們自己定義的 # 要使用的 pipeline item_pipelines = feed_format = 'csv' # 最後輸出的檔案格式 feed_uri = # 最後輸出的檔名 # 為了避免對被爬**造成太大的壓力，我們啟動自動限速，設定最大併發數為 5 autothrottle_enabled = true autothrottle_target_concurrency =

5

一定要注意大小寫和拼寫

比如txt 檔案格式，這樣就錯了

scrapy不支援這個格式，csv可以

高階爬蟲Scrapy框架

Python高階爬蟲框架Scrapy簡介

python的高階爬蟲框架Scrapy

scrapy爬蟲框架

高階爬蟲Scrapy框架

Python高階爬蟲框架Scrapy簡介

python的高階爬蟲框架Scrapy

scrapy爬蟲框架

相關推薦