Python爬蟲增量式爬蟲通用爬蟲聚焦爬蟲

通用爬蟲

聚焦爬蟲

增量式爬蟲

通用爬蟲和聚焦爬蟲之前的部落格有講解過了，本篇主要講解增量式爬蟲

適用於持續跟蹤**的資料爬取。

例如三個月前，爬取了所有資料。**更新了100條資料，此時爬取跟新的100條資料即可。

指定起始url，基於crawlspider獲取頁碼url

基於rule請求獲取的頁碼url

基於頁碼url解析出當前頁對應的詳情url

（核心）檢測詳情url之前有沒有訪問過

將爬取詳情頁的url進行儲存

儲存在redis的set資料結構中（存在返回0，不存在返回0）

對詳情頁傳送請求，解析資料

持久化儲存

scrapy startproject tvpro cd tvpro

scrapy genspider -t crawl tv www.***.com

settings.py

bot_name =
'tvpro'
spider_modules =
['tvpro.spiders'
]newspider_module =
'tvpro.spiders'
# crawl responsibly by identifying yourself (and your website) on the user-agent
user_agent =
# obey robots.txt rules
robotstxt_obey =
false
log_level =
'error'
item_pipelines =

item.py

import scrapy
class
tvproitem
(scrapy.item)
:# define the fields for your item here like:
name = scrapy.field(
)    desc = scrapy.field(
)

pipelines.py

from itemadapter import itemadapter
class
fbspropipeline
:    conn =
none
defopen_spider
(self, spider)
:        self.conn = spider.conn
defprocess_item
(self, item, spider)
:        dic =
self.conn.lpush(
'tvdata'
, dic)
return item

tv.py

import scrapy
from scrapy.linkextractors import linkextractor
from scrapy.spiders import crawlspider, rule
from redis import redis
from tvpro.items import tvproitem
class
tvspider
(crawlspider)
:    name =
'tv'
# allowed_domains = ['www.***.com']
start_urls =
['']    rules =
(        rule(linkextractor(allow=r'/frim/index1-\d+\.html'),
callback=
'parse_item'
, follow=
true),
)    conn = redis(host=
'127.0.0.1'
, port=
6379
)def
parse_item
(self, response)
:    li_list = response.xpath(
'/html/body/div[1]/div/div/div/div[2]/ul/li'
)for li in li_list:
detail_url =
''+ \            li.xpath(
'./div[1]/a/@href'
).extract_first(
)print
(detail_url)
ex = self.conn.sadd(
'urls'
, detail_url)
if ex ==1:
print
('未爬取'
)yield scrapy.request(url=detail_url, callback=self.parse_datail)
else
:print
('已爬取'
)def
parse_datail
(self, response)
:    item = tvproitem(
)    name = response.xpath(
'/html/body/div[1]/div/div/div/div[2]/h1/text()'
).extract_first(
)    desc = response.xpath(
'/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()'
).extract_first(
)    item[
'name'
]= name
item[
'desc'
]= desc
yield item

判斷是否爬取，增量式爬蟲的核心

ex = self.conn.sadd(
'urls'
, detail_url)
if ex ==1:
print
('未爬取'
)yield scrapy.request(url=detail_url, callback=self.parse_datail)
else
:print
('已爬取'
)

執行工程

scrapy crawl tv

增量式爬蟲

目的增量式爬蟲在上一次爬取的基礎上繼續爬取資料,通過增量式爬蟲,我們可以繼續爬取因故未完全爬完的資料,或更新的資料.去重那麼如何判斷我們是否爬過某條資料是關鍵,顯然,每次爬取判斷該資料是否存在是不可取的所以這裡我們利用了redis資料庫集合自動去重的功能.向redis 庫中的集合裡放 ps ...

增量式爬蟲

增量式爬蟲引言當我們在瀏覽相關網頁的時候會發現，某些定時會在原有網頁資料的基礎上更新一批資料，例如某電影會實時更新一批最近熱門的電影。會根據作者創作的進度實時更新最新的章節資料等等。那麼，類似的情景，當我們在爬蟲的過程中遇到時，我們是不是需要定時更新程式以便能爬取到中最近更新的資料呢？一....

python增量爬蟲pyspider

1.為了能夠將爬取到的資料存入本地資料庫，現在本地建立乙個mysql資料庫example，然後在資料庫中建立一張 test，示例如下 drop table if exists test create table douban db id int 11 not null auto increment...

Python爬蟲 增量式爬蟲 通用爬蟲 聚焦爬蟲

增量式爬蟲

增量式爬蟲

python增量爬蟲pyspider

相關推薦

Python爬蟲增量式爬蟲通用爬蟲聚焦爬蟲