SCRAPY 爬蟲筆記

scrapy 爬蟲筆記

第乙個scrapy 程式

首選在cmd 輸入

scrapy startproject ***

*** 就是我們第乙個專案的名稱

在spiders資料夾下面建立乙個新的python檔案

import scrapy
class firstspider(scrapy.spider):
name = "first_spider"  # 爬蟲名字
start_urls = [
'' # 爬蟲位址
]def parse(self, response):
famous = response.css('div.quote')  # 提取所有名言 放入famous
for f in famous:
content = f.css('.text::text').extract_first()  # 提取第一條名言內容
author = f.css('.author::text').extract_first()  # 提取作者
tags = f.css('.tags.tag::text').extract()  # tags 有好多所以不是提取第乙個
tags = ','.join(tags)  # 陣列轉換成字串
# 檔案操作
filename = '%s-語錄.txt' % author  # 檔名
with open(filename, "a+") as file:
file.write(content)
file.write('\n')  # 換行
file.write('標籤:' + tags)
file.write('\n------------------------------------\n')
if next_page is not none:
next_page = response.urljoin(next_page)  # 相對路徑拼成絕對路徑
yield scrapy.request(next_page, callback=self.parse)

執行scrapy 程式就是在專案目錄下執行cmd

>>> scrapy crawl ***(爬蟲名字)

首先在pipeline裡面繼承scrapy的imagespipeline,get_media_requests(self, item, info)方法

class firstspiderpipeline(imagespipeline):
def get_media_requests(self, item, info):
for imgurl in item['imgurls']:
yield request(imgurl)

然後在item檔案裡面加入

imgurl = scrapy.field()

在spiders資料夾下新建乙個檔案,這裡寫我們的爬蟲內容

class imagespider(scrapy.spider):
name = 'imgspider'
start_urls = [
'/archives/57.html',
'/archives/55.html',
]def parse(self, response):
item = imagespideritem()  #例項化item
imgurls = response.css(".post-content img::attr(src)").extract() # 獲取url鏈結
item['imgurl'] = imgurls
imgnames = response.css(".post-title a::text").extract_first()
item['imgname'] = imgnames
yield item

#儲存位置
images_store = 'd:\imagespider'
item_pipelines =

最後執行爬蟲,我們就得到了

def file_path(self, request, response=none, info=none):
# 重新命名，若不重寫這函式，名為雜湊，就是一串亂七八糟的名字
image_guid = request.url.split('/')[-1]  # 提取url前面名稱作為名。
# 接收上面meta傳遞過來的名稱
name = request.meta['name']
name = re.sub(r'[？\\*|「<>:/]', '', name)
# 分資料夾儲存的關鍵：對應著name；對應著image_guid
filename = u'/'.format(name, image_guid)
return filename

scrapy 爬蟲學習筆記

1.安裝scrapy pip install i 源 scrapy 2.手動建立scarpy專案 scrapy startproject 專案名稱 3.scrapy genspider jobbole blog.jobbole.com 使用自帶模板 4.除錯修改setting檔案中obey rob...

Scrapy爬蟲筆記 1

1 安裝使用pip install scrapy 假如使用了fiddler作為伺服器進行除錯分析，為了避免該軟體的影響開啟fiddler，進入 tools fiddler options connections 將 act as system proxy on startup 和 monito...

爬蟲python框架 Scrapy學習筆記

首先啟用爬蟲裡面的starturl獲取響應response。再通過xpath提取資料，提取的資料通過建立的item物件暫存到item.py 資料中轉站裡面的item裡面，item資料通過yield返回給管道，管道給寫入檔案儲存起來。items.py item 可以理解為資料的中轉類，因為我們爬取網...

SCRAPY 爬蟲筆記

scrapy 爬蟲學習筆記

Scrapy爬蟲筆記 1

爬蟲python框架 Scrapy學習筆記

相關推薦