python爬蟲之scrapy爬取豆瓣電影（練習）

開發環境：windows+pycharm+mongodb+scrapy

任務目標：任務目標：爬取

豆瓣電影top250

，將資料儲存到mongodb中。

items.py檔案

# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class doubanitem(scrapy.item):
# define the fields for your item here like:
# 電影名字
title = scrapy.field()
# 基本資訊
bd = scrapy.field()
# 簡介
star = scrapy.field()
# 評分
quote = scrapy.field()
pass

spiders檔案

# -*- coding: utf-8 -*-
import scrapy
from douban.items import doubanitem
class doubantopspider(scrapy.spider):
name = "doubantop"
allowed_domains = ["movie.douban.com"]
offset = 0
url = ""
start_urls = (
url + str(offset),
)def parse(self, response):
item = doubanitem()
movies = response.xpath('//div[@class="info"]')
for each in movies:
# 電影名
item['title'] = each.xpath('.//span[@class="title"][1]/text()').extract()[0]
# 基本資訊
item['bd'] = each.xpath('.//div[@class="bd"]/p/text()').extract()[0]
# 評分
item['star'] = each.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
# 簡介
quote = each.xpath('.//p[@class="quote"]/span/text()').extract()
if len(quote) != 0:
item['quote'] = quote[0]
yield item
if self.offset < 225:
self.offset += 25
yield scrapy.request(self.url + str(self.offset), callback=self.parse)

pipelines.py檔案

# -*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
import pymongo
from scrapy.conf import settings
class doubanpipeline(object):
def __init__(self):
host = settings["mongodb_host"]
port = settings["mongodb_port"]
dbname = settings["mongodb_dbname"]
sheetname = settings["mongodb_sheetname"]
# 建立mongodb資料庫鏈結
client = pymongo.mongoclient(host=host, port=port)
# 指定資料庫
mydb = client[dbname]
# 存放資料的資料庫表名
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item

settings.py

bot_name = 'douban'
spider_modules = ['douban.spiders']
newspider_module = 'douban.spiders'
# crawl responsibly by identifying yourself (and your website) on the user-agent
# configure item pipelines
# see 
item_pipelines = 
# mongodb 主機名
mongodb_host = '127.0.0.1'
# 埠號
mongodb_port = 27017
# 資料庫名稱
mongodb_dbname = "douban"
# 存放資料的表名稱
mongodb_sheetname = "doubanmovies"

最後的結果：

Python之scrapy框架爬蟲

scrapy命令詳解可能是如今最全最簡單的scrapy命令解釋明天上班，又要爬現在每天做的工作有50 的時間爬 40 的時間清理資料，10 寫報告。若想自學資料分析，側重點很重要，我當初把大部分經歷放在了python的pandas numpymatplotlib上面，其他時間一部分放在sql身...

python爬蟲框架之Scrapy

scrapy 是乙個爬蟲框架，提取結構性的資料。其可以應用在資料探勘，資訊處理等方面。提供了許多的爬蟲的基類，幫我們更簡便使用爬蟲。基於twisted 準備步驟首先安裝依賴庫twisted 在這個下面去尋找符合你的python版本和系統版本的twisted pip install 依賴庫的路徑 ...

python爬蟲scrapy之rules的基本使用

link extractors 是那些目的僅僅是從網頁 scrapy.http.response物件中抽取最終將會被follow鏈結的物件 scrapy預設提供2種可用的 link extractor,但你通過實現乙個簡單的介面建立自己定製的link extractor來滿足需求每個linkex...

python爬蟲之scrapy爬取豆瓣電影（練習）

Python之scrapy框架爬蟲

python爬蟲框架之Scrapy

python爬蟲scrapy之rules的基本使用

相關推薦