目標:簡單的scrapy練習,抓取豆瓣劇情片排行榜前20%並寫入檔案儲存
**:網頁說明:
1,**中100:90部分控制排行榜中分數最高的20%
2,網頁解析過程略過
系統及軟體:windows7及pycharm,python3.6
**:1,編寫item
# -*- coding: utf-8 -*-
import scrapy
class doubanmovieitem(scrapy.item):
name = scrapy.field()
score = scrapy.field()
url = scrapy.field()
2,編寫spider
# -*- coding:utf-8 -*-
import scrapy
import json
from douban_movie.items import doubanmovieitem
class catchmoviespider(scrapy.spider):
name = 'catch_movie'
allowed_domains = ['douban.com']
start_urls = ['']
offset = 0
def parse(self,response):
# print(response.body.decode())
item = doubanmovieitem()
movie_list = json.loads(response.body.decode())
if movie_list == :
return
for movie in movie_list:
item['name'] = movie['title']
item['score'] = movie['score']
item['url'] = movie['url']
yield item
self.offset += 20
new_url = ''.format(self.offset)
yield scrapy.request(url = new_url,callback = self.parse)
3,編寫pipeline
# -*- coding: utf-8 -*-
import json
class doubanmoviepipeline(object):
def open_spider(self,spider):
self.file = open('douban_movie.txt','w',encoding='utf-8')
def process_item(self, item, spider):
content = json.dumps(dict(item),ensure_ascii=false)+'\n'
self.file.write(content)
return item
def close_spider(self,spider):
self.file.close()
4,編寫setting
# crawl responsibly by identifying yourself (and your website) on the user-agent
# obey robots.txt rules
robotstxt_obey = false
# configure item pipelines
# see
item_pipelines =
5,編寫main
from scrapy.cmdline import execute
execute('scrapy crawl catch_movie'.split())
儲存後檔案內容截圖:
筆記:1,編寫main是為了方便除錯
2,這個排行榜在**中限定了區間(**中類似於100:90這種引數)
Scrapy 豆瓣搜尋頁爬蟲
使用scrapy爬蟲框架對豆瓣圖書搜尋結果進行爬取 scrapy是乙個為了爬取 資料,提取結構性資料而編寫的應用框架 可以應用在包括資料探勘,資訊處理或儲存歷史資料等一系列的程式 它提供了多種型別爬蟲的基類,如basespider crawlspider等 scrapy框架主要由五大元件組成 排程器...
scrapy簡單爬蟲
coding utf 8 這只是爬蟲檔案內容,使用pycharm執行,在terminal中使用命令列,要用爬蟲名字 import scrapy from insist.items import insistitem class insistsspider scrapy.spider name ins...
scrapy爬蟲簡單案例
進入cmd命令列,切到d盤 cmd d 建立article資料夾 mkdir articlescrapy startproject articlescrapy genspider xinwen www.hbskzy.cn 命令後面加爬蟲名和網域名稱 不能和專案名同名 items檔案 define h...