scrapy框架之今日電影

城市電影廣州今日電影**：

①items.py：定義爬取專案，新增類成員

# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
class todaymovieitem(scrapy.item):
# define the fields for your item here like:
# name = scrapy.field()
movietitle = scrapy.field() #影片名稱
movietype = scrapy.field() #影片型別
moviepf = scrapy.field() #影片評分
movieurl = scrapy.field() #影片位址

②guangzhou_bot.py：定義爬取規則

# -*- coding: utf-8 -*-
import scrapy
from todaymovie.items import todaymovieitem
from lxml import etree
class guangzhoubotspider(scrapy.spider):
name = 'guangzhou_bot'
#　爬蟲的名字
allowed_domains = ['guangzhou.movie.iecity.com']
#　爬蟲的域範圍
start_urls = ['']
#　爬取的起始頁，城市電影網廣州今日電影的**
def parse(self, response):
# pass
movielist = response.xpath('//*[@id="left"]/div[2]/div/ul/li').extract()
items = 
# define spider role
role_url = '//a[1]/@href'
role_title = '//div[@class="movietitle clearfix"]//*[@itemprop="name"]/text()'
role_pf = '//div[@class="movietitle clearfix"]//*[@class="pf"]/text()'
role_type = '//div[@class="moviedetail"]/text()'
# begin crawling
for movie in movielist:
# movie tree = etree.html(movie)
item = todaymovieitem()
item['movietitle'] = tree.xpath(role_title)[0]
item['movietype'] = tree.xpath(role_type)[0].replace('\r\n',' ')
item['moviepf'] = 'none' if len(tree.xpath(role_pf))==0 else tree.xpath(role_pf)[0]
item['movieurl'] = '' +\
tree.xpath(role_url)[0]
return items

③pipeline.py：儲存爬取結果

# -*- coding: utf-8 -*-
# define your item pipelines here
## don't forget to add your pipeline to the item_pipelines setting
# see: 
from time import strftime,localtime
import codecs
class todaymoviepipeline(object):
def process_item(self, item, spider):
movie_today = strftime('%y-%m-%d',localtime())
# 以年月日為檔名，將內容以追加方式，寫入檔案中
filename = 'city_guangzhou_' + movie_today + '.csv'
# spider檔案返回的是乙個列表，而這裡只能乙個乙個item寫入
with codecs.open(filename,'a+','utf-8') as fp:
fp.write('%s,%s,%s,%s\n'%
(item['movietitle'],
item['movietype'],
item['moviepf'],
item['movieurl']))
return item

④settings.py：分派任務，使pipeline生效

取消item_pipelines的注釋：

# configure item pipelines # see item_pipelines =

# 鍵是用來處理結果的類，值是這個類執行的順序，數值越小，越先被執行

scrapy框架爬取豆瓣電影的資料

1.什麼是scrapy框架？scrapy是乙個為了爬取資料，提取結構性資料而編寫的應用框架。其可以應用在資料探勘，資訊處理或儲存歷史資料等一系列的程式中。其最初是為了頁面抓取更確切來說,網路抓取所設計的，也可以應用在獲取api所返回的資料例如 amazon associates web se...

Scrapy框架抓取豆瓣電影的小爬蟲學習日記（三）

獲取到影片資訊之後，下一步就是要把獲取到的資訊進行儲存了。網上很多的案例都是儲存成json格式，這裡我想用mysql伺服器來儲存。1 首先安裝好mysql資料庫，建好filminfo表和字段。2 在items.py檔案中新增你需要儲存到資料庫中的資訊，定義相對應的class，生成item類物件。cl...

scrapy 框架之post請求

通常通過實現對某些表單字段如資料或是登入介面中的認證令牌等的預填充。使用scrapy抓取網頁時，如果想要預填充或重寫像使用者名稱使用者密碼這些表單字段，可以使用formrequest,可以使用 formrequest.from response 方法實現。formrequest類引數同...

scrapy框架之今日電影

scrapy框架爬取豆瓣電影的資料

Scrapy框架抓取豆瓣電影的小爬蟲學習日記（三）

scrapy 框架之post請求

相關推薦