網路爬蟲（四）使用Scrapy爬取網易新聞

**：

import scrapy
class newsitem(scrapy.item):
news_thread = scrapy.field() 
news_title = scrapy.field()
news_url = scrapy.field()
news_time = scrapy.field()
news_source = scrapy.field()
source_url = scrapy.field()
news_text = scrapy.field()

piplines.py

**：

from scrapy.exporters import csvitemexporter
class newspipeline(object):
def __init__(self):
self.file = open ('news_data.csv','wb')
self.exporter = csvitemexporter(self.file,encoding='gbk')
self.exporter.start_exporting()
#建立乙個名稱為news_data.csv的檔案，並且將資料匯入
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self,spider):
self.exporter.finish_exporting()
self.file.close()
#定義結束傳輸

作用：將資料進行分類，並存入所建立的.csv檔案中

主體**：

這裡的檔名由所創name決定

**為：

import scrapy
from news.items import newsitem
from scrapy.linkextractors import linkextractor
#鏈結提取器
from scrapy.spiders import crawlspider,rule
class news163spider(crawlspider):#類的繼承
name = 'news163'
allowed_domains = ['news.163.com']
start_urls = ['']
rules = (
rule (linkextractor(allow=r"/18/09\d+/*"),
callback ="parse_news",follow=true),
)			#如果滿足allow=r"/18/09\d+/*(正規表示式) 把**給parse-news
def parse_news(self,response):
item = newsitem()#例項化操作,把item當成字典使用
item['news_thread'] = response.url.strip().split('/')[-1][:-5]
self.get_title(response,item)
self.get_time(response,item)
self.get_source(response,item)
self.get_url(response,item)
self.get_source_url(response,item)
self.get_text(response,item)
return item
def get_title(self,response,item):
title = response.css('title::text').extract()
print('*'*20)
if title:#判斷是否為空
print('title:{}'.format(title[0][:-5]))
item['news_title']=title[0][:-5]
def get_time(self,response,item):
time = response.css('.post_time_source::text').extract()
if time:
print('time:{}'.format(time[0][:-5]))
item['news_time'] = time[0][:-5]
def get_source(self,response,item):
source = response.css('#ne_article_source::text').extract()
if source:
print('source:{}'.format(source[0]))
item['news_source'] = source[0]
def get_source_url(self,response,item):
source_url = response.css('#ne_article_source::attr(href)').extract()
#attr是屬性
if source_url:
print('source_url:{}'.format(source_url[0]))
item['source_url'] = source_url[0]
def get_text(self,response,item):
text = response.css('#endtext p::text').extract()
if text:
print('text:{}'.format(text))
item['news_text'] = text
def get_url(self,response,item):
url = response.url
if url:
print('uews_url:{}'.format(url))
item['news_url']=url

執行後就可得到整理好的乙個csv檔案。

Scrapy爬蟲爬取電影天堂

目標建立專案 scrapy startproject 爬蟲專案檔案的名字生成 crawlspider 命令 scrapy genspider t crawl 爬蟲名字爬蟲網域名稱終端執行 scrapy crawl 爬蟲的名字 python操作mysql資料庫操作爬蟲檔案 coding ut...

scrapy多爬蟲以及爬取速度

主要這段時間一直使用的就是scrapy這個框架，因為公司裡面需要爬取大量的所以才使用了多爬蟲，但是目前測試也只是幾十個，一直也想不到更好的方法去同時抓取成千上百個結構不同的所以也很是苦逼的用了scrapy裡面的多爬蟲，對每個分別解析，還好雖然幾次改需求但是欄位都是統一的，可以很輕鬆的通過ite...

Scrapy爬蟲框架二匯出爬取結果

功能描述爬取豆瓣電影 top 250,爬取內容電影標題，評分人數，評分編輯 items.py 檔案 coding utf 8 import scrapy class doubanmovieitem scrapy.item 排名 ranking scrapy.field 電影名稱 title...

網路爬蟲（四） 使用Scrapy爬取網易新聞

Scrapy爬蟲爬取電影天堂

scrapy多爬蟲以及爬取速度

Scrapy爬蟲框架 二 匯出爬取結果

相關推薦

網路爬蟲（四）使用Scrapy爬取網易新聞

Scrapy爬蟲框架二匯出爬取結果