讓我們退回到搜尋結果頁面,分析頁面元素,如圖:
所有搜尋結果在# -*- coding: utf-8 -*-
import re
space = u'\u00a0'
brackets = r'\(.*\)|\[.*\]|【.*】|(.*)'
# -*- coding: utf-8 -*-
import urllib
def get_kw_url(kw):
"""concatenate the url for searching"""
base_url = u""
return base_url % (urllib.quote(kw.encode("utf8")))
def get_pkg_url(pkg):
"""get the detail url according to pkg"""
scrapy的爬蟲均繼承與scrapy.spider
類,主要的屬性及方法:
items為儲存爬取後資料的容器,類似於python的dict,
import scrapy
# define the fields for your item here like:
kw = scrapy.field() # key word
豌豆莢spider**:
# -*- coding: utf-8 -*-
# @time : 2016/6/23
# @author : rain
import scrapy
import codecs
class wandoujiaspider(scrapy.spider):
name = "wandoujiaspider"
allowed_domains = ["www.wandoujia.com"]
def __init__(self):
def start_requests(self):
callback=self.parse_search_result,
def parse(self, response):
item['kw'] = response.meta['kw']
item['tag'] = response.css('.crumb>.second>a>span::text').extract_first()
desc = response.css('.desc-info>.con::text').extract()
item['desc'] = util.parse_desc(desc)
item['desc'] = u"" if not item["desc"] else item["desc"].strip()
yield item
def parse_search_result(self, response):
pkg = response.css("#j-search-list>li::attr(data-pn)").extract_first()
yield scrapy.request(url=wandoujia.get_pkg_url(pkg), meta=response.meta)
def parse_desc(desc):
return reduce(lambda a, b: a.strip()+b.strip(), desc, '')
爬取結果有可能會有重複的、為空的(無搜尋結果的);此外,python2序列化json時,對於中文字元,其編碼為unicode。對於這些問題,可自定義pipeline對結果進行處理:
class checkpipeline(object):
"""check item, and drop the duplicate one"""
def __init__(self):
self.names_seen = set()
def process_item(self, item, spider):
if item['name']:
if item['name'] in self.names_seen:
raise dropitem("duplicate item found: %s" % item)
else:
self.names_seen.add(item['name'])
return item
else:
raise dropitem("missing price in %s" % item)
class jsonwriterpipeline(object):
def __init__(self):
self.file = codecs.open('./output/output.json', 'wb', 'utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=false) + "\n"
self.file.write(line)
return item
還需在settings.py中設定
item_pipelines =
分配給每個類的整型值,確定了他們執行的順序,按數字從低到高的順序,通過pipeline,通常將這些數字定義在0-1000範圍內。
爬蟲實戰scrapy
coding utf 8 import scrapy import re class jobbolespider scrapy.spider name jobbole allowed domains blog.jobbole.com start urls defparse self,response...
Scrapy實戰 爬Boss直聘
我們爬取頁面中每個公司的崗位資訊,包括職位 辦公地點 工作經驗 上圖中的11個加上boss直聘的jobid共12個資訊 開啟shell scrapy shell view response 發現返回403 嘗試把headers一併給出 from scrapy import request fetch...
Scrapy實戰之儲存在MongoDB中
爬取豆瓣電影top250的電影資料 並儲存在mongodb中。scrapy startproject douban scrapy genspider doubanspider movie.douban.com class doubanspideritem scrapy.item 電影標題 title...