1.python爬蟲實戰一之爬取糗事百科段子
( 2.在工作目錄建立myproject
scrapy startproject myproject
3.編寫/myproject/myproject/items.py
# -*- coding: utf-8 -*-
# define here the models for your scraped items
## see documentation in:
# import scrapy
#item是需要的資料進行格式化,方便後期處理
class
myitem
(scrapy.item):
user = scrapy.field
() content = scrapy.field
() godcomment = scrapy.field
()
4.編寫/myproject/myproject/spiders/myspider.py
# -*- coding:utf-8 -*-
import scrapy
import re
from myproject.items import myitem
#spider是指定url,傳送請求和接收原始資料,再根據item進行資料操作
class
myspider
(scrapy.spider):
name = 'myspider'
#可傳入pageindex引數合成完整url
def__init__
(self, pageindex=none, *args, **kwargs):
super(myspider, self).__init__(*args, **kwargs)
self.start_urls = ['' % pageindex]
#根據item進行資料操作
defparse
(self, response):
#print response.body.decode('response.encoding') #列印原始資料
pattern = re.compile('.*?' +
'.*?' +
'.*?(.*?).*?
' + '.*?' +
'(.*?)'
,re.s)
items = re.findall(pattern,response.body.decode(response.encoding))
print ("lin len: %d"%(len(items)))
for item in items:
print ("lin user: %s"%(item[0].strip()))
print ("lin content: %s"%(item[1].strip()))
print ("lin god comments: %s"%(item[2].strip()))
myitems = myitem(user=item[0], content=item[1], godcomment=item[2])
yield myitems
5.設定/myproject/myproject/settings.py的headers
default_request_headers =
6.執行myspider,傳入pageindex=1,結果資料儲存為items.json
scrapy crawl myspider -a pageindex=1 -o items.json
7.結果輸出和items.json出現export unicode字符集問題
8.scrapy中關於export unicode字符集問題解決
( 8.1設定/myproject/myproject/settings.py
from scrapy.exporters import jsonlinesitemexporter
class
customjsonlinesitemexporter
(jsonlinesitemexporter):
def__init__
(self, file, **kwargs):
super(customjsonlinesitemexporter, self).__init__(file, ensure_ascii=false, **kwargs)
#這裡只需要將超類的ensure_ascii屬性設定為false即可
#同時要在setting檔案中啟用新的exporter類
feed_exporters =
8.2再次執行,解決items.json出現export unicode字符集的問題,items.json路徑為\myproject
爬取糗事百科段子
user bin env python coding utf 8 author holley file baike1.py datetime 4 12 2018 14 32 description import requests import re import csv from bs4 impor...
爬取糗事百科,朗讀段子
一閒下來就不務正業了,寫個爬蟲,聽段子。額,mac自帶的語音朗讀,windows我就不知道啦,有興趣的可以去研究一下哈。環境 python 2.7 mac os 10.12 使用朗讀的 from subprocess import call call say hello pengge 當然了,聽起來...
爬取糗事百科段子內容
import requests,sqlite3,re class processdatatool object 資料處理的工具類 工具類中一般不寫 init 初始化屬性,只封裝工具方法對資料進行操作。工具類中的方法一般是以工具類居多。classmethod def process data cls,...