python中scrapy處理專案資料的例項分析

在我們處理完資料後，習慣把它放在原有的位置，但是這樣也會出現一定的隱患。如果因為新資料的加入或者其他種種原因，當我們再次想要啟用這個檔案的時候，小夥伴們就會開始著急卻怎麼也翻不出來，似乎也沒有其他更好的蒐集辦法，而重新進行資料整理顯然是不現實的。下面我們就一起看看python爬蟲中sc處理專案資料的方法吧。

1、拉取專案

$ git clone

$ cd tweetscraper/

$ pip install -r requirements.txt #add '--user' if you are not root

$ scrapy list

$ #if the output is 'tweetscraper', then you are ready to go.

2、資料持久化

通過閱讀文件，我們發現該專案有三種持久化資料的方式，第一種是儲存在檔案中，第二種是www.cppcns.com儲存在mongo中，第三種是儲存在mysql資料庫中。因為我們抓取的資料需要做後期的分析，所以，需要將資料儲存在mysql中。

抓取到的資料預設是以json格式儲存在磁碟 ./data/tweet/ 中的，所以，需要修改配置檔案 tweetscraper/settings.py 。

item_pipelines =

#settings for mysql

mysql_snyrycxcxerver = "18.126.219.16"

mysql_db = "scraper"

mysql_table = "tweets" # the table will be created automatically

mysql_user = "root" # mysql user to use (should h**e insert access granted to the database/table

mysql_pwd = "admin123456" # mysql user's password

內容擴充套件：

scrapy.cfg是專案的配置檔案

from scrapy.spider import basespider

class dmozspider(basespider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"",""

]def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, 'wb').write(response.body)

python中scrapy處理專案資料的例項分析

python中的Scrapy框架使用

python中scrapy框架的簡單使用

Scrapy處理異常狀態碼

python中scrapy處理專案資料的例項分析

python中的Scrapy框架使用

python中scrapy框架的簡單使用

Scrapy處理異常狀態碼

相關推薦