python爬取基金 Python 爬基金資料

#coding=utf-8

importjsonimportrequestsfrom lxml importetreefrom htmlparser importhtmlparserfrom pymongo importmongoclient

client= mongoclient('localhost', 27017)

db=client.sciencefund

db.authenticate("","")

collection=db.science_fundfor i in range(1, 43184):printi

data['currentpage'] =i

result= requests.post(url, data = data, headers =headers)

html=result.text

tree=etree.html(html)

table= tree.xpath("//dl[@class='time_dl']")for item intable:

content= etree.tostring(item, method='html')

content=htmlparser().unescape(content)#print content

bson =jiexi(content)

collection.insert(bson)defjiexi(content):#標題

title1 = content.find('">', 20)

title2= content.find('')

title= content[title1+2:title2]#print title

#批准號

standard_no1 = content.find(u'批准號', title2)

standard_no2= content.find('', standard_no1)

standard_no= content[standard_no1+4:standard_no2].strip()#print standard_no

#專案類別

standard_type1 = content.find(u'專案類別', standard_no2)

standard_type2= content.find('', standard_type1)

standard_type= content[standard_type1+5:standard_type2].strip()#print standard_type

#依託單位

supporting_institution1 = content.find(u'依託單位', standard_type2)

supporting_institution2= content.find('', supporting_institution1)

supporting_institution= content[supporting_institution1+5:supporting_institution2].strip()#print supporting_institution

#專案負責人

project_principal1 = content.find(u'專案負責人', supporting_institution2)

project_principal2= content.find('', project_principal1)

project_principal= content[project_principal1+6:project_principal2].strip()#print project_principal

#資助經費

funds1 = content.find(u'資助經費', project_principal2)

funds2= content.find('', funds1)

funds= content[funds1+5:funds2].strip()#print funds

#批准年度

year1 = content.find(u'批准年度', funds2)

year2= content.find('', year1)

year= content[year1+5:year2].strip()#print year

keywords2= content.find('', keywords1)

keywords= content[keywords1+4:keywords2].strip()#print keywords

dc ={}

dc['title'] =title

dc['standard_no'] =standard_no

dc['standard_type'] =standard_type

dc['supporting_institution'] =supporting_institution

dc['project_principal'] =project_principal

dc['funds'] =funds

dc['year'] =year

dc['keywords'] =keywordsreturndcif __name__ == '__main__':

main()

python動態爬取知乎 python爬取微博動態

在初學爬蟲的過程中，我們會發現很多都使用ajax技術動態載入資料，和常規的不一樣，資料是動態載入的，如果我們使用常規的方法爬取網頁，得到的只是一堆html 沒有任何的資料。比如微博就是如此，我們可以通過下滑來獲取更多的動態。對於這樣的網頁該如何抓取呢？我們以微博使用者動態為例，抓取某名使用者的文...

爬蟲爬取天天基金網的公司資訊

coding utf 8 import requests import parsel import re import pandas as pd deftiantianjijin main 設定要爬取的url及headers，headers表明該瀏覽器系統win10 64位 browser核心 ...

Python爬取小說

感覺這個夠蛋疼的，因為你如果正常寫的話，前幾次執行沒問題，之後你連都沒改，再執行就出錯了。其實這可能是網路請求失敗，或者有反爬蟲的東西吧。但這就會讓你寫的時候非常苦惱，所以這這東西，健壯性及其重要！import requests from bs4 import beautifulsoup impo...

python爬取基金 Python 爬基金資料

python動態爬取知乎 python爬取微博動態

爬蟲 爬取天天基金網的公司資訊

Python爬取小說

相關推薦

爬蟲爬取天天基金網的公司資訊