爬蟲學習爬蟲之新浪新聞

學習資料參考：python網路爬蟲實戰

源程式如下：

import requests
import json
from bs4 import beautifulsoup
import pandas
results = 
zturl=''
res = requests.get(zturl)
jd = json.loads(res.text.lstrip('  newsloadercallback(').rstrip(');'))
defgetcomment
(news_url):
newsid = news_url.split('/')[-1].lstrip('doc-i').rstrip('.shtml')
commenturl = '
format=js&channel=gn&newsid=comos-{}&group=&compress=0&ie=utf-8&oe=utf-8&\
page=1&page_size=20'.format(newsid)
res2 = requests.get(commenturl)
jd1 = json.loads(res2.text.lstrip('var data='))
return jd1['result']['count']['total']
for allnews in jd['result']['data']:
newssum = {}
res1 = requests.get(allnews['url'])
res1.encoding = 'utf-8'
soup = beautifulsoup(res1.text,'html.parser')
newstitle = allnews['title']
newsurl = allnews['url']                       
newssum['time'] = soup.select('.timesource'[0].contents[0].strip()  
newssum['title'] = allnews['title']
newssum['url'] = allnews['url']
#newssum['article'] = soup.select('.article')[0].text
newssum['comment'] = getcomment(allnews['url'])
newssum['source'] = soup.select('.time-source a')[0]['href']
df = pandas.dataframe(results)
df.to_excel('news.xlsx')

當然，也會遇到一些問題，比如彈出: indexerror: list index out of range.則是因為原先的新聞鏈結被刪除了，導致列表為空。等等

python實現新浪新聞爬蟲

將爬取的新聞儲存到資料夾e sinanews 中，成功後直接通過瀏覽器開啟。import urllib.request import re data urllib.request.urlopen read data2 data.decode utf 8 ignore 加第二個引數ignore pa...

python爬蟲獲取新浪新聞教學

一提到python，大家經常會提到爬蟲，爬蟲近來興起的原因我覺得主要還是因為大資料的原因，大資料導致了我們的資料不在只存在於自己的伺服器，而python語言的簡便也成了爬蟲工具的首要語言，我們這篇文章來講下爬蟲，爬取新浪新聞 1 大家知道，爬蟲實際上就是模擬程式設計客棧瀏覽器請求，然後把請求到的資料...

java爬蟲之搜狐新聞爬蟲（二）

在瀏覽器中右鍵檢查元素那麼經過分析確定標籤可以得到下面的 elements h doc.select h1 itemprop 標題 system.out println h.text elements time doc.select div.time 時間 system.out println t...

爬蟲學習 爬蟲之新浪新聞

python實現新浪新聞爬蟲

python爬蟲獲取新浪新聞教學

java爬蟲 之 搜狐新聞爬蟲（二）

相關推薦

爬蟲學習爬蟲之新浪新聞

java爬蟲之搜狐新聞爬蟲（二）