python爬蟲第二天

時間字串轉換

contents獲取內容

strftime轉化時間格式

內文的提取實參位置用空格分隔加一級的標籤

import requests

import json

jd = json.loads(comments.text.strip('需剔除部分'))

抓取內文資訊方法寫成函式：

commenturl = '網{}址' #挖空newsid的**

import re

import json

def getcommentcounts(newsurl):

m = re.search('doc-i(.+).shtml',newsurl)

newsid = m.group(1)

comments = requests.get(commenturl.format(newsid))

jd = json.loads(comments.text.strip('var data='))

return jd['result']['count']['total']

import requests

from bs4 import beautifulsoup

def getnewsdetail(newsurl):

result = {}

res = resquest.get(newsurl)

res.encoding = 'utf - 8'

soup = beautifulsoup(res.text,'html.parser')

result['title'] = soup.select('#artibodytitle')[0].text

result['newssource'] = soup.select('.time-sourse span a')[0].text

timesource = soup.select('.time-source')[0]countents[0].strip()

result['dt'] = datetime.striptime(timesource,'%y年%m月%d日%h:%m')

result['article'] = ' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])

result['comments'] = getcommentcount(newsurl)

return result

就會得到乙個新聞的字典資訊，設計迴圈可以實現多條新聞的爬取

Python爬蟲第二天

python爬蟲第二天超時設定有時候訪問網頁時長時間未響應，系統就會判斷網頁超時，無法開啟網頁。如果需要自己設定超時時間則通過urlopen 開啟網頁時使用timeout欄位設定 import urllib.request for i in range 1,100 迴圈99次 try file...

爬蟲第二天

作用網路使用者去取得網路信任 1.突破自身ip限制，去訪問一些不能訪問的站點 2.提高網路速度，服務通過有比較大的硬碟快取區，當外界資訊訪問通過後，將資訊儲存在緩衝區，其他使用者訪問相同資訊，直接在緩衝區拿 3.隱藏真實ip，對於爬蟲來說為了隱藏自身ip，防止自身ip被封鎖爬蟲分類 1.ftp...

學python爬蟲第二天

墨跡少女，隨心學習，啊，好慢！import requests 定義請求的url url 定義請求頭 headers 定義輸入變數 input input post傳送的資料 data 傳送請求 res requests.post url url,headers headers,data data 接...

python爬蟲第二天

Python爬蟲第二天

爬蟲第二天

學python爬蟲第二天

相關推薦