0.從新聞url獲取點選次數,並整理成函式
1.從新聞url獲取新聞詳情: 字典,anews
3.生成所頁列表頁的url並獲取全部新聞 :列表extend(列表) allnews
import requests執行截圖:from bs4 import beautifulsoup
from datetime import datetime
import locale
import re
locale.setlocale(locale.lc_ctype,'chinese')
def getclickcount(newsurl):
newsid = re.findall('\_(.*).html', newsurl)[0].split('/')[1] #使用正規表示式取得新聞編號
clickurl = ''.format(newsid)
clickstr = requests.get(clickurl).text
return(re.search("hits'\).html\('(.*)'\);",clickstr).group(1))
def getnewscontent(content):
f = open('gzccnews.txt','a',encoding='utf8')
f.write(content)
f.close()
def getnewdetail(newsurl):
resd = requests.get(newsurl) # 返回response
resd.encoding = 'utf-8'
soupd = beautifulsoup(resd.text, 'html.parser')
newsurl = newsurl
info = soupd.select('.show-info')[0].text
dtime = datetime.strptime(time, '%y-%m-%d %h:%m:%s')
else:
author = '無'
else:
check = '無'
else:
sourec = '無'
if info.find('攝影:') > 0:
photo = info[info.find('攝影:'):].split()[0].lstrip('攝影:')
else:
photo = '無'
clickcount = getclickcount(newsurl)
print('點選次數:' + clickcount)
content = soupd.select('.show-content')[0].text
getnewscontent(content)
# print(content)
def getliurl(listpageurl):
res = requests.get(listpageurl)
res.encoding = 'utf-8'
soup = beautifulsoup(res.text,'html.parser')
# print(soup.select('li'))
for news in soup.select('li'):
if len(news.select('.news-list-title'))>0:
a = news.a.attrs['href']
getnewdetail(a)
firsturl = ''
print('第1頁:')
getliurl(firsturl)
res = requests.get(firsturl)
res.encoding = 'utf-8'
soupn = beautifulsoup(res.text,'html.parser')
n = int(soupn.select('.a1')[0].text.rstrip('條'))//10+1
# for i in range(2,n):
# pageurl = '{}.html'.format(i)
# print('第{}頁:'.format(i))
# getliurl(pageurl)
# break
4.設定合理的爬取間隔
import time
import random
time.sleep(random.random()*3)
#設定合理的爬取間隔5.用pandas做簡單的資料處理並儲存for i in range(5):
time.sleep(random.random()*3)
儲存到csv或excel檔案
newsdf.to_csv(r'f:\duym\爬蟲\gzccnews.csv')
#儲存檔案截圖:pd.series(allnews)
newsdf=pd.dataframe(allnews)
newsdf.to_csv(r'f:\news.csv')
爬取全部的校園新聞
1 從新聞url獲取新聞詳情 2 從列表頁的url獲取新聞url 3 生成所頁列表頁的url並獲取全部新聞 4 設定合理的爬取間隔 5 用pandas做簡單的資料處理並儲存成csv和sql檔案 import requests from bs4 import beautifulsoup from da...
爬取全部的校園新聞
本次作業 於 import包 import re import requests from bs4 import beautifulsoup from datetime import datetime import time import random import pandas as pd 0.從...
爬取全部的校園新聞
本次作業的要求來自於 0.從新聞url獲取點選次數,並整理成函式 1.熟練運用re.search match findall 2.從新聞url獲取新聞詳情 字典,anews import requests from bs4 import beautifulsoup from datetime imp...