作業是**同學的,因為沒有對新聞資訊做提取,所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法,匯出excel檔案。ps:自己的**會盡快修改!
importrequests
from bs4 import
beautifulsoup
from datetime import
datetime
import
re, pandas
#獲取新聞點選次數
defgetnewsid(url):
newsid = re.findall(r'
\_(.*).html
', url)[0][-4:]
clickurl = '
'.format(newsid)
clickres =requests.get(clickurl)
#利用正規表示式獲取新聞點選次數
clickcount = int(re.search("
hits'\).html\('(.*)'\);
", clickres.text).group(1))
return
clickcount
#將新聞正文寫進檔案,不會被覆蓋
defwritenewscontenttofile(content):
f=open('
gzccnews.txt
','a
',encoding='
utf-8')
f.write(content)
f.close()
#獲取新聞細節
defgetnewsdetail(newsurl):
contentlist={}
resd =requests.get(newsurl)
resd.encoding = '
utf-8
'soupd = beautifulsoup(resd.text, '
html.parser')
newsdict ={}
content = soupd.select('
#content
')[0].text
writenewscontenttofile(content)
info = soupd.select('
.show-info
')[0].text
newsdict[
'title
'] = soupd.select('
.show-title
')[0].text
#識別時間格式
date = re.search('
(\d.\d.\d\s\d.\d.\d)
', info).group(1)
#識別乙個至三個資料
if(info.find('
')>0):
newsdict[
'author
'] = re.search('
', info).group(1)
else
: newsdict[
'author
'] = '
none
'if(info.find('
')>0):
newsdict[
'check
'] = re.search('
', info).group(1)
else
: newsdict[
'check
'] = '
none
'if(info.find('
')>0):
newsdict[
'sources
'] = re.search('
', info).group(1)
else
: newsdict[
'sources
'] = '
none
'if (info.find('
攝影:') >0):
newsdict[
'photo
'] = re.search('
攝影:(.*)\s*點
', info).group(1)
else
: newsdict[
'photo
'] = '
none'#
用datetime將時間字串轉換為datetime型別
newsdict['
datetime
'] = datetime.strptime(date, '
%y-%m-%d %h:%m:%s')
#呼叫getnewsid()獲取點選次數
newsdict['
click
'] =getnewsid(newsurl)
return
newsdict
defgetlistpage(listurl):
res =requests.get(listurl)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
for new in soup.select('li'
):
if len(new.select('
.news-list-title
')) >0:
title = new.select('
.news-list-title
')[0].text
description = new.select('
.news-list-description
')[0].text
newsurl = new.select('
a')[0]['
href']
list =
print('
'.format(title, description, newsurl))
#呼叫getnewsdetail()獲取新聞詳情
dict =getnewsdetail(newsurl)
total.extend(list)
total =
listurl = '
'getlistpage(listurl)
res =requests.get(listurl)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
listcount = int(soup.select('
.a1')[0].text.rstrip('
條'))//10+1df =pandas.dataframe(total)
df.to_excel(
'newsresult.xlsx')
print(df[['
click
','author
','datetime
','sources
']][(df['
click
']>3000)&(df['
sources
'] == u'
學生處'
)])print(df[(df['
click
'] > 3000) & (df['
sources
'] == '
學校綜合辦
')])
print(df[['
click
', '
author
', '
sources
']].head(6))
news_info = ['
國際學院
', '
學生工作處']
print(df[df['
sources
'].isin(news_info)])
資料結構化與儲存
1.將新聞的正文內容儲存到文字檔案。soup beautifulsoup res.text,html.parser content soup.select show content 0 text f open news.txt w encoding utf 8 f.write content f.c...
資料結構化與儲存
1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas,用pandas....
資料結構化與儲存
1.將新聞的正文內容儲存到文字檔案。f open content.txt a encoding utf 8 f.write content f.close 2.將新聞資料結構化為字典的列表 獲取新聞詳情 defgetnewdetail url resd requests.get url resd.e...