資料結構化與儲存

1. 將新聞的正文內容儲存到文字檔案。

2. 將新聞資料結構化為字典的列表:

3. 安裝pandas，用pandas.dataframe(newstotal)，建立乙個dataframe物件df.

4. 通過df將提取的資料儲存到csv或excel 檔案。

import requests
from
bs4 import beautifulsoup
from
datetime import datetime
import re
import pandas
# 將新聞的正文內容儲存到文字檔案。
def writenewsdetail(content):
f = open('
gzccnews.txt
', '
a',encoding='
utf-8')
f.write(content)
f.close()
#一篇新聞的點選次數
def getclickcount(newsurl):
newid = re.search('
\_(.*).html
', newsurl).group(1).split('
/')[1
]    clickurl = "
".format(newid)
return (int(requests.get(clickurl).text.split('
.html
')[-1].lstrip("
('").rstrip("
');"
)))#一篇新聞的全部資訊
def getnewsdetail(newsurl):
resd = requests.get
(newsurl)
resd.encoding = '
utf-8
'soupd = beautifulsoup(resd.text, '
html.parser
')  # 開啟新聞詳情頁並解析
news ={}
news[
'title
'] = soupd.select('
.show-title
')[0
].text
info = soupd.select('
.show-info
')[0
].text
news['dt
'] = datetime.strptime(info.lstrip('
')[0:19], '
%y-%m-%d %h:%m:%s')
if info.find('
') > 0
news[
'source
'] = info[info.find('
'):].split()[0].lstrip('')
else
:        news[
'source
'] = '
none
'if info.find('
') > 0
news[
'author
'] = info[info.find('
'):].split()[0].lstrip('')
else
:        news[
'author
'] = '
none
'if info.find('
攝影：') > 0
:  #  攝影：
news[
'photograph
'] = info[info.find('
攝影：'):].split()[0].lstrip('
攝影：'
)    
else
:        news[
'photograph
'] = '
none
'if info.find('
') > 0
news[
'auditing
'] = info[info.find('
'):].split()[0].lstrip('')
else
:        news[
'auditing
'] = '
none
'news[
'content
'] = soupd.select('
.show-content
')[0
].text.strip()
writenewsdetail(news[
'content'])
news[
'click
'] =getclickcount(newsurl)
news[
'newsurl
'] =newsurl
return
(news)
#乙個列表頁的全部新聞
def getlistpage(pageurl):
res = requests.get
(pageurl)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
newslist =
for news in soup.select('li'
):        
if len(news.select('
.news-list-title
')) > 0
:            newsurl = news.select('
a')[0].attrs['
href
']  # 鏈結
return
(newslist)
# 新聞列表頁的總頁數
def getpagen():
res = requests.get('
')res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
n = int(soup.select('
.a1')[0].text.rstrip('條'
))    
return (n //
10 + 1)
newstotal =
firstpageurl = '
'newstotal.extend(getlistpage(firstpageurl))
n =getpagen()
for i in range(n, n+1
):    listpageurl = '
{}.html
'.format(i)
newstotal.extend(getlistpage(listpageurl))
dt =pandas.dataframe(newstotal)
dt.to_excel(
"news.xlsx")
print(dt)

5
. 用pandas提供的函式和方法進行資料分析：
提取包含點選次數、標題、**的前6行資料
print(dt[[
'click
','title
','source
']].head(6
))提取『學校綜合辦』發布的，『點選次數』超過3000的新聞。
sourcelist=['
學校綜合辦']
print(dt[dt[
'source
'].isin(sourcelist)]& dt[dt['
click
']>3000
])提取
'國際學院
'和'學生工作處
'發布的新聞。
sourcelist=['
學生工作處
','國際學院']
print(dt[dt[
'source
'].isin(sourcelist)])

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。soup beautifulsoup res.text,html.parser content soup.select show content 0 text f open news.txt w encoding utf 8 f.write content f.c...

資料結構化與儲存

作業是同學的，因為沒有對新聞資訊做提取，所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法，匯出excel檔案。ps 自己的會盡快修改！import requests from bs4 import beautifulsoup from datetime import datetim...

資料結構化與儲存

1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas，用pandas....

資料結構化與儲存

資料結構化與儲存

資料結構化與儲存

資料結構化與儲存

相關推薦