1. 將新聞的正文內容儲存到文字檔案。
soup = beautifulsoup(res.text,'2. 將新聞資料結構化為字典的列表:html.parser')
content =soup.select('
.show-content
')[0].text
f=open('
news.txt
','w
',encoding='
utf-8')
f.write(content)
f.close()
3. 安裝pandas,用pandas.dataframe(newstotal),建立乙個dataframe物件df.
4. 通過df將提取的資料儲存到csv或excel 檔案。
5. 用pandas提供的函式和方法進行資料分析:
import截圖如下:requests
import
reimport
pandas
import
openpyxl
from bs4 import
beautifulsoup
from datetime import
datetime
homepage='
'#newsurl=''
res = requests.get(homepage) #
返回response物件
res.encoding='
utf-8
'soup = beautifulsoup(res.text,'
html.parser')
newscount=int(soup.select('
.a1')[0].text.split('條'
)[0])
newspages = newscount // 10 + 1allnews=
alllistnews=
defget_new_click_count(click_url):
res =requests.get(click_url)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser
').text
return soup.split("
('#hits').html
")[1].lstrip("
('").rstrip("
');"
)def
get_all_news(newurl):
dictionary={}
res =requests.get(newurl)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
newlist=soup.select('
.news-list
')[0].select('li'
)
for newitem in
newlist:
title = newitem.select('
.news-list-title
')[0].text
describe = newitem.select('
.news-list-description
')[0].text
newurl=newitem.a.attrs['
href']
newcontenturl = re.search('
(\d\.html)
', newurl).group(1)
newcontenturl2 = newcontenturl.rstrip('
.html')
click_url='
'+newcontenturl2+'
&modelid=80
'newclicktimes=get_new_click_count(click_url)
defget_new_click_content(newurl):
res =requests.get(newurl)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
distributetime = soup.select('
.show-info
')[0].text.split()[0].lstrip('')
author = soup.select('
.show-info
')[0].text.split()[2].lstrip('')
trial = soup.select('
.show-info
')[0].text.split()[3].lstrip('')
orgin = soup.select('
.show-info
')[0].text.split()[4].lstrip('')
photograph = soup.select('
.show-info
')[0].text.split()[5].lstrip('
攝影:'
)
return
distributetime,author,trial,orgin,photograph
dictionary[
'distributetime
']=get_new_click_content(newurl)[0]
dictionary[
'author
'] = get_new_click_content(newurl)[1]
dictionary[
'trial
'] = get_new_click_content(newurl)[2]
dictionary[
'orgin
'] = get_new_click_content(newurl)[3]
dictionary[
'photograph
'] = get_new_click_content(newurl)[4]
dictionary[
'title
'] =title
dictionary[
'describe
'] =describe
return
allnews
for i in range(2,6):
page='
{}.html
'.format(i)
alllistnews.extend(get_all_news(page))
df =pandas.dataframe(alllistnews)
(df)
df.to_excel(
'text.xlsx')
print(df.head(6))
super=df[(df['
clickcount
']>2000)&(df['
source
']=='
學校綜合辦')]
print(super)
資料結構化與儲存
作業是 同學的,因為沒有對新聞資訊做提取,所有無法新增新聞資訊到字典。已練習pandas庫的相關使用方法,匯出excel檔案。ps 自己的 會盡快修改!import requests from bs4 import beautifulsoup from datetime import datetim...
資料結構化與儲存
1.將新聞的正文內容儲存到文字檔案。newscontent soup.select show content 0 text f open news.txt w f.write newscontent f open news.txt r print f.read 3.安裝pandas,用pandas....
資料結構化與儲存
1.將新聞的正文內容儲存到文字檔案。f open content.txt a encoding utf 8 f.write content f.close 2.將新聞資料結構化為字典的列表 獲取新聞詳情 defgetnewdetail url resd requests.get url resd.e...