1. 用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題、鏈結、正文、show-info。
2. 分析info字串,獲取每篇新聞的發布時間,作者,**,攝影等資訊。
3. 將字串格式的發布時間轉換成datetime型別
4. 使用正規表示式取得新聞編號
5. 生成點選次數的request url
6. 獲取點選次數
7. 將456步驟定義成乙個函式 def getclickcount(newsurl):
8. 將獲取新聞詳情的**定義成乙個函式 def getnewdetail(newsurl):
9. 嘗試用使用正規表示式分析show info字串,點選次數字串。
importrequests
from bs4 import
beautifulsoup
from datetime import
datetime
import
locale
import
relocale.setlocale(locale.lc_ctype,
'chinese')
url = "
"res =requests.get(url)
res.encoding = '
utf-8
'soup = beautifulsoup(res.text, '
html.parser')
defgetclickcount(newsurl):
newsid = re.search(r"
\_(.*).html
", newsurl).group(1)[-4:]
clicktimesurl = ("
").format(newsid)
clicktimes = int(requests.get(clicktimesurl).text.split("
.html(
")[-1].lstrip("
'").rstrip("
');"
))
return
clicktimes
defgetnewsdetail(newsurl):
resdet =requests.get(newsurl)
resdet.encoding = '
utf-8
'soupdet = beautifulsoup(resdet.text, '
html.parser')
contentdetail = soupdet.select('
#content
')[0].text
showinfo = soupdet.select('
.show-info
')[0].text
date = showinfo.lstrip("
")[:19]
author = re.search('
', showinfo).group(1)
checker = re.search('
', showinfo).group(1)
source = re.search('
', showinfo).group(1)
clicktimes =getclickcount(address)
datetime = datetime.strptime(date, '
%y-%m-%d %h:%m:%s')
print("
".format(datetime, author, checker, source, clicktimes))
(contentdetail)
for news in soup.select('li'
):
if len(news.select('
.news-list-title
')) >0:
title = news.select('
.news-list-title
')[0].text
description = news.select('
.news-list-description
')[0].text
info = news.select('
.news-list-info
')[0].text
address = news.select('
a')[0]['
href']
print("
".format(title, description, info, address))
getnewsdetail(address)
爬取校園新聞首頁的新聞
1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。2.分析info字串,獲取每篇新聞的發布時間,作者,攝影等資訊。import requests newsurl res requests.get newsurl 返回response物...
爬取校園新聞首頁的新聞
1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。import requests from bs4 import beautifulsoup newsurl res requests.get newsurl res.encoding ...
爬取校園新聞首頁的新聞
1.用requests庫和beautifulsoup庫,爬取校園新聞首頁新聞的標題 鏈結 正文 show info。2.分析info字串,獲取每篇新聞的發布時間,作者,攝影等資訊。import requests from bs4 import beautifulsoup from datetime ...