本次作業**於:
import包:
import re0.從新聞url獲取點選次數,並整理成函式import requests
from bs4 import beautifulsoup
from datetime import datetime
import time
import random
import pandas as pd
click點選函式:
def click(url):1.從新聞url獲取新聞詳情: 字典,anewsid = re.findall('\d+',url)[-1]
clickurl = "".format(id)
clicktext = requests.get(clickurl).text
click = int(clicktext.split('.html')[-1][2:-3])
return click
發布時間獲取:
def newsdt(showinfo):新聞詳情獲取:newsdate = showinfo.split()[0].split(':')[1]
newstime = showinfo.split()[1]
newsdt = newsdate + ' ' + newstime
dt = datetime.strptime(newsdt,'%y-%m-%d %h:%m:%s')
return dt
def anews(url):newsdetail = {}
res = requests.get(url)
res.encoding = 'utf-8'
soup = beautifulsoup(res.text,'html.parser')
newsdetail['newstitle'] = soup.select('.show-title')[0].text
showinfomation = soup.select('.show-info')[0].text
newsdetail['newsdt'] = newsdt(showinfomation)
newsdetail['newsclick'] = click(url)
return newsdetail
newsurl = ""
print(anews(newsurl))
def alist(listurl):3.生成所頁列表頁的url並獲取全部新聞 :列表extend(列表) allnewsres = requests.get(listurl)
res.encoding = 'utf-8'
soup = beautifulsoup(res.text,'html.parser')
newslist =
for news in soup.select('li'):
if len(news.select('.news-list-title'))>0:
newsurl = news.select('a')[0]['href']
newsdesc = news.select('.news-list-description')[0].text
newsdict = anews(newsurl)
newsdict['newsurl'] = newsurl
newsdict['description'] = newsdesc
return newslist
listurl = ''
allnews = alist(listurl)
for newtro in allnews:
print(newtro)
*每個同學爬學號尾數開始的10個列表頁
allnews =4.設定合理的爬取間隔for i in range(2,12):
listurl = '{}.html'.format(i)
allnews.extend(alist(listurl))
for n in allnews:
print(n)
#統計所爬取的新聞總數
print(len(allnews))
import time
import random
time.sleep(random.random()*3)
5.用pandas做簡單的資料處理並儲存
儲存到csv或excel檔案
newsdf.to_csv(r'f:\duym\爬蟲\gzccnews.csv')
newsdf = pd.dataframe(allnews)for i in range(5):
print(i)
time.sleep(random.random() * 3)
print(newsdf)
# 儲存到本地csv檔案
newsdf.to_csv(r'e:\大三用的軟體\pycharm community edition 2018.3.5\homework\testnews.csv',encoding='utf-8')
爬取全部的校園新聞
1 從新聞url獲取新聞詳情 2 從列表頁的url獲取新聞url 3 生成所頁列表頁的url並獲取全部新聞 4 設定合理的爬取間隔 5 用pandas做簡單的資料處理並儲存成csv和sql檔案 import requests from bs4 import beautifulsoup from da...
爬取全部的校園新聞
本次作業的要求來自於 0.從新聞url獲取點選次數,並整理成函式 1.熟練運用re.search match findall 2.從新聞url獲取新聞詳情 字典,anews import requests from bs4 import beautifulsoup from datetime imp...
爬取全部的校園新聞
作業要求 0.從新聞url獲取點選次數,並整理成函式 1.從新聞url獲取新聞詳情 字典,anews 2.從列表頁的url獲取新聞url 3.生成所頁列表頁的url並獲取全部新聞 列表extend 列表 allnews 每個同學爬學號尾數開始的10個列表頁 4.設定合理的爬取間隔 import ti...