1、從新聞url獲取新聞詳情
2、從列表頁的url獲取新聞url
3、生成所頁列表頁的url並獲取全部新聞
4、設定合理的爬取間隔
5、用pandas做簡單的資料處理並儲存成csv和sql檔案
1、新聞詳情:import requests
from bs4 import beautifulsoup
from datetime import datetime
import re
import pandas as pd
import time
import random
import sqlite3
newsurl = ''
listurl = ''
def click(url):
id = re.findall('(\d)', url)[-1]
clickurl = ''.format(id)
resclick = requests.get(clickurl)
newsclick = int(resclick.text.split('.html')[-1].lstrip("('").rstrip("');"))
return newsclick
def newsdt(showinfo):
newsdate = showinfo.split()[0].split(':')[1]
newstime = showinfo.split()[1]
newsdt = newsdate + ' ' + newstime
dt = datetime.strptime(newsdt, '%y-%m-%d %h:%m:%s')
return dt
def anews(url):#從新聞url獲取新聞詳情: 字典,anews
newsdetail = {}
res = requests.get(url)
res.encoding = 'utf-8'
soup = beautifulsoup(res.text, 'html.parser')
newsdetail['newstitle'] = soup.select('.show-title')[0].text
showinfo = soup.select('.show-info')[0].text
newsdetail['newsdt'] = newsdt(showinfo)
newsdetail['newsclick'] = click(newsurl)
return newsdetail
res = requests.get(listurl)
res.encoding = 'utf-8'
soup = beautifulsoup(res.text, 'html.parser')
newslist =
for news in soup.select('li'):
if len(news.select('.news-list-title')) > 0:
newsurl = news.select('a')[0]['href']
newsdesc = news.select('.news-list-description')[0].text
newsdict = anews(newsurl)
newsdict['description'] = newsdesc
return newslist
alist(listurl)
alist(newsurl)
res = requests.get('')
res.encoding = 'utf-8'
soup = beautifulsoup(res.text, 'html.parser')
for news in soup.select('li'):
if len(news.select('.news-list-title')) > 0:
newsurl = news.select('a')[0]['href']
print(anews(newsurl))
allnews =
for i in range(97, 107):#爬取學號尾數開始的10個列表頁
listurl = '{}.html'.format(i)
allnews.extend(alist(listurl))
print("allnewslength={}".format(len(allnews)))
print(allnews)
res = requests.get('')
res.encoding = 'utf-8'
soup = beautifulsoup(res.text, 'html.parser')
for news in soup.select('li'):
if len(news.select('.news-list-title')) > 0:
newsurl = news.select('a')[0]['href']
print(anews(newsurl))
s1 = pd.series([100, 23, 'bugingcode'])
print(s1)
pd.series(anews)
newsdf = pd.dataframe(allnews)
for i in range(5):
print(i)
time.sleep(random.random() * 3)#設定爬取的時間間隔
print(newsdf)
newsdf.to_csv(r'd:\py_file\gzcc.csv',encoding='utf_8_sig')#儲存成csv格式,為避免亂碼,設定編碼格式為utf_8_sig
with sqlite3.connect(r'd:\py_file\gzccnewsdb.sqlite') as db:#儲存檔案為sql
newsdf.to_sql('gzccnewsdb',db)
2、新聞列表:
3、儲存成csv檔案:
4、儲存成為sql檔案
爬取全部的校園新聞
本次作業 於 import包 import re import requests from bs4 import beautifulsoup from datetime import datetime import time import random import pandas as pd 0.從...
爬取全部的校園新聞
本次作業的要求來自於 0.從新聞url獲取點選次數,並整理成函式 1.熟練運用re.search match findall 2.從新聞url獲取新聞詳情 字典,anews import requests from bs4 import beautifulsoup from datetime imp...
爬取全部的校園新聞
作業要求 0.從新聞url獲取點選次數,並整理成函式 1.從新聞url獲取新聞詳情 字典,anews 2.從列表頁的url獲取新聞url 3.生成所頁列表頁的url並獲取全部新聞 列表extend 列表 allnews 每個同學爬學號尾數開始的10個列表頁 4.設定合理的爬取間隔 import ti...