python爬取靜態網頁歷史天氣資料

利用python庫 requests 和 beautifulsoup ，對靜態網頁內容爬取；

這裡給出的例子是對乙個天氣**的歷史天氣進行爬取；

待更新

附python**

一般網頁都會有 robots.txt 檔案，用來記錄使用者對資料和表單內容的許可權。直接在主頁後面加 /robots.txt 即可訪問到。

例如這裡爬取的網頁：

robots.txt 的內容

user-agent: * disallow: /css/dhtmlxchart.css disallow: /o/ disallow: /d/ disallow: /*/*/scripts/expressinstall.swf disallow: /indexs.htm disallow: /t/city/ disallow: /t/city_m/ disallow: /t/city_m2/ disallow: /t/_s/ disallow: /t/air/ disallow: /t/today/ disallow: /t/tomorrow/ disallow: /t/typhoon/ disallow: /t/wea_history/ disallow: /t/wea_hour/ disallow: /t/timezone/ disallow: /t/detect2009v2.php disallow: /news/ disallow: /s/ disallow: /t/news/ disallow: /life/ disallow: /t/life/ disallow: /t/lifenews/ disallow: *.htm?* disallow: *php?* disallow: *?* disallow: *index* disallow: /t/shikuang/alert/ disallow: *?from=lm*

sitemap:

這裡簡單的解釋下：

user-agent:* 表示下列內容許可權對於所有臨時使用者（也就是對所有的搜尋引擎）；

disallow: /t/wea_history/表示對 /t/wea_history/ 路徑下的所有內容禁止訪問；

由於××網頁禁止直接訪問歷史溫度資料，用requests.get()得到的網頁內容只能獲取當天的溫度表，歷史溫度表單是動態載入的。這裡採用的方法是直接在當月查詢頁面下用瀏覽器右鍵選單儲存靜態網頁到本地，發現儲存的網頁檔案中包含了當月的歷史資料。

''' python 依賴庫
'''import requests
from bs4 import beautifulsoup
import re
import pandas as pd
import os
import sys

#構造類
import re
import pandas as pd
import requests
from bs4 import beautifulsoup
class
spyder
:''' 從××天氣網頁爬取天氣資訊
'''def__init__
(self,method =
'url'):
self.method = method
## 正規表示式
self.pat_content = r'.[^\n]*'
self.pat_title = self.pat_content
## 變數
self.soup =
self.weather_pd = pd.dataframe(
)pass
defread_htm
(self,url)
:if self.method ==
'url'
:#             url = ''
# get方式獲取網頁資料
strhtml = requests.get(url)
strhtml.encoding =
'utf-8'
htm_text = strhtml.text
elif self.method ==
'file'
:#             url_text = './res/spyder/順義歷史天氣查詢_歷史天氣預報查詢_2345天氣預報201903.htm'
with
open
(url,
"r",encoding=
"utf-8"
)as htm:
htm_text = htm.read(
)# 解析網頁
self.soup=beautifulsoup(htm_text,
'lxml'
)return
defres_weather
(self)
:# 獲取所需的內容（天氣表單）
data = self.soup.select(
'table'
)        weather_table = data[0]
.get_text(
)## 正規表示式解析
pat_line =
'\n\n.*\n.*\n.*\n.*\n.*'
weather_lins = re.findall(pat_line,weather_table)
# 提取內容到pandas
title = re.findall(self.pat_title,weather_lins[0]
[1:]
)        weather_id =
len(self.weather_pd)
for wl in
range(1
,len
(weather_lins)):
cont = re.findall(self.pat_content,weather_lins[wl][1
:])for ti in
range
(len
(title)):
self.weather_pd.loc[weather_id,title[ti]
]= cont[ti]
weather_id +=
1pass
definit_weather
(self)
:        self.weather_pd = pd.dataframe(
)def
get_weather
(self)
:return self.weather_pd

## 爬蟲
# 由於××網頁禁止直接訪問歷史溫度資料，用requests.get()得到的網頁內容只能獲取當天的溫度表，**歷史溫度表單是動態載入的**。
# 這裡採用的方法是直接在當月查詢頁面下用瀏覽器右鍵選單儲存靜態網頁到本地，發現網頁檔案中可以儲存當月的歷史資料。
## 靜態網頁檔案所在目錄
spy_path =
'./res/spyder/'
url_list =
if os.path.exists(spy_path)
:    htm_files = os.listdir(spy_path)
for fi in htm_files:
url_file = spy_path+fi
## 爬蟲物件
spy = spyder(method=
'file'
)spy.init_weather(
)# 遍歷每月的htm檔案
for url in url_list:
spy.read_htm(url)
spy.res_weather(
)## get weather
weather = spy.get_weather(
)

python爬取靜態網頁歷史天氣資料

python爬蟲爬取靜態網頁

Python爬取靜態網頁操作

靜態網頁內容爬取（python）

python爬取靜態網頁歷史天氣資料

python爬蟲 爬取靜態網頁

Python爬取靜態網頁操作

靜態網頁內容爬取（python）

相關推薦

python爬蟲爬取靜態網頁