Python之爬蟲貓眼電影

#!/usr/bin/env python
# coding: utf-8
import json
import requests
import re
import time
# 貓眼多了反爬蟲，速度過快，則會無響應，所以這裡多了乙個延時等待
from requests.exceptions import requestexception
def get_one_page(url):
try:
headers = 
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text  # 使得get_one_page()函式輸出是乙個文字
return none
except requestexception:
return none
def parse_one_page(html):
pattern = re.compile(
'.*?board-index.*?>(.*?).*?name.*?a.*?>(.*?).*?star.*?>(.*?)
.*?releasetime.*?>(.*?)
.*?'
'integer.*?>(.*?).*?fraction.*?>(.*?).*?',
re.s)  # 正規表示式獲取需要儲存的東西編譯成正規表示式物件
items = re.findall(pattern, html)  # 遍歷html檔案中的所有pattern正規表示式物件
for item in items:  # 把提取的物件裝入字典中
yield 
def write_to_file(content):  # 把檔案寫入並儲存在result.tx + '\n')
with open('result.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=false) + '\n')
def main(offset):  # 遍歷top100的電影的所有**
url = '' + str(offset)  # 接收乙個偏移量offset
html = get_one_page(url)
for item in parse_one_page(html):
print(item)
write_to_file(item)
if __name__ == '__main__':  # 建立乙個偏移量offset
for i in range(10):
main(offset=i * 10)
time.sleep(1)

python爬蟲實戰貓眼電影案例

背景抓包ajax非同步載入的網頁，載入資料的url需要通過抓包獲取。一般確認是否非同步載入，只需要右鍵開啟網頁源如果原始碼文字內容與前端展示的結果不一致，則屬於非同步載入。這時需要按f12開啟開發者工具的network，重新重新整理網頁，就能看到真正的url。如下圖所示，開發者工具中紅色框的ur...

python爬蟲學習之獲取貓眼電影排名前10

我們用正規表示式來完成這個任務，並把讀取到的內容寫入到文字中。首先獲取該網頁的html 注意千萬別用開發者模式檢視網頁的原始碼，原始碼可能和response.text不一樣然後用python的第三方庫，requests庫進行網頁html的爬取注意 1 在獲取源之前我們要設定一下user age...

python爬蟲爬取貓眼電影資料

定義乙個函式獲取貓眼電影的資料 import requests def main url url html requests.get url text print html if name main main 利用正則匹配，獲得我們想要的資訊 dd i class board index board...

Python之爬蟲 貓眼電影

python爬蟲實戰 貓眼電影案例

python爬蟲學習之獲取貓眼電影排名前10

python爬蟲 爬取貓眼電影資料

相關推薦

Python之爬蟲貓眼電影

python爬蟲實戰貓眼電影案例

python爬蟲爬取貓眼電影資料