python爬取貓眼電影排行

完整的**如下在這裡：

閒著沒事，把解析html中的正則方法改用了xpath與beautifulsoup，只能說各有各的優點吧。

正則的話，提取資訊可以連貫，一次性提取出所有需要的資訊，當然前提是你的正則式子沒有寫錯，所以說正則寫起來相比xpath與beautifulsoup來說要複雜一下，提取出錯後，除錯也比較麻煩一下。

xpath的話，相比beautifulsoup提取資訊的時候，比較容易理解，理解好幾個規則，提取資訊的**寫起來非常流暢也非常快，感覺唯一一點不足就是說在巢狀查詢的時候，雖然可以通過下標什麼的來定位，但是比起beautifulsoup的find()與find_all()方法差了很多。

beautifulsoup：這是我寫下來，感覺**最簡單的，也比正則容易理解。本身網頁**的結構就是層層巢狀的，相比前面的兩個動不動就是獲取所有的匹配節點，這種巢狀查詢的方式，感覺更好操作一些。

以上皆為個人觀點，若有不對，還請賜教。

xpath提取：

def parse_one_page_xpath(html):
html = etree.html(html)
rank = html.xpath('//dd/i/text()')
# print(rank)
title = html.xpath('//dl//dd/a/@title')
# print(title)
star = html.xpath('//p[@class="star"]/text()')
# print(star)
relasetime = html.xpath('//p[@class="releasetime"]/text()')
# print(relasetime)
score = html.xpath('//p[@class="score"]//i//text()')
# print(score)
for i in range(len(title)):
yield

def parse_ont_page_beautifulsoup(html):
soup = beautifulsoup(html, 'lxml')
for dd in soup.select('dd'):
rank = dd.find('i').string
name = dd.find(class_='image-link')['title']
star = dd.find(class_='star')
if star is not none:
star = star.string[3:]
time = dd.find(class_='releasetime')
if time is not none:
time = time.string[5:]
yield

爬取貓眼電影排行100電影

import json import requests from requests.exceptions import requestexception import re import time 獲取單頁的內容 def get one page url try response requests....

python 爬取貓眼電影排行資料

爬取的是電影的名稱排名演員上映時間評分等話不多說看就懂了用的是正規表示式進行匹配，實現的是乙個最基本的對網頁的爬取功能。import requests import re import json import time def get one page url try respons...

爬取貓眼電影排行榜

匯入我們需要的模組 import reimport requests 一獲取網頁內容 1 宣告目標url，就是爬取的位址 base url 2 模仿瀏覽器 headers 3 發起請求 response requests.get base url,headers headers 4 接收響應的資...

python爬取貓眼電影排行

爬取貓眼電影排行100電影

python 爬取貓眼電影排行資料

爬取貓眼電影排行榜

相關推薦