Python爬蟲之Xpath解析

2022-09-22 03:09:08 字數 3079 閱讀 8634

例項化乙個etree物件,且需要將被解析的頁面的原始碼資料載入到該物件中

呼叫etree物件中的xpath方法結合著xpath表示式實現標籤的定位和內容的捕獲

pip install lxml
from lxml import etree
1、將本地的html檔案中的原始碼載入到etree物件中

etree.parse(filepath)
2、將網際網路獲取的原始碼載入到該物件中

etree.html(page_text)
import requests

from lxml import etree

from fake_useragent import useragent

url = ''

headers =

response = requests.get(url=url, headers=headers)

page_text = response.text

tree = etree.html(page_text)

room_list = tree.xpath("//section[@class='list']/div/a/div[2]")

for room in room_list:

title = room.xpath(".//div[@class='property-content-title']//h3/text()")[0]

price = room.xpath(

".//div[@class='property-price']//span[@class='property-price-total-num']/text()")[0]

**g = room.xpath(".//p[@class='property-price-**erage']/text()")[0]

from lxml import etree

import requests

from fake_useragent import useragent

import time

url = ''

ua = useragent()

# 彼岸圖網爬取

def pic_down(page, searchid):

param =

headers = # 隨機ua

response = requests.get(url=url, headers=headers, params=param)

if response.status_code != 200:

print("當前狀態碼為: ", response.status_code)

return false

page_text = response.text

# 爬取當前頁所有**的主頁鏈結

index_etree = etree.html(page_text)

index_list = index_etree.xpath("//ul[@class='clearfix']/li/a/@href")

for picture_index_url in index_list:

headers = # 隨機ua

pic_response = requests.get(url=picture_index_url, headers=headers)

pic_etree = etree.html(pic_response.text)

pic_link = '' + \

pic_etree.xpath("//a[@id='img']/img/@src")[0]

fp.write(pic_link+'\n')

print(pic_link)

print('成功爬取第 {} 頁\n', page)

return true

if __name__ == '__main__':

fp = open('鏈結.txt', 'w', encoding='utf-8')

for i in range(0, 5):

if pic_down(i, 16):

time.sleep(3)

else:

print('爬取失敗')

break

fp.close()

import requests

from lxml import etree

from fake_useragent import useragent

ua = useragent()

url = ''

# "//div[@id='container']/div/a/@href"

headers =

response = requests.get(url=url, headers=headers)

page_text = response.content

index_tree = etree.html(page_text)

index_link = index_tree.xpath("//div[@id='container']/div/a/@href")

for link in index_link:

# "https://"+link 為主頁鏈結

風火程式設計 python爬蟲幾個xpath解析方法

requests獲取的響應體 from lxml import etree html etree.html response.text 二進位制型別用.content result html.xpath expression 返回list,乙個用 0 selenium獲取的響應體 result re...

python 爬蟲之xpath用法

xpath全稱為xml path language一種小型的查詢語言,在爬蟲中,我們其實就是拿它來搜尋html文件,僅此而已。而網頁內容只有通過解析才能進行搜尋,所以使用xpath時,需要引入lxml庫,這個庫就是來解析網頁,協助xpath進行搜尋的。lxml庫的安裝,可以直接使用 pip3 ins...

Python之爬蟲 etree和XPath實戰

下面 是在 上找到的乙個例子,空閒的時候可以自己除錯。coding utf 8 爬蟲 創業邦 創業公司資訊爬取 網頁url 爬取頁面中的創業公司,融資階段,創業領域,成立時間和創業公司的鏈結資訊。使用到requests,json,codecs,lxml等庫 requests用於訪問頁面,獲取頁面的源...