python爬蟲筆記02 xpath

1、語法

表示式語法

nodename

選取此節點的所有子節點

/從根節點擊取

//任意子孫節點，不考慮位置

.當前節點

. .當前節點的父節點

@選取屬性

*匹配任何節點

contains(@屬性,」包含的內容」)

模糊查詢

text()

取文字內容

2、使用案例

>
>
class
="tit alive"
lang
="eng"
>
harry pottertitle
>
>
29.99price
>
book
>
>
class
="tit"
lang
="eng"
>
learning xmltitle
>
>
39.95price
>
book
>
bookstore
>

表示式

結果/bookstore/book[1]

選取屬於 bookstore 子元素的第乙個 book 元素。（xpath索引從1開始）

/bookstore/book[last()]

選取屬於 bookstore 子元素的最後乙個 book 元素。

/bookstore/book[position() < 3]

選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。

//title[@lang]

選取所有擁有名為 lang 的屬性的 title 元素。

//title[@class=『tit』]

選取所有class屬性值為tit的 title 元素，(第乙個不會被選中，因為class值為「tit alive」)

//title[contains(@class,「tit」)]

選取class屬性值包含「tit」的title元素，兩個title都會被選中

/bookstore/book[price>35.00]

選取 bookstore 元素下的所有 book 元素，且其中的 price 元素的值須大於 35.00。

//title/text()

選取所有title元素的文字內容

//title/@lang

選取所有title元素的lang屬性

3、在python中使用xpath

# 1.導包
from lxml import etree as le
# 2.準備好str型別的文字物件
html =
'......'
# 3.把str物件載入成xpath物件
html_x = le.html(html)
# 4.使用xpath表示式，ret的結果為列表
ret = html_x.xpath(
'xpath表示式'
)

python爬蟲基礎04 網頁解析庫xpath

xpath 是一門在 xml 文件中查詢資訊的語言。xpath 用於在 xml 文件中通過元素和屬性進行導航。相比於beautifulsoup，xpath在提取資料時會更加的方便。在python中很多庫都有提供xpath的功能，但是最基本的還是lxml這個庫，效率最高。在之前beautifulsou...

Python爬蟲02 請求模組

七 json資料 response.text 返回unicode格式的資料 str response.content 返回位元組流資料二進位制 response.content.decode utf 8 手動進行解碼 response.url 返回url response.encode 編碼 im...

python 爬蟲系列02 認識 requests

本系列所有文章基於 python3.5.2 requests 是基於 urllib 的三方模組,相比於 uillib,操作更簡潔,功能更強大,而且支援 python3 getimport requests r requests.get url print r.status code print r....

python爬蟲筆記02 xpath

python爬蟲基礎04 網頁解析庫xpath

Python爬蟲02 請求模組

python 爬蟲系列02 認識 requests

相關推薦