Python之lxml庫學習筆記二

使用xpath查詢文字

另乙個抽取xml樹的文字內容是xpath，

>>> print(html.xpath("string()")) # lxml.etree only!

texttail

>>> print(html.xpath("//text()")) # lxml.etree only!

[』text』, 』tail』]

如果經常使用，可以包裝成乙個方法：

>>> build_text_list = etree.xpath("//text()") # lxml.etree only!

>>> print(build_text_list(html))

[』text』, 』tail』]

也可以通過getparent方法得到父節點

>>> texts = build_text_list(html)

>>> print(texts[0])

text

>>> parent = texts[0].getparent()

>>> print(parent.tag)

body

>>> print(texts[1])

tail

>>> print(texts[1].getparent().tag)

bryou can also find out if it』s normal text content or tail text:

>>> print(texts[0].is_text)

true

>>> print(texts[1].is_text)

false

>>> print(texts[1].is_tail)

true

樹的迭代：

elements提供乙個樹的迭代器可以迭代訪問樹的元素。

>>> root = etree.element("root")

>>> etree.subelement(root, "child").text = "child 1"

>>> etree.subelement(root, "child").text = "child 2"

>>> etree.subelement(root, "another").text = "child 3"

>>> print(etree.tostring(root, pretty_print=true))

child 1

child 2

child 3

>>> for element in root.iter():

... print("%s - %s" % (element.tag, element.text))

root – none

child - child 1

child - child 2

another - child 3

如果知道感興趣的tag，可以把tag的名字傳給iter方法，起到過濾作用。

>>> for element in root.iter("child"):

... print("%s - %s" % (element.tag, element.text))

child - child 1

child - child 2

預設情況下，迭代器得到乙個樹的所有節點，包括processinginstructions, comments and entity的例項。如果想確認只有elements物件返回，可以把element factory作為引數傳入。

>>> for element in root.iter(tag=etree.element):

... print("%s - %s" % (element.tag, element.text))

root - none

child - child 1

child - child 2

another - child 3

>>> for element in root.iter(tag=etree.entity):

... print(element.text)

Python 之lxml解析庫

一 xpath常用規則二解析html檔案 from lxml import etree 讀取html檔案進行解析 defparse html file html etree.parse test.html parser etree.htmlparser print etree.tostring ...

python的lxml庫簡介 lxml庫

lxml 是乙個html xml的解析器，主要的功能是如何解析和提取 html xml 資料。lxml和正則一樣，也是用 c 實現的，是一款高效能的 python html xml 解析器，我們可以利用之前學習的xpath語法，來快速的定位特定元素以及節點資訊。需要安裝c語言庫，可使用 pip 安...

python爬蟲三大庫之lxml庫

lxml庫是基於libxml2的xml解析庫的python庫，該模組使用c語言編寫，解析速度比beautifulsoup更快。lxml庫使用xpath語法解析定位網頁資料。windows系統下，在cmd命令提示框中，輸入如下命令 pip install lxml2.1 修正html lxml為xml...

Python之lxml庫學習筆記二

Python 之lxml解析庫

python的lxml庫簡介 lxml庫

python爬蟲 三大庫之lxml庫

相關推薦

python爬蟲三大庫之lxml庫