XML解析庫 lxml 教程二

字典介面

在 xml 或 html 中每乙個標籤都有屬性,element類通過字典介面支援屬性操作

建立帶有屬性的element物件

from lxml import etree
# 帶屬性的 element
root = etree.element(
"root"
, name=
"root",**
)# 序列化為 xml 標籤
root_s = etree.tostring(root)
.decode(
)print
(root_s)

列印結果

注意class是 python 關鍵字, 因此無法使用class="main"按名稱傳參

訪問屬性

# 訪問已有屬性
print
("name:"
, root.get(
"name"))
# 沒有相應屬性時, 可指定預設值
print
("title:"
, root.get(
"title"
,"element"
))

列印結果

name: root

title: element

與字典不同, 無法通過root["name"]訪問和設定屬性值, element 的魔法方法__getitem__不支援字串索引

設定屬性

# 設定屬性
root.
set(
"id"
,"root-elem"
)

遍歷屬性

# 遍歷鍵-值
for k, v in root.items():
# python3.7 以上才支援 f-string
print
(f": "
)

列印結果

name: root class: main

id: root-elem

除了items(), root 與字典一樣還支援keys()和values()

獲取屬性字典

借助於property類,element物件的attrib屬性可視為字典物件

# 將 element 的 attrib 屬性視為字典
print
("root.attrib"
, root.attrib)
# 通過字串索引修改和訪問標籤屬性
root.attrib[
"id"]=
"root-attrib"
print
('root.attrib["id"]:'
, root.attrib[
"id"])
# 通過 attrib 修改後, element 也相應被修改
root_s = etree.tostring(root)
.decode(
)print
("root:"
, root_s)

列印結果

root.attrib: 
root.attrib["id"]: root-attrib
root:

對 element 元素屬性的修改會對映到 attrib 屬性, 反之亦然

操作屬性字典

root.attrib比root本身更接近字典

# 更新屬性
root.attrib.update(
)# 僅遍歷值
for v in root.attrib.values():
print
("attribute value:"
, v)

列印結果

attribute value: root attribute value: main

attribute value: root-update

新增文字內容

如果將element物件看做元素節點, 那麼其包含的文字就可看做文字節點, 在 lxml 包中element物件可以新增文字內容

建立包含文字內容的element物件

from lxml import etree
root = etree.element(
"root"
)# 設定文字內容
root.text =
"text"
root_s = etree.tostring(root)
.decode(
)print
("root:"
, root_s)

列印結果

root: text

由於包含文字內容, root 變為雙標籤

在 xml 文件中, 文字內容只能包含於雙標籤內, 而 html 則不同, 文字內容可位於不同的標籤之間, 所以element物件新增了tail屬性, 可在element 末尾新增文字內容

html = etree.element(
"html"
)body = etree.subelement(html,
"body"
)body.text =
"body"
span = etree.subelement(body,
"span"
)span.text =
"span_1"
html_1 = etree.tostring(html, pretty_print=
true
).decode(
)print
(html_1)
# 在尾部新增內容
span.tail =
"span_2"
html_2 = etree.tostring(html, pretty_print=
true
).decode(
)print
(html_2)

列印結果

bodyspan_1 bodyspan_1span_2

在序列化 element 物件時, 可以忽略末尾文字, 也可以僅輸出文字內容

# 忽略末尾文字
html_3 = etree.tostring(html, with_tail=
true
, pretty_print=
true
).decode(
)print
(html_3)
# 僅文字內容
html_4 = etree.tostring(html, method=
'text'
).decode(
)print
(html_4)

列印結果

bodyspan_1span_2

XML解析庫 lxml 教程一

本系列文章主要講解如何使用 python3 的 lxml 庫,本篇和第二篇文章主要介紹element物件的介面工廠函式在 xml或html 中每一處尖括號代表著乙個標籤或者元素,lxml 庫為了方便操作,封裝了element類,通過element物件可以很方便地操作 xml 的元素建立elem...

lxml解析xml檔案

最近在工作中需要從多個xml檔案中選出一些節點合成乙個新的xml檔案，首先想到的使用python自帶的xml.etree.elementtree模組，但是發現合併後的檔案中原來的cdata部分不對，括號和引號都被轉義了，沒有和原來保持一致，elementtree模組解決不了這個問題，我就想會不會有第...

Python 之lxml解析庫

一 xpath常用規則二解析html檔案 from lxml import etree 讀取html檔案進行解析 defparse html file html etree.parse test.html parser etree.htmlparser print etree.tostring ...

XML解析庫 lxml 教程 二

XML解析庫 lxml 教程 一

lxml解析xml檔案

Python 之lxml解析庫

相關推薦

XML解析庫 lxml 教程二

XML解析庫 lxml 教程一