使用Python處理XML

很久沒有寫博文了，也很久沒有碰python了。剛好工作需要，小小動手了下。

因為是在新機器上，所以就直接裝了python 3，反正以後也是要適應，不如早點。

在使用python處理xml的問題上，首先遇到的是編碼問題。

python並不支援gb2312，所以面對encoding="gb2312"的xml檔案會出現錯誤。python讀取的檔案本身的編碼也可能導致丟擲異常，這種情況下開啟檔案的時候就需要指定編碼。此外就是xml中節點所包含的中文。

我這裡呢，處理就比較簡單了，只需要修改xml的encoding頭部。

#!/usr/bin/env python
import os, sys
import re
def replacexmlencoding(filepath, oldencoding='gb2312', newencoding='utf-8'):
f = open(filepath, mode='r')
content = f.read()
content = re.sub(oldencoding, newencoding, content)
f.close()
f = open(filepath, mode='w')
f.write(content)
f.close()
if __name__ == "__main__":
replacexmlencoding('./activateaccount.xml')

接著是使用xml.etree.elementtree來操作xml檔案。

在乙個類裡面定義__call__函式可以使得該類可呼叫，比如下面**的最後幾行，在__main__函式中。這也很突出地體現了在python的世界裡，一切都是物件，包括物件本身：）

一直覺得__main__函式用來測試真是蠻好用的。

#!/usr/bin/env python
import os, re
import xml.etree.elementtree as etree
locale_path = "./locale.txt"
class xmlextractor(object):
def __init__(self):
pass
def __call__(self, filepath):
retdict = {}
f = open(filepath, 'r')
line = len(open(filepath, 'r').readlines())
retdict['line'] = line
tree = etree.parse(f)
root = tree.find("resitem")
id = root.get("id")
retdict['title'] = id
resitemcnt = len(list(root.findall("resitem"))) + 1
retdict['resitemcount'] = resitemcnt
retdict['chinesetip'] = 'none'
for child in root:
attrdict = child.attrib
keyword = "name"
if(keyword in attrdict.keys() and attrdict['name'] == "caption"):
if len(child.attrib['value']) > 1:
if child.attrib['value'][0] == '~':
title = child.attrib['value'][1:]
else:
title = child.attrib['value'][0:]
#print(title)
chs = open(locale_path).read()
pattern = '[^>]+>'
m = re.search(pattern, chs)
if m != none:
realtitle = re.sub('<[^>]+>', '', m.group(0))
retdict['chinesetip'] = realtitle
f.close()
return retdict
if __name__ == "__main__":
fo = xmlextractor()
d = fo('./activateaccount.xml')
print(d)

最後，就是入口檔案，匯入上面兩個檔案，使用xml.dom和os.listdir來遞迴處理xml檔案，並生成乙個結果集。

一直覺得python的unboundlocalerror

錯誤挺有意思的，不知道是不是符號表的覆蓋問題。

#!/usr/bin/env python
from xmlextractor import *
from replacexmlencoding import *
from xml.dom import minidom,node
doc = minidom.document()
extractor = xmlextractor()
totallines = 0
totalresitemcnt = 0
totalxmlfilecnt = 0
totalerrorcnt = 0
errorfilelist = 
xmlroot = doc.createelement("xmlresourcefile")
def mywalkdir(level, path):
global doc, extractor, totallines, totalresitemcnt, totalxmlfilecnt
global totalerrorcnt, errorfilelist
global xmlroot
for i in os.listdir(path):
if i[-3:] == 'xml':
totalxmlfilecnt += 1
try:
#先把xml的encoding由gb2312轉換為utf-8
replacexmlencoding(path + '\\' + i)
#再提取xml文件中需要的資訊
info = extractor(path + '\\' + i)
#在上述兩行**沒有出現異常的基礎上再建立節點
#print(info)
#print(type(i))
xmlnode = doc.createelement("xmlfile")
xmlname = doc.createelement("filename")
xmlname.setattribute('value', i)
filepath = doc.createelement("filepath")
filepath.setattribute('value', path[34:])
titlenode = doc.createelement("title")
titlenode.setattribute('value', str(info['title']))
chsnode = doc.createelement("chinesetip")
chsnode.setattribute('value', str(info['chinesetip']))
resitemnode = doc.createelement("resitemcount")
resitemnode.setattribute('value', str(info['resitemcount']))
linenode = doc.createelement("linecount")
linenode.setattribute('value', str(info['line']))
descnode = doc.createelement("description")
descnode.setattribute('value', '')
except exception as errordetail:
totalerrorcnt += 1
print(path + '\\' + i, errordetail)
if os.path.isdir(path + '\\' + i):
mywalkdir(level+1, path + '\\' + i)
if __name__ == "__main__":
path = os.getcwd() + '\\themes'
mywalkdir(0, path)
print(totalxmlfilecnt, totalerrorcnt)
#print(doc.toprettyxml(indent = "    "))
resultxml = open("./xmlresourcelist.xml", "w")
resultxml.write(doc.toprettyxml(indent = "    "))
resultxml.close()

使用Python處理XML

很久沒有寫博文了，也很久沒有碰python了。剛好工作需要，小小動手了下。因為是在新機器上，所以就直接裝了python 3，反正以後也是要適應，不如早點。在使用python處理xml的問題上，首先遇到的是編碼問題。python並不支援gb2312，所以面對encoding gb2312 的xml檔案...

python處理xml資料

由於最近需要使用python處理xml資料，因此到網上找了些資料學習了下。最新學習的是python的xml.dom.minidom模組，按照資料上的說法，特地在python命令列環境驗證了一下執行之後卻發現xml.dom.minidom無法獲取xml節點之間的文字值，如下 test 14 tdoc...

python 處理xml檔案

python 處理xml檔案最近基因注釋需要查閱文獻是否報道過。由於基因很多，想了乙個辦法。ncbi上每個蛋白有關的登入號下會有文獻的題目。根據序列比對結果，然後調取對應的文獻。首先獲取小麥族 147389 所有的199754條蛋白序列，截止日期是17 5 22.末尾 python try imp...

使用Python處理XML

使用Python處理XML

python處理xml資料

python 處理xml檔案

相關推薦