python抓取處理word文件

前面一篇講到了處理pdf的內容，今天說下python對word的處理。其實python對word文件的支援不夠。

為讀取docx內容，可以使用以下方法：

（1）利用urlopen抓取遠端word docx檔案；

（2）將其轉換為記憶體位元組流；

（3）解壓縮（docx是壓縮後檔案）；

（4）將解壓後檔案作為xml讀取

（5）尋找xml中的標籤（正文內容）並處理

下面是**，傳入url即可。

def wordtocontent(url):
wordfile = urlopen(url).read()
wordfile = bytesio(wordfile)
document = zipfile(wordfile)  #
xml_content = document.read("word/document.xml")
wordobj = beautifulsoup(xml_content.decode("utf-8"), "lxml")
textstrings = wordobj.findall("w:t")
str_all = ''
for textelem in textstrings:
str_all = str_all + textelem.text
return str_all

注意這個只對.docx的文件有效老版本的.doc不行。

python 處理 word 文件

簡介安裝pip install python docx pip install docxtpl docxtpl 庫會依賴jinja2 使用對應的主要python 大同小異，在py指令碼中將資料徹底處理好，render僅做簡單資料填充。from docxtpl import docxtemplate...

python 處理抓取網頁亂碼

相信用python的人一定在抓取網頁時，被編碼問題弄暈過一陣前幾天寫了乙個測試網頁的小指令碼，並查詢是否包含指定的資訊。在html urllib2.open url read 時，列印到控制台始終出現亂碼。一般的解決辦法就是html.decode utf 8 encode gb2312 不過這個即...

利用WORD發布博文

怎樣用 word 管理網易部落格部落格,相信這年頭,只要在對上網略知一二的人都會有幾個賬戶吧.今天在網易部落格上瀏覽時無意間發現了網易支援word寫部落格的功能,仔細檢視了一下說明,親自試了一試,成功了.現在把經驗寫出來給大家分享一下,希望能給大家帶來幫助.工具原料開啟開始選單,選擇micro...

python抓取處理word文件

python 處理 word 文件

python 處理抓取網頁亂碼

利用WORD發布博文

相關推薦