python讀取word裡面的內容

1.將word文件轉為html操作，通過bs4中的 beautifulsoup 提取html中所需要的內容

pip install bs4

pip install pydocx

# 讀取word中的內容
from pydocx import pydocx

from bs4 import beautifulsoup  # 將html轉為物件的形式

步驟二：讀取word裡面的內容，並且解析

html = pydocx.to_html("c:\\users\\administrator\\desktop\\test.docx")
soup = beautifulsoup(html, 'html.parser')
"""demo 表示被解析的html格式的內容
html.parser表示解析用的解析器
"""soup.prettify()  # 使用prettify()格式化顯示輸出
# print(soup.prettify())
title_list = soup.select("h2>span[style='text-indent:1.25em']", attrs=)
content_list = soup.find_all('span', attrs=)  # 指定屬性，查詢class屬性為title的標籤元素，注意因為class是python的關鍵字，所以這裡需要加個下劃線'_'print(len(content_list))

2.讀取word裡面的內容，以文字的形式，一段一段的讀出來，通過樣式去獲去文件裡面的內容

pip install python-docx

# 引入
from docx import document

步驟二：讀取word裡面的內容

title = ""content = ""titlearr =
document = document("c:\\users\\administrator\\desktop\\test.docx")
# 獲取所有段落
all_paragraphs =document.paragraphs
for paragraph inall_paragraphs:
if paragraph.style.name == 'normal':
content = content + paragraph.text + '\n'
else:
obj = 
if content != '':
content = ""title =paragraph.text
# print(obj)

彙編裡面的 word代表什麼

在學習uboot的時候經常在start.s標頭檔案前面看到 word 0x2000 word 0x0 word 0x0 word 0x0 課程解釋說是佔位用的，是uboot前面的16個位元組的header word代表是字，乙個字長。字長與處理器的位數有關，比如16位處理器，字長為2byte 同理，...

C 讀取excel txt 裡面的資料

笨方法將excel裡面的資料直接拷貝到txt檔案中，之前在網上看到轉成.csv格式，感覺沒什麼必要，反而更麻煩了。初始化誤差表 ifstream inlm twodim lm.txt ifstream inthetam twodim thetam.txt ifstream inb twodim l...

matlab讀取excel裡面的資料

命令1 data xlsread result.xls 1 說明輸入後matlab將會開啟result.xls檔案，用滑鼠選擇需要匯入的資料區域，並且可以切換到想要的sheet,這個功能就是人工選擇，但是比較強大。命令2 data xlsread result.xls 2,d4 g4 說明第乙個...

python讀取word裡面的內容

彙編裡面的 word代表什麼

C 讀取excel txt 裡面的資料

matlab讀取excel裡面的資料

相關推薦