Python如何提取docx中的超連結

python如何解析中間的內容

用 xml + 正規表示式

如果僅僅使用 for paragraph in document.paragraphs 獲取不包含**的段落時，還應加上.text屬性

import re
from docx import document
defget_paragraph_from_docx
(file_name)
:"""
**：https:blog.csdn.net，這是一段有hyperlink的段落
這是一段沒有hyperlink的段落
可用於處理包含超連結的文字，但會自動跳過**
:param file_name:
:return:
"""text =
document = document(file_name)
for paragraph in document.paragraphs:
t_para =
u""# 有無超連結均可處理
xml_str =
str(paragraph.paragraph_format.element.xml)
wt_list = re.findall(
'', xml_str)
for wt in wt_list:
wt_content = re.sub(
'<[\s\s]*?>'
,u""
, wt)
t_para += wt_content
if t_para:
t_para = t_para.strip(
)            t_para = re.sub(
'[\s]',''
, t_para)
if t_para:
return text

d = docx.document(./test.docx)
for p in d.paragraphs:
xml = p.paragraph_format.element.xml
xml_str = str(xml)
wt_list = re.findall('', xml_str)
hyperlink = u''
for wt in wt_list:
wt_content = re.sub('<[\s\s]*?>', u'', wt)
hyperlink += wt_content
print(hyperlink)

如何開啟docx檔案

如何開啟docx檔案?最直接有效的一種方法就是安裝office2007，這點大家應該都知道。不過有時候手邊沒有 office2007怎麼辦？比如你去同學或者朋友家，給他看些好玩的比較常見的，去特別2的列印店列印東西以前碰到這種情況都是回到宿舍轉成03 下面介紹幾種方法來開啟docx檔案注意安裝...

python讀取docx內容

環境 pycharm python3.7 獲取文章全部內容 doc docx.document d users administrator pycharmprojects bigdata detail a.docx 一級標題 for p in doc.paragraphs if p.style.na...

python提取內容使用Python提取小說內容

具體實現功能如下輸入目錄頁的url之後，指令碼會自動分析目錄頁，提取的章節名和章節鏈結位址。然後再從章節鏈結位址逐個提取章節內容。現階段只是將從第一章開始，每次提取一章內容，回車之後提取下一章內容。其他的結果可能有不同，需要做一定修改。在逐浪測試過正常。coding utf8 usr bi...

Python如何提取docx中的超連結

如何開啟docx檔案

python讀取docx內容

python提取內容 使用Python提取小說內容

相關推薦

python提取內容使用Python提取小說內容