python讀取pdf中的文字

python處理pdf也是常用的技術了，對於python3來說，pdfminer3k是乙個非常好的工具。

pip install pdfminer3k

首先，為了滿足大部分人的需求，我先給乙個通用一點的指令碼來讀取pdf中的文字：

from io import stringio
from io import open
from pdfminer.converter import textconverter
from pdfminer.layout import laparams
from pdfminer.pdfinterp import pdfresourcemanager, process_pdf
def read_pdf(pdf):
# resource manager
rsrcmgr = pdfresourcemanager()
retstr = stringio()
laparams = laparams()
# device
device = textconverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdf)
device.close()
content = retstr.getvalue()
retstr.close()
# 獲取所有行
lines = str(content).split("\n")
return lines
if __name__ == '__main__':
with open('t1.pdf', "rb") as my_pdf:
print(read_pdf(my_pdf))

我主要是想在pdf中抽出自己想要的一些關鍵資訊，所以需要找到這些資訊的共同點。幸運的是，這些關鍵資訊的行都含有'//'，所以我只需找到含有'//'的行就行了，於是寫了以下指令碼。

這樣就可以直接使用了，我們先看指令碼：

from io import stringio
from io import open
from pdfminer.converter import textconverter
from pdfminer.layout import laparams
from pdfminer.pdfinterp import pdfresourcemanager, process_pdf
def read_pdf(pdf):
# resource manager
rsrcmgr = pdfresourcemanager()
retstr = stringio()
laparams = laparams()
# device
device = textconverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdf)
device.close()
content = retstr.getvalue()
retstr.close()
# 獲取所有行
lines = str(content).split("\n")
units = [1, 2, 3, 5, 7, 8, 9, 11, 12, 13]
header = '\x0cunit '
# print(lines[0:100])
count = 0
flag = false
text = open('words.txt', 'w+')
for line in lines:
if line.startswith(header):
flag = false
count += 1
if count in units:
flag = true
print(line)
text.writelines(line + '\n')
if '//' in line and flag:
text_line = line.split('//')[0].split('. ')[-1]
print(text_line)
text.writelines(text_line+'\n')
text.close()
def _main():
my_pdf = open('t1.pdf', "rb")
read_pdf(my_pdf)
my_pdf.close()
if __name__ == '__main__':
_main()

其實看到lines = str(content).split("\n")那一行就夠了，我們可以把lines都print出來，就可以看到pdf裡面的內容。

這樣我們就可以把pdf檔案處理看作簡單的字串資料處理了。接下來的指令碼操作也不用過多解釋了。

Python 讀取純文字PDF檔案

匯入系統庫 import sys import importlib 對importlib做處理,讓其載入sys importlib.reload sys from pdfminer.pdfparser import pdfparser,pdfdocument from pdfminer.pdfint...

Python讀取PDF內容

1，引言晚上翻看 python網路資料採集這本書，看到讀取pdf內容的想起來前幾天集搜客剛剛發布了乙個抓取網頁pdf內容的抓取規則這個規則能夠把pdf內容當成html來做網頁抓取。神奇之處要歸功於firefox解析pdf的能力，能夠把pdf格式轉換成html標籤，比如，div之類的標籤，...

python讀取pdf檔案

pdfplumber是乙個可以處理pdf格式資訊的庫。可以查詢關於每個文字字元矩陣和行的詳細資訊，也可以對進行提取並進行視覺化除錯。文件參考安裝直接採用pip即可。命令列中輸入 pip install pdfplumber import pdfplumber with pdfplumber....

python讀取pdf中的文字

Python 讀取純文字PDF檔案

Python讀取PDF內容

python讀取pdf檔案

相關推薦