python讀取pdf檔案

pdfplumber是乙個可以處理pdf格式資訊的庫。可以查詢關於每個文字字元、矩陣、和行的詳細資訊，也可以對**進行提取並進行視覺化除錯。

文件參考

安裝直接採用pip即可。命令列中輸入

pip install pdfplumber

import
pdfplumber
with pdfplumber.open(
"path/file.pdf
") as pdf:
first_page = pdf.pages[0]  #
獲取第一頁
print(first_page.chars[0])

pdfplumber.pdf中包含了.metadata和.pages兩個屬性。

metadata是乙個包含pdf資訊的字典。

pages是乙個包含頁面資訊的列表。

每個pdfplumber.page的類中包含了幾個主要的屬性。

page_number 頁碼

width 頁面寬度

height 頁面高度

objects/.chars/.lines/.rects 這些屬性中每乙個都是乙個列表，每個列表都包含乙個字典，每個字典用於說明頁面中的物件資訊，包括直線，字元，方格等位置資訊。

extract_text() 用來提頁面中的文字，將頁面的所有字元物件整理為的那個字串

extract_words() 返回的是所有的單詞及其相關資訊

extract_tables() 提取頁面的**

to_image() 用於視覺化除錯時，返回pageimage類的乙個例項

table_settings

表提取設定

預設情況下，extract_tables使用頁面的垂直和水平線（或矩形邊）作為單元格分隔符。但是方法該可以通過table_settings引數高度定製。可能的設定及其預設值：

表提取策略

vertical_strategy和horizontal_strategy的引數選項

"lines"

use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells.

"lines_strict"

use the page's graphical lines — but not the sides of rectangle objects — as the borders of potential table-cells.

"text"

forvertical_strategy: deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. forhorizontal_strategy, the same but using the tops of words.

"explicit"

only use the lines explicitly defined inexplicit_vertical_lines/explicit_horizontal_lines.

讀取文字

import
pdfplumber
import
pandas as pd
with pdfplumber.open(
"e:\\600aaa_2.pdf
") as pdf:
page_count =len(pdf.pages)
print(page_count)  #
得到頁數
for page in
pdf.pages:
print('
---------- 第[%d]頁 ----------
' %page.page_number)
#獲取當前頁面的全部文字資訊，包括**中的文字
print(page.extract_text())

讀取**

import
pdfplumber
import
pandas as pd
import
rewith pdfplumber.open(
"e:\\600aaa_1.pdf
") as pdf:
page_count =len(pdf.pages)
print(page_count)  #
得到頁數
for page in
pdf.pages:
print('
---------- 第[%d]頁 ----------
' %page.page_number)
for pdf_table in page.extract_tables(table_settings=): #
邊緣相交合併單元格大小
#print(pdf_table)
for row in
pdf_table:
#去掉回車換行
print([re.sub('
\s+', '', cell) if cell is
not none else none for cell in row])

部分參照：

python讀取pdf檔案獲取pdf的文字內容

python處理pdf檔案的所有庫 import pypdf2 from urllib.request import urlopen file open d ltn20190716133.pdf rb filereader pypdf2.pdffilereader file pdf page num...

Python 讀取純文字PDF檔案

匯入系統庫 import sys import importlib 對importlib做處理,讓其載入sys importlib.reload sys from pdfminer.pdfparser import pdfparser,pdfdocument from pdfminer.pdfint...

Python讀取PDF內容

1，引言晚上翻看 python網路資料採集這本書，看到讀取pdf內容的想起來前幾天集搜客剛剛發布了乙個抓取網頁pdf內容的抓取規則這個規則能夠把pdf內容當成html來做網頁抓取。神奇之處要歸功於firefox解析pdf的能力，能夠把pdf格式轉換成html標籤，比如，div之類的標籤，...

python讀取pdf檔案

python讀取pdf檔案獲取pdf的文字內容

Python 讀取純文字PDF檔案

Python讀取PDF內容

相關推薦