Tesseract OCR使用說明

新增源

sudo vi /etc/apt/sources.list 新增deb bionic universe

其中bionic和自己安裝的系統有關，根據實際情況進行改變；

安裝

sudo apt install tesseract-ocr

sudo apt install libtesseract-dev

安裝語言

tesseract一共有130種語言，35種指令碼，語言包為tesseract-ocr-langcode和tesseract-ocr-script-scriptcode,其中langcode為3個字元，scriptcode為4個字元；

例如tesseract-ocr-eng (english), tesseract-ocr-ara (arabic), tesseract-ocr-chi-sim (simplified chinese), tesseract-ocr-script-latn (latin script), tesseract-ocr-script-deva (devanagari script)

其中ubuntu18.04的安裝包鏈結為：

命令格式

tesseract file outputbase [options]...[configfile]...

引數說明

file

可以為也可以為文字。當為文字時，一行文字表示一張。

outputbase

輸出optinons

見下表configfile

見下表

options

說明-c

–dpi n

指定dpi，預設n為300,如果不包括內容，tesseract將會去猜測；

-l lang

-l script

指定語言，預設為英語，可以指定多種語言，使用+連線

–psm n

設定中文字的格式.

–oem n

指定使用tesseract還是lstm

–tessdata-dir path

指定tessdata的路徑

–user-patterns file

指定用於patterns檔案位置？

–user-words file

指定使用者words 檔案位置？

configfile

說明alto

輸出格式為outputbase.alto

hocr

輸出格式為outputbase.hocr

pdf輸出格式為outputbase.pdf

tsv輸出格式為outputbase.tsv

txt輸出格式為outputbase.txt

get.images

將輸入的寫入檔案

logfile

debug資訊

lstm.train

makebox

輸出bounding-box

quiet

將debug資訊輸出到/dev/null

其他選項

選項說明

-h幫助

–help-extra

高階用法幫助

–help-psm

頁分割模式幫助

–help-oem

engine模式幫助

–list-langs

可用的語言

–print-parameters

列印引數

參考鏈結

Tesseract OCR 入門使用

以下只針對widows平台，linux下沒有測試 tesserocr與pytesseract是python的乙個ocr識別庫，但其實是對tesseract做的一層python api封裝，pytesseract是google的tesseract ocr引擎包裝器所以它們的核心是tesseract,...

Tesseract OCR的簡單使用與訓練

原文 tesseract，一款由hp實驗室開發由google維護的開源ocr optical character recognition 光學字元識別引擎，與microsoft office document imaging modi 相比，我們可以不斷的訓練的庫，使影象轉換文字的能力不斷增強如...

Tesseract OCR引擎入門

ocr optical character recognition 光學字元識別,是指對檔案中的文字進行分析識別，獲取的過程。tesseract 開源的ocr識別引擎，初期tesseract引擎由hp實驗室研發，後來貢獻給了開源軟體業，後經由google進行改進，消除bug，優化，重新發布。當前版本...

Tesseract OCR使用說明

Tesseract OCR 入門使用

Tesseract OCR的簡單使用與訓練

Tesseract OCR引擎 入門

相關推薦

Tesseract OCR引擎入門