python的tesseract库几个重要的命令

在调用tesseract时，最重要的三个参数是 -l， -oem 和 -psm

-l 参数控制识别文本的语言。可以通过命令 tesseract --list-langs 查看已经安装的字库。

　　支持中文：下载中文扩展 https://github.com/tesseract-ocr/tessdata，把里面的 chi_sim.traineddata 复制到 **\Tesseract-OCR\tessdata 的路径。

-oem 参数控制OCR的引擎模式，控制由超正方体使用的算法类型。可以通过命令 tesseract --help-oem 查看可用的引擎模式，一般有四种模式，默认第四种，可以用 --oem 1表示只希望用深度学习LSTM引擎。　

　　OCR Engine modes:
　　　　0 Legacy engine only.
　　　　1 Neural nets LSTM engine only.
　　　　2 Legacy + LSTM engines.
　　　　3 Default, based on what is available.

-psm 参数控制tesseract使用的自动页面分割模式。使用 tesseract --help-psm 查看模式，我发现对于小文本，模式6和7运行良好，如果是大块文本，可以试试默认的3模式。　　

　　Page segmentation modes:
　　　　0 Orientation and script detection (OSD) only.
　　　　1 Automatic page segmentation with OSD.
　　　　2 Automatic page segmentation, but no OSD, or OCR.
　　　　3 Fully automatic page segmentation, but no OSD. (Default)
　　　　4 Assume a single column of text of variable sizes.
　　　　5 Assume a single uniform block of vertically aligned text.
　　　　6 Assume a single uniform block of text.
　　　　7 Treat the image as a single text line.
　　　　8 Treat the image as a single word.
　　　　9 Treat the image as a single word in a circle.
　　　　10 Treat the image as a single character.
　　　　11 Sparse text. Find as much text as possible in no particular order.
　　　　12 Sparse text with OSD.
　　　　13 Raw line. Treat the image as a single text line,
　　　　 bypassing hacks that are Tesseract-specific.

使用：

img = Image.open('./img.png')
config = ("-l chi_sim --oem 1 --psm 7")
text = pytesseract.image_to_string(img, config=config)

posted @ 2019-08-12 15:21 wanglai 阅读(8574) 评论(0) 收藏举报

刷新页面返回顶部

炸鸡烤肉研究员

python的tesseract库几个重要的命令

公告