Python+Tesseract-OCR在图像字符识别中的应用

win10+python 3.7.7

1 准备软件和对应版本的字库

我是从这里https://digi.bib.uni-mannheim.de/tesseract/下载的tesseract-ocr-w64-setup-v4.0.0.20181030.exe将然后从https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata下载中文字库

据说还要https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim_vert.traineddata和https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

这一并下载齐了。

如果下载不了，还是不要继续了。

2 安装tesseract-ocr-w64

我安装到了D:\Tesseract-OCR目录下

3 设置“环境变量”

按win+s快捷键，输入“环境变量”，打开控制面板选项中的“编辑系统环境变量”，点高级选项卡下的“环境变量” ，双击XX的用户变量下面的Path，然后新建，填写”D:\Tesseract-OCR“，按三次确定后退出。

打开一个cmd窗口，输入 tesseract -v 查看。

4 将下载好的字库放到Tesseract-OCR项目的tessdata文件夹里面。

即，将第一步中的chi_sim.traineddata复制到D:\Tesseract-OCR\tessdata目录下

5 识别

准备一张带有中文的图片，打开CMD窗口，执行命令 tesseract 图片名称生成的结果文件的名称字库

如：我是用 win键+SHIFT键+S 截图保存为test.png,然后打在cmd定位于图片目录下，执行：

tesseract test.png result -l chi_sim

可在图片目录下找到result.txt，看一下内容，识别率还很高。

（据说有时需增加TESSDATA_PREFIX环境变量来指定tessdata 位置，但我测试时不需要。）

如果只有一行文本，可以：

tesseract test2.png result -l chi_sim --psm 7

-psm 7 表示告诉tesseract 图片是一行文本，这个参数可以减少识别错误率. 默认为 3

6、代码

import subprocess

# 识别软件路径  图片路径 识别结果
p = subprocess.Popen(
    [r"D:\Python\Tesseract-OCR\tesseract.exe", r"E:\PythonCode\SpiderBaidu\tesseract图片识别\test2.png", "result", "-l",
     "testlang"],
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()

result = open(r"result.txt", "r", encoding='utf-8')
print(result.read())

7、训练字库

不想搞了，看原文吧https://www.jianshu.com/p/3326c7216696

tesseract --help | --help-extra （命令行下运行tesseract --print-parameters 之后打印出来的所有参数）

pytesseract psm 选项参数

Page segmentation modes: 
0    Orientation and script detection (OSD) only.  
1    Automatic page segmentation with OSD.  
2    Automatic page segmentation, but no OSD, or OCR.  
3    Fully automatic page segmentation, but no OSD. (Default)  
4    Assume a single column of text of variable sizes.  
5    Assume a single uniform block of vertically aligned text. 
6    Assume a single uniform block of text.  
7    Treat the image as a single text line.  
8    Treat the image as a single word.  
9    Treat the image as a single word in a circle. 
10    Treat the image as a single character. 
11    Sparse text. Find as much text as possible in no particular order. 
12    Sparse text with OSD. 
13    Raw line. Treat the image as a single text line,                        
                      bypassing hacks that are Tesseract-specific.

参考：

https://www.jianshu.com/p/3326c7216696

https://www.pythonf.cn/read/83901

https://www.cnblogs.com/wangkevin5626/p/9640165.html

https://blog.csdn.net/huitailangyz/article/details/80390090

https://blog.csdn.net/hanoil/article/details/74171371

posted on 2020-04-16 11:15 pu369com 阅读(424) 评论(0) 收藏举报

刷新页面返回顶部

pu369com

Python+Tesseract-OCR在图像字符识别中的应用

导航

公告