python3使用ocr进行简单的图文识别

1、安装

pip install pytesseract

pytesseract 的使用是 基于 后端Tesseract的，故需要安装 Tesseract

2、安装

Tesseract 

官方网站：https://github.com/tesseract-ocr/tesseract
官方文档：https://github.com/tesseract-ocr/tessdoc
语言包地址：https://github.com/tesseract-ocr/tessdata
下载地址：https://digi.bib.uni-mannheim.de/tesseract/

注：
1、并配置环境变量
2、安装过程中，可以直接安装简体中文包，此步亲测，没有 那么慢。 （安装后 即可使用 lang='chi_sim'）


参考  https://www.jianshu.com/p/f7cb0b3f337a

3、使用

# 图文识字
import pytesseract
from PIL import Image


def imageToStr(image_url, lang):
    im = Image.open(image_url)
    im = im.convert('L')
    im_str = pytesseract.image_to_string(im, lang=lang)
    return im_str


img_url = r'C:\Users\peng\Desktop\50.png'

# img_str = imageToStr(img_url, 'eng')
# print('识别到的英文', img_str)

# print('识别到的中文')
cn_img_str = imageToStr(img_url, 'chi_sim')
print(cn_img_str)

4、结果

静夜思
作者: 李

床前明月光，疑是地上入
举头望明月，低头思故乡

5、原图

6、问题：

很显然，
1、李白 的白没打出来
2、霜 ，打成 入

机器学习不深入，可能是因为图片分辨率比较差吧

。。。

posted @ 2022-04-17 14:10 王希有阅读(533) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

王希有

越自律越自由，越努力越幸运

python3使用ocr进行简单的图文识别

公告