官方教程地址:https://github.com/tesseract-ocr/tesseract/wiki/Compiling
测试版本为
root@9a2a063f9534:/tesseract/testing# tesseract -v tesseract 4.00.00dev-697-gcdc3533 leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 Found AVX2 Found AVX Found SSE
一、Docker + Ubuntu
git clone git@github.com:tesseract-ocr/tesseract.git cd tesseract docker pull ubuntu:latest docker build -t google-ocr:latest . docker run -itd --name ocr google-ocr:latest /bin/bash
docker exec -it ocr /bin/bash
进入环境后,需要训练功能要执行下面第二条
apt-get install -y g++ autoconf automake libtool autoconf-archive pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev git #training apt-get install -y libicu-dev libpango1.0-dev libcairo2-dev
Leptonica
Tesseract Leptonica Ubuntu 4.00 1.74.2 Must build from source
官网给出必须源码安装,所以去找源码安
cd /tmp git clone https://github.com/DanBloomberg/leptonica.git cd leptonica autoreconf -vi ./autobuild ./configure make make install
安装主体ocr
cd /tesseract ./autogen.sh LIBLEPT_HEADERSDIR=/usr/include ./configure --with-extra-libraries=/usr/local/lib make install
测试安装是否成功
tesseract
tesseract -v
下载字库模型,选自己需要的就行了
字库地址:https://github.com/tesseract-ocr/tessdata 手册地址:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
将字库放入指定路径
export TESSDATA_PREFIX=/tesseract/tessdata
cp xxx.traindata /tesseract/
执行测试
cd /tesseract/testing #english tesseract phototest.tif result -l eng #chinese tesseract chi.jpg result1 -l chi_sim
检查输出
cat result.txt
cat result1.txt
可通过训练提高精度,训练方法见官方文档,这个我没试过。
附录:
python 调用接口:https://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/
python 官方调用依赖:https://github.com/madmaze/pytesseract