windows 10 中使用 text2image 训练字库记录
对已安装安装字体的识别(不存在的字体,下载安装也是一样的),遇到的坑记录:
异常1:Fontconfig error: Cannot load default config file
解决方案:
配置环境变量:
FONTCONFIG_FILE: E:\python\Tesseract-OCR\fonts.conf
FONTCONFIG_PATH: C:\Windows\Fonts
fonts.conf没有,则自己创建,内容如下:
<?xml version="1.0"?> <!DOCTYPE fontconfig SYSTEM "fonts.dtd"> <!-- C:\WINDOWS\fonts.conf file to configure system font access --> <fontconfig> <!-- Find fonts in these directories --> <dir>C:\WINDOWS\fonts</dir> <cache>C:\WINDOWS\Cache\Fontcache</cache> <cachedir>C:\WINDOWS\Cache\Fontcache</cachedir> <!-- Accept deprecated 'mono' alias, replacing it with 'monospace' --> <match target="pattern"> <test qual="any" name="family"><string>mono</string></test> <edit name="family" mode="assign"><string>monospace</string></edit> </match> <!-- Names not including any well known alias are given 'sans' --> <!-- seems not to work and therefore commented out <match target="pattern"> <test qual="all" name="family" mode="not_eq">sans</test> <test qual="all" name="family" mode="not_eq">serif</test> <test qual="all" name="family" mode="not_eq">monospace</test> <edit name="family" mode="append_last"><string>sans</string></edit> </match> --> <!-- Settings for TFT-Monitors --> <match target="font" > <edit mode="assign" name="hinting" > <bool>true</bool> </edit> </match> <match target="font" > <edit mode="assign" name="hintstyle" > <const>hintfull</const> </edit> </match> <match target="font" > <edit mode="assign" name="antialias" > <bool>true</bool> </edit> </match> <match target="font" > <edit mode="assign" name="rgba" > <const>rgb</const> </edit> </match> <!-- Provide required aliases for standard names --> <alias> <family>serif</family> <prefer> <family>DejaVu Serif</family> <family>Bitstream Vera Serif</family> <family>Times New Roman</family> <family>Thorndale AMT</family> <family>Luxi Serif</family> <family>Nimbus Roman No9 L</family> <family>Times</family> </prefer> </alias> <alias> <family>sans-serif</family> <prefer> <family>BPG Glaho International</family> <!-- lat,cyr,arab,geor --> <family>DejaVu Sans</family> <family>Bitstream Vera Sans</family> <family>Luxi Sans</family> <family>Nimbus Sans L</family> <family>Arial</family> <family>Albany AMT</family> <family>Helvetica</family> <family>Verdana</family> <family>Lucida Sans Unicode</family> <family>Tahoma</family> <!-- lat,cyr,greek,heb,arab,thai --> </prefer> </alias> <alias> <family>monospace</family> <prefer> <family>DejaVu Sans Mono</family> <family>Bitstream Vera Sans Mono</family> <family>Luxi Mono</family> <family>Nimbus Mono L</family> <family>Andale Mono</family> <family>Courier New</family> <family>Cumberland AMT</family> <family>Courier</family> </prefer> </alias> </fontconfig>
异常2:Could not find font named 'xxxx'.
执行下面命令时(训练的第一步),报出的:
text2image --text="E:\python\Tesseract-OCR\training\chi_sim.training_text.txt" --outputbase=naruto.FZYiHei-M20S.exp0 --font="FZYiHei-M20S" --fonts_dir="E:\python\Tesseract-OCR\training"
解决方案:
通过命令查看可使用的字体:text2image --list_available_fonts --fonts_dir=C:\\Windows\\Fonts
确实该字体(FZYiHei-M20S,但我已经安装在windows中了)也没在列表中,列表中只有系统原装的那些字体。但在该fonts文件夹下有该字体,感觉很奇怪!找资料解决了很久
无意中把fonts_dir该到了我下在该字体.ttf的那个文件夹,就好了。
我晕~~ 搞半天,虽然安装了该字体,但字体文件没在windows\fonts下?而这个命令需要指定的是ttf文件所在的目录?那还需要安装吗?感觉都必要安装了啊~~
异常3:shapeclustering mftraining cntraining 这3个命令在win10下执行时,都会出现崩溃弹窗
怀疑是最新版本(5.0)兼容不好,换成3.0.5版本后,重复这些操作。通过了~
坑4:被训练的素材物料(文本文件),有格式要求的,不是随便排版的(_(¦3」∠)_)
这里还是建议下载官方的训练物料吧,链接:
https://raw.githubusercontent.com/tesseract-ocr/langdata/master/chi_sim/chi_sim.training_text
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
训练相关的命令:
查找可用的字体:
text2image --list_available_fonts --fonts_dir=C:\\Windows\\Fonts
用字体匹配要训练的文本,查看识别率(有些生僻字还是无法识别):
text2image --text="E:\python\Tesseract-OCR\training\chi_sim.training_text.txt" --outputbase=eng --fonts_dir="E:\python\Tesseract-OCR\training" --find_fonts --min_coverage=1.0 --render_per_font=false
训练完整步骤>>
1. 生成~.tif 和 ~.box 文件:
text2image --text="E:\python\Tesseract-OCR\training\chi_sim.training_text.txt" --outputbase=naruto.FZYiHei-M20S.exp0 --font="FZYiHei-M20S" --fonts_dir="E:\python\Tesseract-OCR\training"
2. 产生字符特征文件(产生~.tr文件)
tesseract naruto.FZYiHei-M20S.exp0.tif naruto.FZYiHei-M20S.exp0 nobatch box.train
3. 计算字符集(产生~.unicharset文件)
unicharset_extractor naruto.FZYiHei-M20S.exp0.box
4. 定义字体特征文件
font_properties.txt
我的是FZYiHei-M20S 0 0 0 0 0
5、聚集字符特征
1) shapeclustering -F font_properties.txt -U unicharset naruto.FZYiHei-M20S.exp0.tr 注意:如果font_properties不加扩展名.txt,可能会报错
2) mftraining -F font_properties.txt -U unicharset -O naruto.unicharset naruto.FZYiHei-M20S.exp0.tr
使用上一步产生的字符集文件unicharset,来生成当前新语言的字符集文件 naruto.unicharset。同时还会产生图形原型文件inttemp和每个字符所对应的字符
特征数文件pffmtable。最重要的就是这个inttemp文件了,他包含了所有需要产生的字的图形原型。
3) cntraining naruto.FZYiHei-M20S.exp0.tr
6、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上 naruto.
执行combine_tessdata naruto.
7、测试
tesseract invoice2b.jpg invoice2bnum1 -l num1
tesseract C:\Users\Administrator\Desktop\pub_04\4.png output -l naruto