centos7下安装tesseract-ocr进行验证码识别

摘要:

  centos7安装依赖库

  tesseract配置

  代码例子

centos7安装依赖库

  • 安装centos系统依赖

    yum install -y automake autoconf libtool gcc gcc-c++ 
    yum install -y libpng-devel libjpeg-devel libtiff-devel
  • 安装leptonica

    wget http://www.leptonica.org/source/leptonica-1.72.tar.gz
    tar xvzf leptonica-1.72.tar.gz
    cd leptonica-1.72/ 
    ./configure 
    make && make install
  • 安装tesseract-ocr

    wget https://github.com/tesseract-ocr/tesseract/archive/3.04.zip
    unzip 3.04.zip
    cd tesseract-3.04/ 
    ./configure
    make && make install 
    sudo ldconfig
  • 部署模型

  • 安装requirements.txt中的python依赖库

    pip install -r requirements.txt

tesseract配置

  • 在/usr/local/share/tessdata创建eng.user-patterns写入

    \n\n\n\n\n\n

    表示识别6位字符(或数字)

  • 在/usr/local/share/tessdata/configs创建myconfig写入

    #识别白名单
    tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123546789
    #用户正则模式匹配
    user_patterns_suffix user-patterns
  • psm参数说明

    -psm N
      Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
    
      0 = Orientation and script detection (OSD) only.
      1 = Automatic page segmentation with OSD.
      2 = Automatic page segmentation, but no OSD, or OCR.
      3 = Fully automatic page segmentation, but no OSD. (Default)
      4 = Assume a single column of text of variable sizes.
      5 = Assume a single uniform block of vertically aligned text.
      6 = Assume a single uniform block of text.
      7 = Treat the image as a single text line.
      8 = Treat the image as a single word.
      9 = Treat the image as a single word in a circle.
      10 = Treat the image as a single character.

代码例子

1 import pytesseract
2 from PIL import Image
3 
4 image = Image.open('code.png')
5 code = pytesseract.image_to_string(image)
6 print code

 

posted @ 2017-10-12 11:53  混沌战神阿瑞斯  阅读(6618)  评论(4编辑  收藏  举报