最近爬一个电影票房的网站(url:http://58921.com/alltime),上面总票房里面其实是一张图片,那么我需要把图片识别成文字,来获取票房数据。
 
我头脑里第一想到的解决方案就是要用tesseract3,别用2,经验来说3相比2,对中文的支持更好一点。
 
然后,我开始使用pip安装一系列相关的库:
 
$ pip install Pillow
$ pip install pytesser3
$ pip install pytesseract
 
第一步,首先执行:
 
$ pip install pillow
 
出现报错:
 
Collecting pillow
  Could not fetch URL https://pypi.python.org/simple/pillow/: There was a problem confirming the ssl certificate: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:661) - skipping
  Could not find a version that satisfies the requirement pillow (from versions: )
No matching distribution found for pillow
 
截图如下:
 
 
我的第一反应是加个sudo,sudo pip install pillow来安装,出现同样报错,截图如下:
 
 
其实是pip的版本低了,然后我尝试更新pip版本,使用如下命令:
 
python -m pip install --upgrade pip
 
出现报错:
 
Could not fetch URL https://pypi.python.org/simple/pip/: There was a problem confirming the ssl certificate: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:661) - skipping
Requirement already up-to-date: pip in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
 
截图如下:
 
 
还是不行!
 
那么,换一种方式更新pip,命令如下:
 
$ pip install -U pip
 
还是出现报错:
 
Could not fetch URL https://pypi.python.org/simple/pip/: There was a problem confirming the ssl certificate: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:661) - skipping
Requirement already up-to-date: pip in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages
 
截图如下:
 
 
再换一种更新pip,命令如下:
 
curl https://bootstrap.pypa.io/get-pip.py | python
 
注意一下后面,如果你是python3,那么:
 
curl https://bootstrap.pypa.io/get-pip.py | python3
 
终于可以了!
 
最终解决方案参考至:
 

 
然后安装pillow,命令如下:
 
$ pip install pillow
 
另外,建议使用pillow,PIL好多年前就停更了,现在pillow fork过来,然后一直在维护。
 
现在可以使用最新的pip批量安装上述的库了。
 

 
 
后来写了一个test.py,发现使用pytesseract.image_to_string()函数时,报下面的崩溃:
 
Traceback (most recent call last):
  File "/Users/baorunchen/Documents/code/repo/python/advanced/image_recognition_test.py", line 29, in <module>
    main()
  File "/Users/baorunchen/Documents/code/repo/python/advanced/image_recognition_test.py", line 26, in main
    run_log(pytesseract.image_to_string(im))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 193, in image_to_string
    return run_and_get_output(image, 'txt', lang, config, nice)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 140, in run_and_get_output
    run_tesseract(**kwargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pytesseract/pytesseract.py", line 111, in run_tesseract
    proc = subprocess.Popen(command, stderr=subprocess.PIPE)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
    errread, errwrite)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1024, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

 

截图如下:
 
 
原因是:安装Tesseract-OCR后,其不会被默认添加至环境变量path中,已导致报错;
 
解决这个问题可参考网址:
 
解决方案:
先需要在mac环境上安装tesseract这个库:
 
$ brew install tesseract
 
又报错了,如下:
 
touch: /usr/local/Homebrew/.git/FETCH_HEAD: Permission denied
touch: /usr/local/Homebrew/Library/Taps/caskroom/homebrew-cask/.git/FETCH_HEAD: Permission denied
touch: /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/.git/FETCH_HEAD: Permission denied
fatal: Unable to create '/usr/local/Homebrew/.git/index.lock': Permission denied
error: could not lock config file .git/config: Permission denied
==> Downloading https://homebrew.bintray.com/bottles/tesseract-3.05.01.high_sierra.bottle.tar.gz
Already downloaded: /Users/baorunchen/Library/Caches/Homebrew/tesseract-3.05.01.high_sierra.bottle.tar.gz
==> Pouring tesseract-3.05.01.high_sierra.bottle.tar.gz
Error: The `brew link` step did not complete successfully
The formula built, but is not symlinked into /usr/local
Could not symlink share/man/man1/ambiguous_words.1
/usr/local/share/man/man1 is not writable.
 
You can try again using:
  brew link tesseract
==> Summary
🍺  /usr/local/Cellar/tesseract/3.05.01: 79 files, 38.7MB
 
截图如下:
 
 
之间我尝试更新brew,然后再brew install tesseract,没什么用;
 
$ brew update
$ sudo brew update
$ brew upgrade
$ brew cleanup
$ brew install tesseract
 
那么,按照报错提示执行下列命令:
 
$ brew link tesseract
 
出现下面报错:
 
Linking /usr/local/Cellar/tesseract/3.05.01...
Error: Could not symlink share/man/man5/unicharambigs.5
/usr/local/share/man/man5 is not writable.
 
截图如下:
 
 
尝试解决brew link失败的问题,参考网址:
 
根据它的报错提示,注意到了"/usr/local/share/man/man5 is not writable.”
这个文件不可写,说明没权限,那么我把该文件加上当前用户的权限,执行下列命令:
$ sudo chown ${USER} /usr/local/share/man/man5
 
然后继续brew link tesseract,根据错误提示,执行相应语句,截图如下:
 
 
进行下一步,参照网址:
 
需要在代码里添加:
 
pytesseract.pytesseract.tesseract_cmd = '<path-to-tesseract-bin>'
 
命令行输入:
 
$ which tesseract
 
之前没有brew link成功,执行上述命令的结果应该是:
 
tesseract not found
 
现在成功了,结果是:
 
/usr/local/bin/tesseract

 

那么,在代码里添加:
 
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'
 
然后应该就没有pytesseract.image_to_string()报错的问题了。
 

附代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
# @version: python 2.7.13
# @author: baorunchen(runchen0518@gmail.com)
# @date: 2018/5/4
import os
 
import time
from PIL import Image
import pytesseract
 
pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'
 
pic_path = '/Users/baorunchen/Desktop/test.png'
 
 
def run_log(log):
    print time.strftime('%Y-%m-%d %H:%M:%S', time.localtime()), '-', log
 
 
def main():
    if not os.path.exists(pic_path):
        run_log('pic not exists!')
        exit(-1)
 
    im = Image.open(pic_path)
    run_log(pytesseract.image_to_string(im))
 
if __name__ == '__main__':
    main()

 

 
 posted on 2018-05-04 11:52  keria  阅读(4578)  评论(0编辑  收藏  举报