图形验证码的识别

OCR 技术：

(1) 在爬虫过程中，难免会遇到各种各样的验证码，而大多数验证码还是罔形验证码，这时候我们可以直接用 OCR 来识别
(2) OCR ，即 Optical Character Recognition ，光学字符识别，是指通过扫描字符，然后通过其形状将其翻译成电子文本的过程
(3) tesserocr 是 Python 的一个OCR 识别库，但其实是对 tesseract 做的一层 Python API 封装，所以它的核心是 tesseract。因此，在安装 tesserocr 之前，我们需要先安装 tesseract

Windows 下安装 tessorocr：

1. 先安装 tessoract，下载地址：https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-3.05.01.exe
2. 再安装 tessorocr，使用 pip3 安装即可：pip3 install tesserocr pillow

Linux 下安装 tessorocr：

yum install -y tesseract
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract/tessdata
pip3 install tesserocr pillow

Python 识别图片验证码：

import tesserocr
from PIL import Image

image = Image.open('1.png')                 # Opens and identifies the given image file
result = tesserocr.image_to_text(image)     # Recognize OCR text from an image object
print(result)

Python 识别有干扰的图片验证码：

import tesserocr
from PIL import Image

image = Image.open('2.png')

image = image.convert('L')
threshold = 127
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

image = image.point(table, '1')
result = tesserocr.image_to_text(image)
print(result)

posted @ 2019-04-01 17:10 孔雀东南飞阅读(933) 评论(0) 编辑收藏举报

刷新页面返回顶部

孔雀东南飞

图形验证码的识别

公告