ocr智能图文识别 tess4j 图文，验证码识别

最近写爬虫采集数据，遇到网站登录需要验证码校验，想了想有两种解决办法

　　　　　　　　　1，利用htmlunit，将验证码输入到swing中，并弹出一个输入框，手动输入验证码，这种实现方式，如果网站需要登录一次可以使用，如果每个5分钟就让你重新登录，校验验证码，那这法指定很麻烦，我总不能一直在这看着，每五分钟手动输入一次吧

　　　　　　　　　2，为了避免上一个法子的弊端，就想到有没有可以自动识别验证码，让程序自己验证而不需要人工手动输入，然后从网上找到了解决方案，ocr - tesseract，但是网上的博客什么的都是一样的，把别人的博客copy过来，也不管代码到底能不能正常运行，因此写了这篇文章，希望可以帮助正卡在tesseract这的盆友（说的大义凛然）

对tess4j的使用总结

1，tess4j 封装了 tesseract-ocr 的操作

　　　　　　可以用很简洁的几行代码就实现原本tesseract-ocr 复杂的实现逻辑

　　　　　　如果你也想了解tesseract-ocr是怎么实现验证码识别的请移步我的另一篇文章

2，网上有很多说发布jar或war包之后需要自己加载dll，这是错误的

　　不需要再自己加载dll，tess4j已经自己封装了加载dll的操作

3，使用tess4j需要先安装 tesseract-ocr-setup-3.02.02

4，如果抛Invalid memory access 无效的内存访问异常，导致这个异常的原因是tessdata语言包的位置没有找到

5，下面就是我使用tess4j的一个使用demo

目录结构

tessdata 语言包放在了和src同级的目录

maven 依赖

 1  <dependencies>
 2   
 3       <dependency>  
 4         <groupId>net.java.dev.jna</groupId>  
 5         <artifactId>jna</artifactId>  
 6         <version>4.2.1</version>  
 7     </dependency>  
 8   
 9       <dependency>  
10         <groupId>net.sourceforge.tess4j</groupId>  
11         <artifactId>tess4j</artifactId>  
12         <version>2.0.1</version>  
13         <exclusions>  
14             <exclusion>  
15                 <groupId>com.sun.jna</groupId>  
16                 <artifactId>jna</artifactId>  
17             </exclusion>  
18         </exclusions>  
19     </dependency>  
20   
21   </dependencies>

3，测试代码

 1 package com.sinosoft.ocr;
 2 
 3 import java.awt.image.BufferedImage;
 4 import java.io.File;
 5 import java.io.FileInputStream;
 6 import java.io.IOException;
 7 import java.io.InputStream;
 8 
 9 import javax.imageio.ImageIO;
10 
11 import net.sourceforge.tess4j.ITesseract;
12 import net.sourceforge.tess4j.Tesseract;
13 import net.sourceforge.tess4j.TesseractException;
14 import net.sourceforge.tess4j.util.ImageHelper;
15 
16 public class OcrTest {
17 
18      public static void main(String[] args) {  
19             File imageFile = new File("E:\\valimg\\fx\\fx.tif");  
20             ITesseract instance = new Tesseract();  // JNA Interface Mapping  
21             
22             try {  
23                 //读取一个文件夹下的所有图片并验证
24                 /*    String[] filelist = imageFile.list();
25                 for (int i = 0; i < filelist.length; i++) {
26                         File readfile = new File("E:\\valimg" + "\\" + filelist[i]);
27                         if (!readfile.isDirectory()) {
28                                 System.out.println("path=" + readfile.getPath());
29                                 System.out.println("absolutepath="
30                                                 + readfile.getAbsolutePath());
31                                 System.out.println("name=" + readfile.getName());
32 
33                                 String result = instance.doOCR(readfile);
34                                 //String result = instance.doOCR(change(readfile));
35                                 System.err.println(readfile.getName() +" result："+  result);
36                      }
37                 }*/ 
38                  instance.setLanguage("chi_sim"); //加载语言包
39                  String result = instance.doOCR(imageFile);
40                 
41                  System.err.println(imageFile.getName() +" result："+  result);
42                  
43             } catch (TesseractException e) {  
44                 System.err.println(e.getMessage());  
45             }  
46      } 
47      
48      public static BufferedImage change(File file){
49          
50             // 读取图片字节数组
51              BufferedImage textImage = null;
52             try {
53                  InputStream in = new FileInputStream(file);
54                  BufferedImage image = ImageIO.read(in);
55                  textImage = ImageHelper.convertImageToGrayscale(ImageHelper.getSubImage(image, 0, 0, image.getWidth(), image.getHeight()));  //对图片进行处理
56                  textImage = ImageHelper.getScaledInstance(image, image.getWidth() * 5, image.getHeight() * 5);  //将图片扩大5倍
57 
58             } catch (IOException e) {
59                 e.printStackTrace();
60             }
61             
62             return textImage;
63      }
64 }

如果是web项目，需要指定 instance.setDatapath("E:\\ocr\\tesseract"); //tessdata 的目录为E:\\ocr\tesseract\tessdata,如果不指定也会抛Invalid memory access 异常

posted @ 2017-06-13 09:58 叫我明羽阅读(12424) 评论(3) 编辑收藏举报

刷新页面返回顶部

陈明羽

ocr智能图文识别 tess4j 图文，验证码识别

公告