随笔- 281 文章- 130 评论- 674 阅读- 321万

一步一步学lucene——（第二步：示例篇）

在上面的篇幅中我们已经了解到了lucene，及lucene到底是做什么的，什么情况下才能够使用lucene，下面我们就结合一个例子来详细说明一下lucene的API，看看lucene是如何工作的。

lucene的下载

其实这个很简单了，直接到baidu或者google上搜索一下，一般情况下第一个就是我们要的链接。下边给出lucene下载的链接：

http://lucene.apache.org/

　　　　　　　　　　　　　　　　图：lucene下载主页面

配置环境

我们下面要做很多的测试，会建立很多的测试工程，如果一个一个手动的添加jar包会非常的麻烦，那么我们就需要配置eclipse环境。

打开eclipse，选择windows->preferences->java->build path->user libraries

将我们上边下载后的lucene中的包全部加载到这个用户变量中。

　　　　　　　　　　　　　　　图：eclipse中加入的用户变量

建立索引

下面这个程序就是读取指定文件夹下的文件并且将文件生成索引的过程，它有两个参数，一个是要索引的文件路径，一个是索引存放的路径。

我们将文件放到我们硬盘的目录上，然后通过程序建立索引。

索引程序如下：

View Code

 1 public class Indexer {
 2 
 3     public static void main(String[] args) throws Exception {
 4         if (args.length != 2) {
 5             throw new IllegalArgumentException("Usage: java "
 6                     + Indexer.class.getName() + " <index dir> <data dir>");
 7         }
 8         String indexDir = args[0]; // 1
 9         String dataDir = args[1]; // 2
10 
11         long start = System.currentTimeMillis();
12         Indexer indexer = new Indexer(indexDir);
13         int numIndexed;
14         try {
15             numIndexed = indexer.index(dataDir, new TextFilesFilter());
16         } finally {
17             indexer.close();
18         }
19         long end = System.currentTimeMillis();
20 
21         System.out.println("Indexing " + numIndexed + " files took "
22                 + (end - start) + " milliseconds");
23     }
24 
25     private IndexWriter writer;
26 
27     public Indexer(String indexDir) throws IOException {
28         Directory dir = FSDirectory.open(new File(indexDir));
29         writer = new IndexWriter(dir, // 3
30                 new StandardAnalyzer( // 3
31                         Version.LUCENE_30),// 3
32                 true, // 3
33                 IndexWriter.MaxFieldLength.UNLIMITED); // 3
34     }
35 
36     public void close() throws IOException {
37         writer.close(); // 4
38     }
39 
40     public int index(String dataDir, FileFilter filter) throws Exception {
41 
42         File[] files = new File(dataDir).listFiles();
43 
44         for (File f : files) {
45             if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead()
46                     && (filter == null || filter.accept(f))) {
47                 indexFile(f);
48             }
49         }
50 
51         return writer.numDocs(); // 5
52     }
53 
54     private static class TextFilesFilter implements FileFilter {
55         public boolean accept(File path) {
56             return path.getName().toLowerCase() // 6
57                     .endsWith(".txt"); // 6
58         }
59     }
60 
61     protected Document getDocument(File f) throws Exception {
62         Document doc = new Document();
63         doc.add(new Field("contents", new FileReader(f))); // 7
64         doc.add(new Field("filename", f.getName(), // 8
65                 Field.Store.YES, Field.Index.NOT_ANALYZED));// 8
66         doc.add(new Field("fullpath", f.getCanonicalPath(), // 9
67                 Field.Store.YES, Field.Index.NOT_ANALYZED));// 9
68         return doc;
69     }
70 
71     private void indexFile(File f) throws Exception {
72         System.out.println("Indexing " + f.getCanonicalPath());
73         Document doc = getDocument(f);
74         writer.addDocument(doc); // 10
75     }
76 
77 }

然后在工程上点击右键Run->Run configuration，新建一个Java Application，输入两个参数一个是索引目录，一个是文件存放目录

　　　　　　　　　　　　　　　　　　　　图：配置运行界面

运行后可以行到分析结果，当然目录中索引的内容不同得到的结果也就会不同。

　　　　　　　　图：索引txt文件时输出

根据索引查询

因为这里边还没涉及到中文的部分，所以我们查询所有文档中包括"RUNNING"的文档。

程序内容如下：

View Code

 1 public class Searcher {
 2 
 3     public static void main(String[] args) throws IllegalArgumentException,
 4             IOException, ParseException {
 5         if (args.length != 2) {
 6             throw new IllegalArgumentException("Usage: java "
 7                     + Searcher.class.getName() + " <index dir> <query>");
 8         }
 9 
10         String indexDir = args[0]; // 1
11         String q = args[1]; // 2
12 
13         search(indexDir, q);
14     }
15 
16     public static void search(String indexDir, String q) throws IOException,
17             ParseException {
18 
19         Directory dir = FSDirectory.open(new File(indexDir)); // 3
20         IndexSearcher is = new IndexSearcher(dir); // 3
21 
22         QueryParser parser = new QueryParser(Version.LUCENE_30, // 4
23                 "contents", // 4
24                 new StandardAnalyzer( // 4
25                         Version.LUCENE_30)); // 4
26         Query query = parser.parse(q); // 4
27         long start = System.currentTimeMillis();
28         TopDocs hits = is.search(query, 10); // 5
29         long end = System.currentTimeMillis();
30 
31         System.err.println("Found " + hits.totalHits + // 6
32                 " document(s) (in " + (end - start) + // 6
33                 " milliseconds) that matched query '" + // 6
34                 q + "':"); // 6
35 
36         for (ScoreDoc scoreDoc : hits.scoreDocs) {
37             Document doc = is.doc(scoreDoc.doc); // 7
38             System.out.println(doc.get("fullpath")); // 8
39         }
40 
41         is.close(); // 9
42     }
43 }

同上操作，配置新的Java Application，如下图：

　　　　　　　　　　　　　　　　　　图：配置查询参数

点击运行，可以得到运行结果。

也就是我们上面索引的文件，当然，随着文件的多少及大小，速度会不同，这里只是一个演示程序，你可以根据你本身的程序自行设置查询条件。

索引过程中的几个核心类

IndexWriter

IndexWriter是索引过程的核心组件。用于创建一个新的索引并把文档加到已有的索引中去，也可以向索引中添加、删除和更新被索引文档的信息。

Directory

Directory类描述了Lucene索引的存放位置。

Analyzer

Analyzer是分词器接口，文本文件在被索引之前，需要经过Analyzer处理。常用的中文分词器有庖丁、IKAnalyzer等。

Document

Document对象代表一组域（Field）的集合。其实说白了就是文件，可能是文本文件，word或者pdf等。

Field

Field就是每个文档中包含的不同的域。

lucene构建索引的流程图如下：

　　　　　　　　　　　　　　　　图：lucene构建索引流程

搜索过程中的几个核心类

IndexSearcher

IndexSearcher是对前边IndexWriter创建的索引进行搜索。

Term

Term对象是搜索功能的基本单元，跟Field对象非常类似，可以放入我们查询的条件。

Query

Query就是Lucene给我们的查询接口，它有很多的子类，我们可以基于这些进行功能丰富的查询。

TermQuery

TermQuery是Lucene提供的最基本的查询类型。

TopDocs

TopDocs类是一个简单的指针容器，指针一般指向前N个排名的搜索结果，搜索结果即匹配查询条件的文档。

　　　　　　　　　　　　　　　　　　　　图：lucene查询请求流程

[源码下载]

posted @ 2012-07-31 08:32 skyme 阅读(7990) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源！
· 周边上新：园子的第一款马克杯温暖上架

公告

欢迎查看我的新书《微服务架构实战》！

微服务架构实战

联系方式：
邮箱【cloudskyme@163.com】
QQ【270800073】
微信 cloudskyme

昵称： skyme
园龄： 14年10个月
荣誉：推荐博客
粉丝： 1264
关注： 40

+加关注

2012年7月

日

一

二

三

四

五

六

随笔分类 (319)

随笔档案 (281)

文章分类 (78)

相册 (1)

随笔(1)

cloudsky

QQ群交流：微服务架构实战 181942601 nlp研究与讨论 1群598640522
大数据_人工智能交流621943289

一步一步学lucene——（第二步：示例篇）

lucene的下载

配置环境

建立索引

根据索引查询

索引过程中的几个核心类

搜索过程中的几个核心类

公告

微服务架构实战

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (319)

随笔档案 (281)

文章分类 (78)

相册 (1)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

cloudsky

QQ群交流：微服务架构实战 181942601 nlp研究与讨论 1群598640522 大数据_人工智能交流621943289

一步一步学lucene——（第二步：示例篇）

lucene的下载

配置环境

建立索引

根据索引查询

索引过程中的几个核心类

搜索过程中的几个核心类

公告

微服务架构实战

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (319)

随笔档案 (281)

文章分类 (78)

相册 (1)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

QQ群交流：微服务架构实战 181942601 nlp研究与讨论 1群598640522
大数据_人工智能交流621943289