lucene7.1.0实现搜索文件内容

Lucene的使用主要体现在两个步骤：

　　1 创建索引，通过IndexWriter对不同的文件进行索引的创建，并将其保存在索引相关文件存储的位置中。

　　2 通过索引查寻关键字相关文档。

首先，我们需要定义一个词法分析器。

Analyzer analyzer = new IKAnalyzer(true);

注意各种词法分析器的区别，详见　　http://blog.csdn.net/silentmuh/article/details/78451786

比如一句话，“我爱我们的中国！”，如何对他拆分，扣掉停顿词“的”，提取关键字“我”“我们”“中国”等等。这就要借助的词法分析器Analyzer来实现。这里面使用的是标准的词法分析器，如果专门针对汉语，还可以搭配paoding，进行使用。

第二步，确定索引文件存储的位置，Lucene提供给我们两种方式：

Directory directory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));

　　1 本地文件存储

第三步，创建IndexWriter，进行索引文件的写入。

IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(directory, config);

第四步，内容提取，进行索引的存储。

Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();

　　第一行，申请了一个document对象，这个类似于数据库中的表中的一行。

　　第二行，是我们即将索引的字符串。

　　第三行，把字符串存储起来（因为设置了TextField.TYPE_STORED,如果不想存储，可以使用其他参数，详情参考官方文档），并存储“表明”为"fieldname".

　　第四行，把doc对象加入到索引创建中。

　　第五行，关闭IndexWriter,提交创建内容。

这就是索引创建的过程。

通过索引查寻关键字相关文档：

　第一步，打开存储位置

DirectoryReader ireader = DirectoryReader.open(directory);

　　第二步，创建搜索器

IndexSearcher isearcher = new IndexSearcher(ireader);

　　第三步，类似SQL，进行关键字查询

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);
Query query = parser.parse("text");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = isearcher.doc(hits[i].doc);
    assertEquals("This is the text to be indexed.",hitDoc.get("fieldname"));
}

　　这里，我们创建了一个查询器，并设置其词法分析器，以及查询的“表名“为”fieldname“。查询结果会返回一个集合，类似SQL的ResultSet，我们可以提取其中存储的内容。

　　关于各种不同的查询方式，可以参考官方手册，或者推荐的PPT

　　第四步，关闭查询器等。

ireader.close();
directory.close();

最后，自己写了个简单的例子，可以对一个文件夹内的内容进行索引的创建，并根据关键字筛选文件，并读取其中的内容。

  1 package muh.test;
  2 
  3 import java.io.BufferedReader;
  4 import java.io.File;
  5 import java.io.FileInputStream;
  6 import java.io.FileNotFoundException;
  7 import java.io.FileReader;
  8 import java.io.FilenameFilter;
  9 import java.io.IOException;
 10 import java.io.InputStreamReader;
 11 import java.nio.file.FileSystems;
 12 import java.util.ArrayList;
 13 import java.util.Date;
 14 import java.util.List;
 15 
 16 import org.apache.lucene.analysis.Analyzer;
 17 import org.apache.lucene.analysis.core.SimpleAnalyzer;
 18 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 19 import org.apache.lucene.document.Document;
 20 import org.apache.lucene.document.Field;
 21 import org.apache.lucene.document.TextField;
 22 import org.apache.lucene.index.DirectoryReader;
 23 import org.apache.lucene.index.IndexReader;
 24 import org.apache.lucene.index.IndexWriter;
 25 import org.apache.lucene.index.IndexWriterConfig;
 26 import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
 27 import org.apache.lucene.queryparser.classic.ParseException;
 28 import org.apache.lucene.queryparser.classic.QueryParser;
 29 import org.apache.lucene.search.BooleanClause;
 30 import org.apache.lucene.search.IndexSearcher;
 31 import org.apache.lucene.search.Query;
 32 import org.apache.lucene.search.ScoreDoc;
 33 import org.apache.lucene.search.TopDocs;
 34 import org.apache.lucene.search.TopScoreDocCollector;
 35 import org.apache.lucene.search.highlight.Highlighter;
 36 import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
 37 import org.apache.lucene.search.highlight.QueryScorer;
 38 import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
 39 import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
 40 import org.apache.lucene.store.Directory;
 41 import org.apache.lucene.store.FSDirectory;
 42 import org.apache.lucene.util.Version;
 43 import org.apache.poi.hssf.usermodel.HSSFCell;
 44 import org.apache.poi.hssf.usermodel.HSSFRow;
 45 import org.apache.poi.hssf.usermodel.HSSFSheet;
 46 import org.apache.poi.hssf.usermodel.HSSFWorkbook;
 47 import org.apache.poi.hwpf.HWPFDocument;
 48 import org.apache.poi.hwpf.usermodel.Range;
 49 import org.apache.poi.xssf.usermodel.XSSFCell;
 50 import org.apache.poi.xssf.usermodel.XSSFRow;
 51 import org.apache.poi.xssf.usermodel.XSSFSheet;
 52 import org.apache.poi.xssf.usermodel.XSSFWorkbook;
 53 import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
 54 import org.apache.poi.xwpf.usermodel.XWPFDocument;
 55 import org.pdfbox.pdfparser.PDFParser;
 56 import org.pdfbox.pdmodel.PDDocument;
 57 import org.pdfbox.util.PDFTextStripper;
 58 import org.wltea.analyzer.lucene.IKAnalyzer;
 59 
 60 public class LuceneTest {
 61 
 62     private static String INDEX_DIR = "E:\\luceneIndex";
 63     private static String Source_DIR = "E:\\luceneSource";
 64 
 65     /**
 66      * 列出某个路径下的所有文件，包括子文件夹，如果本身就是文件，那么返回自身,需要遍历的文件路径,文件名过滤器
 67      * @Title: listAllFiles 
 68      * @author hegg
 69      * @date 2017年11月6日 下午8:28:54
 70      * @param filePath
 71      * @param fileNameFilter
 72      * @return 返回类型 List<File>
 73      */
 74     public static List<File> listAllFiles(String filePath, FilenameFilter fileNameFilter) {
 75         List<File> files = new ArrayList<File>();
 76         try {
 77             File root = new File(filePath);
 78             if (!root.exists())
 79                 return files;
 80             if (root.isFile())
 81                 files.add(root);
 82             else {
 83                 for (File file : root.listFiles(fileNameFilter)) {
 84                     if (file.isFile())
 85                         files.add(file);
 86                     else if (file.isDirectory()) {
 87                         files.addAll(listAllFiles(file.getAbsolutePath(), fileNameFilter));
 88                     }
 89                 }
 90             }
 91         } catch (Exception e) {
 92             e.printStackTrace();
 93         }
 94         return files;
 95     }
 96 
 97     /**
 98      * 删除文件目录下的所有文件
 99      * @Title: deleteDir 
100      * @author hegg
101      * @date 2017年11月6日 下午8:29:16
102      * @param file
103      * @return 返回类型 boolean
104      */
105     public static boolean deleteDir(File file) {
106         if (file.isDirectory()) {
107             File[] files = file.listFiles();
108             for (int i = 0; i < files.length; i++) {
109                 deleteDir(files[i]);
110             }
111         }
112         file.delete();
113         return true;
114     }
115 
116     /**
117      * 读取txt文件的内容
118      * @Title: readTxt 
119      * @author hegg
120      * @date 2017年11月6日 下午8:15:49
121      * @param file
122      * @return 返回类型 String
123      */
124     public static String readTxt(File file) {
125         String result = "";
126         try {
127             BufferedReader br = new BufferedReader(new FileReader(file));// 构造一个BufferedReader类来读取文件
128             String s = null;
129             while ((s = br.readLine()) != null) {// 使用readLine方法，一次读一行
130                 result = result + "\n" + s;
131             }
132             br.close();
133         } catch (Exception e) {
134             e.printStackTrace();
135         }
136         return result;
137     }
138 
139     /**
140      * 读取Word内容，包括03格式和07格式
141      * @Title: readWord 
142      * @author hegg
143      * @date 2017年11月6日 下午8:15:14
144      * @param file
145      * @param type
146      * @return 返回类型 String
147      */
148     public static String readWord(File file, String type) {
149         String result = "";
150         try {
151             FileInputStream fis = new FileInputStream(file);
152             if ("doc".equals(type)) {
153                 HWPFDocument doc = new HWPFDocument(fis);
154                 Range rang = doc.getRange();
155                 result += rang.text();
156             }
157             if ("docx".equals(type)) {
158                 XWPFDocument doc = new XWPFDocument(fis);
159                 XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
160                 result = extractor.getText();
161             }
162 
163             fis.close();
164         } catch (Exception e) {
165             e.printStackTrace();
166         }
167         return result;
168     }
169 
170     /**
171      * 读取Excel文件内容，包括03格式和07格式
172      * @Title: readExcel 
173      * @author hegg
174      * @date 2017年11月6日 下午8:14:04
175      * @param file
176      * @param type
177      * @return 返回类型 String
178      */
179     public static String readExcel(File file, String type) {
180         String result = "";
181         try {
182             FileInputStream fis = new FileInputStream(file);
183             StringBuilder sb = new StringBuilder();
184             if ("xlsx".equals(type)) {
185                 XSSFWorkbook xwb = new XSSFWorkbook(fis);
186                 for (int i = 0; i < xwb.getNumberOfSheets(); i++) {
187                     XSSFSheet sheet = xwb.getSheetAt(i);
188                     for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) {
189                         XSSFRow row = sheet.getRow(j);
190                         for (int k = 0; k < row.getPhysicalNumberOfCells(); k++) {
191                             XSSFCell cell = row.getCell(k);
192                             sb.append(cell.getRichStringCellValue());
193                         }
194                     }
195                 }
196             }
197             if ("xls".equals(type)) {
198                 // 得到Excel工作簿对象
199                 HSSFWorkbook hwb = new HSSFWorkbook(fis);
200                 for (int i = 0; i < hwb.getNumberOfSheets(); i++) {
201                     HSSFSheet sheet = hwb.getSheetAt(i);
202                     for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) {
203                         HSSFRow row = sheet.getRow(j);
204                         for (int k = 0; k < row.getPhysicalNumberOfCells(); k++) {
205                             HSSFCell cell = row.getCell(k);
206                             sb.append(cell.getRichStringCellValue());
207                         }
208                     }
209                 }
210             }
211 
212             fis.close();
213             result += sb.toString();
214         } catch (Exception e) {
215             e.printStackTrace();
216         }
217         return result;
218     }
219 
220     /**
221      * 读取pdf文件内容
222      * @Title: readPDF 
223      * @author hegg
224      * @date 2017年11月6日 下午8:13:41
225      * @param file
226      * @return 返回类型 String
227      */
228     public static String readPDF(File file) {
229         String result = null;
230         FileInputStream is = null;
231         PDDocument document = null;
232         try {
233             is = new FileInputStream(file);
234             PDFParser parser = new PDFParser(is);
235             parser.parse();
236             document = parser.getPDDocument();
237             PDFTextStripper stripper = new PDFTextStripper();
238             result = stripper.getText(document);
239         } catch (FileNotFoundException e) {
240             e.printStackTrace();
241         } catch (IOException e) {
242             e.printStackTrace();
243         } finally {
244             if (is != null) {
245                 try {
246                     is.close();
247                 } catch (IOException e) {
248                     e.printStackTrace();
249                 }
250             }
251             if (document != null) {
252                 try {
253                     document.close();
254                 } catch (IOException e) {
255                     e.printStackTrace();
256                 }
257             }
258         }
259         return result;
260     }
261     
262     /**
263      * 读取html文件内容
264      * @Title: readHtml 
265      * @author hegg
266      * @date 2017年11月6日 下午8:13:08
267      * @param file
268      * @return 返回类型 String
269      */
270     public static String readHtml(File file) {
271         StringBuffer content = new StringBuffer("");
272         FileInputStream fis = null;
273         try {
274             fis = new FileInputStream(file);
275             // 读取页面
276             BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"utf-8"));//这里的字符编码要注意，要对上html头文件的一致，否则会出乱码
277             String line = null;
278             while ((line = reader.readLine()) != null) {
279                 content.append(line + "\n");
280             }
281             reader.close();
282         } catch (Exception e) {
283             e.printStackTrace();
284         }
285         String contentString = content.toString();
286         return contentString;
287     }
288 
289     /**
290      * 创建索引
291      * @Title: creatIndex 
292      * @author hegg
293      * @date 2017年11月6日 下午8:29:37 返回类型 void
294      */
295     public static void creatIndex() {
296         Date begin = new Date();
297         // 1、创建Analyzer词法分析器，注意SimpleAnalyzer和StandardAnalyzer的区别
298         Analyzer analyzer  = null;
299         // 2、创建directory,保存索引,可以保存在内存中也可以保存在硬盘上
300         Directory directory = null;
301         // 3、创建indexWriter创建索引
302         IndexWriter indexWriter = null;
303         try {
304 //            analyzer = new StandardAnalyzer();
305 //            analyzer = new SimpleAnalyzer();
306             analyzer = new IKAnalyzer(true);
307 //            directory = FSDirectory.open(new File(INDEX_DIR));
308             directory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));
309             // 4、创建indexwriterConfig,并指定分词器版本
310             IndexWriterConfig config = new IndexWriterConfig(analyzer);
311             // 5、创建IndexWriter,需要使用IndexWriterConfig,
312             indexWriter = new IndexWriter(directory, config);
313             indexWriter.deleteAll();
314             
315             File docDirectory = new File(Source_DIR);
316             for (File file : docDirectory.listFiles()) {
317                 String content = "";
318                 //获取文件后缀
319                 String type = file.getName().substring(file.getName().lastIndexOf(".")+1);
320                 if("txt".equalsIgnoreCase(type)){
321                     content += readTxt(file);
322                 }else if("doc".equalsIgnoreCase(type)){
323                     content += readWord(file,"doc");
324                 }else if("docx".equalsIgnoreCase(type)){
325                     content += readWord(file,"docx");
326                 }else if("xls".equalsIgnoreCase(type)){
327                     content += readExcel(file,"xls");
328                 }else if("xlsx".equalsIgnoreCase(type)){
329                     content += readExcel(file,"xlsx");
330                 }else if("pdf".equalsIgnoreCase(type)){
331                     content += readPDF(file);
332                 }else if("html".equalsIgnoreCase(type)){
333                     content += readHtml(file);
334                 }
335                 // 6、创建document
336                 Document document = new Document();
337                 document.add(new Field("content", content, TextField.TYPE_STORED));
338                 document.add(new Field("fileName", file.getName(), TextField.TYPE_STORED));
339                 document.add(new Field("filePath", file.getAbsolutePath(), TextField.TYPE_STORED));
340                 document.add(new Field("updateTime", file.lastModified() + "", TextField.TYPE_STORED));
341                 indexWriter.addDocument(document);
342             }
343             indexWriter.commit();
344         } catch (Exception e) {
345             e.printStackTrace();
346         } finally {
347             try {
348                 if (analyzer != null) analyzer.close();
349                 if (indexWriter != null) indexWriter.close();
350                 if (directory != null) directory.close();
351             } catch (IOException e) {
352                 e.printStackTrace();
353             }
354         }
355 
356         Date end = new Date();
357         System.out.println("创建索引-----耗时：" + (end.getTime() - begin.getTime()) + "ms\n");
358     }
359 
360     /**
361      * 查找索引，返回符合条件的文件
362      * @Title: searchIndex 
363      * @author hegg
364      * @date 2017年11月6日 下午8:29:31
365      * @param keyWord 返回类型 void
366      */
367     public static void searchIndex(String keyWord) {
368         Date begin = new Date();
369         // 1、创建Analyzer词法分析器，注意SimpleAnalyzer和StandardAnalyzer的区别
370         Analyzer analyzer  = null;
371         // 2、创建索引在的文件夹
372         Directory indexDirectory = null;
373         // 3、创建DirectoryReader
374         DirectoryReader directoryReader = null;
375         try {
376 //            analyzer = new StandardAnalyzer();
377 //            analyzer = new SimpleAnalyzer();
378             analyzer = new IKAnalyzer(true);
379 //            indexDirectory = FSDirectory.open(new File(INDEX_DIR));
380             indexDirectory = FSDirectory.open(FileSystems.getDefault().getPath(INDEX_DIR));
381             directoryReader = DirectoryReader.open(indexDirectory);
382             // 3:根据DirectoryReader创建indexSeacher
383             IndexSearcher indexSearcher = new IndexSearcher(directoryReader);
384             // 4创建搜索用的query,指定搜索域
385 //            QueryParser parser = new QueryParser(, "content", analyzer);
386 //            Query query1 = parser.parse(keyWord);
387 //            ScoreDoc[] hits = indexSearcher.search(query1, null, 1000).scoreDocs;
388 //            for (int i = 0; i < hits.length; i++) {
389 //                Document hitDoc = indexSearcher.doc(hits[i].doc);
390 //                System.out.println("____________________________");
391 //                System.out.println(hitDoc.get("content"));
392 //                System.out.println(hitDoc.get("fileName"));
393 //                System.out.println(hitDoc.get("filePath"));
394 //                System.out.println(hitDoc.get("updateTime"));
395 //                System.out.println("____________________________");
396 //            }
397 
398             String[] fields = { "fileName", "content" }; // 要搜索的字段，一般搜索时都不会只搜索一个字段
399             // 字段之间的与或非关系，MUST表示and，MUST_NOT表示not，SHOULD表示or，有几个fields就必须有几个clauses
400             BooleanClause.Occur[] clauses = { BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD };
401             Query query2 = MultiFieldQueryParser.parse(keyWord, fields, clauses, analyzer);
402             // 5、根据searcher搜索并且返回TopDocs
403             TopDocs topDocs = indexSearcher.search(query2, 100); // 搜索前100条结果
404             System.out.println("共找到匹配处：" + topDocs.totalHits); // totalHits和scoreDocs.length的区别还没搞明白
405             ///6、根据TopDocs获取ScoreDoc对象
406             ScoreDoc[] scoreDocs = topDocs.scoreDocs;
407             System.out.println("共找到匹配文档数：" + scoreDocs.length);
408             QueryScorer scorer = new QueryScorer(query2, "content");
409             // 7、自定义高亮代码
410             SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<span style=\"backgroud-color:black;color:red\">", "</span>");
411             Highlighter highlighter = new Highlighter(htmlFormatter, scorer);
412             highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer));
413             for (ScoreDoc scoreDoc : scoreDocs) {
414                 ///8、根据searcher和ScoreDoc对象获取具体的Document对象
415                 Document document = indexSearcher.doc(scoreDoc.doc);
416                 System.out.println("-----------------------------------------");
417                 System.out.println(document.get("fileName") + ":" + document.get("filePath"));
418                 System.out.println(highlighter.getBestFragment(analyzer, "content", document.get("content")));
419                 System.out.println("-----------------------------------------");
420             }
421         } catch (IOException e) {
422             e.printStackTrace();
423         } catch (ParseException e) {
424             e.printStackTrace();
425         } catch (InvalidTokenOffsetsException e) {
426             e.printStackTrace();
427         } finally {
428             try {
429                 if (analyzer != null) analyzer.close();
430                 if (directoryReader != null) directoryReader.close();
431                 if (indexDirectory != null) indexDirectory.close();
432             } catch (Exception e) {
433                 e.printStackTrace();
434             }
435         }
436 
437         Date end = new Date();
438         System.out.println("查看关键字耗时：" + (end.getTime() - begin.getTime()) + "ms\n");
439     }
440 
441     public static void main(String[] args) throws Exception {
442         File fileIndex = new File(INDEX_DIR);
443         if (deleteDir(fileIndex)) {
444             fileIndex.mkdir();
445         } else {
446             fileIndex.mkdir();
447         }
448 
449         creatIndex();
450         searchIndex("天安门");
451     }
452 }

View Code

最后附上本例子用到的jar，下载地址链接：http://pan.baidu.com/s/1jI26UgQ 密码：qix6

posted @ 2017-11-05 20:35 silentmuh 阅读(1308) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

silentmuh

lucene7.1.0实现搜索文件内容

通过索引查寻关键字相关文档：

公告