lucene 3.6.0学习总结

目前，主流的全文索引工具有：Lucene , Sphinx , Solr , ElasticSearch。其中Solr和Elastic Search都是基于Lucene的。Sphinx不是 apache的项目，如果你想把Sphinx放到某个商业性的项目中，你就得买个商业许可证。(其实我只学习了lucence,solr 只是了解,这两天项目需要,研究学习了下.此文为个人学习备忘之用)

第一章 LUCENE基础

在全文索引工具中，都是由这样的三部分组成：索引部分、分词部分和搜索部分

　　IndexWriter：用来创建索引并添加文档到索引中。

Directory：这个类代表了索引的存储的位置，是一个抽象类。

Analyzer：对文档内容进行分词处理，把分词后的内容交给 IndexWriter来建立索引。

Document：由多个Field组成，相当于数据库中的一条记录。

Field：相当于数据库中的一条记录中的一个字段。

分词部分的核心类

Analyzer：简单分词器（SimpleAnalyzer）、停用词分词器（StopAnalyzer）、空格分词器（WhitespaceAnalyzer）、标准分词器（StandardAnalyzer）。

TokenStream：可以通过这个类有效的获取到分词单元信息。

Tokenizer：主要负责接收字符流Reader,将Reader进行分词操作。

TokenFilter：将分词的语汇单元，进行各种各样过滤。

搜索部分的核心类

IndexSearcher：用来在建立好的索引上进行搜索。

Term：是搜索的基本单位。

Query：把用户输入的查询字符串封装成Lucene能够识别的Query。

TermQuery：是抽象类Query的一个子类，它的构造函数只接受一个参数，那就是一个Term对象

TopDocs：保存返回的搜索结果。

SocreDoc：保存具体的Document对象。

第二章索引建立

索引的建立是将现实世界中所有的结构化和非结构化数据提取信息，创建索引的过程。如下图：

示例子:

package text;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class TestFileIndexer {
	public   static   void  main(String[] args)  throws  Exception  {             
        /*  指明要索引文件夹的位置,这里是C盘的source文件夹下  */          
        File fileDir  =   new  File( "c:\\source " );    
        /*  这里放索引文件的位置  */         
        File indexDir  =   new  File( "c:\\index" );            
        Directory dir=FSDirectory.open(indexDir);//将索引存放在磁盘上  
        Analyzer lucenAnalyzer=new StandardAnalyzer(Version.LUCENE_36);//分析器  
        IndexWriterConfig iwc=new IndexWriterConfig(Version.LUCENE_36,lucenAnalyzer);  
        iwc.setOpenMode(OpenMode.CREATE);//创建新的索引文件create 表示创建或追加到已有索引库  
        IndexWriter indexWriter=new IndexWriter(dir,iwc);//把文档写入到索引库  
        File[] textFiles=fileDir.listFiles();//得到索引文件夹下所有文件  
        long startTime=new Date().getTime();  
        //增加document到检索去  
        for (int i = 0; i < textFiles.length; i++) {  
//          if (textFiles[i].isFile()&& textFiles[i].getName().endsWith(".txt")) {  
                System.out.println(":;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;");  
                System.out.println("File"+textFiles[i].getCanonicalPath()+"正在被索引...");  
                String temp=FileReaderAll(textFiles[i].getCanonicalPath(),"GBK");  
                System.out.println(temp);  
                Document document=new Document();  
                Field FieldPath=new Field("path",textFiles[i].getPath(),Field.Store.YES,Field.Index.NO);  
                Field FieldBody=new Field("body",temp,Field.Store.YES,Field.Index.ANALYZED,Field.TermVector.WITH_POSITIONS_OFFSETS);  
                NumericField modifiField=new NumericField("modified");//所以key为modified  
                modifiField.setLongValue(fileDir.lastModified());  
                document.add(FieldPath);  
                document.add(FieldBody);  
                document.add(modifiField);  
                indexWriter.addDocument(document);  
                  
//          }  
        }  
        indexWriter.close();  
        //计算一下索引的时间  
        long endTime=new Date().getTime();  
        System.out.println("花了"+(endTime-startTime)+"毫秒把文档添加到索引里面去"+fileDir.getPath());  
    }  
    public static String FileReaderAll(String FileName,String charset)throws IOException{  
        BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(FileName),charset));  
        String line=new String();  
        String temp=new String();  
        while ((line=reader.readLine())!=null) {  
            temp+=line;  
        }  
        reader.close();  
        return temp;  
    }

　　　 Field.Store.YES：存储。该值可以被恢复（还原）。

NO：不存储。该值不可以被恢复，但可以被索引。

Field.Index.ANALYZED：分词。

NOT_ANALYZED：不分词。

NOT_ANALYZED_NO_NORMS：不分词也不加权（即不存储NORMS信息）。

查询索引的基本信息

package text;


import java.io.File;  
import java.io.IOException;  
  
  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.queryParser.ParseException;  
import org.apache.lucene.queryParser.QueryParser;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.Version; 

public class TestQuery {
	public static void main(String[] args) throws ParseException, IOException {  
        String index="c:\\index";//搜索的索引路径  
        IndexReader reader=IndexReader.open(FSDirectory.open(new File(index)));  
        IndexSearcher searcher=new IndexSearcher(reader);//检索工具  
        ScoreDoc[] hits=null;  
        String queryString="测试";  //搜索的索引名称  
        Query query=null;  
        Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_36);  
        try {  
            QueryParser qp=new QueryParser(Version.LUCENE_36,"body",analyzer);//用于解析用户输入的工具  
            query=qp.parse(queryString);  
        } catch (Exception e) {  
            // TODO: handle exception  
        }  
        if (searcher!=null) {  
            TopDocs results=searcher.search(query, 10);//只取排名前十的搜索结果  
            hits=results.scoreDocs;  
            Document document=null;  
           for (int i = 0; i < hits.length; i++) {  
                document=searcher.doc(hits[i].doc);  
                String body=document.get("body");  
                String path=document.get("path");  
                String modifiedtime=document.get("modifiField");  
                System.out.println(body+"        ");   
                System.out.println(path);   
            }  
            if (hits.length>0) {  
                System.out.println("找到"+hits.length+"条结果");  
                  
            }  
            searcher.close();  
            reader.close();  
        }  
          
  
    }  
}

索引文件作用

索引建立成功后，会自动在磁盘上生成一些不同后缀的文件（如下图），这些文件缺一不可，这里简单的介绍下不同后缀名的文件都有些什么作用：

.fdt : 保存域的值（即Store.YES属性的文件）。

.fdx : 与.fdt的作用相同。

.fnm :保存了此段包含了多少个域，每个域的名称及索引方式。

.frq : 保存倒排表。数据出现次数（哪篇文章哪个词出现了多少次）。

.nrm : 保存评分和排序信息。

.prx : 偏移量信息。倒排表中每个词在包含此词的文档中的位置。

.tii : 保存了词典(Term Dictionary)。也即此段包含的所有的词按字典顺序的排序。

.tis : 同上。存储索引信息。

备注：

①如上图，具有相同前缀文件的属同一个段，图中共两个段 "_0"和 "_1"。

②一个索引可以包含多个段，段与段之间是独立的，添加新文档可以生成新的段，不同的段可以合并。

③这些索引文件可以使用使用lukeall-3.5.0.jar打开，具体使用方法在后面的章节进行详述

　　　　 Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_36);

　　　　QueryParser qp=new QueryParser(Version.LUCENE_36,"body",analyzer);//用于解析用户输入的工具
　　 Query query=qp.parse(queryString);

根据Query获取TopDocs

TopDocs tds = searcher.search(query, 10); //返回10条数据

根据TopDocs获取ScoreDoc

ScoreDoc[] hits=null;

hits=results.scoreDocs;

Document document=null;
           for (int i = 0; i < hits.length; i++) {
                document=searcher.doc(hits[i].doc);
                String body=document.get("body");
                String path=document.get("path");
                String modifiedtime=document.get("modifiField");
                System.out.println(body+"        ");
                System.out.println(path);
            }

posted @ 2016-05-31 11:33 赤子之心_timefast 阅读(204) 评论(0) 编辑收藏举报

刷新页面返回顶部

zhuanzhuan

lucene 3.6.0学习总结

第一章 LUCENE基础

分词部分的核心类

搜索部分的核心类

第二章索引建立

查询索引的基本信息

索引文件作用

根据Query获取TopDocs

根据TopDocs获取ScoreDoc

公告

zhuanzhuan

lucene 3.6.0学习总结

第一章 LUCENE基础

分词部分的核心类

搜索部分的核心类

第二章 索引建立

查询索引的基本信息

索引文件作用

根据Query获取TopDocs

根据TopDocs获取ScoreDoc

公告

第二章索引建立