2020-6-4-Lucene

概述、创建索引流程、查询流程、创建索引实例、搜索案例、查看分词器分词效果、索引维护、查询

1概述

基于java开发的全文检索包

2创建索引流程

1）获得文档

原始文档：基于那些数据进行搜索，那么这些数据就是原始文档

搜索引擎：使用爬虫获得原始文档

站内搜索：数据库中的数据

2）构建文档对象

对每个原始文档创建一个Document对象，每个Document对象中包含多个域，每个文档有一个唯一编号

3）分析文档

根据空格进行拆分，单词统一小写，去除标段符号，去除停用词。每个关键词都封装成一个Term，包含域和关键词，不同域中拆分出来的相同关键词，是不同的Term

4）创建索引

通过词语找文档，这种结构称为倒排索引

3查询流程

(1)用户查询接口

(2把关键词封装成一个查询对象

(3)执行查询

(4)渲染结果

4创建索引实例

1）导包

	<dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>8.5.2</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>

2）创建

package com.zhanghuan.loginapi;


import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;


import java.io.File;
import java.io.IOException;


class LoginapiApplicationTests {

    @Test
    void contextLoads() throws IOException {
        //创建索引，并获得索引写入器
        FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
        IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig ());//这里可将分词器对象传入

        //遍历文件
        File file=new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\search_source");
        for (File f:
             file.listFiles ()) {
            //获取文件信息
            String filename=f.getName ();
            String filepath=f.getPath ();
            String filecontent= FileUtils.readFileToString (f,"utf-8");
            Long filesize=FileUtils.sizeOf (f);

            //将文件信息存入字段中
            Field name=new TextField ("name",filename,Field.Store.YES);
            Field path=new StoredField ("path",filepath);
            Field content=new TextField ("content",filecontent,Field.Store.YES);
            Field size=new StoredField ("size",filesize);
            Field size_num=new LongPoint ("size",filesize);

            //创建文档，将字段放入文档
            Document document=new Document ();
            document.add (name);
            document.add (path);
            document.add (content);
            document.add (size);
            document.add (size_num);

            //将文档写入索引里
            indexWriter.addDocument (document);
        }
        //关闭索引写入器
        indexWriter.close ();

    }

}

5搜索案例

package com.zhanghuan.loginapi;


import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;


import java.io.File;
import java.io.IOException;

class LoginapiApplicationTests {



    @Test
    void read() throws IOException {
        //打开索引
        FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
        DirectoryReader indexReader= DirectoryReader.open (dictionary);

        //搜索关键词
        IndexSearcher indexSearcher=new IndexSearcher (indexReader);
        Query query=new TermQuery (new Term ("content","java"));
        TopDocs topDocs=indexSearcher.search (query,10);
        ScoreDoc[] docs=topDocs.scoreDocs;

        //打印搜索结果
        System.out.println ("最大数量:"+topDocs.totalHits);
        for (ScoreDoc doc:
             docs) {
            int docId=doc.doc;
            Document document=indexSearcher.doc (docId);
            System.out.println (document.get("name"));
            System.out.println (document.get("path"));
            System.out.println (document.get("size"));
            System.out.println (document.get("content"));
            System.out.println ("----------------------------------");
        }
        indexReader.close ();

     }

}

6查看分词器分词效果

1）普通分词器

//核心代码
	@Test
     void analyzer() throws IOException {
        //创建一个分析器对象
         StandardAnalyzer analyzer= new StandardAnalyzer ();

         //获取Tokenstream
         TokenStream tokenStream=analyzer.tokenStream ("","java sc a");

         //向Tokenstream设置一个引用，相当于一个指针
         CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);

         //重置
         tokenStream.reset ();

         //循环遍历
        while (tokenStream.incrementToken ()){
            System.out.println (charTermAttribute.toString ());
        }
        /*
        java
        sc
        a
         */
        tokenStream.close ();
     }

对于中文，会每个中文算一个词

2）IKAnalyzer分词器

	<dependency>
            <groupId>com.jianggujin</groupId>
            <artifactId>IKAnalyzer-lucene</artifactId>
            <version>8.0.0</version>
        </dependency>

	@Test
     void analyzer() throws IOException {
        //创建一个分析器对象
         IKAnalyzer analyzer= new IKAnalyzer ();

         //获取Tokenstream
         TokenStream tokenStream=analyzer.tokenStream ("","java sc a 你好");

         //向Tokenstream设置一个引用，相当于一个指针
         CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);

         //重置
         tokenStream.reset ();

         //循环遍历
        while (tokenStream.incrementToken ()){
            System.out.println (charTermAttribute.toString ());
        }
        tokenStream.close ();
     }

使用前需要定义两个词典ext.dic和stopword.dic

7不同域对比

Field类	支持类型	分析（分词）	索引	存储
StringField	字符串	N	Y	Y或N
LongPoint	Long类型	Y	Y	N
StoredField	重载方法，支持各种类型	N	N	Y
TextField	字符串式流	Y	Y	Y或N

8索引维护

1）新增

2）删除

	//核心代码
	@Test
     void del() throws IOException {
         FSDirectory dictionary=FSDirectory.open
                 (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig (new IKAnalyzer ()));
         indexWriter.deleteDocuments (new Term ("content","序列化"));//删除content包含数据库的文档
         //indexWriter.deleteAll ();//删除所有
         indexWriter.close ();
     }

3）修改

删除后在新增

9查询

1）使用Query的子类

	//核心代码
	@Test
     void query() throws IOException {
         FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         DirectoryReader reader=DirectoryReader.open (dictionary);
         
         IndexSearcher indexSearcher=new IndexSearcher (reader);
         Query query=LongPoint.newRangeQuery ("size",10,100);//范围查询
         TopDocs topDocs=indexSearcher.search (query,10);
         ScoreDoc[] docs=topDocs.scoreDocs;
         
         for (ScoreDoc doc:
              docs) {
             int docId=doc.doc;
             Document document=indexSearcher.doc (docId);
             System.out.println (document.get("name"));
             System.out.println (document.get("path"));
             System.out.println (document.get("size"));
             System.out.println (document.get("content"));
             System.out.println ("----------------------------------");
         }
     }

2）使用QueryParser

	@Test
     void query() throws IOException, ParseException {
         FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         DirectoryReader reader=DirectoryReader.open (dictionary);
         
         IndexSearcher indexSearcher=new IndexSearcher (reader);
         QueryParser queryParser=new QueryParser ("name",new IKAnalyzer ());//参数1为域名，参数2为解析器对象
         Query query=queryParser.parse ("java开发");
         TopDocs topDocs=indexSearcher.search (query,10);
         
         ScoreDoc[] docs=topDocs.scoreDocs;
         for (ScoreDoc doc:
              docs) {
             int docId=doc.doc;
             Document document=indexSearcher.doc (docId);
             System.out.println (document.get("name"));
             System.out.println (document.get("path"));
             System.out.println (document.get("size"));
             System.out.println (document.get("content"));
             System.out.println ("----------------------------------");
         }
     }