2020-6-4-Lucene

概述、创建索引流程、查询流程、创建索引实例、搜索案例、查看分词器分词效果、索引维护、查询

1概述

基于java开发的全文检索包

2创建索引流程

1)获得文档

原始文档:基于那些数据进行搜索,那么这些数据就是原始文档

搜索引擎:使用爬虫获得原始文档

站内搜索:数据库中的数据

2)构建文档对象

对每个原始文档创建一个Document对象,每个Document对象中包含多个域,每个文档有一个唯一编号

3)分析文档

根据空格进行拆分,单词统一小写,去除标段符号,去除停用词。每个关键词都封装成一个Term,包含域和关键词,不同域中拆分出来的相同关键词,是不同的Term

4)创建索引

通过词语找文档,这种结构称为倒排索引

3查询流程

(1)用户查询接口

(2把关键词封装成一个查询对象

(3)执行查询

(4)渲染结果

4创建索引实例

1)导包

	<dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>8.5.2</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>

2)创建

package com.zhanghuan.loginapi;


import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;


import java.io.File;
import java.io.IOException;


class LoginapiApplicationTests {

    @Test
    void contextLoads() throws IOException {
        //创建索引,并获得索引写入器
        FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
        IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig ());//这里可将分词器对象传入

        //遍历文件
        File file=new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\search_source");
        for (File f:
             file.listFiles ()) {
            //获取文件信息
            String filename=f.getName ();
            String filepath=f.getPath ();
            String filecontent= FileUtils.readFileToString (f,"utf-8");
            Long filesize=FileUtils.sizeOf (f);

            //将文件信息存入字段中
            Field name=new TextField ("name",filename,Field.Store.YES);
            Field path=new StoredField ("path",filepath);
            Field content=new TextField ("content",filecontent,Field.Store.YES);
            Field size=new StoredField ("size",filesize);
            Field size_num=new LongPoint ("size",filesize);

            //创建文档,将字段放入文档
            Document document=new Document ();
            document.add (name);
            document.add (path);
            document.add (content);
            document.add (size);
            document.add (size_num);

            //将文档写入索引里
            indexWriter.addDocument (document);
        }
        //关闭索引写入器
        indexWriter.close ();

    }

}

5搜索案例

package com.zhanghuan.loginapi;


import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;


import java.io.File;
import java.io.IOException;

class LoginapiApplicationTests {



    @Test
    void read() throws IOException {
        //打开索引
        FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
        DirectoryReader indexReader= DirectoryReader.open (dictionary);

        //搜索关键词
        IndexSearcher indexSearcher=new IndexSearcher (indexReader);
        Query query=new TermQuery (new Term ("content","java"));
        TopDocs topDocs=indexSearcher.search (query,10);
        ScoreDoc[] docs=topDocs.scoreDocs;

        //打印搜索结果
        System.out.println ("最大数量:"+topDocs.totalHits);
        for (ScoreDoc doc:
             docs) {
            int docId=doc.doc;
            Document document=indexSearcher.doc (docId);
            System.out.println (document.get("name"));
            System.out.println (document.get("path"));
            System.out.println (document.get("size"));
            System.out.println (document.get("content"));
            System.out.println ("----------------------------------");
        }
        indexReader.close ();

     }

}

6查看分词器分词效果

1)普通分词器

//核心代码
	@Test
     void analyzer() throws IOException {
        //创建一个分析器对象
         StandardAnalyzer analyzer= new StandardAnalyzer ();

         //获取Tokenstream
         TokenStream tokenStream=analyzer.tokenStream ("","java sc a");

         //向Tokenstream设置一个引用,相当于一个指针
         CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);

         //重置
         tokenStream.reset ();

         //循环遍历
        while (tokenStream.incrementToken ()){
            System.out.println (charTermAttribute.toString ());
        }
        /*
        java
        sc
        a
         */
        tokenStream.close ();
     }

对于中文,会每个中文算一个词

2)IKAnalyzer分词器

	<dependency>
            <groupId>com.jianggujin</groupId>
            <artifactId>IKAnalyzer-lucene</artifactId>
            <version>8.0.0</version>
        </dependency>
	@Test
     void analyzer() throws IOException {
        //创建一个分析器对象
         IKAnalyzer analyzer= new IKAnalyzer ();

         //获取Tokenstream
         TokenStream tokenStream=analyzer.tokenStream ("","java sc a 你好");

         //向Tokenstream设置一个引用,相当于一个指针
         CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);

         //重置
         tokenStream.reset ();

         //循环遍历
        while (tokenStream.incrementToken ()){
            System.out.println (charTermAttribute.toString ());
        }
        tokenStream.close ();
     }

使用前需要定义两个词典ext.dic和stopword.dic

7不同域对比

Field类 支持类型 分析(分词) 索引 存储
StringField 字符串 N Y Y或N
LongPoint Long类型 Y Y N
StoredField 重载方法,支持各种类型 N N Y
TextField 字符串式流 Y Y Y或N

8索引维护

1)新增

2)删除

	//核心代码
	@Test
     void del() throws IOException {
         FSDirectory dictionary=FSDirectory.open
                 (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig (new IKAnalyzer ()));
         indexWriter.deleteDocuments (new Term ("content","序列化"));//删除content包含数据库的文档
         //indexWriter.deleteAll ();//删除所有
         indexWriter.close ();
     }

3)修改

删除后在新增

9查询

1)使用Query的子类

	//核心代码
	@Test
     void query() throws IOException {
         FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         DirectoryReader reader=DirectoryReader.open (dictionary);
         
         IndexSearcher indexSearcher=new IndexSearcher (reader);
         Query query=LongPoint.newRangeQuery ("size",10,100);//范围查询
         TopDocs topDocs=indexSearcher.search (query,10);
         ScoreDoc[] docs=topDocs.scoreDocs;
         
         for (ScoreDoc doc:
              docs) {
             int docId=doc.doc;
             Document document=indexSearcher.doc (docId);
             System.out.println (document.get("name"));
             System.out.println (document.get("path"));
             System.out.println (document.get("size"));
             System.out.println (document.get("content"));
             System.out.println ("----------------------------------");
         }
     }

2)使用QueryParser

	@Test
     void query() throws IOException, ParseException {
         FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
         DirectoryReader reader=DirectoryReader.open (dictionary);
         
         IndexSearcher indexSearcher=new IndexSearcher (reader);
         QueryParser queryParser=new QueryParser ("name",new IKAnalyzer ());//参数1为域名,参数2为解析器对象
         Query query=queryParser.parse ("java开发");
         TopDocs topDocs=indexSearcher.search (query,10);
         
         ScoreDoc[] docs=topDocs.scoreDocs;
         for (ScoreDoc doc:
              docs) {
             int docId=doc.doc;
             Document document=indexSearcher.doc (docId);
             System.out.println (document.get("name"));
             System.out.println (document.get("path"));
             System.out.println (document.get("size"));
             System.out.println (document.get("content"));
             System.out.println ("----------------------------------");
         }
     }
posted @   SylvesterZhang  阅读(4)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 使用C#创建一个MCP客户端
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现
点击右上角即可分享
微信分享提示