2020-6-4-Lucene
概述、创建索引流程、查询流程、创建索引实例、搜索案例、查看分词器分词效果、索引维护、查询
1
概述
基于java开发的全文检索包
2
创建索引流程
1)获得文档
原始文档:基于那些数据进行搜索,那么这些数据就是原始文档
搜索引擎:使用爬虫获得原始文档
站内搜索:数据库中的数据
2)构建文档对象
对每个原始文档创建一个Document对象,每个Document对象中包含多个域,每个文档有一个唯一编号
3)分析文档
根据空格进行拆分,单词统一小写,去除标段符号,去除停用词。每个关键词都封装成一个Term,包含域和关键词,不同域中拆分出来的相同关键词,是不同的Term
4)创建索引
通过词语找文档,这种结构称为倒排索引
3
查询流程
(1)用户查询接口
(2把关键词封装成一个查询对象
(3)执行查询
(4)渲染结果
4
创建索引实例
1)导包
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>8.5.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
2)创建
package com.zhanghuan.loginapi;
import org.apache.commons.io.FileUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import java.io.File;
import java.io.IOException;
class LoginapiApplicationTests {
@Test
void contextLoads() throws IOException {
//创建索引,并获得索引写入器
FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig ());//这里可将分词器对象传入
//遍历文件
File file=new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\search_source");
for (File f:
file.listFiles ()) {
//获取文件信息
String filename=f.getName ();
String filepath=f.getPath ();
String filecontent= FileUtils.readFileToString (f,"utf-8");
Long filesize=FileUtils.sizeOf (f);
//将文件信息存入字段中
Field name=new TextField ("name",filename,Field.Store.YES);
Field path=new StoredField ("path",filepath);
Field content=new TextField ("content",filecontent,Field.Store.YES);
Field size=new StoredField ("size",filesize);
Field size_num=new LongPoint ("size",filesize);
//创建文档,将字段放入文档
Document document=new Document ();
document.add (name);
document.add (path);
document.add (content);
document.add (size);
document.add (size_num);
//将文档写入索引里
indexWriter.addDocument (document);
}
//关闭索引写入器
indexWriter.close ();
}
}
5
搜索案例
package com.zhanghuan.loginapi;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import java.io.File;
import java.io.IOException;
class LoginapiApplicationTests {
@Test
void read() throws IOException {
//打开索引
FSDirectory dictionary=FSDirectory.open (new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
DirectoryReader indexReader= DirectoryReader.open (dictionary);
//搜索关键词
IndexSearcher indexSearcher=new IndexSearcher (indexReader);
Query query=new TermQuery (new Term ("content","java"));
TopDocs topDocs=indexSearcher.search (query,10);
ScoreDoc[] docs=topDocs.scoreDocs;
//打印搜索结果
System.out.println ("最大数量:"+topDocs.totalHits);
for (ScoreDoc doc:
docs) {
int docId=doc.doc;
Document document=indexSearcher.doc (docId);
System.out.println (document.get("name"));
System.out.println (document.get("path"));
System.out.println (document.get("size"));
System.out.println (document.get("content"));
System.out.println ("----------------------------------");
}
indexReader.close ();
}
}
6
查看分词器分词效果
1)普通分词器
//核心代码
@Test
void analyzer() throws IOException {
//创建一个分析器对象
StandardAnalyzer analyzer= new StandardAnalyzer ();
//获取Tokenstream
TokenStream tokenStream=analyzer.tokenStream ("","java sc a");
//向Tokenstream设置一个引用,相当于一个指针
CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);
//重置
tokenStream.reset ();
//循环遍历
while (tokenStream.incrementToken ()){
System.out.println (charTermAttribute.toString ());
}
/*
java
sc
a
*/
tokenStream.close ();
}
对于中文,会每个中文算一个词
2)IKAnalyzer分词器
<dependency>
<groupId>com.jianggujin</groupId>
<artifactId>IKAnalyzer-lucene</artifactId>
<version>8.0.0</version>
</dependency>
@Test
void analyzer() throws IOException {
//创建一个分析器对象
IKAnalyzer analyzer= new IKAnalyzer ();
//获取Tokenstream
TokenStream tokenStream=analyzer.tokenStream ("","java sc a 你好");
//向Tokenstream设置一个引用,相当于一个指针
CharTermAttribute charTermAttribute=tokenStream.addAttribute (CharTermAttribute.class);
//重置
tokenStream.reset ();
//循环遍历
while (tokenStream.incrementToken ()){
System.out.println (charTermAttribute.toString ());
}
tokenStream.close ();
}
使用前需要定义两个词典ext.dic和stopword.dic
7
不同域对比
Field类 | 支持类型 | 分析(分词) | 索引 | 存储 |
---|---|---|---|---|
StringField | 字符串 | N | Y | Y或N |
LongPoint | Long类型 | Y | Y | N |
StoredField | 重载方法,支持各种类型 | N | N | Y |
TextField | 字符串式流 | Y | Y | Y或N |
8
索引维护
1)新增
2)删除
//核心代码
@Test
void del() throws IOException {
FSDirectory dictionary=FSDirectory.open
(new File ("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
IndexWriter indexWriter=new IndexWriter (dictionary,new IndexWriterConfig (new IKAnalyzer ()));
indexWriter.deleteDocuments (new Term ("content","序列化"));//删除content包含数据库的文档
//indexWriter.deleteAll ();//删除所有
indexWriter.close ();
}
3)修改
删除后在新增
9
查询
1)使用Query的子类
//核心代码
@Test
void query() throws IOException {
FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
DirectoryReader reader=DirectoryReader.open (dictionary);
IndexSearcher indexSearcher=new IndexSearcher (reader);
Query query=LongPoint.newRangeQuery ("size",10,100);//范围查询
TopDocs topDocs=indexSearcher.search (query,10);
ScoreDoc[] docs=topDocs.scoreDocs;
for (ScoreDoc doc:
docs) {
int docId=doc.doc;
Document document=indexSearcher.doc (docId);
System.out.println (document.get("name"));
System.out.println (document.get("path"));
System.out.println (document.get("size"));
System.out.println (document.get("content"));
System.out.println ("----------------------------------");
}
}
2)使用QueryParser
@Test
void query() throws IOException, ParseException {
FSDirectory dictionary=FSDirectory.open (new File("C:\\Users\\admin\\IdeaProjects\\logina_pi\\src\\main\\resources\\static\\index").toPath ());
DirectoryReader reader=DirectoryReader.open (dictionary);
IndexSearcher indexSearcher=new IndexSearcher (reader);
QueryParser queryParser=new QueryParser ("name",new IKAnalyzer ());//参数1为域名,参数2为解析器对象
Query query=queryParser.parse ("java开发");
TopDocs topDocs=indexSearcher.search (query,10);
ScoreDoc[] docs=topDocs.scoreDocs;
for (ScoreDoc doc:
docs) {
int docId=doc.doc;
Document document=indexSearcher.doc (docId);
System.out.println (document.get("name"));
System.out.println (document.get("path"));
System.out.println (document.get("size"));
System.out.println (document.get("content"));
System.out.println ("----------------------------------");
}
}
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 使用C#创建一个MCP客户端
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 按钮权限的设计及实现