es 自定义分词插件
0. 数据准备
1. 创建索引
curl -X PUT -H 'Content-Type:application/json' -d '{"settings":{"index":{"number_of_shards":2,"number_of_replicas":0}},"mappings":{"properties":{"description":{"type":"text"},"name":{"type":"keyword"},"age":{"type":"integer"}}}}' localhost:9200/user
2. 查看索引信息
(base) xxx@58deMacBook-Pro business_scf_productservice % curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open user uWw_V1ECRbSmZxyLF0TdBg 2 0 0 0 452b 452b
3. 插入数据
curl -X POST -H 'Content-Type:application/json' -d '{"description":"this is a good boy","name":"zhangsan","age":20}' localhost:9200/user/_doc/
4. 查询数据
curl -X GET localhost:9200/user/_search
curl -X GET "localhost:9200/user/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"description": "good boy"
}
}
}'
1. es什么情况下调用分词?
1、写入
同步对写入的text 字段进行分词,然后进行相关后续存储
断点下到 : org.apache.lucene.analysis.standard.StandardTokenizer#incrementToken 查看调用链。 可以看到从org.apache.lucene.index.DefaultIndexingChain#processField 调用到分词方法。
2、查询:
同步对输入的条件进行分词。 分词后去词典表以及倒排列表进行查询。
断点下到 : org.apache.lucene.analysis.standard.StandardTokenizer#incrementToken 查看调用链。 可以看到从org.elasticsearch.index.search.MatchQueryParser#parse 调用过去。
2. es调用分词返回的结果是什么?
返回结果:是一个对象,包含当前的词、起始位置、结束位置等信息,用于es 建立倒排列表。
1、 自己curl 测试
curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
"analyzer": "standard",
"text": "Hello, world! This is a test. 123@example.com! 我是中国人~"
}
'
--- 结果
{
"tokens": [{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
}, {
"token": "world",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
}, {
"token": "this",
"start_offset": 14,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
}, {
"token": "is",
"start_offset": 19,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
}, {
"token": "a",
"start_offset": 22,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
}, {
"token": "test",
"start_offset": 24,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 5
}, {
"token": "123",
"start_offset": 30,
"end_offset": 33,
"type": "<NUM>",
"position": 6
}, {
"token": "example.com",
"start_offset": 34,
"end_offset": 45,
"type": "<ALPHANUM>",
"position": 7
}, {
"token": "我",
"start_offset": 47,
"end_offset": 48,
"type": "<IDEOGRAPHIC>",
"position": 8
}, {
"token": "是",
"start_offset": 48,
"end_offset": 49,
"type": "<IDEOGRAPHIC>",
"position": 9
}, {
"token": "中",
"start_offset": 49,
"end_offset": 50,
"type": "<IDEOGRAPHIC>",
"position": 10
}, {
"token": "国",
"start_offset": 50,
"end_offset": 51,
"type": "<IDEOGRAPHIC>",
"position": 11
}, {
"token": "人",
"start_offset": 51,
"end_offset": 52,
"type": "<IDEOGRAPHIC>",
"position": 12
}]
}
2、代码测试
package qz.es;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import java.io.StringReader;
public class AnalyzerTest {
public static void main(String[] args) throws Exception {
String text = "Hello, world! This is a test. 123@example.com! 我是中国人~";
// 初始化StringReader,准备输入文本
StringReader reader = new StringReader(text);
// 创建StandardTokenizer实例
StandardTokenizer tokenizer = new StandardTokenizer();
// 获取CharTermAttribute,用于获取分词结果
CharTermAttribute termAtt = tokenizer.addAttribute(CharTermAttribute.class);
// 开始分词
tokenizer.setReader(reader);
tokenizer.reset();
while (tokenizer.incrementToken()) {
String token = termAtt.toString();
System.out.println(token);
}
tokenizer.end();
tokenizer.close();
reader.close();
}
}
--- 结果
Hello
world
This
is
a
test
123
example.com
我
是
中
国
人
查看单个返回的词信息:
3. 自定义自己的分词插件
脱离ES 只用两个组件: Analyzer、Tokenizer。Analyzer是一个更全面的概念,涵盖了文本分析的整个流程,包括但不限于分词(比如过滤等精细化处理)。Tokenizer则专注于分词这一特定步骤。
集成到es需要2个组件:AbstractIndexAnalyzerProvider、AnalysisPlugin,(测试不需要 AbstractTokenizerFactory 也能正常使用)
// 参考项目: https://gitee.com/Qiao-Zhi/custom_analyzer_es_plugin
4. 自己的分词插件调用其他分词插件如何实现-扩展分词?
1、 org.apache.lucene.analysis.Tokenizer#reset 重置的时候可以拿到当前需要分词的词, 然后processAnalyzer(sb.toString()); 分词完成缓存到对象属性
private BufferedReader reader;
public void reset(Reader input) throws IOException {
if (BufferedReader.class.isAssignableFrom(input.getClass())) {
reader = ((BufferedReader) input);
} else {
reader = new BufferedReader(input);
}
CharBuffer buffer = CharBuffer.allocate(256);
StringBuilder sb = new StringBuilder();
while (reader.read(buffer) != -1) {
sb.append(buffer.flip());
buffer.clear();
}
// 要分词的字符串
terms = null;
processAnalyzer(sb.toString());
}
2、org.apache.lucene.analysis.TokenStream#incrementToken 遍历上面分词结果返回
@Override
public final boolean incrementToken() throws IOException {
CustomTom customTerm = tokenizerAdapter.nextTerm();
if (lexerTerm == null) {
return false;
}
String word = lexerTerm.word;
int offset = lexerTerm.offset;
int endOffset = offset + word.length();
termAtt.setEmpty().append(word);
offsetAtt.setOffset(correctOffset(offset),
correctOffset(endOffset));
return true;
}