elasticsearch—深入搜索
基于词项和基于全文的搜索
基于 Term 的查询(一般为了性能可以设置参数使其不打分)
关于 Term查询的例子
数据准备
DELETE products PUT products { "settings": { "number_of_shards": 1 } } POST /products/_bulk { "index": { "_id": 1 }} { "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" } { "index": { "_id": 2 }} { "productID" : "KDKE-B-9947-#kL5","desc":"iPad" } { "index": { "_id": 3 }} { "productID" : "JODL-X-1937-#pV7","desc":"MBP" } GET /products
分别打开注释执行
POST /products/_search { "query": { "term": { "desc": { //"value": "iPhone" "value":"iphone" } } } }
我们输入数据的时候是大写,查询的时候使用大写查是查不出来的,stand 默认转小写
多字段 Mapping 和 Term
使用 keyword 进行精确匹配
POST /products/_search { "explain": true, "query": { "term": { "productID.keyword": { "value": "XHDK-A-1293-#fJ3" } } } }
可以看到默认有计算打分的过程
复合查询constact score 转换为 Filter
POST /products/_search { "explain": true, "query": { "constant_score": { "filter": { "term": { "productID.keyword": "XHDK-A-1293-#fJ3" } } } } }
基于全文的查询
最新匹配度
短语查询
Match query 查询过程
相关阅读
结构化搜索
结构化数据
Es中的结构化搜索
数据准备
#结构化搜索,精确匹配 DELETE products POST /products/_bulk { "index": { "_id": 1 }} { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" } GET products/_mapping #对布尔值 match 查询,有算分 POST products/_search { "profile": "true", "explain": true, "query": { "term": { "avaliable": true } } }
布尔值
#对布尔值 match 查询,有算分 POST products/_search { "profile": "true", "explain": true, "query": { "term": { "avaliable": true } } } #对布尔值,通过constant score 转成 filtering,没有算分 POST products/_search { "profile": "true", "explain": true, "query": { "constant_score": { "filter": { "term": { "avaliable": true } } } } }
数字
#数字类型 POST products/_search { "profile": "true", "explain": true, "query": { "term": { "price": 30 } } } #数字类型 查询集合 POST products/_search { "query": { "constant_score": { "filter": { "terms": { "price": [ "10", "30" ] } } } } }
数字range
GET products/_search { "query": { "constant_score": { "filter": { "range": { "price": { "gte": 20, "lte": 30 } } } } } }
日期range
查询大于3年前的数据
POST products/_search { "query": { "constant_score": { "filter": { "range": { "date": { "gte": "now-3y" } } } } } }
处理空值 exists查询 有的数据中没有日期字段
POST products/_search { "query": { "constant_score": { "filter": { "exists": { "field": "date" } } } } }
查询日期不存在的数据
POST products/_search { "query": { "constant_score": { "filter": { "bool": { "must_not": { "exists": { "field": "date" } } } } } } }
处理多值字段 term 查询是包含,而不是等于
相关阅读
-
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html
-
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/term-level-queries.html
搜索的相关性算分
词频 TF
通过 explain API 查看 TF
数据准备
PUT testscore { "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "content": { "type": "text" } } } } PUT testscore/_bulk { "index": { "_id": 1 }} { "content":"we use Elasticsearch to power the search" } { "index": { "_id": 2 }} { "content":"we like elasticsearch" } { "index": { "_id": 3 }} { "content":"The scoring of documents is caculated by the scoring formula" } { "index": { "_id": 4 }} { "content":"you know, for search" }
查询es
POST /testscore/_search { //"explain": true, "query": { "match": { //"content":"you" "content": "elasticsearch" //"content":"the" //"content": "the elasticsearch" } } }
发现 2 排在3的前面是因为 同样出现一次,2的长度小于3,分数高
boosting relevance
POST testscore/_search { "query": { "boosting": { "positive": { "term": { "content": "elasticsearch" } }, "negative": { "term": { "content": "like" } }, "negative_boost": 0.2 } } }
Query & Filtering 与多字符串多字段查询
条件组合
bool 查询
bool 查询语法
POST /products/_bulk { "index": { "_id": 1 }} { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" } #基本语法 POST /products/_search { "query": { "bool": { "must": { "term": { "price": "30" } }, "filter": { "term": { "avaliable": "true" } }, "must_not": { "range": { "price": { "lte": 10 } } }, "should": [ { "term": { "productID.keyword": "JODL-X-1937-#pV7" } }, { "term": { "productID.keyword": "XHDK-A-1293-#fJ3" } } ], "minimum_should_match": 1 } } }
如何解决结构化数据包含而不是相等的问题
增加 genre count 字段使用bool处理
POST /newmovies/_bulk { "index": { "_id": 1 }} { "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy","genre_count":1 } { "index": { "_id": 2 }} { "title" : "Dave","year":1993,"genre":["Comedy","Romance"],"genre_count":2 } #must,有算分 POST /newmovies/_search { "query": { "bool": { "must": [ {"term": {"genre.keyword": {"value": "Comedy"}}}, {"term": {"genre_count": {"value": 1}}} ] } } } #Filter。不参与算分,结果的score是0 POST /newmovies/_search { "query": { "bool": { "filter": [ {"term": {"genre.keyword": {"value": "Comedy"}}}, {"term": {"genre_count": {"value": 1}}} ] } } }
#Filtering Context POST _search { "query": { "bool": { "filter": { "term": { "avaliable": "true" } }, "must_not": { "range": { "price": { "lte": 10 } } } } } }
#Query Context POST /products/_bulk { "index": { "_id": 1 }} { "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" } POST /products/_search { "query": { "bool": { "should": [ { "term": { "productID.keyword": { "value": "JODL-X-1937-#pV7" } } }, { "term": { "avaliable": { "value": true } } } ] } } }
bool嵌套
#嵌套,实现了 should not 逻辑 POST /products/_search { "query": { "bool": { "must": { "term": { "price": "30" } }, "should": [ { "bool": { "must_not": { "term": { "avaliable": "false" } } } } ], "minimum_should_match": 1 } } }
控制字段的 boosting
在这里设置在 title 中包含的分数打分高优先显示
DELETE blogs POST /blogs/_bulk { "index": { "_id": 1 }} {"title":"Apple iPad", "content":"Apple iPad,Apple iPad" } { "index": { "_id": 2 }} {"title":"Apple iPad,Apple iPad", "content":"Apple iPad" } POST blogs/_search { "query": { "bool": { "should": [ {"match": { "title": { "query": "apple,ipad", "boost": 4 } }}, {"match": { "content": { "query": "apple,ipad", "boost":1 } }} ] } } }
查询包含苹果的(查全率尽可能返回多)
DELETE news POST /news/_bulk { "index": { "_id": 1 }} { "content":"Apple Mac" } { "index": { "_id": 2 }} { "content":"Apple iPad" } { "index": { "_id": 3 }} { "content":"Apple employee like Apple Pie and Apple Juice" } POST news/_search { "query": { "bool": { "must": { "match":{"content":"apple"} } } } }
查询包含苹果的不包含pie(查准率)
POST news/_search { "query": { "bool": { "must": { "match": { "content": "apple" } }, "must_not": { "match": { "content": "pie" } } } } }
查询包含苹果的优先显示,pie的排在后面
POST news/_search { "query": { "boosting": { "positive": { "match": { "content": "apple" } }, "negative": { "match": { "content": "pie" } }, "negative_boost": 0.5 } } }
相关阅读
单字符串多字段查询 Dis Max Query
单字符串查询
单字符串查询实例
PUT /blogs/_doc/1 { "title": "Quick brown rabbits", "body": "Brown rabbits are commonly seen." } PUT /blogs/_doc/2 { "title": "Keeping pets healthy", "body": "My quick brown fox eats rabbits on a regular basis." } POST /blogs/_search { "query": { "bool": { "should": [ { "match": { "title": "Brown fox" } }, { "match": { "body": "Brown fox" } } ] } } }
可以看到2包括所有的 term但算分低于1,主要因为按照titele 和 body的算分相加,最高的返回
算分过程
Dis Max Query 查询
POST blogs/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" } }, { "match": { "body": "Quick pets" } } ] } } }
最佳字段调优
通过 tie_breaker 参数调整
POST blogs/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" } }, { "match": { "body": "Quick pets" } } ], "tie_breaker": 0.2 } } }
相关阅读
Multi Match Query
POST blogs/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "Quick pets" } }, { "match": { "body": "Quick pets" } } ], "tie_breaker": 0.2 } } } POST blogs/_search { "query": { "multi_match": { "type": "best_fields", "query": "Quick pets", "fields": [ "title", "body" ], "tie_breaker": 0.2, "minimum_should_match": "20%" } } }
因为默认的stand按词切分,match 查询的是 barking or dogs 所以1打的分数高
PUT /titles { "mappings": { "properties": { "title": { "type": "text", "analyzer": "english" } } } } POST titles/_bulk { "index": { "_id": 1 }} { "title": "My dog barks" } { "index": { "_id": 2 }} { "title": "I see a lot of barking dogs on the road " } GET titles/_search { "query": { "match": { "title": "barking dogs" } } }
提高字段的权重控制搜索结果的返回
GET /titles/_search { "query": { "multi_match": { "query": "barking dogs", "type": "most_fields", "fields": [ "title^10", "title.std" ] } } }
使用多数字段匹配解决
english分词器会尽可能多地匹配, standard分词不会对词干进行任何提取
DELETE /titles PUT /titles { "mappings": { "properties": { "title": { "type": "text", "analyzer": "english", "fields": { "std": { "type": "text", "analyzer": "standard" } } } } } } POST titles/_bulk { "index": { "_id": 1 }} { "title": "My dog barks" } { "index": { "_id": 2 }} { "title": "I see a lot of barking dogs on the road " } GET /titles/_search { "query": { "multi_match": { "query": "barking dogs", "type": "most_fields", "fields": [ "title", "title.std" ] } } }
跨字段搜索
PUT address/_doc/1 { "street": "5 Poland Street", "city": "London", "country": "United Kingdom", "postcode": "W1V 3DG" } POST address/_search { "query": { "multi_match": { "query": "Poland Street W1V", "type": "most_fields", "operator": "and", "fields": [ "street", "city", "country", "postcode" ] } } }
使用cross_fields
PUT address/_doc/1 { "street": "5 Poland Street", "city": "London", "country": "United Kingdom", "postcode": "W1V 3DG" } POST address/_search { "query": { "multi_match": { "query": "Poland Street W1V", "type": "cross_fields", "operator": "and", "fields": [ "street", "city", "country", "postcode" ] } } }
多语言及中文分词与检索
自言语言与查询 Recall
混合多语言的挑战
分词的挑战
中文分词现状
一些中文分词器
Hanlp分词器
面向生产环境的自言语言处理工具包
网址
- https://www.hanlp.com/
安装
./elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.1.0/elasticsearch-analysis-hanlp-7.1.0.zip
hanlp: hanlp默认分词
hanlp_standard: 标准分词
hanlp_index: 索引分词
hanlp_nlp: NLP分词
hanlp_n_short: N-最短路分词
hanlp_dijkstra: 最短路分词
hanlp_crf: CRF分词(在hanlp 1.6.6已开始废弃)
hanlp_speed: 极速词典分词
POST _analyze { "analyzer": "hanlp_standard", "text": [ "剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜" ] }
Ik 分词器
支持字典热更新
安装
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip
ik_max_word 粗粒度分词
ik_smart 最小细度分词
pinyin分词器
安装
./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.1.0/elasticsearch-analysis-pinyin-7.1.0.zip
简单案例
PUT /artists/ { "settings": { "analysis": { "analyzer": { "user_name_analyzer": { "tokenizer": "whitespace", "filter": "pinyin_first_letter_and_full_pinyin_filter" } }, "filter": { "pinyin_first_letter_and_full_pinyin_filter": { "type": "pinyin", "keep_first_letter": true, "keep_full_pinyin": false, "keep_none_chinese": true, "keep_original": false, "limit_first_letter_length": 16, "lowercase": true, "trim_whitespace": true, "keep_none_chinese_in_first_letter": true } } } } } GET /artists/_analyze { "text": [ "刘德华 张学友 郭富城 黎明 四大天王" ], "analyzer": "user_name_analyzer" }
返回结果
{ "tokens" : [ { "token" : "ldh", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "zxy", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 1 }, { "token" : "gfc", "start_offset" : 8, "end_offset" : 11, "type" : "word", "position" : 2 }, { "token" : "lm", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 3 }, { "token" : "sdtw", "start_offset" : 15, "end_offset" : 19, "type" : "word", "position" : 4 } ] }
一个全文检索的例子
{ "_source": ["title","overview"], "size":20, "query": { "multi_match": { "type": "most_fields", "query": "basketball with cartoon aliens", "fields": ["title","overview"] } }, "highlight" : { "fields" : { "overview" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }, "title" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] } } } }
使用 Search Template 和 Index Alias 查询
Search Template
搜素人员创建一个搜索模板
POST _scripts/tmdb { "script": { "lang": "mustache", "source": { "_source": [ "title", "overview" ], "size": 20, "query": { "multi_match": { "query": "{{q}}", "fields": [ "title", "overview" ] } } } } }
开发人员使用模板进行查询(搜素人员修改模板不管自己的是了)
GET _scripts/tmdb POST tmdb/_search/template { "id":"tmdb", "params": { "q": "basketball with cartoon aliens" } }
index alias 实现零停机运维
为索引创建一个别名,通过别名读写数据
PUT movies-2019/_doc/1 { "name": "the matrix", "rating": 5 } PUT movies-2019/_doc/2 { "name": "Speed", "rating": 3 } POST _aliases { "actions": [ { "add": { "index": "movies-2019", "alias": "movies-latest" } } ] } POST movies-latest/_search { "query": { "match_all": {} } }
可以发现有两条数据
再次创建相同的索引别名会把以前的覆盖掉实现零停机
POST _aliases { "actions": [ { "add": { "index": "movies-2019", "alias": "movies-lastest-highrate", "filter": { "range": { "rating": { "gte": 4 } } } } } ] } POST movies-lastest-highrate/_search { "query": { "match_all": {} } }
综合排序:Function Score Query 优化算分
算分与排序
field_value_factor (按照指定的字段作为算分)
DELETE blogs PUT /blogs/_doc/1 { "title": "About popularity", "content": "In this post we will talk about...", "votes": 0 } PUT /blogs/_doc/2 { "title": "About popularity", "content": "In this post we will talk about...", "votes": 100 } PUT /blogs/_doc/3 { "title": "About popularity", "content": "In this post we will talk about...", "votes": 1000000 } POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "popularity", "fields": [ "title", "content" ] } }, "field_value_factor": { "field": "votes" } } } }
返回结果如下
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 133531.39, "hits" : [ { "_index" : "blogs", "_type" : "_doc", "_id" : "3", "_score" : 133531.39, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 1000000 } }, { "_index" : "blogs", "_type" : "_doc", "_id" : "2", "_score" : 13.353139, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 100 } }, { "_index" : "blogs", "_type" : "_doc", "_id" : "1", "_score" : 0.0, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 0 } } ] } }
使用 modifier 平滑曲线
发现投票数影响的差异太大
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "popularity", "fields": [ "title", "content" ] } }, "field_value_factor": { "field": "votes", "modifier": "log1p" } } } }
输出结果
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.8011884, "hits" : [ { "_index" : "blogs", "_type" : "_doc", "_id" : "3", "_score" : 0.8011884, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 1000000 } }, { "_index" : "blogs", "_type" : "_doc", "_id" : "2", "_score" : 0.26763982, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 100 } }, { "_index" : "blogs", "_type" : "_doc", "_id" : "1", "_score" : 0.0, "_source" : { "title" : "About popularity", "content" : "In this post we will talk about...", "votes" : 0 } } ] } }
引入factor
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "popularity", "fields": [ "title", "content" ] } }, "field_value_factor": { "field": "votes", "modifier": "log1p", "factor": 0.1 } } } }
boost_mode max_boost 设置最大份
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "popularity", "fields": [ "title", "content" ] } }, "field_value_factor": { "field": "votes", "modifier": "log1p", "factor": 0.1 }, "boost_mode": "sum", "max_boost": 3 } } }
一致随机性函数
POST /blogs/_search { "query": { "function_score": { "random_score": { "seed": 911119 } } } }
排序为 1 3 2
POST /blogs/_search { "query": { "function_score": { "random_score": { "seed": 100 } } } }
排序为 1 3 2
Term & Phrase Suggester(建议)
什么是搜索建议
Elasticsearch Suggester Api
suggest_mode Missing Mode
DELETE articles POST articles/_bulk { "index" : { } } { "body": "lucene is very cool"} { "index" : { } } { "body": "Elasticsearch builds on top of lucene"} { "index" : { } } { "body": "Elasticsearch rocks"} { "index" : { } } { "body": "elastic is the company behind ELK stack"} { "index" : { } } { "body": "Elk stack rocks"} { "index" : {} } { "body": "elasticsearch is rock solid"} POST /articles/_search { "size": 1, "query": { "match": { "body": "lucen rock" } }, "suggest": { "term-suggestion": { "text": "lucen rock", "term": { "suggest_mode": "missing", "field": "body" } } } }
suggest_mode popular Mode
POST /articles/_search { "suggest": { "term-suggestion": { "text": "lucen rock", "term": { "suggest_mode": "popular", "field": "body" } } } }
sorted by frq(频率) & prefix legth(首字母长度)
默认首字母写错就不推荐了
POST /articles/_search { "suggest": { "term-suggestion": { "text": "lucen hocks", "term": { "suggest_mode": "always", "field": "body", "prefix_length":0, "sort": "frequency" } } } }
phrase 推荐
POST /articles/_search { "suggest": { "my-suggestion": { "text": "lucne and elasticsear rock hello world ", "phrase": { "field": "body", "max_errors": 2, "confidence": 1, "direct_generator": [ { "field": "body", "suggest_mode": "always" } ], "highlight": { "pre_tag": "<em>", "post_tag": "</em>" } } } } }
confidence 要是改为2的话,只有一条数据
自动补全与基于上下文的搜索
自动补全
使用 completion 步骤
DELETE articles # 定义mapping PUT articles { "mappings": { "properties": { "title_completion":{ "type": "completion" } } } } # 索引数据 POST articles/_bulk { "index" : { } } { "title_completion": "lucene is very cool"} { "index" : { } } { "title_completion": "Elasticsearch builds on top of lucene"} { "index" : { } } { "title_completion": "Elasticsearch rocks"} { "index" : { } } { "title_completion": "elastic is the company behind ELK stack"} { "index" : { } } { "title_completion": "Elk stack rocks"} { "index" : {} } # 运行suggest POST articles/_search?pretty { "size": 0, "suggest": { "article-suggester": { "prefix": "elk ", "completion": { "field": "title_completion" } } } }
基于上下文的搜索
实现上下文搜索
定义 mapping
DELETE comments PUT comments PUT comments/_mapping { "properties": { "comment_autocomplete":{ "type": "completion", "contexts":[{ "type":"category", "name":"comment_category" }] } } }
索引数据
POST comments/_doc { "comment": "I love the star war movies", "comment_autocomplete": { "input": [ "star wars" ], "contexts": { "comment_category": "movies" } } } POST comments/_doc { "comment": "Where can I find a Starbucks", "comment_autocomplete": { "input": [ "starbucks" ], "contexts": { "comment_category": "coffee" } } }
不同的上下文自动提示
POST comments/_search { "suggest": { "MY_SUGGESTION": { "prefix": "sta", "completion": { "field": "comment_autocomplete", "contexts": { "comment_category": "coffee" } } } } }
跨集群搜索
水平扩展的痛点
配置及查询
在每个集群上配置
PUT /_cluster/settings { "persistent": { "cluster": { "remote": { "cluster0": { "seeds": [ "127.0.0.1:9300" ], "transport.ping_schedule": "30s" }, "cluster1": { "seeds": [ "127.0.0.1:9301" ], "transport.compress": true, "skip_unavailable": true }, "cluster2": { "seeds": [ "127.0.0.1:9302" ] } } } } }
插入数据
#创建测试数据 curl -XPOST "http://localhost:9200/users/_doc" -H 'Content-Type: application/json' -d' {"name":"user1","age":10}' curl -XPOST "http://localhost:9201/users/_doc" -H 'Content-Type: application/json' -d' {"name":"user2","age":20}' curl -XPOST "http://localhost:9202/users/_doc" -H 'Content-Type: application/json' -d' {"name":"user3","age":30}'
查询
GET /users,cluster1:users,cluster2:users/_search { "query": { "range": { "age": { "gte": 20, "lte": 40 } } } }