ES 学习笔记_上
1.数据准备
POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
{ "index": { "_id": 5 }}
{ "articleID" : "QQPX-R-3333-#aD8", "userID" : 3, "hidden": true, "postDate": "2021-01-02" }
2.查询表结构 show create table
GET /forum/article/_mapping
3.根据id查找 select * from forum.article where hidden = ?
GET /forum/article/_search { "query": { "constant_score":{ "filter":{ "term": { "hidden": false } } } } }
4.查看分词
GET /forum/_analyze { "field": "articleID", "text": "XHDK-A-1293-#fJ3" }
5.如下搜索不到结果
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID" : "XHDK-A-1293-#fJ3" } } } } }
因为:
(1)term filter:根据exact value进行搜索,数字、boolean、date天然支持
(2)text需要建索引时指定为not_analyzed,才能用term query
(3)相当于SQL中的单个where条件
想要搜索到可以重新建立索引:
DELETE /forum PUT /forum { "mappings": { "article": { "properties": { "articleID": { "type": "keyword" } } } } } POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
或者:如下在要搜索的字段后面加 .keyword
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID.keyword" : "XHDK-A-1293-#fJ3" } } } } }
6.复杂where条件 =》搜索发帖日期为2017-01-01,或者帖子ID为XHDK-A-1293-#fJ3的帖子,同时要求帖子的发帖日期绝对不为2017-01-02
select *
from forum.article
where (post_date='2017-01-01' or article_id='XHDK-A-1293-#fJ3')
and post_date!='2017-01-02'
GET /forum/article/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ {"term": { "postDate": "2017-01-01" }}, {"term": {"articleID": "XHDK-A-1293-#fJ3"}} ], "must_not": { "term": { "postDate": "2017-01-02" } } } } } } }
must,should,must_not,:必须匹配,可以匹配其中任意一个即可,必须不匹配
7.复杂where条件 案例二 =》搜索帖子ID为XHDK-A-1293-#fJ3,或者是帖子ID为JODL-X-1937-#pV7而且发帖日期为2017-01-01的帖子
select *
from forum.article
where article_id='XHDK-A-1293-#fJ3'
or (article_id='JODL-X-1937-#pV7' and post_date='2017-01-01')
GET /forum/article/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ { "term": { "articleID": "XHDK-A-1293-#fJ3" } }, { "bool": { "must": [ { "term":{ "articleID": "JODL-X-1937-#pV7" } }, { "term": { "postDate": "2017-01-01" } } ] } } ] } } } } }
8.sql中的in
select * from tbl where col in ("value1", "value2")
添加一列数据
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"tag" : ["java", "hadoop"]} } { "update": { "_id": "2"} } { "doc" : {"tag" : ["java"]} } { "update": { "_id": "3"} } { "doc" : {"tag" : ["hadoop"]} } { "update": { "_id": "4"} } { "doc" : {"tag" : ["java", "elasticsearch"]} }
搜索tag中 in ("java","hadoop")的帖子
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "terms" : { "tag" : ["java","hadoop"] } } } } }
9.sql中的between
添加数据
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"view_cnt" : 30} } { "update": { "_id": "2"} } { "doc" : {"view_cnt" : 50} } { "update": { "_id": "3"} } { "doc" : {"view_cnt" : 100} } { "update": { "_id": "4"} } { "doc" : {"view_cnt" : 80} }
搜索浏览量在30~60之间的帖子
GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "view_cnt": { "gt": 30, "lt": 60 } } } } } }
搜索发帖日期在指定时间前1个月的帖子
POST /forum/article/_bulk { "index": { "_id": 5 }} { "articleID" : "DHJK-B-1395-#Ky5", "userID" : 3, "hidden": false, "postDate": "2017-03-01", "tag": ["elasticsearch"], "tag_cnt": 1, "view_cnt": 10 } GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "postDate": { "gt": "2017-03-10||-30d" } } } } } }
最近一个月
GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "postDate": { "gt": "now-30d" } } } } } }
10.以上都是exact value ,以下介绍full text 全文检索
数据准备
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"title" : "this is java and elasticsearch blog"} } { "update": { "_id": "2"} } { "doc" : {"title" : "this is java blog"} } { "update": { "_id": "3"} } { "doc" : {"title" : "this is elasticsearch blog"} } { "update": { "_id": "4"} } { "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} } { "update": { "_id": "5"} } { "doc" : {"title" : "this is spark blog"} }
搜索标题中包含java或elasticsearch的blog
GET /forum/article/_search { "query": { "match": { "title": "java elasticsearch" } } }
搜索结果精准控制的第一步:灵活使用and关键字,如果你是希望所有的搜索关键字都要匹配的,那么就用and,可以实现单纯match query无法实现的效果
GET /forum/article/_search { "query": { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } }
控制搜索结果的精准度的第二步:指定一些关键字中,必须至少匹配其中的多少个关键字,才能作为结果返回
GET /forum/article/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "75%" } } } }
用bool组合多个搜索条件,来搜索title
GET /forum/article/_search { "query": { "bool": { "must": { "match": { "title": "java" }}, "must_not": { "match": { "title": "spark" }}, "should": [ { "match": { "title": "hadoop" }}, { "match": { "title": "elasticsearch" }} ] } } }
总结:
bool组合多个搜索条件,如何计算relevance score
must和should搜索对应的分数,加起来,除以must和should的总数
排名第一:java,同时包含should中所有的关键字,hadoop,elasticsearch
排名第二:java,同时包含should中的elasticsearch
排名第三:java,不包含should中的任何关键字
should是可以影响相关度分数的
must是确保说,谁必须有这个关键字,同时会根据这个must的条件去计算出document对这个搜索条件的relevance score
在满足must的基础之上,should中的条件,不匹配也可以,但是如果匹配的更多,那么document的relevance score就会更高
搜索java,hadoop,spark,elasticsearch,至少包含其中3个关键字
GET /forum/article/_search { "query": { "bool": { "should": [ { "match": { "title": "java" }}, { "match": { "title": "elasticsearch" }}, { "match": { "title": "hadoop" }}, { "match": { "title": "spark" }} ], "minimum_should_match": 3 } } }
11.match 和 term之间的转换
{ "match": { "title": "java elasticsearch"} } { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } }
{ "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } { "bool": { "must": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }} ] } }
{ "match": { "title": { "query": "java elasticsearch hadoop spark", "minimum_should_match": "75%" } } } { "bool": { "should": [ { "term": { "title": "java" }}, { "term": { "title": "elasticsearch" }}, { "term": { "title": "hadoop" }}, { "term": { "title": "spark" }} ], "minimum_should_match": 3 } }
12.使用 boost 单独增加某个字段的权重
GET /forum/article/_search { "query": { "bool": { "must": [ { "match": { "title": "blog" } } ], "should": [ { "match": { "title": { "query": "java" } } }, { "match": { "title": { "query": "hadoop" } } }, { "match": { "title": { "query": "elasticsearch" } } }, { "match": { "title": { "query": "spark", "boost": 5 } } } ] } } }
13.搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;而不是尽可能多的field匹配到了少数的关键词,排在了前面
数据准备
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"content" : "i like to write best elasticsearch article"} } { "update": { "_id": "2"} } { "doc" : {"content" : "i think java is the best programming language"} } { "update": { "_id": "3"} } { "doc" : {"content" : "i am only an elasticsearch beginner"} } { "update": { "_id": "4"} } { "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} } { "update": { "_id": "5"} } { "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }
GET /forum/article/_search { "query": { "bool": { "should": [ { "match": { "title": "java solution" }}, { "match": { "content": "java solution" }} ] } } } GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java solution" }}, { "match": { "content": "java solution" }} ] } } }
14.dis_max只取某一个query最大的分数,完全不考虑其他query的分数
使用tie_breaker将其他query的分数也考虑进去
tie_breaker参数的意义,在于说,将其他query的分数,乘以tie_breaker,然后综合与最高分数的那个query的分数,综合在一起进行计算
除了取最高分以外,还会考虑其他的query的分数
tie_breaker的值,在0~1之间,是个小数,就ok
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java beginner" }}, { "match": { "content": "java beginner" }} ], "tie_breaker": 0.3 } } }
15.minimum_should_match
去长尾,long tail
长尾,比如你搜索5个关键词,但是很多结果是只匹配1个关键词的,其实跟你想要的结果相差甚远,这些结果就是长尾
minimum_should_match,控制搜索结果的精准度,只有匹配一定数量的关键词的数据,才能返回
GET /forum/article/_search { "query": { "multi_match": { "query": "java solution", "type": "best_fields", "fields": [ "title^2", "content" ], "tie_breaker": 0.3, "minimum_should_match": "50%" } } } GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": { "query": "java beginner", "minimum_should_match": "50%", "boost": 2 } } }, { "match": { "body": { "query": "java beginner", "minimum_should_match": "30%" } } } ], "tie_breaker": 0.3 } } }
16.从best-fields换成most-fields策略
best-fields策略,主要是说将某一个field匹配尽可能多的关键词的doc优先返回回来
most-fields策略,主要是说尽可能返回更多field匹配到某个关键词的doc,优先返回回来
数据准备
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"sub_title" : "learning more courses"} } { "update": { "_id": "2"} } { "doc" : {"sub_title" : "learned a lot of course"} } { "update": { "_id": "3"} } { "doc" : {"sub_title" : "we have a lot of fun"} } { "update": { "_id": "4"} } { "doc" : {"sub_title" : "both of them are good"} } { "update": { "_id": "5"} } { "doc" : {"sub_title" : "haha, hello world"} }
GET /forum/article/_search { "query": { "multi_match": { "query": "learning courses", "type": "most_fields", "fields": [ "sub_title", "sub_title.std" ] } } }
17.多字段联合搜索
数据准备
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} } { "update": { "_id": "2"} } { "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} } { "update": { "_id": "3"} } { "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} } { "update": { "_id": "4"} } { "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} } { "update": { "_id": "5"} } { "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }
PUT /forum/_mapping/article { "properties": { "new_author_first_name": { "type": "string", "copy_to": "new_author_full_name" }, "new_author_last_name": { "type": "string", "copy_to": "new_author_full_name" }, "new_author_full_name": { "type": "string" } } } GET /forum/article/_search { "query": { "match": { "new_author_full_name": "Peter Smith" } } }
或者
GET /forum/article/_search { "query": { "multi_match": { "query": "Peter Smith", "type": "cross_fields", "operator": "and", "fields": ["author_first_name", "author_last_name"] } } }
19.match_phrase语法
GET /forum/article/_search { "query": { "match_phrase": { "content": "java spark" } } }
20.召回率和精准度问题
召回率
比如你搜索一个java spark,总共有100个doc,能返回多少个doc作为结果,就是召回率,recall
精准度
比如你搜索一个java spark,能不能尽可能让包含java spark,或者是java和spark离的很近的doc,排在最前面,precision
直接用match_phrase短语搜索,会导致必须所有term都在doc field中出现,而且距离在slop限定范围内,才能匹配上
match phrase,proximity match,要求doc必须包含所有的term,才能作为结果返回;如果某一个doc可能就是有某个term没有包含,那么就无法作为结果返回
java spark --> hello world java --> 就不能返回了
java spark --> hello world, java spark --> 才可以返回
近似匹配的时候,召回率比较低,精准度太高了
但是有时可能我们希望的是匹配到几个term中的部分,就可以作为结果出来,这样可以提高召回率。同时我们也希望用上match_phrase根据距离提升分数的功能,让几个term距离越近分数就越高,优先返回
就是优先满足召回率,意思,java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同时兼顾精准度,就是包含java和spark,同时java和spark离的越近的doc排在最前面
此时可以用bool组合match query和match_phrase query一起,来实现上述效果
GET /forum/article/_search { "query": { "bool": { "must": { "match": { "title": { "query": "java spark" --> java或spark或java spark,java和spark靠前,但是没法区分java和spark的距离,也许java和spark靠的很近,但是没法排在最前面 } } }, "should": { "match_phrase": { --> 在slop以内,如果java spark能匹配上一个doc,那么就会对doc贡献自己的relevance score,如果java和spark靠的越近,那么就分数越高 "title": { "query": "java spark", "slop": 50 } } } } } }
GET /forum/article/_search { "query": { "bool": { "must": [ { "match": { "content": "java spark" } } ], "should": [ { "match_phrase": { "content": { "query": "java spark", "slop": 50 } } } ] } } }
21.match和phrase match(proximity match)区别
match --> 只要简单的匹配到了一个term,就可以理解将term对应的doc作为结果返回,扫描倒排索引,扫描到了就ok
phrase match --> 首先扫描到所有term的doc list; 找到包含所有term的doc list; 然后对每个doc都计算每个term的position,是否符合指定的范围; slop,需要进行复杂的运算,来判断能否通过slop移动,匹配一个doc
match query的性能比phrase match和proximity match(有slop)要高很多。因为后两者都要计算position的距离。
match query比phrase match的性能要高10倍,比proximity match的性能要高20倍。
但是别太担心,因为es的性能一般都在毫秒级别,match query一般就在几毫秒,或者几十毫秒,而phrase match和proximity match的性能在几十毫秒到几百毫秒之间,所以也是可以接受的。
优化proximity match的性能,一般就是减少要进行proximity match搜索的document数量。主要思路就是,用match query先过滤出需要的数据,然后再用proximity match来根据term距离提高doc的分数,同时proximity match只针对每个shard的分数排名前n个doc起作用,来重新调整它们的分数,这个过程称之为rescoring,重计分。因为一般用户会分页查询,只会看到前几页的数据,所以不需要对所有结果进行proximity match操作。
用我们刚才的说法,match + proximity match同时实现召回率和精准度
默认情况下,match也许匹配了1000个doc,proximity match全都需要对每个doc进行一遍运算,判断能否slop移动匹配上,然后去贡献自己的分数
但是很多情况下,match出来也许1000个doc,其实用户大部分情况下是分页查询的,所以可能最多只会看前几页,比如一页是10条,最多也许就看5页,就是50条
proximity match只要对前50个doc进行slop移动去匹配,去贡献自己的分数即可,不需要对全部1000个doc都去进行计算和贡献分数
rescore:重打分
match:1000个doc,其实这时候每个doc都有一个分数了; proximity match,前50个doc,进行rescore,重打分,即可; 让前50个doc,term举例越近的,排在越前面
GET /forum/article/_search { "query": { "match": { "content": "java spark" } }, "rescore": { "window_size": 50, "query": { "rescore_query": { "match_phrase": { "content": { "query": "java spark", "slop": 50 } } } } } }