Elasticsearch(二)
插入数据:
POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
(2)根据用户ID搜索帖子
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "userID" : 1 } } } } }
term filter/query:对搜索文本不分词,直接拿去倒排索引中匹配,你输入的是什么,就去匹配什么
(3)搜索没有隐藏的帖子
GET /forum/article/_search { "query": { "constant_score": { "filter": { "term": { "hidden": "false" } } } } }
(4)根据发帖日期搜索帖子
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "postDate" : "2017-01-01" } } } } }
(5)根据帖子ID搜索帖子
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID" : "XHDK-A-1293-#fJ3" } } } } }
结果:
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
正确写法:
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID.keyword" : "XHDK-A-1293-#fJ3" } } } } }
(6)查看分词:
GET /forum/_analyze { "field": "articleID", "text": "XHDK-A-1293-#fJ3" }
articleID.keyword,是es最新版本内置建立的field,就是不分词的。所以一个articleID过来的时候,会建立两次索引,
一次是自己本身,是要分词的,分词后放入倒排索引;另外一次是基于articleID.keyword,不分词,
保留256个字符最多,直接一个字符串放入倒排索引中。
所以term filter,对text过滤,可以考虑使用内置的field.keyword来进行匹配。
但是有个问题,默认就保留256个字符。所以尽可能还是自己去手动建立索引,指定not_analyzed吧。
在最新版本的es中,不需要指定not_analyzed也可以,将type=keyword即可。
(7)重建索引
DELETE /forum PUT /forum { "mappings": { "article": { "properties": { "articleID": { "type": "keyword" } } } } } POST /forum/article/_bulk { "index": { "_id": 1 }} { "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 2 }} { "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" } { "index": { "_id": 3 }} { "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" } { "index": { "_id": 4 }} { "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }
(8)重新根据帖子ID和发帖日期进行搜索
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "term" : { "articleID" : "XHDK-A-1293-#fJ3" } } } } }
(1)term filter:根据exact value进行搜索,数字、boolean、date天然支持
(2)text需要建索引时指定为not_analyzed,才能用term query
(3)相当于SQL中的单个where条件
filter比query的好处就在于会caching,但是之前不知道caching的是什么东西,
实际上并不是一个filter返回的完整的doc list数据结果。而是filter bitset缓存起来。
filter大部分情况下来说,在query之前执行,先尽量过滤掉尽可能多的数据
query:是会计算doc对搜索条件的relevance score,还会根据这个score去排序
filter:只是简单过滤出想要的数据,不计算relevance score,也不排序
基于bool组合多个filter条件来搜索数据
1、搜索发帖日期为2017-01-01,或者帖子ID为XHDK-A-1293-#fJ3的帖子,同时要求帖子的发帖日期绝对不为2017-01-02
select * from forum.article where (post_date='2017-01-01' or article_id='XHDK-A-1293-#fJ3') and post_date!='2017-01-02'
GET /forum/article/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ { "term": { "postDate": "2017-01-01" } }, { "term": { "articleID": "XHDK-A-1293-#fJ3" } } ], "must_not": { "term": { "postDate": "2017-01-02" } } } } } } }
must,should,must_not,filter:必须匹配,可以匹配其中任意一个即可,必须不匹配
2、搜索帖子ID为XHDK-A-1293-#fJ3,或者是帖子ID为JODL-X-1937-#pV7而且发帖日期为2017-01-01的帖子
select * from forum.article where article_id='XHDK-A-1293-#fJ3' or (article_id='JODL-X-1937-#pV7' and post_date='2017-01-01')
GET /forum/article/_search { "query": { "constant_score": { "filter": { "bool": { "should": [ { "term": { "articleID.keyword": "XHDK-A-1293-#fJ3" } }, { "bool": { "must": [ { "term":{ "articleID.keyword": "JODL-X-1937-#pV7" } }, { "term": { "postDate": "2017-01-01" } } ] } } ] } } } } }
terms搜索多个值以及多值搜索结果
1、为帖子数据增加tag字段
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"tag" : ["java", "hadoop"]} } { "update": { "_id": "2"} } { "doc" : {"tag" : ["java"]} } { "update": { "_id": "3"} } { "doc" : {"tag" : ["hadoop"]} } { "update": { "_id": "4"} } { "doc" : {"tag" : ["java", "elasticsearch"]} }
2、搜索articleID为KDKE-B-9947-#kL5或QQPX-R-3956-#aD8的帖子,搜索tag中包含java的帖子
GET /forum/article/_search { "query": { "constant_score": { "filter": { "terms": { "articleID.keyword": [ "KDKE-B-9947-#kL5", "QQPX-R-3956-#aD8" ] } } } } }
GET /forum/article/_search { "query" : { "constant_score" : { "filter" : { "terms" : { "tag" : ["java"] } } } } }
terms搜索多个值以及多值搜索结果优化
添加字段:
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"tag_cnt" : 2} } { "update": { "_id": "2"} } { "doc" : {"tag_cnt" : 1} } { "update": { "_id": "3"} } { "doc" : {"tag_cnt" : 1} } { "update": { "_id": "4"} } { "doc" : {"tag_cnt" : 2} }
GET /forum/article/_search { "query": { "constant_score": { "filter": { "bool": { "must": [ { "term": { "tag_cnt": 1 } }, { "terms": { "tag": ["java"] } } ] } } } } }
range filter来进行范围过滤
1、为帖子数据增加浏览量的字段
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"view_cnt" : 30} } { "update": { "_id": "2"} } { "doc" : {"view_cnt" : 50} } { "update": { "_id": "3"} } { "doc" : {"view_cnt" : 100} } { "update": { "_id": "4"} } { "doc" : {"view_cnt" : 80} }
2、搜索浏览量在30~60之间的帖子
GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "view_cnt": { "gt": 30, "lt": 60 } } } } } }
3、搜索发帖日期在最近1个月的帖子
POST /forum/article/_bulk {"index":{"_id":5}} {"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2017-03-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10} GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "postDate": { "gt": "2017-03-10||-30d" } } } } } }
GET /forum/article/_search { "query": { "constant_score": { "filter": { "range": { "postDate": { "gt": "now-30d" } } } } } }
手动控制全文检索结果的精准度
1、为帖子数据增加标题字段
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"title" : "this is java and elasticsearch blog"} } { "update": { "_id": "2"} } { "doc" : {"title" : "this is java blog"} } { "update": { "_id": "3"} } { "doc" : {"title" : "this is elasticsearch blog"} } { "update": { "_id": "4"} } { "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} } { "update": { "_id": "5"} } { "doc" : {"title" : "this is spark blog"} }
2、搜索标题中包含java或elasticsearch的blog
GET /forum/article/_search { "query": { "match": { "title": "java elasticsearch" } } }
3、搜索标题中包含java和elasticsearch的blog
搜索结果精准控制的第一步:灵活使用and关键字,希望所有的搜索关键字都要匹配的,那么就用and,
可以实现单纯match query无法实现的效果
GET /forum/article/_search { "query": { "match": { "title": { "query": "java elasticsearch", "operator": "and" } } } }
4、搜索包含java,elasticsearch,spark,hadoop,4个关键字中,至少3个的blog
控制搜索结果的精准度的第二步:指定一些关键字中,必须至少匹配其中的多少个关键字,才能作为结果返回
GET /forum/article/_search { "query": { "match": { "title": { "query": "java elasticsearch spark hadoop", "minimum_should_match": "75%" } } } }
5、用bool组合多个搜索条件,来搜索title
GET /forum/article/_search { "query": { "bool": { "must": { "match": { "title": "java" } }, "must_not": { "match": { "title": "spark" } }, "should": [ { "match": { "title": "hadoop" } }, { "match": { "title": "elasticsearch" } } ] } } }
6、bool组合多个搜索条件,如何计算relevance score
must和should搜索对应的分数,加起来,除以must和should的总数
排名第一:java,同时包含should中所有的关键字,hadoop,elasticsearch 排名第二:java,同时包含should中的elasticsearch 排名第三:java,不包含should中的任何关键字 should是可以影响相关度分数的 must是确保说,谁必须有这个关键字,同时会根据这个must的条件去计算出document对这个搜索条件的relevance score 在满足must的基础之上,should中的条件,不匹配也可以,但是如果匹配的更多,那么document的relevance score就会更高
7、搜索java,hadoop,spark,elasticsearch,至少包含其中3个关键字
默认情况下,should是可以不匹配任何一个的,比如上面的搜索中,this is java blog,就不匹配任何一个should条件
但是有个例外的情况,如果没有must的话,那么should中必须至少匹配一个才可以
比如下面的搜索,should中有4个条件,默认情况下,只要满足其中一个条件,就可以匹配作为结果返回
但是可以精准控制,should的4个条件中,至少匹配几个才能作为结果返回
GET /forum/article/_search { "query": { "bool": { "should": [ { "match": { "title": "java" } }, { "match": { "title": "elasticsearch" } }, { "match": { "title": "hadoop" } }, { "match": { "title": "spark" } } ], "minimum_should_match": 3 } } }
1、全文检索的时候,进行多个值的检索,有两种做法,match query;should
2、控制搜索结果精准度:and operator,minimum_should_match
boost的细粒度搜索条件权重控制
GET /forum/article/_search { "query": { "bool": { "must": [ { "match": { "title": "blog" } } ], "should": [ { "match": { "title": { "query": "java" } } }, { "match": { "title": { "query": "hadoop" } } }, { "match": { "title": { "query": "elasticsearch" } } }, { "match": { "title": { "query": "spark", "boost": 5 } } } ] } } }
dis_max实现best fields策略进行多字段搜索
1、为帖子数据增加content字段
POST /forum/article/_bulk { "update": { "_id": "1"} } { "doc" : {"content" : "i like to write best elasticsearch article"} } { "update": { "_id": "2"} } { "doc" : {"content" : "i think java is the best programming language"} } { "update": { "_id": "3"} } { "doc" : {"content" : "i am only an elasticsearch beginner"} } { "update": { "_id": "4"} } { "doc" : {"content" : "elasticsearch and hadoop are all very good solution, i am a beginner"} } { "update": { "_id": "5"} } { "doc" : {"content" : "spark is best big data solution based on scala ,an programming language similar to java"} }
2、搜索title或content中包含java或solution的帖子
这个就是multi-field搜索,多字段搜索
GET /forum/article/_search { "query": { "bool": { "should": [ { "match": { "title": "java solution" }}, { "match": { "content": "java solution" }} ] } } }
best fields策略,就是说,搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;
而不是尽可能多的field匹配到了少数的关键词,排在了前面
dis_max语法,直接取多个query中,分数最高的那一个query的分数即可
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java solution" }}, { "match": { "content": "java solution" }} ] } } }
tie_breaker参数优化dis_max搜索效果
1、搜索title或content中包含java beginner的帖子
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java beginner" }}, { "match": { "body": "java beginner" }} ] } } }
dis_max,只是取分数最高的那个query的分数而已
2、dis_max只取某一个query最大的分数,完全不考虑其他query的分数
3、使用tie_breaker将其他query的分数也考虑进去
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": "java beginner" }}, { "match": { "body": "java beginner" }} ], "tie_breaker": 0.3 } } }
基于multi_match语法实现dis_max+tie_breaker
GET /forum/article/_search { "query": { "multi_match": { "query": "java solution", "type": "best_fields", "fields": [ "title^2", "content" ], "tie_breaker": 0.3, "minimum_should_match": "50%" } } }
GET /forum/article/_search { "query": { "dis_max": { "queries": [ { "match": { "title": { "query": "java beginner", "minimum_should_match": "50%", "boost": 2 } } }, { "match": { "body": { "query": "java beginner", "minimum_should_match": "30%" } } } ], "tie_breaker": 0.3 } } }
minimum_should_match,的作用
去长尾,long tail
长尾,比如你搜索5个关键词,但是很多结果是只匹配1个关键词的,其实跟你想要的结果相差甚远,这些结果就是长尾
minimum_should_match,控制搜索结果的精准度,只有匹配一定数量的关键词的数据,才能返回