Elasticsearch学习笔记（十二）filter与query

一.keyword 字段和keyword数据类型

1、测试准备数据

POST /forum/article/_bulk

{ "index": { "_id": 1 }}

{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 2 }}

{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }

{ "index": { "_id": 3 }}

{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 4 }}

{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

2、查询mapping

GET /forum/_mapping/article

{

"forum": {

    "mappings": {

      "article": {

        "properties": {

          "articleID": {

            "type": "text",

            "fields": {

              "keyword": {

                "type": "keyword",

                "ignore_above": 256

              }

            }

          },

          "hidden": {

            "type": "boolean"

          },

          "postDate": {

            "type": "date"

          },

          "userID": {

            "type": "long"

          }

        }

      }

    }

}

}

        es 5.2版本，字段数据类型为text的字段（type=text），es默认会设置两个field，一个是field本身，比如articleID，就是分词的；还有一个的话，就是field.keyword，articleID.keyword，默认不分词，会最多保留256个字符

    articleID.keyword，是es最新版本内置建立的field，就是不分词的。所以一个articleID过来的时候，会建立两次索引，一次是自己本身，是要分词的，分词后放入倒排索引；另外一次是基于articleID.keyword，不分词，保留256个字符最多，直接一个字符串放入倒排索引中。

    所以term filter，对text过滤，可以考虑使用内置的field.keyword来进行匹配。但是有个问题，默认就保留256个字符。所以尽可能还是自己去手动建立索引，指定not_analyzed吧。在最新版本的es中，不需要指定not_analyzed也可以，将type=keyword即可。

3、测试

    测试1：使用articleID搜索

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "articleID" : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

结果：查询不到指定的document

    {

"took": 1,

"timed_out": false,

"_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

},

"hits": {

    "total": 0,

    "max_score": null,

    "hits": []

}

   }

    测试2：使用articleID.keyword搜索

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "articleID.keyword" : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

      结果：

{

"took": 2,

"timed_out": false,

"_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

},

"hits": {

    "total": 1,

    "max_score": 1,

    "hits": [

      {

        "_index": "forum",

        "_type": "article",

        "_id": "1",

        "_score": 1,

        "_source": {

          "articleID": "XHDK-A-1293-#fJ3",

          "userID": 1,

          "hidden": false,

          "postDate": "2017-01-01"

        }

      }

    ]

}

}

测试3：term查询

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "userID" : 1

                }

            }

        }

    }

}

term filter/query：对搜索文本不分词，直接拿去倒排索引中匹配，你输入的是什么，就去匹配什么

比如说，如果对搜索文本进行分词的话，“helle world” --> “hello”和“world”，两个词分别去倒排索引中匹配

term，“hello world” --> “hello world”，直接去倒排索引中匹配“hello world”

4、查看分词

GET /forum/_analyze

{

"field": "articleID",

"text": "XHDK-A-1293-#fJ3"

}

GET /forum/_analyze

{

"field": "articleID.keyword",

"text": "XHDK-A-1293-#fJ3"

}

默认是analyzed的text类型的field，建立倒排索引的时候，就会对所有的articleID分词，分词以后，原本的articleID就没有了，只有分词后的各个word存在于倒排索引中。

term，是不对搜索文本分词的，XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3；但是articleID建立索引的时候，XHDK-A-1293-#fJ3 --> xhdk，a，1293，fj3

5、定义keyword数据类型的字段

（1）删除索引 DELETE /forum

（2）重建索引

PUT /forum

{

"mappings": {

    "article": {

      "properties": {

       "articleID": {

          "type": "keyword"

        }

      }

    }

}

}

（3）准备数据

POST /forum/article/_bulk

{ "index": { "_id": 1 }}

{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 2 }}

{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }

{ "index": { "_id": 3 }}

{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 4 }}

{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

（4）测试articleID查询

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "articleID" : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

6、小结

        （1）term filter：根据exact value进行搜索，数字、boolean、date天然支持

（2）text需要建索引时指定为not_analyzed，才能用term query

（3）相当于SQL中的单个where条件

二、filter执行原理深度剖析

1、bitset机制

每个filter根据在倒排索引中搜索的结果构建一个bitset（位集），用以存储搜索的结果。简单的数据结构去实现复杂的功能，可以节省内存空间，提升性能。bitset，就是一个二进制的数组，数组每个元素都是0或1，用来标识一个doc对一个filter条件是否匹配，如果匹配就是1，不匹配就是0。比如：[0, 1, 1]。

遍历每个过滤条件对应的bitset，优先从最稀疏的开始搜索，查找满足所有条件的document（先遍历比较稀疏的bitset，就可以先过滤掉尽可能多的数据发）

2、caching bitset机制

跟踪query，在最近256个query中超过一定次数的过滤条件，缓存其bitset。对于小segment（<1000，或<3%），不缓存bitset。在最近的256个filter中，有某个filter超过了一定的次数，次数不固定，就会自动缓存这个filter对应的bitset。filter针对小segment获取到的结果，可以不缓存，segment记录数<1000，或者segment大小<index总大小的3% segment数据量很小，此时哪怕是扫描也很快；segment会在后台自动合并，小segment很快就会跟其他小segment合并成大segment，此时就缓存也没有什么意义，segment很快就消失了。

cache biset的自动更新：如果document有新增或修改，那么cached bitset会被自动更新

3、filter与query的对比

filter比query的好处就在于会caching。

filter大部分情况下来说，在query之前执行，先尽量过滤掉尽可能多的数据

query：是会计算doc对搜索条件的relevance score（相关评分），还会根据这个score去排序

filter：只是简单过滤出想要的数据，不计算relevance score，也不排序

三、基于bool组合多个filter条件来搜索数据

1、搜索发帖日期为2017-01-01，或者帖子ID为XHDK-A-1293-#fJ3的帖子，同时要求帖子的发帖日期绝对不为2017-01-02

GET /forum/article/_search

{

"query": {

    "constant_score": {

      "filter": {

        "bool": {

          "should":[

            {"term":{"postDate":"2017-01-01"}},

            {"term":{"articleID":"HDK-A-1293-#fJ3"}}

          ],

          "must_not":{

            "term":{

              "postDate":"2017-01-02"

            }

          }

        }

      }

    }

}

}

2、搜索帖子ID为XHDK-A-1293-#fJ3，或者是帖子ID为JODL-X-1937-#pV7而且发帖日期为2017-01-01的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

      "filter": {

        "bool": {

         "should":[

              {"term":{"articleID":"XHDK-A-1293-#fJ3"}},

              {"bool":{

                "must":[

                  {"term":{"articleID":"JODL-X-1937-#pV7"}},

                  {"term":{"postDate":"2017-01-01"}}

                ]

              }}

            ]

        }

      }

    }

}

}

四、term和terms

五、filter range

测试数据：

为帖子数据增加浏览量的字段

POST /forum/article/_bulk

{ "update": { "_id": "1"} }

{ "doc" : {"view_cnt" : 30} }

{ "update": { "_id": "2"} }

{ "doc" : {"view_cnt" : 50} }

{ "update": { "_id": "3"} }

{ "doc" : {"view_cnt" : 100} }

{ "update": { "_id": "4"} }

{ "doc" : {"view_cnt" : 80} }

1、搜索浏览量在30~60之间的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

     "filter": {

        "range": {

          "view_cnt": {

            "gt": 30,              //gt大于 gte大于或等于

            "lt": 60               //lt大于   lte大于或等于

          }

        }

      }

    }

}

}

2、搜索发帖日期在最近1个月的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

    "filter": {

        "range": {

          "postDate": {

            "gt": "2017-03-10||-30d"

          }

        }

      }

    }

}

}

GET /forum/article/_search

{

"query": {

    "constant_score": {

    "filter": {

        "range": {

          "postDate": {

            "gt": "now-30d"

          }

        }

      }

    }

}

}

六、match query 精准查询

测试数据：为帖子数据增加标题字段

POST /forum/article/_bulk

{ "update": { "_id": "1"} }

{ "doc" : {"title" : "this is java and elasticsearch blog"} }

{ "update": { "_id": "2"} }

{ "doc" : {"title" : "this is java blog"} }

{ "update": { "_id": "3"} }

{ "doc" : {"title" : "this is elasticsearch blog"} }

{ "update": { "_id": "4"} }

{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }

{ "update": { "_id": "5"} }

{ "doc" : {"title" : "this is spark blog"} }

1、match query

    GET /forum/article/_search

{

    "query": {

        "match": {

            "title": "java elasticsearch"

        }

    }

}

相当于：

{

"bool": {

    "should": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }}

    ]

}

}

如果title字段是analyzed则进行full text全文搜索，则返回title字段包含java 或者elasticsearch 或者两个都包含的document

如果是not_analyzed则进行exact value（相当于temr query），则只返回包含java elasticsearch的document

GET /forum/article/_search

{

    "query": {

        "match": {

            "title": {

"query": "java elasticsearch",

"operator": "and" //full text 中返回都包含“java”和"elasticsearch“的document

           }

        }

    }

}

相当于：

   {

      "bool": {

     "must": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }}

        ]

      }

    }

GET /forum/article/_search

{

"query": {

    "match": {

      "title": {

        "query": "java elasticsearch spark hadoop",

        "minimum_should_match": "75%" // full text中返回，包含指定条件的75%的document

      }

    }

}

}

相当于：

{

"bool": {

    "should": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }},

      { "term": { "title": "hadoop" }},

      { "term": { "title": "spark" }}

    ],

    "minimum_should_match": 3

}

}

2、用bool组合多个搜索条件，来搜索title

GET /forum/article/_search

{

"query": {

    "bool": {

      "must":     { "match": { "title": "java" }},

      "must_not": { "match": { "title": "spark" }},

      "should": [

                  { "match": { "title": "hadoop" }},

                  { "match": { "title": "elasticsearch"   }}

      ]

    }

}

}

bool组合多个搜索条件，如何计算relevance score

must和should搜索对应的分数，加起来，除以must和should的总数

排名第一：java，同时包含should中所有的关键字，hadoop，elasticsearch

排名第二：java，同时包含should中的elasticsearch

排名第三：java，不包含should中的任何关键字

should是可以影响相关度分数的

must是确保说，谁必须有这个关键字，同时会根据这个must的条件去计算出document对这个搜索条件的relevance score

在满足must的基础之上，should中的条件，不匹配也可以，但是如果匹配的更多，那么document的relevance score就会更高

默认情况下，should是可以不匹配任何一个的，比如上面的搜索中，this is java blog，就不匹配任何一个should条件

但是有个例外的情况，如果没有must的话，那么should中必须至少匹配一个才可以

比如下面的搜索，should中有4个条件，默认情况下，只要满足其中一个条件，就可以匹配作为结果返回

但是可以精准控制，should的4个条件中，至少匹配几个才能作为结果返回

GET /forum/article/_search

{

"query": {

    "bool": {

      "should": [

        { "match": { "title": "java" }},

        { "match": { "title": "elasticsearch"   }},

        { "match": { "title": "hadoop"   }},

        { "match": { "title": "spark"   }}

      ],

      "minimum_should_match": 3

    }

}

}

posted @ 2018-02-05 15:23 奋斗超云 Views(3316) Comments(0) Edit 收藏举报

刷新页面返回顶部

求知若饥虚心若愚,脚着沃野长望星空,天高海阔水静深流.

仰望星空是我的信仰，追寻真理是我的梦想，水静深流是我的格言。

Elasticsearch学习笔记（十二）filter与query

公告