ElasticSearch基础2：查询和过滤初步

组合查询

组合多查询
现实的查询需求从来都没有那么简单；它们需要在多个字段上查询多种多样的文本，并且根据一系列的标准来过滤。为了构建类似的高级查询，你需要一种能够将多查询组合成单一查询的查询方法。

你可以用 bool 查询来实现你的需求。这种查询将多查询组合在一起，成为用户自己想要的布尔查询。它接收以下参数：

must
- 　　文档必须匹配这些条件才能被包含进来。
must_not
- 　　文档必须不匹配这些条件才能被包含进来。
should
- 　　如果满足这些语句中的任意语句，将增加 _score ，否则，无任何影响。它们主要用于修正每个文档的相关性得分。
filter
- 　　必须匹配，但它以不评分、过滤模式来进行。这些语句对评分没有贡献，只是根据过滤标准来排除或包含文档。

由于这是我们看到的第一个包含多个查询的查询，所以有必要讨论一下相关性得分是如何组合的。每一个子查询都独自地计算文档的相关性得分。一旦他们的得分被计算出来， bool 查询就将这些得分进行合并并且返回一个代表整个布尔操作的得分。

此查询将匹配满足以下所有条件的文档：

title字段包含单词how go to
content字段包含单词Elasticsearch。
status字段包含"published"的确切单词。
发布日期字段包含自2015年1月1日起的日期。

GET /_search
{
  "query": { 
    "bool": { 
      "must": [
        { "match": { "title":   "how go to"        }},
        { "match": { "content": "Elasticsearch" }}
      ],
      "filter": [ 
        { "term":  { "status": "published" }},
        { "range": { "publish_date": { "gte": "2015/01/01" }}}
      ]
    }
  }
}

下面的查询用于

查找 title 字段匹配 how to make millions
并且不被标识为 spam 的文档。
那些被标识为 starred 或在2014之后的文档，将比另外那些文档拥有更高的排名。如果两者都满足，那么它排名将更高：

GET /_search
{
  "query": {
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014/01/01" }}}
        ]
    }
}
}

事先需要建一条记录试试：

POST /blog/_doc
{
  "title": "how Search go to make millions",
  "content": "content Elasticsearch",
  "tag":  "not spam",
  "status":  "published",
  "publish_date":  "2015/02/01",
  "date":  "2015/01/01"
}

复合查询说明

复合查询包装其他复合查询或叶子查询，可以组合其结果和分数，更改其行为，或者从查询切换到筛选上下文。

评分

以下查询有一个match_all查询，该查询为所有文档分配1.0的分数。

GET _search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "term": {
          "status": "published"
        }
      }
    }
  }
}

or

curl -XGET "http://localhost:9200/_search" -H 'Content-Type: application/json' -d'{  "query": {    "bool": {      "must": {        "match_all": {}      },      "filter": {        "term": {          "status": "published"        }      }    }  }}'

这个常量得分查询的行为方式与上面的示例完全相同。constant_score查询为筛选器匹配的所有文档分配1.0的分数。

GET _search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "status": "active"
        }
      }
    }
  }
}

由于未指定评分查询，查询为所有文档分配0分：

GET _search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "status": "published"
        }
      }
    }
  }
}

Boosting query

返回与正查询匹配的文档，同时减少也与负查询匹配的文档的相关性分数。
可以使用boosting查询降级某些文档，而不将其从搜索结果中排除。

GET /_search
{
    "query": {
        "boosting" : {
            "positive" : {
                "term" : {
                    "text" : "apple"
                }
            },
            "negative" : {
                 "term" : {
                     "text" : "pie tart fruit crumble tree"
                }
            },
            "negative_boost" : 0.5
        }
    }
}

测试数据

POST /blog/_doc
{
  "text": "how to search pie and tree"
}

POST /blog/_doc
{
  "text": "how to search apple and pie and tree"
}

POST /blog/_doc
{
  "text": "how to search apple and apple tree"
}

POST /blog/_doc
{
  "text": "how to search apple and apple bar"
}

POST /blog/_doc
{
  "text": "my name is apple"
}

POST /blog/_doc
{
  "text": "how to search apple and pie and tree pie tart"
}

结果：
含有“apple”的分高，含有"pie tart fruit crumble tree"的分数低，包含的越多，分数越低

{
  "took" : 922,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 0.9642221,
    "hits" : [
      {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "f78jg28BdESjTDDpG97d",
        "_score" : 0.9642221,
        "_source" : {
          "text" : "how to search apple and apple tree"
        }
      },
      {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "gL8jg28BdESjTDDpKd7F",
        "_score" : 0.9642221,
        "_source" : {
          "text" : "how to search apple and apple bar"
        }
      },
      {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "gb8jg28BdESjTDDpNN5f",
        "_score" : 0.8504159,
        "_source" : {
          "text" : "my name is apple"
        }
      },
      {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "fr8jg28BdESjTDDpFd5h",
        "_score" : 0.6669828,
        "_source" : {
          "text" : "how to search apple and pie and tree"
        }
      },
      {
        "_index" : "blog",
        "_type" : "_doc",
        "_id" : "gr8lg28BdESjTDDpsN4K",
        "_score" : 0.6020521,
        "_source" : {
          "text" : "how to search apple and pie and tree pie tart"
        }
      }
    ]
  }
}

Constant score常数分数查询

包装筛选器查询并返回每个匹配文档，相关度得分等于boost参数值。
以下相关的分数均为1.2

GET /_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                "term" : { "text" : "apple"}
            },
            "boost" : 1.2
        }
    }
}

GET /_search
{
    "query": {
        "dis_max" : {
            "queries" : [
                { "term" : { "title" : "Quick pets" }},
                { "term" : { "body" : "Quick pets" }}
            ],
            "tie_breaker" : 0.7
        }
    }
}

POST /blog/_doc
{
  "title": "Quick pets",
  "body": "how to search Quick pets"
}

POST /blog/_doc
{
  "title": "Quick pets",
  "body": "how to search "
}

POST /blog/_doc
{
  "title": "Quick pets",
  "body": "how to search pets"
}

POST /blog/_doc
{
  "title": "how to search",
  "body": "how to search pets"
}

POST /blog/_doc
{
  "title": "how to search",
  "body": "how to search Quick pets"
}

精确查询

当进行精确值查找时，我们会使用过滤器（filters）。过滤器很重要，因为它们执行速度非常快，不会计算相关度（直接跳过了整个评分阶段）而且很容易被缓存。
term 查询数字
我们首先来看最为常用的 term 查询，可以用它处理数字（numbers）、布尔值（Booleans）、日期（dates）以及文本（text）。
数据准备：

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

我们想要做的是查找具有某个价格的所有产品，有关系数据库背景的人肯定熟悉 SQL，如果我们将其用 SQL 形式表达，会是下面这样：

SELECT document
FROM   products
WHERE  price = 20

在 Elasticsearch 的查询表达式（query DSL）中，我们可以使用 term 查询达到相同的目的。 term 查询会查找我们指定的精确值。

GET /_search
{
    "query": {
    "term" : {
        "price" : 20
    }
}}

通常当查找一个精确值的时候，我们不希望对查询进行评分计算。只希望对文档进行包括或排除的计算，所以我们会使用 constant_score 查询以非评分模式来执行 term 查询并以一作为统一评分。

最终组合的结果是一个 constant_score 查询，它包含一个 term 查询：

GET /products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "price" : 20
                }
            }
        }
    }
}

查询置于 filter 语句内不进行评分或相关度的计算，所以所有的结果都会返回一个默认评分 1 。
如本部分开始处提到过的一样，使用 term 查询匹配字符串和匹配数字一样容易。如果我们想要查询某个具体 UPC ID 的产品，使用 SQL 表达式会是如下这样：

SELECT product
FROM products
WHERE productID = "XHDK-A-1293-#fJ3"

转换成查询表达式（query DSL），同样使用 term 查询，形式如下：

GET /products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "productID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

但这里有个小问题：我们无法获得期望的结果。为什么呢？问题不在 term 查询，而在于索引数据的方式。如果我们使用 analyze API (分析 API)，我们可以看到这里的 UPC 码被拆分成多个更小的 token ：

GET /products/_analyze
{
"field": "productID",
"text": "XHDK-A-1293-#fJ3"
}

这里有几点需要注意：

Elasticsearch 用 4 个不同的 token 而不是单个 token 来表示这个 UPC 。
所有字母都是小写的。
丢失了连字符和哈希符（ # ）。
所以当我们用 term 查询查找精确值 XHDK-A-1293-#fJ3 的时候，找不到任何文档，因为它并不在我们的倒排索引中，正如前面呈现出的分析结果，索引里有四个 token 。

显然这种对 ID 码或其他任何精确值的处理方式并不是我们想要的。

为了避免这种问题，我们需要告诉 Elasticsearch 该字段具有精确值，要将其设置成 not_analyzed 无需分析的。

组合过滤

比方说，怎样用 Elasticsearch 来表达下面的 SQL ？

SELECT product
FROM products
WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3")
AND (price != 30)

这种情况下，我们需要 bool （布尔）过滤器。这是个复合过滤器（compound filter），它可以接受多个其他过滤器作为参数，并将这些过滤器结合成各式各样的布尔（逻辑）组合。

must
- 　　所有的语句都必须（must）匹配，与 AND 等价。
must_not
- 　　所有的语句都不能（must not）匹配，与 NOT 等价。
should
- 　　至少有一个语句要匹配，与 OR 等价。

请注意，过滤和搜索的should是有区别的，搜索中should只是影响评分。而过滤是or的关系。

GET /products/_search
{
   "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              ],
              "must_not" : {
                 "term" : {"price" : 30} 
              }
           }
         }
}

查找多个精确值
term 查询对于查找单个值非常有用，但通常我们可能想搜索多个值。如果我们想要查找价格字段值为 $20 或 $30 的文档该如何处理呢？

GET /products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "terms" : { 
                    "price" : [20, 30]
                }
            }
        }
    }
}

posted @ 2020-01-09 15:10 昕友软件开发阅读(319) 评论(0) 收藏举报

刷新页面返回顶部

昕友软件开发

知行合一