ElasticSearch笔记-脚本（script）

使用场景

用于解决复杂业务问题，如：自定义字段、自定义评分、自定义更新、自定义聚合分析等

缺点

性能问题。官方文档性能优化中明确指出使用脚本会导致性能低；
如非必要，不要使用脚本，尽量用其他方式替换，如下：
使用脚本进行前缀查询：

 1POST seats/_search
 2{
 3 "query": {
 4  "bool":{
 5    "filter": { 
 6    "script":{
 7      "script":{
 8        "lang":"painless",
 9        "source": "doc['theatre'].value.startsWith('Down')"
10      }
11    }
12    }
13  }
14 }
15}

可以使用prefix前缀匹配，性能提升5倍

script模板

"script": {
    "lang":   "...",  
    "source" | "id": "...", 
    "params": { ... } 
}

lang：代表language脚本语言，默认指定为：painless。
source可以为inline脚本，或者是一个id，那么这个id对应于一个stored脚本
params：传递给脚本使用的变量参数。

Painless Scripting 简介

Painless是一种简单，安全的脚本语言，专为与Elasticsearch一起使用而设计。从ES5.0开始，它是Elasticsearch的默认脚本语言，可以安全地用于内联和存储脚本。
Painless特点：
性能牛逼：Painless脚本运行速度比备选方案（包括Groovy）快几倍。
安全性强：使用白名单来限制函数与字段的访问，避免了可能的安全隐患。
可选输入：变量和参数可以使用显式类型或动态def类型。
上手容易：扩展了java的基本语法，并兼容groove风格的脚本语言特性。
特定优化：是ES官方专为Elasticsearch脚本编写而设计。

使用脚本修改文档案例

修改文档字段一种方法就是把所有文档内容读出来，修改其中的字段，再写进去。这样做比较麻烦，使用Painless脚本直接修改：

POST index9/_update/1
{
  "script": {
    "source": "ctx._source.description='desc1'"
  }
}

这里的source表明是我们的Painless代码。这里我们只写了很少的代码在DSL之中。这种代码称之为inline。在这里我们直接通过ctx._source.description来访问 _souce里的description。这样我们通过编程的办法直接对年龄进行了修改。
上面的方法固然好，但是每次执行scripts都是需要重新进行编译的。编译好的script可以cache并供以后使用。上面的script如果是改变年龄的话，需要重新进行编译。一种更好的方法是改为这样的：

POST index9/_update/1
{
  "script": {
    "source": "ctx._source.description=params.desc",
    "params":{
      "desc":"new desc"
    }
  }
}

这样，我们的script的source是不用改变的，只需要编译一次。下次调用的时候，只需要修改params里的参数即可。
在Elasticsearch里：

    "script": {
      "source": "ctx._source.num_of_views += 2"
    }

和

    "script": {
      "source": "ctx._source.num_of_views += 3"
    }

被视为两个不同的脚本，需要分别进行编译，所以最好的办法是使用params来传入参数。

存储脚本(stored script)

脚本可以使用_scripts端点存储在集群状态中并从集群状态检索。
下面是使用位于/_scripts/{id}的存储脚本的示例。
首先，在集群状态下创建名为calculate-score的脚本:

POST _scripts/calculate-score
{
  "script": {
    "lang": "painless",
    "source": "Math.log(_score * 2) + params.my_modifier"
  }
}

同样的获取脚本可以用:

GET _scripts/calculate-score

存储的脚本可以通过如下方式指定id参数来使用:

GET _search
{
  "query": {
    "script": {
      "script": {
        "id": "calculate-score",
        "params": {
          "my_modifier": 2
        }
      }
    }
  }
}

删除脚本

DELETE _scripts/calculate-score

使用存储脚本实现上面的案例

# 创建存储脚本
POST _scripts/update_desc
{
  "script":{
    "lang": "painless",
    "source": "ctx._source.description=params.desc"
  }
}
# 使用存储脚本更新字段
POST index9/_update/1
{
  "script": {
    "id":"update_desc",
    "params":{
      "desc":"new desc22"
    }
  }
}

访问source里的字段

Painless中用于访问字段值的语法取决于上下文。这里，我们对于context（上下文）的理解非常重要。它的意思是针对不同的API，在使用中ctx所包含的字段是不一样的。在下面的例子中，我们针对一些情况来做具体的分析。在Elasticsearch中，有许多不同的Plainless上下文。就像那个链接显示的那样，Plainless上下文包括：ingest processor, update, update by query, sort，filter等等。

Ingest管道场景: 访问字段使用ctx ctx.field_name
update/update/update_by_query/reindex 场景，使用：ctx._source；
search和聚合场景，使用：doc['value']。

当然，Elasticsearch 远不止上面这些场景，更多推荐阅读：
官网上下文案例：https://www.elastic.co/guide/en/elasticsearch/painless/7.15/painless-contexts.html

那遇到复杂的脚本处理咋办呢？
案例：获取字符串中的子串
举例如下：求字符串中的某子串，java 语法中的 substring 还能用吗？
如果使用：ingest processor 预处理方式，怎么查官方是否支持？
第一步，找 shard API。
细节 API 入口文档。
https://www.elastic.co/guide/en/elasticsearch/painless/master/painless-api-reference-shared.html
第二步，找到 string
如上是 7.13 版本截图，早期版本如：7.2 版本还有 string类， 7.13 已没有。
第三步：找 substring
第四步：找 java API
这就到了 oracle 官网了。
参考：https://blog.csdn.net/laoyang360/article/details/121738408

painless脚本案例

管道中使用脚本，添加新字段

这个pipepline的作用是创建一个新的field：field_c。它的结果是field_a及field_b的和，并乘以2。

 PUT _ingest/pipeline/add_field_c
    {
      "processors": [
        {
          "script": {
            "lang": "painless",
            "source": "ctx.field_c = (ctx.field_a + ctx.field_b) * params.value",
            "params": {
              "value": 2
            }
          }
        }
      ]
    }

# 添加索引时指定管道
PUT test_script/_doc/1?pipeline=add_field_c
    {
      "field_a": 10,
      "field_b": 20
    }

管道中使用脚本，修改metadata，如_index和_type

   PUT _ingest/pipeline/my_index
    {
        "processors": [
          {
            "script": {
              "source": """
                ctx._index = 'my_index';
                ctx._type = '_doc';
              """
            }
          }
        ]
    }

使用上面的pipeline，我们可以尝试index一个文档到any_index：

    PUT any_index/_doc/1?pipeline=my_index
    {
      "message": "text"
    }

显示的结果是：

    {
      "_index": "my_index",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "result": "created",
      "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
      },
      "_seq_no": 89,
      "_primary_term": 1,
    }

也就是说真正的文档时存到my_index之中，而不是any_index。

reindex使用脚本

    POST _reindex
    {
      "source": {
        "index": "blogs"
      },
      "dest": {
        "index": "blogs_fixed"
      },
      "script": {
        "source": """
          if (ctx._source.category == "") {
              ctx._source.category = "None" 
          }
    """
      }
    }

上面的这个例子在reindex时，如果category为空时，写入“None”。我们可以从上面的两个例子中看出来，针对pipeline，我们可以直接对cxt.field进行操作，而针对update来说，我们可以对cxt._source下的字段进行操作。这也是之前提到的上下文的区别。

添加删除List

    PUT test/_doc/1
    {
        "counter" : 1,
        "tags" : ["red"]
    }

您可以使用和update脚本将tag添加到tags列表（这只是一个列表，因此即使存在标记也会添加）
添加tag：

    POST test/_update/1
    {
        "script" : {
            "source": "ctx._source.tags.add(params.tag)",
            "lang": "painless",
            "params" : {
                "tag" : "blue"
            }
        }
    }

移除tag：

    POST test/_update/1
    {
      "script": {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) }",
        "lang": "painless",
        "params": {
          "tag": "blue"
        }
      }
    }

自定义字段

返回原有Mapping未定义的字段值。
1.1、以idplus返回id字段的翻倍后的结果。

GET index6/_search
{
  "script_fields": {
    "idplus": {
      "script": {
        "lang": "expression",
       "source": "doc['id'] * multiplier",
         "params": {
           "multiplier": 2
        }
      }
    }
  }
}

结果

{
省略...
    "hits" : [
      {
        "_index" : "index6",
        "_type" : "_doc",
        "_id" : "2001",
        "_score" : 1.0,
        "fields" : {
          "idplus" : [
            4002.0
          ]
        }
      },
省略...
    ]
}

1.2、返回日期字段中的“年”或“月”或“日”等。

 1GET hockey/_search
 2{
 3  "script_fields": {
 4    "birth_year": {
 5      "script": {
 6        "source": "doc.born.value.year"
 7      }
 8    }
 9  }
10}

自定义评分

1GET my_index/_search
 2{
 3  "query": {
 4    "function_score": {
 5      "query": {
 6        "match": {
 7          "text": "quick brown fox"
 8        }
 9      },
10      "script_score": {
11        "script": {
12          "lang": "expression",
13          "source": "_score * doc['popularity']"
14        }
15      }
16    }
17  }
18}

自定义更新

1.1、_update:将已有字段值赋值给其他字段。

 POST index6/_update/1
 {
   "script": {
     "lang": "painless",
     "source": """
       ctx._source.firstName = params.firstName;
       ctx._source.lastName = params.lastName
     """,
     "params": {
      "firstName": "aa",
      "lastName": "bb"
    }
  }
}

1.2、Update_by_query：满足b开头（注意正则）的字段，末尾添加matched。

 1POST hockey/_update_by_query
 2{
 3  "script": {
 4    "lang": "painless",
 5    "source": """
 6      if (ctx._source.last =~ /b/) {
 7        ctx._source.last += "matched";
 8      } else {
 9        ctx.op = "noop";
10      }
11    """
12  }
13}

自定义reindex

Elasticsearch认证考试题：
有index_a包含一些文档，要求创建索引index_b，通过reindex api将index_a的文档索引到index_b。
要求：
1）增加一个整形字段，value是index_a的field_x的字符长度；
2）再增加一个数组类型的字段，value是field_y的词集合。
(field_y是空格分割的一组词，比方"foo bar"，索引到index_b后，要求变成["foo", "bar"]）

1POST _reindex
 2{
 3  "conflicts": "proceed",
 4  "source": {
 5    "index": "index_a"
 6  },
 7  "dest": {
 8    "index": "index_b"
 9  },
10  "script": {
11    "source": "ctx._source.parts = / /.split(ctx._source.address); ctx._source.tag = ctx._source.city.length();"
12  }
13}

自定义聚合

 1GET /_search
 2{
 3    "aggs" : {
 4        "genres" : {
 5            "terms" : {
 6                "script" : {
 7                    "source": "doc['genre'].value",
 8                    "lang": "painless"
 9                }
10            }
11        }
12    }
13
14}

检查字段缺失

doc ['field'].value。如果文档中缺少该字段，则抛出异常。
要检查文档是否缺少值，可以调用doc ['field'] .size（）== 0。

Script 调试

目前，调试嵌入式脚本的最佳方法是在选择位置抛出异常。虽然您可以抛出自己的异常（throw new exception('whatever'），但Painless的沙箱会阻止您访问有用的信息，如对象的类型。所以Painless有一个实用工具方法Debug.explain，它会为你抛出异常。例如，您可以使用_explain来探索script query可用的上下文。
添加数据：

PUT /hockey/_doc/1?refresh
{"first":"johnny","last":"gaudreau","goals":[9,27,1],"assists":[17,46,0],"gp":[26,82,1]}

查看goals

GET /hockey/_search
{
   "query": {
    "script": {
      "script": "Debug.explain(doc.goals)"
    }
  }
}

可以看到goals的数据类型是org.elasticsearch.index.fielddata.ScriptDocValues$Longs
您可以使用相同的技巧来查看_source是_update API中的LinkedHashMap：

POST /hockey/_update/1
{
   "script": "Debug.explain(ctx._source)"
}

参考：
https://www.cnblogs.com/sanduzxcvbnm/p/12083590.html
https://blog.csdn.net/laoyang360/article/details/121738408

posted @ 2021-03-08 23:39 .Neterr 阅读(2498) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· TypeScript + Deepseek 打造卜卦网站：技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗？
· 【译】Visual Studio 中新的强大生产力特性
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句：使用策略模式优化代码结构

.Neter学习笔记