ELK——Logstash 2.2 mutate 插件【翻译+实践】
本文内容
- 语法
- 测试数据
- 可选配置项
mutate 插件可以在字段上执行变换,包括重命名、删除、替换和修改。这个插件相当常用。
比如:
- 你已经根据 Grok 表达式将 Tomcat 日志的内容放到各个字段中,想把状态码、字节大小或是响应时间,转换成整型;
- 你已经根据正则表达式将日志内容放到各个字段中,但是字段的值,大小写都有,这对于 Elasticsearch 的全文检索来说,显然用处不大,那么可以用该插件,将字段内容全部转换成小写。
迁移到:http://www.bdata-cap.com/newsinfo/1712678.html
语法
该插件必须是用 mutate 包裹,如下所示:
mutate {}
可用的配置选项如下表所示:
设置 | 输入类型 | 是否必填 | 默认值 |
add_field | hash | No | {} |
add_tag | array | No | [] |
convert | hash | No | |
gsub | array | No | |
join | hash | No | |
lowercase | array | No | |
merge | hash | No | |
periodic_flush | boolean | No | false |
remove_field | array | No | [] |
remove_tag | array | No | [] |
rename | hash | No | |
replace | hash | No | |
split | hash | No | |
strip | array | No | |
update | hash | No | |
uppercase | array | No |
其中,add_field、remove_field、add_tag、remove_tag 是所有 Logstash 插件都有。它们在插件过滤成功后生效。虽然 Logstash 叫过滤,但不仅仅过滤功能。
tag 作用是,当你对字段处理期间,还期望进行后续处理,就先作个标记。Logstash 有个内置 tags 数组,包含了期间产生的 tag,无论是 Logstash 自己产生的,还是你添加的,比如,你用 grok 解析日志,但是错了,那么 Logstash 自己就会自己添加一个 _grokparsefailure 的 tag。这样,你在 output 时,可以对解析失败的日志不做任何处理;
而 field 作用是,对字段的操作,比如,你想利用已有的字段,创建新的字段。这些在后面再说。
另外,你会发现,上表中所有选项,要么是动词,要么是动宾短语。估计你也猜到了,选项其实就是 ruby 函数,而它们后面,即“=>”,跟着的肯定是一堆参数(要是你写程序,你也会这么干)。第一个参数,肯定是字段,也就是你期望该函数作用在哪个字段上,从第二个字段开始往后,是具体参数~
什么是字段?比如,你想解析 Tomcat 日志,把一行访问日志拆分后,得到客户端IP、字节大小、响应时间等放到指定变量,那么这个变量就是字段。
下面具体介绍各个选项。
测试数据
假设有 Tomcat access 日志:
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/goLogin" "" 8080 200 1692 23 "http://10.1.8.193:8080/goMain" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/js/common/jquery-1.10.2.min.js" "" 8080 304 - 67 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/css/common/login.css" "" 8080 304 - 75 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/js/system/login.js" "" 8080 304 - 53 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"
它是按如下 Tomcat 配置产生的:
<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"
prefix="localhost_access_log." suffix=".txt"
pattern="%h %l %u %t %m "%U" "%q" %p %s %b %D "%{Referer}i" "%{User-Agent}i"" />
若用如下 Grok 表达式解析该日志:
%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}
会得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T08:26:07.794Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
注意,日志拆分到各个字段后的数据类型。port、statusCode、bytes、reqTime 字段肯定是(最好是)数字,不过这里暂时先用字符串。后面会介绍,下面的示例都在此基础上。
可配置选项
add_field
- 值是散列,就是键值对,比如 add_field => {"field1"=>"value1","field2"=>"value2"}。
- 默认值是空对象,即
{}
添加新的字段。
示例:
input {
stdin {
}
}
filter {
grok {
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]
}
mutate {
add_field=>{
"SayHi"=>"Hello , %{clientip}"
}
}
}
output{
stdout{
codec=>rubydebug
}
}
注意黑体部分,如果用这个配置,解析前面的 Tcomat access 日志,会得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T04:52:02.031Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"SayHi" => "Hello , 192.168.6.25"
}
你会看到多了一个 SayHi 字段。这个字段是写死的,当然也可以动态。如果将
"SayHi"=>"Hello , %{clientip}"
改成:
"another_%{clientip}"=>"Hello , %{clientip}"
你会看到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T06:38:04.427Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"another_192.168.6.25" => "Hello , 192.168.6.25"
}
虽然这个例子不太合理,但你现在知道,用已有字段的值,可以生成新的字段和它的值。上面示例只添加了一个字段,你也可以添加多个字段:
add_field=>{
"another_%{clientip}"=>"Hello , %{clientip}"
"another_%{http_method}"=>"Hello, %{http_method}"
}
add_tag
- 值是 array 数组
- 默认值为空数组,即
[]
添加新的标签。
示例:
mutate {
add_tag=>[
"foo_%{clientip}"
]
}
你会看到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T06:48:43.278Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"tags" => [
[0] "foo_192.168.6.25"
]
}
与 add_field 类似,也可以一次添加多个 tags。
注意,add_tag 是数组 [],不是 {}。
convert
- 值是 hash
- 无默认值
数据类型转换。
如果要转换成 boolean,那么可接受的数据是:
true
,t
,yes
,y
, 和1
false
,f
,no
,n
, 和0
另外,还可转换成 integer, float, string。
示例:
mutate {
#convert=>["reqTime","integer","statusCode","integer","bytes","integer"]
convert=>{"port"=>"integer"}
}
convert 有两种写法。一种是用数组,两个为一组;另一种是散列。得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T09:06:25.360Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => 8080,
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
注意,
- port 字段,已经没有双引号啦。
- mutate 插件选项的值类型设计得很简单,要么是散列(键值对),要么数组……比如,convert=>["reqTime","integer","statusCode","integer"],两个为一组,第一个表示字段,第二个为想转换的数据类型,并没有采用嵌套或是复合类型。看来作者的意图是——简单,复杂的数据类型,虽然看起来容易,但要付出成本的。简单没关系,约定好就行。Logstash 很多插件和其选项都这样。
gsub
- 值是 array 数组
- 无默认值
字符串替换。用正则表达式和字符串都行。它只能用于字符串,如果不是字符串,那么什么都不会做,也不会报错。
该配置的值是数组,三个为一组,分别表示:字段名称,待匹配的字符串(或正则表达式),待替换的字符串。
示例:在解析 Tomcat 日志,会遇到一种情况,资源的字节大小,可能会是“-”,因此,需要将“-”,替换成0,然后在用convert转换成数字型。
input {
stdin {
}
}
filter {
grok {
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]
}
mutate {
gsub=>["bytes","_","0"]
convert=>["port","integer","reqTime","integer","statusCode","integer","bytes","integer"]
}
}
output{
stdout{
codec=>rubydebug
}
}
得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/js/common/jquery-1.10.2.min.js\" \"\" 8080 304 - 67 \"http://10.1.8.193:8080/goLogin\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T09:17:21.745Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/js/common/jquery-1.10.2.min.js\"",
"request_query" => "\"\"",
"port" => 8080,
"statusCode" => 304,
"bytes" => 0,
"reqTime" => 67,
"referer" => "\"http://10.1.8.193:8080/goLogin\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
join
- 值是 hash
- 无默认值
用分隔符连接数组. 如果字段不是数组,那什么都不做。
示例:
filter { mutate { join =>{"fieldname"=>","}}}
lowercase 和 uppercase
- 值是数组 array
- 没有默认值
把字符串转换成小写或大写。
示例:
filter {
mutate {
lowercase =>["fieldname"]}}
示例:
filter {
mutate {
uppercase =>["fieldname"]}}
merge
- 值是 hash
- 无默认值
合并两个数组或散列字段。存在三种情况,合并后是数组:
- 数组和字符串,可以合并
- 字符串和字符串,可以合并
- 数组和散列不能合并
示例:
mutate {
add_field=>{"arr_clientip"=>"%{clientip}"}
add_field=>{"arrmstr_clientip"=>"%{clientip}"}
add_field=>{"arrmarr_clientip"=>"%{clientip}"}
#merge=>{"merge_clientip"=>"clientip"}
}
mutate {
split=>{"arr_clientip"=>"."}
split=>{"arrmstr_clientip"=>"."}
split=>{"arrmarr_clientip"=>"."}
}
mutate {
merge=>{"arrmstr_clientip"=>"clientip"}
merge=>{"arrmarr_clientip"=>"arr_clientip"}
}
=> 后面的字段值会合并到前面的字段。
得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-18T02:53:35.671Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"arr_clientip" => [
[0] "192",
[1] "168",
[2] "6",
[3] "25"
],
"arrmstr_clientip" => [
[0] "192",
[1] "168",
[2] "6",
[3] "25",
[4] "192.168.6.25"
],
"arrmarr_clientip" => [
[0] "192",
[1] "168",
[2] "6",
[3] "25",
[4] "192",
[5] "168",
[6] "6",
[7] "25"
]
}
periodic_flush
- 值是 boolean
- 默认值是
false
按时间间隔调用。可选。
remove_field
- 值是数组 array
- 默认值是数组
[]
移除字段。
示例:移除 message 字段。
mutate {
remove_field=>["message"]
}
得到如下结果:
{
"@version" => "1",
"@timestamp" => "2016-05-18T02:04:16.879Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
message 字段已经没有了~message 字段保存了原始日志,如果保留的话,就意味着日志存了两份:分割前和分割后。
当然,也可以一次移除多个字段。
remove_tag
- 值是数组 array
- 默认值是
[]
移除标识。
示例:
filter {
mutate {
remove_tag =>["foo_%{somefield}"]}}
也可以一次移动多个 tag:
filter {
mutate {
remove_tag =>["foo_%{somefield}","sad_unwanted_tag"]}}
rename
- 值是 hash
- 无默认值
重命名一个或多个字段。
示例:
input {
stdin {
}
}
filter {
grok {
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]
}
mutate {
rename=>{"clientip"=>"host"}
}
}
output{
stdout{
codec=>rubydebug
}
}
得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-17T09:29:44.018Z",
"host" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
Grok 里,客户端IP本来叫 clientip,但是可以在 mutate 里重新命名为 host。
replace
- 值是 hash
- 无默认值
用一个新的值替换掉指定字段的值。
示例:
input {
stdin {
}
}
filter {
grok {
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]
}
mutate {
replace=>{"message"=>"%{clientip}: My new Message."}
}
}
output{
stdout{
codec=>rubydebug
}
}
得到如下结果:
{
"message" => "192.168.6.25: My new Message.",
"@version" => "1",
"@timestamp" => "2016-05-18T01:55:34.566Z",
"host" => "vcyber",
"clientip" => "192.168.6.25",
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
message 字段的值已经变了。
split
- 值是 hash
- 无默认值
用分隔符或字符分割一个字符串。只能应用在字符串上。
示例:把客户端IP按英文句号分割成数组。
mutate {
split=>{"clientip"=>"."}
}
得到如下结果:
{
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",
"@version" => "1",
"@timestamp" => "2016-05-18T01:58:40.687Z",
"host" => "vcyber",
"clientip" => [
[0] "192",
[1] "168",
[2] "6",
[3] "25"
],
"identd" => "-",
"auth" => "-",
"timestamp" => "24/Apr/2016:01:25:53 +0800",
"http_method" => "GET",
"request" => "\"/goLogin\"",
"request_query" => "\"\"",
"port" => "8080",
"statusCode" => "200",
"bytes" => "1692",
"reqTime" => "23",
"referer" => "\"http://10.1.8.193:8080/goMain\"",
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""
}
strip
- 值是数组 array
- 无默认值
去掉字段首尾的空格。
示例:
filter {
mutate {
strip =>["field1","field2"]}}
update
- 值是 hash
- 无默认值
Update an existing field with a new value. If the field does not exist, then no action will be taken.
示例:
filter { mutate { update =>{"sample"=>"My new message"}}}