ES 字符过滤器&令牌过滤器
1、字符过滤器 官方文档
其作用主要是在调用分词器进行分词之前,进行一些无用字符的过滤,字符过滤器主要分为以下三种
(1)、Html strip 官方文档
过滤html标签,主要参数escaped_tags保留哪些html标签,示例代码如下:
PUT test_index { "settings": { "analysis": { "analyzer": { "custom_analyzer":{ //指定分词器 "tokenizer":"keyword", //指定分析器的字符串过滤器 "char_filter":"custom_char_filter" } }, //字符过滤器 "char_filter": { "custom_char_filter":{ //字符过滤器的类型 "type":"html_strip", //跳过过滤的html标签 "escaped_tags": [ "a" ] } } } } }
测试过滤器代码:
GET test_index/_analyze { "analyzer": "custom_analyzer", "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"] }
执行结果如下:
{ "tokens" : [ { "token" : """this is address of baidu<a>baidu</a> baidu content """, "start_offset" : 0, "end_offset" : 56, "type" : "word", "position" : 0 } ] }
从结果中可以看出过滤了除a标签之外的所有html标签.
(2)、Mapping 官方文档
常用于敏感词过滤,示例代码如下:
PUT test_index { "settings": { "analysis": { "analyzer": { "custom_analyzer":{ "tokenizer":"keyword", "char_filter":["custom_char_filter","custom_mapping_filter"] } }, "char_filter": { "custom_char_filter":{ "type":"html_strip", "escaped_tags": [ "a" ] }, "custom_mapping_filter":{ "type": "mapping", //当内容出现baidu或者is 全都用**替换 "mappings": [ "baidu=>**", "is=>**" ] } } } } }
执行搜索代码如下:
GET test_index/_analyze { "analyzer": "custom_analyzer", "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>"] }
执行结果如下:
{ "tokens" : [ { "token" : """th** ** address of **<a>**</a> ** content """, "start_offset" : 0, "end_offset" : 56, "type" : "word", "position" : 0 } ] }
在html_strip的基础上,通过mapping完成了baidu和is的敏感词过滤.
(3)、Pattern Replace 官方文档
主要用于一些结构化的内容(可以用正则表达式检索到的)的替换,示例代码如下:
PUT test_index { "settings": { "analysis": { "analyzer": { "custom_analyzer":{ "tokenizer":"keyword", "char_filter":["custom_char_filter","custom_mapping_filter","custom_pattern_replace_filter"] } }, "char_filter": { "custom_char_filter":{ "type":"html_strip", "escaped_tags": [ "a" ] }, "custom_mapping_filter":{ "type": "mapping", "mappings": [ "baidu=>**", "is=>**" ] }, "custom_pattern_replace_filter":{ "type":"pattern_replace", "pattern": "(\\d{3})\\d{4}(\\d{4})", "replacement": "$1****$2" } } } } }
在(1)、(2)的基础上增加了custom_pattern_replace_filter用于正则替换内容,主要作用是手机号脱敏
检索代码如下:
GET test_index/_analyze { "analyzer": "custom_analyzer", "text": ["this is address of baidu<a>baidu</a><p>baidu content</p>telphone:13311112222"] }
执行结果如下:
{ "tokens" : [ { "token" : """th** ** address of **<a>**</a> ** content telphone:133****2222""", "start_offset" : 0, "end_offset" : 76, "type" : "word", "position" : 0 } ] }
手机号13311112222被替换成了133****2222
2、令牌过滤器 官方文档
令牌过滤器包含的内容过多,参考官方文档,这里分析几种常用的令牌过滤器
(1)、同义词过滤器 synonym
第一步向运行目录的config文件夹下添加analysis文件夹,再到此文件夹下添加synonym.txt文件,集群下所有节点重复此操作,内容如下:
第二步编写相关的设置指向同义词文件
PUT test_index { "settings": { "analysis": { "analyzer": { "custom_analyzer":{ "tokenizer":"ik_max_word", "filter":["synonym"] } }, "filter": { "synonym":{ "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } } }
这里用了ik分词器,不明白参考ES 中文分词器ik
新增索引后,执行搜索代码
GET test_index/_analyze { "analyzer": "custom_analyzer", "text":"啦啦啦,呵呵呵,啧啧啧" }
结果如下:
{ "tokens" : [ { "token" : "嘻嘻", "start_offset" : 0, "end_offset" : 3, "type" : "SYNONYM", "position" : 0 }, { "token" : "嘻嘻", "start_offset" : 4, "end_offset" : 6, "type" : "SYNONYM", "position" : 1 }, { "token" : "嘎嘎", "start_offset" : 4, "end_offset" : 6, "type" : "SYNONYM", "position" : 1 }, { "token" : "嘎嘎", "start_offset" : 5, "end_offset" : 7, "type" : "SYNONYM", "position" : 2 }, { "token" : "么", "start_offset" : 8, "end_offset" : 10, "type" : "SYNONYM", "position" : 3 }, { "token" : "么", "start_offset" : 9, "end_offset" : 11, "type" : "SYNONYM", "position" : 4 }, { "token" : "么", "start_offset" : 9, "end_offset" : 11, "type" : "SYNONYM", "position" : 5 } ] }
结果中所有的相关词汇都被执行的词汇替换
(2)、停用词stop 官方文档
在设置中指定的停用词,将不会创建倒排索引.
PUT test_index { "settings": { "analysis": { "analyzer": { "custom_analyzer":{ "tokenizer":"ik_max_word", "filter":["custom_stop_filter"] } }, "filter": { "custom_stop_filter":{ "type": "stop", "ignore_case": true, "stopwords": [ "and", "is","friend" ] } } } } }
执行以上代码,并执行以下搜索语句
GET test_index/_analyze { "analyzer": "custom_analyzer", "text":"You and me IS friend" }
执行结果如下:
{ "tokens" : [ { "token" : "you", "start_offset" : 0, "end_offset" : 3, "type" : "ENGLISH", "position" : 0 }, { "token" : "me", "start_offset" : 8, "end_offset" : 10, "type" : "ENGLISH", "position" : 1 } ] }
注:也可以指定停用词文件路劲,和ik分词器类似.具体参考官方文档.