文档正常话

一、normalization

normalization:规范化。在切词之后,包括大小写的转换、去掉语气词停用词(is、an)、单复数的变化

每种分词器的normalization策略不一样,如下图展示

 

二、char filter

char filter:字符过滤器,在切词之前完成操作

1、HTML Strip

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "analyzer": {
 6         "my_analyzer": {
 7           "tokenizer": "keyword",
 8           "char_filter": [
 9             "my_char_filter"
10           ]
11         }
12       },
13       "char_filter": {
14         "my_char_filter": {
15           "type": "html_strip",
16           "escaped_tags":"a"     使用该属性可以规定保留哪些标签
17         }
18       }
19     }
20   }
21 }

2、Mapping

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "char_filter": {
 6         "my_char_filter": {
 7           "type": "mapping",
 8           "mappings": [
 9             "滚 => *",
10             "垃 => *",
11             "圾 => *"
12           ]
13         }
14       },
15       "analyzer": {
16         "my_analyzer": {
17           "tokenizer": "keyword",
18           "char_filter": [
19             "my_char_filter"
20           ]
21         }
22       }
23     }
24   }
25 }

3、Pattern Replace,正则替换

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "char_filter": {
 6         "my_char_filter": {
 7           "type": "pattern_replace",
 8           "pattern":"(\\d{3})\\d{4}(\\d{4})",
 9           "replacement":"$1****$2"
10         }
11       },
12       "analyzer": {
13         "my_analyzer": {
14           "tokenizer": "keyword",
15           "char_filter": [
16             "my_char_filter"
17           ]
18         }
19       }
20     }
21   }
22 }

三、分词器tokenizer

分词器最主要的作用是进行切词,默认分词器为standard

四、token filter

包括大小写、时态、停用词、同义词、语气词

 

posted @ 2022-02-11 07:43  showMeTheCodes  阅读(25)  评论(0编辑  收藏  举报