elasticsearch-分析器(五)

官方介绍

一个analyzer即分析器,无论是内置的还是自定义的,只是一个包含character filters(字符过滤器)、 tokenizers(分词器)、token filters(令牌过滤器)三个细分模块的包。

 

 

CharFilter

字符过滤器用于在将字符流传递给标记赋予器之前对其进行预处理。

字符过滤器接收原始文本作为字符流,并可以通过添加、删除或更改字符来转换该流。例如,可以使用字符过滤器将印度-阿拉伯数字(٠, ١٢٣٤٥٦٧٨, ٩)转换为阿拉伯-拉丁数字(0123456789),或者从流中剥离这样的HTML元素。

内置的有:HTML Strip Character FilterMapping Character FilterPattern Replace Character Filter

HTML Strip Character Filter

用于替换html标签如:

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

过滤后

[ \nI'm so happy!\n ]

 

Mapping Character Filter

可以替换任意字符

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5",
        "٦ => 6",
        "٧ => 7",
        "٨ => 8",
        "٩ => 9"
      ]
    }
  ],
  "text": "My license plate is ٢٥٠١٥"
}

过滤后

[ My license plate is 25015 ]

Pattern Replace Character Filter

基于正则替代字符

案例1

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
过滤后
[ My, credit, card, is, 123_456_789 ]

案例2

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}

过滤后

[ the, foo, bar, baz, method ]

tokenizer

是分析器的一部分,负责将文本分割成初步的词(tokens)。

接收字符流,将其分解为单个标记(通常是单个单词),并输出标记流,如text  "Quick brown fox!" 转换成[Quick, brown, fox!].

还负责记录每个词的顺序或位置,以及该词所代表的原始词的开始和结束字符偏移量。

一个analyzer必须有一个tokenizer

内置的文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

 

 

如: 

POST _analyze
{
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

转换后

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token filters

可以进一步处理这些词,例如去除停用词、转换大小写等。

token过滤器可以接收token stream 并且可以添加,删除或改变token

例如大小写转换

将以下token转换为大写

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token过滤器不允许更改每个token的位置或字符偏移量

分析器可以有0个到多个token过滤器他们按顺序生效

token graph

https://www.elastic.co/guide/en/elasticsearch/reference/current/token-graphs.html

当被tokenizer标记后还将记录token的位置 以及token跨越的长度

如quick  brown fox 转换后->[quick,brown,fox]

 

 

token filter还允许在现有token上添加新令牌,它将于现有的令牌位置和跨越相同,比如quick同义词fast,

 

 

filter还允许添加跨越多个位置的令牌比如domain name system is fragile转换后[domain,name,system,is,fragile ] 添加domain name system简写dns 允许通过dns匹配

在以下dns位置为0,但是positionLenght为3

 

 

 在搜素的时候使用相同分析器通过match 或者match_phrase搜素

domain name system is fragile 或者dns is fragile 将能够搜素包含这2种的文档

es提供以下2种过滤器,允许将跨越多个位置的令牌positionLength记录为1

自定义分析器

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

可以参考中英文搜素配置https://www.cnblogs.com/LQBlog/p/10449637.html

同义词过滤器可以参考

比如搜索土豆 包含马铃薯的文档也可以搜出来

 

分析器处理过程的3步骤

1.字符过滤器:去除字符的特殊字符(chartFilter)

2.分词器:将词组分词(tokenizer)

3.对分词词组进行操作,比如转大写 分词后的词组替换等(tokenizer filter)

4.最终生成token graph

不同分词算法的思考

常用API

测试分析器

官网:https://www.elastic.co/guide/en/elasticsearch/reference/current/test-analyzer.html

get请求:http://127.0.0.1:9200/_analyze

body:

{
    "analyzer":"standard",//分词器
    "text":"Set the shape to semi-transparent by calling set_trans(5)"//测试分词的fulltext
}

结果:

{
    "tokens": [
        {
            "token": "set",//被索引的词
            "start_offset": 0,//原文本起始位置
            "end_offset": 3,//原文本结束位置
            "type": "<ALPHANUM>",
            "position": 0//第几个出现
        },
        {
            "token": "the",
            "start_offset": 4,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "shape",
            "start_offset": 8,
            "end_offset": 13,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "to",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "semi",
            "start_offset": 17,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "transparent",
            "start_offset": 22,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "by",
            "start_offset": 34,
            "end_offset": 36,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "calling",
            "start_offset": 37,
            "end_offset": 44,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "set_trans",
            "start_offset": 45,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "5",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<NUM>",
            "position": 9
        }
    ]
}

查询某个文档分词结果 

GET /${index}/${type}/${id}/_termvectors?fields=${fields_name}

type默认为_doc

例子

http://10.4.3.230:9200/assist_award_punishment/_doc/AP1816861322877568258/_termvectors?fields=acceptOrgPath

响应

{
    "_index": "dev_assist_award_punishment_s0",
    "_type": "_doc",
    "_id": "AP1816861322877568258",
    "_version": 1,
    "found": true,
    "took": 0,
    "term_vectors": {
        "acceptOrgPath": {
            "field_statistics": {
                "sum_doc_freq": 64,
                "doc_count": 8,
                "sum_ttf": 64
            },
            "terms": {
                "10010000": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 8
                        }
                    ]
                },
                "10010001": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 9,
                            "end_offset": 17
                        }
                    ]
                },
                "10010010": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 2,
                            "start_offset": 18,
                            "end_offset": 26
                        }
                    ]
                },
                "10010400": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 4,
                            "start_offset": 36,
                            "end_offset": 44
                        }
                    ]
                },
                "10010401": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 5,
                            "start_offset": 45,
                            "end_offset": 53
                        }
                    ]
                },
                "10050120": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 3,
                            "start_offset": 27,
                            "end_offset": 35
                        }
                    ]
                },
                "10052279": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 6,
                            "start_offset": 54,
                            "end_offset": 62
                        }
                    ]
                },
                "a003": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 7,
                            "start_offset": 63,
                            "end_offset": 67
                        }
                    ]
                }
            }
        }
    }
}
View Code

 

查询指定索引指定分词器分词结果

GET /${index}/_analyze?analyzer={分词器名字}&text=2,3,4,5,100-100

或者

GET http://10.4.3.230:9200/_analyze

{
    "analyzer":"standard",//分词器
    "text":"1232-2323-4335|大幅度发-123"//测试分词的fulltext
}

响应

{
    "tokens": [
        {
            "token": "1232",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "2323",
            "start_offset": 5,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 1
        },
        {
            "token": "4335",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "大",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "幅",
            "start_offset": 16,
            "end_offset": 17,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "度",
            "start_offset": 17,
            "end_offset": 18,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "发",
            "start_offset": 18,
            "end_offset": 19,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "123",
            "start_offset": 20,
            "end_offset": 23,
            "type": "<NUM>",
            "position": 7
        }
    ]
}
View Code

查看搜索的执行计划

使用以下api查看搜索结果

get http://127.0.0.1:9200/opcm3/_validate/query?explain

复制代码
{
    "query":{
        "match_phrase":{
            "productName":{
                "query":"纯生"
            }
        }
    }
}
复制代码
复制代码
{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "opcm3",
            "valid": true,
            "explanation": "productName:\"(chun c) (sheng s)\""
        }
    ]
}
复制代码

可以理解为(chun or c) and (sheng or s)

自定义分析器

1.如果存在删除索引

curl --location --request DELETE 'http://10.4.3.230:9200/lq_test_analysis_index'

2.添加一个索引并设置自定义分词器

curl --location --request PUT 'http://10.4.3.230:9200/liqinag-index-test' \
--header 'Content-Type: application/json' \
--data '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
           "type": "pattern",
          "pattern": "-"
        }
      }
    }
  }
}'

3.测试分词器效果

curl --location --request GET 'http://10.4.3.230:9200/lq_test_analysis_index/_analyze' \
--header 'Content-Type: application/json' \
--data '{
    "analyzer":"char_analyzer",
    "text":"李强"
}'

响应

{
    "tokens": [
        {
            "token": "",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        }
    ]
}
View Code

 

5.添加测试文档做搜索测试

curl --location 'http://10.4.3.230:9200/lq_test_analysis_index/_doc/1' \
--header 'Content-Type: application/json' \
--data '{
  "content": "李强"
}'

 

6.测试搜索

curl --location --request GET 'http://10.4.3.230:9200/lq_test_analysis_index/_search' \
--header 'Content-Type: application/json' \
--data '{
  "query": {
    "match_phrase_prefix": {
      "content": ""  
    }
  }
}'

响应

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0925692,
        "hits": [
            {
                "_index": "lq_test_analysis_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0925692,
                "_source": {
                    "content": "李强"
                }
            }
        ]
    }
}

 

 

 

已有索引添加分析器

POST /{index}/_close #目标索引关闭,执行需要的更新操作 期间不能对索引进行操作

PUT /{index}/_settings
{
  "settings": {
    "analysis": {
      "analyzer": {
         "ik_word": {#要新增的分析器
                            "tokenizer": "ik_max_word"
           }
      }
    }
  }
}

POST /{index}/_open #索引打开

查询已经安装的分词插件

GET /_cat/plugins?v

 响应

name      component       version
es_node01 analysis-ik     7.10.1
es_node01 analysis-pinyin 7.10.1
View Code

 

posted @ 2019-02-25 14:53  意犹未尽  阅读(420)  评论(0编辑  收藏  举报