elasticsearch-分析器(五)

官方介绍

一个analyzer即分析器，无论是内置的还是自定义的，只是一个包含character filters（字符过滤器）、 tokenizers（分词器）、token filters（令牌过滤器）三个细分模块的包。

CharFilter

字符过滤器用于在将字符流传递给标记赋予器之前对其进行预处理。

字符过滤器接收原始文本作为字符流，并可以通过添加、删除或更改字符来转换该流。例如，可以使用字符过滤器将印度-阿拉伯数字(٠， ١٢٣٤٥٦٧٨， ٩)转换为阿拉伯-拉丁数字(0123456789)，或者从流中剥离这样的HTML元素。

内置的有：HTML Strip Character Filter、Mapping Character Filter、Pattern Replace Character Filter

HTML Strip Character Filter

用于替换html标签如:

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

过滤后

[ \nI'm so happy!\n ]

Mapping Character Filter

可以替换任意字符

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5",
        "٦ => 6",
        "٧ => 7",
        "٨ => 8",
        "٩ => 9"
      ]
    }
  ],
  "text": "My license plate is ٢٥٠١٥"
}

过滤后

[ My license plate is 25015 ]

Pattern Replace Character Filter

基于正则替代字符

案例1

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

过滤后

[ My, credit, card, is, 123_456_789 ]

案例2

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}

过滤后

[ the, foo, bar, baz, method ]

tokenizer

是分析器的一部分，负责将文本分割成初步的词（tokens）。

接收字符流，将其分解为单个标记(通常是单个单词)，并输出标记流,如text "Quick brown fox!" 转换成[Quick, brown, fox!].

还负责记录每个词的顺序或位置，以及该词所代表的原始词的开始和结束字符偏移量。

一个analyzer必须有一个tokenizer

内置的文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

如:

POST _analyze
{
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

转换后

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token filters

可以进一步处理这些词，例如去除停用词、转换大小写等。

token过滤器可以接收token stream 并且可以添加，删除或改变token

例如大小写转换

将以下token转换为大写

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token过滤器不允许更改每个token的位置或字符偏移量

分析器可以有0个到多个token过滤器他们按顺序生效

token graph

https://www.elastic.co/guide/en/elasticsearch/reference/current/token-graphs.html

当被tokenizer标记后还将记录token的位置以及token跨越的长度

如quick brown fox 转换后->[quick,brown,fox]

token filter还允许在现有token上添加新令牌,它将于现有的令牌位置和跨越相同,比如quick同义词fast,

filter还允许添加跨越多个位置的令牌比如domain name system is fragile转换后[domain,name,system,is,fragile ] 添加domain name system简写dns 允许通过dns匹配

在以下dns位置为0,但是positionLenght为3

在搜素的时候使用相同分析器通过match 或者match_phrase搜素

domain name system is fragile 或者dns is fragile 将能够搜素包含这2种的文档

es提供以下2种过滤器,允许将跨越多个位置的令牌positionLength记录为1

自定义分析器

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

可以参考中英文搜素配置https://www.cnblogs.com/LQBlog/p/10449637.html

同义词过滤器可以参考

比如搜索土豆包含马铃薯的文档也可以搜出来

分析器处理过程的3步骤

1.字符过滤器:去除字符的特殊字符(chartFilter)

2.分词器:将词组分词(tokenizer)

3.对分词词组进行操作，比如转大写分词后的词组替换等(tokenizer filter)

4.最终生成token graph

不同分词算法的思考

常用API

测试分析器

官网:https://www.elastic.co/guide/en/elasticsearch/reference/current/test-analyzer.html

get请求:http://127.0.0.1:9200/_analyze

body:

{
    "analyzer":"standard",//分词器
    "text":"Set the shape to semi-transparent by calling set_trans(5)"//测试分词的fulltext
}

结果:

{
    "tokens": [
        {
            "token": "set",//被索引的词
            "start_offset": 0,//原文本起始位置
            "end_offset": 3,//原文本结束位置
            "type": "<ALPHANUM>",
            "position": 0//第几个出现
        },
        {
            "token": "the",
            "start_offset": 4,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "shape",
            "start_offset": 8,
            "end_offset": 13,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "to",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "semi",
            "start_offset": 17,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "transparent",
            "start_offset": 22,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "by",
            "start_offset": 34,
            "end_offset": 36,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "calling",
            "start_offset": 37,
            "end_offset": 44,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "set_trans",
            "start_offset": 45,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "5",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<NUM>",
            "position": 9
        }
    ]
}

查询某个文档分词结果

GET /${index}/${type}/${id}/_termvectors?fields=${fields_name}

type默认为_doc

例子

http://10.4.3.230:9200/assist_award_punishment/_doc/AP1816861322877568258/_termvectors?fields=acceptOrgPath

响应

{
    "_index": "dev_assist_award_punishment_s0",
    "_type": "_doc",
    "_id": "AP1816861322877568258",
    "_version": 1,
    "found": true,
    "took": 0,
    "term_vectors": {
        "acceptOrgPath": {
            "field_statistics": {
                "sum_doc_freq": 64,
                "doc_count": 8,
                "sum_ttf": 64
            },
            "terms": {
                "10010000": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 8
                        }
                    ]
                },
                "10010001": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 9,
                            "end_offset": 17
                        }
                    ]
                },
                "10010010": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 2,
                            "start_offset": 18,
                            "end_offset": 26
                        }
                    ]
                },
                "10010400": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 4,
                            "start_offset": 36,
                            "end_offset": 44
                        }
                    ]
                },
                "10010401": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 5,
                            "start_offset": 45,
                            "end_offset": 53
                        }
                    ]
                },
                "10050120": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 3,
                            "start_offset": 27,
                            "end_offset": 35
                        }
                    ]
                },
                "10052279": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 6,
                            "start_offset": 54,
                            "end_offset": 62
                        }
                    ]
                },
                "a003": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 7,
                            "start_offset": 63,
                            "end_offset": 67
                        }
                    ]
                }
            }
        }
    }
}

View Code

查询指定索引指定分词器分词结果

GET /${index}/_analyze?analyzer={分词器名字}&text=2,3,4,5,100-100

或者

GET http://10.4.3.230:9200/_analyze

{
    "analyzer":"standard",//分词器
    "text":"1232-2323-4335|大幅度发-123"//测试分词的fulltext
}

响应

{
    "tokens": [
        {
            "token": "1232",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<NUM>",
            "position": 0
        },
        {
            "token": "2323",
            "start_offset": 5,
            "end_offset": 9,
            "type": "<NUM>",
            "position": 1
        },
        {
            "token": "4335",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "大",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "幅",
            "start_offset": 16,
            "end_offset": 17,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "度",
            "start_offset": 17,
            "end_offset": 18,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "发",
            "start_offset": 18,
            "end_offset": 19,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "123",
            "start_offset": 20,
            "end_offset": 23,
            "type": "<NUM>",
            "position": 7
        }
    ]
}

View Code

查看搜索的执行计划

使用以下api查看搜索结果

get http://127.0.0.1:9200/opcm3/_validate/query?explain

{
    "query":{
        "match_phrase":{
            "productName":{
                "query":"纯生"
            }
        }
    }
}

{
    "valid": true,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "explanations": [
        {
            "index": "opcm3",
            "valid": true,
            "explanation": "productName:\"(chun c) (sheng s)\""
        }
    ]
}

可以理解为(chun or c) and (sheng or s)

自定义分析器

1.如果存在删除索引

curl --location --request DELETE 'http://10.4.3.230:9200/lq_test_analysis_index'

2.添加一个索引并设置自定义分词器

curl --location --request PUT 'http://10.4.3.230:9200/liqinag-index-test' \
--header 'Content-Type: application/json' \
--data '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
           "type": "pattern",
          "pattern": "-"
        }
      }
    }
  }
}'

3.测试分词器效果

curl --location --request GET 'http://10.4.3.230:9200/lq_test_analysis_index/_analyze' \
--header 'Content-Type: application/json' \
--data '{
    "analyzer":"char_analyzer",
    "text":"李强"
}'

响应

{
    "tokens": [
        {
            "token": "李",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "强",
            "start_offset": 1,
            "end_offset": 2,
            "type": "word",
            "position": 1
        }
    ]
}

View Code

5.添加测试文档做搜索测试

curl --location 'http://10.4.3.230:9200/lq_test_analysis_index/_doc/1' \
--header 'Content-Type: application/json' \
--data '{
  "content": "李强"
}'

6.测试搜索

curl --location --request GET 'http://10.4.3.230:9200/lq_test_analysis_index/_search' \
--header 'Content-Type: application/json' \
--data '{
  "query": {
    "match_phrase_prefix": {
      "content": "李"  
    }
  }
}'

响应

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0925692,
        "hits": [
            {
                "_index": "lq_test_analysis_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0925692,
                "_source": {
                    "content": "李强"
                }
            }
        ]
    }
}

已有索引添加分析器

POST /{index}/_close #目标索引关闭，执行需要的更新操作 期间不能对索引进行操作

PUT /{index}/_settings
{
  "settings": {
    "analysis": {
      "analyzer": {
         "ik_word": {#要新增的分析器
                            "tokenizer": "ik_max_word"
           }
      }
    }
  }
}

POST /{index}/_open #索引打开

查询已经安装的分词插件

GET /_cat/plugins?v

响应

name      component       version
es_node01 analysis-ik     7.10.1
es_node01 analysis-pinyin 7.10.1

View Code

posted @ 2019-02-25 14:53 意犹未尽阅读(446) 评论(0) 收藏举报

刷新页面返回顶部

意犹未尽

只要在学习，则在进步。

elasticsearch-分析器(五)

官方介绍

CharFilter

HTML Strip Character Filter

Mapping Character Filter

Pattern Replace Character Filter

tokenizer

token filters

token graph

自定义分析器

同义词过滤器可以参考

分析器处理过程的3步骤

不同分词算法的思考

常用API

测试分析器

查询某个文档分词结果

查询指定索引指定分词器分词结果

查看搜索的执行计划

自定义分析器

已有索引添加分析器

查询已经安装的分词插件

公告