elasticsearch-分析器(五)

官方介绍

一个analyzer即分析器,无论是内置的还是自定义的,只是一个包含character filters(字符过滤器)、 tokenizers(分词器)、token filters(令牌过滤器)三个细分模块的包。

CharFilter

字符过滤器用于在将字符流传递给标记赋予器之前对其进行预处理。

字符过滤器接收原始文本作为字符流,并可以通过添加、删除或更改字符来转换该流。例如,可以使用字符过滤器将印度-阿拉伯数字(٠, ١٢٣٤٥٦٧٨, ٩)转换为阿拉伯-拉丁数字(0123456789),或者从流中剥离这样的HTML元素。

内置的有:HTML Strip Character FilterMapping Character FilterPattern Replace Character Filter

HTML Strip Character Filter

用于替换html标签如:

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

过滤后

[ \nI'm so happy!\n ]

 

Mapping Character Filter

可以替换任意字符

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5",
        "٦ => 6",
        "٧ => 7",
        "٨ => 8",
        "٩ => 9"
      ]
    }
  ],
  "text": "My license plate is ٢٥٠١٥"
}

过滤后

[ My license plate is 25015 ]

Pattern Replace Character Filter

基于正则替代字符

案例1

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}
过滤后
[ My, credit, card, is, 123_456_789 ]

案例2

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}

过滤后

[ the, foo, bar, baz, method ]

tokenizer

接收字符流,将其分解为单个标记(通常是单个单词),并输出标记流,如text  "Quick brown fox!" 转换成[Quick, brown, fox!].

还负责记录每个词的顺序或位置,以及该词所代表的原始词的开始和结束字符偏移量。

一个analyzer必须有一个tokenizer

内置的文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

 

 

如: 

POST _analyze
{
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

转换后

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token filters

token过滤器可以接收token stream 并且可以添加,删除或改变token

例如大小写转换

将以下token转换为大写

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

token过滤器不允许更改每个token的位置或字符偏移量

分析器可以有0个到多个token过滤器他们按顺序生效

token graph

https://www.elastic.co/guide/en/elasticsearch/reference/current/token-graphs.html

当被tokenizer标记后还将记录token的位置 以及token跨越的长度

如quick  brown fox 转换后->[quick,brown,fox]

 

 

token filter还允许在现有token上添加新令牌,它将于现有的令牌位置和跨越相同,比如quick同义词fast,

 

 

filter还允许添加跨越多个位置的令牌比如domain name system is fragile转换后[domain,name,system,is,fragile ] 添加domain name system简写dns 允许通过dns匹配

在以下dns位置为0,但是positionLenght为3

 

 

 在搜素的时候使用相同分析器通过match 或者match_phrase搜素

domain name system is fragile 或者dns is fragile 将能够搜素包含这2种的文档

es提供以下2种过滤器,允许将跨越多个位置的令牌positionLength记录为1

自定义分析器

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html

可以参考中英文搜素配置https://www.cnblogs.com/LQBlog/p/10449637.html

同义词过滤器可以参考

比如搜索土豆 包含马铃薯的文档也可以搜出来

 

分析器处理过程的3步骤

1.字符过滤器:去除字符的特殊字符(chartFilter)

2.分词器:将词组分词(tokenizer)

3.对分词词组进行操作,比如转大写 分词后的词组替换等(tokenizer filter)

4.最终生成token graph

测试分析器

官网:https://www.elastic.co/guide/en/elasticsearch/reference/current/test-analyzer.html

get请求:http://127.0.0.1:9200/_analyze

body:

{
    "analyzer":"standard",//分词器
    "text":"Set the shape to semi-transparent by calling set_trans(5)"//测试分词的fulltext
}

结果:

{
    "tokens": [
        {
            "token": "set",//被索引的词
            "start_offset": 0,//原文本起始位置
            "end_offset": 3,//原文本结束位置
            "type": "<ALPHANUM>",
            "position": 0//第几个出现
        },
        {
            "token": "the",
            "start_offset": 4,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "shape",
            "start_offset": 8,
            "end_offset": 13,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "to",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "semi",
            "start_offset": 17,
            "end_offset": 21,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "transparent",
            "start_offset": 22,
            "end_offset": 33,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "by",
            "start_offset": 34,
            "end_offset": 36,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "calling",
            "start_offset": 37,
            "end_offset": 44,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "set_trans",
            "start_offset": 45,
            "end_offset": 54,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "5",
            "start_offset": 55,
            "end_offset": 56,
            "type": "<NUM>",
            "position": 9
        }
    ]
}

查询某个文档分词结果 

GET /${index}/${type}/${id}/_termvectors?fields=${fields_name}

查询指定索引指定分词器分词结果

GET /${index}/_analyze?analyzer={分词器名字}&text=2,3,4,5,100-100'

添加分析器

在已有索引新增分析器

POST /{index}/_close #目标索引关闭,执行需要的更新操作 期间不能对索引进行操作

PUT /{index}/_settings
{
  "settings": {
    "analysis": {
      "analyzer": {
         "ik_word": {#要新增的分析器
                            "tokenizer": "ik_max_word"
           }
      }
    }
  }
}

POST /{index}/_open #索引打开

不同分词算法的思考

 

posted @ 2019-02-25 14:53  意犹未尽  阅读(414)  评论(0编辑  收藏  举报