Fork me on GitHub

Analyzer分词器

一、Elasticsearch内置分词器

#Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
#Stop Analyzer – 小写处理,停用词过滤(the,a,is)
#Whitespace Analyzer – 按照空格切分,不转小写
#Keyword Analyzer – 不分词,直接将输入当作输出
#Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
#Language – 提供了30多种常见语言的分词器

 

 

 1,Standard Analyzer

 

 

2, Simple Analyzer

 

 

 3,Whitespace Analyzer

 

 

 4,Stop Analyzer

 

 

  5,Keywork Analyzer

 

 

 6,Pattern Analyzer

#Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
#Stop Analyzer – 小写处理,停用词过滤(the,a,is)
#Whitespace Analyzer – 按照空格切分,不转小写
#Keyword Analyzer – 不分词,直接将输入当作输出
#Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
#Language – 提供了30多种常见语言的分词器
#2 running Quick brown-foxes leap over lazy dogs in the summer evening

#查看不同的analyzer的效果
#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#simpe
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}


POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃"
}
demo

 

二、中文分词 ICU Analyzer

//直接指定analyze进行测试
GET _analyze
{
  "analyzer":"icu_analyzer",
  "text":"你好中国"
}

2,其他中文分词插件

 

三、自定义Analyzer

 

 

 

 

 

 

//自定义分词器
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type": "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "char_filter": [ "html_strip", "&_to_and" ],
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "my_stopwords" ]
            }}
}}}

//设置mapping
PUT /my_index/_mapping
{
   "properties":{
        "username":{
             "type":"text",
             "analyzer" : "my_analyzer"
         },
        "password" : {
          "type" : "text"
        }
    
  }
}

//插入数据
PUT /my_index/_doc/1
{
  "username":"The quick & brown fox ",
  "password":"The quick & brown fox "
}

//验证
GET my_index/_analyze
{
  "field":"username",
  "text":"The quick & brown fox"
}

GET my_index/_analyze
{
  "field":"password",
  "text":"The quick & brown fox"
}

 

四、ik分词插件

1,下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases

需要下载与el版本一致的分词器版本

2,plugins文件夹下面创建一个analysis-ik目录

3,将下载的zip文件copy到analysis-ik目录下,执行unzip

4,运行es

//ik_max_word
//ik_smart
POST _analyze
{
  "analyzer": "ik_max_word",
  "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
} 

hanlp分词插件

1,下载地址:https://github.com/KennFalcon/elasticsearch-analysis-hanlp

 

posted on 2020-04-06 22:43  *Hunter  阅读(498)  评论(0编辑  收藏  举报

导航

AmazingCounters.com