es 基础概念总结 —— 自定义分析器

一、分析器 analyzer

包括

1.字符过滤器 character filter

比如去除HTML标记，或者转化“&”为“and”

2.分词器 tokenizer

比如按空格分词

3.词单元标准化过滤器 token filter

如大小写转换，去掉停用词，增加同义词

二、内置分析器

标准分析器

根据单词边界分词，去标点符号，转小写

GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

简单分析器

根据非字母切分，非字母去除，转小写

GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

stop 分析器

根据非字母切分，非字母去除，转小写，停用词

GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

空格分析器

依据空格切分，不转换小写

GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

正则分析器

默认为非字符符号(\w+)分隔，转小写

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

keyword 分析器

不分词

GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

三、自定义 analyzer

字符过滤器

# html_strip
POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<b>hello world</b>"
}

# 映射替换
POST _analyze
{
  "tokenizer": "standard", 
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["-=>_"]
    }
  ], 
  "text": "123-456, i-test"
}

# 正则替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

分词器

# 路径分词
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/usr/abc/efg"
}

# 空格分词
POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": "The rain in Spain falls mainly on the plain."
}

词单元过滤器

# 转小写、停用词去除
POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop"],
  "text": "The rain in Spain falls mainly on the plain."
}

es 基础概念总结 —— 自定义分析器

导航

公告