Elasticsearch 所有内置分析器介绍(5)

Elasticsearch 附带了各种内置分析器，可以直接在任何索引中使用，而无需额外配置：

　　1) 标准分析器 Standard Analyzer

　　　　该分析器的文本分词规则是：过滤掉大多数标点符号来划分单词，通过Unicode 文本分割算法，再转成小写的分词。

　　　　支持删除常用的停用词(如:the， a 等)，需要配置fileter英文停用词。

　　　　该分析器是ES的默认分析器。

POST _analyze
{
  "analyzer":"standard",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　　分析结果：[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

　　2) 简单分析器 Simple Analyzer

　　　该分析器的文本分词规则是：过滤掉不是字母的划分单词，再转成小写的分词。

POST _analyze
{
  "analyzer":"simple",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　　分析结果：[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

　　3) 空白分析器 Whitespace Analyzer

　　　　该分析器的文本分词规则是：以空格来划分单词, 不转换为小写的分词

POST _analyze
{
  "analyzer":"whitespace",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　　分析结果：[ The, 2,QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone ]

　　4）停止分析器 Stop Analyzer

　　　　该分析器去Simple Analyzer相似，但增加了对删除停用词的支持，转成小写的分词。

POST _analyze
{
  "analyzer":"stop",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　分析结果：[ quick, brown, foxes, jumped,over, lazy, dog, s, bone ]

　　5）关键词分析器 Keyword Analyzer

　　　　该分析器是一个“noop”分析器，也就是一个空的分析器，并不会对文本进行分词，而是将整个文本看成一个term。

POST _analyze
{
  "analyzer":"keyword",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　分析结果：

{
  "tokens" : [
    {
      "token" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
      "start_offset" : 0,
      "end_offset" : 56,
      "type" : "word",
      "position" : 0
    }
  ]
}

　　6）模式分析器 Pattern Analyzer

　　　　该分析器使用正则表达式将文本分词。分词后转小写，支持停用词。默认为 \w+ 的正则表达式(\w代表一个字符或数字， \w+ 就代表一个或多个字符或数字)

　　　　注意：模式分析器使用 Java 正则表达式。写得不好的正则表达式可能会运行得很慢，甚至抛出 StackOverflowError或正在运行的节点退出。

POST _analyze
{
  "analyzer":"pattern",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　　分析结果：[ the, 2,quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

　　7) 语言分析器 Language Analyzer

　　　　Elasticsearch 提供了许多语言的分析器, 如english or french

POST _analyze
{
  "analyzer":"english",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

　　　　分析结果：[ 2,quick, brown, foxes, jumped, over, lazy, dog, bone ]

　　8）指纹分析器 Fingerprint Analyzer

　　　　将输入的文本小写、标准化以删除扩展字符、排序、去重并连接成单个标记，一般用涂:

　　　　　　1) 唯一标识: 用于给数据生成一个唯一标识，以便跟踪和管理。

　　　　　　2) 数据对比: 通过比较不同数据的指纹，可以快速判断它们是否相同。

　　　　　　3) 安全性和合规: 在安全性和合规性要求高的场景下，用于确保数据的一致性和完整性。

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

　　　　分析结果：

{
  "tokens" : [
    {
      "token" : "and consistent godel is said sentence this yes",
      "start_offset" : 0,
      "end_offset" : 52,
      "type" : "fingerprint",
      "position" : 0
    }
  ]
}

参考官方资料: Built-in analyzer reference　　　　

posted on 2024-04-19 16:43 花阴偷移阅读(125) 评论(0) 编辑收藏举报

刷新页面返回顶部

花阴偷移

功名本是真儒事，公知否？

Elasticsearch 所有内置分析器介绍(5)

导航

公告