elasticsearch之内置分词过滤器

elasticsearch之内置分词过滤器

 

前言

asticsearch内置很多(真是变态多啊!但一般用不到,美滋滋!!!)的分词过滤器。其中包含分词过滤器和字符过滤器。

常见分词过滤器

这里仅列举几个常见的分词过滤器(token filter)包括:

  • 标准分词过滤器(Standard Token Filter)在6.5.0版本弃用。此筛选器已被弃用,将在下一个主要版本中删除。在之前的版本中其实也没干啥,甚至在更老版本的Lucene中,它用于去除单词结尾的s字符,还有不必要的句点字符,但是现在, 连这些小功能都被其他的分词器和分词过滤器顺手干了,真可怜!
  • ASCII折叠分词过滤器(ASCII Folding Token Filter)将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果存在的话)。
  • 扁平图形分词过滤器(Flatten Graph Token Filter)接受任意图形标记流。例如由同义词图形标记过滤器生成的标记流,并将其展平为适合索引的单个线性标记链。这是一个有损的过程,因为单独的侧路径被压扁在彼此之上,但是如果在索引期间使用图形令牌流是必要的,因为Lucene索引当前不能表示图形。 出于这个原因,最好只在搜索时应用图形分析器,因为这样可以保留完整的图形结构,并为邻近查询提供正确的匹配。该功能在Lucene中为实验性功能。
  • 长度标记过滤器(Length Token Filter)会移除分词流中太长或者太短的标记,它是可配置的,我们可以在settings中设置。
  • 小写分词过滤器(Lowercase Token Filter)将分词规范化为小写,它通过language参数支持希腊语、爱尔兰语和土耳其语小写标记过滤器。
  • 大写分词过滤器(Uppercase Token Filter)将分词规范为大写。

其余分词过滤器不一一列举。详情参见官网

自定义分词过滤器

接下来我们简单的来学习自定义两个分词过滤器。首先是长度分词过滤器。

PUT pattern_test3
{
  "settings": {
    "analysis": {
      "filter": {
        "my_test_length":{
          "type":"length",
          "max":8,
          "min":2
        }
      }
    }
  }
}

上例中,我们自定义了一个长度过滤器,过滤掉长度大于8和小于2的分词。
需要补充的是,max参数表示最大分词长度。默认为Integer.MAX_VALUE,就是2147483647($2^31-1$),而min则表示最小长度,默认为0。

POST pattern_test3/_analyze
{
  "tokenizer": "standard",
  "filter": ["my_test_length"],
  "text":"a Small word and a longerword"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "Small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

自定义小写分词过滤器

自定义一个小写分词过滤器,过滤希腊文:

PUT lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

过滤内容是:

POST lowercase_example/_analyze
{
  "tokenizer": "standard",
  "filter": ["greek_lowercase"],
  "text":"Ένα φίλτρο διακριτικού τύπου πεζά s ομαλοποιεί το κείμενο διακριτικού σε χαμηλότερη θήκη"
}

结果如下:

{
  "tokens" : [
    {
      "token" : "ενα",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "φιλτρο",
      "start_offset" : 4,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 11,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "τυπου",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "πεζα",
      "start_offset" : 29,
      "end_offset" : 33,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "s",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "ομαλοποιει",
      "start_offset" : 36,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "το",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "κειμενο",
      "start_offset" : 50,
      "end_offset" : 57,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "διακριτικου",
      "start_offset" : 58,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "σε",
      "start_offset" : 70,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "χαμηλοτερη",
      "start_offset" : 73,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "θηκη",
      "start_offset" : 84,
      "end_offset" : 88,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

多个分词过滤器

除此之外,我们可以使用多个分词过滤器。例如我们在使用长度过滤器时,可以同时使用小写分词过滤器或者更多。

POST _analyze
{
  "tokenizer": "standard",
  "filter": ["length","lowercase"],
  "text":"a Small word and a longerword"
}

上例中,我们用列表来管理多个分词过滤器。
结果如下:

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "small",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "word",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "and",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "longerword",
      "start_offset" : 19,
      "end_offset" : 29,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

see also:Standard Token Filter | ASCII Folding Token Filter | Flatten Graph Token Filter | Length Token Filter | Lowercase Token Filter | Uppercase Token Filter | NGram Token Filter | Edge NGram Token Filter | Porter Stem Token Filter | Shingle Token Filter | Stop Token Filter | Word Delimiter Token Filter | Word Delimiter Graph Token Filter | Multiplexer Token Filter | Conditional Token Filter | Predicate Token Filter Script | Stemmer Token Filter | Stemmer Override Token Filter | Keyword Marker Token Filter | Keyword Repeat Token Filter | KStem Token Filter | Snowball Token Filter | Phonetic Token Filter | Synonym Token Filter | Parsing synonym files | Synonym Graph Token Filter | Parsing synonym files | Compound Word Token Filters | Reverse Token Filter | Elision Token Filter | Truncate Token Filter | Unique Token Filter | Pattern Capture Token Filter | Pattern Replace Token Filter | Trim Token Filter | Limit Token Count Token Filter | Hunspell Token Filter | Common Grams Token Filter | Normalization Token Filter | CJK Width Token Filter | CJK Bigram Token Filter | Delimited Payload Token Filter | Keep Words Token Filter | Keep Types Token Filter | Exclude mode settings example | Classic Token Filter | Apostrophe Token Filter | Decimal Digit Token Filter | Fingerprint Token Filter | Minhash Token Filter | Remove Duplicates Token Filter | Character Filters | HTML Strip Char Filter | Mapping Char Filter | Pattern Replace Char Filter

 
 
 
posted @ 2019-04-05 15:37  heshun  阅读(1231)  评论(0编辑  收藏  举报