Elasticsearch 之（10）倒排索引核心原理、分词器、精确匹配与全文搜索

倒排索引核心原理

doc1：I really liked my small dogs, and I think my mom also liked them.

doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词，初步的倒排索引的建立

word		doc1	 doc2

I		        *			*

really		*

liked		*			*

my		        *			*

small		*	

dogs		*

and		        *

think		*

mom		*			*

also		        *

them		*	

He					        *

never					*

any					        *

so					        *

hope					*

that					        *

will					        *

not					        *

expect					*

me					        *

to					        *

him					        *

演示了一下倒排索引最简单的建立的一个过程

搜索

mother like little dog，不可能有任何结果

mother

like

little

dog

这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。

normalization，建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率

时态的转换，单复数的转换，同义词的转换，大小写的转换

mom —> mother

liked —> like

small —> little

dogs —> dog

重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了

word		doc1	    doc2

I		        *			*

really		*

like		        *			*			liked --> like

my		        *			*

little	        	*						small --> little

dog		        *			*			dogs --> dog						

and		        *

think		*

mom		*			*

also		        *

them		*	

He					        *

never					*

any				        	*

so				        	*

hope					*

that				        	*

will				        	*

not					        *

expect					*

me				        	*

to				        	*

him				        	*

mother like little dog，分词，normalization

mother	--> mom

like	--> like

little	--> little

dog	--> dog

doc1和doc2都会搜索出来

doc1：I really liked my small dogs, and I think my mom also liked them.

doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

倒排索引，是适合用于进行搜索的

倒排索引的结构

（1）包含这个关键词的document list

（2）包含这个关键词的所有document的数量：IDF（inverse document frequency）

（3）这个关键词在每个document中出现的次数：TF（term frequency）

（4）这个关键词在这个document中的次序

（5）每个document的长度：length norm

（6）包含这个关键词的所有document的平均长度

word		doc1	doc2

dog		        *		*

hello		*

you				        *

倒排索引不可变的好处

（1）不需要锁，提升并发能力，避免锁的问题

（2）数据不变，一直保存在os cache中，只要cache内存足够

（3）filter cache一直驻留在内存，因为数据不变

（4）可以压缩，节省cpu和io开销

倒排索引不可变的坏处：每次都要重新构建整个索引

分词器

1、什么是分词器

切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分瓷器

recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）

tokenizer：分词，hello you and me --> hello, you, and, me

token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

2、内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

精确匹配与全文搜索

1、exact value

2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来

如果你输入一个01，是搜索不出来的

2、full text

（1）缩写 vs. 全程：cn vs. china

（2）格式转化：like liked likes

（3）大小写：Tom vs tom

（4）同义词：like vs love

2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来

china，搜索cn，也可以将china搜索出来

likes，搜索like，也可以将likes搜索出来

Tom，搜索tom，也可以将Tom搜索出来

like，搜索love，同义词，也可以将like搜索出来

就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

posted @ 2018-05-21 17:19 91vincent 阅读(236) 评论(0) 编辑收藏举报

刷新页面返回顶部

Elasticsearch 之（10） 倒排索引核心原理、分词器、精确匹配与全文搜索

倒排索引核心原理

分词器

精确匹配与全文搜索

公告

Elasticsearch 之（10）倒排索引核心原理、分词器、精确匹配与全文搜索