查询引擎为文本数据类型提供~, ~*, LIKE和ILIKE操作符，并提供全文检索以识别自然语言文档，并通过相关性查询进行排序。查询引擎提供两种数据类型用于支持全文检索，即tsvector类型与tsquery类型。

1.2 文档(tsvector)类型

对于tsvector类型，表示一个检索单元，通常是一个数据库表中一行的文本字段，或者这些字段的可能组合（级联），也可能存储在多个表中或者动态地获得，它的值是一个无重复值的lexemes排序列表，即一些同一个词的不同变种的标准化，在输入的同时会自动排序和消除重复。to_tsvector函数通常用于解析和标准化文档字符串。

一个tsvector的值是唯一分词的分类列表，把一话一句词格式化为不同的词条，在进行分词处理的时候tsvector会自动去掉分词中重复的词条，按照一定的顺序装入。例如

SELECT 'a fat cat sat on a mat and atea fat rat'::tsvector;

tsvector

----------------------------------------------------

'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat''sat'

从上面的例子可以看出，通过tsvector把一个字符串按照空格进行分词，分词的顺序是按照长短和字母来排序的。但是某些时候，为了让词条中包含空格或者符号，就需要对其使用引号。

SELECT $$the lexeme ' ' contains spaces$$::tsvector;

tsvector

-------------------------------------------

' ''contains' 'lexeme' 'spaces' 'the'

为了使用引号，可以使用双$$符号来避免混淆。并且词条位置常量可以附属于每个词条,例如：

SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;

tsvector

-------------------------------------------------------------------------------

'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11'mat':7 'on':5 'rat':12 'sat':4

理解tsvector类型是很重要的,不能只关注标准的应用.例如

select 'The Fat Rats'::tsvector;

tsvector

--------------------

'Fat' 'Rats' 'The'

但是对于英文全文检索应用来说,上面的句子就是非标准化的,但是tsvector是不会知道的,为处理加工的文本应该通过使用to_tsvector函数来是之规格化,标注化的应用于搜索.

SELECT to_tsvector('english', 'The Fat Rats');

to_tsvector

-----------------

'fat':2 'rat':3

1.3查询(tsquery)类型

对于tsquery类型，表示一个检索条件，存储用于检索的词汇，并且使用布尔操作符&(AND)，|(OR)和!(NOT) 来组合它们，括号用来强调操作符的分组。与tsvector一样，任何单词必须在转换为tsquery类型前规范化。to_tsquery函数及 plainto_tsquery函数可以方便的用来执行规范化。

SELECT 'fat & rat'::tsquery;

tsquery

---------------

'fat' & 'rat'

SELECT 'fat & (rat | cat)'::tsquery;

tsquery

---------------------------

'fat' & ( 'rat' | 'cat' )

SELECT to_tsquery('english', 'fat & rat');

to_tsquery

---------------

'fat' & 'rat'

to_tsquery函数在处理查询文本的时候，查询文本的单个词之间要使用逻辑操作符（& (AND), | (OR) and ! (NOT)）连接（或者使用括号）。例如

SELECT to_tsquery('english', 'Fat Rats');

如果要使执行上面的操作，就会报语法错误。然而plainto_tsquery函数却可以提供一个标准的tsquery，如上面的例子，plainto_tsquery会自动加上逻辑&操作符。

SELECT plainto_tsquery('english','Fat Rats');

plainto_tsquery

-----------------

'fat' & 'rat'

但是plainto_tsquery函数不能够识别逻辑操作符和权重标记。

SELECTplainto_tsquery('english','The Fat & Rats:C');

plainto_tsquery

---------------------

'fat'& 'rat' & 'c'

1.4检索表

查询引擎的全文检索基于匹配算子@@，如果一个tsvector与一个tsquery匹配，则返回true。在不使用索引的情况下也是可以进行全文检索的,一个简单查询,显示出title从所有body字段中包含friend的每一行：

SELECT title

FROM web

WHERE to_tsvector('english', body) @@ to_tsquery('english','friend');

其中to_tsvector和to_tsquery中第一个参数用于指定全文检索的分词语言设置，一般可省略，语句如下：

SELECT title

FROM web

WHERE to_tsvector(body) @@ to_tsquery('friend');

1.5创建索引

gist 和 gin的索引类型，这两种索引都能用在提高全文检索的速度，注意全文检索不一定非要使用索引，但是当一个字段被固定规律搜索时，使用索引将会有很好的效果。创建gist 索引字段的类型可以是 tsvector 或者 tsquery。创建gin 索引字段的类型必须是tsvector

CREATE INDEX web_idx ON web USING gin(to_tsvector('english', body));

创建索引可以有多种方式.索引的创建甚至可以连接两个列:

CREATE INDEX web_idx ON web USING gin(to_tsvector('english', title ||body));

另外的一种方式是创建一个单独的 tsvector列,然后使用to_tsvector函数把需要索引字段的数据联合在一起，比如列title和body，并且使用函数coalesce来确保字段为NULL的可以建立索引。如下：

ALTER TABLE web ADD COLUMN textsearchable_index_col tsvector;
UPDATE web SET textsearchable_index_col =
to_tsvector('english', coalesce(title,'') ||coalesce(body,''));

然后，就可以创建倒排的索引

CREATE INDEX textsearch_idx ON web USING gin(textsearchable_index_col);

索引创建完毕，就可以使用全文检索了。

SELECT title

FROM web

WHERE textsearchable_index_col @@ to_tsquery('create& table')

ORDER BY last_mod_date DESC LIMIT 10;

1.6权重匹配(Weight)

提供一个函数setweight，使用这个函数要引入一个概念，这个概念就是权重weight，什么是权重，字面上解释就是权衡一下哪个更重要，也就是说哪个更侧重一些。可以通过函数setweight来设置权重，switf提供了四个权重级别A，B，C，D，级别类型用来标记他们来自于文档中的不同部分，例如title和body。查询结果的关注度可以使用这个权重级别。如：

UPDATE tt SET ti =

setweight(to_tsvector(coalesce(title,'')), 'A') ||

setweight(to_tsvector(coalesce(keyword,'')), 'B') ||

setweight(to_tsvector(coalesce(abstract,'')), 'C') ||

setweight(to_tsvector(coalesce(body,'')), 'D');

在搜索中tsquery中可以使用权重（weight）,在搜索词条中附加权重，查询的结果就是在这个权重范围的了。

SELECT to_tsquery('english', 'Fat | Rats:AB');

to_tsquery

------------------

'fat' | 'rat':AB

1.7相关(Ranking)查询

相关度试图衡量哪一个文档是检索中最关注的，所以当有很多匹配时，最相关的一个则最先显示。查询引擎提供了两个预定义的相关函数（ts_rank和rs_rank_cd），考虑了查询词在文档中出现的频率，术语在文档中的紧密程度，以及它们在文档中的部分的重要性。

这两个函数的语法是

ts_rank([ weights float4[], ] vectortsvector, query tsquery [, normalization integer ]) returns float4

ts_rank_cd([ weights float4[], ] vectortsvector, query tsquery [, normalization integer ]) returns float4

两个函数的第一个参数都是权重(weight)，在前面已经讲了权重的概念。

参数的格式为 {D-weight, C-weight, B-weight, A-weight} ，在使用函数的时候没有指定这个参数，默认指定参数为：{0.1, 0.2, 0.4, 1.0}

vector tsvector表示分词的位置

query tsquery 表示查询关键词的位置

因为一个长文档有更大的几率包含检索的关键词，我们认为一个包含100词的文档有5个关键词，比一个包含1000个词的文档有五个关键词更相关。所以这里用最后一个参数来表示文档长度对得分的影响，你可以指定一个或者多个例如（2|4）。这些参数的定义

0 (the default) ignores the documentlength

表示跟长度大小没有关系

1 divides the rank by 1 + the logarithmof the document length

表示参数关注度（rank）除以文档长度的对数+1

2 divides the rank by the documentlength

表示关注度除以文档的长度

4 divides the rank by the mean harmonicdistance between extents (this is implemented only by ts_rank_cd)

表示关注度除以文档长度的平均值，只能使用函数ts_rank_cd.

8 divides the rank by the number ofunique words in document

表示关注度除以文档中唯一分词的数量

16 divides the rank by 1 + thelogarithm of the number of unique words in document

表示关注度除以唯一分词数量的对数+1

32 divides the rank by itself + 1

表示关注度除以本身+1

下面是返回得分最高的前10项的例子

SELECT title, ts_rank_cd(textsearch, query) AS rank

FROM apod, to_tsquery('neutrino|(dark & matter)') query

WHERE query @@ textsearch

ORDER BY rank DESC

LIMIT 10;

title | rank

-----------------------------------------------+----------

Neutrinos in the Sun | 3.1

The Sudbury NeutrinoDetector | 2.4

A MACHO View of Galactic DarkMatter | 2.01317

Hot Gas and Dark Matter | 1.91171

The Virgo Cluster: Hot Plasmaand Dark Matter | 1.90953

Rafting for SolarNeutrinos | 1.9

NGC 4650A: Strange Galaxy andDark Matter | 1.85774

Hot Gas and Dark Matter | 1.6123

Ice Fishing for CosmicNeutrinos | 1.6

Weak Lensing Distorts theUniverse | 0.818218

这是相同的例子使用规范化的排名

SELECT title,ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank

FROM apod,to_tsquery('neutrino|(dark & matter)') query

WHERE query @@ textsearch

ORDER BY rank DESC

LIMIT 10;

title | rank

-----------------------------------------------+-------------------

Neutrinos in the Sun | 0.756097569485493

The Sudbury Neutrino Detector | 0.705882361190954

A MACHO View of Galactic Dark Matter | 0.668123210574724

Hot Gas and Dark Matter | 0.65655958650282

The Virgo Cluster: Hot Plasma and Dark Matter| 0.656301290640973

Rafting for Solar Neutrinos | 0.655172410958162

NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637

Hot Gas and Dark Matter | 0.617195790024749

Ice Fishing for Cosmic Neutrinos | 0.615384618911517

Weak Lensing Distorts the Universe | 0.450010798361481

1.8索引统计函数

ts_stat(sqlquery text,[ weights text, ] OUT word text, OUT ndoc integer, OUT nentry integer)

返回的是统计的纪录

word text — 索引中的词条

ndoc integer — 词条在索引中出现的次数

nentry integer — 词条在文档中出现的总次数

例如：

SELECT * FROMts_stat('SELECT tsv FROM messages')

ORDER BY nentry DESC,ndoc DESC, word

LIMIT 10;

查询的结果为

word ndoc nentry

test 2 3

title 2 2

test 1 2

body 1 1

上面可以看到，通过ts_stat函数就可以看到索引列中的分词的情况。

1.9 屏蔽词(Stop Words)

stop words 是一个很普遍并且在每个文档中几乎都能出现的的词，并且这个词没有实际的意义，因此在全文检索的文档中这些词将被忽略。例如英文文本内容中单词像 a 和like，他们不需要存储在索引中，但是他会影响词所在文档的位置。

SELECT to_tsvector('english','in the list of stopwords');

to_tsvector

----------------------------

'list':3'stop':5 'word':6

并且相关度的计算与是否存在stopwords是十分不同的,如:

SELECT ts_rank_cd (to_tsvector('english','in thelist of stop words'), to_tsquery('list & stop'));

ts_rank_cd

------------

0.05

SELECT ts_rank_cd (to_tsvector('english','list stopwords'), to_tsquery('list & stop'));

ts_rank_cd

------------

0.1

posted on 2014-10-27 17:49 XIAO的博客阅读(2538) 评论(0) 收藏举报

刷新页面返回顶部

XIAO的博客

postgresql全文检索语法

第1章全文检索语法

1.1 概述

1.2 文档(tsvector)类型

1.3查询(tsquery)类型

1.4检索表

1.5创建索引

1.6权重匹配(Weight)

1.7相关(Ranking)查询

1.8索引统计函数

1.9 屏蔽词(Stop Words)

公告

导航

XIAO的博客

postgresql全文检索语法

第1章 全文检索语法

1.1 概述

1.2 文档(tsvector)类型

1.3查询(tsquery)类型

1.4检索表

1.5创建索引

1.6权重匹配(Weight)

1.7相关(Ranking)查询

1.8索引统计函数

1.9 屏蔽词(Stop Words)

公告

导航

第1章全文检索语法