使用全文搜索

一、检查相关组件

1. 检查全文搜索组建是否安装

SELECT SERVERPROPERTY('IsFullTextInstalled');

全文索引支持的数据类型包括CHAR, VARCHAR, NCHAR, NVARCHAR,TEXT, NTEXT, IMAGE, XML, and VARBINARY(MAX)

 

2. 检查装了哪些文档过滤器

过滤器(ifilters) 用来提取文本信息,并且移除文档的格式

EXEC sys.sp_help_fulltext_system_components 'filter';
或者
SELECT document_type, path
FROM sys.fulltext_document_types;

可以安装其他filter,比如 Microsoft Office 2010 document

http://www.microsoft.com/en-us/download/details.aspx?id=17062

安装好filter包以后需要在SQL Server中注册,使用以下语句

EXEC sys.sp_fulltext_service 'load_os_resources', 1;

可能需要重启一下在用sys.sp_help_fulltext_components检查是否安装成功。

 

3. 设置语言断子符(Word breakers)和词干分析器(Stemmers)

断字符和词干分析器用于对所有全文索引数据执行语言分析。语言分析将涉及到查找词边界(断字)和组合动词(词干分析)。断字符和词干分析器是特定于语言的,并且各语言的语言分析规则也各不相同。对于给定语言,“断字符”通过根据语言的词法规则确定词的边界位置来标识各个词。每个词(也称为“标记”tokens)使用压缩表示形式插入全文索引以减少其大小。“词干分析器”根据该语言的规则生成特定词的变形形式(inflectional forms of a word)(例如,“running”、“ran”和“runner”是单词“run”的不同形式)。

以下语句可以查看SQL Server支持的语言

SELECT  lcid ,
        name
FROM    sys.fulltext_languages
ORDER BY name;

默认非本地化的情况下使用的是英语。可以用存储sys.sp_configure来更改全文搜索的默认语言

USE AdventureWorks2012 ;
GO
EXEC sp_configure 'show advanced options', 1 ;
GO
RECONFIGURE
GO
EXEC sp_configure 'default full-text language', 1043 ;
GO
RECONFIGURE
GO

 

4. 非索引字表

有些干扰词(noise words)无益于查询,可以创建非索引字表(stoplists of stopwords)

以下语句可以查看当前数据库中用的非索引字符

SELECT stoplist_id, names
FROM sys.fulltext_stoplists;
SELECT stoplist_id, stopword, language
FROM sys.fulltext_stopwords;

 

5. 同义词查询

全文搜索也支持同义词查询。 这些同义词放在thesaurus文件中。每种语言都分配了一个XML thesaurus 文件。目录如下:

SQL_Server_install_path\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTDATA\

image

diacritics_sensitive 是否区分重音 ,0 不区分,1 区分。
expansion 使用附加词进行搜索,比如你添加了“auther”和“writer”。当用户搜索“writer”的时候同事搜索“auther”
replacement 搜索某个词的时候替换成另外一个词进行搜索。比如用“Windows 2008”替换“Win 2k8”。

Thesaurus文件改好以后还需要加载

EXEC sys.sp_fulltext_load_thesaurus_file 1033

 

二、创建并管理全文目录和索引

全文目录是一个虚拟的对象,它不属于任何文件组,里面存储了全文索引。 其中全文索引语法中多了一个选项 STATISTICAL_SEMANTICS (语义搜索)

统计语义搜索通过提取统计上相关的“关键短语”并对其进行索引,提供对 SQL Server 中存储的非结构化文档的更深层次剖析。  然后,它还使用这些关键短语标识“相似或相关文档”并对其进行索引。 您通过使用三个 Transact-SQL 行集函数将结果作为结构化数据检索,查询这些语义索引。

语义搜索扩展了全文搜索的功能,可以让你查询文档的含义。比如说,你可以通过关键词索引建立文档分类。你可以通过简历查询相匹配的工作描述。语义搜索使得你可以建立自己的文本挖掘方案。

创建表格和全文组件

1.创建表格

我们先建立一个临时表,填充一些文本,然后建立stopwords和stoplist。最后搜索文档内容

column name Data type nullability remarks
id INT NOT NULL IDENTITY, PRIMARY KEY
title NVARCHAR(100) NOT NULL Name of the documents you are going to import
doctype NCHAR(4) NOT NULL Type of the documents you are going to import
docexcerpt NVARCHAR(1000) NOT NULL Excerpt of the documents you are going to import
doccontent VARBINARY(MAX) NOT NULL Documents you are going to import
 
--Table for documents
CREATE TABLE dbo.Documents
(
  id INT IDENTITY(1,1) NOT NULL,
  title NVARCHAR(100) NOT NULL,
  doctype NCHAR(4) NOT NULL,
  docexcerpt NVARCHAR(1000) NOT NULL,
  doccontent VARBINARY(MAX) NOT NULL,
  CONSTRAINT PK_Documents 
   PRIMARY KEY CLUSTERED(id)
);
GO

2.插入数据

--Insert data
INSERT INTO dbo.Documents
(title, doctype, docexcerpt, doccontent)
SELECT N'Columnstore Indices and Batch Processing', 
 N'docx',
 N'You should use a columnstore index on your fact tables,
   putting all columns of a fact table in a columnstore index. 
   In addition to fact tables, very large dimensions could benefit 
   from columnstore indices as well. 
   Do not use columnstore indices for small dimensions. ',
 bulkcolumn
FROM OPENROWSET(BULK 'C:\TK461\ColumnstoreIndicesAndBatchProcessing.docx', 
                SINGLE_BLOB) AS doc;
INSERT INTO dbo.Documents
(title, doctype, docexcerpt, doccontent)
SELECT N'Introduction to Data Mining', 
 N'docx',
 N'Using Data Mining is becoming more a necessity for every company 
   and not an advantage of some rare companies anymore. ',
 bulkcolumn
FROM OPENROWSET(BULK 'C:\TK461\IntroductionToDataMining.docx', 
                SINGLE_BLOB) AS doc;
INSERT INTO dbo.Documents
(title, doctype, docexcerpt, doccontent)
SELECT N'Why Is Bleeding Edge a Different Conference', 
 N'docx',
 N'During high level presentations attendees encounter many questions. 
   For the third year, we are continuing with the breakfast Q&A session. 
   It is very popular, and for two years now, 
   we could not accommodate enough time for all questions and discussions! ',
 bulkcolumn
FROM OPENROWSET(BULK 'C:\TK461\WhyIsBleedingEdgeADifferentConference.docx', 
                SINGLE_BLOB) AS doc;
INSERT INTO dbo.Documents
(title, doctype, docexcerpt, doccontent)
SELECT N'Additivity of Measures', 
 N'docx',
 N'Additivity of measures is not exactly a data warehouse design problem. 
   However, you have to realize which aggregate functions you will use 
   in reports for which measure, and which aggregate functions 
   you will use when aggregating over which dimension.',
 bulkcolumn
FROM OPENROWSET(BULK 'C:\TK461\AdditivityOfMeasures.docx', 
                SINGLE_BLOB) AS doc;
GO

3. 创建搜索属性列表

文档属性的内容先前无法与文档正文的内容区分。  此局限性将全文查询限制为针对整个文档进行一般搜索。 但现在,对于 varbinary、varbinary(max)(包括 FILESTREAM)或 image 二进制数据列中支持的文档类型,您可以配置全文索引以支持对特定属性(如 Author 和 Title)进行属性范围内的搜索。 这种形式的搜索称为“属性搜索”。

-- Search property list
CREATE SEARCH PROPERTY LIST WordSearchPropertyList;
GO
ALTER SEARCH PROPERTY LIST WordSearchPropertyList
 ADD 'Authors' 
 WITH (PROPERTY_SET_GUID = 'F29F85E0-4FF9-1068-AB91-08002B27B3D9', 
       PROPERTY_INT_ID = 4, 
       PROPERTY_DESCRIPTION = 'System.Authors - authors of a given item.');
GO

4. 创建stopwords

-- Stopwords list
CREATE FULLTEXT STOPLIST SQLStopList;
GO
ALTER FULLTEXT STOPLIST SQLStopList
 ADD 'SQL' LANGUAGE 'English';
GO
--  Check the Stopwords list
SELECT w.stoplist_id,
 l.name,
 w.stopword,
 w.language
FROM sys.fulltext_stopwords AS w
 INNER JOIN sys.fulltext_stoplists AS l
  ON w.stoplist_id = l.stoplist_id;
GO

5. 使用sys.dm_fts_parser 检阅结果

sys.dm_fts_parser 返回将给定断字符、同义词库和非索引字表组合应用于查询字符串输入后生成的最终词语切分结果。此词语切分结果等效于全文引擎针对指定查询字符串的输出。

--第一个参数是用来分析的字符串,第二个是语言ID,第三个是stoplist ID (用前面的语句查询),第四个是否解析重音
SELECT *
FROM sys.dm_fts_parser
(N'"Additivity of measures is not exactly a data warehouse design problem.
However, you have to realize which aggregate functions you will use
in reports for which measure, and which aggregate functions
you will use when aggregating over which dimension."', 1033, 5, 0);
SELECT *
FROM sys.dm_fts_parser
('FORMSOF(INFLECTIONAL,'+ 'function' + ')', 1033, 5, 0);

安装语意数据库并且创建全文索引

1. 检查是否安装了语意数据库

如果以下语句无任何结果返回则需要安装,安装文件是x64\Setup 里面的SemanticLanguageDatabase.msi

SELECT *
FROM sys.fulltext_semantic_language_statistics_database;

2. 附加语义数据库

CREATE DATABASE semanticsdb ON
(FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb.mdf'),
(FILENAME = 'C:\Program Files\Microsoft Semantic Language Database\semanticsdb_log.ldf')
FOR ATTACH;

3. 注册数据库

EXEC sp_fulltext_semantic_register_language_statistics_db
 @dbname = N'semanticsdb';
GO

4. 创建全文目录

CREATE FULLTEXT CATALOG DocumentsFtCatalog;

5. 创建全文索引

-- Full-text index
CREATE FULLTEXT INDEX ON dbo.Documents
( 
  docexcerpt Language 1033, 
  doccontent TYPE COLUMN doctype
  Language 1033
  STATISTICAL_SEMANTICS
)
KEY INDEX PK_Documents
ON DocumentsFtCatalog
WITH STOPLIST = SQLStopList, 
     SEARCH PROPERTY LIST = WordSearchPropertyList, 
     CHANGE_TRACKING AUTO;
GO

三、 使用CONTAINS 和 FREETEXT 谓词

CONTAINS和FREETEXT 他们都是在 WHERE 子句中使用的一个谓词,用于搜索含有基于字符的数据类型的列,以查找含义与搜索条件中的单词相同但不完全匹配的值。首先,CONTAINS可用于检索单个词或短语的精确或模糊匹配。FREETEXT则全文查询引擎将在内部对 输入的单词、短语或句子执行一些操作,如断词,变形形式,或者thesaurus制定的扩展或替代词等。再进行查询操作。

使用CONTAINS谓词

1. 查询 dbo.Documents 表中docexcerpt列包含“data”的行。

SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'data');

2. 查询 dbo.Documents 表中docexcerpt列包含“data”或 “index”的行

--  Logical operators - OR
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'data OR index');

3.查询 dbo.Documents 表中docexcerpt列包含“data”但是不包含“mining”的行。

-- Logical operators - AND NOT
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'data AND NOT mining');

4. 找到所有dbo.Documents 表中docexcerpt列包含“data”或者 包含“fact”与“warehouse”的行。

-- Logical operators - parentheses
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'data OR (fact AND warehouse)');

5. 查询短语

-- Phrase
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'"data warehouse"');

6. 前缀

--  Prefix
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'"add*"');

7. NEAR函数

查找单词”problem“与”data“在一起的行,未指定次序以及距离

--  Simple proximity
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'NEAR(problem, data)');

设置最大距离

-- Proximity with max distance
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),5)');
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),1)');

按照次序

--  Proximity with max distance and order
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'NEAR((problem, data),5, TRUE)');

8. 查找单词的变形形式

-- Inflectional forms
-- The next query does not return any rows
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'presentation');
-- The next query returns a row
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'FORMSOF(INFLECTIONAL, presentation)');
GO

使用同义词(Synonyms)和FREETEXT

1. 修改英文的thesaurus文件。内容如下:

<XML ID="Microsoft Search Thesaurus">
    <thesaurus xmlns="x-schema:tsSchema.xml">
    <diacritics_sensitive>0</diacritics_sensitive>
        <expansion>
            <sub>Internet Explorer</sub>
            <sub>IE</sub>
            <sub>IE5</sub>
        </expansion>
        <replacement>
            <pat>NT5</pat>
            <pat>W2K</pat>
            <sub>Windows 2000</sub>
        </replacement>
        <expansion>
            <sub>run</sub>
            <sub>jog</sub>
        </expansion>
        <expansion>
            <sub>need</sub>
            <sub>necessity</sub>
        </expansion>
    </thesaurus>
</XML>

2. 加载thesaurus文件

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

3. 同义词查询

--  Synonyms
-- The next query does not return any rows
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'need');
-- The next query returns a row
SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(docexcerpt, N'FORMSOF(THESAURUS, need)');

4. 文档属性查询

SELECT id, title, docexcerpt
FROM dbo.Documents
WHERE CONTAINS(PROPERTY(doccontent,'Authors'), 'Dejan');

5. 查询包含单词”data“,”presentation“或者”need“的行。 这些单词同义词,变化形式也会查询。

SELECT id, title, doctype, docexcerpt
FROM dbo.Documents
WHERE FREETEXT(docexcerpt, N'data presentation need');

四、 使用表值函数查询全文和语意

全文索引可以用权重来对文档排名(排名计算很复杂,比如词频,数量,NEAR字句,ISABOUT字句索引的行数等)。 可以用CONTAINSTABLE 和 FREETEXTTABLE函数。此外语意查询可用三个表值函数 SEMANTICKEYPHRASETABLE, SEMANTICSIMILARITYDETAILSTABLE,和 SEMANTICSIMILARITYTABLE

使用全文搜索函数

1. 使用CONTAINSTABLE排序

--   Rank with CONTAINSTABLE
SELECT D.id, D.title, CT.[RANK], D.docexcerpt
FROM CONTAINSTABLE(dbo.Documents, docexcerpt, 
      N'data OR level') AS CT
 INNER JOIN dbo.Documents AS D
  ON CT.[KEY] = D.id
ORDER BY CT.[RANK] DESC;

2. 使用FREETEXTTABLE排序

--  Rank with FREETEXTTABLE
SELECT D.id, D.title, FT.[RANK], D.docexcerpt
FROM FREETEXTTABLE (dbo.Documents, docexcerpt, 
      N'data level') AS FT
 INNER JOIN dbo.Documents AS D
  ON FT.[KEY] = D.id
ORDER BY FT.[RANK] DESC;

3. 使用权重排序

--  Weighted terms
SELECT D.id, D.title, CT.[RANK], D.docexcerpt
FROM CONTAINSTABLE
      (dbo.Documents, docexcerpt, 
       N'ISABOUT(data weight(0.8), level weight(0.2))') AS CT
 INNER JOIN dbo.Documents AS D
  ON CT.[KEY] = D.id
ORDER BY CT.[RANK] DESC;

4. 相近词汇

--  Proximity term
SELECT D.id, D.title, CT.[RANK]
FROM CONTAINSTABLE (dbo.Documents, doccontent, 
      N'NEAR((data, row), 30)') AS CT
 INNER JOIN dbo.Documents AS D
  ON CT.[KEY] = D.id
ORDER BY CT.[RANK] DESC;

使用语义搜索函数

-- 1. Top 20 semantic key phrases
SELECT TOP (20)
 D.id, D.title, SKT.keyphrase, SKT.score
FROM SEMANTICKEYPHRASETABLE
      (dbo.Documents, doccontent) AS SKT
 INNER JOIN dbo.Documents AS D
  ON SKT.document_key = D.id
ORDER BY SKT.score DESC;

-- 2. Documents that are similar to document 1
SELECT SST.matched_document_key, 
 D.title, SST.score
FROM SEMANTICSIMILARITYTABLE
     (dbo.Documents, doccontent, 1) AS SST
 INNER JOIN dbo.Documents AS D
  ON SST.matched_document_key = D.id
ORDER BY SST.score DESC;

-- 3. Semantic search key phrases that are common to two documents
SELECT SSDT.keyphrase, SSDT.score
FROM SEMANTICSIMILARITYDETAILSTABLE
      (dbo.Documents, doccontent, 1,
       doccontent, 4) AS SSDT
ORDER BY SSDT.score DESC;
GO

-- 4. Clean up
DROP TABLE dbo.Documents;
DROP FULLTEXT CATALOG DocumentsFtCatalog;
DROP SEARCH PROPERTY LIST WordSearchPropertyList;
DROP FULLTEXT STOPLIST SQLStopList;
GO

 

参考文档

断字符和词干分析器
http://blog.csdn.net/tianlianchao1982/article/details/5806193

SQL Server -- 从全文索引中去掉普通字符串
http://blog.csdn.net/fan158/article/details/6273813

CREATE FULLTEXT INDEX (Transact-SQL)

https://msdn.microsoft.com/zh-cn/library/ms187317.aspx

语义搜索 (SQL Server)

https://msdn.microsoft.com/zh-cn/library/gg492075.aspx#find1

使用搜索属性列表搜索文档属性

https://msdn.microsoft.com/zh-SG/library/ee677637

Find Property Set GUIDs and Property Integer IDs for Search Properties

https://msdn.microsoft.com/en-us/library/ee677618.aspx

Windows Properties

https://msdn.microsoft.com/library/dd561977.aspx

全文检索,CONTAINS 和FREETEXT的用法比较

https://social.msdn.microsoft.com/Forums/zh-CN/42f48bfe-0095-4e1d-a0f8-5127de3ff898/contains-freetext

posted @ 2015-03-01 18:03  Haseo  阅读(1513)  评论(1编辑  收藏  举报