Sql2005 fulltext search 确是对中文支持不怎么样.
使用contains 和 FreeText 搜索中文的老是有问题.
http://technet.microsoft.com/zh-cn/library/ms170355(SQL.90).aspx
这个里面提到的很多东西还没有玩过
NTAINS (Transact-SQL)
更新日期: 2006 年 4 月 14 日一个谓词,用于在包含字符数据类型的列中搜索单个词和短语的精确或模糊(不太精确)匹配项、一定差异范围内的相邻词或加权匹配项。
在 SQL Server 2005 中,可以在 CONTAINS 或 FREETEXT 全文谓词中使用由四部分组成的名称,对链接服务器执行查询。
CONTAINS 谓词可以搜索:
- 词或短语。
- 词或短语的前缀。
- 与另一个词相邻的词。
- 由另一个词的词形变化而生成的词(例如,drive 一词是 drives、drove、driving 和 driven 词形变化的词干)。
- 使用同义词库确定的另一个词的同义词(例如,metal 一词可能有 aluminum 和 steel 等同义词)。
Extending SQL 2005 Fulltext Search
http://blogs.msdn.com/shajan/
Extending SQL 2005 fulltext search
Author: Shajan Dasan (Development Lead, Microsoft Search)
Date: 1/24/2005
Fulltext indexing consists of extracting text out of documents, breaking the text into individual words and storing the words in an index for fast lookup. A filter (IFilter) is a COM component that extracts text out of documents. SQL 2005 (SQL Server 2005) ships with many filters out of the box – e.g. filters capable of extracting text out of Word, PowerPoint, html. WordBreakers (IWordBreaker) are language specific COM components capable of separating words from text. SQL Server 2005 supports WordBreakers for many languages out of the box – e.g. English, German, Japanese. It is possible to extend SQL Server 2005 fulltext search by implementing custom Filters and WordBreakers. The IFilter and IWordBreaker interfaces are documented in MSDN. This document describes authoring and installing Filters and Wordbreakers in SQL Server 2005.
Component loading model
SQL Server 2005 component (IFilter/IWordBreaker) loading model is different from previous versions of SQL Server. There is a backward compatible mode, which is described later in admin settings. Changes were made to better isolate different instances of SQL Server 2005 and improve the security of the indexing process. Different instances of SQL Server 2005 can be installed on the same machine. A change in components of one instance does not affect other instances. To improve the security of the indexing process, Authenticode signing of components is required by default. Components signature is verified before it is loaded.
Fulltext Search component installation involves (a) Installing the binaries and resources to the SQL Server 2005 instance folder and (b) Optionally installing the Authenticode certificates of the component publisher if the publisher is not trusted on the machine (c) Updating the instance specific registry keys.
Multi Instance isolation
Multiple instances of SQL can be installed on a machine. Each instance has its own copy of search components. Changes to components in one instance do not affect other instances. Each instance of SQL Server 2005 binaries are installed in a different folder. Each instance of SQL Server 2005 has a corresponding instance of fulltext search service. Fulltext search service instance has a registry root and install path. The search registry root is a node in the SQL instance registry and the search install path is the Binn directory of the SQL Server 2005 instance. Fulltext search instance information (registry root and install path) can also be found by expanding HKEY_LOCAL_MACHINE"SOFTWARE"Microsoft"MSFTESQLInstMap/<instance#> where instance# is the SQL Server 2005 instance number (1, 2, 3 …). The value of Path gives the path to search binaries, and RegRoot points to the registry root for the instance. The example below illustrates the registry layout for a multi instance SQL installation.
e.g.:
Fulltext Search instance map for instance #1, pointing to install path and registry root
Binn folder D:"MSSQL"MSSQL.1"MSSQL"Binn
Registry root HKEY_LOCAL_MACHINE"SOFTWARE"Microsoft"Microsoft SQL Server"MSSQL.1"MSSearch
Fulltext Search instance map for instance #2, pointing to install path and registry root
Binn folder D:"Program Files"Microsoft SQL Server"MSSQL.2"MSSQL"Binn
Registry root HKEY_LOCAL_MACHINE"SOFTWARE"Microsoft"Microsoft SQL Server"MSSQL.2"MSSearch
Fulltext search registry root for instance 1, 2 inside SQL Server 2005 instance root
Registering Components
Components need to be installed for individual instances of SQL Server 2005. WordBreakers and Filters have slightly different installation procedure. First, install the binaries and configuration files (if any) to the SQL Server 2005 instance Binn folder, such as C:"Program Files"Microsoft SQL Server"MSSQL.1"MSSQL"Binn. Follow the Registration procedure below.
WordBreaker registration:
1. Add the COM ClassID(s) as a key to <InstanceRoot>"MSSearch"CLSID. The default value of this key is the path to the component. If the component is installed in the instance Binn folder, the full path is optional. ThreadingModel can also be specified here. It is not required to register the ClassIDs as regular COM components.
2. Add a key to the <InstanceRoot>"MSSearch"Language for the installed language. Fill in the Locale number and StemmerClass, WBreakerClass ClassIDs. See IStemmer interface in MSDN for more information about stemming.
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/wbrscenario_4ckl.asp
3. Configuration files include lexicons, noise word files and thesaurus. These are optional. Lexicons are configuration files commonly used by WordBreakers, and are internal to WordBreaker implementations. Noise files are used to eliminate common words from the index. A language specific noise word can be provided and its path specified in the NoiseFile field. (See below for a sample). Thesaurus file is also optional and can be specified. A thesaurus is used at query time to expand query phrases.
Component registration sample for English WordBreaker / Stemmer.
Filter registration:
1. Add the COM ClassID(s) as a key to <InstanceRoot>"MSSearch"CLSID. The default value of this key is the path to the component. If the component is installed in the instance Binn folder, the full path is optional. ThreadingModel can also be specified here. It is not required to register the ClassIDs as regular COM components.
2. Add a key to the <InstanceRoot>"MSSearch"Filters for the installed language. Fill in the Locale number and StemmerClass, WBreakerClass ClassIDs. See IStemmer interface in MSDN for more information about stemming.
Component registration sample for .aspx IFilter
Signature Validation
Component signature is validated before it is used. All components need to be signed, by the component publisher, and the publisher needs to be trusted on the machine. More information on code signing is available at MSDN. Signature verification can be disabled, this is not recommended, see admin settings below.
Admin Settings
Load OS Resources: By default, loading components (filters/word breakers) installed with the OS is disabled. This is the recommended setting, and helps isolate different instances of SQL Server 2005. This setting can be changed by the command
Exec Sp_fulltext_service ‘load_os_resources’, 1
Verify Signature: Signature verification is enabled by default. Disabling this is not recommended. This setting can be changed by the command
Exec Sp_fulltext_service ‘verify_signature’, 0
Apendix
For example, for the default SQL 2005 instance, the German word breaker registry key is at: HKEY_LOCAL_MACHINE"SOFTWARE"Microsoft"Microsoft SQL Server"MSSQL.1"MSSearch"CLSID"{6DE705A5-6467-43DC-9CE3-CCFE08B3F645}
The stemmer registry key for the default instance is at:
HKEY_LOCAL_MACHINE"SOFTWARE"Microsoft"Microsoft SQL Server"MSSQL.1"MSSearch"CLSID"{9EA69E16-AD5E-43ED-B90E-73D58771D3F6}
In general it is better to register a new language for the specific wordbreaker. As long as the column that need to be indexed in that specific language, you still have this feature plus all of the regular language support. This will be a better solution rather than replacing the default language. If a column just needs regular word breaking, you can still use default language wordbreaker.
When registering a new LCID, the following TSQL command needs to execute in SQL Server 2005:
exec sp_fulltext_service 'update_languages'
It is possible to create a new LCID , for example LCID 1034, and register text parser, stemmer, ignored tokens file and thesaurus file for it. Then Fulltext Index can be created using TSQL command
CREATE FULLTEXT INDEX ON table1(column2 LANGUAGE 1034) KEY INDEX ncidx1 ON SpecialTCDB_Catalog
The TSQL code to create registry key and populate it:
exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE"Microsoft"MSSQLSERVER"MSSearch"Language"LCID', 'Locale', 'REG_DWORD', 1034
exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE"Microsoft"MSSQLSERVER"MSSearch"Language"LCID', 'NoiseFile', 'REG_SZ', 'noiseenu.txt'
exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE"Microsoft"MSSQLSERVER"MSSearch"Language"LCID', 'StemmerClass', 'REG_SZ', '{EEED4C20-7F1B-11CE-BE57-00AA0051FE20}'
exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE"Microsoft"MSSQLSERVER"MSSearch"Language"LCID', 'TSaurusFile', 'REG_SZ', 'TSENU.XML'
exec master.dbo.xp_instance_regwrite 'HKEY_LOCAL_MACHINE', 'SOFTWARE"Microsoft"MSSQLSERVER"MSSearch"Language"LCID', 'WBreakerClass', 'REG_SZ', '{188D6CC5-CB03-4C01-912E-47D21295D77E}'
作者: David Yang(David Yang's Tech Blog)
出处:http://davidyang.cnblogs.com/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构