DotLucene(Lucene.Net)研究[转]

1. 新建工程

a) 引用Lucene.Net.dll

using Lucene.Net.Index;

using Lucene.Net.Documents;

using Lucene.Net.Analysis;

using Lucene.Net.Analysis.Standard;

using Lucene.Net.Search;

using Lucene.Net.QueryParsers;

2. 创建索引(Creating an index)
在你开始进行搜索之前,你需要创建一个的索引,并且添加你想搜索的数据

a) .创建索引使用一个名叫 IndexWriter的类
IndexWriter writer = new IndexWriter(@"C:\your\index\directory", new StandardAnalyzer(), true);
这样很轻松的在硬盘目录中创建了一个新的索引(你需要对硬盘写的权限);你也可以存储索引到内存中或者数据库中

b) 然后,当你创建索引之后,你可以打开它,当你需要向其添加文档时.
IndexWriter writer = new IndexWriter(@"C:\your\index\directory", new StandardAnalyzer(), false);

3. 添加文档到索引中

一旦你打开索引后,你可以向其添加文档.你向索引中插入的每一个实体都是一个文档.在索引中的每一个域都包含了相关的信息.在每一个域你需要特别的编录它,那就意味着无论域满足下面:

stored in the index(你可以得到任何存储值从索引中,这个对短域很有用,例如:author,title,等)

indexed(编录索引对于你想请求的域来说是必须的)

tokenized(标记,分割单词在编录之前)

域可以列入有用的构造:

public Field(string fieldName, string fieldValue, bool stored, bool indexed, bool tokenized);

为了你使用方便,这里也可以使用静态的方法创建一个新的域实例:

Field.Keyword() –域值存储和编录,但是不作标记

Field.Text(string name,StreamReader val) –域不被存储,但是可以编录索引和作标记
Field.Text(string name,string val)—域被存储,编入索引,标记

Field.UnIndexed() –域仅仅被存储

如果你向获取标记值和编录索引,你需要无格式文本支持.为了编入其它格式化的文本,你需要一个分词器.我们将查看一个剖析一个HTML文档

在我们的例子中,我们将编录索引一个文本文件,让我们使用这些域:

“filename” (存储,不编入索引,不标记)

“text”(不存储,编入索引,标记)

当我们使用这样的域,我们可以仅仅搜索这些内容(“text”域)

void addDocument(IndexWriter writer, string path)

{

Document doc = new Document();

StreamReader sr = new StreamReader(path, System.Text.Encoding.Default);

doc.Add(Field.Text("text", sr));

doc.Add(Field.Keyword("filename", path));

writer.AddDocument(doc);

sr.Close();

}

4.保存索引
不要忘记保存索引

writer.Close();

你可以在关闭IndexWriter之前有很索引

writer.Optimize();

writer.Close();

优化可能花费一些时间,但是可以提高搜索性能.你应该在索引写操作完成之前进行优化.

5.开始搜索
当你搜索一个搜索索引之前,你首先应该使用IndexSearcher ,你可以打开它,并且执行搜索,当那里有另外一个添加文档过程或者优化.

IndexSearcher searcher = new IndexSearcher(@"C:\your\index\directory");

然后你可以创建查询:

string q = "dotlucene";

Query query = QueryParser.Parse(q, "text", new StandardAnalyzer());

现在可以获取搜索结果并且打印出来.我们职能将文件名打印出来,因为我们在索引中每一存储其它域.

Hits hits = searcher.Search(query);

Console.WriteLine("Found " + hits.Length() + " document(s) that matched query '" + q + "':\r\n");

for (int i = 0; i < hits.Length(); i++) {

Document doc = hits.Doc(i);

Console.WriteLine(doc.Get("filename") + "\r\n");

}

最后别忘记关闭搜索引擎:

searcher.Close();

6. Query syntax(查询语法)

Query	Example	Notes
single term 单词	document	Searches for documents that contain "document" term in the default field. 搜索文档中包含”document”单词的默认域
Phrase 短语	"important document"	Searches for documents that contain the phrase "important document" in the default fiels. 搜索文档中包含” important document”短语的默认域
searching fields 搜索域	title:document	Searches for documents that contain "document" term in the "title" field. 搜索文档中”title”域中包含单词” document”
wildcard search 通配符搜索	doc?ment	Single-character wildcard search. It will match "document" and "dociment" but not "docooment". 当字母通配符查询,它将匹配” document”和” dociment”,但不匹配”docooment”
	document*	Multi-character wildcard search. It will match "document" and "documentation". 多字母通配符查询,它将匹配” document”和” documentation”
fuzzy search 模糊查询	document~	Search based on similar spelling. 查询基本相似的拼写
	document~0.9	Search based on similar spelling. 0.9 is the required similarity (default: 0.5) 查询相似度为0.9的,默认0.5
proximity search 接近查询	"important document"~5	Find words of a phrase that are not next to each other. Maximum distance in this example is 5 words. 查找短语中的单词,在接下来的短语中,最远距离为5个单词
range search 范围查询	author:{Einstein TO Newton}	Searches for document with "author" field value between specified values. 查询文档中域”author”值在清单值之间.
	date:{20050101 TO 20050201}	Searches for document with "date" field (DateTime type) value between specified dates. 查询文档中域”date”(日期型)在清单值之间
Relevance 适当	important^4 document	Set boost factor of the term "important" to 4. Default boost factor is 1.
	"important document"^4 "search engine"	You can set boost factor for phrases too.
OR operator 或操作	important document	"OR" is the default operator.
	important OR document	The default field must contain either "important" or "document". 默认文档中必须包含” important” 或者” document”
AND operator 与操作	important AND document	The default field must contain both word. 默认文档必须包含所有的单词
+ operator +操作	important +document	The default field must contain "document" and may contain "important". 默认文档必须包含”document” 并且可以包含” important”
NOT/- operator 非操作	-important document	The default field must contain "document" but not "important". 默认文档必须包含” document” 但是不保存” important”
Grouping 分组	(important OR office) AND document	Use parentheses for expression grouping. 使用园括弧分组
	author:(Einstein OR Newton)	Parentheses work with fields as well.

7. Prohibited Queries(禁止查询)

Query	Examples	Notes
wildcard at the beginning of a term 通配符在一个单词的开始	?ocument, *ocument	Throws Lucene.Net.QueryParsers.ParseException.
stop words 逗号	a, the, and	Stop words are not indexed.
special characters: + - && \|\| ! ( ) { } [ ] ^ " ~ * ? : \ 特殊字符	\+, \:	Use a backslash to escape the special characters.

学习积累:

1. TermQuery 查询某个特定的词

Term t = new Term("isbn", "1930110995");
Query query = new TermQuery(t);

2.RangeQuery 用于查询范围,通常用于时间查询

例如: Query query=new RangeQuery(new Term("time","20070516"),new Term("time","20070517"),false);

RangeQuery的第三个参数用于表示是否包含该起止日期

3. PrefixQuery 用于搜索是否包含某个特定前缀,常用于Catalog的检索.
PrefixQuery query = new PrefixQuery(new Term("category", "/Computers"));

4. BooleanQuery 用于测试满足多个条件.

TermQuery searchingBooks =
new TermQuery(new Term("subject", "junit"));

              RangeQuery currentBooks =
                   new RangeQuery(new Term("pubmonth", "200301"),
                                  new Term("pubmonth", "200312"),
                                  true);
              BooleanQuery currentSearchingBooks = new BooleanQuery();
              currentSearchingBooks.Add(searchingBooks, true, false);
              currentSearchingBooks.Add(currentBooks, true, false);
              IndexSearcher searcher = new IndexSearcher(directory);
              Hits hits = searcher.Search(currentSearchingBooks);

什么时候是与什么时候又是或? 关键在于BooleanQuery对象的Add方法的参数.

参数一是待添加的查询条件.

参数二Required表示这个条件必须满足吗? True表示必须满足, False表示可以不满足该条件.

参数三Prohibited表示这个条件必须拒绝吗? True表示这么满足这个条件的结果要排除, False表示可以满足该条件.

注意AND和OR的大小如果想要Ａ与非B 就用 A AND –B 表示, +A –B也可以.

默认的情况下QueryParser会把空格认为是或关系,就象google一样.但是你可以通过QueryParser对象修改这一属性

5. PhraseQuery
查询短语,这里面主要有一个slop的概念, 也就是各个词之间的位移偏差, 这个值会影响到结果的评分.如果slop为0,当然最匹配.看看下面的例子就比较容易明白了,有关slop的计算用户就不需要理解了,不过slop太大的时候对查询效率是有影响的,所以在实际使用中要把该值设小一点. PhraseQuery对于短语的顺序是不管的,这点在查询时除了提高命中率外,也会对性能产生很大的影响, 利用SpanNearQuery可以对短语的顺序进行控制,提高性能.

       private bool matched(String[] phrase, int slop)
              {
                    PhraseQuery query = new PhraseQuery();
                     query.SetSlop(slop);

for (int i = 0; i < phrase.Length; i++)
         {
              query.Add(new Term("field", phrase[i]));
         }

Hits hits = searcher.Search(query);
return hits.Length() > 0;
}

利用QueryParse进行短语查询的时候要先设定slop的值,有两种方式如下所示:

Query q2 = QueryParser.Parse(""quick fox"~1", //第一种方式
"field", new SimpleAnalyzer());

QueryParser qp = new QueryParser("field", new SimpleAnalyzer());
qp.SetPhraseSlop(1); //第二种方式

原文：http://www.shenjk.com/detail/277

posted @ 2013-01-07 11:39 M' 阅读(289) 评论(0) 编辑收藏举报

刷新页面返回顶部

M'

.net 技术资料

DotLucene(Lucene.Net)研究[转]

公告