Lucene.Net RangeQuery 效率确实低下

Posted on 2008-11-03 12:07 eaglet 阅读(2546) 评论(11) 编辑收藏举报

很多文章提到 Lucene.Net RangeQuery 的查询效率非常低下，我今天测试了一下，果然非常低下，而且结果也不正确。

测试代码：

索引

public void Index(int count)

{

IndexWriter writer = new IndexWriter(INDEX_DIR, new Lucene.Net.Analysis.SimpleAnalyzer(), true);

_Count = count;

Document doc = new Document();

for (int i = 0; i < count; i++)

{

if (IndexProgress != null)

{

IndexProgress(i);

}

string iStr = string.Format("{0:00000}", i);

Field field = new Field("Id", iStr, Field.Store.YES, Field.Index.UN_TOKENIZED);

doc.Add(field);

field = new Field("Text", "Test " + i.ToString(), Field.Store.YES, Field.Index.TOKENIZED);

doc.Add(field);

writer.AddDocument(doc);

}

if (IndexProgress != null)

{

IndexProgress(count);

}

writer.Optimize();

writer.Close();

}

查找

public string Search(int begin, int end)

{

IndexSearcher search = new IndexSearcher(INDEX_DIR);

try

{

string bStr = string.Format("{0:00000}", begin);

string eStr = string.Format("{0:00000}", end);

RangeQuery query = new RangeQuery(new Term("Id", bStr), new Term("Id", eStr), true);

//QueryParser qp = new QueryParser("Text", new Lucene.Net.Analysis.SimpleAnalyzer());

//Query q = qp.Parse("Test");

//BooleanQuery bq = new BooleanQuery();

//bq.Add(query, BooleanClause.Occur.MUST);

//bq.Add(q, BooleanClause.Occur.MUST);

_StopWatch.Reset();

_StopWatch.Start();

Hits hits = search.Search(query);

int retCount = hits.Length();

_StopWatch.Stop();

}

catch(Exception e)

{

return e.Message;

}

finally

{

search.Close();

}

StringBuilder report = new StringBuilder();

report.AppendLine("**************TestRange Report******************");

report.AppendFormat("Index count = {0}\r\n", _Count);

report.AppendFormat("Begin {0} to {1}\r\n", begin, end);

report.AppendFormat("ElapsedMilliseconds = {0}\r\n", ElapsedMilliseconds);

report.AppendLine("**************End Report************************");

return report.ToString();

}

这段代码我尝试插入3000条记录

搜索 0-1000条记录耗时 2秒多，如果搜索 0-2000条记录，lucene 会报错。

得到的结果也有问题，虽然 hits.HitDocs 的长度为1000，但hits.Length()的大小却是3000.

而hits.HitDocs 是私有成员，根本无法取出。

范围超过2000报错是因为clause的数量超过2000，而这个数量默认的最大值是2000.

从这个结果我大概判断，Lucene在进行范围查找的时候，并不是利用传统的B+树或者类似的算法计算范围，而是用全文的方法

计算范围，找到Score 值明显较大的记录，于是在查找00000 - 001000 的时候就产生 1001个Clause ，

分别是"00000", "00001", ..."001000", 用这1001个Clause

到全文索引中一一匹配得到Score值大的记录，然后输出。

Lucene 搜索过程的代码如下:

if (hitDocs.Count > min)

{

min = hitDocs.Count;

}

int n = min * 2; // double # retrieved

TopDocs topDocs = (sort == null) ? searcher.Search(weight, filter, n) : searcher.Search(weight, filter, n, sort);

length = topDocs.totalHits;

ScoreDoc[] scoreDocs = topDocs.scoreDocs;

float scoreNorm = 1.0f;

if (length > 0 && topDocs.GetMaxScore() > 1.0f)

{

scoreNorm = 1.0f / topDocs.GetMaxScore();

}

int end = scoreDocs.Length < length?scoreDocs.Length:length;

for (int i = hitDocs.Count; i < end; i++)

{

hitDocs.Add(new HitDoc(scoreDocs[i].score * scoreNorm, scoreDocs[i].doc));

}

从这里我们可以看出length 被赋值为 topDocs.totalHits; 而不是topDocs.scoreDocs.Count

而这个 topDocs.totalHits的值始终是3000.也就是索引文件的记录总数。

排除结果不对的问题不说，这种算法的效率也实在太低，而且限制你输出记录总数的范围只能在2000个记录以内，输出

的范围超过2000个就无法查询（虽然你可以人为提高Clause的上限，但这将导致系统开销极大的增加）

对于垂直搜索来说，这种范围结合全文的查询是非常普遍的，比如我们需要查某个名字的书名，且价格在某个范围内。

这种需求用 lucene实现，基本就不可能了。

这个测试使我对lucene的商业化应用的前景产生了怀疑，我打算调整Hubble.Net 的设计和研发计划，优先考虑部分或

全部替换lucene。

刷新页面返回顶部

eaglet

公告

Lucene.Net RangeQuery 效率确实低下