[阅读笔记]仅用37行代码构造网站的全文检索

仅用37行代码构造网站的全文检索

英文标题:DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

原文地址: http://www.codeproject.com/KB/aspnet/DotLuceneSearch.aspx

作者:Dan Letecky
在线演示

示例下载(包含索引文件)

dotLucene是个不错的全文检索引擎.本文介绍用核心的37行代码构建网站的全文检索.

创建索引:

IndexWriter writer =

new IndexWriter(directory, new StandardAnalyzer(), true);

添加文档对象:

public void AddHtmlDocument(string path)

{

Document doc = new Document();

string rawText;

using (StreamReader sr =

new StreamReader(path, System.Text.Encoding.Default))

{

rawText = parseHtml(sr.ReadToEnd());

}

doc.Add(Field.UnStored("text", rawText));

doc.Add(Field.Keyword("path", path));

writer.AddDocument(doc);

}

优化并且保存索引:

writer.Optimize();

writer.Close();

打开索引查询:

IndexSearcher searcher = new IndexSearcher(directory);

开始检索:

Query query =

QueryParser.Parse(q, "text", new StandardAnalyzer());

Hits hits = searcher.Search(query);

变量hits是一个文档结果对象集合类型,下面的代码遍历里面的结果并且将结果保存到DataTable中.

DataTable dt = new DataTable();

dt.Columns.Add("path", typeof(string));

dt.Columns.Add("sample", typeof(string));

for (int i = 0; i < hits.Length(); i++)

{

// get the document from index

Document doc = hits.Doc(i);

// get the document filename

// we can't get the text from the index

//because we didn't store it there

DataRow row = dt.NewRow();

row["path"] = doc.Get("path");

dt.Rows.Add(row);

}

标识检索的关键字:

QueryHighlightExtractor highlighter =

new QueryHighlightExtractor(query, new StandardAnalyzer(),

"<B>", "</B>");

在组织结果的过程中,可以通过如下代码只取出和结果相关的部分:

for (int i = 0; i < hits.Length(); i++)

{

// ...

string plainText;

using (StreamReader sr =

new StreamReader(doc.Get("filename"),

System.Text.Encoding.Default))

{

plainText = parseHtml(sr.ReadToEnd());

}

row["sample"] =

highlighter.GetBestFragments(plainText, 80, 2, "...");

// ...

}

专注微软平台的商业智能解决方案

[阅读笔记]仅用37行代码构造网站的全文检索

公告