[阅读笔记]仅用37行代码构造网站的全文检索
仅用37行代码构造网站的全文检索
英文标题:DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code
原文地址: http://www.codeproject.com/KB/aspnet/DotLuceneSearch.aspx
dotLucene是个不错的全文检索引擎.本文介绍用核心的37行代码构建网站的全文检索.
创建索引:
IndexWriter writer =
new IndexWriter(directory, new StandardAnalyzer(), true);
添加文档对象:
public void AddHtmlDocument(string path)
{
Document doc = new Document();
string rawText;
using (StreamReader sr =
new StreamReader(path, System.Text.Encoding.Default))
{
rawText = parseHtml(sr.ReadToEnd());
}
doc.Add(Field.UnStored("text", rawText));
doc.Add(Field.Keyword("path", path));
writer.AddDocument(doc);
}
优化并且保存索引:
writer.Optimize();
writer.Close();
打开索引查询:
IndexSearcher searcher = new IndexSearcher(directory);
开始检索:
Query query =
QueryParser.Parse(q, "text", new StandardAnalyzer());
Hits hits = searcher.Search(query);
变量hits是一个文档结果对象集合类型,下面的代码遍历里面的结果并且将结果保存到DataTable中.
DataTable dt = new DataTable();
dt.Columns.Add("path", typeof(string));
dt.Columns.Add("sample", typeof(string));
for (int i = 0; i < hits.Length(); i++)
{
// get the document from index
Document doc = hits.Doc(i);
// get the document filename
// we can't get the text from the index
//because we didn't store it there
DataRow row = dt.NewRow();
row["path"] = doc.Get("path");
dt.Rows.Add(row);
}
标识检索的关键字:
QueryHighlightExtractor highlighter =
new QueryHighlightExtractor(query, new StandardAnalyzer(),
"<B>", "</B>");
在组织结果的过程中,可以通过如下代码只取出和结果相关的部分:
for (int i = 0; i < hits.Length(); i++)
{
// ...
string plainText;
using (StreamReader sr =
new StreamReader(doc.Get("filename"),
System.Text.Encoding.Default))
{
plainText = parseHtml(sr.ReadToEnd());
}
row["sample"] =
highlighter.GetBestFragments(plainText, 80, 2, "...");
// ...
}
相关资源:
---------------------------------------------------------------
aspnetx的BI笔记系列索引:
使用SQL Server Analysis Services数据挖掘的关联规则实现商品推荐功能
---------------------------------------------------------------