Lucene.NET-1.4.3索引架构研究
Lucene索引架构研究
Lucene索引部分主要分为三大块:索引内容读取器,项在文档中的统计信息,项和域的数据结构。
一. 索引内容读取器
图1. 索引内容读取器
IndexReader封装对外的统一接口,可以获取指定term的tf,df等信息。获取Document通过调用FieldsReader来实现,获取Term的信息通过TermInfoReader来实现。
IndexReader
IndexReader is an abstract class, providing an interface for accessing an index.
/// <summary>Returns the
directory this index resides in. </summary>
public virtual Directory Directory()
/// <summary>Returns the
number of documents in this index. </summary>
public abstract int
NumDocs();
/// <summary>Returns one
greater than the largest possible document number.
/// This may be used to,
e.g., determine how big to allocate an array which
/// will have an element
for every document number in an index.
/// </summary>
public abstract int MaxDoc();
/// <summary>Returns the
stored fields of the <code>n</code><sup>th</sup>
/// <code>Document</code> in this
index.
/// </summary>
public abstract Document Document(int
n);
/// <summary>Returns an
enumeration of all the terms in the index.
/// The enumeration is
ordered by Term.compareTo(). Each term
/// is greater than all
that precede it in the enumeration.
/// </summary>
public abstract TermEnum Terms();
/// <summary>Returns an enumeration
of all terms after a given term.
/// The enumeration is
ordered by Term.compareTo(). Each term
/// is greater than all
that precede it in the enumeration.
/// </summary>
public abstract TermEnum Terms(Term
t);
/// <summary>Returns the
number of documents containing the term <code>t</code>. </summary>
public abstract int DocFreq(Term
t);
/// <summary>Returns an
enumeration of all the documents which contain
/// <code>term</code>. For each
document, the document number, the frequency of
/// the term in that
document is also provided, for use in search scoring.
/// Thus, this method
implements the mapping:
/// <p><ul>
/// Term <docNum, freq><sup>*</sup>
/// </ul></p>
/// <p>The enumeration is
ordered by document number. Each
document number
/// is greater than all
that precede it in the enumeration.</p>
/// </summary>
public virtual TermDocs TermDocs(Term
term)
/// <summary>Returns an enumeration of
all the documents which contain
/// <code>term</code>. For each document, in addition to the
document number
/// and frequency of the
term in that document, a list of all of the ordinal
/// positions of the term
in the document is available. Thus, this
method
/// implements the mapping:
/// <p><ul>
/// Term <docNum, freq, <positions>>
/// <p> This positional
information faciliates phrase and proximity searching.
/// <p>The enumeration is
ordered by document number. Each
document number is
/// greater than all that
precede it in the enumeration.
/// </summary>
public virtual TermPositions TermPositions(Term term)
TermInfoReader
供IndexReader调用,提供对Term的访问。
/// <summary>Returns the
number of term/value pairs in the set. </summary>
internal long Size()
/// <summary>Returns the
TermInfo for a Term in the set, or null. </summary>
public TermInfo
Get(Term term)
/// <summary>Returns the
nth term in the set. </summary>
internal Term Get(int position)
/// <summary>Returns an
enumeration of all the Terms and TermInfos in the set. </summary>
public SegmentTermEnum
Terms()
FieldsReader
读取域中的值构造Document对象,读取.fdt和.fdx
/// <summary>
/// 构造第n个文档
/// </summary>
public /*internal*/ Document
Doc(int n)
二. 项在文档中的统计信息
图2. 项在文档中的统计信息
TermDocs
枚举器访问器,提供访问term的文献频率的接口。这是一个三元结构,首先需要通过Seek指定Term,然后再调用Next(),Doc()和Freq()方法来访问Term在不同的doc中的freq。
/// <summary>Sets this to the data for a
term.
/// The enumeration is
reset to the start of the data for this term.
/// </summary>
void Seek(Term term);
/// <summary>Returns the current document
number. This is invalid until {@link
/// #Next()} is called for
the first time.
/// </summary>
int Doc();
/// <summary>Returns the
frequency of the term within the current document. This
/// is invalid until {@link
#Next()} is called for the first time.
/// </summary>
int Freq();
/// <summary>Attempts to
read multiple entries from the enumeration, up to length of
/// <i>docs</i>. Document numbers are stored in <i>docs</i>, and term
/// frequencies are stored
in <i>freqs</i>. The <i>freqs</i> array must be as
/// long as the <i>docs</i> array.
///
/// <p>Returns the number
of entries read. Zero is only returned
when the
/// stream has been
exhausted.
/// </summary>
int Read(int[] docs, int[] freqs);
/// <summary>Moves to the
next pair in the enumeration. <p> Returns true iff
there is
/// such a next pair in the
enumeration.
/// </summary>
bool Next();
TermPostions
枚举器访问器,提供访问term的文献频率的接口,同时还包含term在文献中出现的位置的集合。
/// <summary>Returns next
position in the current document.
/// </summary>
int NextPosition();
三. 项和域的数据结构
Term封装了当前term所属的域和内容;TermInfo包含了当前term的文献频率,同时包含两个指针分别指向term出现的在文献中的频率表和出现的位置,这两个指针对SegmentTermDocs和SegmentTermPostions枚举有用。
TermEnum
/// <summary>Increments
the enumeration to the next element.
True if one exists.</summary>
public abstract bool Next();
/// <summary>Returns the
current Term in the enumeration.</summary>
public abstract Term Term();
/// <summary>Returns the
docFreq of the current Term in the enumeration.</summary>
public abstract int
DocFreq();
图3. 项和域的数据结构