Lucene.NET-1.4.3索引架构研究

Lucene索引架构研究

 

Lucene索引部分主要分为三大块:索引内容读取器,项在文档中的统计信息,项和域的数据结构。

 

一. 索引内容读取器

1. 索引内容读取器

 

IndexReader封装对外的统一接口,可以获取指定termtfdf等信息。获取Document通过调用FieldsReader来实现,获取Term的信息通过TermInfoReader来实现。

IndexReader

 IndexReader is an abstract class, providing an interface for accessing an index.

 

              /// <summary>Returns the directory this index resides in. </summary>

              public virtual Directory Directory()

 

              /// <summary>Returns the number of documents in this index. </summary>

              public abstract int NumDocs();

 

              /// <summary>Returns one greater than the largest possible document number.

              /// This may be used to, e.g., determine how big to allocate an array which

              /// will have an element for every document number in an index.

              /// </summary>

              public abstract int MaxDoc();

 

              /// <summary>Returns the stored fields of the <code>n</code><sup>th</sup>

              /// <code>Document</code> in this index.

              /// </summary>

              public abstract Document Document(int n);

 

              /// <summary>Returns an enumeration of all the terms in the index.

              /// The enumeration is ordered by Term.compareTo().  Each term

              /// is greater than all that precede it in the enumeration.

              /// </summary>

              public abstract TermEnum Terms();

 

              /// <summary>Returns an enumeration of all terms after a given term.

              /// The enumeration is ordered by Term.compareTo().  Each term

              /// is greater than all that precede it in the enumeration.

              /// </summary>

              public abstract TermEnum Terms(Term t);

 

              /// <summary>Returns the number of documents containing the term <code>t</code>. </summary>

              public abstract int DocFreq(Term t);

 

              /// <summary>Returns an enumeration of all the documents which contain

              /// <code>term</code>. For each document, the document number, the frequency of

              /// the term in that document is also provided, for use in search scoring.

              /// Thus, this method implements the mapping:

              /// <p><ul>

              /// Term  <docNum, freq><sup>*</sup>

              /// </ul></p>

              /// <p>The enumeration is ordered by document number.  Each document number

              /// is greater than all that precede it in the enumeration.</p>

              /// </summary>

              public virtual TermDocs TermDocs(Term term)

 

           /// <summary>Returns an enumeration of all the documents which contain

              /// <code>term</code>.  For each document, in addition to the document number

              /// and frequency of the term in that document, a list of all of the ordinal

              /// positions of the term in the document is available.  Thus, this method

              /// implements the mapping:

              /// <p><ul>

              /// Term  <docNum, freq, <positions>>

              /// <p> This positional information faciliates phrase and proximity searching.

              /// <p>The enumeration is ordered by document number.  Each document number is

              /// greater than all that precede it in the enumeration.

              /// </summary>

              public virtual TermPositions TermPositions(Term term)

 

TermInfoReader

       IndexReader调用,提供对Term的访问。

 

              /// <summary>Returns the number of term/value pairs in the set. </summary>

              internal long Size()

 

              /// <summary>Returns the TermInfo for a Term in the set, or null. </summary>

              public TermInfo Get(Term term)

 

              /// <summary>Returns the nth term in the set. </summary>

              internal Term Get(int position)

 

              /// <summary>Returns an enumeration of all the Terms and TermInfos in the set. </summary>

              public SegmentTermEnum Terms()

 

FieldsReader

       读取域中的值构造Document对象,读取.fdt.fdx

 

        /// <summary>

        /// 构造第n个文档

        /// </summary>

public /*internal*/ Document Doc(int n)

 

二. 项在文档中的统计信息

2. 项在文档中的统计信息

 

TermDocs

       枚举器访问器,提供访问term的文献频率的接口。这是一个三元结构,首先需要通过Seek指定Term,然后再调用Next()Doc()Freq()方法来访问Term在不同的doc中的freq

 

           /// <summary>Sets this to the data for a term.

              /// The enumeration is reset to the start of the data for this term.

              /// </summary>

              void  Seek(Term term);

 

           /// <summary>Returns the current document number.   This is invalid until {@link

              /// #Next()} is called for the first time.

              /// </summary>

              int Doc();

 

              /// <summary>Returns the frequency of the term within the current document.  This

              /// is invalid until {@link #Next()} is called for the first time.

              /// </summary>

              int Freq();

 

              /// <summary>Attempts to read multiple entries from the enumeration, up to length of

              /// <i>docs</i>.  Document numbers are stored in <i>docs</i>, and term

              /// frequencies are stored in <i>freqs</i>.  The <i>freqs</i> array must be as

              /// long as the <i>docs</i> array.

              ///

              /// <p>Returns the number of entries read.  Zero is only returned when the

              /// stream has been exhausted. 

              /// </summary>

              int Read(int[] docs, int[] freqs);

 

              /// <summary>Moves to the next pair in the enumeration.  <p> Returns true iff there is

              /// such a next pair in the enumeration.

              /// </summary>

              bool Next();

 

TermPostions

       枚举器访问器,提供访问term的文献频率的接口,同时还包含term在文献中出现的位置的集合。

 

              /// <summary>Returns next position in the current document. 

              /// </summary>

              int NextPosition();

 

三. 项和域的数据结构

Term封装了当前term所属的域和内容;TermInfo包含了当前term的文献频率,同时包含两个指针分别指向term出现的在文献中的频率表和出现的位置,这两个指针对SegmentTermDocsSegmentTermPostions枚举有用。

 

TermEnum

 

              /// <summary>Increments the enumeration to the next element.  True if one exists.</summary>

              public abstract bool Next();

             

              /// <summary>Returns the current Term in the enumeration.</summary>

              public abstract Term Term();

             

              /// <summary>Returns the docFreq of the current Term in the enumeration.</summary>

       public abstract int DocFreq();

3. 项和域的数据结构

 

 

posted on 2008-06-26 09:23  薛定颚的猫  阅读(2445)  评论(12编辑  收藏  举报

导航