Lucene.NET-1.4.3索引架构研究

Lucene索引架构研究

Lucene索引部分主要分为三大块：索引内容读取器，项在文档中的统计信息，项和域的数据结构。

一. 索引内容读取器

图1. 索引内容读取器

IndexReader封装对外的统一接口，可以获取指定term的tf，df等信息。获取Document通过调用FieldsReader来实现，获取Term的信息通过TermInfoReader来实现。

IndexReader

IndexReader is an abstract class, providing an interface for accessing an index.

/// <summary>Returns the directory this index resides in. </summary>

public virtual Directory Directory()

/// <summary>Returns the number of documents in this index. </summary>

public abstract int NumDocs();

/// <summary>Returns one greater than the largest possible document number.

/// This may be used to, e.g., determine how big to allocate an array which

/// will have an element for every document number in an index.

/// </summary>

public abstract int MaxDoc();

/// <summary>Returns the stored fields of the <code>n</code>th

/// <code>Document</code> in this index.

/// </summary>

public abstract Document Document(int n);

/// <summary>Returns an enumeration of all the terms in the index.

/// The enumeration is ordered by Term.compareTo(). Each term

/// is greater than all that precede it in the enumeration.

/// </summary>

public abstract TermEnum Terms();

/// <summary>Returns an enumeration of all terms after a given term.

/// The enumeration is ordered by Term.compareTo(). Each term

/// is greater than all that precede it in the enumeration.

/// </summary>

public abstract TermEnum Terms(Term t);

/// <summary>Returns the number of documents containing the term <code>t</code>. </summary>

public abstract int DocFreq(Term t);

/// <summary>Returns an enumeration of all the documents which contain

/// <code>term</code>. For each document, the document number, the frequency of

/// the term in that document is also provided, for use in search scoring.

/// Thus, this method implements the mapping:

/// <ul>

/// Term <docNum, freq>*

/// </ul>

/// The enumeration is ordered by document number. Each document number

/// is greater than all that precede it in the enumeration.

/// </summary>

public virtual TermDocs TermDocs(Term term)

/// <summary>Returns an enumeration of all the documents which contain

/// <code>term</code>. For each document, in addition to the document number

/// and frequency of the term in that document, a list of all of the ordinal

/// positions of the term in the document is available. Thus, this method

/// implements the mapping:

/// <ul>

/// Term <docNum, freq, <positions>>

/// This positional information faciliates phrase and proximity searching.

/// The enumeration is ordered by document number. Each document number is

/// greater than all that precede it in the enumeration.

/// </summary>

public virtual TermPositions TermPositions(Term term)

TermInfoReader

供IndexReader调用，提供对Term的访问。

/// <summary>Returns the number of term/value pairs in the set. </summary>

internal long Size()

/// <summary>Returns the TermInfo for a Term in the set, or null. </summary>

public TermInfo Get(Term term)

/// <summary>Returns the nth term in the set. </summary>

internal Term Get(int position)

/// <summary>Returns an enumeration of all the Terms and TermInfos in the set. </summary>

public SegmentTermEnum Terms()

FieldsReader

读取域中的值构造Document对象，读取.fdt和.fdx

/// <summary>

/// 构造第n个文档

/// </summary>

public /*internal*/ Document Doc(int n)

二. 项在文档中的统计信息

图2. 项在文档中的统计信息

TermDocs

枚举器访问器，提供访问term的文献频率的接口。这是一个三元结构，首先需要通过Seek指定Term，然后再调用Next()，Doc()和Freq()方法来访问Term在不同的doc中的freq。

/// <summary>Sets this to the data for a term.

/// The enumeration is reset to the start of the data for this term.

/// </summary>

void Seek(Term term);

/// <summary>Returns the current document number. This is invalid until {@link

/// #Next()} is called for the first time.

/// </summary>

int Doc();

/// <summary>Returns the frequency of the term within the current document. This

/// is invalid until {@link #Next()} is called for the first time.

/// </summary>

int Freq();

/// <summary>Attempts to read multiple entries from the enumeration, up to length of

/// docs. Document numbers are stored in docs, and term

/// frequencies are stored in freqs. The freqs array must be as

/// long as the docs array.

///

/// Returns the number of entries read. Zero is only returned when the

/// stream has been exhausted.

/// </summary>

int Read(int[] docs, int[] freqs);

/// <summary>Moves to the next pair in the enumeration. Returns true iff there is

/// such a next pair in the enumeration.

/// </summary>

bool Next();

TermPostions

枚举器访问器，提供访问term的文献频率的接口，同时还包含term在文献中出现的位置的集合。

/// <summary>Returns next position in the current document.

/// </summary>

int NextPosition();

三. 项和域的数据结构

Term封装了当前term所属的域和内容；TermInfo包含了当前term的文献频率，同时包含两个指针分别指向term出现的在文献中的频率表和出现的位置，这两个指针对SegmentTermDocs和SegmentTermPostions枚举有用。

TermEnum

/// <summary>Increments the enumeration to the next element. True if one exists.</summary>

public abstract bool Next();

/// <summary>Returns the current Term in the enumeration.</summary>

public abstract Term Term();

/// <summary>Returns the docFreq of the current Term in the enumeration.</summary>

public abstract int DocFreq();

图3. 项和域的数据结构

posted on 2008-06-26 09:23 薛定颚的猫阅读(2445) 评论(12) 编辑收藏举报

刷新页面返回顶部

OiuNt

Lucene.NET-1.4.3索引架构研究

一. 索引内容读取器

IndexReader

TermInfoReader

FieldsReader

二. 项在文档中的统计信息

TermDocs

TermPostions

三. 项和域的数据结构

TermEnum

导航

公告