Lucene学习笔记：四，Lucene索引过程分析

对于Lucene的索引过程，除了将词(Term)写入倒排表并最终写入Lucene的索引文件外，还包括分词(Analyzer)和合并段(merge segments)的过程，本次不包括这两部分，将在以后的文章中进行分析。

Lucene的索引过程，很多的博客，文章都有介绍，推荐大家上网搜一篇文章：《Annotated Lucene》，好像中文名称叫《Lucene源码剖析》是很不错的。

想要真正了解Lucene索引文件过程，最好的办法是跟进代码调试，对着文章看代码，这样不但能够最详细准确的掌握索引过程(描述都是有偏差的，而代码是不会骗你的)，而且还能够学习Lucene的一些优秀的实现，能够在以后的工作中为我所用，毕竟Lucene是比较优秀的开源项目之一。

由于Lucene已经升级到3.0.0了，本索引过程为Lucene 3.0.0的索引过程。

一、索引过程体系结构

Lucene 3.0的搜索要经历一个十分复杂的过程，各种信息分散在不同的对象中分析，处理，写入，为了支持多线程，每个线程都创建了一系列类似结构的对象集，为了提高效率，要复用一些对象集，这使得索引过程更加复杂。

其实索引过程，就是经历下图中所示的索引链的过程，索引链中的每个节点，负责索引文档的不同部分的信息，当经历完所有的索引链的时候，文档就处理完毕了。最初的索引链，我们称之基本索引链。

为了支持多线程，使得多个线程能够并发处理文档，因而每个线程都要建立自己的索引链体系，使得每个线程能够独立工作，在基本索引链基础上建立起来的每个线程独立的索引链体系，我们称之线程索引链。线程索引链的每个节点是由基本索引链中的相应的节点调用函数addThreads创建的。

为了提高效率，考虑到对相同域的处理有相似的过程，应用的缓存也大致相当，因而不必每个线程在处理每一篇文档的时候都重新创建一系列对象，而是复用这些对象。所以对每个域也建立了自己的索引链体系，我们称之域索引链。域索引链的每个节点是由线程索引链中的相应的节点调用addFields创建的。

当完成对文档的处理后，各部分信息都要写到索引文件中，写入索引文件的过程是同步的，不是多线程的，也是沿着基本索引链将各部分信息依次写入索引文件的。

下面详细分析这一过程。

二、详细索引过程

1、创建IndexWriter对象

代码：

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

IndexWriter对象主要包含以下几方面的信息：

用于索引文档
- Directory directory; 指向索引文件夹
- Analyzer analyzer; 分词器
- Similarity similarity = Similarity.getDefault(); 影响打分的标准化因子(normalization factor)部分，对文档的打分分两个部分，一部分是索引阶段计算的，与查询语句无关，一部分是搜索阶段计算的，与查询语句相关。
- SegmentInfos segmentInfos = new SegmentInfos(); 保存段信息，大家会发现，和segments_N中的信息几乎一一对应。
- IndexFileDeleter deleter; 此对象不是用来删除文档的，而是用来管理索引文件的。
- Lock writeLock; 每一个索引文件夹只能打开一个IndexWriter，所以需要锁。
- Set<SegmentInfo> segmentsToOptimize = new HashSet<SegmentInfo>(); 保存正在最优化(optimize)的段信息。当调用optimize的时候，当前所有的段信息加入此Set，此后新生成的段并不参与此次最优化。
用于合并段，在合并段的文章中将详细描述
- SegmentInfos localRollbackSegmentInfos;
- HashSet<SegmentInfo> mergingSegments = new HashSet<SegmentInfo>();
- MergePolicy mergePolicy = new LogByteSizeMergePolicy(this);
- MergeScheduler mergeScheduler = new ConcurrentMergeScheduler();
- LinkedList<MergePolicy.OneMerge> pendingMerges = new LinkedList<MergePolicy.OneMerge>();
- Set<MergePolicy.OneMerge> runningMerges = new HashSet<MergePolicy.OneMerge>();
- List<MergePolicy.OneMerge> mergeExceptions = new ArrayList<MergePolicy.OneMerge>();
- long mergeGen;
为保持索引完整性，一致性和事务性
- SegmentInfos rollbackSegmentInfos; 当IndexWriter对索引进行了添加，删除文档操作后，可以调用commit将修改提交到文件中去，也可以调用rollback取消从上次commit到此时的修改。
- SegmentInfos localRollbackSegmentInfos; 此段信息主要用于将其他的索引文件夹合并到此索引文件夹的时候，为防止合并到一半出错可回滚所保存的原来的段信息。
一些配置
- long writeLockTimeout; 获得锁的时间超时。当超时的时候，说明此索引文件夹已经被另一个IndexWriter打开了。
- int termIndexInterval; 同tii和tis文件中的indexInterval。

2、创建文档Document对象，并加入域(Field)

代码：

Document doc = new Document();

doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("modified",DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));

doc.add(new Field("contents", new FileReader(f)));

Document对象主要包括以下部分：

此文档的boost，默认为1，大于一说明比一般的文档更加重要，小于一说明更不重要。
一个ArrayList保存此文档所有的域
每一个域包括域名，域值，和一些标志位，和fnm，fdx，fdt中的描述相对应。

3、将文档加入IndexWriter

代码：

writer.addDocument(doc);
-->IndexWriter.addDocument(Document doc, Analyzer analyzer)
-->doFlush = docWriter.addDocument(doc, analyzer);
--> DocumentsWriter.updateDocument(Document, Analyzer, Term)
注：--> 代表一级函数调用

IndexWriter继而调用DocumentsWriter.addDocument，其又调用DocumentsWriter.updateDocument。

4、将文档加入DocumentsWriter

代码：

DocumentsWriter.updateDocument(Document doc, Analyzer analyzer, Term delTerm)
-->(1) DocumentsWriterThreadState state = getThreadState(doc, delTerm);
-->(2) DocWriter perDoc = state.consumer.processDocument();
-->(3) finishDocument(state, perDoc);

DocumentsWriter对象主要包含以下几部分：

用于写索引文件
- IndexWriter writer;
- Directory directory;
- Similarity similarity：分词器
- String segment：当前的段名，每当flush的时候，将索引写入以此为名称的段。

IndexWriter.doFlushInternal()
--> String segment = docWriter.getSegment();//return segment
--> newSegment = new SegmentInfo(segment,……);
--> docWriter.createCompoundFile(segment);//根据segment创建cfs文件。

String docStoreSegment：存储域所要写入的目标段。(在索引文件格式一文中已经详细描述)
int docStoreOffset：存储域在目标段中的偏移量。
int nextDocID：下一篇添加到此索引的文档ID号，对于同一个索引文件夹，此变量唯一，且同步访问。
DocConsumer consumer; 这是整个索引过程的核心，是IndexChain整个索引链的源头。

5、DocumentsWriter对CharBlockPool，ByteBlockPool，IntBlockPool的缓存管理

在索引的过程中，DocumentsWriter将词信息(term)存储在CharBlockPool中，将文档号(doc ID)，词频(freq)和位置(prox)信息存储在ByteBlockPool中。
在ByteBlockPool中，缓存是分块(slice)分配的，块(slice)是分层次的，层次越高，此层的块越大，每一层的块大小事相同的。
- nextLevelArray表示的是当前层的下一层是第几层，可见第9层的下一层还是第9层，也就是说最高有9层。
- levelSizeArray表示每一层的块大小，第一层是5个byte，第二层是14个byte以此类推。

ByteBlockPool类中有以下静态变量：

final static int[] nextLevelArray = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9};
final static int[] levelSizeArray = {5, 14, 20, 30, 40, 40, 80, 80, 120, 200};

6、关闭IndexWriter对象

代码：

writer.close();

--> IndexWriter.closeInternal(boolean)

--> (1) 将索引信息由内存写入磁盘: flush(waitForMerges, true, true);
--> (2) 进行段合并: mergeScheduler.merge(this);

对段的合并将在后面的章节进行讨论，此处仅仅讨论将索引信息由写入磁盘的过程。

代码：

IndexWriter.flush(boolean triggerMerge, boolean flushDocStores, boolean flushDeletes)

--> IndexWriter.doFlush(boolean flushDocStores, boolean flushDeletes)

--> IndexWriter.doFlushInternal(boolean flushDocStores, boolean flushDeletes)

将索引写入磁盘包括以下几个过程：

得到要写入的段名：String segment = docWriter.getSegment();
DocumentsWriter将缓存的信息写入段：docWriter.flush(flushDocStores);
生成新的段信息对象：newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());
准备删除文档：docWriter.pushDeletes();
生成cfs段：docWriter.createCompoundFile(segment);
删除文档：applyDeletes();

6.1、得到要写入的段名

代码：

SegmentInfo newSegment = null;

final int numDocs = docWriter.getNumDocsInRAM();//文档总数

String docStoreSegment = docWriter.getDocStoreSegment();//存储域和词向量所要要写入的段名，"_0"

int docStoreOffset = docWriter.getDocStoreOffset();//存储域和词向量要写入的段中的偏移量

String segment = docWriter.getSegment();//段名，"_0"

在Lucene的索引文件结构一章做过详细介绍，存储域和词向量可以和索引域存储在不同的段中。

6.2、将缓存的内容写入段

代码：

flushedDocCount = docWriter.flush(flushDocStores);

此过程又包含以下两个阶段；

按照基本索引链关闭存储域和词向量信息
按照基本索引链的结构将索引结果写入段

6.3、生成新的段信息对象

代码：

newSegment = new SegmentInfo(segment, flushedDocCount, directory, false, true, docStoreOffset, docStoreSegment, docStoreIsCompoundFile, docWriter.hasProx());

segmentInfos.add(newSegment);

6.4、准备删除文档

代码：

docWriter.pushDeletes();

--> deletesFlushed.update(deletesInRAM);

此处将deletesInRAM全部加到deletesFlushed中，并把deletesInRAM清空。原因上面已经阐明。

6.5、生成cfs段

代码：

docWriter.createCompoundFile(segment);

newSegment.setUseCompoundFile(true);

代码为：

DocumentsWriter.createCompoundFile(String segment) {

CompoundFileWriter cfsWriter = new CompoundFileWriter(directory, segment + "." + IndexFileNames.COMPOUND_FILE_EXTENSION);

//将上述中记录的文档名全部加入cfs段的写对象。

for (final String flushedFile : flushState.flushedFiles)

cfsWriter.addFile(flushedFile);

cfsWriter.close();

}

6.6、删除文档

代码：

applyDeletes();

代码为：

boolean applyDeletes(SegmentInfos infos) {

if (!hasDeletes())

return false;

final int infosEnd = infos.size();

int docStart = 0;

boolean any = false;

for (int i = 0; i < infosEnd; i++) {

assert infos.info(i).dir == directory;

SegmentReader reader = writer.readerPool.get(infos.info(i), false);

try {

any |= applyDeletes(reader, docStart);

docStart += reader.maxDoc();

} finally {

writer.readerPool.release(reader);

}

deletesFlushed.clear();

return any;

}

Lucene删除文档可以用reader，也可以用writer，但是归根结底还是用reader来删除的。
reader的删除有以下三种方式：
- 按照词删除，删除所有包含此词的文档。
- 按照文档号删除。
- 按照查询对象删除，删除所有满足此查询的文档。
但是这三种方式归根结底还是按照文档号删除，也就是写.del文件的过程。

posted @ 2014-08-05 14:33 塞北鸿雁飞阅读(342) 评论(0) 收藏举报

刷新页面返回顶部

塞北鸿雁飞

Lucene学习笔记： 四，Lucene索引过程分析