Lucene in action 笔记 index篇
一. Index之前要做什么
1. 将要index的内容转化为文本
你要处理的文件可能是PDF, word, html, OK通通转化成文本, lucene只能处理文本
2.分析文本
在index之前, 必须对文本做一系列的分析, 对文本的token化, 就是分词. 然后滤掉一些没有区分度的词, 如stop word
这个地方涉及语言相关性. 不同语言的处理会不同. 英文还要处理大小写, 派生词.
这一步对于index非常重要, 所以在lucene中选取和开发合适的analysizor是关键的一步.
3. 写入index
这步就是真正的去做index了,lucene采用的是inverted index, 即倒索引, 现在几乎所有主流的search engine都用的这种索引.
inverted index其实就是问''这个词在哪些文章中出现?'', 其实一般书后面附的索引就是倒索引, 会标出关键词在哪些章节中出现.
反之, 一般的index是问''这个文章里有哪些词''.
二. lucene 怎样index文件
1. 创建一个index writer
IndexWriter writer = new IndexWriter(dir, getAnalyzer(), true);
2. 创建文件对象
Document doc = new Document();
doc.add(Field.Keyword("id", keywords[i]));
doc.add(Field.UnIndexed("country", unindexed[i]));
doc.add(Field.UnStored("contents", unstored[i]));
doc.add(Field.Text("city", text[i]));
3. 写入文档
writer.addDocument(doc);
4. 优化
writer.optimize();
writer.close();
就这么简单.......
这里面有几点要讲一下,
1. index 储存目录的类型
有两种FSDirectory, RAMDirectory, 由名字可以看出分别是文件系统目录和内存目录
内存目录有两个作用, 最主要是快, 用于实时要求高的索引. 其次是用于测试, 单元测试, 测试完了不会留下任何垃圾文件
书上说"the performance difference between RAMDirectory and FSDirectory is less visible when Lucene is used on operating systems that cache files in memory."
2. Document 和 Field
对于要index的文档, 首先要生成抽象的document对象, 然后把需要index的内容加到各个fields中去.
Document就是fields的集合, 而fields用来放文档本身和文档的元数据.
field由名字和值组成.
Lucene1.4主要提供下列四种不同类型的Field:
Keyword:不分析, index, 存
UnIndexed:不分析, 不index, 存, 这个用于需要随搜索结果显示的, 本身不算搜索关键词
UnStored:分析, index, 不存, 用于大段文字, 正文
Text:分析, index, 不存(reader)存(string),
在Lucene2.0中是通过三个内部类Field.Index,Field.Store,Field.termVector(项向量)的组合来区分Field的具体类型.具体如下:
Field.Store.COMPRESS:压缩保存,用于长文本或二进制数据
Field.Store.YES:保存
Field.Store.NO:不保存
Field.Index.NO:不建立索引
Field.Index.TOKENIZED:分词,建索引
Field.Index.UN_TOKENIZED:不分词,建索引
Field.Index.NO_NORMS:不分词,建索引.但是Field的值不像通常那样被保存,而是只取一个byte,这样节约存储空间
Field.TermVector.NO:不保存term vectors
Field.TermVector.YES:保存term vectors
Field.TermVector.WITH_POSITIONS:保存term vectors.(保存值和token位置信息)
Field.TermVector.WITH_OFFSETS:保存term vectors.(保存值和Token的offset)
:保存term vectors.(保存值和token位置信息和Token的offset)
而Field的构造函数也用到了这三个内部类:
Field(String, byte[],Field.Store)
Field(String, Reader)
Field(String, Reader, Field.TermVector)
Field(String, String, Field.Store, Field.Index)
Field(String, String, Field.Store, Field.Index, Field.TermVector)
其中Field(String, Reader)和Field(String, Reader, Field.TermVector)默认为Field.Index.TOKENIZED和Field.Store.NO的.
而对于新老类型可以建立如下对应:
Keyword <==> Store.YES,Index.UN_TOKENIZED;
UnIndexed <==> Store.YES,Index.NO;
UnStored <==> Store.NO,Index.TOKENIZED;
Text(String, Reader) <==> Store.NO,Index.TOKENIZED;
Text(String,String) <==> Store.YES,Index.TOKENIZED.
这段摘自(http://hi.baidu.com/z57354658/blog/item/915579182bf948be4aedbcbf.html)
最新的lucene 3.0的field是这样的:
Field options for indexing
Index.ANALYZED
– use the analyzer to break the Field’s value into a stream of separate tokens and make each token searchable.
Index.NOT_ANALYZED
– do index the field, but do not analyze the String. Instead, treat the
Field’s entire value as a single token and make that token searchable.
Index.ANALYZED_NO_NORMS
– an advanced variant of Index.ANALYZED which does not store norms information in the index.
Index.NOT_ANALYZED_NO_NORMS
– just like , but also do not store Norms.
Index.NO
– don’t make this field’s value available for searching at all.
Field options for storing fields
Store.YES
— store the value. When the value is stored, the original String in its
entirety is recorded in the index and may be retrieved by an
IndexReader.
Store.NO
– do not store the value. This is often used along with Index.ANALYZED
to index a large text field that doesn’t need to be retrieved in its
original form.
Field options for term vectors
TermVector.YES
– record the unique terms that occurred, and their counts, in each
document, but do not store any positions or offsets information.
TermVector.WITH_POSITIONS
– record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
TermVector.WITH_OFFSETS
– record the unique terms and their counts, with the offsets (start
& end character position) of each occurrence of every term, but no
positions.
TermVector.WITH_POSITIONS_OFFSETS
– store unique terms and their counts, along with positions and offsets.
TermVector.NO
– do not store any term vector information.
If Index.NO is specified for a field, then you must also specify TermVector.NO.
具一些例子来说明这些怎么用
Index Store TermVector Example usage
NOT_ANALYZED YES NO Identifiers (file names, primary keys),
Telephone and Social Security
numbers, URLs, personal names, Dates
ANALYZED YES WITH_POSITIONS_OFFSETS Document title, document abstract
ANALYZED NO WITH_POSITIONS_OFFSETS Document body
NO YES NO Document type, database primary key
NOT_ANALYZED NO NO Hidden keywords
When Lucene builds the inverted index, by default it stores all
necessary information to implement the Vector Space model. This model
requires the count of every term that occurred in the document, as well
as the positions of each occurrence (needed for phrase searches).
You can tell Lucene to skip indexing the term frequency and positions by calling:
Field.setOmitTermFreqAndPositions(true)
三. index的一些小技巧
1. boost
设置document和field的重要性, 默认为1
通过setBoost(1.5)可以更改这个值
既然可以对document和field设置boost值, 那么这些boost值是存在什么地方的了(lucene3.0)?
在index的时候, 对于每个document,每个field都有个floating point的boost值,浮点不便于存储, 占空间,
所以就把这些浮点型的boost值, encoding(quantized)成一个byte的norms来存储.在search的时候,
把这些norms都load到内存里面, 用到的时候再decoding成浮点来计算.
这样有个问题, 就是norms会占用大量的ram, 每个document和field都要一个byte, 对于庞大的index, 这将占用大量的内存.
你可以用这个操作关掉这个boost功能, 如果你觉得这个boost对你的系统并不重要:
Field.setOmitNorms(true)
当然你可以在选择Field index option的时候选ANALYZED_NO_NORMS
2. index dates
可以直接用date类型, 有两个缺点, lucene在index date类型时, 要先转为string, 它会转到直到微秒位,
而一般不用那么精确, 存在performance问题. date类型不能处理1970以前的日期
所以你可以自己把date转成YYYYMMDD形式的string来存
3. index number
index数字也是要先把它转化为string, 对于正文里的数字首先要选合适的analyzer, 如 WhitespaceAnalyzer and StandardAnalyzer
你如果选SimpleAnalyzer and StopAnalyzer, 数字会被滤掉.
而对于存在keyword中的数字, 如果要进行range queries, 必须要对其补0
因为7, 21,71的字典顺序是21,7,71.所以必须补0, 007,021,071, 就可以保证正确的顺序.
4. 对于要排序的field, 要索引但不分词, 即用keyword
四. 提高index 的效率
index的瓶颈主要在写磁盘上, 怎么利用buffer有效的减少磁盘读写的次数是优化的关键.
1. 调节参数
属性 | 默认值 | 说明 | |
mergeFactor | org.apache.lucene.mergeFactor | 10 | 控制index的大小和频率,两个作用 |
maxMergeDocs | org.apache.lucene.maxMergeDocs | Integer.MAX_VALUE | 限制一个段中的document数目 |
minMergeDocs | org.apache.lucene.minMergeDocs | 10 | 缓存在内存中的document数目,超过他以后会写入到磁盘 |
maxFieldLength | 1000 | 一个Field中最大Term数目,超过部分忽略,不会index到field中,所以自然也就搜索不到 |
通常加大mergeFactor, minMergeDocs会大大提高index效率, 但很耗内存, 会导致内存耗尽.
java -server -Xms128m -Xmx256m
同样通过增加JVM的start和maxim的heap时也能提高index效率.
注意操作系统的打开文件数目的上限
在linux, 用ulimit -n来查和修改
lucene一次最多打开这么多文件: (1 + mergeFactor) * FilesPerSegment
感觉不多, 一般不会超
2. In-memory indexing: RAMDirectory
Everything that FSDirectory does on disk, RAMDirectory performs in memory, and is thus much faster.
渐渐增加mergeFactor or minMergeDocs的值, FSDirectory-based indexing starts to approach the speed of the RAMDirectorybased one.
3. Batch indexing by using RAMDirectory as a buffer
优化index的思路, 用参数调节太麻烦效果不好. 而且不好控制, 那么就用这个方法吧
1 Create an FSDirectory-based index.
2 Create a RAMDirectory-based index.
3 Add Documents to the RAMDirectory-based index.
4 Every so often, flush everything buffered in RAMDirectory into FSDirectory.
5 Go to step 3. (Who says GOTO is dead?)
4.Parallelizing indexing by working with multiple indexes
并行的index, 这个策略比较灵活, 可以多个线程并行index到各自RAMDirectory, 然后:
当RAMDirectory达到临界值时, 把index存入统一的index目录.
当RAMDirectory达到临界值时, 把index存入各自的index目录(磁盘), 有一个进程负责合并所有index到主index
甚至这里的线程可以扩展为computer, 这样可以形成indexing cluster.
五. Index Optimizing
Optimize操作就是把所有的index segments 合并成一个, 合并过程中会占用2倍的磁盘空间, 且有大量的磁盘IO操作.
所以要合理的使用optimize操作, 过于频繁也会带来性能问题.
It’s important to emphasize that optimizing an index only affects the speed of searches against that index, and doesn’t affect the speed of indexing.
因为如果很多segment文件, search需要打开过多的文件, 而浪费时间.
不推荐在indexing的同时, 进行optimize, 最好时index完了以后开始optimize.
六.Concurrency, thread-safety, and locking issues
1. Lucene’s concurrency rules are simple but should be strictly followed:
■ Any number of read-only operations may be executed concurrently.
■ Any number of read-only operations may be executed while an index is being modified.
■ Only a single index-modifying operation may execute at a time.
简单的说可以并行读, 只能串行的写.
2. Thread-safety
IndexWriter or IndexReader are thread-safe. Therefore, a single
instance of either class can be shared among multiple threads, and all
calls to its index-modifying methods will be properly synchronized so
that index modifications are executed
one after the other.
随每个类是线程安全的, 但是注意IndexWriter or IndexReader不能同时去改index
Lucene must ensure that indexmodifying operations of these two classes don’t overlap.
An index-modifying IndexReader operation can’t be executed while an index-modifying IndexWriter operation is in progress
3. Index locking
The write.lock file is used to keep processes from concurrently attempting to modify an index.
The commit.lock is used whenever segments are being read or merged.
七. Debugging indexing
IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(),true);
writer.infoStream = System.out;