lucene版本升级到4.6.0以上之后使用ik分词器遇到的问题

在将lucene core版本从4.5.1升级到4.7.0后，如下代码使用ik分词器报错

IKAnalyzer analyzer = new IKAnalyzer(true);
StringReader reader=new StringReader(line);
TokenStream ts=analyzer.tokenStream("", reader);
CharTermAttribute term=ts.getAttribute(CharTermAttribute.class);
while(ts.incrementToken()){
    ...
}

异常信息：

java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

后来发现是lucene从4.6.0开始TokenStream使用方法更改的问题，在使用incrementToken方法前必须调用reset方法，详见api http://lucene.apache.org/core/4_6_0/core/index.html

The workflow of the new TokenStream API is as follows:

Instantiation of TokenStream/TokenFilters which add/get attributes to/from the AttributeSource.
The consumer calls reset().
The consumer retrieves attributes from the stream and stores local references to all attributes it wants to access.
The consumer calls incrementToken() until it returns false consuming the attributes after each call.
The consumer calls end() so that any end-of-stream operations can be performed.
The consumer calls close() to release any resource when finished using the TokenStream.

更改代码为如下运行正常

IKAnalyzer analyzer = new IKAnalyzer(true);
StringReader reader=new StringReader(line);
TokenStream ts=analyzer.tokenStream("", reader);
CharTermAttribute term=ts.getAttribute(CharTermAttribute.class);
ts.reset();
while(ts.incrementToken()){
    ...
}

posted on 2014-04-14 14:10 图样图森破阅读(758) 评论(0) 收藏举报