Apache Solr使用自定义QParser后同义词扩展及Token去重的感悟


起初按照solr wiki上的说明,在schema.xml里配置了如下filter:

1 <analyzer type="query">
2 <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/>
3 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
4 <filter class="solr.LowerCaseFilterFactory"/>
5 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
6 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
7 </analyzer>

但是在实际使用过程中,发现RemoveDuplicatesTokenFilterFactory并未能过滤掉重复的Token,例如:“摩托罗拉 motorola 里程碑2代”,经过同义词扩展后(此处的同义词扩展为品牌中英文扩展,下同)变成了“摩托罗拉 摩托 motorola moto motorola 摩托罗拉 摩托 moto 里程碑 2代”,其中的【摩托罗拉】、【摩托】、【motorola】、【moto】都重复了一次。而我使用了基于DisMaxQParser的自定义Qparser,因此扩展后的同义词会对min-should-match参数带来影响,降低匹配精度。


01 @Override
02 public boolean incrementToken() throws IOException {
03 while (input.incrementToken()) {
04 final char term[] = termAttribute.buffer();
05 final int length = termAttribute.length();
06 final int posIncrement = posIncAttribute.getPositionIncrement();
07 if (posIncrement > 0) {
08 previous.clear();
09 }
10 boolean duplicate = (posIncrement == 0 && previous.contains(term, 0, length));
11 // clone the term, and add to the set of seen terms.
12 char saved[] = new char[length];
13 System.arraycopy(term, 0, saved, 0, length);
14 previous.add(saved);
15 if (!duplicate) {
16 return true;
17 }
18 }
19 return false;
20 }



  1. 想办法在QParser中获得Solr的TokenizerChain,从中获取SynonymFilterFactory
  2. 在QParser中取得分词的Analyzer,并通过Analyzer的TokenStream构建SynonymFilter实例
  3. 通过SynonymFilter遍历Token(调用incrementToken方法),并针对同义词扩展的positionIncrement进行逻辑判断:
    • 若positionIncrement>0,则判断该词是否已经出现过,未出现则放行,并放在Set中待下次判断是否重复
    • 若positionIncrement==0,则只放在Set中供下次判断



01 Analyzer analyzer = req.getSchema().getQueryAnalyzer();
02 final TokenizerChain tokennizerChain = (TokenizerChain) req.getSchema().getField("title").getType().getQueryAnalyzer();
03 SynonymFilterFactory sff = null;
04 for (TokenFilterFactory tf : tokennizerChain.getTokenFilterFactories()) {
05 if (tf instanceof SynonymFilterFactory) {
06 sff = (SynonymFilterFactory) tf;
07 }
08 }
09 if (null == analyzer) {
10 return;
11 }
12 …………
13 StringReader reader = new StringReader(qstr);
14 StringBuilder buffer = new StringBuilder(128);
15 Set<String> tokenSet = new LinkedHashSet<String>();
16 …………
17 TokenStream tokens = analyzer.reusableTokenStream("title", reader);
18 SynonymFilter sf = sff.create(tokens);
19 sf.reset();
20 TermAttribute termAtt = (TermAttribute) sf.getAttribute(TermAttribute.class);
21 PositionIncrementAttribute positionIncrementAttribute = sf.getAttribute(PositionIncrementAttribute.class);
22 OffsetAttribute offsetAttribute = sf.getAttribute(OffsetAttribute.class);
23 Set dumplicatedTokenSet = new HashSet();
24 while (sf.incrementToken()) {
25 final String token = (new String(termAtt.termBuffer(), 0, termAtt.termLength())).toLowerCase();
26 final int posIncr = positionIncrementAttribute.getPositionIncrement();
27 if (posIncr > 0) {
28 if (!dumplicatedTokenSet.contains(token)) {
29 dumplicatedTokenSet.add(token);
30 tokenSet.add(token);
31 }
32 } else {
33 dumplicatedTokenSet.add(token);
34 }
35 }
36 …………
37 for (String tok : tokenSet) {
38 buffer.append(tok).append(" ");
39 }
40 if (buffer.length() > 0) {
41 qstr = buffer.toString();
42 }





