创建Weight对象
Weight对象是通过IndexSearcher.createNormalizedWeight(Query query)创建的
1 public Weight createNormalizedWeight(Query query) throws IOException { 2 query = rewrite(query); 3 Weight weight = query.createWeight(this); 4 float v = weight.getValueForNormalization(); 5 float norm = getSimilarity().queryNorm(v); 6 if (Float.isInfinite(norm) || Float.isNaN(norm)) { 7 norm = 1.0f; 8 } 9 weight.normalize(norm, 1.0f); 10 return weight; 11 }
1.以TermQuery为例分析Weight对象的创建过程
TermQuery.createWeight(IndexSearcher searcher)
1 public Weight createWeight(IndexSearcher searcher) throws IOException { 2 final IndexReaderContext context = searcher.getTopReaderContext(); 3 final TermContext termState; 4 if (perReaderTermState == null || perReaderTermState.topReaderContext != context) { 5 // make TermQuery single-pass if we don't have a PRTS or if the context differs! 6 termState = TermContext.build(context, term); 7 } else { 8 // PRTS was pre-build for this IS 9 termState = this.perReaderTermState; 10 } 11 12 // we must not ignore the given docFreq - if set use the given value (lie) 13 if (docFreq != -1) 14 termState.setDocFreq(docFreq); 15 16 return new TermWeight(searcher, termState); 17 }
这里的termState是term的统计信息
TermWeight的构造函数
1 public TermWeight(IndexSearcher searcher, TermContext termStates) 2 throws IOException { 3 assert termStates != null : "TermContext must not be null"; 4 this.termStates = termStates; 5 this.similarity = searcher.getSimilarity(); 6 this.stats = similarity.computeWeight( 7 getBoost(), 8 searcher.collectionStatistics(term.field()), 9 searcher.termStatistics(term, termStates)); 10 }
TermWeight构造函数里注意第8和第9行,第8行是CollectionStatistics,第9行是TermStatistics,这两者都是关于term的统计信息,两者的区别是CollectionStatistics是关于查询term的所在Field的所有Term的统计信息,而TermStatistics是查询term的统计信息
similarity是IndexSearcher的默认的DefaultSimilarity
1 public final SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) { 2 final Explanation idf = termStats.length == 1 3 ? idfExplain(collectionStats, termStats[0]) 4 : idfExplain(collectionStats, termStats); 5 return new IDFStats(collectionStats.field(), idf, queryBoost); 6 }
注意这里的idf
1 public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) { 2 final long df = termStats.docFreq(); 3 final long max = collectionStats.maxDoc(); 4 final float idf = idf(df, max); 5 return new Explanation(idf, "idf(docFreq=" + df + ", maxDocs=" + max + ")"); 6 }
这里的idf值作为参数传递给IDFStats的构造函数
1 public IDFStats(String field, Explanation idf, float queryBoost) { 2 // TODO: Validate? 3 this.field = field; 4 this.idf = idf; 5 this.queryBoost = queryBoost; 6 this.queryWeight = idf.getValue() * queryBoost; // compute query weight 7 }
queryWeight是idf与queryBoost的乘积,queryBoost是查询时作为参数传递进来的,idf是基于term文档频率的分数因子(a score factor based on the term's document frequency),可简单理解为:文档总数/docFreq,df越大term对结果的评分影响越小,索引idf越大,对结果评分影响越大
下面是DefaultSimilarity里的idf实现
public float idf(long docFreq, long numDocs) { return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0); }
2.createNormalizedWeight方法的第4行
public float getValueForNormalization() { return stats.getValueForNormalization(); }
stats是在TermWeight的构造函数里构造的
DefaultSimilarity继成TFIDFSimilarity,getValueForNormalization方法在TFIDFSimilarity里
public float getValueForNormalization() { // TODO: (sorta LUCENE-1907) make non-static class and expose this squaring via a nice method to subclasses? return queryWeight * queryWeight; // sum of squared weights }
所以可以得到createNormalizedWeight方法第4行v的值是queryWeight * queryWeight
3.createNormalizedWeight方法的第5行
public float queryNorm(float sumOfSquaredWeights) { return (float)(1.0 / Math.sqrt(sumOfSquaredWeights)); }
4.createNormalizedWeight方法的第9行
public void normalize(float queryNorm, float topLevelBoost) { stats.normalize(queryNorm, topLevelBoost); }
public void normalize(float queryNorm, float topLevelBoost) { this.queryNorm = queryNorm * topLevelBoost; queryWeight *= this.queryNorm; // normalize query weight value = queryWeight * idf.getValue(); // idf for document }