Apache Lucene 评分原理及代码分析

在IndexSearcher类中有一个管理Lucene得分情况的方法，如下所示：

1 public Explanation explain(Weight weight, int doc) throws IOException {
2     return weight.explain(reader, doc);
3 }

返回的这个Explanation的实例解释了Lucene中Document的得分情况。我们可以测试一下，直观地感觉一下到底这个Explanation的实例都记录了一个Document的哪些信息。

写一个测试类，如下所示：

  1 package org.shirdrn.lucene.learn;
  2 
  3 import java.io.IOException;
  4 import java.util.Date;
  5 
  6 import net.teamhot.lucene.ThesaurusAnalyzer;
  7 
  8 import org.apache.lucene.document.Document;
  9 import org.apache.lucene.document.Field;
 10 import org.apache.lucene.index.CorruptIndexException;
 11 import org.apache.lucene.index.IndexWriter;
 12 import org.apache.lucene.index.Term;
 13 import org.apache.lucene.index.TermDocs;
 14 import org.apache.lucene.search.Explanation;
 15 import org.apache.lucene.search.Hits;
 16 import org.apache.lucene.search.IndexSearcher;
 17 import org.apache.lucene.search.Query;
 18 import org.apache.lucene.search.TermQuery;
 19 import org.apache.lucene.store.LockObtainFailedException;
 20 
 21 public class AboutLuceneScore {
 22 
 23 private String path = "E:\\Lucene\\index";
 24 
 25 public void createIndex(){
 26    IndexWriter writer;
 27    try {
 28     writer = new IndexWriter(path,new ThesaurusAnalyzer(),true);
 29    
 30     Field fieldA = new Field("contents","一人",Field.Store.YES,Field.Index.TOKENIZED); 
 31     Document docA = new Document(); 
 32     docA.add(fieldA);
 33    
 34     Field fieldB = new Field("contents","一人 之交 一人之交",Field.Store.YES,Field.Index.TOKENIZED);
 35     Document docB = new Document(); 
 36     docB.add(fieldB);
 37    
 38     Field fieldC = new Field("contents","一人 之下 一人之下",Field.Store.YES,Field.Index.TOKENIZED);
 39     Document docC = new Document(); 
 40     docC.add(fieldC);
 41    
 42     Field fieldD = new Field("contents","一人 做事 一人当 一人做事一人当",Field.Store.YES,Field.Index.TOKENIZED); 
 43     Document docD = new Document(); 
 44     docD.add(fieldD);
 45    
 46     Field fieldE = new Field("contents","一人 做事 一人當 一人做事一人當",Field.Store.YES,Field.Index.TOKENIZED);
 47     Document docE = new Document(); 
 48     docE.add(fieldE);
 49 
 50     writer.addDocument(docA);
 51     writer.addDocument(docB);
 52     writer.addDocument(docC);
 53     writer.addDocument(docD);
 54     writer.addDocument(docE);
 55    
 56     writer.close();
 57    } catch (CorruptIndexException e) {
 58     e.printStackTrace();
 59    } catch (LockObtainFailedException e) {
 60     e.printStackTrace();
 61    } catch (IOException e) {
 62     e.printStackTrace();
 63    }
 64 }
 65 
 66 public static void main(String[] args) {
 67    AboutLuceneScore aus = new AboutLuceneScore();
 68    aus.createIndex();    // 建立索引
 69    try {
 70     String keyword = "一人";
 71     Term term = new Term("contents",keyword);
 72     Query query = new TermQuery(term); 
 73     IndexSearcher searcher = new IndexSearcher(aus.path);
 74     Date startTime = new Date();
 75     Hits hits = searcher.search(query);
 76     TermDocs termDocs = searcher.getIndexReader().termDocs(term);
 77     while(termDocs.next()){
 78      System.out.print("搜索关键字<"+keyword+">在编号为 "+termDocs.doc());
 79      System.out.println(" 的Document中出现过 "+termDocs.freq()+" 次");
 80     }
 81     System.out.println("********************************************************************");
 82     for(int i=0;i<hits.length();i++){
 83      System.out.println("Document的内部编号为 ： "+hits.id(i));
 84      System.out.println("Document内容为 ： "+hits.doc(i));
 85      System.out.println("Document得分为 ： "+hits.score(i));
 86      Explanation e = searcher.explain(query, hits.id(i));
 87      System.out.println("Explanation为 ： \n"+e);
 88      System.out.println("Document对应的Explanation的一些参数值如下： ");
 89      System.out.println("Explanation的getValue()为 ： "+e.getValue());
 90      System.out.println("Explanation的getDescription()为 ： "+e.getDescription());
 91      System.out.println("********************************************************************");
 92     }
 93     System.out.println("共检索出符合条件的Document "+hits.length()+" 个。");
 94     Date finishTime = new Date();
 95     long timeOfSearch = finishTime.getTime() - startTime.getTime();
 96     System.out.println("本次搜索所用的时间为 "+timeOfSearch+" ms");
 97    } catch (CorruptIndexException e) {
 98     e.printStackTrace();
 99    } catch (IOException e) {
100     e.printStackTrace();
101    }
102   
103 }
104 }

该测试类中实现了一个建立索引的方法createIndex()方法；然后通过检索一个关键字“一人”，获取到与它相关的Document的信息。

打印出结果的第一部分为：这个检索关键字“一人”在每个Document中出现的次数。

打印出结果的第二部分为：相关的Explanation及其得分情况的信息。

测试结果输出如下所示：

搜索关键字<一人>在编号为 0 的Document中出现过 1 次
搜索关键字<一人>在编号为 1 的Document中出现过 1 次
搜索关键字<一人>在编号为 2 的Document中出现过 1 次
搜索关键字<一人>在编号为 3 的Document中出现过 2 次
搜索关键字<一人>在编号为 4 的Document中出现过 2 次
********************************************************************
Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分为： 0.81767845
Explanation为：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.81767845
Explanation的getDescription()为： fieldWeight(contents:一人 in 0), product of:
********************************************************************
Document的内部编号为： 3
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人当一人做事一人当>>
Document得分为： 0.5059127
Explanation为：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 3), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=3)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.5059127
Explanation的getDescription()为： fieldWeight(contents:一人 in 3), product of:
********************************************************************
Document的内部编号为： 4
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人做事一人當一人做事一人當>>
Document得分为： 0.5059127
Explanation为：
0.5059127 = (MATCH) fieldWeight(contents:一人 in 4), product of:
1.4142135 = tf(termFreq(contents:一人)=2)
0.81767845 = idf(docFreq=5)
0.4375 = fieldNorm(field=contents, doc=4)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.5059127
Explanation的getDescription()为： fieldWeight(contents:一人 in 4), product of:
********************************************************************
Document的内部编号为： 1
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人之交一人之交>>
Document得分为： 0.40883923
Explanation为：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 1), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=1)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.40883923
Explanation的getDescription()为： fieldWeight(contents:一人 in 1), product of:
********************************************************************
Document的内部编号为： 2
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人之下一人之下>>
Document得分为： 0.40883923
Explanation为：
0.40883923 = (MATCH) fieldWeight(contents:一人 in 2), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
0.5 = fieldNorm(field=contents, doc=2)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.40883923
Explanation的getDescription()为： fieldWeight(contents:一人 in 2), product of:
********************************************************************
共检索出符合条件的Document 5 个。
本次搜索所用的时间为 79 ms

先从测试的输出结果进行分析，可以获得到如下信息：

■ 测试类中hits.score(i)的值与Explanation的getValue()的值是一样的，即Lucene默认使用的得分；

■ 默认情况下，Lucene按照Document的得分进行排序检索结果；

■ 默认情况下，如果两个Document的得分相同，按照Document的内部编号进行排序，比如上面编号为(3和4)、(1和2)是两组得分相同的Document，结果排序时按照Document的编号进行了排序；

通过从IndexSearcher类中的explain方法：

1 public Explanation explain(Weight weight, int doc) throws IOException {
2     return weight.explain(reader, doc);
3 }

可以看出，实际上是调用了Weight接口类中的explain()方法，而Weight是与一个Query相关的，它记录了一次查询构造的Query的情况，从而保证一个Query实例可以重用。

具体地，可以在实现Weight接口的具体类TermWeight中追溯到explain()方法，而TermWeight类是一个内部类，定义在TermQuery类内部。TermWeight类的explain()方法如下所示：

 1 public Explanation explain(IndexReader reader, int doc)
 2       throws IOException {
 3 
 4       ComplexExplanation result = new ComplexExplanation();
 5       result.setDescription("weight("+getQuery()+" in "+doc+"), product of:");
 6 
 7       Explanation idfExpl = new Explanation(idf, "idf(docFreq=" + reader.docFreq(term) + ")");
 8 
 9       // explain query weight
10       Explanation queryExpl = new Explanation();
11       queryExpl.setDescription("queryWeight(" + getQuery() + "), product of:");
12 
13       Explanation boostExpl = new Explanation(getBoost(), "boost");
14       if (getBoost() != 1.0f)
15         queryExpl.addDetail(boostExpl);
16       queryExpl.addDetail(idfExpl);
17 
18       Explanation queryNormExpl = new Explanation(queryNorm,"queryNorm");
19       queryExpl.addDetail(queryNormExpl);
20 
21       queryExpl.setValue(boostExpl.getValue() *idfExpl.getValue() *queryNormExpl.getValue());
22 
23       result.addDetail(queryExpl);
24 
25       // 说明Field的权重
26       String field = term.field();
27       ComplexExplanation fieldExpl = new ComplexExplanation();
28       fieldExpl.setDescription("fieldWeight("+term+" in "+doc+"), product of:");
29 
30       Explanation tfExpl = scorer(reader).explain(doc);
31       fieldExpl.addDetail(tfExpl);
32       fieldExpl.addDetail(idfExpl);
33 
34       Explanation fieldNormExpl = new Explanation();
35       byte[] fieldNorms = reader.norms(field);
36       float fieldNorm =
37         fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;
38       fieldNormExpl.setValue(fieldNorm);
39       fieldNormExpl.setDescription("fieldNorm(field="+field+", doc="+doc+")");
40       fieldExpl.addDetail(fieldNormExpl);
41       
42       fieldExpl.setMatch(Boolean.valueOf(tfExpl.isMatch()));
43       fieldExpl.setValue(tfExpl.getValue() *idfExpl.getValue() *fieldNormExpl.getValue());
44 
45       result.addDetail(fieldExpl);
46       result.setMatch(fieldExpl.getMatch());
47       
48       // combine them
49       result.setValue(queryExpl.getValue() * fieldExpl.getValue());
50 
51       if (queryExpl.getValue() == 1.0f)
52         return fieldExpl;
53 
54       return result;
55     }

根据检索结果，以及上面的TermWeight类的explain()方法，可以看出输出的字符串部分正好一一对应，比如：idf(Inverse Document Frequency，即反转文档频率)、fieldNorm、fieldWeight。

检索结果的第一个Document的信息：

Document的内部编号为： 0
Document内容为： Document<stored/uncompressed,indexed,tokenized<contents:一人>>
Document得分为： 0.81767845
Explanation为：
0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:
1.0 = tf(termFreq(contents:一人)=1)
0.81767845 = idf(docFreq=5)
1.0 = fieldNorm(field=contents, doc=0)

Document对应的Explanation的一些参数值如下：
Explanation的getValue()为： 0.81767845
Explanation的getDescription()为： fieldWeight(contents:一人 in 0), product of:

tf的计算

上面的tf值Term Frequency，即词条频率，可以在org.apache.lucene.search.Similarity类中看到具体地说明。在Lucene中，并不是直接使用的词条的频率，而实际使用的词条频率的平方根，即：

tf(t in d) = frequency^½

这是使用org.apache.lucene.search.Similarity类的子类DefaultSimilarity中的方法计算的，如下：

1 /** Implemented as <code>sqrt(freq)</code>. */
2 public float tf(float freq) {
3     return (float)Math.sqrt(freq);
4 }

即：某个Document的tf = 检索的词条在该Document中出现次数freq取平方根值

也就是freq的平方根。

例如，从我们的检索结果来看：

搜索关键字<一人>在编号为 0 的Document中出现过 1 次
搜索关键字<一人>在编号为 1 的Document中出现过 1 次
搜索关键字<一人>在编号为 2 的Document中出现过 1 次
搜索关键字<一人>在编号为 3 的Document中出现过 2 次
搜索关键字<一人>在编号为 4 的Document中出现过 2 次

各个Document的tf计算如下所示：

编号为0的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为1的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为2的Document的 tf 为： (float)Math.sqrt(1) = 1.0；
编号为3的Document的 tf 为： (float)Math.sqrt(2) = 1.4142135；
编号为4的Document的 tf 为： (float)Math.sqrt(2) = 1.4142135；

idf的计算

检索结果中，每个检索出来的Document的都对应一个idf，在DefaultSimilarity类中可以看到idf计算的实现方法，如下：

1 /** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */
2 public float idf(int docFreq, int numDocs) {
3     return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
4 }

其中，docFreq是根据指定关键字进行检索，检索到的Document的数量，我们测试的docFreq=5；numDocs是指索引文件中总共的Document的数量，我们的测试比较特殊，将全部的Document都检索出来了，我们测试的numDocs=5。

各个Document的idf的计算如下所示：

编号为0的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为1的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为2的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为3的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；
编号为4的Document的 idf 为：(float)(Math.log(5/(double)(5+1)) + 1.0) = 0.81767845；

lengthNorm的计算

在DefaultSimilarity类中可以看到lengthNorm计算的实现方法，如下：

public float lengthNorm(String fieldName, int numTerms) {
return (float)(1.0 / Math.sqrt(numTerms));
}

各个Document的lengthNorm的计算如下所示：

编号为0的Document的 lengthNorm 为：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
编号为1的Document的 lengthNorm 为：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
编号为2的Document的 lengthNorm 为：(float)(1.0 / Math.sqrt(1)) = 1.0/1.0 = 1.0；
编号为3的Document的 lengthNorm 为：(float)(1.0 / Math.sqrt(2)) = 1.0/1.4142135 = 0.7071068；
编号为4的Document的 lengthNorm 为：(float)(1.0 / Math.sqrt(2)) = 1.0/1.4142135 = 0.7071068；

关于fieldNorm

fieldNorm是在建立索引的时候写入的，而检索的时候需要从索引文件中读取，然后通过解码，得到fieldNorm的float型值，用于计算Document的得分。

在org.apache.lucene.search.TermQuery.TermWeight类中，explain方法通过打开的IndexReader流读取fieldNorm，写入索引文件的是byte[]类型，需要解码，如下所示：

1 byte[] fieldNorms = reader.norms(field);
2 float fieldNorm = fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;

调用Similarity类的decodeNorm方法，将byte[]类型值转化为float浮点值：

1 public static float decodeNorm(byte b) {
2     return NORM_TABLE[b & 0xFF]; // & 0xFF maps negative bytes to positive above 127
3 }

这样，一个浮点型的fieldNorm的值就被读取出来了，可以参加一些运算，最终实现Lucene的Document的得分的计算。

queryWeight的计算

queryWeight的计算可以在org.apache.lucene.search.TermQuery.TermWeight类中的sumOfSquaredWeights方法中看到计算的实现：

1 public float sumOfSquaredWeights() {
2       queryWeight = idf * getBoost();             // compute query weight
3       return queryWeight * queryWeight;          // square it
4     }

其实默认情况下，queryWeight = idf，因为Lucune中默认的激励因子boost = 1.0。

各个Document的queryWeight的计算如下所示：

queryWeight = 0.81767845 * 0.81767845 = 0.6685980475944025；

queryNorm的计算

queryNorm的计算在DefaultSimilarity类中实现，如下所示：

1 /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
2 public float queryNorm(float sumOfSquaredWeights) {
3     return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
4 }

这里，sumOfSquaredWeights的计算是在org.apache.lucene.search.TermQuery.TermWeight类中的sumOfSquaredWeights方法实现：

1  public float sumOfSquaredWeights() {
2       queryWeight = idf * getBoost();             // compute query weight
3       return queryWeight * queryWeight;          // square it
4     }

其实默认情况下，sumOfSquaredWeights = idf * idf，因为Lucune中默认的激励因子boost = 1.0。

上面测试例子中sumOfSquaredWeights的计算如下所示：

sumOfSquaredWeights = 0.81767845*0.81767845 = 0.6685980475944025；

然后，就可以计算queryNorm的值了，计算如下所示：

queryNorm = (float)(1.0 / Math.sqrt(0.6685980475944025) = 1.2229746301862302962735534977105；

value的计算

org.apache.lucene.search.TermQuery.TermWeight类类中还定义了一个value成员：

private float value;

关于value的计算，可以在它的子类org.apache.lucene.search.TermQuery.TermWeight类中看到计算的实现：

1 public void normalize(float queryNorm) {
2       this.queryNorm = queryNorm;
3       queryWeight *= queryNorm;                   // normalize query weight
4       value = queryWeight * idf;                  // idf for document
5     }

这里，使用normalize方法计算value的值，即：

value = queryNorm * queryWeight * idf;

上面测试例子中value的值计算如下：

value = 1.2229746301862302962735534977105 * 0.6685980475944025 * 0.81767845 = 0.66859804759440249999999999999973；

关于fieldWeight

从检索结果中，可以看到：

0.81767845 = (MATCH) fieldWeight(contents:一人 in 0), product of:

字符串"(MATCH) "的输在ComplexExplanation类中的getSummary方法中可以看到：

1 protected String getSummary() {
2     if (null == getMatch())
3       return super.getSummary();
4     
5     return getValue() + " = "
6       + (isMatch() ? "(MATCH) " : "(NON-MATCH) ")
7       + getDescription();
8 }

这个fieldWeight的值其实和Document的得分是相等的，先看这个fieldWeight是如何计算出来的，在org.apache.lucene.search.TermQuery.TermWeight类中的explain方法中可以看到：

 1 ComplexExplanation fieldExpl = new ComplexExplanation();
 2       fieldExpl.setDescription("fieldWeight("+term+" in "+doc+
 3                                "), product of:");
 4 
 5       Explanation tfExpl = scorer(reader).explain(doc);
 6       fieldExpl.addDetail(tfExpl);
 7       fieldExpl.addDetail(idfExpl);
 8 
 9       Explanation fieldNormExpl = new Explanation();
10       byte[] fieldNorms = reader.norms(field);
11       float fieldNorm =
12         fieldNorms!=null ? Similarity.decodeNorm(fieldNorms[doc]) : 0.0f;
13       fieldNormExpl.setValue(fieldNorm);
14       fieldNormExpl.setDescription("fieldNorm(field="+field+", doc="+doc+")");
15       fieldExpl.addDetail(fieldNormExpl);
16       
17       fieldExpl.setMatch(Boolean.valueOf(tfExpl.isMatch()));
18       fieldExpl.setValue(tfExpl.getValue() *
19                          idfExpl.getValue() *
20                          fieldNormExpl.getValue());
21 
22       result.addDetail(fieldExpl);
23       result.setMatch(fieldExpl.getMatch());
24       
25       // combine them
26       result.setValue(queryExpl.getValue() * fieldExpl.getValue());
27 
28       if (queryExpl.getValue() == 1.0f)
29         return fieldExpl;

上面，ComplexExplanation fieldExpl被设置了很多项内容，我们就从这里来获取fieldWeight的计算的实现。

关键是在下面进行了计算：

fieldExpl.setValue(tfExpl.getValue() *
                         idfExpl.getValue() *
                         fieldNormExpl.getValue());

使用计算式表示就是

fieldWeight = tf * idf * fieldNorm

fieldNorm的值因为是在建立索引的时候写入到索引文件中的，索引只需要从上面的测试结果中取来，进行如下关于Document的分数的计算的验证。

使用我们这个例子来计算检索出来的Docuyment的fieldWeight，需要用到前面计算出来的结果，如下所示：

编号为0的Document的 fieldWeight 为：1.0 * 0.81767845 * 1.0 = 0.81767845；
编号为1的Document的 fieldWeight 为：1.0 * 0.81767845 * 0.5 = 0.408839225；
编号为2的Document的 fieldWeight 为：1.0 * 0.81767845 * 0.5 = 0.408839225；
编号为3的Document的 fieldWeight 为：1.4142135 * 0.81767845 * 0.4375 = 0.5059127074089703125；
编号为4的Document的 fieldWeight 为：1.4142135 * 0.81767845 * 0.4375 = 0.5059127074089703125；

对比一下，其实检索结果中Document的得分就是这个fieldWeight的值，验证后，正好相符(注意：我这里没有进行舍入运算)。

总结说明

上面的计算得分是按照Lucene默认设置的情况下进行的，比如激励因子的默认值为1.0，它体现的是一个Document的重要性，即所谓的fieldWeight。

不仅可以通过为一个Document设置激励因子boost，而且可以通过为一个Document中的Field设置boost，因为一个Document的权重体现在它当中的Field上，即上面计算出来的fieldWeight与Document的得分是相等的。

提高一个Document的激励因子boost，可以使该Document被检索出来的默认排序靠前，即说明比较重要。也就是说，修改激励因子boost能够改变检索结果的排序。

转自：http://www.blogjava.net/ashutc/archive/2011/04/15/348339.html

posted @ 2011-11-15 10:01 爱开卷360 阅读(1230) 评论(0) 编辑收藏举报

刷新页面返回顶部

爱开卷360

好文章 • 爱开卷 • 360Tech

Apache Lucene 评分原理及代码分析

公告