OKapi BM25 算法介绍

From wikipedia.org英文版,我主要将其改变成中文。


  BM25(Best Match25)是在信息检索系统中根据提出的query对document进行评分的算法。It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. RobertsonKaren Spärck Jones, and others.BM25算法首先由OKapi系统实现,所以又称为OKapi BM25。

  

      BM25属于bag-of-words模型,bag-of-words模型只考虑document中词频,不考虑句子结构或者语法关系之类,把document当做装words的袋子,具体袋子里面可以是杂乱无章的。It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

  对于一个query Q, 包括关键字 q_1, ..., q_n, 一个文档的BM25得分:

 \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})},
其中IDF是上篇文章《TD-IDF》中的IDF,f是《TD-IDF》中的TF,|D|是文档D的长度,avgdl是语料库全部文档的平均长度。k1和b是参数。usually chosen, in absence of an advanced optimization, as k_1 \in [1.2,2.0] and b = 0.75
posted @ 2012-08-22 13:45  lycan785  阅读(5262)  评论(0编辑  收藏  举报