[IR] Probabilistic Model

If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model.

Can we use probabilities to quantify our uncertainties?

Ranking method:

Rank by probability of relevance of the document w.r.t. information need.

P(relevant | document i, query)

Bayes’ Optimal Decision Rule: x is relevant（相关的）iff p(R|x) > p(NR|x)

C - cost of retrieval of relevant document

C’- cost of retrieval of non-relevant document

C ⋅ p(R | d) + C ′ ⋅ (1− p(R | d)) ≤ C ⋅ p(R | d′ ) + C ′ ⋅ (1− p(R | d′ ))

for all d’ not yet retrieved, then d is the next document to be retrieved

How do we compute all those probabilisties?

二值独立模型 - Binary Independence Model

(q位置没有变，odds 优势率)

分母约去。

Query相关的话，文档Vecdor如此的概率是多少？需要估计。

思考：针对一个Query，某单词是否该出现在文档中呢？

假设 (重要)：

p_i= p ( x_i = 1 | R , q );

r_i = p ( x_i = 1 | NR , q );

（去掉x_i = 0后，乘的变多了，多了x_i=1, q_i=1的部分。在前一个连乘中乘以倒数，达到平衡。）

两个常量：

　　query能获得有效返回的概率。

　　every query 与vocabulary中的each word的相关的概率。　

一个变量：

　　Retrieval Status Value

So, how do we compute c_i ’s from our data ?

For each term i look at this table of document counts:

(Term与doc的关系：出现但不一定相关；相关但不一定出现，比如computer与IBM)

p_i = s / (S-s)

r_i = (n-s) / (N-n-S+s)

Add 1⁄2 Smoothing

结论：一篇新文档出现，遂统计every Term与该doc的关系，得到C_i。

Okapi BM25: 一个非二值的模型 (略)

posted @ 2016-11-07 20:08 郝壹贰叁阅读(1053) 评论(0) 收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

[IR] Probabilistic Model

Ranking method:

公告