[IR] Probabilistic Model
If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model.
Can we use probabilities to quantify our uncertainties?
Ranking method:
Rank by probability of relevance of the document w.r.t. information need.
P(relevant | document i, query)
Bayes’ Optimal Decision Rule: x is relevant(相关的)iff p(R|x) > p(NR|x)
C - cost of retrieval of relevant document
C’- cost of retrieval of non-relevant document
C ⋅ p(R | d) + C ′ ⋅ (1− p(R | d)) ≤ C ⋅ p(R | d′ ) + C ′ ⋅ (1− p(R | d′ ))
for all d’ not yet retrieved, then d is the next document to be retrieved
- How do we compute all those probabilisties?
- 二值独立模型 - Binary Independence Model
(q位置没有变,odds 优势率)
分母约去。
Query相关的话,文档Vecdor如此的概率是多少?需要估计。
思考:针对一个Query,某单词是否该出现在文档中呢?
假设 (重要):
pi = p ( xi = 1 | R , q );
ri = p ( xi = 1 | NR , q );
(去掉xi = 0后,乘的变多了,多了xi =1, qi =1的部分。在前一个连乘中乘以倒数,达到平衡。)
两个常量:
query能获得有效返回的概率。
every query 与vocabulary中的each word的相关的概率。
一个变量:
Retrieval Status Value
So, how do we compute ci ’s from our data ?
For each term i look at this table of document counts:
(Term与doc的关系:出现但不一定相关;相关但不一定出现,比如computer与IBM)
pi = s / (S-s)
ri = (n-s) / (N-n-S+s)
Add 1⁄2 Smoothing
结论:一篇新文档出现,遂统计every Term与该doc的关系,得到Ci。
- Okapi BM25: 一个非二值的模型 (略)