[IR] Evaluation
无序检索结果的评价方法:
Precision
P
=
tp/(tp
+
fp)
Recall
R
=
tp/(tp
+
fn)
Accuracy = (tp + tn) / ( tp + fp + fn + tn)
有序检索结果的评价方法:
A precison-recall curve
调式search engine目前只是针对一个Query的表现。
You need to average performance over a whole bunch of queries.
其实,就是在遵从precision降低,必然提高recall的原则下,画出趋势图。(也就是插值法 Interpolated Precision)
What is the interpolated precision of the system at 25% recall?
1.0, 0.67, 0.5, 0.4, 0.36, 0.36, 0.36
Mean average precision (MAP)
System: D1, D2, D4, D3
k = 1, R, 1/1
k = 2, NR, n/a
k = 3, NR, n/a
k = 4, R, 2/4
MAP = (1/1+2/4)/2 = 3/4
What is the largest possible mean average precision that this system could have?
If the last two relevant documents are in ranking 21 and 22. 尽量早出现
MAP = (1.0+1.0+0.33+0.36+0.33+0.3+0.33+0.36)/8 = 0.503
What is the smallest possible mean average precision that this system could have?
If the last two relevant documents are in ranking 9999 and 10000. 尽量晚出现
MAP = (1.0+1.0+0.33+0.36+0.33+0.3+0.0007+0.0008)/6 = 0.416
用已有的MAP去估计未来可能的MAP的error是多少?
MAP = (1.0 + 1.0 + 0.33 + 0.36 + 0.33 + 0.3)/6 = 0.555
The error could be 0.555 - (0.503 + 0.416)/2 = 0.095
Kappa Measure
P(A) = Accuracy
P(E) = [ (person1-yes + person2-yes)/(total*2) ]^2 + [ (person1-no + person2-no)/(total*2) ]
Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ]
Kappa > 0.8 // good agreement
0.67 < Kappa < 0.8 // “tentative conclusions” (CarleSa ’96)
相关反馈:有点reinforcement learning的意思。