If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model.

Can we use probabilities to quantify our uncertainties?

Ranking method: 

Rank by probability of relevance of the document w.r.t. information need.

P(relevant | document i, query)

Bayes’ Optimal Decision Rulex is relevant(相关的)iff p(R|x) > p(NR|x)      

C - cost of retrieval of relevant document

C’- cost of retrieval of non-relevant document

C ⋅ p(R | d) + C ′ ⋅ (1− p(R | d))  ≤  C ⋅ p(R | d′ ) + C ′ ⋅ (1− p(R | d′ ))

for all d’ not yet retrieved, then d is the next document to be retrieved


  • How do we compute all those probabilisties?

  • 二值独立模型 - Binary Independence Model

(q位置没有变,odds 优势率)




假设 (重要):

pi = p ( xi = 1 | R , q );

ri = p ( xi = 1 | NR , q );

(去掉xi = 0后,乘的变多了,多了x=1, q=1的部分。在前一个连乘中乘以倒数,达到平衡。)



  every query 与vocabulary中的each word的相关的概率。 


  Retrieval Status Value

So, how do we compute ci ’s from our data ?

For each term i look at this table of document counts: 


pi = s / (S-s)

ri = (n-s) / (N-n-S+s)

Add 1⁄2 Smoothing


结论:一篇新文档出现,遂统计every Term与该doc的关系,得到Ci。

  • Okapi BM25: 一个非二值的模型 (略)


