wand,week and 算法

一般搜索的query比较短,但如果query比较长,如是一段文本,需要搜索相似的文本,这时候一般就需要wand算法,该算法在广告系统中有比较成熟的应该,主要是adsense场景,需要搜索一个页面内容的相似广告。

Wand方法简单来说,一般我们在计算文本相关性的时候,会通过倒排索引的方式进行查询,通过倒排索引已经要比全量遍历节约大量时间,但是有时候仍然很慢。
原因是很多时候我们其实只是想要top n个结果,一些结果明显较差的也进行了复杂的相关性计算,而weak-and算法通过计算每个词的贡献上限来估计文档的相关性上限,从而建立一个阈值对倒排中的结果进行减枝,从而得到提速的效果。

 

wand算法首先要估计每个词对相关性贡献的上限,最简单的相关性就是TF*IDF,一般query中词的TF均为1,IDF是固定的,因此就是估计一个词在文档中的词频TF上限,一般TF需要归一化,即除以文档所有词的个数,因此,就是要估算一个词在文档中所能占到的最大比例,这个线下计算即可。

 

知道了一个词的相关性上界值,就可以知道一个query和一个文档的相关性上限值,显然就是他们共同的词的相关性上限值的和。

这样对于一个query,获得其所有词的相关性贡献上限,然后对一个文档,看其和query中都出现的词,然后求这些词的贡献和即可,然后和一个预设值比较,如果超过预设值,则进入下一步的计算,否则则丢弃。

 

如果按照这样的方法计算n个最相似文档,就要取出所有的文档,每个文档作预计算,比较threshold,然后决定是否在top-n之列。这样计算当然可行,但是还是可以优化的。优化的出发点就是尽量减少预计算,wand论文中提到的算法如下:

http://wulc.me/2018/03/18/Wand%20%E7%AE%97%E6%B3%95%E4%BB%8B%E7%BB%8D%E4%B8%8E%E5%AE%9E%E7%8E%B0/

 

import heapq

UB = {"t0":0.5,"t1":1,"t2":2,"t3":3,"t4":4} #upper bound of term's value
LAST_ID = 999999999999 # a large number, larger than all the doc id in the inverted index
THETA = 2 # theta, threshold for chechking whether to calculate the relevence between query and doc
TOPN = 3 #max result number 

class WAND:
    def __init__(self, InvertIndex):
        """init inverted index and necessary variable"""
        self.result_list = [] #result list
        self.inverted_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
        self.current_doc = 0
        self.current_inverted_index = {} #posting
        self.query_terms = []
        self.sort_terms = []
        self.threshold = THETA
        self.last_id = LAST_ID

    def __init_query(self, query_terms):
        """init variable with query"""
        self.current_doc = 0
        self.current_inverted_index = {}
        self.query_terms = []
        self.sort_terms = []
        
        for term in query_terms:
            if term in self.inverted_index:  # terms may not appear in inverted_index
                doc_id = self.inverted_index[term][0]
                self.query_terms.append(term)
                self.current_inverted_index[term] = [doc_id, 0] #[ docid, index ]
                self.sort_terms.append([doc_id, term])

    def __pick_term(self, pivot_index):
        """select the term before pivot_index in sorted term list
         paper recommends returning the term with max idf, here we just return the firt term,
         also return the index of the term instead of the term itself for speeding up"""
        return 0
        
    def __find_pivot_term(self):
        """find pivot term"""
        score = 0
        for i in range(len(self.sort_terms)):
            score += UB[self.sort_terms[i][1]]
            if score >= self.threshold:
                return [self.sort_terms[i][1], i] #[term, index]
        return [None, len(self.sort_terms)]

    def __iterator_invert_index(self, change_term, docid, pos):
        """find the new_doc_id in the doc list of change_term such that new_doc_id >= docid,
        if no new_doc_id satisfy, the self.last_id"""
        doc_list = self.inverted_index[change_term]
        # new_doc_id, new_pos = self.last_id, len(doc_list)-1 # the case when new_doc_id not exists
        for i in range(pos, len(doc_list)):
            if doc_list[i] >= docid:   # since doc_list contains self.last_id, this inequation will always be satisfied
                new_pos = i
                new_doc_id = doc_list[i]
                break
        return [new_doc_id, new_pos]

    def __advance_term(self, change_index, doc_id ):
        """change the first doc of term self.sort_terms[change_index] in the current inverted index
        return whether the action succeed or not"""
        change_term = self.sort_terms[change_index][1]
        pos = self.current_inverted_index[change_term][1]
        new_doc_id, new_pos = self.__iterator_invert_index(change_term, doc_id, pos)
        self.current_inverted_index[change_term] = [new_doc_id, new_pos]
        self.sort_terms[change_index][0] = new_doc_id

    def __next(self):
        while True:
            self.sort_terms.sort() #sort terms by doc id
            pivot_term, pivot_index = self.__find_pivot_term() #find pivot term > threshold
            if pivot_term == None: #no more candidate
                return None
            pivot_doc_id = self.current_inverted_index[pivot_term][0]
            if pivot_doc_id == self.last_id: # no more candidate
                return None
            if pivot_doc_id <= self.current_doc:
                change_index = self.__pick_term(pivot_index)
                self.__advance_term(change_index, self.current_doc + 1)
            else:
                first_doc_id = self.sort_terms[0][0]
                if pivot_doc_id == first_doc_id:
                    self.current_doc = pivot_doc_id
                    return self.current_doc # return the doc for fully calculating
                else:
                    # pick all preceding term instead of just one, then advance all of them to pivot
                    change_index = 0
                    while change_index < pivot_index:
                        self.__advance_term(change_index, pivot_doc_id)
                        change_index += 1
            # print(self.sort_terms, self.current_doc, pivot_doc_id)

    def __insert_heap(self, doc_id, score):
        """store the Top N result"""
        if len(self.result_list) < TOPN:
            heapq.heappush(self.result_list, (score, doc_id))
        else:
            heapq.heappushpop(self.result_list, (score, doc_id))


    def __calculate_doc_relevence(self, docid):
        """fully calculate relevence between doc and query"""
        score = 0
        for term in self.query_terms:
            if docid in self.inverted_index[term]:
                score += UB[term]
        return score


    def perform_query(self, query_terms):
        self.__init_query(query_terms)
        while True:
            candidate_docid = self.__next()
            if candidate_docid == None:
                break
            #insert candidate_docid to heap
            print('candidata doc', candidate_docid)
            full_doc_score = self.__calculate_doc_relevence(candidate_docid)
            self.__insert_heap(candidate_docid, full_doc_score)
            print("result list ", self.result_list)
        return self.result_list


if __name__ == "__main__":
    testIndex = {}
    testIndex["t0"] = [1, 3, 26, LAST_ID]
    testIndex["t1"] = [1, 2, 4, 10, 100, LAST_ID]
    testIndex["t2"] = [2, 3, 6, 34, 56, LAST_ID]
    testIndex["t3"] = [1, 4, 5, 23, 70, 200, LAST_ID]
    testIndex["t4"] = [5, 14, 78, LAST_ID]
    
    w = WAND(testIndex)
    final_result = w.perform_query(["t0", "t1", "t2", "t3", "t4"])
    print("=================final result=======================")
    for i in reversed(range(len(final_result))):
        print("doc {0}, relevence score {1}".format(final_result[i][1], final_result[i][0]))

  

posted on 2020-11-18 18:45  不忘初衷,方能致远  阅读(324)  评论(0编辑  收藏  举报

导航