topk算法 -- heap,Quicksort

topk：即求大量数据中的前k大。
本文首先参照STL源码。提出了用heap和Quicksort两套求topk的方案。然后对他们进行了详细的分析与比较。

一、heap概述

堆是一种经过排序的树形数据结构，通常我们所说的堆，是指binary heap(二叉堆)。所谓binary heap，就是一种complete binary tree(完全二叉树)，也就是说，整棵binary tree除了最底层的叶节点之外，是填满的，且最顶层的叶节点由左至右又不得有空隙。
complete binary tree整棵树内没有任何节点漏洞，这带来一个极大的好处：我们可以利用array来储存所有节点。即某个节点位于i处时，其左子节点必位于array的2i处，其右子节点比位于array的2i+1处，其父节点必位于"i/2"处。这么一来，我们需要的工具就简单了：一个array和一组heap算法。array的缺点是无法动态改变大小，而heap却需要这项功能，因此，以vector代替array是更好的选择。
根据元素的排序方式，heap分为max-heap和min-heap。max-heap每个节点的key都大于或等于其子节点，min-heap每个节点的key都小于或等于其子节点。因此，max-heap的最大值在根节点，并总是位于vector的其头处。min-heap同理。

二、heap算法

注：以下源码最后一个参数接收一个functor(仿函数)，用来决定是max-heap或min-heap：less -> max-heap，greater -> min-heap。

push_heap算法

为了满足complete binary tree条件，新元素必插入在底层vector的end()处。那么新元素是否适合于现有位置呢？为满足max-heap条件，我们执行一个所谓的percolate up(上溯)程序：如果键值比父节点大，就父子对换位置。如此一直上溯，直到不需对换或到根节点位置。这是一个复杂度为O(logn)的操作。

push_heap

//------------------------------------------------------------------------------
// push_heap

template <class RandomAccessIterator,class Compare>
inline void push_heap(RandomAccessIterator first,
                        RandomAccessIterator last,
                        Compare Comp)
// 注意，此函数被调用时，新元素应已置于底部容器的最尾端
{
    __push_heap_aux(first,
        last,
        distance_type(first),
        value_type(first),
        Comp);

}

template <class RandomAccessIterator, class Distance, class T,class Compare>
inline void __push_heap_aux(RandomAccessIterator first,
                        RandomAccessIterator last,
                        Distance*, T*, Compare Comp)
{
    __push_heap(first,
        Distance((last - first) - 1),    // holeIndex
        Distance(0),    // topIndex
        T(*(last - 1)),
        Comp);    // value
    // 以上系根据implicit representation heap 的结构特性
    // 即：新值必置于底部容器的最尾端，此即第一个洞号：(last-first)-1
}

template <class RandomAccessIterator, class Distance, class T,class Compare>
void __push_heap(RandomAccessIterator first,
                Distance holeIndex,
                Distance topIndex,
                T value,Compare Comp)
{
    Distance parent = (holeIndex - 1) / 2; //  找出父节点
    while(holeIndex > topIndex && Comp(*(first+parent),value)/**(first + parent) < value*/)
    {
        // 当尚未到达顶端，且父节点小于新值（于是不符合heap的次序特性）
        // 由于以上使用operator <，可知stl heap是max-heap
        *(first + holeIndex) = *(first + parent); // 即把父节点往下"拉"
        holeIndex = parent;    // percolate up: 调整洞号，向上提升至父节点
        parent = (holeIndex - 1) / 2;    // 新洞的父节点

    }    // 持续至顶端，或满足heap的次序特性为止
    *(first + holeIndex) = value;    // 令洞值为新值
}

pop_heap算法

既然身为max-heap，最大值必在根节点vector[0]。pop操作取走根节点(注意，此处STL算法并没有移除他，只是把他移到vector的尾端)，然后为了满足complete binary tree条件，必须割舍最下层最右边的叶节点vector[last-1]，并将其重新安插置max-heap。这种操作的复杂度也为O(logn)。

pop_heap

//-----------------------------------------------------------------
// pop_heap

template <class RandomAccessIterator,class Compare>
inline void pop_heap(RandomAccessIterator first,
                    RandomAccessIterator last,
                    Compare Comp)
{
    __pop_heap_aux(first, last, value_type(first),Comp);

}

template <class RandomAccessIterator, class T,class Compare>
inline void __pop_heap_aux(RandomAccessIterator first,
                            RandomAccessIterator last,
                            T*,Compare Comp)
{
    __pop_heap(first,            // first
                last-1,            // last
                last-1,            // resule
                T(*(last-1)),    // value,即把最后一个元素“挑”出来
                distance_type(first),
                Comp);    
    // 以上，根据implicit representation heap 的次序特性，pop操作的结果
    // 应为底部容器的第一个元素。因此，首先设定欲调整值为尾指，然后将首值调制
    // 尾节点（所以以上将迭代器result设为last-1）。然后重整[first,last-1)，
    // 使之重新成一个合格的heap
}

template <class RandomAccessIterator, class T, class Distance,class Compare>
inline void __pop_heap(RandomAccessIterator first,
                    RandomAccessIterator last,
                    RandomAccessIterator result,
                    T value,
                    Distance*,
                    Compare Comp)
{
    *result = *first;    // 设定尾值为首值，于是尾值即为欲求结果
                        // 可由客户端稍后再以底层容器之pop_back()取出尾值
    
    // 重新调整heap，洞号为0（亦即树根处），欲调整值为value（原尾值）
    __adjust_heap(first,        // first    
        Distance(0),            // holeIndex
        Distance(last-first),    // len
        value,
        Comp);                    // value
}


template <class RandomAccessIterator,class T,class Distance,class Compare>
void  __adjust_heap(RandomAccessIterator first,
    Distance holeIndex,
    Distance len,
    T value,
    Compare Comp)        
{
    Distance topIndex = holeIndex;
    Distance secondChild = 2 * holeIndex + 2;    // 洞节点之右子节点
    
    while (secondChild < len){
        // 比较洞节点之左右两个子值，然后以sencondChild代表较大子节点
        if (Comp(*(first + secondChild),*(first + (secondChild - 1)))
            /**(first + secondChild) < *(first + (secondChild - 1))*/)
            secondChild--;

        // Percolate down:令较大子值为洞值，再令洞号下移至较大子节点处
        *(first+holeIndex) = *(first+secondChild);
        holeIndex = secondChild;
        // 找出新洞节点的右子节点
        secondChild = 2 * secondChild + 2;
    }
    if (secondChild == len)    {    // 没有右子节点，只有左子节点
        // Percolate down: 令左子值为洞值，再令洞号下移至左子结点处。
        *(first + holeIndex) = *(first + (secondChild - 1));
        holeIndex = secondChild - 1;
    }


    // 此时（可能）尚未满足次序特性，执行一次percolate up操作

    // 有右子节点，即把右子节点再push进堆中即可
    __push_heap(first,holeIndex,topIndex,value,Comp);

    // 直接复制到hole中也可啊
    // 错啦！因为该值可能大于其父节点啦~~
    // "wrong" : *(first+holeIndex) = value;
}

make_heap算法

make_heap

template <class RandomAccessIterator,class T,class Distance,class Compare>
inline void __make_heap(RandomAccessIterator first,
    RandomAccessIterator last,T*,
    Distance*,
    Compare Comp)
{
    if (last - first < 2)
        return;        // 如果长度为0或1，不必重新排列
    Distance len = last - first;

    Distance parent = (len - 2)/2;

    while(true){
        __adjust_heap(first,
            parent,
            len,
            T(*(first + parent)),
            Comp);
        if (parent == 0) return;
        parent--;
    }
}

template <class RandomAccessIterator,class Compare>
inline void make_heap(RandomAccessIterator first,
    RandomAccessIterator last,
    Compare Comp)
{
    __make_heap(first, last, value_type(first), distance_type(first),Comp);
}

这个算法用来将一段现有数据转化为一个heap。

// for example
　　int ia[9] = {0,1,2,3,4,8,9,3,5};
　　vector<int> ivec(ia,ia+9);
　　make_heap(ivec.begin(), ivec.end());　　// 9 5 8 3 4 0 2 3 1

sort_heap算法

持续对整个heap做pop_heap操作，每次将操作范围从后向前缩减一个元素(因为pop_heap会把键值最大元素放在底部容器最尾端)，所以最后我们会得到一个递增序列。

sort_heap

//-----------------------------------------------------------------
// sort_heap
template <class RandomAccessIterator,class Compare>
void sort_heap(RandomAccessIterator first,
                RandomAccessIterator last,
                Compare Comp)
{
    while (last - first > 1)
        mySTL::pop_heap(first, last--,Comp);
}

注意，排序过后，原来的heap就不是一个合法的heap了。

三、partial_sort算法

现在，就来看看怎样利用heap算法来计算大量数据中的topk。
举例来说，比如求前k小，那么首先取前k个元素做成max-heap，然后遍历剩下的所有元素，分别于堆中的最大元素(root节点)比较，如果它小于最大元素，则替换掉那个元素。这样一来，当把序列扫描一遍之后，该max-heap中的k个元素就是最小的k个元素，效率很高。

partial_sort

//------------------------------------------------------------------------------
// partial_sort

template <class RandomAccessIterator,class T,class Compare>
inline void __partial_sort(RandomAccessIterator first,
    RandomAccessIterator middle,
    RandomAccessIterator last, T*,Compare Comp)
    //            less    ---    max-heap
    //            greater    ---    min-heap
{
    mySTL::make_heap(first, middle,Comp); 
    for (RandomAccessIterator i = middle; i < last; ++i)
        if ( Comp(*i,*first)  )
            mySTL::__pop_heap(first, middle, i, T(*i), distance_type(first),Comp);
    mySTL::sort_heap(first,middle,Comp);
}


// paitial_sort的任务是找出middle - first个最小元素。
template <class RandomAccessIterator,class Compare>
inline void partial_sort(RandomAccessIterator first,
    RandomAccessIterator middle,
    RandomAccessIterator last,
    Compare Comp)
{
    __partial_sort(first, middle, last, value_type(first),Comp);
}

四、用Quicksort求topk

Quicksort相关原理就不多说了，STL中Quicksort算法的实现我在<stl sort源码剖析>中一文也详细说明过。所以这里就只说说怎么样用Quicksort来求topk。
其实很简单，大家都知道Quicksort每次分割后即把比轴小的值和比轴大的值给cut开了。那么，假如是想求前k小(以下源码即是)，我就只关心比轴小的一边，继续分割这一边，直到如果分割后的元素没有k个了，就结束循环。我们用源码说话：

topk_qsort

// 利用快速排序思想求topk小
template <class Iter,class T>
void __topk_qsort(Iter first,
                  Iter last,
                  int k,
                  T*)
{
    if (last - first < k)  throw runtime_error("too little elements!");
    Iter cut = last;
    do{
        
        // 只考虑元素小的一边，以为是求前k小。
        last = cut;    
        cut = mySTL::__unguarded_partition            
            (first, last, T(mySTL::__median(*first,
                                            *(first + (last - first)/2),
                                            *(last - 1))));
        // [first - cut] -> 小
        // [cut - last]     -> 大    

    }while(cut - first > k);    // 当[cut-first]已经不足k个元素时停止qsort
    mySTL::sort(first,last);
}

/--------------------------------------华丽的分割线--------------------------------------/

五、分析分析

............　　

posted on 2012-03-28 23:50 imAkaka 阅读(5871) 评论(0) 编辑收藏举报

刷新页面返回顶部

imAkaka