一种查找中值的方法——Rank_Select

看Kd树时，建立Kd-tree需要为第split维的值排序，并查找位于正中间的那个数据点。询问了好多同学，他们说，可以首先用堆排序，然后找第(n-1)/2的数。觉得挺有道理。

但是，还看到一种方法，就是Rank_Select方法。

代码如下：

Rank_Select方法

static void insertion_sort( double* array, int n )
{
    double k;
    int i, j;

    for( i = 1; i < n; i++ )
    {
        k = array[i];
        j = i-1;
        while( j >= 0  &&  array[j] > k )
        {
            array[j+1] = array[j];
            j -= 1;
        }
        array[j+1] = k;
    }
}



/*
Partitions an array around a specified value.

@param array an array
@param n number of elements
@param pivot value around which to partition

@return Returns index of the pivot after partitioning
*/
static int partition_array( double* array, int n, double pivot )
{
    double tmp;
    int p, i, j;

    i = -1;
    for( j = 0; j < n; j++ )
        if( array[j] <= pivot )
        {
            tmp = array[++i];
            array[i] = array[j];
            array[j] = tmp;
            if( array[i] == pivot )
                p = i;
        }
        array[p] = array[i];
        array[i] = pivot;

        return i;
}

static double rank_select( double* array, int n, int r )
{
    double* tmp, med;
    int gr_5, gr_tot, rem_elts, i, j;

    /* base case */
    if( n == 1 )
        return array[0];

    /* divide array into groups of 5 and sort them */
    gr_5 = n / 5;
    gr_tot = cvCeil( n / 5.0 );
    rem_elts = n % 5;
    tmp = array;
    for( i = 0; i < gr_5; i++ )
    {
        insertion_sort( tmp, 5 );
        tmp += 5;
    }
    insertion_sort( tmp, rem_elts );

    /* recursively find the median of the medians of the groups of 5 取所有中间值*/
    tmp = (double*)calloc( gr_tot, sizeof( double ) );
    for( i = 0, j = 2; i < gr_5; i++, j += 5 )
        tmp[i] = array[j];
    if( rem_elts )
        tmp[i++] = array[n - 1 - rem_elts/2];
    //取中间值
    med = rank_select( tmp, i, ( i - 1 ) / 2 );
    free( tmp );

    /* partition around median of medians and recursively select if necessary */
    j = partition_array( array, n, med );
    if( r == j )
        return med;
    else if( r < j )
        return rank_select( array, j, r );
    else
    {
        array += j+1;
        return rank_select( array, ( n - j - 1 ), ( r - j - 1 ) );
    }
}

此方法，利用到了递归。

输入为该保持数据的数组，和数组中的元素个数，还有需要找到的中间值的索引。

昨天看了本书，才知道在C/C++中，如果将数据作为参数传递给函数，在函数内部利用sizeof(a)只是将a作为指针，所以sizeof(a)=4 而不是数组长度。因此，此处传的后两个参数必不可少。

首先把数据分为几组，每组含有5个数据，然后每组利用插入排序方法进行排序，最后取中值。

然后，在将取到的所有组的中值放入一个数组。进行取中值，即重复上述操作，直至传入的第二个参数为1，返回数组的第一个数据。

13：0

25：1

67：2

2：3

10：4

30：5

20：6

（1）查找索引为r=3的数据。

2:0

10：1

13：2

25：3

67：4

20：5

30：6

Tmp中包含两项 tmp[0]=13,tmp[1]=20

（2）因此，此时输入为tmp，2,0

（3）因此，输入为tmp[0]=13,并返回至第二步中。

第二个递归中，med=13.通过，partition_array函数，返回13的索引为2.但，2比3小。因此进入由25,67,20,30组成的查找特定值的递归中。此时，传入第二个参数为n-(j+1),原因是查找到的13，太靠前了，因此需要从后面的数据中查找。而返回的13的索引，与要查找的索引相差值，就是本次查找中，需要返回的索引，于是第三个参数为r-(j+1)。

然后，新一轮就开始了。Array 为下表数据，第二个参数为 4，第三个参数为0.

25:0

67：1

20：2

30：3

20:0

25：1

30：2

67：3

Tmp[0]= array[n-1-rem_elts/2] 即4-1-4/2 =1

Tmp[0]=25.

此时经过partition_array函数，返回25的索引为1.而第三个参数为0，因此找到的25就太靠后了，因此需要查找数组前面的数据。于是再开始新一轮的递归。

此时传入array，第二个参数为1（即查找到的索引），第三个参数因为要查找的索引，就包含在这部分中，因此直接传入这个索引即可.此时，返回值为20.即为要查找的值。

也许此算法分析不够准确，还望有人指出。

posted on 2011-11-08 16:34 Ming明、阅读(2337) 评论(1) 编辑收藏举报