水塘抽样问题

2013-10-01 04:18 youxin 阅读(6031) 评论(0) 编辑收藏举报

google曾经有一道面试题，十分有趣：

I have a linked list of numbers of length N. N is very large and I don’t know in advance the exact value of N.

How can I most efficiently write a function that will return k completely random numbers from the list

题目非常简单：有N个元素的链表，事先不知道有多长，写一个函数可以高效地从其中取出k个随机数。

初看这题心里没有一点思路，最后查了下资料，这题不是什么新题，编程珠玑Column 12中的题目10提到过，其描述如下：

　　How could you select one of n objects at random, where you see the objects sequentially but you do not know the value of n beforehand? For concreteness, how would you read a text file, and select and print one random line, when you don’t know the number of lines in advance?

　　问题定义可以简化如下：在不知道文件总行数的情况下，如何从文件中随机的抽取一行？

　　首先想到的是我们做过类似的题目吗?当然，在知道文件行数的情况下，我们可以很容易的用C运行库的rand函数随机的获得一个行数，从而随机的取出一行，但是，当前的情况是不知道行数，这样如何求呢？我们需要一个概念来帮助我们做出猜想，来使得对每一行取出的概率相等，也即随机。这个概念即蓄水池抽样（Reservoir Sampling）。

wikipedia：http://en.wikipedia.org/wiki/Reservoir_sampling 说的很详细：

水塘抽样是一系列的随机算法，其目的在于从包含n个项目的集合S中选取k个样本，其中n为一很大或未知的数量，尤其适用于不能把所有n个项目都存放到主内存的情况。最常见例子为Jeffrey Vitter在其论文^[1]中所提及的算法R。

参照Dictionary of Algorithms and Data Structures^[2]所载的O(n)算法，包含以下步骤（假设阵列S以0开始标示）：

從S中抽取首k項放入「水塘」中
對於每一個S[j]項（j ≥ k）：
   隨機產生一個範圍從0到j的整數r
   若 r < k 則把水塘中的第r項換成S[j]項

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

c++实现：

#include<iostream>
#include<ctime>
using namespace std;

int main()
{
 
    int S[10]={0,1,2,3,4,5,6,7,8,9};
    const int k=4;
    int R[k];
    int i,j;
    for(i=0;i<k;i++)
        R[i]=S[i];

    for(i=k;i<sizeof(S)/sizeof(S[0]);i++)
    {
        srand(time(NULL));
        j=rand()%i;
        if(j<k)
            R[j]=S[i];
    }
    
    for(int i=0;i<k;i++)
        cout<<R[i]<<ends;
    cout<<endl;

}

为什么叫水塘抽样，因为我们array R【k】类似一个reservoir水库（蓄水池），

The algorithm creates a "reservoir" array of size k and populates it with the first k items of S. It then iterates through the remaining elements of S until Sis exhausted. At the i^th element of S, the algorithm generates a random number j between 1 and i. If j is less than k, the j^th element of the reservoir array is replaced with the i^th element of S. In effect, for all i, the i^th element of S is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the j^th element of the reservoir array is chosen to be replaced with probability j/k * k/i, which simplifies to j/i. It can be shown that when the algorithm has finished executing, each item in S has equal probability (i.e. k/length(S)) of being chosen for the reservoir.

　　有了这个概念，我们来看最先的问题，在不知道文件总行数的情况下，如何从文件中随机的抽取一行？我们便有了这样一个解决方案：定义取出的行号为choice，第一次直接以第一行作为取出行 choice ，而后第二次以二分之一概率决定是否用第二行替换 choice ，第三次以三分之一的概率决定是否以第三行替换 choice ……，以此类推，可用伪代码描述如下：

i = 0

while more input lines

with probability 1.0/++i

choice = this input line

print choice

#include<iostream>
#include<ctime>
using namespace std;

int main()
{
    int choice=0;
    int start=0;
    const int n=10;
    for(int i=2;i<=n;i++)
    {
        srand(time(NULL));
        int randValue=rand()%(i+1-start)+start;
        if(randValue==0)
            choice=i;
    }
    cout<<choice;

}