Reservoir Sampling

Reservoir sampling is proposed to solve such set of problems: Randomly choose k items from a stream of n\gg k elements where n could be very large or unknown in advance, i.e., all elements in the stream are equally likely to be selected with probability \frac{k}{n}

The algorithm works as follows.

Let’s first take a look at a simple example with k=1. When a new item a_i comes, we either keep a_i with probability \frac{1}{i} or keep the old selected item with probability 1-\frac{1}{i}. We repeat this process till the end of the stream, i.e., all elements in a_1,a_2,\cdots,a_n have been visited. The probability that a_i is chosen in the end is \mathit{Pr}(a_i)=\frac{1}{i}\times (1-\frac{1}{i+1})\times(1-\frac{1}{i+2})\times\cdots\times(1-\frac{1}{n})=\frac{1}{i}\times\frac{i}{i+1}\times\frac{i+1}{i+2}\times\cdots\times\frac{n-1}{n}=\frac{1}{n}

Thus we prove the algorithm guarantees equal probability for all elements to be chosen. A Java implementation of this algorithm should look like this:

1 int random(int n) {
2     Random rnd = new Random();
3     int ret = 0;
4     for (int i = 1; i <= n; i++)
5         if (rnd.nextInt(i) == 0)
6             ret = i;
7     return ret;
8 }

 

k>1 is a little tricky. One straightforward way is to simply run the previous algorithm k times. However, this does require multiple passes against the stream. Here we discuss another approach to get k element randomly.

For item a_i, there are two cases to handle:

  1. When i<=k, we just blindly keep a_i
  2. When i>k, we keep a_i with probability \frac{k}{i}

A simple implementation requires the memory space to store the k selected elements, say s_1,s_2,\cdots,s_k. For every a_i we first get a random number 1\leq j\leq i and keep a_i when 1\leq j\leq k, i.e., s_j=a_i. Otherwise a_i is discarded. This guarantees the \frac{k}{i} probability in the second scenario.

The proof is as previous. The probability of a_i to be chosen is \mathit{Pr}(a_i)=\frac{k}{i}\times(1-\frac{1}{i+1})\times\cdots\times(1-\frac{1}{n})=\frac{k}{i}\times\frac{i}{i+1}\times\cdots\times\frac{n-1}{n}=\frac{k}{n}

1-\frac{1}{i+1} is the probability that a_i is replace by a_{i+1} ad s_j.

Below is a sample implementation in Java:

 1 int[] random(int[] a, int k) {
 2     int[] s = new int[k];
 3     Random rnd = new Random();
 4     for (int i = 0; i < k; i++)
 5         s[i] = a[i];
 6     for (int i = k + 1; i <= a.length; i++) {
 7         int j = rnd.nextInt(i);
 8         if (j < k) s[j] = a[i];
 9     }
10     return s;
11 }

 

posted @ 2017-11-08 12:10  rkk  阅读(132)  评论(0编辑  收藏  举报