Derivation of Reservoir Sampling


I had to implement reservoir sampling during an interview and I had no idea how to do that. So I had to figure it out on the spot. This is where all those years of induction paid off, since there’s a nice derivation based on writing down the definitions of everything and noticing that there’s a natural inductive pattern.

The Problem

Sample \(k\) elements from a stream.

Definitions

Sample
Uniformly at random.
Stream
A sequence of finite length (uniform sampling wouldn’t make sense on a countably infinite sequence). It can only be accessed one element at a time in a linear order, and is exhausted by iteration.

Python iterators like (i for i in range(5)) are good examples of streams.

The Solution

The constraints of the stream data structure suggest a solution that comes from considering simple cases and noticing that you don’t know how long the stream is.

We need an algorithm that works for any length, but we don’t know what that length is up front. All we get is one element at a time (with something like StopIteration when the stream is empty) and all we can do with that element is store or ignore it.

Let’s call that unknown length \(N\) and consider simple cases.

For now, assume \(k = 1\). We only want to sample a single element.

\(N = 1\)

Our stream has only one element. Let’s call it \(x_1\). We only find out that there’s one element when we try to access it twice and get \(x_1\) and then StopIteration.

So for our algorithm to cover this case, it has to store \(x_1\) no matter what.

\(N = 2\)

Since we don’t know that \(N = 2\), we have to cover the \(N = 1\) case and keep \(x_1\).

Then to have a uniform sampling over \(\{x_1, x_2\}\), we have to discard \(x_1\) and keep \(x_2\) with probability \(\frac{1}{2}\).

\(N = 3\)

This is the \(N = 2\) case with one additional element. So we keep \(x_1\) with probability 1 and \(x_2\) with probability \(\frac{1}{2}\).

To keep the sampling uniform, you’ll see that \(x_3\) needs to be kept with probability \(\frac{1}{3}\).

This suggests a pattern of keeping the \(i\)-th element with probability \(\frac{1}{i}\). This is correct, and I’ll let you look up the proof on Wikipedia.

Now we consider how to sample \(k\) elements rather than just 1. We start with the smallest valid case.

\(N = k\)

As before, we need to keep the first \(k\) elements no matter what in case there are no others.

(The data structure that stores these \(k\) elements is our reservoir).

\(N = k + 1\)

A bit of arithmetic shows that we need to keep our \(i\)-th element with probability \(\frac{k}{i}\).

If we keep it, we uniformly pick one element from our reservoir and replace it with this new element.

And then we’re done. The stream data structure’s constraints of unknown length and nonrandom access basically solves the problem for us.

Related Posts

Minimal Surfaces

November 2, 2023

NTK reparametrization

Kate from Vancouver, please email me

ChatGPT Session: Emotions, Etymology, Hyperfiniteness

Some ChatGPT Sessions

2016 ML thoughts

My biggest takeaway from Redwood Research REMIX

finite, actual infinity, potential infinity

Actions and Flows