During the recent paper readings, I find that we need to estimate the size of a subset from time to time. This subset comes from a extreme large set which means this subset may also be extreme large. The naive way to know the size of this subset is to go through all the elements of this subset. However, because of the extreme large size, it is impossible for us in practise. So, what we need is a way to estimate the size of subset quickly. One of the typical scenarios is the Ranking Problem. In Ranking Problem problem, we need to know how many instances are there ranking before the current instance. And this operation will be applied to each instance. Apparently, it is impossible for us to go through the whole training set.

# Sampling¶

An reasonable way is to estimate the size of subset based on sampling.

Let's first set up some backgrounds. We assume the whole set is \(Y\), its size is denoted as \(\vert Y\vert\). Set \(S\subset Y\) is the subset and \(\vert S\vert\) is what we need to estimate.

## Geometric Distribution¶

In the paper Weston et al. (2011), it mentions a sampling method based on the geometric distribution. The sampling process can be described as:

Keep randomly choose(with replacement) an element \(y\) from \(Y\) until the element \(y\in S\). Record the sampling times, denoted as \(N\). Then, the size of \(S\) can be estimated as \(\vert S \vert=\frac{\vert Y \vert}{E[N]}\approx\frac{\vert Y\vert}{N}\).

Proof:

The proving process is very simple. We use \(p\) to denote the probability that element \(y\) belongs to set \(S\):

Based on the sampling process, we can get:

When \(N=i\), we sampled \(i\) times, which means the previous \(i-1\) times are all failed to pick the element belonging to \(S\). This is actually a geometric distribution, so its expectation is:

Transform the equation above:

Also, we know that when the sampling rounds become infinity, the expirical value equals the expecation:

where \(N_i\) represents the sampling times in the \(i^{th}\) round of sampling. In order to estimate quickly, we can set \(m=1\), then we finally get:

\(\blacksquare\)

## Bernoulli Distribution¶

Apart from the previous sampling method, there are actually other ways. One of the choices is to use bernoulli distribution. The process is as following:

Keep randomly choose(with replacement) an element \(y\) from \(Y\) for \(N\) times, then check how many elements in this \(N\) elements that belong to \(S\), which is denoted as \(N\). Clearly, \(M < N\). Then, the size of \(S\) can be estimated as \(\vert S\vert=\frac{E[M]\vert Y \vert}{N}\approx \frac{M\vert Y \vert}{N}\).

Proof:

Similarily, we use \(p\) to denote the probability that any element \(y\in Y\) belongs to \(S\):

Based on this sampling method, we know:

This means we independently choose(with replacement) \(N\) elements from \(Y\), in which contains \(i\) elements that belong to \(S\). This actually a bernoulli distribution, so its expecation is:

Rewrite this as:

Based on the defintion of expectation:

Similarily, we can set \(m=1\), then we can get:

\(\blacksquare\)

# Experiments¶

To compare the difference between these two different methods, I did some experiments. The results are showed as following.

## Accuracy¶

I first compare the accuracy of estimation of these two sampling methods. I got 4 different plots based on different ratios of \(\vert Y\vert\) and \(\vert S \vert\).

From these four plots, we can conclude:

**When the ratio of \(\vert Y\vert\) and \(\vert S \vert\) is larger(\(0.5\)), bernoulli distribution is more stable; when the ratio becomes smaller(\(0.0005\)), geometric distribution is becoming more stable.**

## Time¶

Another important aspect of sampling is the time. After all, all we want is to accelerate the computation. The following 4 polts show the comparison of time based on the same settings.

The conclusion is:

**When the ratio of \(\vert Y\vert\) and \(\vert S \vert\) is larger(\(0.5\)), geometric distribution is faster than bernoulli distribution; when the ratio becomes smaller(\(0.0005\)), geometric distribution takes more and more time and bernoulli distribution will take the advantage.**

## Bibliography

Jason Weston, Samy Bengio, and Nicolas Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
In *Twenty-Second International Joint Conference on Artificial Intelligence*. 2011. ↩

###### Comments

So what do you think? Did I miss something? Is any part unclear? Leave your comments below