# Hoeffding's inequality  Consider a simple problem: we have a jar containing some red and blue balls like this: We define the fraction of blue balls in the jar as $\mu$. Next, we draw 1 ball from the jar, write down its color, put the ball back into the jar and repeat the whole process $m$ times (we will call this action “Draw $m$ balls with replacement”). We call the fraction of blue balls over the $m$ balls we drew $\nu$. What we want is a way to know how well can $\nu$ track $\mu$ or, in other terms, how well can the in-sample mean $E_{in}$ track the out-of-sample mean $E_{out}$.

Let’s define $p$ as the probability of getting a blue ball and $q = 1 - p$ the probability of getting a red ball. Since we have replacement, the whole process is memoryless and therefore a Bernoulli trial. We can then say that the probability of getting exactly k blue balls in $n$ draws is:

And the probability of getting at most $k$ blue balls in $n$ draws is

Hoeffding’s inequality states that:

##Sketch of a (quasi)proof of Hoeffding’s inequality

I failed proving this myself (I get a slightly worse upper bound), but I can briefly sketch what I did to help you convince yourself this works.

By using De Moivre-Laplace we approximate the discrete series with an exponential1.

$F(k; n, p)$ is therefore:

Now, call $\beta = \frac{1}{4pq}$ and note that $\beta > 1$. Also, note that the coefficient is strictly less than 1. We then get:

Now, note that when $i$ is growing, the negative exponent is shrinking and therefore:

I am not able to remove the $k$ factor that separates my bound from that of Hoeffding’s, but this should be enough to convince ourselves that the bound is true, where it comes from and thus reduce the height of the leap of faith required to accept it.

##Making this inequality useful in Machine Learning

We are not over yet. Now we want to use this formula to predict how much information about the out-of-sample mean we can get from the in-sample mean.

Before doing that, let us recap what we have so far. Recall that $k$ is the amount of blue balls in $n$ draws and that $p$ is the probability of drawing a blue ball, while $F(k; n, p)$ is the probability of getting at most $k$ draws.

Then, it follows that $k$ is the in-sample mean of blue balls (integer) and $np$ is the out-of-sample mean, being the expected number of blue balls (real-valued, not necessarily integer).

To make this useful, we set up a threshold $\epsilon$ (not necessarily integer) on the probability we deduce from the sample. In other words, $\epsilon$ is the maximum difference we can tolerate between the probability we get by dividing $k$ by $n$ and the ideal probability $p$. We now want to obtain a measure of the probability of success when $k$ is close to $np$ by at most $\epsilon$.

$\epsilon = \lvert p - \frac{k}{n} \rvert$.

This in turn means that we have two cases:

1. $\epsilon = p - \frac{k}{n}$ $\Rightarrow k = n (p - \epsilon)$.

2. $\epsilon = \frac{k}{n} - p$ $\Rightarrow k = n (p + \epsilon)$.

Point 2 is wrong, but I can’t find why. Wikipedia points out that 2 should be the probability of drawing AT LEAST $k$ blue balls, which should therefore be $1 - F(n (p + \epsilon); n, p)$. Also, $F(k; n, p)$ must be symmetrical with $k = np$ having the highest probability, therefore one should be able to assert that point 2 refers to drawing at least $k$ blue balls, but I cannot pin this down mathematically :-(

Moving on, we have that the probability of taking LESS blue balls than $\epsilon$ and that of taking MORE blue balls than $\epsilon$ sum up (independent) and we obtain the final result:

$P(\| np - k \| > \epsilon) \leq 2 \exp\left(-2\epsilon^2 n\right)$.

This formula is non-asymptotic and can tell us how to quantify the relationship between $np$ and $k$ (and therefore between $\mu$ and $\nu$, or between $E_{out}$ and $E_{in}$).

1. De Moivre-Laplace’s formula uses Stirling’s approximation to convert the binomial coefficients into exponentials (in case you were wondering where the exponentials were coming from).