Consider a simple problem: we have a jar containing some red and blue balls like this:
We define the fraction of blue balls in the jar as . Next, we draw 1 ball from the jar, write down its color, put the ball back into the jar and repeat the whole process times (we will call this action “Draw balls with replacement”). We call the fraction of blue balls over the balls we drew . What we want is a way to know how well can track or, in other terms, how well can the in-sample mean track the out-of-sample mean .
Let’s define as the probability of getting a blue ball and the probability of getting a red ball. Since we have replacement, the whole process is memoryless and therefore a Bernoulli trial. We can then say that the probability of getting exactly k blue balls in draws is:
And the probability of getting at most blue balls in draws is
Hoeffding’s inequality states that:
##Sketch of a (quasi)proof of Hoeffding’s inequality
I failed proving this myself (I get a slightly worse upper bound), but I can briefly sketch what I did to help you convince yourself this works.
Now, call and note that . Also, note that the coefficient is strictly less than 1. We then get:
Now, note that when is growing, the negative exponent is shrinking and therefore:
I am not able to remove the factor that separates my bound from that of Hoeffding’s, but this should be enough to convince ourselves that the bound is true, where it comes from and thus reduce the height of the leap of faith required to accept it.
##Making this inequality useful in Machine Learning
We are not over yet. Now we want to use this formula to predict how much information about the out-of-sample mean we can get from the in-sample mean.
Before doing that, let us recap what we have so far. Recall that is the amount of blue balls in draws and that is the probability of drawing a blue ball, while is the probability of getting at most draws.
Then, it follows that is the in-sample mean of blue balls (integer) and is the out-of-sample mean, being the expected number of blue balls (real-valued, not necessarily integer).
To make this useful, we set up a threshold (not necessarily integer) on the probability we deduce from the sample. In other words, is the maximum difference we can tolerate between the probability we get by dividing by and the ideal probability . We now want to obtain a measure of the probability of success when is close to by at most .
This in turn means that we have two cases:
Point 2 is wrong, but I can’t find why. Wikipedia points out that 2 should be the probability of drawing AT LEAST blue balls, which should therefore be . Also, must be symmetrical with having the highest probability, therefore one should be able to assert that point 2 refers to drawing at least blue balls, but I cannot pin this down mathematically :-(
Moving on, we have that the probability of taking LESS blue balls than and that of taking MORE blue balls than sum up (independent) and we obtain the final result:
This formula is non-asymptotic and can tell us how to quantify the relationship between and (and therefore between and , or between and ).