# Feasibility of learning

These notes are written in English and refer to the fantastic Caltech online course you can find here: http://work.caltech.edu/telecourse.html. It is also worth noting you can find it on iTunes U, and if you have an iPad this makes a fantastic combo :-) The book is here: http://www.amazon.com/gp/product/1600490069. I make ample use of Yaser’s material, including linking the relevant lecture PDF in each note and using his pictures (this is much faster than drawing stuff myself). Please note this is just for personal use and I give ALL the due credit to him!

A disclaimer on my notes in general: they are meant for my personal use and, although I’d love to know they helped somebody else, they are not written with clarity for third parties in mind. However, if you do use them and you would like to point out any errors, please write me an email!

##Refining the theory: feasibility

RELEVANT SLIDES: Lecture 2

What the previous diagram lacks is an explanation about the feasibility of the learning process. What we would like to have is a mathematical explanation to tell us how well we can learn with the data we have.

###Hoeffding’s inequality

To do that, we need a mathematical tool, Hoeffding’s inequality. I failed to prove this formula, but the derivation carries nonetheless some interesting insights. It would be long to post here, so just look at the note about the theory of Hoeffding’s inequality.

Here, we are just content with looking at the formula:

Notation: The notation for the formula is different here than in the note about Hoeffding’s inequality. Here, we have $m$ instead of $n$ to be consistent with the notation I used before. You have $m$ training examples, and this is the same $m$ you want to plug in the formula.

This inequality is important because it puts an upper bound on a bad event (that the in-sample mean is too far away from the out-of-sample mean that we wanted to learn, i.e. our model sucks).

Now the question becomes: “How do I use this in learning?”.

###Applying Hoeffding to the learning process

The learning process involves using a learning algorithm to search the space $H$ (the hypothesis set), trying various $h$ functions until it finds what it thinks is the best function, and call it $g$ and output it.

However, the performance of $g$ does not depend on this process alone: it depends on the data as well (if not more!). For example, if you only have 1 training example, or if you have lots of useless data (for example, full of points that are all close to each other, but then you want to predict a point which is very far away), then there is no hypothesis set and no algorithm that will help you!

To help us formalize this, we modify our learning diagram a little bit and introduce a probability distribution that generates the training examples.

Note that the distribution can be anything and therefore it does not restrict our domain in any sense. The importance of it is that we now have a way to quantify the qualitative difference in the example, and this is where Hoeffding’s inequality kicks in: instead of thinking about balls in a jar, we now think about how well a function can predict outcomes given a training set generated by some probability distribution.

This time, the bin becomes the domain $X$ and each marble ball is therefore a point $\textbf{x} \in X$. We still define $\mu$ as the fraction of green marble balls out of sample (in the whole $X$ now) and $\nu$ as the fraction of green marble balls in sample (of length $m$), but this time we change the meaning of the balls color. Now a green ball means $h(\textbf{x}) = f(\textbf{x})$ and a red ball means $h(\textbf{x}) \neq f(\textbf{x})$, therefore we see $f$ and $h$ as responsible of the coloring of the balls, altough not directly: none of them colors a ball alone, you need both to determine the color of the ball. They do not appear directly in the process: they are somewhere, like Platonic ideas, and together they influence the color of the marbles, which in turn give us a value of $\mu$ for the whole bin and a value of $\nu$ in each sample we draw. Notice how the meaning of the accordance between $\mu$ and $\nu$ is not accuracy of the model, but rather accuracy of the TEST. In other words, we use a test set of limited size and find out its performance $\nu$. Then, Hoeffding will help us tell how indicative $\nu$ is of the performance out-of-sample $\mu$. Without this formalization, there can be no verification, let alone learning!

Summing up:

all the marble balls in a bin -> $X$

each marble -> $\textbf{x} \in X$

green ball -> $h(\textbf{x}) = f(\textbf{x})$

red ball -> $h(\textbf{x}) \neq f(\textbf{x})$

However, this is not learning. Learning involves using an algorithm to search a space $H$ and try different functions $h \in H$. Here, we have already picked some specific function and are testing its performance on a sample, using maths to guarantee the accuracy of the test within some threshold we are willing to tolerate.

###Moving on to learning

Since one bin is clearly not enough, we use more bins. Each bin contains exactly the same marble balls ($X$ is the same), but the colors will be different because each bin will have a different function $h$ associated with it ($f$ of course remains the same). For now, we assume the bins are in limited number1 M.

At this point, we cannot apply Hoeffding again to the whole process, because it is not a Bernoulli trial anymore. Each single bin is, but the totality of the bins isn’t.

Instead of relying on Hoeffding to find some upper bound on the bad probability (the probability of learning being useless because trained of a bad set), we assume we are in the worst possible case.

Imagine we have a coin.

• If you toss it 10 times, the probability of getting 10 heads is P(10 heads on 10 tosses, 1 coin) $= (\frac{1}{2})^{10} \approx 0.1\%$

• If you instead toss 1000 coins 10 times each, what is the probability that at least one coin will get 10 heads? It is P(10 heads on 10 tosses, 1000 coins) $= 1 - [1 - (\frac{1}{2})^{10}]^{1000} \approx 62.3\%$

The same can be said to our learning problem: if I have a set $H$ containing 1000 functions $h$ I can choose from, what is the probability that at least one sucks?

We follow the very same reasoning: we want to know the probability of at least one failing.

This can be bounded by the union bound2, which intuitevely says that the maximum probability of at least an event occurring in N is when all the events are independent, in which case you just sum up the probabilities:

Therefore:

The conclusion may seem both awkward and obvious, but the bigger the hypothesis set, the higher the probability of at least one function being very bad. In the event that we have an infinite hypothesis set, of course this bound goes to infinity and tells us nothing new.

1. Later, this will be expanded to a possibly infinite set.

2. Formalization and proof: http://en.wikipedia.org/wiki/Union_bound