These notes are written in English and refer to the fantastic Caltech online course you can find here: http://work.caltech.edu/telecourse.html. It is also worth noting you can find it on iTunes U, and if you have an iPad this makes a fantastic combo :-) The book is here: http://www.amazon.com/gp/product/1600490069. I make ample use of Yaser’s material, including linking the relevant lecture PDF in each note and using his pictures (this is much faster than drawing stuff myself). Please note this is just for personal use and I give ALL the due credit to him!

A disclaimer on my notes in general: they are meant for my personal use and, although I’d love to know they helped somebody else, they are not written with clarity for third parties in mind. However, if you do use them and you would like to point out any errors, please write me an email!

#Learning and error

RELEVANT SLIDES: Lecture 4

The learning diagram as we left it:

Now, we are going to expand on it a lot more.

###Error measure: theory

The first part that was overlooked last time was our error measure (or performance measure). To us, “doing well” in the task of learning means that we find some function for which , where is the target function (unknown). However, you can say a function is doing badly or well only with respect to a certain error metric.

Therefore, there must exist (in general) an **error measure** which is 0 when and increseases if the two functions are “different”. However, this definition is too general: is not a function itself, but a functional. It is not a object easy to deal with, and therefore we always resort to a *pointwise error metric*, , where is a point.

Examples of pointwise error are the squared error in regression and the binary error in classification:

Squared Error:

Binary error:

To obtain a overall we average the pointwise errors for all the examples we have got.

In-sample error:

where each example ( is the number of features).

Out-of-sample error:

Notice that this time we don’t get a specific point in the training set but instead we pick a completely generic point . Now, instead of a computer average on the sample, we compute the **expected value** of the error by using the probability distribution generating the examples (like we did in the previous article about feasibility).

The difference is subtle but crucial: is the *frequency* of errors (in the sample), while is the *probability* of error (in the input space ).

We can now add this to the learning diagram:

###Error measure in practice: precision and recall

The error measure need not be symmetric. Sometimes **precision** and **recall** are much more important. For example:

In case a supermarket uses this to quickly identify customers without needing them to use cards, we can assume that pissing a customer is far more costly than giving them a discount he did not reject prefer **recall** to precision.

In case the CIA uses this, they definitely prefer to shut down an authorized officer a few times (and have another one try instead if need be) than having an intruder stealing their secrets!

The take-home lesson is that **the error measure should be specified by the user**. Only the client knows what it wants.

However, when this is not possible, you can resort to:

**Plausible** measures: squared error is the best when the noise is Gaussian

**Friendly** measures: closed-form solution, convex optimization.

This causes another addition to the learning diagram:

###Noisy data

Machine Learning deals with real data, which often involves people. This means that there can be inconsistencies we may have to deal with.

For example, if we were to build a machine learning system for credit card approval, we may have two “identical” customers (two customers whose features have the very same values) and still have one labeled as “approved” and the other labeled as “rejected”. After all, two different people made the two decisions, and maybe it was a close call. If this is the case, then we have a different value for the same point! This means that the target function is not really a function!

The theory can be (slightly) stretched to accommodate these problems.

Instead of a target function , we now think of a *target distribution* .

The couple example and output is now generated by the joint distribution .

NOTE: My website does not support a single bar | inside math mode. Just consider as being |.

Our previous deterministic target is included in this theory by defining , where is the noise “responsible” for the confusion. We also define (in other terms, we assume the noise is zero-averaged and symmetric).

The deterministic case is still there: it is sufficient to assume .

Guess what? This changes the learning diagram once more:

There is an important distinction between these distributions:

###Wrapping up

Let us wrap up what we know so far.

The aim of the game of learning is to find a function such that:

(nevermind about the noise now)

This in turn means that we want

is achieved through:

The first point is achieved through sound statistics about significativity of the training set, while the second is achieved through a good learning algorithm.

Learning is thus split into 2 questions:

- Can we make sure that is close enough to ? This is also called
**generalization**. - Can we make small enough? This is also called
**fitness**of the model.

###Next steps