Machine Learning

image
Konstantinos Perrakis

Training Error, Prediction Error and Model Selection Evaluation

Training Error

The training error of a linear model is evaluated via the residual sum of squares which as we have seen is given by \[RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2,\] where \[\hat{y}_i = \hat{\beta_0}+\hat{\beta_1}x_{i1} + \hat{\beta_2}x_{i2}+ \ldots + \hat{\beta_p}x_{ip}.\] Another very common metric is the mean squared error which is simply a scaled down version of \(RSS\) \[MSE = \frac{RSS}{n}.\] Neither of these metrics is appropriate when comparing models of different dimensionality (different number of predictors)!

Overfitting

A hypothetical example

Let us assume a hypothetical scenario where the true model is \[y= \beta_0 +\beta_1 x_1 +\beta_2 x_2 +\epsilon,\] where \(\beta_0 = 1\), \(\beta_1 = 1.5\), \(\beta_2 = -0.5\) for some given values of the predictors \(x_1\) and \(x_2\). Lets also say that the errors are normally distributed with zero mean and variance equal to one; i.e., \(\epsilon \sim N(0,1)\).
We will further assume that we have three additional predictors \(x_3\), \(x_4\) and \(x_5\) which are irrelevant to \(y\) (in the sense that \(\beta_3=\beta_4=\beta_5=0\)), but of course we do not know that beforehand.
This leads to the example we have previously seen: \[y= \beta_0 + \underset{\color{green} \mbox{relevant}}{{\color{green}\beta_1x_1} + {\color{green}\beta_2x_2}} + \underset{\color{red} \mbox{irrelevant}}{{\color{red}\beta_3x_3} +{\color{red}\beta_4x_4} +{\color{red}\beta_5x_5}} +\epsilon.\]

Hypothetical example: simulation experiment

Training vs. Prediction MSE (\(n_{train} = n_{pred} = 10\)) as we add predictors...

What about \(R^2\)?

The “R square” is another metric for evaluating model fit \[R^2 = 1 - \frac{RSS}{TSS},\] where \(TSS=\sum_{i=1}^{n} (y_i-\bar{y})^2\) is the total sum of squares. As we know \(R^2\) ranges from 0 (worst fit) to 1 (perfect fit).


\(R^2\) is just a function of \(-RSS\) (since \(TSS\) is a constant), so in this case as we add further predictors \(R^2\) will just increase...
In the case of \(n=p\) we would have an R square of 1 (overfitting).

So, how to choose which model?

Given that in practice the actual prediction error is unknown we have two choices:

Under the first strategy we have availability of several model selection criteria.
The second strategy is computational in nature.

Model Selection Criteria: \(Cp\) and \(AIC\)

For a given model with \(d\) predictors (out of the available \(p\) predictors) Mallows’ \(Cp\) criterion is \[Cp = \frac{1}{n}(RSS + 2 d \hat{\sigma}^2),\] where \(\hat{\sigma}^2=RSS/(n-d-1)\) is an estimate of the error variance.
In practice, we choose the model which has the minimum \(Cp\) value: so we essentially penalise more models of high dimensionality (the larger \(d\) is the greater the penalty).
For linear models Mallows’ \(Cp\) is equivalent to the Akaike information criterion (AIC) (as the two are proportional) which is given by \[AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2 d \hat{\sigma}^2).\]

Model Selection Criteria: \(BIC\)

Another metric is the Bayesian information criterion (BIC) \[BIC = \frac{1}{n\hat{\sigma}^2}(RSS + \log(n)d \hat{\sigma}^2),\] where again the model with the minimum BIC value is selected.
In comparison to \(Cp\)/AIC, where the penalty is \(2d \hat{\sigma}^2\), the BIC penalty is \(\log(n)d \hat{\sigma}^2\): this means that generally BIC has a heavier penalty (because \(\log(n)>2\) for \(n>7\)) \(\implies\) BIC selects models with fewer variables than \(Cp\)/AIC. BIC leads to sparser models.
In general all three criteria are based on rigorous theoretical asymptotic (\(n\rightarrow \infty\)) justifications: \(Cp\)/AIC are unbiased estimators of the true prediction MSE, while BIC is based on well-justified Bayesian arguments.

Another model selection technique: adjusted \(R^2\)

A simple method which is not backed up by statistical theory, but often works well in practice is to simply adjust the \(R^2\) metric by taking into account the number of predictors.
The adjusted \(R^2\) for a model with \(d\) variables is calculated as follows \[\mbox{Adjusted~} R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}.\]
In this case we choose the model with the maximum adjusted \(R^2\) value.

Back to the example

Here \(Cp\)/AIC/BIC are in agreement (2 predictors). Adjusted \(R^2\) would select 3 predictors.

Direct estimation of prediction error

Validation – a reminder

The approach is based on the following simple idea:

Cross-validation with 5 folds – a visualisation

image

Image from http://ethen8181.github.io/machine-learning/model_selection/model_selection.html

Next topic

Next we will discuss model selection methods and how to evaluate results based on the methods and techniques presented today.