Training Error
The training error of a linear model is evaluated via the residual sum of squares which as we have seen is given by \[RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2,\] where \[\hat{y}_i = \hat{\beta_0}+\hat{\beta_1}x_{i1} + \hat{\beta_2}x_{i2}+ \ldots + \hat{\beta_p}x_{ip}.\] Another very common metric is the mean squared error which is simply a scaled down version of \(RSS\) \[MSE = \frac{RSS}{n}.\] Neither of these metrics is appropriate when comparing models of different dimensionality (different number of predictors)!
Overfitting
The reason: both \(RSS\) and \(MSE\) generally decrease as we include additional predictors to a linear model.
In fact, in the extreme case of \(n=p\) both metrics will be 0!
This does not mean that we have a “good” model, we have just “overfitted” our linear model to perfectly adjust to the training data.
Overfitted models exhibit poor predictive performance (low training error but high prediction error).
Our aim is to have simple, interpretable models with relatively small \(p\) (in relation to \(n\)), which have good predictive performance.
A hypothetical example
Let us assume a hypothetical scenario where the true model is \[y= \beta_0 +\beta_1 x_1 +\beta_2 x_2 +\epsilon,\] where \(\beta_0 = 1\), \(\beta_1 = 1.5\), \(\beta_2 = -0.5\) for some given values of the predictors \(x_1\) and \(x_2\). Lets also say that the errors are normally distributed with zero mean and variance equal to one; i.e., \(\epsilon \sim N(0,1)\).
We will further assume that we have three additional predictors \(x_3\), \(x_4\) and \(x_5\) which are irrelevant to \(y\) (in the sense that \(\beta_3=\beta_4=\beta_5=0\)), but of course we do not know that beforehand.
This leads to the example we have previously seen: \[y= \beta_0 + \underset{\color{green} \mbox{relevant}}{{\color{green}\beta_1x_1} + {\color{green}\beta_2x_2}} + \underset{\color{red} \mbox{irrelevant}}{{\color{red}\beta_3x_3} +{\color{red}\beta_4x_4} +{\color{red}\beta_5x_5}} +\epsilon.\]
Hypothetical example: simulation experiment
Training vs. Prediction MSE (\(n_{train} = n_{pred} = 10\)) as we add predictors...
Training error: steady (negligible) decline after adding \(x_3\) – not really obvious how many predictors to include.
Prediction error: clearly here one would select the model with \(x_1\) and \(x_2\)!
Prediction errors are larger than training errors – this is generally always the case!
What about \(R^2\)?
The “R square” is another metric for evaluating model fit \[R^2 = 1 - \frac{RSS}{TSS},\] where \(TSS=\sum_{i=1}^{n} (y_i-\bar{y})^2\) is the total sum of squares. As we know \(R^2\) ranges from 0 (worst fit) to 1 (perfect fit).
\(R^2\) is just a function of \(-RSS\) (since \(TSS\) is a constant), so in this case as we add further predictors \(R^2\) will just increase...
In the case of \(n=p\) we would have an R square of 1 (overfitting).
So, how to choose which model?
Given that in practice the actual prediction error is unknown we have two choices:
Indirectly estimate prediction error by making an adjustment to the training error that accounts for overfitting.
Directly estimate prediction error using validation or cross-validation techniques.
Under the first strategy we have availability of several model selection criteria.
The second strategy is computational in nature.
Model Selection Criteria: \(Cp\) and \(AIC\)
For a given model with \(d\) predictors (out of the available \(p\) predictors) Mallows’ \(Cp\) criterion is \[Cp = \frac{1}{n}(RSS + 2 d \hat{\sigma}^2),\] where \(\hat{\sigma}^2=RSS/(n-d-1)\) is an estimate of the error variance.
In practice, we choose the model which has the minimum \(Cp\) value: so we essentially penalise more models of high dimensionality (the larger \(d\) is the greater the penalty).
For linear models Mallows’ \(Cp\) is equivalent to the Akaike information criterion (AIC) (as the two are proportional) which is given by \[AIC = \frac{1}{n\hat{\sigma}^2}(RSS + 2 d \hat{\sigma}^2).\]
Model Selection Criteria: \(BIC\)
Another metric is the Bayesian information criterion (BIC) \[BIC = \frac{1}{n\hat{\sigma}^2}(RSS + \log(n)d \hat{\sigma}^2),\] where again the model with the minimum BIC value is selected.
In comparison to \(Cp\)/AIC, where the penalty is \(2d \hat{\sigma}^2\), the BIC penalty is \(\log(n)d \hat{\sigma}^2\): this means that generally BIC has a heavier penalty (because \(\log(n)>2\) for \(n>7\)) \(\implies\) BIC selects models with fewer variables than \(Cp\)/AIC. BIC leads to sparser models.
In general all three criteria are based on rigorous theoretical asymptotic (\(n\rightarrow \infty\)) justifications: \(Cp\)/AIC are unbiased estimators of the true prediction MSE, while BIC is based on well-justified Bayesian arguments.
Another model selection technique: adjusted \(R^2\)
A simple method which is not backed up by statistical theory, but often works well in practice is to simply adjust the \(R^2\) metric by taking into account the number of predictors.
The adjusted \(R^2\) for a model with \(d\) variables is calculated as follows \[\mbox{Adjusted~} R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}.\]
In this case we choose the model with the maximum adjusted \(R^2\) value.
Back to the example
Here \(Cp\)/AIC/BIC are in agreement (2 predictors). Adjusted \(R^2\) would select 3 predictors.
Direct estimation of prediction error
A shortcoming of the previous approaches is that they are not applicable for models that have more variables than the size of the sample (\(d>n\)).
Also, in the setting of penalised regression (which we will see later on the course) deciding what \(d\) actually is becomes a bit problematic.
Validation and cross-validation (which you have seen in ISDS) are effective computational tools which can be used in all settings.
Validation – a reminder
The approach is based on the following simple idea:
Split the data into a training sample (\(y, x_1,\ldots,x_d\)) of size \(n\) and a test/validation sample (\(y^*, x_1^*,\ldots,x_d^*\)) of size \(n^*\)
Train the model using the training sample to get \(\hat{\beta}_0,\hat{\beta}_1,\ldots,\hat{\beta}_d\)
Use the validation sample to get predictions: \(\hat{y}^*= \hat{\beta}_0+\hat{\beta}_1x_1^*+\ldots+\hat{\beta}_dx_d^*\)
Calculate the validation \(RSS=\sum_{i=1}^{n^*}(y_i^*-\hat{y}_i^*)^2\)
This procedure is performed for all possible models and the model with the smallest validation RSS is selected.
A common option is to use 2/3 of the sample for training and 1/3 for testing/validating.
One criticism is that we do not use the entire sample for training: this can be problematic especially when the sample size is small to begin with.
Solution: K-fold cross-validation.
Cross-validation with 5 folds – a visualisation
Image from http://ethen8181.github.io/machine-learning/model_selection/model_selection.html
Next topic
Next we will discuss model selection methods and how to evaluate results based on the methods and techniques presented today.