In this workshop we will start with an analysis of the Hitters
dataset included in package ISLR
. This dataset consists of 322 records of baseball players. The response variable is the players’ salary and the number of predictors is 19, including variables such as number of hits, years in the league and so forth.
The first thing to do is to load the data. So we have to install package ISLR
in R by typing install.packages('ISLR')
(if you have already installed the package ignore this). Next we load the library and use help()
which will open the help window in RStudio with information about the dataset.
library(ISLR)
help(Hitters)
To get a clearer picture of how the data looks we can use commands names()
, head()
and dim()
. The first returns the names of the 20 variables, the second returns the first 6 rows (type help(head)
to see how to adjust this) and the third gives us the number of rows (sample size) and number of columns (1 response + 19 predictors).
names(Hitters)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
head(Hitters)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson 293 66 1 30 29 14 1 293 66 1
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson 30 29 14 A E 446 33 20
## -Alan Ashby 321 414 375 N W 632 43 10
## -Alvin Davis 224 266 263 A W 880 82 14
## -Andre Dawson 828 838 354 N E 200 11 3
## -Andres Galarraga 48 46 33 N E 805 40 4
## -Alfredo Griffin 501 336 194 A W 282 421 25
## Salary NewLeague
## -Andy Allanson NA A
## -Alan Ashby 475.0 N
## -Alvin Davis 480.0 A
## -Andre Dawson 500.0 N
## -Andres Galarraga 91.5 N
## -Alfredo Griffin 750.0 A
dim(Hitters)
## [1] 322 20
Prior to proceeding to any analysis we must check whether the data contain any missing values. To do this we can use a combination of the commands sum()
and is.na()
. We find that we have 59 missing Salary
entries. To remove this 59 rows we make use of the command na.ommit()
and then double-check.
sum(is.na(Hitters$Salary))
## [1] 59
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
Now we are ready to start with best subset selection. There are different packages in R which best subset as well as stepwise selections. We will use the function regsubsets()
from library leaps
(again install.packages('leaps')
is needed if not already installed). We will store the results from best subset in an object called best
(any name would do) and summarize the results via summary()
, which outputs the best models of different dimenionality.
library(leaps)
best = regsubsets(Salary~., Hitters)
summary(best)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
The asterisks *
indicate variables which are included. For instance, we see that the model with one predictor that variable is CRBI
, while for the model with four predictors the selected variables are Hits
, CRBI
, DivisionW
and PutOuts
. As we see the summary()
command displayed output up to the best model with 8 predictors; this is the default option in regsubsets()
. To change this we can use the extra argument nvmax
. Below we re-run the command for all best models (19 in total). Now we also store the output of summary()
in an object called results
and further investigate what this contains via names()
. As seen a lot of useful information! Specifically, it returns \(R^2\), RSS, adjusted-\(R^2\), \(C_p\) and BIC. We can easily extract this quantities; for instance below we see RSS values of the 19 models.
best = regsubsets(Salary~., data = Hitters, nvmax = 19)
results = summary(best)
names(results)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
results$rsq
## [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
## [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
Lets store these quantities as separate objects. We can also stack them together into one matrix as done below and see the values.
RSS = results$rss
r2 = results$rsq
Cp = results$cp
BIC = results$bic
Adj_r2 = results$adjr2
cbind(RSS, r2, Cp, BIC, Adj_r2)
## RSS r2 Cp BIC Adj_r2
## [1,] 36179679 0.3214501 104.281319 -90.84637 0.3188503
## [2,] 30646560 0.4252237 50.723090 -128.92622 0.4208024
## [3,] 29249297 0.4514294 38.693127 -135.62693 0.4450753
## [4,] 27970852 0.4754067 27.856220 -141.80892 0.4672734
## [5,] 27149899 0.4908036 21.613011 -144.07143 0.4808971
## [6,] 26194904 0.5087146 14.023870 -147.91690 0.4972001
## [7,] 25906548 0.5141227 13.128474 -145.25594 0.5007849
## [8,] 25136930 0.5285569 7.400719 -147.61525 0.5137083
## [9,] 24814051 0.5346124 6.158685 -145.44316 0.5180572
## [10,] 24500402 0.5404950 5.009317 -143.21651 0.5222606
## [11,] 24387345 0.5426153 5.874113 -138.86077 0.5225706
## [12,] 24333232 0.5436302 7.330766 -133.87283 0.5217245
## [13,] 24289148 0.5444570 8.888112 -128.77759 0.5206736
## [14,] 24248660 0.5452164 10.481576 -123.64420 0.5195431
## [15,] 24235177 0.5454692 12.346193 -118.21832 0.5178661
## [16,] 24219377 0.5457656 14.187546 -112.81768 0.5162219
## [17,] 24209447 0.5459518 16.087831 -107.35339 0.5144464
## [18,] 24201837 0.5460945 18.011425 -101.86391 0.5126097
## [19,] 24200700 0.5461159 20.000000 -96.30412 0.5106270
From lectures we know that RSS should steadily decrease and that \(R^2\) increase as we add predictors. Lets check that by plotting the values of RSS and \(R^2\). We would like to have the two plots next to each other, so will make use of the command par(mfrow())
before calling the command plot()
.
par(mfrow = c(1, 2))
plot(RSS, xlab = "Number of Predictors", ylab = "RSS", type = "l", lwd = 2)
plot(r2, xlab = "Number of Predictors", ylab = "R-square", type = "l", lwd = 2)
Results are as expected. Above the argument type = 'l'
is used for the lines to appear (otherwise by default R would simply plot the points) and the argument lwd
controls the thickness of the lines. Plots in R are very customizable (type ?par
to see all available options).
Now let us find what are the optimal models under \(Cp\), BIC and adjusted-\(R^2\) and produce plots similar to the ones we have seen in lectures.
which.min(Cp)
## [1] 10
which.min(BIC)
## [1] 6
which.max(Adj_r2)
## [1] 11
par(mfrow = c(1, 3))
plot(Cp, xlab = "Number of Predictors", ylab = "Cp", type = 'l', lwd = 2)
points(10, Cp[10], col = "red", cex = 2, pch = 8, lwd = 2)
plot(BIC, xlab = "Number of Predictors", ylab = "BIC", type = 'l', lwd = 2)
points(6, BIC[6], col = "red", cex = 2, pch = 8, lwd = 2)
plot(Adj_r2, xlab = "Number of Predictors", ylab = "Adjusted RSq", type = "l", lwd = 2)
points(11, Adj_r2[11], col = "red", cex = 2, pch = 8, lwd = 2)
\(Cp\) and adjusted-\(R^2\) select 10 and 11 predictors, respectively, while the optimal BIC model has only 6 predictors. In the code above we use the command points()
to highlight the optimal model; the first two arguments of this command correspond to x-axis and y-axis co-ordinates. So, for example, we know that the optimal \(C_p\) model is the one with 10 predictors; therefore the first argument is set to 10
and the second argument is set to Cp[10]
(i.e., \(C_p\) at its minimum).
The regsubsets()
function in-build function for plotting the results which is very convenient when the output from summary()
is difficult to read due to many variables. Below we see the visualisation for the BIC models. Remark: Because previously we used par(mfrow(1,3))
we have to close the 1 by 3 window splitting first by typing dev.off()
(this returns the window plot to the default state).
plot(best, scale = "bic")
The top row corresponds to the best model, while the bottom row to the worst model depending upon method. Non-white squares correspond to variables that excluded under each model.
Finally, we extract the model coefficients for any model we choose via the command coef
. For instance, the best models under each approach are the following.
coef(best,10) # Cp
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
## CRBI CWalks DivisionW PutOuts Assists
## 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
coef(best,6) # BIC
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
coef(best,11) # adj-Rsq
## (Intercept) AtBat Hits Walks CAtBat CRuns
## 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914 1.4553310
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.7852528 -0.8228559 43.1116152 -111.1460252 0.2894087 0.2688277
We can use the same function to also perform forward or backward selection by only change one of its arguments. Forward selection can be implemented in the following way.
fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(fwd)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
## 9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " " "
## 5 ( 1 ) " " " " "*" "*" " " " " " "
## 6 ( 1 ) " " " " "*" "*" " " " " " "
## 7 ( 1 ) "*" " " "*" "*" " " " " " "
## 8 ( 1 ) "*" " " "*" "*" " " " " " "
## 9 ( 1 ) "*" " " "*" "*" " " " " " "
## 10 ( 1 ) "*" " " "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 12 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
The analysis then will essentially be identical to the previous analysis for best subset selection.
One interesting thing to check is whether the results from best subset and forward stepwise selection are in agreement (we know that is not necessarily the case). For example lets have a look at the models with 6 and 7 predictors.
coef(best, 6)
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
coef(fwd, 6)
## (Intercept) AtBat Hits Walks CRBI DivisionW
## 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
## PutOuts
## 0.2643076
coef(best, 7)
## (Intercept) Hits Walks CAtBat CHits CHmRun
## 79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538
## DivisionW PutOuts
## -129.9866432 0.2366813
coef(fwd, 7)
## (Intercept) AtBat Hits Walks CRBI CWalks
## 109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070
## DivisionW PutOuts
## -127.1223928 0.2533404
The results for the model with 6 predictors are in agreement (which is good to know because this would be one of the three models one would likely select), but as we see the results for the models with 7 predictors are completely different!
So, we have three potentially useful models for our data. We may take a further step and ask the question “Which one of the three to choose?” From a machine-learning perspective the answer to this question is to choose the model with the better predictive performance.
If we had new data we would be able to do that, but we don’t. However, we can do the following: (i) split our data into a training sample and a testing sample, (ii) use the training sample for model fitting, (iii) generate predictions based on the held-out predictors of the test sample, and (iv) compare the predictions to the held-out samples of the response variable from the test sample (e.g. by calculating the correlation). Lets see how we can do this in R.
Say we want to use approximately 66% of the observations for training; i.e., 175 observations in this case. So we need to randomly sample 175 rows of Hitters
which we will use for training and keep the rest of the 88 rows for testing. We can use the command sample()
for this and then create two separate datasets as done below.
dim(Hitters)
## [1] 263 20
training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs, ]
Hitters.test = Hitters[-training.obs, ]
dim(Hitters.train)
## [1] 175 20
dim(Hitters.test)
## [1] 88 20
We also need a mechanism to predict from our linear models. Unfortunately, regsubsets()
doesn’t have its own build-in function for that (most similar types of R functions have). We can of course hard-code this step, but this is a bit difficult. Here we will just make use of the function below which will do the predictions for us (details of the function are not important here).
predict.regsubsets = function(object, newdata, id, ...){
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars]%*%coefi
}
Finally, we would like to repeat this procedure several times in order to take into account the variance of this procedure. This can be done in R using a for loop. Lets say we want to repeat this 50 times and calculate the correlations of \(Cp\), \(BIC\) and adjusted-\(R^2\) predictions with the held-out response samples. So, first we define the number of repetitions and then we create three empty vectors which will be filled during the for loop.
repetitions = 50
cor.cp = c()
cor.bic = c()
cor.adjr2 = c()
The for loop is given below. The inner parts should not be difficult to understand based on the previous analysis.
set.seed(1) # for the results to be reproducable
for(i in 1:repetitions){
# Step (i) data splitting
training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs, ]
Hitters.test = Hitters[-training.obs, ]
dim(Hitters.train)
dim(Hitters.test)
# Step (ii) training phase
regfit.train = regsubsets(Salary~., data = Hitters.train, nvmax = 19)
min.cp = which.min(results$cp)
min.bic = which.min(results$bic)
max.adjr2 = which.max(results$adjr2)
# Step (iii) generating predictions
predict.cp = predict.regsubsets(regfit.train, Hitters.test, min.cp)
predict.bic = predict.regsubsets(regfit.train, Hitters.test, min.bic)
predict.adjr2 = predict.regsubsets(regfit.train, Hitters.test, max.adjr2)
# Step (iv) evaluating predictive performance
cor.cp[i] = cor(Hitters.test$Salary, predict.cp)
cor.bic[i] = cor(Hitters.test$Salary, predict.bic)
cor.adjr2[i] = cor(Hitters.test$Salary, predict.adjr2)
}
boxplot(cor.cp, cor.bic, cor.adjr2, names = c('Cp','BIC','adjRsq'), ylab = 'Test correlation', col = 2)
With this data we want to predict body fat (variable brozek
) using the other variables available, except for siri
(another way of computing body fat), density
(it is used in the brozek
and siri
formulas) and free
(it is computed using brozek formula). So, this variables need to be removed. To get started first install.packages('faraway')
(if you have already installed the package ignore this) and then type the following commands.
library(faraway)
help(fat) # read the help file for more information about the variables
head(fat)
## brozek siri density age weight height adipos free neck chest abdom hip
## 1 12.6 12.3 1.0708 23 154.25 67.75 23.7 134.9 36.2 93.1 85.2 94.5
## 2 6.9 6.1 1.0853 22 173.25 72.25 23.4 161.3 38.5 93.6 83.0 98.7
## 3 24.6 25.3 1.0414 22 154.00 66.25 24.7 116.0 34.0 95.8 87.9 99.2
## 4 10.9 10.4 1.0751 26 184.75 72.25 24.9 164.7 37.4 101.8 86.4 101.2
## 5 27.8 28.7 1.0340 24 184.25 71.25 25.6 133.1 34.4 97.3 100.0 101.9
## 6 20.6 20.9 1.0502 24 210.25 74.75 26.5 167.0 39.0 104.5 94.4 107.8
## thigh knee ankle biceps forearm wrist
## 1 59.0 37.3 21.9 32.0 27.4 17.1
## 2 58.7 37.3 23.4 30.5 28.9 18.2
## 3 59.6 38.9 24.0 28.8 25.2 16.6
## 4 60.1 37.3 22.8 32.4 29.4 18.2
## 5 63.2 42.2 24.0 32.2 27.7 17.7
## 6 66.0 42.0 25.6 35.7 30.6 18.8
dim(fat)
## [1] 252 18
names(fat)
## [1] "brozek" "siri" "density" "age" "weight" "height" "adipos"
## [8] "free" "neck" "chest" "abdom" "hip" "thigh" "knee"
## [15] "ankle" "biceps" "forearm" "wrist"
fat = fat[,-c(2, 3, 8)]
dim(fat)
## [1] 252 15
sum(is.na(fat$brozek))
## [1] 0
Initial exploratory analysis. Calculate the correlations of brozek
with respect to the 14 predictor variables. Which three predictors have the highest correlations with the response brozek
? Using the command pairs()
produce a 4 by 4 frame of scatter plots of the response variable and the three predictors. What does this plot tells us about potential correlations among the three predictors?
Forward stepwise regression. Use regsubsets()
to perform forward stepwise selection and carry out your analysis along the lines that was presented previously based on best subset selection for the Baseball data (feel free to experiment with different options with plots to get familiarised with RStudio). Which models are selected by \(C_p\), BIC and adjusted-\(R^2\)? Use the special plot()
function to visualise the models obtained by \(C_p\).
Comparisons with best subset and backward selections. Do the best models from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? Hint: if it is not obvious how to perform backward selection type ?regsubsets
and see information about argument method
.
Predictive performance. Based on forward selection evaluate the predictive accuracy of the three models based on training and testing samples. Keep approximately 75% of the samples for training. Use 30 repetitions to evaluate results. Is the predictive accuracy better or worse in comparison to the Baseball data?