We will continue analysing the Hitters
dataset included in package ISLR
, this time applying lasso and PCR regression.
We perform the usual steps that are required as we have seen in previous workshops. .
library(ISLR)
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263 20
sum(is.na(Hitters))
## [1] 0
We will now use glmnet()
to implement lasso regression. Having learned how to implement ridge the analysis is very easy. To run lasso all we have to do is remove the option alpha = 0
. We do not need to specify explicitly alpha = 1
because this is the default option in glmnet()
.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1
In this part again we call the response y
and extract the matrix of predictors, which we call x
using the command model.matrix()
.
y = Hitters$Salary
x = model.matrix(Salary~., Hitters)[,-1] # Here we exclude the first column
#because it is the salary variable
head(x)
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby 315 81 7 24 38 39 14 3449 835 69
## -Alvin Davis 479 130 18 66 72 76 3 1624 457 63
## -Andre Dawson 496 141 20 65 78 37 11 5628 1575 225
## -Andres Galarraga 321 87 10 39 42 30 2 396 101 12
## -Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 19
## -Al Newman 185 37 1 23 8 21 2 214 42 1
## CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby 321 414 375 1 1 632 43 10
## -Alvin Davis 224 266 263 0 1 880 82 14
## -Andre Dawson 828 838 354 1 0 200 11 3
## -Andres Galarraga 48 46 33 1 0 805 40 4
## -Alfredo Griffin 501 336 194 0 1 282 421 25
## -Al Newman 30 9 24 1 0 76 127 7
## NewLeagueN
## -Alan Ashby 1
## -Alvin Davis 0
## -Andre Dawson 1
## -Andres Galarraga 1
## -Alfredo Griffin 0
## -Al Newman 0
We then run lasso regression initially without cross-validation for \(\lambda\).
lasso = glmnet(x, y)
names(lasso)
## [1] "a0" "beta" "df" "dim" "lambda" "dev.ratio"
## [7] "nulldev" "npasses" "jerr" "offset" "call" "nobs"
lasso$lambda
## [1] 255.2820965 232.6035387 211.9396814 193.1115442 175.9560469 160.3245967
## [7] 146.0818014 133.1042968 121.2796779 110.5055256 100.6885193 91.7436287
## [13] 83.5933776 76.1671724 69.4006907 63.2353246 57.6176727 52.4990774
## [19] 47.8352041 43.5856564 39.7136268 36.1855777 32.9709507 30.0419023
## [25] 27.3730625 24.9413151 22.7255974 20.7067180 18.8671902 17.1910810
## [31] 15.6638728 14.2723375 13.0044224 11.8491453 10.7964999 9.8373686
## [37] 8.9634439 8.1671562 7.4416086 6.7805166 6.1781542 5.6293040
## [43] 5.1292122 4.6735471 4.2583620 3.8800609 3.5353670 3.2212947
## [49] 2.9351238 2.6743755 2.4367913 2.2203135 2.0230670 1.8433433
## [55] 1.6795857 1.5303760 1.3944216 1.2705450 1.1576733 1.0548288
## [61] 0.9611207 0.8757374 0.7979393 0.7270526 0.6624632 0.6036118
## [67] 0.5499886 0.5011291 0.4566102 0.4160462 0.3790858 0.3454089
## [73] 0.3147237 0.2867645 0.2612891 0.2380769 0.2169268 0.1976557
## [79] 0.1800965 0.1640972
dim(lasso$beta)
## [1] 19 80
lasso$beta[,1:3] # this returns only the predictor coefficients
## 19 x 3 sparse Matrix of class "dgCMatrix"
## s0 s1 s2
## AtBat . . .
## Hits . . .
## HmRun . . .
## Runs . . .
## RBI . . .
## Walks . . .
## Years . . .
## CAtBat . . .
## CHits . . .
## CHmRun . . .
## CRuns . . 0.01515244
## CRBI . 0.07026614 0.11961370
## CWalks . . .
## LeagueN . . .
## DivisionW . . .
## PutOuts . . .
## Assists . . .
## Errors . . .
## NewLeagueN . . .
coef(lasso)[,1:3] # this returns intercept + predictor coefficients
## 20 x 3 sparse Matrix of class "dgCMatrix"
## s0 s1 s2
## (Intercept) 535.9259 512.70866861 490.92996092
## AtBat . . .
## Hits . . .
## HmRun . . .
## Runs . . .
## RBI . . .
## Walks . . .
## Years . . .
## CAtBat . . .
## CHits . . .
## CHmRun . . .
## CRuns . . 0.01515244
## CRBI . 0.07026614 0.11961370
## CWalks . . .
## LeagueN . . .
## DivisionW . . .
## PutOuts . . .
## Assists . . .
## Errors . . .
## NewLeagueN . . .
We now produce a 1 by 2 plot displaying on the left regularisation paths based on \(\log\lambda\) and on the right the paths based on the \(\ell_1\)-norm.
par(mfrow=c(1, 2))
plot(lasso, xvar = 'lambda')
plot(lasso, xvar = 'norm') # or simply plot(lasso) because xvar = 'norm' is the default
As we did with ridge, we will use the function cv.glmnet()
to find the min-CV \(\lambda\) and the 1 standard error (1-se) \(\lambda\) under lasso. The code is the following.
set.seed(1)
lasso.cv = cv.glmnet(x, y)
lasso.cv$lambda.min
## [1] 1.843343
lasso.cv$lambda.1se
## [1] 69.40069
We see that in this case the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\). This means that we expect the coefficients under 1-se \(\lambda\) to be much smaller. Lets have a look at the corresponding coefficients from the two models, rounding them to 3 decimal places.
round(cbind(
coef(lasso.cv, s = 'lambda.min'),
coef(lasso.cv, s = 'lambda.1se')), # here we can also use coef(lasso.cv)
# becauce 'lambda.1se' is the default
3)
## 20 x 2 sparse Matrix of class "dgCMatrix"
## 1 1
## (Intercept) 142.878 127.957
## AtBat -1.793 .
## Hits 6.188 1.423
## HmRun 0.233 .
## Runs . .
## RBI . .
## Walks 5.148 1.582
## Years -10.393 .
## CAtBat -0.004 .
## CHits . .
## CHmRun 0.585 .
## CRuns 0.764 0.160
## CRBI 0.388 0.337
## CWalks -0.630 .
## LeagueN 34.227 .
## DivisionW -118.981 -8.062
## PutOuts 0.279 0.084
## Assists 0.224 .
## Errors -2.436 .
## NewLeagueN . .
Finally, lets plot the results based on the outputs of the objects lasso
and lasso.cv
which we have created.
par(mfrow=c(1,2))
plot(lasso.cv)
plot(lasso, xvar = 'lambda')
abline(v = log(lasso.cv$lambda.min), lty = 3) # careful to use the log here and below
abline(v = log(lasso.cv$lambda.1se), lty = 3)
Lets see under which of the two options lasso performs better in terms of prediction.
repetitions = 50
cor.1 = c()
cor.2 = c()
set.seed(1)
for(i in 1:repetitions){
# Step (i) data splitting
training.obs = sample(1:263, 175)
y.train = Hitters$Salary[training.obs]
x.train = model.matrix(Salary~., Hitters[training.obs, ])[,-1]
y.test = Hitters$Salary[-training.obs]
x.test = model.matrix(Salary~., Hitters[-training.obs, ])[,-1]
# Step (ii) training phase
lasso.train = cv.glmnet(x.train, y.train)
# Step (iv) generating predictions
predict.1 = predict(lasso.train, x.test, s = 'lambda.min')
predict.2 = predict(lasso.train, x.test, s = 'lambda.1se')
# Step (iv) generating predictions
predict.1 = predict(lasso.train, x.test, s = 'lambda.min')
predict.2 = predict(lasso.train, x.test, s = 'lambda.1se')
# Step (v) evaluating predictive performance
cor.1[i] = cor(y.test, predict.1)
cor.2[i] = cor(y.test, predict.2)
}
boxplot(cor.1, cor.2, names = c('min-CV lasso','1-se lasso'), ylab = 'Test correlation', col = 7)
The approach based on min-CV is better, which makes sense as it minimises within CV error. In general, in applications where the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\) (like here) we expect results as above.
For PCR regression we will use of the package pcr
so we have to install this before by executing once install.packages('pls')
. The main command is called pcr()
. The argument scale=TRUE
standardises the columns of the predictor matrix, while argument validation = 'CV'
picks the optimal number of principal components via cross-validation.The command summary()
provides information on CV error and variance explained.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, validation = 'CV')
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 352.5 351.6 352.3 350.7 346.1 345.5
## adjCV 452 352.1 351.2 351.8 350.1 345.5 344.6
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 345.4 348.5 350.4 353.2 354.5 357.5 360.3
## adjCV 344.5 347.5 349.3 351.8 353.0 355.8 358.5
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 352.4 354.3 345.6 346.7 346.6 349.4
## adjCV 350.2 352.3 343.6 344.5 344.3 346.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26 94.96
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69 46.75
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 96.28 97.26 97.98 98.65 99.15 99.47 99.75
## Salary 46.86 47.76 47.82 47.85 48.10 50.40 50.55
## 16 comps 17 comps 18 comps 19 comps
## X 99.89 99.97 99.99 100.00
## Salary 53.01 53.85 54.61 54.61
We can use the command validationplot()
to plot the validation error. The argument val.type = 'MSEP'
specifies to select mean squared error, other options are root mean squared error and \(R^2\); type ?validationplot
for help.
validationplot(pcr.fit, val.type = 'MSEP')
From the plot above it is not obvious which PCR model minimises CV. We would like to extract this information automatically, but it turns out that is not so easy with pcr()
! One way to do this is via the code below.
min.pcr = which.min(MSEP(pcr.fit)$val[1,1, ] ) - 1
min.pcr
## 7 comps
## 7
So, it is the model with 7 principal components that minimises CV. Now that we know which model it is we can find the corresponding estimates in terms of the original \(\hat{\beta}\)’s using the command coef()
we have used before. Also we can generate predictions via predict()
(below the first 6 predictions are displayed).
coef(pcr.fit, ncomp = min.pcr)
## , , 7 comps
##
## Salary
## AtBat 27.005477
## Hits 28.531195
## HmRun 4.031036
## Runs 29.464202
## RBI 18.974255
## Walks 47.658639
## Years 24.125975
## CAtBat 30.831690
## CHits 32.111585
## CHmRun 21.811584
## CRuns 34.054133
## CRBI 28.901388
## CWalks 37.990794
## LeagueN 9.021954
## DivisionW -66.069150
## PutOuts 74.483241
## Assists -3.654576
## Errors -6.004836
## NewLeagueN 11.401041
head(predict(pcr.fit,ncomp = min.pcr))
## , , 7 comps
##
## Salary
## -Alan Ashby 568.7940
## -Alvin Davis 670.3840
## -Andre Dawson 921.6077
## -Andres Galarraga 495.8124
## -Alfredo Griffin 560.3198
## -Al Newman 135.5378
What about regularisation paths with pcr()
? This is again tricky because there is no build-in command. The below piece of code can be used to produce regularisation paths. The regression coefficients from all models using 1 up to 19 principal components are stored in pcr.fit$coefficients
, but this is an object of class list
which is not very convenient to work with. So we create a matrix called coef.mat
which has 19 rows and 19 columns. We then store the coefficients in pcr.fit$coefficients
to the columns of coef.mat
starting from the model that has 19 principal components using a for loop. We then first use plot()
to plot the path of the first row of coef.mat
and then use a for loop calling the command lines()
to superimpose the paths of the rest of the rows of coef.mat
.
coef.mat = matrix(NA, 19, 19)
for(i in 1:19){
coef.mat[,i] = pcr.fit$coefficients[,,20-i]
}
plot(coef.mat[1,], type = 'l', ylab = 'Coefficients', xlab = 'Number of components', ylim = c(min(coef.mat), max(coef.mat)))
for(i in 2:19){
lines(coef.mat[i,], col = i)
}
abline(v = min.pcr, lty = 3)
Use 50 repetitions of data-splitting with 175 samples for training to compare the predictive performance of the following models:
Remember: for best subset you will have to load library leaps
and make use of the following function for prediction:
predict.regsubsets = function(object, newdata, id, ...){
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars]%*%coefi
}
Here you will use glmnet()
and cv.glmnet()
for logistic regression. For this we will turn the continuous response Salary
to a binary response. Lets use first boxplot()
and summary()
to understand better how the salaries of the baseball players are distributed.
boxplot(Hitters$Salary, ylab = 'Salary')
summary(Hitters$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.5 190.0 425.0 535.9 750.0 2460.0
From the above we see that 75% of the players earn up to 750 thousands. So one potential interesting cut-off is to model the probability of earning more than that. In this case we can define the response y
as follows.
y = rep(0, 263)
y[Hitters$Salary > 750] = 1
table(y)
## y
## 0 1
## 201 62
You may of course choose a different cut-off; e.g., above average salary.
Carry out an analysis using lasso and/or ridge for this logistic model along the lines of today’s tutorial: plot regularisation paths, print out coefficients under min-CV, 1-se CV and so forth. Do not consider predictive performance in this case. Are there any predictor variables that seem to stand out in the logistic models? Tips: Use the option family = 'binomial'
. In addition, in cv.glmnet()
you may want to choose type.measure = 'class'
. Read the the help document.