Tutorial: Baseball Data

We will continue analysing the Hitters dataset included in package ISLR, this time applying lasso and PCR regression.

Removing missing entries

We perform the usual steps that are required as we have seen in previous workshops. .

library(ISLR)
Hitters = na.omit(Hitters)
dim(Hitters)
## [1] 263  20
sum(is.na(Hitters))
## [1] 0

Lasso regression

We will now use glmnet() to implement lasso regression. Having learned how to implement ridge the analysis is very easy. To run lasso all we have to do is remove the option alpha = 0. We do not need to specify explicitly alpha = 1 because this is the default option in glmnet().

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1

In this part again we call the response y and extract the matrix of predictors, which we call x using the command model.matrix().

y = Hitters$Salary
x = model.matrix(Salary~., Hitters)[,-1] # Here we exclude the first column 
                                         #because it is the salary variable
head(x)
##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
## -Al Newman          185   37     1   23   8    21     2    214    42      1
##                   CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## -Alan Ashby         321  414    375       1         1     632      43     10
## -Alvin Davis        224  266    263       0         1     880      82     14
## -Andre Dawson       828  838    354       1         0     200      11      3
## -Andres Galarraga    48   46     33       1         0     805      40      4
## -Alfredo Griffin    501  336    194       0         1     282     421     25
## -Al Newman           30    9     24       1         0      76     127      7
##                   NewLeagueN
## -Alan Ashby                1
## -Alvin Davis               0
## -Andre Dawson              1
## -Andres Galarraga          1
## -Alfredo Griffin           0
## -Al Newman                 0

We then run lasso regression initially without cross-validation for \(\lambda\).

lasso = glmnet(x, y)
names(lasso)
##  [1] "a0"        "beta"      "df"        "dim"       "lambda"    "dev.ratio"
##  [7] "nulldev"   "npasses"   "jerr"      "offset"    "call"      "nobs"
lasso$lambda
##  [1] 255.2820965 232.6035387 211.9396814 193.1115442 175.9560469 160.3245967
##  [7] 146.0818014 133.1042968 121.2796779 110.5055256 100.6885193  91.7436287
## [13]  83.5933776  76.1671724  69.4006907  63.2353246  57.6176727  52.4990774
## [19]  47.8352041  43.5856564  39.7136268  36.1855777  32.9709507  30.0419023
## [25]  27.3730625  24.9413151  22.7255974  20.7067180  18.8671902  17.1910810
## [31]  15.6638728  14.2723375  13.0044224  11.8491453  10.7964999   9.8373686
## [37]   8.9634439   8.1671562   7.4416086   6.7805166   6.1781542   5.6293040
## [43]   5.1292122   4.6735471   4.2583620   3.8800609   3.5353670   3.2212947
## [49]   2.9351238   2.6743755   2.4367913   2.2203135   2.0230670   1.8433433
## [55]   1.6795857   1.5303760   1.3944216   1.2705450   1.1576733   1.0548288
## [61]   0.9611207   0.8757374   0.7979393   0.7270526   0.6624632   0.6036118
## [67]   0.5499886   0.5011291   0.4566102   0.4160462   0.3790858   0.3454089
## [73]   0.3147237   0.2867645   0.2612891   0.2380769   0.2169268   0.1976557
## [79]   0.1800965   0.1640972
dim(lasso$beta)
## [1] 19 80
lasso$beta[,1:3]   # this returns only the predictor coefficients
## 19 x 3 sparse Matrix of class "dgCMatrix"
##            s0         s1         s2
## AtBat       . .          .         
## Hits        . .          .         
## HmRun       . .          .         
## Runs        . .          .         
## RBI         . .          .         
## Walks       . .          .         
## Years       . .          .         
## CAtBat      . .          .         
## CHits       . .          .         
## CHmRun      . .          .         
## CRuns       . .          0.01515244
## CRBI        . 0.07026614 0.11961370
## CWalks      . .          .         
## LeagueN     . .          .         
## DivisionW   . .          .         
## PutOuts     . .          .         
## Assists     . .          .         
## Errors      . .          .         
## NewLeagueN  . .          .
coef(lasso)[,1:3]  # this returns intercept + predictor coefficients
## 20 x 3 sparse Matrix of class "dgCMatrix"
##                   s0           s1           s2
## (Intercept) 535.9259 512.70866861 490.92996092
## AtBat         .        .            .         
## Hits          .        .            .         
## HmRun         .        .            .         
## Runs          .        .            .         
## RBI           .        .            .         
## Walks         .        .            .         
## Years         .        .            .         
## CAtBat        .        .            .         
## CHits         .        .            .         
## CHmRun        .        .            .         
## CRuns         .        .            0.01515244
## CRBI          .        0.07026614   0.11961370
## CWalks        .        .            .         
## LeagueN       .        .            .         
## DivisionW     .        .            .         
## PutOuts       .        .            .         
## Assists       .        .            .         
## Errors        .        .            .         
## NewLeagueN    .        .            .

We now produce a 1 by 2 plot displaying on the left regularisation paths based on \(\log\lambda\) and on the right the paths based on the \(\ell_1\)-norm.

par(mfrow=c(1, 2))
plot(lasso, xvar = 'lambda')
plot(lasso, xvar = 'norm')  # or simply plot(lasso) because xvar = 'norm' is the default

Lasso regression with cross-validation for \(\lambda\)

As we did with ridge, we will use the function cv.glmnet() to find the min-CV \(\lambda\) and the 1 standard error (1-se) \(\lambda\) under lasso. The code is the following.

set.seed(1)
lasso.cv = cv.glmnet(x, y)
lasso.cv$lambda.min
## [1] 1.843343
lasso.cv$lambda.1se 
## [1] 69.40069

We see that in this case the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\). This means that we expect the coefficients under 1-se \(\lambda\) to be much smaller. Lets have a look at the corresponding coefficients from the two models, rounding them to 3 decimal places.

round(cbind(
  coef(lasso.cv, s = 'lambda.min'),
  coef(lasso.cv, s = 'lambda.1se')),   # here we can also use coef(lasso.cv) 
                                       # becauce 'lambda.1se' is the default
3)
## 20 x 2 sparse Matrix of class "dgCMatrix"
##                    1       1
## (Intercept)  142.878 127.957
## AtBat         -1.793   .    
## Hits           6.188   1.423
## HmRun          0.233   .    
## Runs           .       .    
## RBI            .       .    
## Walks          5.148   1.582
## Years        -10.393   .    
## CAtBat        -0.004   .    
## CHits          .       .    
## CHmRun         0.585   .    
## CRuns          0.764   0.160
## CRBI           0.388   0.337
## CWalks        -0.630   .    
## LeagueN       34.227   .    
## DivisionW   -118.981  -8.062
## PutOuts        0.279   0.084
## Assists        0.224   .    
## Errors        -2.436   .    
## NewLeagueN     .       .

Finally, lets plot the results based on the outputs of the objects lasso and lasso.cv which we have created.

par(mfrow=c(1,2))
plot(lasso.cv)
plot(lasso, xvar = 'lambda')
abline(v = log(lasso.cv$lambda.min), lty = 3) # careful to use the log here and below
abline(v = log(lasso.cv$lambda.1se), lty = 3)

Comparing predictive performance of min-CV \(\lambda\) vs. 1-se \(\lambda\) lasso

Lets see under which of the two options lasso performs better in terms of prediction.

repetitions = 50
cor.1 = c()
cor.2 = c()

set.seed(1)                
for(i in 1:repetitions){
  # Step (i) data splitting
  training.obs = sample(1:263,  175)
  y.train = Hitters$Salary[training.obs]
  x.train = model.matrix(Salary~., Hitters[training.obs, ])[,-1]
  y.test = Hitters$Salary[-training.obs]
  x.test = model.matrix(Salary~., Hitters[-training.obs, ])[,-1]
  
  # Step (ii) training phase
  lasso.train = cv.glmnet(x.train, y.train)
  
  # Step (iv) generating predictions
  predict.1 = predict(lasso.train, x.test, s = 'lambda.min')
  predict.2 = predict(lasso.train, x.test, s = 'lambda.1se')

  
  # Step (iv) generating predictions
  predict.1 = predict(lasso.train, x.test, s = 'lambda.min')
  predict.2 = predict(lasso.train, x.test, s = 'lambda.1se')
  
  # Step (v) evaluating predictive performance
  cor.1[i] = cor(y.test, predict.1)
  cor.2[i] = cor(y.test, predict.2)
}

boxplot(cor.1, cor.2, names = c('min-CV lasso','1-se lasso'), ylab = 'Test correlation', col = 7)

The approach based on min-CV is better, which makes sense as it minimises within CV error. In general, in applications where the 1-se \(\lambda\) is much larger than the min-CV \(\lambda\) (like here) we expect results as above.

PCR Regression

For PCR regression we will use of the package pcr so we have to install this before by executing once install.packages('pls'). The main command is called pcr(). The argument scale=TRUE standardises the columns of the predictor matrix, while argument validation = 'CV' picks the optimal number of principal components via cross-validation.The command summary() provides information on CV error and variance explained.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit=pcr(Salary~., data=Hitters, scale=TRUE, validation = 'CV')
summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    352.5    351.6    352.3    350.7    346.1    345.5
## adjCV          452    352.1    351.2    351.8    350.1    345.5    344.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       345.4    348.5    350.4     353.2     354.5     357.5     360.3
## adjCV    344.5    347.5    349.3     351.8     353.0     355.8     358.5
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        352.4     354.3     345.6     346.7     346.6     349.4
## adjCV     350.2     352.3     343.6     344.5     344.3     346.9
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
## Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## Salary     53.01     53.85     54.61     54.61

We can use the command validationplot() to plot the validation error. The argument val.type = 'MSEP' specifies to select mean squared error, other options are root mean squared error and \(R^2\); type ?validationplot for help.

validationplot(pcr.fit, val.type = 'MSEP')

From the plot above it is not obvious which PCR model minimises CV. We would like to extract this information automatically, but it turns out that is not so easy with pcr()! One way to do this is via the code below.

min.pcr = which.min(MSEP(pcr.fit)$val[1,1, ] ) - 1
min.pcr
## 7 comps 
##       7

So, it is the model with 7 principal components that minimises CV. Now that we know which model it is we can find the corresponding estimates in terms of the original \(\hat{\beta}\)’s using the command coef() we have used before. Also we can generate predictions via predict() (below the first 6 predictions are displayed).

coef(pcr.fit, ncomp = min.pcr)
## , , 7 comps
## 
##                Salary
## AtBat       27.005477
## Hits        28.531195
## HmRun        4.031036
## Runs        29.464202
## RBI         18.974255
## Walks       47.658639
## Years       24.125975
## CAtBat      30.831690
## CHits       32.111585
## CHmRun      21.811584
## CRuns       34.054133
## CRBI        28.901388
## CWalks      37.990794
## LeagueN      9.021954
## DivisionW  -66.069150
## PutOuts     74.483241
## Assists     -3.654576
## Errors      -6.004836
## NewLeagueN  11.401041
head(predict(pcr.fit,ncomp = min.pcr))
## , , 7 comps
## 
##                     Salary
## -Alan Ashby       568.7940
## -Alvin Davis      670.3840
## -Andre Dawson     921.6077
## -Andres Galarraga 495.8124
## -Alfredo Griffin  560.3198
## -Al Newman        135.5378

What about regularisation paths with pcr()? This is again tricky because there is no build-in command. The below piece of code can be used to produce regularisation paths. The regression coefficients from all models using 1 up to 19 principal components are stored in pcr.fit$coefficients, but this is an object of class list which is not very convenient to work with. So we create a matrix called coef.mat which has 19 rows and 19 columns. We then store the coefficients in pcr.fit$coefficients to the columns of coef.mat starting from the model that has 19 principal components using a for loop. We then first use plot() to plot the path of the first row of coef.mat and then use a for loop calling the command lines() to superimpose the paths of the rest of the rows of coef.mat.

coef.mat = matrix(NA, 19, 19)
for(i in 1:19){
  coef.mat[,i] = pcr.fit$coefficients[,,20-i]
}

plot(coef.mat[1,], type = 'l', ylab = 'Coefficients', xlab = 'Number of components', ylim = c(min(coef.mat), max(coef.mat)))
for(i in 2:19){
  lines(coef.mat[i,], col = i)
}
abline(v = min.pcr, lty = 3)

Practice 1: Overall comparison of Baseball Data prediction.

Use 50 repetitions of data-splitting with 175 samples for training to compare the predictive performance of the following models:

Remember: for best subset you will have to load library leaps and make use of the following function for prediction:

predict.regsubsets = function(object, newdata, id, ...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form, newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars]%*%coefi
}

Practice 2: Analysing the Baseball Data with Logistic Regression

Here you will use glmnet() and cv.glmnet() for logistic regression. For this we will turn the continuous response Salary to a binary response. Lets use first boxplot() and summary() to understand better how the salaries of the baseball players are distributed.

boxplot(Hitters$Salary, ylab = 'Salary')

summary(Hitters$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    67.5   190.0   425.0   535.9   750.0  2460.0

From the above we see that 75% of the players earn up to 750 thousands. So one potential interesting cut-off is to model the probability of earning more than that. In this case we can define the response y as follows.

y = rep(0, 263)
y[Hitters$Salary > 750] = 1
table(y)
## y
##   0   1 
## 201  62

You may of course choose a different cut-off; e.g., above average salary.

Carry out an analysis using lasso and/or ridge for this logistic model along the lines of today’s tutorial: plot regularisation paths, print out coefficients under min-CV, 1-se CV and so forth. Do not consider predictive performance in this case. Are there any predictor variables that seem to stand out in the logistic models? Tips: Use the option family = 'binomial'. In addition, in cv.glmnet() you may want to choose type.measure = 'class'. Read the the help document.