Tutorial: Baseball Data

In this workshop we will start with an analysis of the Hitters dataset included in package ISLR. This dataset consists of 322 records of baseball players. The response variable is the players’ salary and the number of predictors is 19, including variables such as number of hits, years in the league and so forth.

Data handling

The first thing to do is to load the data. So we have to install package ISLR in R by typing install.packages('ISLR') (if you have already installed the package ignore this). Next we load the library and use help() which will open the help window in RStudio with information about the dataset.

library(ISLR)
help(Hitters)

To get a clearer picture of how the data looks we can use commands names(), head() and dim(). The first returns the names of the 20 variables, the second returns the first 6 rows (type help(head) to see how to adjust this) and the third gives us the number of rows (sample size) and number of columns (1 response + 19 predictors).

names(Hitters)

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

head(Hitters)

##                   AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## -Andy Allanson      293   66     1   30  29    14     1    293    66      1
## -Alan Ashby         315   81     7   24  38    39    14   3449   835     69
## -Alvin Davis        479  130    18   66  72    76     3   1624   457     63
## -Andre Dawson       496  141    20   65  78    37    11   5628  1575    225
## -Andres Galarraga   321   87    10   39  42    30     2    396   101     12
## -Alfredo Griffin    594  169     4   74  51    35    11   4408  1133     19
##                   CRuns CRBI CWalks League Division PutOuts Assists Errors
## -Andy Allanson       30   29     14      A        E     446      33     20
## -Alan Ashby         321  414    375      N        W     632      43     10
## -Alvin Davis        224  266    263      A        W     880      82     14
## -Andre Dawson       828  838    354      N        E     200      11      3
## -Andres Galarraga    48   46     33      N        E     805      40      4
## -Alfredo Griffin    501  336    194      A        W     282     421     25
##                   Salary NewLeague
## -Andy Allanson        NA         A
## -Alan Ashby        475.0         N
## -Alvin Davis       480.0         A
## -Andre Dawson      500.0         N
## -Andres Galarraga   91.5         N
## -Alfredo Griffin   750.0         A

dim(Hitters)

## [1] 322  20

Prior to proceeding to any analysis we must check whether the data contain any missing values. To do this we can use a combination of the commands sum() and is.na(). We find that we have 59 missing Salary entries. To remove this 59 rows we make use of the command na.ommit() and then double-check.

sum(is.na(Hitters$Salary))

## [1] 59

Hitters = na.omit(Hitters)
dim(Hitters)

## [1] 263  20

sum(is.na(Hitters))

## [1] 0

Best Subset Selection

Now we are ready to start with best subset selection. There are different packages in R which best subset as well as stepwise selections. We will use the function regsubsets() from library leaps (again install.packages('leaps') is needed if not already installed). We will store the results from best subset in an object called best (any name would do) and summarize the results via summary(), which outputs the best models of different dimenionality.

library(leaps)
best = regsubsets(Salary~., Hitters)
summary(best)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
##          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

The asterisks * indicate variables which are included. For instance, we see that the model with one predictor that variable is CRBI, while for the model with four predictors the selected variables are Hits, CRBI, DivisionW and PutOuts. As we see the summary() command displayed output up to the best model with 8 predictors; this is the default option in regsubsets(). To change this we can use the extra argument nvmax. Below we re-run the command for all best models (19 in total). Now we also store the output of summary() in an object called results and further investigate what this contains via names(). As seen a lot of useful information! Specifically, it returns \(R^2\), RSS, adjusted-\(R^2\), \(C_p\) and BIC. We can easily extract this quantities; for instance below we see RSS values of the 19 models.

best = regsubsets(Salary~., data = Hitters, nvmax = 19)
results = summary(best)
names(results)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

results$rsq

##  [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
##  [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
## [15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

Lets store these quantities as separate objects. We can also stack them together into one matrix as done below and see the values.

RSS = results$rss
r2 = results$rsq
Cp = results$cp
BIC = results$bic
Adj_r2 = results$adjr2

cbind(RSS, r2, Cp, BIC, Adj_r2)

##            RSS        r2         Cp        BIC    Adj_r2
##  [1,] 36179679 0.3214501 104.281319  -90.84637 0.3188503
##  [2,] 30646560 0.4252237  50.723090 -128.92622 0.4208024
##  [3,] 29249297 0.4514294  38.693127 -135.62693 0.4450753
##  [4,] 27970852 0.4754067  27.856220 -141.80892 0.4672734
##  [5,] 27149899 0.4908036  21.613011 -144.07143 0.4808971
##  [6,] 26194904 0.5087146  14.023870 -147.91690 0.4972001
##  [7,] 25906548 0.5141227  13.128474 -145.25594 0.5007849
##  [8,] 25136930 0.5285569   7.400719 -147.61525 0.5137083
##  [9,] 24814051 0.5346124   6.158685 -145.44316 0.5180572
## [10,] 24500402 0.5404950   5.009317 -143.21651 0.5222606
## [11,] 24387345 0.5426153   5.874113 -138.86077 0.5225706
## [12,] 24333232 0.5436302   7.330766 -133.87283 0.5217245
## [13,] 24289148 0.5444570   8.888112 -128.77759 0.5206736
## [14,] 24248660 0.5452164  10.481576 -123.64420 0.5195431
## [15,] 24235177 0.5454692  12.346193 -118.21832 0.5178661
## [16,] 24219377 0.5457656  14.187546 -112.81768 0.5162219
## [17,] 24209447 0.5459518  16.087831 -107.35339 0.5144464
## [18,] 24201837 0.5460945  18.011425 -101.86391 0.5126097
## [19,] 24200700 0.5461159  20.000000  -96.30412 0.5106270

From lectures we know that RSS should steadily decrease and that \(R^2\) increase as we add predictors. Lets check that by plotting the values of RSS and \(R^2\). We would like to have the two plots next to each other, so will make use of the command par(mfrow()) before calling the command plot().

par(mfrow = c(1, 2))
plot(RSS, xlab = "Number of Predictors", ylab = "RSS", type = "l", lwd = 2)
plot(r2, xlab = "Number of Predictors", ylab = "R-square", type = "l", lwd = 2)

Results are as expected. Above the argument type = 'l' is used for the lines to appear (otherwise by default R would simply plot the points) and the argument lwd controls the thickness of the lines. Plots in R are very customizable (type ?par to see all available options).

Now let us find what are the optimal models under \(Cp\), BIC and adjusted-\(R^2\) and produce plots similar to the ones we have seen in lectures.

which.min(Cp)

## [1] 10

which.min(BIC)

## [1] 6

which.max(Adj_r2)

## [1] 11

par(mfrow = c(1, 3))
plot(Cp, xlab = "Number of Predictors", ylab = "Cp", type = 'l', lwd = 2)
points(10, Cp[10], col = "red", cex = 2, pch = 8, lwd = 2)
plot(BIC, xlab = "Number of Predictors", ylab = "BIC", type = 'l', lwd = 2)
points(6, BIC[6], col = "red", cex = 2, pch = 8, lwd = 2)
plot(Adj_r2, xlab = "Number of Predictors", ylab = "Adjusted RSq", type = "l", lwd = 2)
points(11, Adj_r2[11],  col = "red", cex = 2, pch = 8, lwd = 2)

\(Cp\) and adjusted-\(R^2\) select 10 and 11 predictors, respectively, while the optimal BIC model has only 6 predictors. In the code above we use the command points() to highlight the optimal model; the first two arguments of this command correspond to x-axis and y-axis co-ordinates. So, for example, we know that the optimal \(C_p\) model is the one with 10 predictors; therefore the first argument is set to 10 and the second argument is set to Cp[10] (i.e., \(C_p\) at its minimum).

The regsubsets() function in-build function for plotting the results which is very convenient when the output from summary() is difficult to read due to many variables. Below we see the visualisation for the BIC models. Remark: Because previously we used par(mfrow(1,3)) we have to close the 1 by 3 window splitting first by typing dev.off() (this returns the window plot to the default state).

plot(best, scale = "bic")

The top row corresponds to the best model, while the bottom row to the worst model depending upon method. Non-white squares correspond to variables that excluded under each model.

Finally, we extract the model coefficients for any model we choose via the command coef. For instance, the best models under each approach are the following.

coef(best,10) # Cp

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

coef(best,6)  # BIC

##  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
##      PutOuts 
##    0.2643076

coef(best,11) # adj-Rsq

##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277

Forward Stepwise Selection

We can use the same function to also perform forward or backward selection by only change one of its arguments. Forward selection can be implemented in the following way.

fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(fwd)

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
##           CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

The analysis then will essentially be identical to the previous analysis for best subset selection.

One interesting thing to check is whether the results from best subset and forward stepwise selection are in agreement (we know that is not necessarily the case). For example lets have a look at the models with 6 and 7 predictors.

coef(best, 6)

##  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
##      PutOuts 
##    0.2643076

coef(fwd, 6)

##  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
##   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
##      PutOuts 
##    0.2643076

coef(best, 7)

##  (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
##   79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
##    DivisionW      PutOuts 
## -129.9866432    0.2366813

coef(fwd, 7)

##  (Intercept)        AtBat         Hits        Walks         CRBI       CWalks 
##  109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 
##    DivisionW      PutOuts 
## -127.1223928    0.2533404

The results for the model with 6 predictors are in agreement (which is good to know because this would be one of the three models one would likely select), but as we see the results for the models with 7 predictors are completely different!

Actual predictive performance

So, we have three potentially useful models for our data. We may take a further step and ask the question “Which one of the three to choose?” From a machine-learning perspective the answer to this question is to choose the model with the better predictive performance.

If we had new data we would be able to do that, but we don’t. However, we can do the following: (i) split our data into a training sample and a testing sample, (ii) use the training sample for model fitting, (iii) generate predictions based on the held-out predictors of the test sample, and (iv) compare the predictions to the held-out samples of the response variable from the test sample (e.g. by calculating the correlation). Lets see how we can do this in R.

Say we want to use approximately 66% of the observations for training; i.e., 175 observations in this case. So we need to randomly sample 175 rows of Hitters which we will use for training and keep the rest of the 88 rows for testing. We can use the command sample() for this and then create two separate datasets as done below.

dim(Hitters)

## [1] 263  20

training.obs = sample(1:263, 175)
Hitters.train = Hitters[training.obs, ]
Hitters.test = Hitters[-training.obs, ]
dim(Hitters.train)

## [1] 175  20

dim(Hitters.test)

## [1] 88 20

We also need a mechanism to predict from our linear models. Unfortunately, regsubsets() doesn’t have its own build-in function for that (most similar types of R functions have). We can of course hard-code this step, but this is a bit difficult. Here we will just make use of the function below which will do the predictions for us (details of the function are not important here).

predict.regsubsets = function(object, newdata, id, ...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form, newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars]%*%coefi
}

Finally, we would like to repeat this procedure several times in order to take into account the variance of this procedure. This can be done in R using a for loop. Lets say we want to repeat this 50 times and calculate the correlations of \(Cp\), \(BIC\) and adjusted-\(R^2\) predictions with the held-out response samples. So, first we define the number of repetitions and then we create three empty vectors which will be filled during the for loop.

repetitions = 50
cor.cp = c()
cor.bic = c()
cor.adjr2 = c()

The for loop is given below. The inner parts should not be difficult to understand based on the previous analysis.

set.seed(1)                 # for the results to be reproducable
for(i in 1:repetitions){
  # Step (i) data splitting
  training.obs = sample(1:263,  175)
  Hitters.train = Hitters[training.obs, ]
  Hitters.test = Hitters[-training.obs, ]
  dim(Hitters.train)
  dim(Hitters.test)
  
  # Step (ii) training phase
  regfit.train = regsubsets(Salary~., data = Hitters.train, nvmax = 19)
  min.cp = which.min(results$cp)
  min.bic = which.min(results$bic)
  max.adjr2 = which.max(results$adjr2)
  
  # Step (iii) generating predictions
  predict.cp = predict.regsubsets(regfit.train, Hitters.test, min.cp)
  predict.bic = predict.regsubsets(regfit.train, Hitters.test, min.bic)
  predict.adjr2 = predict.regsubsets(regfit.train, Hitters.test, max.adjr2)
  
  # Step (iv) evaluating predictive performance
  cor.cp[i] = cor(Hitters.test$Salary, predict.cp)
  cor.bic[i] = cor(Hitters.test$Salary, predict.bic)
  cor.adjr2[i] = cor(Hitters.test$Salary, predict.adjr2)
}
boxplot(cor.cp, cor.bic, cor.adjr2, names = c('Cp','BIC','adjRsq'), ylab = 'Test correlation', col = 2)

Practice: Body Fat and Body Measurements Data

With this data we want to predict body fat (variable brozek) using the other variables available, except for siri (another way of computing body fat), density (it is used in the brozek and siri formulas) and free (it is computed using brozek formula). So, this variables need to be removed. To get started first install.packages('faraway') (if you have already installed the package ignore this) and then type the following commands.

library(faraway) 
help(fat)    # read the help file for more information about the variables
head(fat)

##   brozek siri density age weight height adipos  free neck chest abdom   hip
## 1   12.6 12.3  1.0708  23 154.25  67.75   23.7 134.9 36.2  93.1  85.2  94.5
## 2    6.9  6.1  1.0853  22 173.25  72.25   23.4 161.3 38.5  93.6  83.0  98.7
## 3   24.6 25.3  1.0414  22 154.00  66.25   24.7 116.0 34.0  95.8  87.9  99.2
## 4   10.9 10.4  1.0751  26 184.75  72.25   24.9 164.7 37.4 101.8  86.4 101.2
## 5   27.8 28.7  1.0340  24 184.25  71.25   25.6 133.1 34.4  97.3 100.0 101.9
## 6   20.6 20.9  1.0502  24 210.25  74.75   26.5 167.0 39.0 104.5  94.4 107.8
##   thigh knee ankle biceps forearm wrist
## 1  59.0 37.3  21.9   32.0    27.4  17.1
## 2  58.7 37.3  23.4   30.5    28.9  18.2
## 3  59.6 38.9  24.0   28.8    25.2  16.6
## 4  60.1 37.3  22.8   32.4    29.4  18.2
## 5  63.2 42.2  24.0   32.2    27.7  17.7
## 6  66.0 42.0  25.6   35.7    30.6  18.8

dim(fat)

## [1] 252  18

names(fat)

##  [1] "brozek"  "siri"    "density" "age"     "weight"  "height"  "adipos" 
##  [8] "free"    "neck"    "chest"   "abdom"   "hip"     "thigh"   "knee"   
## [15] "ankle"   "biceps"  "forearm" "wrist"

fat = fat[,-c(2, 3, 8)]
dim(fat)

## [1] 252  15

sum(is.na(fat$brozek))

## [1] 0

Task 1

Initial exploratory analysis. Calculate the correlations of brozek with respect to the 14 predictor variables. Which three predictors have the highest correlations with the response brozek? Using the command pairs() produce a 4 by 4 frame of scatter plots of the response variable and the three predictors. What does this plot tells us about potential correlations among the three predictors?

Task 2

Forward stepwise regression. Use regsubsets() to perform forward stepwise selection and carry out your analysis along the lines that was presented previously based on best subset selection for the Baseball data (feel free to experiment with different options with plots to get familiarised with RStudio). Which models are selected by \(C_p\), BIC and adjusted-\(R^2\)? Use the special plot() function to visualise the models obtained by \(C_p\).

Task 3

Comparisons with best subset and backward selections. Do the best models from best subset and backward stepwise selections have the same number of predictors? If yes, are the predictors the same as those from forward selection? Hint: if it is not obvious how to perform backward selection type ?regsubsets and see information about argument method.

Task 4

Predictive performance. Based on forward selection evaluate the predictive accuracy of the three models based on training and testing samples. Keep approximately 75% of the samples for training. Use 30 repetitions to evaluate results. Is the predictive accuracy better or worse in comparison to the Baseball data?

Workshop 1: Best Subset and Stepwise Selection with Model Selection Criteria