In this practical, we will be looking at two ensemble methods: bagging and random forests. For decision tree fitting, we will use the rpart package in R. And, for random forests, we will be use the aptly-named randomForest package in R. As before, I expect that there are functions here that you have not seen before. I encourage you to look at the help files for these functions to understand what they do.

I also encourage you to clear your environment before starting the practical tasks because your machine memory will soon get clogged up especially as we move to bigger datasets.

Bagging

Recall that bagging is a method that involves bootstrapping the data and, often, fitting a prediction model to each bootstrapped sample. The final prediction is the average of the predictions of all the models.

Task 1 - Reproduce the following example

For this practical, we will use the data in the ISLR2 package. The data is called Boston and contains information about housing in Boston. The goal is to predict the median value of owner-occupied homes.

library(ISLR2)

# Load the data
data("Boston")
head(Boston)
dim(Boston)

We will first split the data into a training and test set (80/20).

????
train_data <- ????
test_data <- ????

Now, we will fit a bagged model to the training data (medv ~ .). We will not use packages beyond the rpart package to demonstrate the steps needed.

library(rpart)

# Create bootstrapped samples
n <- nrow(train_data)
nboot <- 1000
boot_samples <- lapply(1:nboot,
                       function(i) sample(1:n,
                                          replace = TRUE))

# Fit a decision tree to each bootstrapped sample
trees <- lapply(boot_samples,
                function(i) ????)

# Make predictions
preds <- sapply(trees,
                function(tree) ????)

# Average the predictions
pred_ave <- rowMeans(preds)

Let’s evaluate the model performance.

# Calculate the MSE
????

Plot the predictions against the actual for just one of the trees.

????

Compare this with the average of the predictions from the bagged sample.

????

And with a single decision tree trained on the entire training set.

tree_single <- rpart(????)
????

And we can evaluate the MSE of the single tree.

????

Task 2 - Reproduce the bagging model using the ipred package

The ipred package in R provides a function called bagging that can be used to fit a bagged model. We can use this package to reproduce the above example.

library(ipred)

# Fit a bagged model
bagged <- bagging(medv ~ .,
                  data = train_data,
                  nbagg = 1000)

# Make predictions
preds_bagged <- ????

# Calculate the MSE
????

How does this compare with the model built up using rpart? Are there any differences in the implementations?

Random Forests

Task 3 - Improve on the bagged model using the randomForest package

The randomForest package in R provides a function called randomForest that can be used to fit a random forest model. We can use this package to improve the model from the above example.

library(randomForest)

# Fit a random forest model
rf <- ????

# Make predictions
preds_rf <- ????

# Calculate the MSE
????

# Plot the predictions against the actual
????

Task 4 - Deciphering the random forest model

The randomForest package provides a function called getTree that can be used to extract the trees from the random forest model. We can use this function to extract the first tree from the random forest model and display information about the splits.

# Extract the first tree
tree_rf <- getTree(rf, k = 1, labelVar = TRUE)

# Display the first tree structure
tree_rf

We can also use the following to plot the importance of the variables.

# Plot the importance of the variables
varImpPlot(rf)

Investigate what this plot is showing by looking at the corresponding help file: ?varImpPlot.

Task 5 - Use the caret package to fit a random forest model

As we have seen, the caret package in R provides a function called train that can also be used to fit a random forest model. We can use this package to reproduce the above example.

library(caret)

# Fit a random forest model
rf_caret <- train(????)

# Make predictions
preds_rf_caret <- ????

# Calculate the MSE
????

# Plot the predictions against the actual
????

As we have also seen in the lectures, there are very many methods in caret that can be used to fit random forest models. Investigate the help file for train to see what other methods are available, and try a few to see if there is any appreciable difference in the model performance.