In this practical, we will be looking at two ensemble methods:
bagging and random forests. For decision tree fitting, we will use the
rpart
package in R. And, for random forests, we will be use
the aptly-named randomForest
package in R. As before, I
expect that there are functions here that you have not seen before. I
encourage you to look at the help files for these functions to
understand what they do.
I also encourage you to clear your environment before starting the practical tasks because your machine memory will soon get clogged up especially as we move to bigger datasets.
Recall that bagging is a method that involves bootstrapping the data and, often, fitting a prediction model to each bootstrapped sample. The final prediction is the average of the predictions of all the models.
For this practical, we will use the data in the ISLR2
package. The data is called Boston
and contains information
about housing in Boston. The goal is to predict the median value of
owner-occupied homes.
library(ISLR2)
# Load the data
data("Boston")
head(Boston)
dim(Boston)
We will first split the data into a training and test set (80/20).
????
train_data <- ????
test_data <- ????
Now, we will fit a bagged model to the training data
(medv ~ .
). We will not use packages beyond the
rpart
package to demonstrate the steps needed.
library(rpart)
# Create bootstrapped samples
n <- nrow(train_data)
nboot <- 1000
boot_samples <- lapply(1:nboot,
function(i) sample(1:n,
replace = TRUE))
# Fit a decision tree to each bootstrapped sample
trees <- lapply(boot_samples,
function(i) ????)
# Make predictions
preds <- sapply(trees,
function(tree) ????)
# Average the predictions
pred_ave <- rowMeans(preds)
Let’s evaluate the model performance.
# Calculate the MSE
????
Plot the predictions against the actual for just one of the trees.
????
Compare this with the average of the predictions from the bagged sample.
????
And with a single decision tree trained on the entire training set.
tree_single <- rpart(????)
????
And we can evaluate the MSE of the single tree.
????
The ipred
package in R provides a function called
bagging
that can be used to fit a bagged model. We can use
this package to reproduce the above example.
library(ipred)
# Fit a bagged model
bagged <- bagging(medv ~ .,
data = train_data,
nbagg = 1000)
# Make predictions
preds_bagged <- ????
# Calculate the MSE
????
How does this compare with the model built up using
rpart
? Are there any differences in the
implementations?
The randomForest
package in R provides a function called
randomForest
that can be used to fit a random forest model.
We can use this package to improve the model from the above example.
library(randomForest)
# Fit a random forest model
rf <- ????
# Make predictions
preds_rf <- ????
# Calculate the MSE
????
# Plot the predictions against the actual
????
The randomForest
package provides a function called
getTree
that can be used to extract the trees from the
random forest model. We can use this function to extract the first tree
from the random forest model and display information about the
splits.
# Extract the first tree
tree_rf <- getTree(rf, k = 1, labelVar = TRUE)
# Display the first tree structure
tree_rf
We can also use the following to plot the importance of the variables.
# Plot the importance of the variables
varImpPlot(rf)
Investigate what this plot is showing by looking at the corresponding
help file: ?varImpPlot
.
caret
package to fit a random forest
modelAs we have seen, the caret
package in R provides a
function called train
that can also be used to fit a random
forest model. We can use this package to reproduce the above
example.
library(caret)
# Fit a random forest model
rf_caret <- train(????)
# Make predictions
preds_rf_caret <- ????
# Calculate the MSE
????
# Plot the predictions against the actual
????
As we have also seen in the lectures, there are very many methods in
caret
that can be used to fit random forest models.
Investigate the help file for train
to see what other
methods are available, and try a few to see if there is any appreciable
difference in the model performance.