MLNN ensembles

JP Gosling

2024-11-19

Running examples

Source: Created using the Image Creator in Bing

Handwriting classification (Categorical)

We have a training set of 2,000 digits and a test set of 1,000 digits.

The digits are 28x28 pixels in size giving us 784 variables.

Handwriting classification (Categorical)

Terrible fake data (Categorical)

Wine data (Continuous)

We have 1,599 observations of Portuguese red wines. The dataset was originally used to predict the quality of the wine based on 11 physicochemical tests.

The quality is a score between 0 and 10.

We want a continuous output so we will use alcohol as the output and ignore quality.

What did you expect? (Continuous)

End of section

Stacking

Source: xkcd.com/1885

Weak and strong learners

A weak learner is a model that is only slightly better than random guessing.

A strong learner is a model that is significantly better than random guessing.

Weak and strong learners

A weak learner is a model that is only slightly better than random guessing.

A strong learner is a model that is significantly better than random guessing.

As weak learners are so fast to train, we can use many of them to create a strong learner.

See notes

The algorithm

Train a number of weak learners on the data.

Combine the weak learners to create a strong learner.

The basic idea

The choices

How many models should we use?

What models should we use?

How should we combine the models?

How should we train the models?

1d example (1)

1d example (2)

1d example (3)

1d example (4)

1d example (5)

1d example (6)

# Fit a linear model to the predictions
preds <- data.frame(lm = predict(lm_full), dt = predict(dt_full), y = y)
lm_stack <- lm(y ~ ., data = preds)
summary(lm_stack)

## 
## Call:
## lm(formula = y ~ ., data = preds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34334 -0.09499 -0.01761  0.10400  0.35431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01954    0.03308  -0.591    0.556    
## lm           0.39483    0.07548   5.231 9.73e-07 ***
## dt           0.61552    0.07512   8.194 1.04e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1494 on 97 degrees of freedom
## Multiple R-squared:  0.9774, Adjusted R-squared:  0.9769 
## F-statistic:  2094 on 2 and 97 DF,  p-value: < 2.2e-16

1d example (7)

Over-fitting?

How many parameters will we have in the combined model?

Sum up all the parameters in the individual models and add the parameters needed to combine the models.

Terrible fake data (1)

Let’s try stacking on our terrible fake data.

The following three models have already failed to predict the class of the data:

A linear discriminant analysis model,
A decision tree model,
A k-nearest neighbours model.

Terrible fake data (2)

# Predict the classes
lda_pred <- predict(lda, X[,1:2])$class
dt_pred <- predict(dt, X[,1:2], type = "class")
knn_pred <- knn

# Stack the predictions in a new data frame
test_pred <- data.frame(lda = lda_pred,
                        dt = dt_pred,
                        knn = knn_pred,
                        Class = X$Class)

Terrrible fake data (3)

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         1         2         3         4 
## 0.2475248 0.2475248 0.2475248 0.2574257 
## 
## Conditional probabilities:
##    lda
## Y            1          2          3          4
##   1 0.30000000 0.20000000 0.28000000 0.22000000
##   2 0.10000000 0.70000000 0.04000000 0.16000000
##   3 0.00000000 0.10000000 0.60000000 0.30000000
##   4 0.21153846 0.28846154 0.42307692 0.07692308
## 
##    dt
## Y           1         2         3         4
##   1 0.6600000 0.0400000 0.2200000 0.0800000
##   2 0.1600000 0.6800000 0.1400000 0.0200000
##   3 0.1200000 0.0400000 0.7600000 0.0800000
##   4 0.0000000 0.3653846 0.0000000 0.6346154
## 
##    knn
## Y            1          2          3          4
##   1 0.54000000 0.06000000 0.26000000 0.14000000
##   2 0.14000000 0.54000000 0.14000000 0.18000000
##   3 0.08000000 0.04000000 0.78000000 0.10000000
##   4 0.03846154 0.09615385 0.01923077 0.84615385

Terrrible fake data (4)

# Predict the classes for the test data
lda_pred_test <- predict(lda, X_test[,1:2])$class
dt_pred_test <- predict(dt, X_test[,1:2], type = "class")
knn_pred_test <- knn(X[,1:2], 
                     cl =X[,3],
                     test = X_test[,1:2], 
                     k = 7)

# Stack the predictions in a new data frame
test_pred <- data.frame(lda = lda_pred_test,
                        dt = dt_pred_test,
                        knn = knn_pred_test,
                        Class = X_test$Class)

# Predict the classes for the test data
nb_pred_test <- predict(nb_stack, test_pred)

Terrrible fake data (5)

Terrrible fake data (6)

We can calculate the accuracy of the individual models and the stacked model.

Model	Accuracy
LDA	0.393
Decision tree	0.67
k-NN	0.759
Stacked	0.786

Wine data (1)

Let’s try stacking on the wine data.

We will use the following two models:

A linear regression model,
A k-nearest neighbours model.

Wine data (2)

# Fit the models and predict the alcohol content
lm_full <- lm(alcohol ~ . - quality,
              data = wine_train)
lm_pred <- predict(lm_full, wine_train)

library(caret)
knn_full <- train(x = wine_train[, -12],
                  y = wine_train$alcohol,
                  method = "knn",
                  trControl = trainControl(method = "cv"))
knn_pred <- predict(knn_full,
                    wine_train)

Wine data (3)

# Stack the predictions in a new data frame
wine_train_pred <- data.frame(lm = lm_pred,
                              knn = knn_pred,
                              alcohol = wine_train$alcohol)

# Fit a decision tree model to the predictions
dt_stack <- rpart(alcohol ~ .,
                  data = wine_train_pred,
                  method = "anova")

Wine data (4)

Wine data (5)

We are totally ignoring the linear model in the combined model.

Model	RMSE
Linear regression	0.61
k-NN	0.228
Stacked	0.284

We have simply made things worse…

Is this the best use of our resources?

With stacking we are training multiple models in complete isolation before combining them. The combined model is then trained on the predictions of the individual models alone.

We are training with the same data multiple times.
We are asking each model to be the best over the entire dataset.
The combined model doesn’t have access to the original explanatory variables.

Some variations?

These questions lead us to consider some variations on the stacking algorithm.

Can we train the models on different subsets of the data?
Can we train the models on different subsets of the explanatory variables?
Can we get each model to have its own speciality?
Are we allowed to use the original explanatory variables in the combined model?

End of section

Bagging

Source: xkcd.com/990

Bootstrapping (1)

(Recall that) Bootstrapping is a method of resampling the data with replacement.

The idea is to create a new dataset of the same size as the original dataset by sampling from the original dataset with replacement and then use the new dataset to calculate a statistic.

See notes

Bootstrapping (2)

The algorithm looks like this in R:

# We have data stored in a data frame called my_data
n <- nrow(my_data)

# We repeat the sampling many times
statistic <- NULL
for (i in 1:1000) {
  # We want to create a new dataset of the same size
  new_data <- my_data[sample(1:n, n, replace = TRUE), ]
  
  # We can now calculate a statistic on the new dataset
  statistic[i] <- mean(new_data$variable)
}

# We can now calculate the mean and standard deviation of the statistic
mean_statistic <- mean(statistic)
sd_statistic <- sd(statistic)

Bootstrapping (2)

The algorithm looks like this in R:

# We use the boot package to do the same thing
library(boot)
statistic <- boot(data = my_data,
                  statistic = function(data, i) mean(data$variable[i]),
                  R = 1000)

# We can extract the mean and standard deviation of the statistic
mean_statistic <- mean(statistic$t)
sd_statistic <- sd(statistic$t)

Bagging

Bagging is short for bootstrap aggregating.

The idea is to train a number of models on bootstrapped samples of the data and then combine the models to create a stronger learner.

The aggregation is often a majority vote for classification problems or an average for regression problems.

The algorithm

Train a number of models on bootstrapped samples of the data.

Combine the models to create a strong learner.

The choices

What model should we use as a base model?

How many bootstrap samples should we use?

What model should we use to combine the predictions?

Points to consider

Bagging is a way to reduce the variance of a model.

We might want to preserve diversity in the models.

How is bagging related to the CV-based estimates of the error?

We mustn’t forget that we are not creating new data and might be reusing individual data points many times in a single model.

Link to cross validation

For each bootstrapped sample, we can calculate the error of the model on the out-of-bag data.

These are the data points that were not included in the bootstrapped sample.

Unlike cross-validation, the coverage of the out-of-bag data is not guaranteed as we have a random element in the sampling.

Handwriting data (1)

Bagging can be handled in R using the caret package.

# Load the package
library(caret)

# Fit a bagged model
bag <- train(x = MNIST_train[, -1],
             y = as.factor(MNIST_train$label),
             method = "treebag",
             trControl = trainControl(method = "oob"),
             keepX = TRUE)

treebag refers to a bagged decision tree model.

oob refers to the out-of-bag validation.

Handwriting data (2)

## Bagged CART 
## 
## 2000 samples
##  784 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6803444  0.6443445

Compare this with the accuracy from a single decision tree model of 0.429.

Handwriting data (3)

Handwriting data (4)

We have 25 trees in the bagged model. We could be here for a while…

How do we interpret this mess?

Recall for the single decision tree model we had the following.

Handwriting data (5)

Handwriting data (6)

This is not an entirely honest comparison as the bagged model is based on different implementation of the decision tree model.

We used rpart whereas treebag uses e1071 with a different set-up.

https://topepo.github.io/caret/train-models-by-tag.html

What did you expect? (1)

What did you expect? (2)

# Fit a bagged MARS model
library(caret)
library(earth)
bag <- train(x = X[, -3],
             y = X$y,
             method = "bagEarth",
             trControl = trainControl(method = "oob"),
             keepX = TRUE,
             B = 30,
             tuneGrid = expand.grid(degree = 1:5,
                                    nprune = 5))

What did you expect? (3)

## Bagged MARS 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   degree  RMSE       Rsquared   MAE      
##   1       0.3896525  0.1972920  0.2768313
##   2       0.3026231  0.5071554  0.2157730
##   3       0.3128187  0.4428551  0.2186416
##   4       0.3277315  0.4547187  0.2312810
##   5       0.3485411  0.3822639  0.2343792
## 
## Tuning parameter 'nprune' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 2.

We originally had an RMSE of 0.25 from a standard MARS model…

What did you expect? (4)

Are we falling into the trap (yet again) of over-fitting?

If we have a new set of 100 data points, how do the models compare?

What did you expect? (4)

Are we falling into the trap (yet again) of over-fitting?

If we have a new set of 100 data points, how do the models compare?

Model	RMSE
Bagged MARS	0.232
MARS	0.292

End of section

Random forests

Source: Created using the Image Creator in Bing

The basic idea (1)

The basic idea (2)

The algorithm

We essentially have bagging with an extra twist…

Create a bootstrapped sample of the data.

Choose a random subset of the explanatory variables.

Train a decision tree model on the bootstrapped sample using the random subset of explanatory variables.

Repeat many times to get a ``forest’’ of decision trees.

The choices

How many trees should we use?

How many explanatory variables should we use?

All the usual choices for decision trees.

How do we combine the predictions?

Implementation (1)

In R, we can use the randomForest package.

# Load the package
library(randomForest)

# Fit a random forest model
rf <- randomForest(y ~ ., data = my_data)

Implementation (1)

In R, we can use the randomForest package.

# Load the package
library(randomForest)

# Fit a random forest model changing defaults
rf <- randomForest(y ~ ., data = my_data,
                   ntree = 100,
                   mtry = 2)

Implementation (2)

Unsurprisingly, the randomForest package has a train method in the caret package.

# Load the package
library(randomForest)
library(caret)

# Fit a random forest model changing defaults
rf <- train(x = my_data[, -1],
            y = my_data$y,
            method = "rf",
            trControl = trainControl(method = "oob"),
            tuneGrid = expand.grid(mtry = 1:3))

Terrible fake data (1)

Terrible fake data (2)

## Random Forest 
## 
## 202 samples
##   2 predictor
##   4 classes: '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.5940594  0.4586220
##   2     0.6138614  0.4849971
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Terrible fake data (3)

Terrible fake data (4)

Handwriting data (1)

# Fit a random forest model
rf <- train(x = MNIST_train[, -1],
            y = as.factor(MNIST_train$label),
            method = "rf",
            trControl = trainControl(method = "oob"),
            tuneGrid = expand.grid(mtry = 1:5),
            keep.forest = TRUE)

Handwriting data (2)

## Random Forest 
## 
## 2000 samples
##  784 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa    
##   1     0.4055    0.3336649
##   2     0.7675    0.7408913
##   3     0.8705    0.8558992
##   4     0.9050    0.8943217
##   5     0.9130    0.9032293
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

Compare this accuracy with the accuracy from the bagged model of 0.68.

The number of trees in the forest is 500…

End of section

Boosting

Source: Created using the Image Creator in Bing

The basic idea (1)

Instead of training a number of models in isolation and then combining them, we will train a number of models in sequence.

The idea is to train a model on the data and then train a new model on the residuals of the first model.

Boosting algorithms typically consist of iteratively learning weak classifier and adding them to get a final strong classifier.

The basic idea (2)

Train weak learners iteratively: Boosting starts by training a weak learner on the entire dataset.
Focus on errors: After the first weak learner is trained, the algorithm identifies the data points that the model incorrectly predicted. These points are then given higher importance (weight) in the next iteration.
Train subsequent learners: A new weak learner is trained using the adjusted weights.
Combine predictions: Finally, the predictions from all the weak learners are combined.

See notes

AdaBoost

AdaBoost is short for adaptive boosting. It was devised in 1996.

See notes

Wine data (1)

library(adabag)

# Convert the quality rating to a factor
wine_train$quality <- as.factor(wine_train$quality)
wine_test$quality <- as.factor(wine_test$quality)

# Fit an AdaBoost model
ada <- boosting(quality ~ ., 
                data = wine_train,
                control = rpart.control(maxdepth = 3),
                mfinal = 5)

Wine data (2)

Wine data (3)

ada$weights

## [1] 0.12178700 0.14231017 0.08112155 0.04562269 0.06489436

The weights here are used in the calculation of the final prediction. That is, what is the relative importance of each tree in the final model.

Wine data (4)

For a simple decision tree model, we had the following confusion matrix:

##        
## dt_pred  3  4  5  6  7  8
##       5  1  6 96 34  0  0
##       6  1  3 50 89 22  2
##       7  0  0  3  8  5  0

With an accuracy of 0.594.

Wine data (5)

For our AdaBoost model, we have the following confusion matrix:

##    
##       3   4   5   6   7   8
##   5   2   6 104  45   0   0
##   6   0   3  44  78  14   2
##   7   0   0   1   8  13   0

With an accuracy of 0.609.

Wine data (6)

For our AdaBoost model with 100 trees, we have the following confusion matrix:

##    
##       3   4   5   6   7   8
##   5   1   6 113  48   0   0
##   6   1   3  36  80  22   2
##   7   0   0   0   3   5   0

With an accuracy of 0.619.

Other boosting algorithms

Gradient boosting: It looks at the partial derivative of the loss function to estiamte the residuals of the model and where to focus the next model’s efforts.

XGBoost is short for extreme gradient boosting: This is probably the most popular boosting algorithm. It is very similar to gradient boosting, but it considers the second derivative of the loss function too.

These algorithms are based on the idea of functional gradient descent, and you will meet this next term.

Wine data (1)

# Load the package
library(xgboost)

# Fit a XGBoost model for the alcohol response
xgb <- xgboost(data = as.matrix(wine_train[, -12]),
               label = wine_train$alcohol,
               nrounds = 10,
               objective = "reg:squarederror",
               verbose = 1)

## [1]  train-rmse:7.021846 
## [2]  train-rmse:4.926508 
## [3]  train-rmse:3.457073 
## [4]  train-rmse:2.427373 
## [5]  train-rmse:1.705298 
## [6]  train-rmse:1.198682 
## [7]  train-rmse:0.843322 
## [8]  train-rmse:0.594671 
## [9]  train-rmse:0.420442 
## [10] train-rmse:0.298416

Wine data (2)

The RMSE for the XGBoost model is 0.29.

library(rpart)

# Compare with a simple tree approach
dt <- rpart(alcohol ~ ., data = wine_train)

# Predict the classes for the test data
dt_pred <- predict(dt, wine_test)

# Calculate the RMSE
sqrt(mean((wine_test$alcohol - dt_pred)^2))

## [1] 0.7162616

Wine data (3)

The RMSE for the XGBoost model is 0.29.

library(randomForest)

# Compare with a RF approach
rf <- randomForest(alcohol ~ ., data = wine_train)

# Predict the classes for the test data
rf_pred <- predict(rf, wine_test)

# Calculate the RMSE
sqrt(mean((wine_test$alcohol - rf_pred)^2))

## [1] 0.5138576

End of section

Error-correcting output codes

Source: Created using the Image Creator in Bing

The basic idea

Error-correcting output codes (ECOC) is a method for multi-class classification. To understand its origins, we need to go back to the early days of error correcting in communication theory…

The idea is to transform a multi-class classification problem into a series of binary classification problems. (We have seen this before with one-vs-all.)

The difference is that we are not just considering one-vs-all, but all possible combinations of classes.

Error correcting codes

In communication theory, error correcting codes are used to transmit data over a noisy channel. We send information with a level of redundancy so that if some of the information is lost, we can still recover the original message. For example:

Sent message	Message interpretation
000	0
001	0
010	0
100	0
011	1
101	1
110	1
111	1

The algorithm

Create a code matrix: This is a matrix where each row corresponds to a class and each column corresponds to a binary classifier.
Train a binary classifier: Train a binary classifier for each column in the code matrix.
Predict the class: For a new data point, predict the class by using the binary classifiers.
Look for the closest code: Find the row in the code matrix that is closest to the predictions.

Terrrible fake data (1)

We have four classes in the data. We will use the following code matrix:

Class	M1	M2	M3	M4	M5	M6	M7	M8
1	1	0	0	1	0	1	1	0
2	0	1	1	0	0	-1	0	1
3	-1	0	0	-1	1	0	0	-1
4	0	-1	-1	0	-1	0	-1	0

To have every comparison, we need to train 6 binary classifiers.

Terrrible fake data (2)

# Create the code matrix
code_matrix <- matrix(c(1, 0, 0, 1, 0, 1, 1, 0,
                        0, 1, 1, 0, 0, -1, 0, 1,
                        -1, 0, 0, -1, 1, 0, 0, -1,
                        0, -1, -1, 0, -1, 0, -1, 0), nrow = 4, byrow = TRUE)

Terrrible fake data (3)

We build 8 binary classifiers. For instance, M3 is trained to distinguish between classes 2 and 4.

# Train the binary classifiers
library(caret)

# Concentrate on class 2 and 4
class_2_4 <- ifelse(X$Class == 2, 1,
                    ifelse(X$Class == 4, -1, 0))

# Train the binary classifier
M3 <- train(x = X[class_2_4 != 0, 1:2],
            y = as.factor(class_2_4[class_2_4 != 0]),
            method = "rpart")

Terrrible fake data (4)

# Predict the classes for the test data
X_test$o1 <- predict(M1, X_test)
X_test$o2 <- predict(M2, X_test)
X_test$o3 <- predict(M3, X_test)
X_test$o4 <- predict(M4, X_test)
X_test$o5 <- predict(M5, X_test)
X_test$o6 <- predict(M6, X_test)
X_test$o7 <- predict(M7, X_test)
X_test$o8 <- predict(M8, X_test)

# Display some predictions
head(X_test[, c("o1", "o2", "o3", "o4", "o5", "o6", "o7", "o8")])

##   o1 o2 o3 o4 o5 o6 o7 o8
## 1 -1  1 -1 -1  1  1  1 -1
## 2  1  1  1  1 -1  1  1  1
## 3 -1  1 -1  1  1  1  1 -1
## 4  1  1 -1  1 -1  1  1  1
## 5  1  1  1  1 -1 -1 -1  1
## 6  1  1 -1  1 -1  1  1  1

Terrrible fake data (5)

Compare the codes with the code matrix.

# Calculate the Hamming distance between the predictions and 
# each row of the code matrix
hamming <- function(x, y) {
  sum(x != y)
}

# Predict the classes
X_test$Class_pred <- apply(X_test[, c("o1", "o2", "o3", "o4", "o5", "o6", "o7", "o8")], 1,
                           function(x) which.min(apply(code_matrix, 1,
                                                       function(y) hamming(x, y))))