We have a training set of 2,000 digits and a test set of 1,000 digits.
The digits are 28x28 pixels in size giving us 784 variables.
We have 1,599 observations of Portuguese red wines. The dataset was originally used to predict the quality of the wine based on 11 physicochemical tests.
The quality
is a score between 0 and 10.
We want a continuous output so we will use alcohol
as
the output and ignore quality
.
A weak learner is a model that is only slightly better than random guessing.
A strong learner is a model that is significantly better than random guessing.
A weak learner is a model that is only slightly better than random guessing.
A strong learner is a model that is significantly better than random guessing.
As weak learners are so fast to train, we can use many of them to create a strong learner.
See notes
# Fit a linear model to the predictions
preds <- data.frame(lm = predict(lm_full), dt = predict(dt_full), y = y)
lm_stack <- lm(y ~ ., data = preds)
summary(lm_stack)
##
## Call:
## lm(formula = y ~ ., data = preds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34334 -0.09499 -0.01761 0.10400 0.35431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01954 0.03308 -0.591 0.556
## lm 0.39483 0.07548 5.231 9.73e-07 ***
## dt 0.61552 0.07512 8.194 1.04e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1494 on 97 degrees of freedom
## Multiple R-squared: 0.9774, Adjusted R-squared: 0.9769
## F-statistic: 2094 on 2 and 97 DF, p-value: < 2.2e-16
How many parameters will we have in the combined model?
Sum up all the parameters in the individual models and add the parameters needed to combine the models.
Let’s try stacking on our terrible fake data.
The following three models have already failed to predict the class of the data:
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 1 2 3 4
## 0.2475248 0.2475248 0.2475248 0.2574257
##
## Conditional probabilities:
## lda
## Y 1 2 3 4
## 1 0.30000000 0.20000000 0.28000000 0.22000000
## 2 0.10000000 0.70000000 0.04000000 0.16000000
## 3 0.00000000 0.10000000 0.60000000 0.30000000
## 4 0.21153846 0.28846154 0.42307692 0.07692308
##
## dt
## Y 1 2 3 4
## 1 0.6600000 0.0400000 0.2200000 0.0800000
## 2 0.1600000 0.6800000 0.1400000 0.0200000
## 3 0.1200000 0.0400000 0.7600000 0.0800000
## 4 0.0000000 0.3653846 0.0000000 0.6346154
##
## knn
## Y 1 2 3 4
## 1 0.54000000 0.06000000 0.26000000 0.14000000
## 2 0.14000000 0.54000000 0.14000000 0.18000000
## 3 0.08000000 0.04000000 0.78000000 0.10000000
## 4 0.03846154 0.09615385 0.01923077 0.84615385
# Predict the classes for the test data
lda_pred_test <- predict(lda, X_test[,1:2])$class
dt_pred_test <- predict(dt, X_test[,1:2], type = "class")
knn_pred_test <- knn(X[,1:2],
cl =X[,3],
test = X_test[,1:2],
k = 7)
# Stack the predictions in a new data frame
test_pred <- data.frame(lda = lda_pred_test,
dt = dt_pred_test,
knn = knn_pred_test,
Class = X_test$Class)
# Predict the classes for the test data
nb_pred_test <- predict(nb_stack, test_pred)
We can calculate the accuracy of the individual models and the stacked model.
Model | Accuracy |
---|---|
LDA | 0.393 |
Decision tree | 0.67 |
k-NN | 0.759 |
Stacked | 0.786 |
Let’s try stacking on the wine data.
We will use the following two models:
# Fit the models and predict the alcohol content
lm_full <- lm(alcohol ~ . - quality,
data = wine_train)
lm_pred <- predict(lm_full, wine_train)
library(caret)
knn_full <- train(x = wine_train[, -12],
y = wine_train$alcohol,
method = "knn",
trControl = trainControl(method = "cv"))
knn_pred <- predict(knn_full,
wine_train)
We are totally ignoring the linear model in the combined model.
Model | RMSE |
---|---|
Linear regression | 0.61 |
k-NN | 0.228 |
Stacked | 0.284 |
We have simply made things worse…
With stacking we are training multiple models in complete isolation before combining them. The combined model is then trained on the predictions of the individual models alone.
These questions lead us to consider some variations on the stacking algorithm.
(Recall that) Bootstrapping is a method of resampling the data with replacement.
The idea is to create a new dataset of the same size as the original dataset by sampling from the original dataset with replacement and then use the new dataset to calculate a statistic.
See notes
The algorithm looks like this in R:
# We have data stored in a data frame called my_data
n <- nrow(my_data)
# We repeat the sampling many times
statistic <- NULL
for (i in 1:1000) {
# We want to create a new dataset of the same size
new_data <- my_data[sample(1:n, n, replace = TRUE), ]
# We can now calculate a statistic on the new dataset
statistic[i] <- mean(new_data$variable)
}
# We can now calculate the mean and standard deviation of the statistic
mean_statistic <- mean(statistic)
sd_statistic <- sd(statistic)
The algorithm looks like this in R:
Bagging is short for bootstrap aggregating.
The idea is to train a number of models on bootstrapped samples of the data and then combine the models to create a stronger learner.
The aggregation is often a majority vote for classification problems or an average for regression problems.
For each bootstrapped sample, we can calculate the error of the model on the out-of-bag data.
These are the data points that were not included in the bootstrapped sample.
Unlike cross-validation, the coverage of the out-of-bag data is not guaranteed as we have a random element in the sampling.
Bagging can be handled in R using the caret
package.
# Load the package
library(caret)
# Fit a bagged model
bag <- train(x = MNIST_train[, -1],
y = as.factor(MNIST_train$label),
method = "treebag",
trControl = trainControl(method = "oob"),
keepX = TRUE)
treebag
refers to a bagged decision tree model.
oob
refers to the out-of-bag validation.
## Bagged CART
##
## 2000 samples
## 784 predictor
## 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling results:
##
## Accuracy Kappa
## 0.6803444 0.6443445
Compare this with the accuracy from a single decision tree model of 0.429.
We have 25 trees in the bagged model. We could be here for a while…
How do we interpret this mess?
Recall for the single decision tree model we had the following.
This is not an entirely honest comparison as the bagged model is based on different implementation of the decision tree model.
We used rpart
whereas treebag
uses
e1071
with a different set-up.
## Bagged MARS
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling results across tuning parameters:
##
## degree RMSE Rsquared MAE
## 1 0.3896525 0.1972920 0.2768313
## 2 0.3026231 0.5071554 0.2157730
## 3 0.3128187 0.4428551 0.2186416
## 4 0.3277315 0.4547187 0.2312810
## 5 0.3485411 0.3822639 0.2343792
##
## Tuning parameter 'nprune' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 2.
We originally had an RMSE of 0.25 from a standard MARS model…
Are we falling into the trap (yet again) of over-fitting?
Are we falling into the trap (yet again) of over-fitting?
Model | RMSE |
---|---|
Bagged MARS | 0.232 |
MARS | 0.292 |
We essentially have bagging with an extra twist…
In R, we can use the randomForest
package.
In R, we can use the randomForest
package.
Unsurprisingly, the randomForest
package has a
train
method in the caret
package.
## Random Forest
##
## 202 samples
## 2 predictor
## 4 classes: '1', '2', '3', '4'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.5940594 0.4586220
## 2 0.6138614 0.4849971
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Random Forest
##
## 2000 samples
## 784 predictor
## 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.4055 0.3336649
## 2 0.7675 0.7408913
## 3 0.8705 0.8558992
## 4 0.9050 0.8943217
## 5 0.9130 0.9032293
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
Compare this accuracy with the accuracy from the bagged model of 0.68.
The number of trees in the forest is 500…
Instead of training a number of models in isolation and then combining them, we will train a number of models in sequence.
The idea is to train a model on the data and then train a new model on the residuals of the first model.
Boosting algorithms typically consist of iteratively learning weak classifier and adding them to get a final strong classifier.
AdaBoost is short for adaptive boosting. It was devised in 1996.
See notes
## [1] 0.12178700 0.14231017 0.08112155 0.04562269 0.06489436
The weights here are used in the calculation of the final prediction. That is, what is the relative importance of each tree in the final model.
For a simple decision tree model, we had the following confusion matrix:
##
## dt_pred 3 4 5 6 7 8
## 5 1 6 96 34 0 0
## 6 1 3 50 89 22 2
## 7 0 0 3 8 5 0
With an accuracy of 0.594.
For our AdaBoost model, we have the following confusion matrix:
##
## 3 4 5 6 7 8
## 5 2 6 104 45 0 0
## 6 0 3 44 78 14 2
## 7 0 0 1 8 13 0
With an accuracy of 0.609.
For our AdaBoost model with 100 trees, we have the following confusion matrix:
##
## 3 4 5 6 7 8
## 5 1 6 113 48 0 0
## 6 1 3 36 80 22 2
## 7 0 0 0 3 5 0
With an accuracy of 0.619.
# Load the package
library(xgboost)
# Fit a XGBoost model for the alcohol response
xgb <- xgboost(data = as.matrix(wine_train[, -12]),
label = wine_train$alcohol,
nrounds = 10,
objective = "reg:squarederror",
verbose = 1)
## [1] train-rmse:7.021846
## [2] train-rmse:4.926508
## [3] train-rmse:3.457073
## [4] train-rmse:2.427373
## [5] train-rmse:1.705298
## [6] train-rmse:1.198682
## [7] train-rmse:0.843322
## [8] train-rmse:0.594671
## [9] train-rmse:0.420442
## [10] train-rmse:0.298416
The RMSE for the XGBoost model is 0.29.
library(rpart)
# Compare with a simple tree approach
dt <- rpart(alcohol ~ ., data = wine_train)
# Predict the classes for the test data
dt_pred <- predict(dt, wine_test)
# Calculate the RMSE
sqrt(mean((wine_test$alcohol - dt_pred)^2))
## [1] 0.7162616
The RMSE for the XGBoost model is 0.29.
library(randomForest)
# Compare with a RF approach
rf <- randomForest(alcohol ~ ., data = wine_train)
# Predict the classes for the test data
rf_pred <- predict(rf, wine_test)
# Calculate the RMSE
sqrt(mean((wine_test$alcohol - rf_pred)^2))
## [1] 0.5138576
Error-correcting output codes (ECOC) is a method for multi-class classification. To understand its origins, we need to go back to the early days of error correcting in communication theory…
The idea is to transform a multi-class classification problem into a series of binary classification problems. (We have seen this before with one-vs-all.)
The difference is that we are not just considering one-vs-all, but all possible combinations of classes.
In communication theory, error correcting codes are used to transmit data over a noisy channel. We send information with a level of redundancy so that if some of the information is lost, we can still recover the original message. For example:
Sent message | Message interpretation |
---|---|
000 | 0 |
001 | 0 |
010 | 0 |
100 | 0 |
011 | 1 |
101 | 1 |
110 | 1 |
111 | 1 |
We have four classes in the data. We will use the following code matrix:
Class | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 0 | 1 | 1 | 0 | 0 | -1 | 0 | 1 |
3 | -1 | 0 | 0 | -1 | 1 | 0 | 0 | -1 |
4 | 0 | -1 | -1 | 0 | -1 | 0 | -1 | 0 |
To have every comparison, we need to train 6 binary classifiers.
We build 8 binary classifiers. For instance, M3 is trained to distinguish between classes 2 and 4.
# Predict the classes for the test data
X_test$o1 <- predict(M1, X_test)
X_test$o2 <- predict(M2, X_test)
X_test$o3 <- predict(M3, X_test)
X_test$o4 <- predict(M4, X_test)
X_test$o5 <- predict(M5, X_test)
X_test$o6 <- predict(M6, X_test)
X_test$o7 <- predict(M7, X_test)
X_test$o8 <- predict(M8, X_test)
# Display some predictions
head(X_test[, c("o1", "o2", "o3", "o4", "o5", "o6", "o7", "o8")])
## o1 o2 o3 o4 o5 o6 o7 o8
## 1 -1 1 -1 -1 1 1 1 -1
## 2 1 1 1 1 -1 1 1 1
## 3 -1 1 -1 1 1 1 1 -1
## 4 1 1 -1 1 -1 1 1 1
## 5 1 1 1 1 -1 -1 -1 1
## 6 1 1 -1 1 -1 1 1 1
Compare the codes with the code matrix.
# Calculate the Hamming distance between the predictions and
# each row of the code matrix
hamming <- function(x, y) {
sum(x != y)
}
# Predict the classes
X_test$Class_pred <- apply(X_test[, c("o1", "o2", "o3", "o4", "o5", "o6", "o7", "o8")], 1,
function(x) which.min(apply(code_matrix, 1,
function(y) hamming(x, y))))