We have a training set of 2,000 digits and a test set of 1,000 digits.
The digits are 28x28 pixels in size giving us 784 variables.
We have a training set of 2,000 digits and a test set of 1,000 digits.
The digits are 28x28 pixels in size giving us 784 variables.
There is massive correlation in the variables, but the algorithms presented here will tend to ignore this.
The \(k\)-nearest neighbours algorithm is a lazy algorithm. This means that it does not learn a model from the training data. Instead, it classifies new data points based on the training data.
The \(k\)-nearest neighbours algorithm has two main choices:
library(class)
# Train the model
knn_model <- knn(train = MNIST_train[, -1],
test = MNIST_test[, -1],
cl = MNIST_train$label,
k = 3)
# Compute the accuracy
sum(knn_model == MNIST_test$label) / length(knn_model)
## [1] 0.868
## k-Nearest Neighbors
##
## 2000 samples
## 784 predictor
## 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 1900, 1900, 1902, 1898, 1900, 1901, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9124143 0.9025703
## 2 0.8979975 0.8865268
## 3 0.9075030 0.8971012
## 4 0.9020126 0.8909903
## 5 0.9009925 0.8898518
## 6 0.8959968 0.8842878
## 7 0.8930119 0.8809638
## 8 0.8939970 0.8820677
## 9 0.8914812 0.8792735
## 10 0.8940216 0.8820880
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
## k-Nearest Neighbors
##
## 202 samples
## 2 predictor
## 4 classes: '1', '2', '3', '4'
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 191, 191, 193, 193, 191, 193, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.6426894 0.5198390
## 2 0.6090530 0.4715523
## 3 0.6311237 0.5020650
## 4 0.5773232 0.4303253
## 5 0.5945076 0.4547095
## 6 0.5940530 0.4522104
## 7 0.5987753 0.4593627
## 8 0.6131944 0.4783842
## 9 0.6088258 0.4702154
## 10 0.6002399 0.4614984
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
I earlier brushed over the technicality of ties in the majority class. This can be handled in a number of ways:
There is also the issue of tied distances. The
knn
function handles this by:
In the absence of any information, what proportion of the data belongs to each class?
If it did belong to class A, how probable it is that the explanatory variables would look like this?
See notes
The Naive Bayes classifier is an eager algorithm. This means that it learns a model from the training data. It is also probabilistic in nature, but most implementations will return a class label alone.
The Naive Bayes classifier has one main choice:
\(x_1\) | \(x_2\) | \(y\) |
---|---|---|
1.5 | A | Positive |
1.7 | B | Positive |
1.3 | A | Positive |
1.9 | B | Positive |
2.1 | A | Positive |
2.3 | B | Negative |
2.5 | A | Negative |
2.7 | B | Negative |
1.9 | A | Negative |
We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).
We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).
We have also estimated the likelihoods of the variables as follows: \[ \begin{aligned} x_1 | \text{Positive}~ &\sim N(1.7, 0.1),\\ x_1 | \text{Negative} &\sim N(2.35, 0.1),\\ x_2 | \text{Positive}~ &\sim \text{Categorical}(P_A = 0.6, P_B = 0.4),\\ x_2 | \text{Negative} &\sim \text{Categorical}(P_A = 0.5, P_B = 0.5).\\ \end{aligned} \]
We can now estimate the new data points falling into the positive class.
\(x_1\) | \(x_2\) | \(P(\text{Positive}|x)\) |
---|---|---|
1.8 | A | 0.87 |
2.2 | B | 0.24 |
2.0 | B | 0.54 |
library(e1071)
# Train the model
nb_model <- naiveBayes(x = MNIST_train[, -1],
y = as.factor(MNIST_train$label))
# Summarise
nb_model$apriori
## as.factor(MNIST_train$label)
## 0 1 2 3 4 5 6 7 8 9
## 191 220 198 191 214 180 200 224 172 210
## [1] 0.498
## nb_predict
## 0 1 2 3 4 5 6 7 8 9
## 0 74 1 0 0 0 1 2 0 7 0
## 1 0 125 0 0 0 0 1 0 0 0
## 2 6 22 20 5 0 4 18 0 41 0
## 3 9 20 1 26 0 5 7 2 32 5
## 4 5 10 0 0 18 0 12 1 26 38
## 5 10 13 1 0 2 1 4 0 54 2
## 6 4 5 1 0 0 0 72 0 5 0
## 7 2 24 0 1 2 1 1 29 13 26
## 8 1 19 1 1 2 2 0 0 60 3
## 9 1 13 0 0 1 0 1 0 5 73
## X$Class
## 1 2 3 4
## 50 50 50 52
The decision tree algorithm is another eager algorithm. It is also deterministic in most implementations, but you could convert into probabilistic predictions.
The decision tree algorithm has a number of choices:
## CART
##
## 2000 samples
## 784 predictor
## 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 1903, 1901, 1900, 1900, 1899, 1899, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.05574324 0.4139438 0.3446133
## 0.06813063 0.3640007 0.2898130
## 0.08417793 0.2654192 0.1768648
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.05574324.
# Predict for test data
tree_predict <- predict(tree_cv, MNIST_test[, -1])
# Calculate accuracy
sum(tree_predict == MNIST_test$label) / length(tree_predict)
## [1] 0.341
Let’s find a linear combination of the data that best separates the classes.
The linear combination here is a projection onto the real line.
The “machine learning” part of the perceptron is the updating of the weights. The weights are updated based on classification errors.
See notes
The perceptron algorithm is as follows:
The perceptron is a simple linear classifier that is closely related to LDA. Recall that LDA has similar features in that it:
We start be setting the weights to zero and the learning rate to 0.1.
With zero weights, the prediction can be just a random allocation to each class.
As the classes are balanced here, the accuracy will be 50%.
We can compute new weights for the perceptron using the update rule.
# Set the learning rate
alpha <- 0.1
# Initialise the weights
w <- c(0, 0)
# For each input in the training data
for (j in 1:nrow(X)) {
# Compute the predicted class label
y_hat <- sign(sum(w * X[j, 1:2]))
# Update the weights based on the classification error
w <- w + alpha * (X[j, 3] - y_hat) * X[j, 1:2]
}
w
## 1 x1
## 1 0.1 0.2786494
Let’s do 30 more iterations.
Let’s do 100 more iterations.
The score function has the following weights:
\(w_0\) | \(w_1\) |
---|---|
-1.5 | 1.5 |
So, if \(x_1\) is 1, the score function is zero.
The perceptron can be extended to multiple classes using the one-vs-all approach.
Here, we train a perceptron for each class against the rest. We then assign the class with the highest score.
# Set the learning rate
alpha <- 0.1
# Initialise the 786 weights
weights <- rep(0, ncol(MNIST_train) - 1)
# Set maximum number of iterations
max_iter <- 3
# Update the weights based on the classification error
for (i in 1:max_iter) {
for (j in 1:nrow(MNIST_train)) {
# Compute the predicted class label
y_hat <- sign(sum(weights * MNIST_train[j, -1]))
# Update the weights based on the classification error
weights <- weights + alpha * (MNIST_train[j, 1] - y_hat) * MNIST_train[j, -1]
}
}
# Now make predictions for the test data
predictions <- NULL
for (j in 1:nrow(MNIST_test)) {
predictions[j] <- sign(sum(weights * MNIST_test[j, -1]))
}
# Calculate accuracy
sum(predictions == MNIST_test$label) / nrow(MNIST_test)
## [1] 0.996
# Set the learning rate
alpha <- 0.1
# Initialise the weights (1 set for each class)
weights <- matrix(0, nrow = 4, ncol = 3)
# Set maximum number of iterations
max_iter <- 50
# Update the weights based on the classification error for each class
for (k in 1:4){
# Recode the class labels
recoded <- ifelse(X$Class == k, 1, -1)
# Update the weights based on the classification error
for (i in 1:max_iter) {
for (j in 1:nrow(X)) {
# Compute the predicted class label
y_hat <- sign(sum(weights[k, ] * X[j, -4]))
# Update the weights based on the classification error
weights[k, ] <- weights[k, ] +
as.numeric(alpha * (recoded[j] - y_hat) * X[j, -4])
}
}
}
If there is one class with far more observations than the other, the classifier will be biased towards the majority class. Is this a problem?
If there is one class with far more observations than the other, the classifier will be biased towards the majority class. Is this a problem?
If the training data are truly representative of the population, then this is not a problem. Or is it?
There are three main strategies for dealing with imbalanced classes:
##
## 1 2 3
## 50 50 5