MLNN classification

JP Gosling

2024-10-29

Running examples

Source: Created using the Image Creator in Bing

Handwriting classification (1)

Handwriting classification (2)

Handwriting classification (3)

Handwriting classification (4)

Handwriting classification (5)

Handwriting classification (6)

We have a training set of 2,000 digits and a test set of 1,000 digits.

The digits are 28x28 pixels in size giving us 784 variables.

Handwriting classification (6)

We have a training set of 2,000 digits and a test set of 1,000 digits.

The digits are 28x28 pixels in size giving us 784 variables.

There is massive correlation in the variables, but the algorithms presented here will tend to ignore this.

Terrible fake data

End of section

\(k\)-nearest neighbours

Source: Created using the Image Creator in Bing

The algorithm

The \(k\)-nearest neighbours algorithm is a lazy algorithm. This means that it does not learn a model from the training data. Instead, it classifies new data points based on the training data.

Compute the distance between the new input and each training input.
Sort the distances in ascending order.
Select the \(k\) nearest neighbours of the new input.
Classify the new input based on the majority class of the neighbours.

Choices

The \(k\)-nearest neighbours algorithm has two main choices:

The distance metric to use.

The number of neighbours to consider.

(Plus a technicality about ties for the majority.)

Handwriting classification (1)

library(class)

# Train the model
knn_model <- knn(train = MNIST_train[, -1], 
                 test = MNIST_test[, -1], 
                 cl = MNIST_train$label, 
                 k = 3)

# Compute the accuracy
sum(knn_model == MNIST_test$label) / length(knn_model)

## [1] 0.868

Handwriting classification (2)

Handwriting classification (3)

Handwriting classification (4)

library(caret)

# Leave 100-out cross-validation
knn_cv <- train(x = MNIST_train[, -1], 
                y = as.factor(MNIST_train$label), 
                method = "knn", 
                tuneGrid = expand.grid(k = 1:10), 
                trControl = trainControl(method = "cv",
                                         number = 20))

# Plot the results
plot(knn_cv)

Handwriting classification (5)

Handwriting classification (6)

knn_cv

## k-Nearest Neighbors 
## 
## 2000 samples
##  784 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 1900, 1900, 1902, 1898, 1900, 1901, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9124143  0.9025703
##    2  0.8979975  0.8865268
##    3  0.9075030  0.8971012
##    4  0.9020126  0.8909903
##    5  0.9009925  0.8898518
##    6  0.8959968  0.8842878
##    7  0.8930119  0.8809638
##    8  0.8939970  0.8820677
##    9  0.8914812  0.8792735
##   10  0.8940216  0.8820880
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Terrible fake data (1)

library(caret)

# Leave 100-out cross-validation
knn_cv <- train(x = X[, -3], 
                y = X$Class, 
                method = "knn", 
                tuneGrid = expand.grid(k = 1:10), 
                trControl = trainControl(method = "cv",
                                         number = 20))

# Plot the results
plot(knn_cv)

Terrible fake data (2)

Terrible fake data (3)

knn_cv

## k-Nearest Neighbors 
## 
## 202 samples
##   2 predictor
##   4 classes: '1', '2', '3', '4' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 191, 191, 193, 193, 191, 193, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.6426894  0.5198390
##    2  0.6090530  0.4715523
##    3  0.6311237  0.5020650
##    4  0.5773232  0.4303253
##    5  0.5945076  0.4547095
##    6  0.5940530  0.4522104
##    7  0.5987753  0.4593627
##    8  0.6131944  0.4783842
##    9  0.6088258  0.4702154
##   10  0.6002399  0.4614984
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

Terrible fake data (4)

Back to the technicalities (1)

I earlier brushed over the technicality of ties in the majority class. This can be handled in a number of ways:

Randomly select one of the majority classes.
Add extra neighbours until the tie is broken.
Remove the furthest neighbour until the tie is broken.
Weight the votes based on the distance.
Set a threshold for the class to be considered a majority.

Back to the technicalities (2)

There is also the issue of tied distances. The knn function handles this by:

Randomly selecting the appropriate number of tied neighbours.
Using all of the tied points in the vote.

End of section

Naive Bayes classifier

Source: xkcd.com/2059

The basic idea

In the absence of any information, what proportion of the data belongs to each class?

If it did belong to class A, how probable it is that the explanatory variables would look like this?

See notes

The algorithm

The Naive Bayes classifier is an eager algorithm. This means that it learns a model from the training data. It is also probabilistic in nature, but most implementations will return a class label alone.

Estimate the class probabilities.
Estimate the conditional probabilities of the explanatory variables given the class.
Use Bayes’ theorem to estimate the probability of the class given the explanatory variables for new data.

Choices

The Naive Bayes classifier has one main choice:

The distributions of the explanatory variables given the class.

(Plus altering the initial class probabilities.)

Simple example (1)

\(x_1\)	\(x_2\)	\(y\)
1.5	A	Positive
1.7	B	Positive
1.3	A	Positive
1.9	B	Positive
2.1	A	Positive
2.3	B	Negative
2.5	A	Negative
2.7	B	Negative
1.9	A	Negative

Simple example (2)

We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).

Simple example (2)

We have estimated class prior probabilities of \(P(\text{Positive}) = 5/9\) and \(P(\text{Negative}) = 4/9\).

We have also estimated the likelihoods of the variables as follows: \[ \begin{aligned} x_1 | \text{Positive}~ &\sim N(1.7, 0.1),\\ x_1 | \text{Negative} &\sim N(2.35, 0.1),\\ x_2 | \text{Positive}~ &\sim \text{Categorical}(P_A = 0.6, P_B = 0.4),\\ x_2 | \text{Negative} &\sim \text{Categorical}(P_A = 0.5, P_B = 0.5).\\ \end{aligned} \]

Simple example (3)

We can now estimate the new data points falling into the positive class.

\(x_1\)	\(x_2\)	\(P(\text{Positive}\|x)\)
1.8	A	0.87
2.2	B	0.24
2.0	B	0.54

Handwriting classification (1)

library(e1071)

# Train the model
nb_model <- naiveBayes(x = MNIST_train[, -1], 
                       y = as.factor(MNIST_train$label))

# Summarise
nb_model$apriori

## as.factor(MNIST_train$label)
##   0   1   2   3   4   5   6   7   8   9 
## 191 220 198 191 214 180 200 224 172 210

# Predict for test data
nb_predict <- predict(nb_model, MNIST_test[, -1])

Handwriting classification (2)

# Calculate accuracy
sum(nb_predict == MNIST_test$label) / length(nb_predict)

## [1] 0.498

# Confusion matrix
table(MNIST_test$label, nb_predict)

##    nb_predict
##       0   1   2   3   4   5   6   7   8   9
##   0  74   1   0   0   0   1   2   0   7   0
##   1   0 125   0   0   0   0   1   0   0   0
##   2   6  22  20   5   0   4  18   0  41   0
##   3   9  20   1  26   0   5   7   2  32   5
##   4   5  10   0   0  18   0  12   1  26  38
##   5  10  13   1   0   2   1   4   0  54   2
##   6   4   5   1   0   0   0  72   0   5   0
##   7   2  24   0   1   2   1   1  29  13  26
##   8   1  19   1   1   2   2   0   0  60   3
##   9   1  13   0   0   1   0   1   0   5  73

Terrible fake data (1)

# Train the model
nb_model <- naiveBayes(x = X[, -3], 
                       y = X$Class)

# Summarise
nb_model$apriori

## X$Class
##  1  2  3  4 
## 50 50 50 52

# Predict for test data
nb_predict <- predict(nb_model, X_test[, -3])

Terrible fake data (2)

Terrible fake data (3)

Terrible fake data (4)

End of section

Decision trees

Source: xkcd.com/518

Slicing the variable space (1)

Slicing the variable space (2)

Slicing the variable space (3)

Slicing the variable space (4)

Slicing the variable space (5)

The algorithm

The decision tree algorithm is another eager algorithm. It is also deterministic in most implementations, but you could convert into probabilistic predictions.

Find the best orthogonal split of the data based on some criterion.
Recursively apply this to the resulting subsets.
Stop when some stopping criterion is met.

Choices

The decision tree algorithm has a number of choices:

The criterion for splitting the data.
The stopping criterion.
The maximum depth of the tree.
The minimum number of observations in a node.
(Plus a technicality about ties for the majority.)

See notes

A 2d illustration (1)

A 2d illustration (2)

library(rpart)

# Train the model
tree_model <- rpart(Class ~ ., 
                    data = X, 
                    method = "class")

# Plot the tree
plot(tree_model)
text(tree_model)

A 2d illustration (3)

A 2d illustration (4)

library(rpart)

# Train the model
tree_model <- rpart(Class ~ ., 
                    data = X, 
                    method = "class")

# Plot the tree
library(rpart.plot)
prp(tree_model)

A 2d illustration (4)

A 2d illustration (5)

library(rpart)

# Train the model
tree_model <- rpart(Class ~ ., 
                    data = X, 
                    method = "class")

# Plot the tree
library(rpart.plot)
rpart.plot(tree_model,
           box.palette = list("grey", "red", "green",
                              "blue", "cyan"))

A 2d illustration (5)

A 2d illustration (6)

Handwriting classification (1)

library(caret)

# Train the model with 20-fold cross-validation
tree_cv <- train(x = MNIST_train[, -1], 
                 y = as.factor(MNIST_train$label), 
                 method = "rpart",
                 maxdepth = 10,
                 trControl = trainControl(method = "cv",
                                          number = 20))

Handwriting classification (2)

tree_cv

## CART 
## 
## 2000 samples
##  784 predictor
##   10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' 
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 1903, 1901, 1900, 1900, 1899, 1899, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.05574324  0.4139438  0.3446133
##   0.06813063  0.3640007  0.2898130
##   0.08417793  0.2654192  0.1768648
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.05574324.

Handwriting classification (3)

Handwriting classification (4)

# Predict for test data
tree_predict <- predict(tree_cv, MNIST_test[, -1])

# Calculate accuracy
sum(tree_predict == MNIST_test$label) / length(tree_predict)

## [1] 0.341

Terrible fake data (1)

Terrible fake data (2)

Terrible fake data (3)

Terrible fake data (4)

End of section

Perceptron

Source: xkcd.com/1838

The basic idea (1)

Let’s find a linear combination of the data that best separates the classes.

The linear combination here is a projection onto the real line.

The “machine learning” part of the perceptron is the updating of the weights. The weights are updated based on classification errors.

See notes

The algorithm

The perceptron algorithm is as follows:

Initialise the weights to zero.
Set the learning rate.
Set the maximum number of iterations.
Repeat the following steps until the maximum number of iterations is reached:
- For each input in the training data:
  - Compute the predicted class label.
  - Update the weights based on the classification error.
Return the weights.

Relationship to LDA

The perceptron is a simple linear classifier that is closely related to LDA. Recall that LDA has similar features in that it:

Transforms the data into a new space.
Uses a linear combination of the variables.
Uses a threshold to classify the data.

Unlike LDA, the perceptron (in its basic form) is not set up for multiple classes.

A 1-d illustration (1)

A 1-d illustration (2)

We start be setting the weights to zero and the learning rate to 0.1.

With zero weights, the prediction can be just a random allocation to each class.

As the classes are balanced here, the accuracy will be 50%.

A 1-d illustration (3)

We can compute new weights for the perceptron using the update rule.

# Set the learning rate
alpha <- 0.1

# Initialise the weights
w <- c(0, 0)

# For each input in the training data
for (j in 1:nrow(X)) {
  # Compute the predicted class label
  y_hat <- sign(sum(w * X[j, 1:2]))
  
  # Update the weights based on the classification error
  w <- w + alpha * (X[j, 3] - y_hat) * X[j, 1:2]
}

w

##     1        x1
## 1 0.1 0.2786494

A 1-d illustration (4)

A 1-d illustration (5)

Let’s do 30 more iterations.

max_iter <- 30

# Update the weights based on the classification error
for (i in 1:max_iter) {
  for (j in 1:nrow(X)) {
    # Compute the predicted class label
    y_hat <- sign(sum(w * X[j, 1:2]))
    
    # Update the weights based on the classification error
    w <- w + alpha * (X[j, 3] - y_hat) * X[j, 1:2]
  }
}

A 1-d illustration (6)

A 1-d illustration (7)

Let’s do 100 more iterations.

max_iter <- 100

# Update the weights based on the classification error
for (i in 1:max_iter) {
  for (j in 1:nrow(X)) {
    # Compute the predicted class label
    y_hat <- sign(sum(w * X[j, 1:2]))
    
    # Update the weights based on the classification error
    w <- w + alpha * (X[j, 3] - y_hat) * X[j, 1:2]
  }
}

A 1-d illustration (8)

1-d illustration (9)

The score function has the following weights:

\(w_0\)	\(w_1\)
-1.5	1.5

So, if \(x_1\) is 1, the score function is zero.

Multiple classes

The perceptron can be extended to multiple classes using the one-vs-all approach.

Here, we train a perceptron for each class against the rest. We then assign the class with the highest score.

Handwriting classification (1)

# We need to add in the constant term
MNIST_train <- cbind(MNIST_train, 1)
MNIST_test <- cbind(MNIST_test, 1)

# Let's recode to see if we can identify the number 1
MNIST_train$label <- ifelse(MNIST_train$label == 1, 1, -1)
MNIST_test$label <- ifelse(MNIST_test$label == 1, 1, -1)

Handwriting classification (2)

# Set the learning rate
alpha <- 0.1

# Initialise the 786 weights
weights <- rep(0, ncol(MNIST_train) - 1)

# Set maximum number of iterations
max_iter <- 3

# Update the weights based on the classification error
for (i in 1:max_iter) {
  for (j in 1:nrow(MNIST_train)) {
    # Compute the predicted class label
    y_hat <- sign(sum(weights * MNIST_train[j, -1]))
    
    # Update the weights based on the classification error
    weights <- weights + alpha * (MNIST_train[j, 1] - y_hat) * MNIST_train[j, -1]
  }
}

Handwriting classification (3)

# Now make predictions for the test data
predictions <- NULL

for (j in 1:nrow(MNIST_test)) {
  predictions[j] <- sign(sum(weights * MNIST_test[j, -1]))
}

# Calculate accuracy
sum(predictions == MNIST_test$label) / nrow(MNIST_test)

## [1] 0.996

Terrible fake data (1)

Terrible fake data (2)

# Set the learning rate
alpha <- 0.1

# Initialise the weights (1 set for each class)
weights <- matrix(0, nrow = 4, ncol = 3)

# Set maximum number of iterations
max_iter <- 50

# Update the weights based on the classification error for each class
for (k in 1:4){
  # Recode the class labels
  recoded <- ifelse(X$Class == k, 1, -1)
  
  # Update the weights based on the classification error
  for (i in 1:max_iter) {
    for (j in 1:nrow(X)) {
      # Compute the predicted class label
      y_hat <- sign(sum(weights[k, ] * X[j, -4]))
      
      # Update the weights based on the classification error
      weights[k, ] <- weights[k, ] + 
        as.numeric(alpha * (recoded[j] - y_hat) * X[j, -4])
    }
  }
}

Terrible fake data (3)

Terrible fake data (4)

# Calculate the scores for each of the four models
scores <- matrix(0, nrow = nrow(X_test), ncol = 4)
for (k in 1:4){
  scores[, k] <- X_test$x * weights[k, 2] + X_test$y * weights[k, 3] + weights[k, 1]
}

# Select the class with the highest score
X_test$Predicted <- apply(scores, 1, which.max)

Terrible fake data (5)

Terrible fake data (6)

Terrible fake data (7)

Terrible fake data (8)

Terrrible fake data (9)

The multilevel perceptron

Source: Taken from Hassen Bouzgou’s PhD thesis.

End of section

Balancing classes

Source: Created using the Image Creator in Bing

Who cares?

If there is one class with far more observations than the other, the classifier will be biased towards the majority class. Is this a problem?

With k-NN, the majority class will dominate the prediction as you have more chance the neighbours are from that class.
With the naive Bayes classifier, the class “priors” will be biased towards the majority class.
With the decision tree, the majority class will be the default prediction.

Who cares?

If there is one class with far more observations than the other, the classifier will be biased towards the majority class. Is this a problem?

If the training data are truly representative of the population, then this is not a problem. Or is it?

Strategies

There are three main strategies for dealing with imbalanced classes:

Over-sampling the minority class(es). This involves duplicating observations from the minority class(es) to balance the classes.
Under-sampling the majority class(es). This involves removing observations from the majority class(es) to balance the classes.
Synthetic data generation. This involves creating new observations that are similar to the minority class(es).

Over-sampling example (1)

Over-sampling example (2)

Over-sampling example (3)

Over-sampling example (4)

# How many are in each class?
table(X$Class)

## 
##  1  2  3 
## 50 50  5

# Over-sample the minority class to match the majority classes
X_over <- X
for (i in 1:9) {
  X_over <- rbind(X_over, X[X$Class == 3, ])
}