MLNN unsupervised learning

JP Gosling

2024-12-09

Running examples


Source: Created using the Image Creator in Bing

iris dataset (Categorical)

Handwriting classification (Categorical)


We have a training set of 2,000 digits and a test set of 1,000 digits.


The digits are 28x28 pixels in size giving us 784 variables.

Handwriting classification (Categorical)

End of section

Clustering


Source: Created using the Image Creator in Bing

Hierarchical clustering (1)


The most well known clustering technique in statistics is agglomerative hierarchical clustering.


The idea is simple: we start with each observation in its own cluster and then merge the closest clusters until we have one big cluster.


The definition of closest is the key to the algorithm.

Hierarchical clustering (2)

Hierarchical clustering (3)

Hierarchical clustering (4)

\(k\)-means clustering


Source: xkcd.com/1450

The basic idea


The \(k\)-means algorithm is a simple iterative algorithm that partitions a dataset into \(k\) clusters.


It sounds a bit like \(k\)-nearest neighbours, but it’s actually quite different.

The algorithm

The algorithm works as follows:


  • Randomly initialise \(k\) cluster centroids.


  • Assign each data point to the nearest cluster centroid.
  • Update the cluster centroids by computing the mean of all data points assigned to each cluster.


  • Repeat the assign and update steps until the cluster centroids no longer change.

The choices

The algorithm is simple, but there are some choices to be made:


  • How do we initialise the cluster centroids?
  • How do we measure the distance between data points?
  • How do we choose the number of clusters?
  • How do we know when to stop?


  • See notes

iris dataset (1)


# Initialise
k <- 3
centroids <- matrix(c(5,3,5,2,
                      6,3,2,1,
                      5,4,3,1), 
                    byrow = TRUE, nrow = k, ncol = 4)

# Measure the Euclidean distance
distances <- as.matrix(dist(rbind(as.matrix(iris[, -5]),
                                  centroids)))[1:nrow(iris),
                                               (nrow(iris) + 1):
                                                 (nrow(iris)+k)]

iris dataset (2)


head(distances)
##        151      152      153
## 1 4.057093 1.435270 1.860108
## 2 4.026164 1.486607 2.051828
## 3 4.130375 1.691153 2.063977
## 4 3.957272 1.691153 1.964688
## 5 4.069398 1.536229 1.833030
## 6 3.797368 1.272792 1.489966
# Assign each data point to the nearest cluster centroid
clusters <- apply(distances, 1, which.min)

iris dataset (3)

iris dataset (4)


Fortunately, we don’t have to implement the \(k\)-means algorithm ourselves.


kmeans_result <- kmeans(iris[, -5], centers = 3)
kmeans_result$centers
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.175758    3.624242     1.472727   0.2727273
## 2     6.314583    2.895833     4.973958   1.7031250
## 3     4.738095    2.904762     1.790476   0.3523810

iris dataset (5)

iris dataset (6)

iris dataset (7)

iris dataset (8)

How many clusters?


The choice of the number of clusters is a difficult one: we can’t just look at the data and there have been very many methods proposed.


See notes

iris dataset (9)

# Total sum of squares
kmeans_result$totss
## [1] 681.3706
# Within sum of squares
sum(kmeans_result$withinss)
## [1] 142.7535
# Between sum of squares
kmeans_result$betweenss
## [1] 538.6171

iris dataset (10)

iris dataset (11)

Handwriting classification (1)

Handwriting classification (2)


Let’s look at the \(k=16\) case.


We have 16 cluster centroids, each of which is a 28x28 pixel image.


kmeans_result <- kmeans(MNIST_train[, -1], centers = 16)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

DBSCAN


DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.


  • Like \(k\)-means clustering, we use an iterative algorithm to partition the data into clusters.


  • Unlike \(k\)-means clustering, we don’t have to specify the number of clusters.

The algorithm


  • For each data point in turn, find all points within a distance of \(\epsilon\).
  • If a point has at least \(m\) points within \(\epsilon\), it is considered a core point.
  • Expand the cluster by adding all reachable points from the core point.
  • Repeat the process for all unvisited points until all points are assigned to a cluster or marked as outliers.

The choices


Again, the algorithm is simple, but there are some choices to be made:


  • How do we measure the distance between data points?
  • How do we choose \(\epsilon\) and \(m\)?

iris dataset (1)


The package dbscan provides a function dbscan that implements the DBSCAN algorithm.


library(dbscan)

# Perform DBSCAN on the iris dataset
dbscan_result <- dbscan(iris[, -5], eps = 0.5, minPts = 5)

iris dataset (2)

iris dataset (3)

Standardisation


It might be fairer to compare the variables on the same scale so we might utilise the Pearson distance.


This is equivalent to standardising the variables.

iris_standardised <- scale(iris[, -5])
dbscan_result <- dbscan(iris_standardised, eps = 0.5, minPts = 5)

iris dataset (4)

iris dataset (5)

iris dataset (6)

Handwriting classification (1)


library(dbscan)

# Perform DBSCAN
dbscan_result <- dbscan(MNIST_train[, -1], eps = 1000, minPts = 5)

# Number of clusters
length(unique(dbscan_result$cluster)) - 1
## [1] 6

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (3)

Handwriting classification (3)

Handwriting classification (3)

Other clustering techniques


There are many other clustering techniques available.


  • Expectation-maximisation (EM) clustering: a probabilistic approach to clustering.
  • Affinity propagation: a method that identifies exemplars in the data.
  • Spectral clustering: a method that uses the eigenvectors of a similarity matrix.
  • Birch: a method that uses a tree structure to cluster the data.

End of section

Dimension reduction


Source: Created using the Image Creator in Bing

The problem


One problem with high-dimensional data is that it can be difficult to visualise.


Another problem is that more dimensions typically means more parameters in our models.


The frustration is that many of these dimensions are redundant or irrelevant.

Principal component analysis (PCA)


PCA has been around for a long time: it was first proposed by Karl Pearson in 1901…


It involves finding the “directions” in the data that explain the most variance.

# Perform PCA on the iris dataset
pca_result <- prcomp(iris[, -5],
                     scale = TRUE)

Principal component analysis (PCA)


The directions are linear combinations of the original variables and are called principal components. We can ask how much of the total variation in a dataset is explained by each principal component.


Principal component % of variance explained
PC1 73
PC2 23
PC3 4
PC4 1

Principal component analysis (PCA)


We can also look at the loadings of the principal components.


Variable PC1 PC2 PC3 PC4
Sepal.Length 0.52 -0.38 0.72 0.26
Sepal.Width -0.27 -0.92 -0.24 -0.12
Petal.Length 0.58 -0.02 -0.14 -0.8
Petal.Width 0.56 -0.07 -0.63 0.52

Handwriting classification (1)

# Perform PCA
pca_result <- prcomp(MNIST_train[, -1],
                     scale = TRUE)
Error in prcomp.default(MNIST_train[, -1], scale = TRUE) : 
  cannot rescale a constant/zero column to unit variance

Handwriting classification (1)

# Perform PCA adding a smidgen of noise
pca_result <- prcomp(MNIST_train[, -1] + 
                       matrix(rnorm(784*2000, 0, 0.1), nrow = 2000),
                     scale = TRUE)
Principal component % of variance explained
PC1 5
PC2 4
PC3 3
PC4 3
PC5 2
PC6 2

Handwriting classification (2)

Handwriting classification (2)

Handwriting classification (3)

t-distributed stochastic neighbour embedding (t-SNE)


t-SNE is another dimension reduction that outperforms PCA at preserving the non-linear structure of high-dimensional data.


It is a more modern technique: it was proposed by van der Maaten and Hinton in 2008.

The “basic” idea


The underpinning idea is to model the high-dimensional data as a set of pairwise similarities.


We suggest low-dimensional points that have similar pairwise similarities.


See notes

Handwriting classification (1)


We can choose how many dimensions we want to reduce to.


library(Rtsne)

# Perform t-SNE
tsne_result <- Rtsne(MNIST_train[, -1],
                     dims = 2)

Handwriting classification (2)

Handwriting classification (3)


As always, there are choices to be made.


library(Rtsne)

# Perform t-SNE
tsne_result <- Rtsne(MNIST_train[, -1],
                     dims = 2, 
                     initial_dims = 250)

We have an initial PCA step to reduce the number of dimensions. initial_dims is the number of dimensions to reduce to via PCA.

Handwriting classification (4)

Handwriting classification (5)


As always, there are choices to be made.


library(Rtsne)

# Perform t-SNE
tsne_result <- Rtsne(MNIST_train[, -1],
                     dims = 2, 
                     perplexity = 50)

The perplexity parameter is a measure of the effective number of neighbours when estimating the probability distributions.

Handwriting classification (6)

End of chapter