iris dataset (Categorical)We have a training set of 2,000 digits and a test set of 1,000 digits.
The digits are 28x28 pixels in size giving us 784 variables.
The most well known clustering technique in statistics is agglomerative hierarchical clustering.
The idea is simple: we start with each observation in its own cluster and then merge the closest clusters until we have one big cluster.
The definition of closest is the key to the algorithm.
The \(k\)-means algorithm is a simple iterative algorithm that partitions a dataset into \(k\) clusters.
It sounds a bit like \(k\)-nearest neighbours, but it’s actually quite different.
The algorithm works as follows:
The algorithm is simple, but there are some choices to be made:
iris dataset (1)iris dataset (2)##        151      152      153
## 1 4.057093 1.435270 1.860108
## 2 4.026164 1.486607 2.051828
## 3 4.130375 1.691153 2.063977
## 4 3.957272 1.691153 1.964688
## 5 4.069398 1.536229 1.833030
## 6 3.797368 1.272792 1.489966
iris dataset (3)iris dataset (4)Fortunately, we don’t have to implement the \(k\)-means algorithm ourselves.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     4.738095    2.904762     1.790476   0.3523810
## 2     5.175758    3.624242     1.472727   0.2727273
## 3     6.314583    2.895833     4.973958   1.7031250
iris dataset (5)iris dataset (6)iris dataset (7)iris dataset (8)The choice of the number of clusters is a difficult one: we can’t just look at the data and there have been very many methods proposed.
See notes
iris dataset (9)## [1] 681.3706
## [1] 142.7535
## [1] 538.6171
iris dataset (10)iris dataset (11)Let’s look at the \(k=16\) case.
We have 16 cluster centroids, each of which is a 28x28 pixel image.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
Again, the algorithm is simple, but there are some choices to be made:
iris dataset (1)The package dbscan provides a function
dbscan that implements the DBSCAN algorithm.
iris dataset (2)iris dataset (3)It might be fairer to compare the variables on the same scale so we might utilise the Pearson distance.
This is equivalent to standardising the variables.
iris dataset (4)iris dataset (5)iris dataset (6)library(dbscan)
# Perform DBSCAN
dbscan_result <- dbscan(MNIST_train[, -1], eps = 1000, minPts = 5)
# Number of clusters
length(unique(dbscan_result$cluster)) - 1## [1] 6
There are many other clustering techniques available.
One problem with high-dimensional data is that it can be difficult to visualise.
Another problem is that more dimensions typically means more parameters in our models.
The frustration is that many of these dimensions are redundant or irrelevant.
PCA has been around for a long time: it was first proposed by Karl Pearson in 1901…
It involves finding the “directions” in the data that explain the most variance.
The directions are linear combinations of the original variables and are called principal components. We can ask how much of the total variation in a dataset is explained by each principal component.
| Principal component | % of variance explained | 
|---|---|
| PC1 | 73 | 
| PC2 | 23 | 
| PC3 | 4 | 
| PC4 | 1 | 
We can also look at the loadings of the principal components.
| Variable | PC1 | PC2 | PC3 | PC4 | 
|---|---|---|---|---|
| Sepal.Length | 0.52 | -0.38 | 0.72 | 0.26 | 
| Sepal.Width | -0.27 | -0.92 | -0.24 | -0.12 | 
| Petal.Length | 0.58 | -0.02 | -0.14 | -0.8 | 
| Petal.Width | 0.56 | -0.07 | -0.63 | 0.52 | 
# Perform PCA adding a smidgen of noise
pca_result <- prcomp(MNIST_train[, -1] + 
                       matrix(rnorm(784*2000, 0, 0.1), nrow = 2000),
                     scale = TRUE)| Principal component | % of variance explained | 
|---|---|
| PC1 | 5 | 
| PC2 | 4 | 
| PC3 | 3 | 
| PC4 | 3 | 
| PC5 | 2 | 
| PC6 | 2 | 
t-SNE is another dimension reduction that outperforms PCA at preserving the non-linear structure of high-dimensional data.
It is a more modern technique: it was proposed by van der Maaten and Hinton in 2008.
The underpinning idea is to model the high-dimensional data as a set of pairwise similarities.
We suggest low-dimensional points that have similar pairwise similarities.
See notes
We can choose how many dimensions we want to reduce to.
As always, there are choices to be made.
library(Rtsne)
# Perform t-SNE
tsne_result <- Rtsne(MNIST_train[, -1],
                     dims = 2, 
                     initial_dims = 250)We have an initial PCA step to reduce the number of dimensions.
initial_dims is the number of dimensions to reduce to via
PCA.
As always, there are choices to be made.
The perplexity parameter is a measure of the effective
number of neighbours when estimating the probability distributions.