iris
dataset (Categorical)We have a training set of 2,000 digits and a test set of 1,000 digits.
The digits are 28x28 pixels in size giving us 784 variables.
The most well known clustering technique in statistics is agglomerative hierarchical clustering.
The idea is simple: we start with each observation in its own cluster and then merge the closest clusters until we have one big cluster.
The definition of closest is the key to the algorithm.
The \(k\)-means algorithm is a simple iterative algorithm that partitions a dataset into \(k\) clusters.
It sounds a bit like \(k\)-nearest neighbours, but it’s actually quite different.
The algorithm works as follows:
The algorithm is simple, but there are some choices to be made:
iris
dataset (1)iris
dataset (2)## 151 152 153
## 1 4.057093 1.435270 1.860108
## 2 4.026164 1.486607 2.051828
## 3 4.130375 1.691153 2.063977
## 4 3.957272 1.691153 1.964688
## 5 4.069398 1.536229 1.833030
## 6 3.797368 1.272792 1.489966
iris
dataset (3)iris
dataset (4)Fortunately, we don’t have to implement the \(k\)-means algorithm ourselves.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.175758 3.624242 1.472727 0.2727273
## 2 6.314583 2.895833 4.973958 1.7031250
## 3 4.738095 2.904762 1.790476 0.3523810
iris
dataset (5)iris
dataset (6)iris
dataset (7)iris
dataset (8)The choice of the number of clusters is a difficult one: we can’t just look at the data and there have been very many methods proposed.
See notes
iris
dataset (9)## [1] 681.3706
## [1] 142.7535
## [1] 538.6171
iris
dataset (10)iris
dataset (11)Let’s look at the \(k=16\) case.
We have 16 cluster centroids, each of which is a 28x28 pixel image.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
Again, the algorithm is simple, but there are some choices to be made:
iris
dataset (1)The package dbscan
provides a function
dbscan
that implements the DBSCAN algorithm.
iris
dataset (2)iris
dataset (3)It might be fairer to compare the variables on the same scale so we might utilise the Pearson distance.
This is equivalent to standardising the variables.
iris
dataset (4)iris
dataset (5)iris
dataset (6)library(dbscan)
# Perform DBSCAN
dbscan_result <- dbscan(MNIST_train[, -1], eps = 1000, minPts = 5)
# Number of clusters
length(unique(dbscan_result$cluster)) - 1
## [1] 6
There are many other clustering techniques available.
One problem with high-dimensional data is that it can be difficult to visualise.
Another problem is that more dimensions typically means more parameters in our models.
The frustration is that many of these dimensions are redundant or irrelevant.
PCA has been around for a long time: it was first proposed by Karl Pearson in 1901…
It involves finding the “directions” in the data that explain the most variance.
The directions are linear combinations of the original variables and are called principal components. We can ask how much of the total variation in a dataset is explained by each principal component.
Principal component | % of variance explained |
---|---|
PC1 | 73 |
PC2 | 23 |
PC3 | 4 |
PC4 | 1 |
We can also look at the loadings of the principal components.
Variable | PC1 | PC2 | PC3 | PC4 |
---|---|---|---|---|
Sepal.Length | 0.52 | -0.38 | 0.72 | 0.26 |
Sepal.Width | -0.27 | -0.92 | -0.24 | -0.12 |
Petal.Length | 0.58 | -0.02 | -0.14 | -0.8 |
Petal.Width | 0.56 | -0.07 | -0.63 | 0.52 |
# Perform PCA adding a smidgen of noise
pca_result <- prcomp(MNIST_train[, -1] +
matrix(rnorm(784*2000, 0, 0.1), nrow = 2000),
scale = TRUE)
Principal component | % of variance explained |
---|---|
PC1 | 5 |
PC2 | 4 |
PC3 | 3 |
PC4 | 3 |
PC5 | 2 |
PC6 | 2 |
t-SNE is another dimension reduction that outperforms PCA at preserving the non-linear structure of high-dimensional data.
It is a more modern technique: it was proposed by van der Maaten and Hinton in 2008.
The underpinning idea is to model the high-dimensional data as a set of pairwise similarities.
We suggest low-dimensional points that have similar pairwise similarities.
See notes
We can choose how many dimensions we want to reduce to.
As always, there are choices to be made.
library(Rtsne)
# Perform t-SNE
tsne_result <- Rtsne(MNIST_train[, -1],
dims = 2,
initial_dims = 250)
We have an initial PCA step to reduce the number of dimensions.
initial_dims
is the number of dimensions to reduce to via
PCA.
As always, there are choices to be made.
The perplexity
parameter is a measure of the effective
number of neighbours when estimating the probability distributions.