Today we will continue working with the iris dataset which is a build-in dataset. Lets load the dataset again.
data('iris')
class(iris)
## [1] "data.frame"
dim(iris)
## [1] 150 5
head(iris, n=5)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
?iris
We recall that the dataset contains measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 150 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Initially, lets store the numerical data in an object y
,
set the sample size n
and produce some visualisations of
the data.
y = iris[, 1:4]
n = nrow(y)
par(mfrow = c(2,2))
for (i in 1:4) {
hist(y[,i], col=i+1, main = '', xlab = names(y)[i])
}
pairs(y)
We see that the distributions of petal length and petal width seem to be bimodal. We also see that there is a clear indication for existence of two groups. Lets start by estimating non-parametrically the density of petal length.
If not installed already, we need to install the package first before loading the library.
install.packages('BNPmix')
Now, load the BNPmix library:
library(BNPmix)
The package BNPmix uses a very efficient C++ implementation for posterior sampling (Gibbs algorithms) from complex Dirichlet process mixtures. In fact, BNPmix uses the Pitman-Yor (PY) process which is a generalization of the DP process, which is a special case. We will not have time to learn about this process, but in practical terms it is sufficient to know that in the notation we use in the lectures we would assume that \[ G \sim \mathrm{PY}(\beta,\alpha,G_0), \] where \(\beta = [0,1)\) is called the discount parameter, \(\alpha > -\beta\) the strength parameter and \(G_0\) is the usual centering measure. When the discount is zero, i.e. \(\beta=0\), we have a \(\mathrm{DP}(\alpha,G_0)\) as we know it from lectures.
For univariate DPM models the package has the capability to fit infinite mixtures of normal densities via the stick-breaking representation \[\begin{equation} p(y_i) = \sum_{k=1}^\infty\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2), \label{eq1} \end{equation}\] with prior distributions for \(\mu_k\) and \(\sigma_k^2\) based on a conditional normal/inverse-gamma centering measure \(G_0(\mu_k,\sigma_k^2)=G_0(\mu_k ~|~ \sigma_k^2)G_0(\sigma_k^2)\) as follows \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0),\] where \(a_0>0, b_0>0\) are the scale and rate parameters, respectively, so that \(\mathbb{E}[\sigma_k^2]=\displaystyle\frac{b_0}{a_0-1}\), \(a_0>1\), and \(\mathbb{V}\mathrm{ar}[\sigma_k^2]=\displaystyle\frac{b_0^2}{(a_0-1)^2(a_0-2)}\) for \(a_0>2\). The prior on \(\mu_k\) is just a little different than the prior we presented in lectures as it includes a scalar \(k_0\) in its variance.
BNPmix uses two basic functions. The first is
PYcalibrate
which requires specification of three basic
arguments:
PYcalibrate(Ek, n, discount = 0)
Ek
This is a prior expectation for the number of
clusters. This must be specified. For instance if we expect to have 4
clusters we can set it equal to 4 (or close to 4). If we have no
expectation at all we can set it equal to some large number.n
The sample size.discount
The discount parameter (default is 0,
resulting in a DP process).The function for density estimation via posterior sampling is
PYdensity
. Typing ?PYdensity
will open the
help window. This function requires specification of 4 arguments
PYdensity(y, mcmc = list(), prior = list(), output = list())
Lets go through the above one by one and see what are the most important arguments we need to define.
y
The data vector (for univariate estimation) of matrix
(for multivariate estimation)mcmc
List with arguments for MCMC implementation:
niter
number of MCMC iterations.nburn
number of initial iterations to discard (the
“burn-in” period).prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Logical, when TRUE
(which is the
default option) a hierarchical DPM model (seen later in lectures) is
implemented, when FALSE
a standard DPM model is
implemented.output
List with arguments for the returning output of
the function:
grid
Important as it defines the grid of points at
which to evaluate the estimated posterior mean density. For instance, if
\(y\) is univariate, we can define the
grid ranging from \(\min(y)\) to \(\max(y)\) and also to have a specific
length (e.g., 100).out_param
Logical with default equal to FALSE.
Recommended to set it to TRUE: this returns the full set of parameters
in the model; e.g., also the \(\mu_k\),
\(\sigma_k^2\) etc.Lets apply a DPM model for variable Petal Length from the
Iris dataset, whose density certainly does not appear unimodal. So, in R
we are going to use the vector y[,3]
.
Tasks:
1. In PYcalibrate
set the number of
expected clusters equal to 3 and also the sample size.
2. Define an object mcmc
as a list with
elements niter
equal to 11000 and nburn
equal
to 1000.
3. Define an object prior
as a list
with elements strength
equal to 1, discount
equal to 0 and hyper
equal to FALSE
.
4. Define an object output
as a list
with element grid
defined as a sequence from min of
y[,3]
to max of y[,3]
with length equal to 100
(type ?seq
for help) and element out_param
set
equal to TRUE.
5. Then, set.seed(1)
(for
reproducibility) and execute
fit1 = PYdensity(y = y[,3], mcmc = mcmc, prior = prior, output = output)
.
If everything is done correctly, the command summary()
will return the following output from fit1
:
summary(fit1)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 0.93 seconds
## Average number of groups: 3.5286
## Min number of groups: 2 ; Max number of groups: 10
This provides information about runtime and the number of iterations and burn-in period used. Importantly, it tells us that over the 10000 MCMC iterations the minimum number of clusters formed was 2 and the maximum was 10 with the average cluster number being 4 (when rounded).
If we want to overlay the mean posterior density on top of the histogram and also show the 95% posterior credible bands for the density we can use the line of code below. Overall, we see that the 95% credible bands capture almost everywhere the uncertainty in the data.
plot(fit1, band = TRUE, show_hist = TRUE, xlab = "Petal length", ylab = 'Density')
Note that BNPmix transfers arguments from the package function
plot.BNPdens
to the generic R function plot
.
Standard arguments within plot
such as col
,
xlab
etc. can be modified (but not all). Typing
?plot.BNPdens
reveals the list of arguments. We can see,
for instance, that we can suppress the credible bands or change them (as
we see the default option is conf_level = c(0.025, 0.975)). We can also
add the average posterior cluster assignment for the observations with
coloured points on the \(x\)-axis as
follows.
plot(fit1, band = TRUE, show_clust = TRUE, xlab = "Petal length", ylab = 'Density')
We saw previously that the command summary(fit1)
will
provide us the average number of clusters. However, what about the model
with most frequently visited number of clusters? We are more interested
in this quantity as it is the maximum a posteriori
(MAP) estimate of the number of clusters. Unfortunately, BNPmix
does not provide an immediate command for that, but there is a
workaround with some additional coding. First, lets have a look at the
names of the outputs that fit1
contains.
names(fit1)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
We are interested in output clust
which contains the
cluster label for each observation for each of the 10000 posterior
samples. Lets have a closer look.
cluster.labels = fit1$clust
class(cluster.labels)
## [1] "matrix" "array"
dim(cluster.labels)
## [1] 10000 150
cluster.labels[1,] # labels from 1st MCMC sample, 3 clusters in total are visited
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0
Below, n.cluster
is
cluster.labels
cluster.labels
unlist
just tranform the class of
n.cluster
from list to vector of integers (as verified from
line 4).n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
length(n.cluster); class(n.cluster)
## [1] 10000
## [1] "integer"
Almost there! Now we can just use the command table()
and plot
to create a (relatively) nice figure for the
posterior probabilities of clusters.
counts = table(n.cluster)
counts
## n.cluster
## 2 3 4 5 6 7 8 9 10
## 2239 3302 2394 1352 494 164 44 8 3
plot(counts/sum(counts), col = 1:length(counts), bty = 'l', xlab = 'Number of clusters',
ylab = 'Posterior probability', main = 'alpha = 1')
As seen, the most visited model is the one with \(K=3\), although \(K=2\) also has a relatively high posterior probability (but so does the model with \(K=4\)). This is not exactly what we expected based on the exploratory visualisations of the dataset at the beginning.
We may wish to improve this aspect of the model (although that is not
very important for density estimation). May it be that results are
affected from setting the expected number of clusters equal to 3? We did
this by setting Ek = 3
in the function
PYcalibrate
. You may consider changing this option to
Ek = 2
(left as an exercise), but as you will see it will
not affect things (at least under set.seed(1)
).
What we can consider, however, is to change the prior precision of the DP (strength in terms of the PY process). Previously, we defined the the prior through
prior = list(strength = 1, discount = 0, hyper = 'FALSE')
BNPmix has an internal procedure for automatically selecting the strength parameter based on the sample size and the discount parameter.We can find this value as follows:
DPprior = PYcalibrate(Ek = 3, n = n, discount = 0)
DPprior
## $strength
## [1] 0.3938936
##
## $discount
## [1] 0
Task:
6. Refit the model (with the same seed) using the above value for strength and plot the resulting posterior probabilities of the cluster for this model.
If this is done correctly, the results should be as shown below. Now the MAP estimate for the number of clusters is, indeed, \(K=2\).
counts
## n.cluster
## 2 3 4 5 6 7
## 5472 3236 1022 228 38 4
plot(counts/sum(counts), col = 1:length(counts), bty = 'l', xlab = 'Number of clusters',
ylab = 'Posterior probability', main = 'alpha = 0.3938936')
Above, in 3.1 we estimated the posterior density and its related uncertainty and in 3.2 we inferred about the most likely a posteriori number of clusters. These two, are typically the main goals of nonparametric analyses. However, what about posterior inference for the parameters, is that still possible and does BNPmix provide this extra inferential option?
The answer is yes, but of course the only way posterior summaries of parameters would make sense is when we work conditional upon a given \(K\). A sensible option is to select the MAP estimate, which is \(K=2\) in this application. So we will basically infer about the 2-component mixture of the form
\[ p(y_i) = \sum_{k=1}^2\pi_k \mathrm{N}(y_i ~|~ \mu_k, \sigma_k^2). \] Clarification note: The results below are not equivalent to the results obtained from using a 2-component Bayesian normal mixture model as a starting point (even if the prior specification is similar in form). If the main aim is inference about parameters, then a more disciplined approach would be to first determine the number of clusters (as we did above) and then infer (using MCMC) from a finite 2-component mixture. That being said, in practical terms the results will likely look quite similar!
Lets see how we can extract posterior summaries for the means \(\mu_1\) amd \(\mu_2\). The key thing is to remember that
we can see the objects contained in fit1
by
names(fit1)
as we have seen above. The object of interest
in this case is mean
. Below we store this information in a
new objectmeans
; this is a list: means[[1]]
and means[[2]]
show the resulting means from the first and
last MCMC sample (the burn-in samples are removed automatically).
means = fit1$mean
means[[1]]; means[[10000]]
## [,1]
## [1,] 5.095031
## [2,] 1.426748
## [3,] 2.687789
## [4,] 1.651989
## [,1]
## [1,] 4.8656932
## [2,] 1.4430385
## [3,] 1.3996998
## [4,] 1.3686386
## [5,] 0.9059466
Now, we are only interested in the posterior samples of the first two
components, so lets initially create a matrix of NAs with 10000 (MCMC
sample size) rows and \(K=2\) columns.
Next, we impute matrix post.means
row-wise with the first
column being filled with the samples of \(\mu_1\) and the second with samples of
\(\mu_2\).
post.means = matrix(NA, nrow = 10000, ncol = 2)
for(i in 1:10000){
post.means[i,1] = means[[i]][1]
post.means[i,2] = means[[i]][2]
}
We now have everything we require and can evaluate posterior summaries of \((\mu_1,\mu_2)\).
summary(post.means)
## V1 V2
## Min. :3.527 Min. :0.9085
## 1st Qu.:4.850 1st Qu.:1.4430
## Median :4.910 Median :1.4622
## Mean :4.912 Mean :1.4830
## 3rd Qu.:4.970 3rd Qu.:1.4822
## Max. :5.792 Max. :5.5915
We see, for instance, that the posterior means are \(\hat{\mu_1} = 4.909\) and \(\hat{\mu_2} = 1.4697\). Does this generally make sense? It seems so, if we have a look again at the histogram of variable Petal length one group mode seems to be around 5 while the other group mode somewhere around 1.5 (approximately). We can also produce relatively nice-looking density plots.
par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density',
xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density',
xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3)
We observe that the posterior density of \(\mu_2\) is certainly peaked around the
posterior mode but it is also quite spread, taking values in extreme
regions with very low probability density. We can improve this plot by
controlling the \(x\)-axis range via
argument xlim
within plot
as done below.
par(mfrow = c(1,2))
plot(density(post.means[,1]), main = '', lwd = 2, ylab = 'Posterior density',
xlab = expression(mu[1]), bty = 'l', col = 1, cex.lab = 1.3)
plot(density(post.means[,2]), main = '', lwd = 2, ylab = 'Posterior density',
xlab = expression(mu[2]), bty = 'l', col = 2, cex.lab = 1.3, xlim = c(1,2))
Tasks:
7. Produce similar posterior summaries and density
plots for \(\sigma_1^2\) and \(\sigma_2^2\) (hint: for the plot of \(\sigma_2^2\) use
xlim = c(0,0.2)
).
8. Produce similar posterior summaries and density plots for \(\pi_1\) and \(\pi_2\)
If done correctly, you should get the following results and figures:
## V1 V2
## Min. :0.08956 Min. :0.01327
## 1st Qu.:0.60264 1st Qu.:0.03177
## Median :0.66451 Median :0.03729
## Mean :0.67357 Mean :0.04093
## 3rd Qu.:0.73744 3rd Qu.:0.04389
## Max. :1.24008 Max. :5.38477
## V1 V2
## Min. :0.0101 Min. :0.002686
## 1st Qu.:0.6273 1st Qu.:0.298695
## Median :0.6583 Median :0.325601
## Mean :0.6497 Mean :0.321325
## 3rd Qu.:0.6866 3rd Qu.:0.353695
## Max. :0.7967 Max. :0.513049
Finally, let us discuss the aspect of setting the hyper-parameters of the model. As mentioned above the prior assumptions under BNPmix are the following: \[\mu_k ~|~ \sigma_k^2\sim G_0(\mu_k ~|~ \sigma_k^2)\equiv \mathrm{N}(m_0, \sigma_k^2/k_0) \mbox{ and } \sigma_k^2\sim G_0(\sigma_k^2) \equiv \mathrm{InvGamma}(a_0,b_0).\] The default options used by BNPmix are as follows \[m_0 = \bar{y}= \frac{1}{n}\sum_iy_i, ~k_0 =1, ~a_0 =2, ~b_0 = \frac{1}{(n-1)}\sum_i(y_i-\bar{y})^2.\] So as wee, BNPmix actually uses an empirical Bayes procedure where prior hyper-parameters \(m_0\) and \(b_0\) are estimated from the data! Practically, this is fine and quite understandable from the side of the developers as this is a generally “safe” option so that the algorithmic procedures implemented “under the hood” will provide stable estimates (or at least, not crash completely) across different types of datasets.
That being acknowledged, it should be mentioned that hard-core Bayesians would not like this approach at all! Even more moderate Bayesian statisticians would like to have some control over this and potentially apply a fully Bayesian approach. So the question is: can we change the prior specification?
The answer is yes. Setting of prior hyper-parameters can be
controlled via the prior
list, which, apart from the
arguments listed above in Section 2.3, can take the
following additional arguments:
prior
List with arguments for prior specification:
discount
The discount parameter (here it must be set to
0 for a DP).strength
The strength parameter of the PY process or
corresponding precision of the DP.hyper
Provides control for choosing a DPM or a
hierarchical DPM.m0
The prior mean of each \(\mu_k\).k0
The scalar \(k_0>0\) controlling the variance of each
\(\mu_k\).a0
The shape \(a_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\).b0
The scale \(b_0>0\) of the inverse-gamma
distribution of each \(\sigma_k^2\).Suppose we want to adopt a non-informative prior such that \[\mu_k ~|~\sigma_k^2\sim\mathrm{N}(0,\sigma^2_k) ~\mbox{and}~\sigma_k^2\sim\mathrm{InvGamma}(1,1000).\] Note that the mean and the variance for \(a_0=1\) are not available in closed form, but we can still check empirically that this inverse-gamma has a very large mean and variance; for instance:
set.seed(1)
mean(1/rgamma(1000, shape = 1, rate = 1000))
## [1] 450650.4
var(1/rgamma(1000, shape = 1, rate = 1000))
## [1] 694001710
Above we made use of the following relationship: if \(X\sim \mathrm{Gamma}(a,b)\) (where \(b\) is the rate parameter), then \(1/X\sim \mathrm{InvGamma}(a,b)\).
Tasks:
9. Using the automatic BNPmix option for setting
strength
, while discount
should always be
equal to 0 and hyper
be equal to FALSE
, define
the prior list so that \(\mu_0=0\),
\(k_0=1\) (this is actually the
default) and \(a_0=1, ~ b_0=1000\).
Then run the new model (without changing the rest of the previous
specifications) using again set.seed(1)
and call this
fit2
.
10. Produce the two density plots we saw above, now
for for fit2
(change the colour; e.g., use
col = 2
within the plot).
11. Visualise, as done before, the posterior
probabilities for the number of clusters under model
fit2
.
12. Calculate posterior summaries and produce
posterior density plots for the means under the MAP model resulting from
fit2
.
Results should look as follows:
## n.cluster
## 2 3 4 5 6 7 8
## 5196 3441 1070 253 36 1 3
Did results change under the modified prior specification?
In this application the related inference was basically unaffected. So, we avoided using empirical Bayes and went with a fully Bayesian approach which, at the end, provided more or less the same results.
Sometimes things can go wrong for reasons that are obscure. For
instance, let’s say that we want to analyse Petal width (the
4th column of y
). In this case
prior = list(strength = DPprior$strength, discount = 0, hyper = 'FALSE')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100),
out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)
If you execute this code you will receive the following error message:
Error in cICS(y, grid_use, niter, nburn, m0, k0, a0, b0, m1, s21, tau1, : randg(): incorrect distribution parameters; a and b must be greater than zero
Now, the fact that error starts with cICS gives reason to suspect
that there is something going wrong with the MCMC sampler used. BNPmix
can implement three algorithms: an Importance Conditional Sampler (ICS),
a Marginal Sampler (MAR) and a Slice Sampler (SLI). These can be
controlled via the mcmc
list through the extra argument
method
. The default option is
method = 'ICS'
.
However, before changing the algorithm lets consider first a practical solution that might. In such types of applications sometimes it is better to scale the data first (for each column subtract its mean and divide by its variance). So, lets try this
y.scaled = scale(y)
apply(y.scaled,2,mean); apply(y.scaled,2,var)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## -4.484318e-16 2.034094e-16 -2.895326e-17 -3.663049e-17
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 1 1 1
output = list(grid = seq(min(y.scaled[,4]), max(y.scaled[,4]), length.out = 100),
out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y.scaled[,4], mcmc = mcmc, prior = prior, output = output)
summary(fit3)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 1.12 seconds
## Average number of groups: 5
## Min number of groups: 5 ; Max number of groups: 5
plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width",
ylab = 'Density', col = 3)
Well, in this case we managed to run the sampler, but the results are awfull! So, lets change the sampler and use MAR instead.
mcmc = list(niter = 11000, nburn = 1000, method = 'MAR')
output = list(grid = seq(min(y[,4]), max(y[,4]), length.out = 100),
out_param = TRUE)
set.seed(1)
fit3 = PYdensity(y = y[,4], mcmc = mcmc, prior = prior, output = output)
## Completed: 1100/11000 - in 0.124563 sec
## Completed: 2200/11000 - in 0.254853 sec
## Completed: 3300/11000 - in 0.38127 sec
## Completed: 4400/11000 - in 0.518943 sec
## Completed: 5500/11000 - in 0.637626 sec
## Completed: 6600/11000 - in 0.765306 sec
## Completed: 7700/11000 - in 0.898981 sec
## Completed: 8800/11000 - in 1.02484 sec
## Completed: 9900/11000 - in 1.16216 sec
## Completed: 11000/11000 - in 1.29038 sec
##
## Estimation done in 1.29039 seconds
summary(fit3)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 1.29 seconds
## Average number of groups: 3.1109
## Min number of groups: 2 ; Max number of groups: 30
plot(fit3, band = TRUE, show_hist = TRUE, xlab = "Petal width",
ylab = 'Density', col = 3)
This worked and without scaling the data.
Warning note: If you try to run this with
method = 'SLI'
RStudio will crash for some reason!