Mixture models are an incredibly useful class of models for data whose distribution exhibits a degree of heterogeneity that cannot be captured by a single distribution alone. They have been widely used in many scientific fields, such as astronomy, biology, genomics, finance, medicine and engineering, for a variety of purposes, including density-estimation, unsupervised clustering and capturing unobserved heterogeneity. Mixture modelling is also highly flexible, and can be applied to a wide range of types of data: categorical, discrete or continuous, univariate or multivariate, and so on.
The basic finite mixture model provides a model for a heterogeneous population that consists of, say, \(K\) unobserved homogeneous sub-groups, often called components, mixed at random in proportion to the relative sizes, \(\eta_1,\ldots,\eta_K\), where \(\sum_{k=1}^K \eta_k = 1\). If some random feature \(Y\) of the population is heterogeneous across and homogeneous within sub-groups, then we model \(Y\) as having a different probability distribution in each sub-group, denoting the \(k\)th component distribution by \(p(y | \boldsymbol{\theta}_k)\), which depends on a parameter \(\boldsymbol{\theta}_k\). It is common, though not necessary, to assume that the component distributions \(p(y | \boldsymbol{\theta}_k)\) are all from the same parametric family, such as the normal distribution, differing only in the value of the parameter. The distribution of \(Y\) is then given by \[\begin{equation*} p(y | \boldsymbol{\theta},\boldsymbol{\eta}) = \sum_{k=1}^K \eta_k p(y | \boldsymbol{\theta}_k), \end{equation*}\] where \(\boldsymbol{\theta} = (\boldsymbol{\theta}_1,\ldots,\boldsymbol{\theta}_K)\) and \(\boldsymbol{\eta}=(\eta_1,\ldots,\eta_K)^T\). An example with normal component distributions is provided in Figure 1.
Figure 1: Example of three normal densities, giving two very different mixture distributions, \(\eta_1 \times \mathrm{N}(1, 1) + \eta_2 \times \mathrm{N}(3, 0.5^2) + \eta_3 \times \mathrm{N}(4, 1.3^2)\), where for mixture A, \(\boldsymbol{\eta}=(0.3, 0.5, 0.2)^T\) and for mixture B, \(\boldsymbol{\eta}=(0.84, 0.02, 0.14)^T\).
The classic Bayesian analysis completes the model specification through the choice of a prior for the mixture weights, \(\boldsymbol{\eta}\), and the component parameters, \(\boldsymbol{\theta}\), usually factorised as \(\pi(\boldsymbol{\theta},\boldsymbol{\eta}) = \pi(\boldsymbol{\theta}) \pi(\boldsymbol{\eta})\). Given a random sample of \(n\) observations from the finite mixture model, \(\mathbf{y}=(y_1,\ldots,y_n)^T\), the posterior distribution of interest is then given by \[\begin{equation*} \pi(\boldsymbol{\theta},\boldsymbol{\eta} | \mathbf{y}) \propto \pi(\boldsymbol{\theta}) \pi(\boldsymbol{\eta}) \prod_{i=1}^n \left\{ \sum_{k=1}^K \eta_k p(y_i | \boldsymbol{\theta}_k) \right\}. \end{equation*}\] The posterior is analytically intractable in all but toy examples and so the usual approach is to sample from it using Markov chain Monte Carlo (MCMC) methods. But the product of sums arising from the likelihood function poses a number of challenges to MCMC.
The goal of this project is to introduce and explore Bayesian inference for mixture models.
The group project will revolve around learning about the mathematical formulation and properties of finite mixture models as well as Bayesian methodology for working with them.
By the end of the group project you will have learned:
By the end of the group project you will be able to:
The project will revolve around learning through reading and programming in R. Students will demonstrate their understanding by comparing theory to simulation results, writing R code to implement core methodology, analysing simulated and real data sets, and clearly communicating the material in both written and oral formats.
The individual project will build on the knowledge we have gained in the group project and will explore additional advanced topics. A few examples of topics you will be able to investigate are:
The project will revolve around learning through reading and programming in R. Students will demonstrate their understanding by comparing theory to simulation results, writing R code to implement core methodology, analysing simulated and real data sets, and clearly communicating the material in both written and oral formats.
Prerequisites: Statistical Inference II, Data Science and Statistical Modelling II.
Co-requisites: Bayesian Computation and Modelling III.
If you would like more information about this project, please contact me at sarah.e.heaps@durham.ac.uk
label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1), 1–24.