EM {UEM}R Documentation

Fitting and updating mixtures via the EM algorithm

Description

The function EM implements the usual EM algorithm for multivariate Gaussian mixtures or Poisson mixtures. The function updateEM allows for updating the result from a fitted mixture model, after new observations have been coming in. The function UEM allows to split a data set (from the start) into several batches, and apply and update EM sequentially on those batches.

Usage

EM(y, K, init = "quantile", family = "Gaussian", iter = -1, threshold =
    0.0001, lambda = 0.999, tol=0.5, verbose = FALSE, plot = FALSE, ...)
UEM(y, K, init = "quantile", split = NULL, randomize = FALSE, 
    family = "Gaussian", iter = -1, threshold = 0.001, max.time = NULL, 
    lambda = 0.999, verbose = FALSE, plot = FALSE, ...)
updateEM(z = NULL, theta, iter = -1, threshold = 0.001, max.time = NULL, 
    lambda = 0.999, verbose = FALSE, plot = FALSE, ...) 

Arguments

y

a univariate or multivariate data set.

z

new data to be added for updating the estimate theta.

theta

the value of the parameter vector theta from which UEM is started.

K

the number of components.

init

the type of initialization used. Options include "random" (randomly chosen from the data), "scatter" (uniformly sampled from the _support_ of the data), "quantile" (quantile-based), "shortruns" (short runs of EM), "gq" (based on Gauss Quadrature points). See also help file for init.

split

For the use in UEM, a vector giving the split points between the batches.

randomize

Boolean. For the use in UEM. If TRUE, data are randomized before splitting.

family

Response family. At present, "Gaussian" (default) and "Poisson" are supported. "Cauchy" is in preparation.

iter

Number of EM iterations. For UEM this can be a vector which gives the number of iterations in the individual batches (in this case its length has just to be the length of split+1) or a scalar (in this case all batches have the same size).

threshold

Convergence threshold (in terms of a log-likelihood difference).

lambda

calibrates between globally equal component variances (lambda=0) or unequal variances (lambda=1). The only reason to set this to a value different than 0 or 1 is to avoid likelihood spikes when using unequal variances. In this case a value like 0.999 is suitable, which means that the component variances are computed by taking 0.999 times the component-specific (‘unequal’) variances plus 0.001 the globally computed (‘equal’) component variance.

tol

tuning parameter which scales EM starting points inwards or outwards

max.time

Time limit after which execution stops (in seconds).

verbose

Boolean. If TRUE, displays iteration count on the screen.

plot

Boolean. If TRUE, provides graphical output (fitted mixture).

...

Arguments to be passed to plot.

Value

A fitted mixture object, of class umix.

Author(s)

J. Einbeck, D. Bonetti

References

Einbeck, Jochen & Bonetti, Daniel (2014), A study of online and blockwise updating of the EM algorithm for Gaussian mixtures, in Kneib, Thomas, Sobotka, Fabian, Fahrenholz, Jan & Irmer, Henriette eds, Proceedings of the 29th International Workshop on Statistical Modelling. Goettingen, Germany, 14-18 July 2014 II: 29th International Workshop on Statistical Modelling. Goettingen, University of Goettingen, 35-38.

Examples

### Univariate Gaussian Example:

data(pistonrings, package="qcc")
boxplot(diameter ~ sample, data=pistonrings)
dm <- as.matrix(pistonrings$diameter)

# EM all at once:
fit  <- EM(dm,2, threshold=0.005)
# Now via update EM
fit2  <- UEM(dm,2, split=seq(100,200,by=5), iter= c(10, rep(2,20),-1),plot=TRUE )
# (this gives 100 data points first and iterates 10 times. Then it gives 20 batches
# of size five and iterates twice after each batch. Finally it iterates until convergence). 
# Compare log-likelihoods:
logLike(fit,dm)
logLike(fit2,dm) 

### Bivariate Gaussian Example: 

require(mvtnorm)
s1 <- matrix(c(10,3,3,2),2,2)
s2 <- matrix(c(1,3,3,16),2,2)
m1 = rmvnorm(n=40, c(4,2), s1)
m2 = rmvnorm(n=60, c(9,4), s2)
x = rbind(m1, m2)
par(mfrow=c(2,2))
plot(x)

thetar = EM(x, 2, iter=10) # Standard EM
plot(thetar,x, main="EM")

i = sample(100, 50)
theta0 = EM(x[-i, ], 2, iter=10) # remove 50 points, fit EM to remaining points
theta1 = updateEM(x[i,],  theta0, iter=10) # put points back, update EM

plot(theta0,x[-i, ], col=1, main= "EM (subset)")
plot(theta1,x, col= 1+ (1:100)%in%i,  main = "update EM")

### Poisson Example:

theta  <- list("mu"=c(1,8,30),"pi"=c(0.2,0.5,0.3))
theta2 <- list("mu"=c(5,10,100),"pi"=c(0.2,0.2,0.6))

pdat   <- poisSimN(100, theta)
pdat.z <- poisSimN(20, theta2)

poisfit <- EM(pdat, 3, iter=100,  family="Poisson")
plot.umix(poisfit,pdat)
poisup  <- updateEM(pdat.z, poisfit, iter=100, dist="Poisson", plot=TRUE)

# equivalently, at once:
poisall <- UEM(c(pdat,pdat.z), 3,split=100,  iter=100,  family="Poisson", plot=TRUE)

poisup$mu
poisall$mu
 # identical!
 

[Package UEM version 0.3-1 Index]