Models for zero-inflated count data

Supervisor: Jochen Einbeck | Research Area: Statistics

Background

Count data are a frequently arising and very important type of statistical data. For instance, we may count the number of vehicles per minute entering some motorway junction, or the number of goals in a football match, or the number of a certain type of insects per plot on a field. Many modern-day data sets, such as web-click data, or RNA sequencing data, also come in the form of count data. Statistical modelling of count data is usually carried out by identifying an appropriate count data distribution from the exponential family of distributions, and then carrying out the required inferences within the framework of the theory of generalised linear models.

A persistent problem with count data is that the observed frequency of "zero" counts is not in line with what would be expected under the postulated count data distribution. Typically this leads to the problem of "zero-inflation" in the data. Such zero-inflation often has a solid structural reason. To give a rather trivial example, assume we want to model the number of fish caught by each visitor at some fishing ground. However, some visitors might actually not attempt fishing at all, hence structurally return zero fish. However, we do not have the information on whether an individual engaged in fishing or not. Hence, our data set contains two types of zeros: Those from the active visitors who tried to catch some fish but failed, and those from the idle visitors who did not actually try. In other examples, it may be the measurement process itself which causes difficulties: Assume some ecologist would like to describe the number of eggs laid by a certain bird species. It may be difficult to even identify a nest as a nest if there are no eggs in it, in this case likely leading to "zero-deflation" or even "zero-truncation" (if the data by construction can't give zeros at all).

All of these situations have in common that traditional count data models, such as based on the Poisson distribution, will likely fail. The solution is to address the generation of zeros explicitly in the model, by essentially building a mixture model, with which a probability, say p, produces a structural zero, and with probability 1-p draws from a count distribution (which may lead to further, random, zeros). Such models are known under terms like zero-inflated regression models, sometimes also zero-modified regression models, with the most important special cases being the zero-inflated Poisson (ZIP) and the zero-inflated negative Binomial (ZINB) regression model.

This project

In this project, we will study the statistical methodology behind zero-inflated count regression models, and use R to apply them in practice and understand their behavior. We will look into the most commonly used publicly available implementations of zero-inflated count regression models, and apply them on real data sets (which you may choose). Simulation studies may be carried out to investigate their behavior. We will also look at inferential questions, such as testing different "nested" count distributions against each other, for instance through likelihood ratio or score tests. There will be opportunity for engagement with state-of-the art literature (recent publications) and/or own methodological development, for instance by implementing "your own" zero-inflated model in the context of a specific count distribution. There is also opportunity to look at related concepts, such as Hurdle models.

Mode of operation and evidence of learning

The project will be based on reading and data analysis using the programming language R. In addition, your project may involve statistical paper-and pencil derivations, simulation studies, and advanced R programming, depending on the focus that your particular project takes in the course of the academic year.

Additional information

If you would like more information about this project, please contact me at jochen.einbeck@durham.ac.uk.

Resources