Models for count data
Supervisor: Jochen Einbeck | Research Area: Statistics
Background
Count data are a frequently arising and very important type of statistical data. For instance, we may count the number of vehicles per minute entering some motorway junction, or the number of goals in a football match, or the number of a certain type of insects per plot on a field. Many modern-day data sets, such as web-click data, or RNA sequencing data, also come in the form of count data. While count data can be easily summarized and visualized through frequency diagrams (bar charts), their statistical modelling poses some challenges.
The most commonly used statistical model is the linear regression model. This model makes some rather strong assumptions, including continuous scale and unbounded domain of the response variable. Both of these assumptions are clearly violated for count data (note that count data cannot be negative). Hence, an alternative modelling framework is required, which is essentially established by replacing the Gaussian response distribution by a more appropriate count distribution, and then basing inference on the likelihood expressions induced by this distributional assumption. A possible choice of distribution is here the Poisson distribution, which comes however with its own constraints. Several alternative count distributions can be considered to accommodate practically relevant count data patterns, most of which can be considered as special cases of the "generalised linear model" framework, which is taught in the second term of Advanced Statistical Modelling (ASM) III. However, these count data models are also of interest in their own right, so that this project is entirely suitable whether or not you will be attending ASM.
Group project
The group project will revolve about understanding of count data and commonly used count data distributions, with view to the statistical modelling in situations where the response variable is a count variable.
By the end of the group project, you will have knowledge about:
- how to visualize count data, and identify common features such as zero-inflation or overdispersion;
- distributions commonly used for the modelling of count data;
- statistical tests for assessing goodness-of-fit;
- how to set up a statistical model using the Poisson (or other count distribution) and estimate regression parameters from it.
By the end of this group project, you will be able to:
- for a given count data set, identify a suitable count data model, and formulate it in R;
- interpret and critically assess the model output;
- engage with quantitative indicators and graphical output to diagnose the goodness of fit of your model;
- compare model fit between competing models.
Mode of operation and evidence of learning
The project will be based on reading, mild statistical paper-and pencil derivations, and some light-touch programming tasks, using R. The focus lies on the building of an efficient practical toolbox, and the related understanding, interpretation, and communication (both orally and written), rather than hard-core statistical theory or programming.
Individual project
The individual project will build on the knowledge we have gained in the group project and will explore additional advanced topics. A few examples you will be able to investigate are:
- Count data distributions as special cases of the exponential family, giving rise to a "Generalised linear model" -type framework*
- More specific consideration of model classes which deal with zero-inflation, zero-deflation, or zero-truncation
- Advanced two-or three parameter count distributions which do not fall into a GLM-type framework
- The statistical mechanism behind state-of-the-art graphical tools to carry out model diagnostics for count data (rootograms, half-normal-plots)
- Inferential considerations, such as the validity of statistical tests when operating at the boundary of the parameter space
- Application to specialized data problems, such as in statistical dosimetry or traffic engineering.
Mode of operation and evidence of learning
Adding to the corresponding section from the Group project, the Individual project will involve at least one of
more substantial theoretical work (derivations, literature work);
more advanced programming, including the building of simulation studies;
a deeper dive into an application area, including study of the subject matter background and some advanced data analysis.
Prerequisites and Co-requisites
Prereqisites: Statistical Inference II, Data Science and Statistical Modelling II.
Co-requisites: None. The project may be chosen with or without ASMIII. Whether or not you attend ASMIII will have no implication at all on the Group work part of this project. However, the focus of your Individual project may be tuned accordingly, since the GLM is covered in the second term of ASMIII.
Additional information
If you would like more information about this project, please contact me at jochen.einbeck@durham.ac.uk.
Resources
Cameron, Adrian Colin.; Trivedi, P. K. (1998). Regression analysis of count data, Cambridge University Press.
Dupuy, Jean-Francois (2018). Statistical Methods for Overdispersed count data. ISTE Press - Elsevier.
Hardin, James W. (2020). Count data models, SAGE Publications Ltd
Hilbe, Joseph (2014). Modeling Count Data,
Cambridge University Press