Durham University Statistics and Probability

## Stats4Grads

Welcome to the Stats4Grads website! Here you will find all the information about the seminar series.

Stats4Grads is a weekly seminar in statistics organised by and aimed at postgraduate students. The seminars take place on Wednesdays between 13:00-14:00, usually in CM105, with tea, coffee and biscuits provided by the Department of Mathematical Sciences.

Stats4Grads is a great opportunity to learn about the research of other postgraduate students and their use of statistics. This includes recent developments in statistics as well as applications to "real-world" problems and cross-disciplinary work. Moreover, Stats4Grads provides a relaxed forum in which to discuss and develop ideas, exchange knowledge as well as access help and insight from students who have a deeper understanding of the theory and methodology.

Feel free to invite a friend or collaborator from another institution or department to give a talk if they are in Durham!

Organiser: Jonathan Owen. For information or to give a talk contact: jonathan.owen@durham.ac.uk .

For details of previous years' seminars, click here.

### Stats4Grads Timetable 2019/2020

The story of the data: insight into pre-clinical research and the reproducibility crisis

Speaker: Andrea Simkus, Department of Mathematical Sciences, Durham University
Wednesday 11 March 2020: 13:00, CM105

Abstract:

In My PhD work I collaborate with a pharmaceutical company AstraZeneca, in bid to explore our approach to statistical reproducibility in the context of their test scenarios. Having been exposed (intellectually) to pre-clinical research, I acquired an insight into where the data I get comes from. In my presentation I want to give you a story of that data. In my work data is not just numbers but rather a chain of processes leading to its acquisition. There is the ethics (and animal welfare) side of this story too and it is one of the main reasons why experiment design matters. Inter alia, I will talk about what I consider be the main differences between industry and university and share some insights into what this industry is looking for in a statistician. I will further discuss the reproducibility crisis problem and what solutions we propose to it: to put it simply we see statistical reproducibility as a prediction problem and we employ nonparametric predictive problem to quantify it.

High dimension optimal design using Fisher Information Gain

Speaker: Sophie Harbisher, School of Mathematics, Statistics and Physics, Newcastle University
Wednesday 4 March 2020: 13:00, CM105

Abstract:

Finding high dimensional designs is increasingly important in applications of experimental design, but is computationally demanding under existing methods. We introduce an efficient approach applying recent advances in stochastic gradient optimisation. To allow rapid gradient calculations we work with a computationally convenient utility function, the trace of the Fisher information. We provide a decision theoretic justification for this utility, analogous to work by Bernardo (1979) on the Shannon information gain. Due to this similarity we refer to our utility as the Fisher information gain. We compare our optimisation scheme, SGO-FIG, to existing state-of-the-art methods and show our approach is quicker at finding designs which maximise expected utility, allowing designs with hundreds of choices to be produced in under a minute in one example.

Likelihood Free SAMC

Speaker: Kieran Richards, Department of Mathematical Sciences, Durham University
Wednesday 26 February 2020: 13:00, CG91

Abstract:

Approximate Bayesian Computation (ABC) has become a valuable tool for Bayesian Uncertainty Quantification, as it enables inference to be made even when the likelihood is intractable. ABC methods can produce unreliable inference when they introduce high approximation bias into the posterior through careless specification of the ABC kernel. Additionally MCMC-ABC methods often suffer from the local trapping problem which causes poor mixing when the tolerance parameter is low. We introduce a new ABC algorithm, the Stochastic Approximation Monte Carlo ABC (SAMC-ABC), which enables Bayesian Uncertainty Quantification in increasingly complex systems where inference was previously unreliable. SAMC-ABC adaptively constructs the so called ABC kernel, both reducing the approximation bias and providing immunity to the local trapping problem. We demonstrate the performance of the proposed algorithm with some benchmark examples and find that the method outperforms its competitors. We use our algorithm to analyse a computer model which describes the transmission of the Ebola virus against data from the 2014-15 Ebola outbreak in Liberia.

Multilevel Emulation of Stochastic Computer Codes

Speaker: Jack Kennedy, School of Mathematics, Statistics and Physics, Newcastle University
Wednesday 19 February 2020: 13:00, CM105

Abstract:

Increasingly, stochastic computer models are being used in science and engineering to predict complex phenomena. Such stochastic models are implemented as computer simulators which may takes minutes, hours or even days to run. A common approach to alleviate this problem is to build a statistical surrogate model, known as an emulator. Emulators of stochastic computer models should accurately predict the mean response surface of the simulator but also their level of noise. A particularly flexible approach is to emulate the stochastic simulator via a heteroscedastic Gaussian process.

Many complex simulators can be run at different levels of accuracy and hence different computational cost. Although the cheapest to run simulators will be inaccurate, they may be informative for more expensive, but slower, runs of the computer simulator. We present a method to incorporate cheap simulator runs into a heteroscedastic GP emulator.

What are the chances?

Speaker: Clare Wallace, Department of Mathematical Sciences, Durham University
Wednesday 12 February 2020: 13:00, CG91

Abstract:

The Monty Hall problem is a well-known example of a question in probability whose answer is unintuitive. Beginning with cars and goats, and travelling past princesses in towers, poisonous frogs, and some cryptic comments from our parents, we will take a whirlwind tour of some other "controversial" probability models, and (hopefully) settle on some answers!

Bayesian Forecasting and Dynamic Linear Models

Speaker: Jordan Oakley, School of Mathematics, Statistics and Physics, Newcastle University
Wednesday 5 February 2020: 13:00, CM105

Abstract:

Dynamic models offer a powerful framework for the modelling and analysis of time series which are subject to abrupt changes in pattern. They are used in many time series applications from finance and econometrics, to biological series used in clinical monitoring. In this talk I will describe how dynamic models can be used to model time series, following work from West and Harrison. In particular, I will focus attention to a specific problem of monitoring kidney deterioration in patients that have just had cardiac surgery. This work is in joint collaboration with the cardiac surgery unit at the University Hospital of South Manchester. The particular problem studied is that of developing an on-line statistical procedure to monitor the progress of kidney function in individual patients who have recently had heart surgery.

Analysis of Overdispersion in Gamma-H2AX Data

Speaker: Adam Errington, Department of Mathematical Sciences, Durham University
Wednesday 4 December 2019: 13:00, E101

Abstract:

Count data which exhibit overdispersion are extensive in a wide variety of disciplines, such as public health and environmental science. It is typically assumed that the total (aggregated) number of gamma-H2AX foci (DNA repair proteins) produced in a sample of blood cells is Poisson distributed, whose expected yield (average foci per cell) can be represented by a linear function of the absorbed dose. However, in practice, because of unobserved heterogeneity in the cell population, the standard Poisson assumption of equidispersion will most likely be contravened which will cause the variance of the aggregated foci counts to be larger than their mean. In both whole and partial body exposure this phenomenon is perceptible, unlike in the context of the “gold-standard” dicentric assay in which overdispersion is only linked to partial exposure. For such situations, it is possible that utilising a model that can handle overdispersion such as the quasi-Poisson is more preferable to the standard Poisson.

There are many different possible causes of overdispersion and in any modelling situation a number of these could be involved. For our data, some possibilities include experimental variability (for example, a change of technology used in the scoring of cells) and correlation between individual foci counts (or cells) for which both are not accounted for by a fitted model. We will see that the behaviour of dispersion estimates differ considerably between using aggregated data and the full frequency distribution (raw data). To our knowledge, this phenomena has not been investigated in the literature both within and outside the field of biodosimetry. I will explain through simulation how accounting for dependence between observations can impact on the estimated dispersion.

A Bayesian statistical approach to decision support for petroleum reservoir well control optimisation

Speaker: Jonathan Owen, Department of Mathematical Sciences, Durham University
Wednesday 27 November 2019: 13:00, CM105

Abstract:

Complex mathematical computer models are used across many scientific disciplines and industry to improve the understanding of the behaviour of physical systems and increasingly to aid decision makers. Major limitations to the use of computer simulators include their complex structure; high-dimensional parameter spaces and large number of unknown model parameters; which is further compounded by their long evaluation times. Decision support, commonly misrepresented as an optimisation task, often requires a large number of model evaluations rendering traditional optimisation methods intractable whilst simultaneously failing to incorporate uncertainty. Consequently, they may yield non-robust decisions.

I will present an iterative decision support strategy which imitates the history matching procedure aiming to identify a robust class of decisions. Bayes linear emulators provide fast, statistical approximations to computer models, yielding predictions for as yet unevaluated parameter settings, along with a corresponding quantification of uncertainty. Appropriate structured uncertainties are accurately quantified and incorporated to link the sophisticated computer model and the actual system in order to obtain robust decisions for the real world problem.

In the petroleum industry, TNO devised a field development optimisation challenge under uncertainty providing an ensemble of 50 fictitious oil reservoir models generated using a stochastic geology model. This challenge exhibits many of the common issues associated with computer experimentation. I will demonstrate the robust decision support strategy applied to the TNO challenge for a greatly reduced computational cost versus ensemble optimisers. This includes the construction of a targeted Bayesian design as well as methods of identifying subsets of models as representatives for the entire ensemble.

Reducing bias is as easy as ABC with applications to modelling Ebola

Speaker: Kieran Richards, Department of Mathematical Sciences, Durham University
Wednesday 20 November 2019: 13:00, CM105

Abstract:

Approximate Bayesian Computation(ABC) has enabled us in recent years to use increasingly complex models to solve problems that were previously intractable. ABC methods can produce unreliable inference when they introduce high approximation bias into the posterior through careless specification of the ABC kernel. Additionally MCMC-ABC methods often suffer from the local trapping problem which causes poor mixing when the tolerance parameter is low. We propose an alternative ABC algorithm which we show can be used to reduce the approximation bias and provide immunity to local trapping by adaptively constructing the ABC kernel. We demonstrate the new algorithm on real data; calibrating a complex SEIR model to data from the Ebola outbreak of 2014 and estimating the pre intervention transmission rate of the disease.

A Sensitivity Analysis of Adaptive Lasso

Speaker: Tathagata Basu, Department of Mathematical Sciences, Durham University
Wednesday 13 November 2019: 13:00, CM105

Abstract:

Sparse regression is an effcient statistical modelling technique which is of major relevance for high dimensional statistics. There are several ways of achieving sparse regression, the well-known lasso being one of them. However, lasso variable selection may not be consistent in selecting the true sparse model. Zou proposed an adaptive form of the lasso which overcomes this issue, and showed that data driven weights on the penalty term will result in a consistent variable selection procedure. We are interested in the case that the weights are informed by a prior execution of ridge regression. We carry out a sensitivity analysis of the Adaptive lasso through the power parameter of the weights, and demonstrate that, in effect, this parameter takes over the role of the usual lasso penalty parameter. In addition, we use the parameter as an input variable to obtain an error bound on the Adaptive lasso.

Keywords:

Adaptive lasso, sensitivity analysis, oracle properties, variable selection, ridge regression.

This work is funded by the European Commissions H2020 programme, through the UTOPIAE Marie Curie Innovative Training Network, H2020-MSCA-ITN-2016, Grant Agreement number 722734.

Bayes goes to Space: inferring chemical model parameters for tomorrow’s Space journeys

Speaker: Anabel del Val, von Karman Institute for Fluid Dynamics, Belgium
Wednesday 6 November 2019: 13:00, CM105

Abstract:

Venturing into Space requires large amounts of energy to reach orbital and interplanetary velocities. The bulk of this energy is exchanged during the entry phase by converting the kinetic energy of the vehicle into thermal energy in the surrounding atmosphere through the formation of a strong bow shock ahead of the vehicle. The way engineers protect spacecraft from the intense heat of atmospheric entry is by designing two kinds of protection systems: reusable and ablative. Reusable systems are characterized by re-radiating a significant amount of energy from the hot surface back into the atmosphere. Ablative materials, on the other hand, transform the thermal energy into decomposition and removal of the material.

The resulting aerothermal environment surrounding a vehicle during atmospheric entry is consequently extremely complex, as such, we often need efficient uncertainty quantification techniques to extract knowledge from experimental data that can appropriately inform the proposed models. We develop robust Bayesian frameworks that aim at characterizing chemical models parameters for re-entry plasma flows in the presence of both types of protection systems. Special care is devoted to the treatment of nuisance parameters which are unavoidable when performing flow simulations in need of proper boundary conditions beyond the interest of the specific inference. Our formulation involves a particular treatment of these nuisance parameters by solving an auxiliary maximum likelihood problem. Results will be shown for real-world cases.

Analysis of clickstream data

Speaker: Ryan Jessop, Department of Mathematical Sciences, Durham University and Clicksco
Wednesday 30 October 2019: 13:00, CM105

Abstract:

Online user browsing generates vast quantities of typically unexploited data. Investigating this data and uncovering the valuable information it contains can be of substantial value to online businesses, and statistics plays a key role in this process.

The data takes the form of an anonymous digital footprint associated with each unique visitor, resulting in 10^6 unique profiles across 10^7 individual page visits on a daily basis. Exploring, cleaning and transforming data of this scale and high dimensionality (2TB+ of memory) is particularly challenging, and requires cluster computing.

We consider the problem of predicting customer purchases (known as conversions), from the customer’s journey or clickstream, which is the sequence of pages seen during a single visit to a website. We consider each page as a discrete state with probabilities of transitions between the pages, providing the basis for a simple Markov model. Further, Hidden Markov models (HMMs) are applied to relate the observed clickstream to a sequence of hidden states, uncovering meta-states of user activity. We can also apply conventional logistic regression to model conversions in terms of summaries of the profile’s browsing behaviour and incorporate both into a set of tools to solve a wide range of conversion types where we can directly compare the predictive capability of each model.

In real-time, predicting profiles that are likely to follow similar behaviour patterns to known conversions, will have a critical impact on targeted advertising. We illustrate these analyses with results from real data collected by an Audience Management Platform (AMP) - Carbon.

Wednesday 23rd October 2019:

13:00, CM105

Stats4Grads Welcome Session

Speaker: everyone!

Abstract

Come along to CM105 on Wednesday 23rd October at 13:00 to get to know your fellow Statisticians! This is a relaxed and informal event to introduce ourselves and our research area (briefly!), as well as meet others with an interest in Statistics. There will be free pizza! ;)

Return to the Statistics Seminar list.