Bayesian modelling motivation

JP Gosling

Some preliminaries

Prof. John Paul Gosling

Email john-paul.gosling@durham.ac.uk

Office MSC3068

Office hours Fridays 1400-1600

A disclaimer

Bayesian methods are widely used, but not many academic statisticians buy-in to the philosophy. Although throughout this course, we will argue that it is a natural way of handling problems, there are many opponents.

Expect all of my material to be biased.

Dose response modelling

Dose response modelling

Dose response modelling

Dose response modelling

Dose response modelling

Dose response modelling

Dose response modelling

The model

Mathematically, the model between response and dose, \(d\), can be written simply as

\[ r(d) = \begin{cases} 0, & d \leq \theta,\\ 100\frac{d-\theta}{\kappa-\theta} & \theta < d < \kappa,\\ 100, & d \geq \kappa. \end{cases} \]

And we have two unknown parameters \(\theta\) and \(\kappa\).

The model (2nd form)

Mathematically, the model between response and dose, \(d\), can be written simply as

\[ r(d) = \begin{cases} 0, & d \leq \theta,\\ 100\frac{d-\theta}{\delta} & \theta < d < \theta + \delta,\\ 100, & d \geq \theta + \delta. \end{cases} \]

And we have two unknown parameters \(\theta\) and \(\delta\).

Reality

Unfortunately, real world observations are unlikely to be this well behaved. Therefore, we also need to consider errors.

We believe that the test has a form of measurement error associated with it. We assume that the error is normally distributed:

\[ \begin{align*} O_i &= r(d_i) + \epsilon_i, \\ \epsilon_i &\sim \text{N}\left(0,\sigma^2_M\right). \end{align*} \]

Directed acyclic graph

Here is a visual representation of a single observation \(O\):

Directed acyclic graph

Here is a visual representation of a single observation \(O\):

Directed acyclic graph

Here is a visual representation of a single observation \(O\):

Directed acyclic graph

Here is a visual representation of a single observation \(O\):

Directed acyclic graph

Here is a visual representation of a single observation \(O\):

Directed acyclic graph

Here is a visual representation of two observations (ignoring noise) \(O_1\) and \(O_2\):

Directed acyclic graph

Consider one path through the DAG:

Directed acyclic graph

\[ \small \pi(\theta,R_1,R_2,O_1,O_2) = \pi(\theta)\pi(R_1|\theta)\pi(O_1|R_1)\pi(R_2|\theta)\pi(O_2|R_2) \]

Priors

Being proper Bayesians, we must consider our prior beliefs about each of the unknown parameters.

We have three parameters to consider: \(\theta\), \(\delta\) and \(\sigma_M\).

Prior for \(\theta\)

This parameter is the threshold dose in Molar for which the chemical becomes harmful.

Questions:

What does experience about other chemicals lead me to believe about this value?
What range of values are possible?
What circumstances would lead this to be a relatively high or low number?

Prior for \(\theta\)

This parameter is the threshold dose in Molar for which the chemical becomes harmful.

Questions:

What does experience about other chemicals lead me to believe about this value?
What range of values are possible?
What circumstances would lead this to be a relatively high or low number?
What evidence do you have to back all of this up?

Prior for \(\theta\)

Can you give a value for \(\theta\) for which you believe there is an equal chance of the true value being above or below?

Prior for \(\theta\)

Can you give a value for \(\theta\) for which you believe there is an equal chance of the true value being above or below?

Prior for \(\theta\)

Can you give a value for \(\theta\) for which you believe there is an equal chance of the true value being above or below?

Prior for \(\theta\)

Suppose \(\theta\) is definitely greater than 50, can you give a value for \(\theta\) for which you believe there is an equal chance of the true value being above or below?

Prior for \(\theta\)

Suppose \(\theta\) is definitely below 50, can you give a value for \(\theta\) for which you believe there is an equal chance of the true value being above or below?

Prior for \(\theta\)

We have the following histogram representation of our beliefs about \(\theta\):

Prior for \(\theta\)

We can fit a distribution that represents our beliefs about \(\theta\):

Prior for \(\theta\)

We can fit a distribution that represents our beliefs about \(\theta\):

Prior for \(\delta\)

The upper limit \(\kappa\) of the dose response (after which complete damage is guaranteed) is also uncertain. Recall that \(\delta\) is the gap between the lower limit \(\theta\) and \(\kappa\).

Preposterior distribution

# Priors
theta <- rgamma(1000, 1.73, 0.0268)
delta <- rgamma(1000, 3.88, 0.0849)

# Plausible relationships?
dose_try <- seq(0, 200, 0.4)
response <- function(dose,theta,delta){
  ifelse(dose <= theta, 0, 
         ifelse(dose >= theta + delta, 100,
                100*(dose-theta)/delta))
}
response_try <- list()
for (i in 1:1000){
  response_try[[i]] <- response(dose_try,theta[i],delta[i])
}

Preposterior distribution

# Priors
theta <- rgamma(1000, 1.73, 0.0268)
delta <- rgamma(1000, 3.88, 0.0849)

# Plausible relationships?
dose_try <- seq(0, 200, 0.4)
response <- function(dose,theta,delta){
  ifelse(dose <= theta, 0, 
         ifelse(dose >= theta + delta, 100,
                100*(dose-theta)/delta))
}
response_try <- list()
for (i in 1:1000){
  response_try[[i]] <- response(dose_try,theta[i],delta[i])
}

Preposterior distribution

# Priors
theta <- rgamma(1000, 1.73, 0.0268)
delta <- rgamma(1000, 3.88, 0.0849)

# Plausible relationships?
dose_try <- seq(0, 200, 0.4)
response <- function(dose,theta,delta){
  ifelse(dose <= theta, 0, 
         ifelse(dose >= theta + delta, 100,
                100*(dose-theta)/delta))
}
response_try <- list()
for (i in 1:1000){
  response_try[[i]] <- response(dose_try,theta[i],delta[i])
}

Preposterior distribution

# Priors
theta <- rgamma(1000, 1.73, 0.0268)
delta <- rgamma(1000, 3.88, 0.0849)

# Plausible relationships?
dose_try <- seq(0, 200, 0.4)
response <- function(dose,theta,delta){
  ifelse(dose <= theta, 0, 
         ifelse(dose >= theta + delta, 100,
                100*(dose-theta)/delta))
}
response_try <- list()
for (i in 1:1000){
  response_try[[i]] <- response(dose_try,theta[i],delta[i])
}

Preposterior distribution

Preposterior distribution

Preposterior distribution

Preposterior distribution

Prior for \(\sigma_M\)

We are uncertain about the measurement error that we encode in \(\sigma_M\).

This comes into our model here:

\[ \epsilon_i \sim \text{N}\left(0,\sigma^2_M\right). \]

Prior for \(\sigma_M\)

I would be extremely surprised if \(\sigma_M\) were greater than 20.

Preposterior distribution

Preposterior distribution

Preposterior distribution

Preposterior distribution

Study in healthy adults

dose	response
30	7
30	1
30	2
40	17
40	10
70	29
100	64

Study in healthy adults

Sampling from the posterior

We use the probabilistic programming language Stan that helps us to shortcut the coding and tuning of MCMC.

Stan requires us to specify multiple code blocks that cover the model, data and what we want to output.

data,
parameters,
transformed parameters,
model,
generated quantities.

Sampling from the posterior

data {
  int<lower=0> N_obs;
  real D[N_obs];
  real O[N_obs];
}

Sampling from the posterior

parameters {
  real<lower=0> theta;
  real<lower=0> delta;
  real<lower=0> sigma_M;
}

Sampling from the posterior

transformed parameters {
  real<lower=0> R[N_obs];

  for (n in 1:N_obs) {
      R[n] = 100 * (D[n] - theta) / delta;
      if (R[n] < 0) R[n] = 0;
      if (R[n] > 100) R[n] = 100;
      }
}

Sampling from the posterior

model {
  // Prior
  theta ~ gamma(1.73, 0.0268);
  delta ~ gamma(3.88, 0.0849);
  sigma_M ~ gamma(2.01, 0.382);
  
  // Likelihood
  for (n in 1:N_obs)
      O[n] ~ normal(R[n], sigma_M);
}

Sampling from the posterior

library(rstan)

# compile the model
model <- stan_model(file = 'First model.stan')

# the data
N_obs <- 7
D <- c(30,30,30,40,40,70,100)
O <- c(7,1,2,17,10,29,64)

# generate a posterior distribution
fit <- sampling(model, 
                chains = 1,
                iter = 100000,
                data = list(N_obs = N_obs,
                                   D = D,
                                   O = O))
summary(fit,
        pars = c('theta','delta','sigma_M'),
        probs = c(0.05,0.5,0.95))

# extract the posterior samples
theta_samples <- extract(fit, pars = 'theta')$theta
delta_samples <- extract(fit, pars = 'delta')$delta
sigma_M_samples <- extract(fit, pars = 'sigma_M')$sigma_M

Sampling from the posterior

library(rstan)

# compile the model
model <- stan_model(file = 'First model.stan',)

Sampling from the posterior

# the data
N_obs <- 7
D <- c(30,30,30,40,40,70,100)
O <- c(7,1,2,17,10,29,64)

# generate a posterior distribution
fit <- sampling(model, 
                chains = 1,
                iter = 100000,
                data = list(N_obs = N_obs,
                                   D = D,
                                   O = O))


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 9e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:     1 / 100000 [  0%]  (Warmup)
Chain 1: Iteration: 10000 / 100000 [ 10%]  (Warmup)
Chain 1: Iteration: 20000 / 100000 [ 20%]  (Warmup)
Chain 1: Iteration: 30000 / 100000 [ 30%]  (Warmup)
Chain 1: Iteration: 40000 / 100000 [ 40%]  (Warmup)
Chain 1: Iteration: 50000 / 100000 [ 50%]  (Warmup)
Chain 1: Iteration: 50001 / 100000 [ 50%]  (Sampling)
Chain 1: Iteration: 60000 / 100000 [ 60%]  (Sampling)
Chain 1: Iteration: 70000 / 100000 [ 70%]  (Sampling)
Chain 1: Iteration: 80000 / 100000 [ 80%]  (Sampling)
Chain 1: Iteration: 90000 / 100000 [ 90%]  (Sampling)
Chain 1: Iteration: 100000 / 100000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.313 seconds (Warm-up)
Chain 1:                1.61 seconds (Sampling)
Chain 1:                2.923 seconds (Total)
Chain 1:

Sampling from the posterior

summary(fit, pars = c('theta','delta','sigma_M'), probs = c(0.05,0.5,0.95))

$summary
              mean   se_mean        sd        5%        50%       95%    n_eff
theta    30.756202 0.1405303  9.149246 21.827272  27.977051  52.96344 4238.679
delta   110.484309 0.2647553 18.793685 69.695085 113.929210 134.73253 5038.887
sigma_M   5.981208 0.0276066  2.194064  3.385563   5.488194  10.18207 6316.447
            Rhat
theta   1.000616
delta   1.000477
sigma_M 1.000000

$c_summary
, , chains = chain:1

         stats
parameter       mean        sd        5%        50%       95%
  theta    30.756202  9.149246 21.827272  27.977051  52.96344
  delta   110.484309 18.793685 69.695085 113.929210 134.73253
  sigma_M   5.981208  2.194064  3.385563   5.488194  10.18207

# extract the posterior samples
theta_samples <- extract(fit, pars = 'theta')$theta
delta_samples <- extract(fit, pars = 'delta')$delta
sigma_M_samples <- extract(fit, pars = 'sigma_M')$sigma_M

Comparison with priors

Comparison with priors

Study in healthy adults

Comparison with priors

Comparison with priors

Comparison with priors

Comparison with priors

Predictive distribution

Predictive distribution

Predictive distribution

Resolution

We can consider the amount of uncertainty that has been resolved by observing the data through: \[ \text{R}(\theta) = 1 - \frac{\text{Var}(\theta|\text{Data})}{\text{Var}(\theta)}. \] We get \[ \text{R}(\theta) = 0.96,~\text{R}(\delta) = 0.32~\text{and R}(\sigma_M) = 0.64. \]

An alternative model

As noted, there have been a few dodgy elements in the proposed model. Also, there is a theory that there is no real threshold for harm. We can consider an alternative formulation:

\[ r(d) = 100~\Phi\left(\frac{d-\mu}{\nu} \right), \] with parameters \(\mu\) and \(\nu\).

An alternative model

An alternative model

An alternative model

An alternative model

Of course, reality is messy:

\[ O_i = \text{logit}\left[r(d_i)/100\right] + \epsilon_i. \]

Logit transformation

The logit transformation takes a doubly-bounded quantity and maps it uniquely to the real line.

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right). \]

Logit transformation

The logit transformation takes a doubly-bounded quantity and maps it uniquely to the real line.

An alternative model

Of course, reality is messy:

\[ O_i = \text{logit}\left[r(d_i)/100\right] + \epsilon_i. \]

We further assume that the errors are independent of each other and have the following distribution:

\[ \epsilon_i \sim \text{N}\left(0, \sigma^2\right). \]

An alternative model

An alternative model

An alternative model

An alternative model

Prior for \(\mu\)

This is related to the priors for \(\theta\) and \(\delta\) from the first model.

Prior for \(\nu\)

This is more related to the prior for \(\delta\) from the first model.

Preposterior distribution

Preposterior distribution

Prior for \(\sigma\)

This is difficult to consider directly due to the logit transformation.

Sampling from the posterior

Sampling from the posterior

transformed data {
  real logit_O[N_obs];
  
  for (n in 1:N_obs){
      logit_O[n] = logit(O[n]/100);
  }
}

Sampling from the posterior

parameters {
  real<lower=0> mu;
  real<lower=0> nu;
  real<lower=0> sigma;
}

Sampling from the posterior

transformed parameters {
  real<lower=0> R[N_obs];
  real logit_R[N_obs];

  for (n in 1:N_obs){
      R[n] = 100 * normal_cdf(D[n], mu, nu);
      logit_R[n] = logit(R[n]/100);
  }
}

Sampling from the posterior

model {
  // Prior
  mu ~ gamma(5.16, 0.0457);
  nu ~ gamma(2.03, 0.0685);
  sigma ~ gamma(3.02, 1.98);
  
  // Likelihood
  for (n in 1:N_obs)
      logit_O[n] ~ normal(logit_R[n], sigma);
}

Sampling from the posterior

# compile the model
model <- stan_model(file = 'Second model.stan')

Sampling from the posterior

fit2 <- sampling(model, 
                 chains = 1,
                 iter = 100000,
                 data = list(N_obs = N_obs,
                             D = D,
                             O = O),
                 init = list(list(mu = 100,
                                  nu = 10,
                                  sigma = 0.5)))


SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 9e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:     1 / 100000 [  0%]  (Warmup)
Chain 1: Iteration: 10000 / 100000 [ 10%]  (Warmup)
Chain 1: Iteration: 20000 / 100000 [ 20%]  (Warmup)
Chain 1: Iteration: 30000 / 100000 [ 30%]  (Warmup)
Chain 1: Iteration: 40000 / 100000 [ 40%]  (Warmup)
Chain 1: Iteration: 50000 / 100000 [ 50%]  (Warmup)
Chain 1: Iteration: 50001 / 100000 [ 50%]  (Sampling)
Chain 1: Iteration: 60000 / 100000 [ 60%]  (Sampling)
Chain 1: Iteration: 70000 / 100000 [ 70%]  (Sampling)
Chain 1: Iteration: 80000 / 100000 [ 80%]  (Sampling)
Chain 1: Iteration: 90000 / 100000 [ 90%]  (Sampling)
Chain 1: Iteration: 100000 / 100000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 2.186 seconds (Warm-up)
Chain 1:                2.421 seconds (Sampling)
Chain 1:                4.607 seconds (Total)
Chain 1:

Sampling from the posterior

summary(fit2, pars = c('mu','nu','sigma'), probs = c(0.05,0.5,0.95))

$summary
           mean     se_mean         sd         5%       50%        95%    n_eff
mu    93.133184 0.155396233 16.6173437 73.0107639 89.920636 124.431135 11435.16
nu    37.017227 0.111375878 11.7295596 23.7531219 34.350136  59.541693 11091.27
sigma  1.081619 0.002930401  0.3623079  0.6399465  1.010335   1.765676 15286.27
          Rhat
mu    1.000429
nu    1.000251
sigma 1.000111

$c_summary
, , chains = chain:1

         stats
parameter      mean         sd         5%       50%        95%
    mu    93.133184 16.6173437 73.0107639 89.920636 124.431135
    nu    37.017227 11.7295596 23.7531219 34.350136  59.541693
    sigma  1.081619  0.3623079  0.6399465  1.010335   1.765676

# extract the posterior samples
mu_samples <- extract(fit2, pars = 'mu')$mu
nu_samples <- extract(fit2, pars = 'nu')$nu
sigma_samples <- extract(fit2, pars = 'sigma')$sigma

Comparison with priors

Comparison with priors

Comparison with priors

Predictive distribution

Predictive distribution

Predictive distribution

Model selection

We have proposed two fundamentally different models. How do we choose which is best?

personal choice,
Bayes factors,
Bayesian model averaging,
predictive performance,
(cross validation).

Model selection

Bayes factors use the model evidence to show which model is best supported by the data.

Bayesian model averaging produces weights to be used in a weighted-average for predictive distributions, which are closely related to the model evidence.

The model evidence is essentially the normalisation factor in the Bayesian update…

Model selection

We estimate expected log pointwise predictive density from leave-one out (loo) calculations.

This can be related to model evidence and a psuedo model weighting can then be derived.

Model selection

library(loo)

# LOO for model 1
loo1 <- loo(fit)
loo1


Computed from 50000 by 7 log-likelihood matrix

         Estimate  SE
elpd_loo    -24.5 2.0
p_loo         3.3 1.1
looic        49.0 4.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     4     57.1%   4220      
 (0.5, 0.7]   (ok)       1     14.3%   477       
   (0.7, 1]   (bad)      2     28.6%   202       
   (1, Inf)   (very bad) 0      0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Model selection

# LOO for model 2
loo2 <- loo(fit2)
loo2


Computed from 50000 by 7 log-likelihood matrix

         Estimate  SE
elpd_loo    -10.8 1.4
p_loo         1.8 0.6
looic        21.6 2.8
------
Monte Carlo SE of elpd_loo is 0.0.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     5     71.4%   12788     
 (0.5, 0.7]   (ok)       2     28.6%   1542      
   (0.7, 1]   (bad)      0      0.0%   <NA>      
   (1, Inf)   (very bad) 0      0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Model selection

# Get pseudo-BMA weights
loo_list <- list(loo1, loo2)
loo_model_weights(loo_list, 
                  method = "pseudobma", 
                  BB = FALSE)

Method: pseudo-BMA
------
       weight
model1 0.000 
model2 1.000

Predictive distribution for model 1

Predictive distribution for model 2

Conclusions

careful Bayesian modelling is time consuming,
there are subjective choices at every step,
our priors will often affect our results,
preposterior and predictive distributions are important tools,
model selection is very challenging.