Models and Methods in Health Data Science

Lecture 3: Linking SIR models to the real world

Rachel Oughton

15/02/2023

1 Overview

This lecture will think about linking the SIR model we developed in Lectures 1 and 2 to the real world.

These could easily each be a lecture (or more!) on their own, so please ask questions at any time!

2 Linking to the real world

Our model structure (SIR, SEIRS etc) should relate to the real world
(Think of all the questions we considered)

But the compartment populations vary hugely with \(\beta\) and \(\gamma\):

The measures we’ve discussed have simple definitions but are impossible to measure:

Can we use our model and real world data to understand

3 The early stages of an epidemic

In this section:

The early stages of an epidemic

Using early stage data, we can estimate \(R_0\).
This relies on the assumption that almost all the population is susceptible.
This helps us to understand:

To estimate \(R_0\), we first need to calculate the growth rate.

3.1 The growth rate of an epidemic

Key idea:

During the first stages of an epidemic, the rate of increase in infectious individuals should be approximately constant.


In our SIR model: \[ \frac{dI}{dt} = \frac{\beta I S}{N} - \gamma I. \] At the start of the epidemic, we have \(S\approx{N}\), and so approximately \[ \frac{dI\left(t\right)}{dt} = \beta I\left(t\right) - \gamma I\left(t\right) = \left(\beta - \gamma\right) I\left(t\right), \]

Therefore an approximate solution is \[ I\left(t\right) \approx I\left(0\right)e^{\left(\gamma-\beta\right)t} \]

The growth rate of an epidemic

More generally (not assuming a simple SIR model) we write

\[ I\left(t\right) \approx I\left(0\right)e^{\Lambda t}, \] where \(\Lambda\) is the growth rate.

Taking logs, we therefore have \[ \ln\left(I\left(t\right)\right) \approx \ln\left(I\left(0\right)\right) + \Lambda t. \] This is a straight line, with gradient \(\Lambda\).

Therefore, if we plot \(\ln{\left[I\left(t\right)\right]}\), we should see a straight line with gradient \(\Lambda\).

Estimating the growth rate

But

What if we don’t know the number of infectious individuals at each time step? (normally we won’t)

At log scale, the cumulative case count increases at a similar rate early on:

Natural logs of the I population, and the cumulative sum of I.

Natural logs of the I population, and the cumulative sum of I.

3.1.1 Scale invariance

Another problem

Incidences of an infectious disease are often under-reported (especially early on).

Say, \[I_{reported} = k_r I_{true} \text{ where } 0<k_r<1.\] Our estimate of the growth-rate \(\Lambda\) is unaffected: \[ \begin{align*} I_{true}\left(t\right) & \approx I_{true}\left(0\right)e^{\Lambda t}\\ & \approx \frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}, \end{align*} \] and therefore \[\begin{align*} \ln \left[I_{true}\left(0\right)\right] & \approx \ln{\left[\frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}\right]}\\ &\approx{\ln\left(\frac{1}{k_r}I_{reported}\left(0\right)\right) + \Lambda t}. \end{align*}\]

The intercept of the straight line changes, but not the gradient \(\Lambda\).

3.1.2 Example: Covid-19 in the UK - new cases

From the UK government website you can download daily case data for Covid-19 in the UK.

New daily cases and cumulative counts, for each of England, Scotland, Wales and Northern Ireland.

Plotting the time series for the whole of the pandemic up to January 2022 we see

Daily new Covid-19 cases in the UK from early 2020 to present.

Daily new Covid-19 cases in the UK from early 2020 to present.

Cumulative Covid-19 cases in the UK from early 2020 to present.

Cumulative Covid-19 cases in the UK from early 2020 to present.

Example: plotting the logs

Plotting the natural log, for just the first month (March 2020):

Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.

Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.

Example: fitting straight lines

Fit straight lines to each of these using linear regression:

Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.

Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.

The gradients and their standard errors are:

Coefficients from the fitted linear models.
Region Coefficients SE
England 0.224 0.00461
Northern Ireland 0.194 0.00823
Scotland 0.229 0.00940
Wales 0.277 0.00992

The ‘Coefficients’ column gives the estimate of the growth rate \(\Lambda\).

3.2 Using the growth rate to estimate \(R_0\)

There are many ways to estimate \(R_0\) from \(\Lambda\)

For example many assume that the pre-infectious and/or infectious periods follow an exponential distribution (which we also do when we use rate parameters).

One such estimate is

\[ \begin{equation} R_0 \approx 1 + \Lambda T_S \end{equation} \]

(\(T_S\) is the serial interval, between the onsets of successive cases).

Another is

\[\begin{equation} R_0\approx \left(1+\Lambda D\right)\left(1+\Lambda D'\right), \end{equation}\] where \(D\) and \(D'\) are the durations of the infectious and pre-infectious (latent) periods, respectively.

3.2.1 Example continued: Covid-19 in the UK

Nishiura, Linton, and Akhmetzhanov (2020) estimate \(T_s\) of Covid-19 to be 4.6 with a 95% credible interval of \(\left(3.5, 5.9\right)\).

We can use these with our estimates of \(\Lambda\) to estimate \(R_0\):

Estimates of R[0] using the median and 95% credible interval from Nishiura, Linton, and Akhmetzhanov (2020).
Region Coefficients SE R0_lower95 R0_median R0_upper95
England 0.224 0.00461 1.7840 2.0304 2.3216
Northern Ireland 0.194 0.00823 1.6790 1.8924 2.1446
Scotland 0.229 0.00940 1.8015 2.0534 2.3511
Wales 0.277 0.00992 1.9695 2.2742 2.6343


Note: this ignores uncertainty in the estimates of \(\Lambda\).

4 Estimating parameters by model fitting

In this section:

Estimating parameters by model fitting

We can also use outbreak data to find appropriate values for the parameters of an SIR-type model.

We call this calibration (a general model-fitting term)

Parameter space

If a model has \(n\) parameters, we can think of them as living in an \(n\)-dimension space: the parameter space (or sometimes input space or model space)

Each point in parameter space will correspond to a particular set of parameter values.

Difficult to visualise for large \(n\).

4.1 SIR parameters in parameter space

Think of the SIR model with two parameters (so \(n=2\)):

We can see the point \(\left(\beta=1.25,\,\gamma=0.5\right)\) in parameter space, and the corresponding output:

The parameter values from our example, plotted in 2D parameter space.

The parameter values from our example, plotted in 2D parameter space.

The model output corresponding to the parameter values from this example.

The model output corresponding to the parameter values from this example.

4.2 Moving around parameter space

How does the output change for as we move around parameter space?

To answer this question, we need to choose some points.

But we have to be careful how we choose them…

Problem: We won’t about what happens when \(\beta\) and \(\gamma\) are varied independently

Problem: If one parameter turns to have little/no effect, we effectively have only 3 points.

4.2.1 Latin hypercube

To build a \(p\)-point latin hypercube design (LHD) we

  1. Divide each dimension (parameter) into \(p\) equal intervals.
  2. Allocate \(p\) points such that there is exactly one point in each interval.

This means the diagonal design is in fact an LHD, so we add another constraint.

A maximin LHD is an LHD where the minimum distance between points has been maximised.


Now we have

10-point latin hypercube

We can run the SIR model for each \(\left(\beta,\,\gamma\right)\) pair in our 10-point maximin latin hypercube:

A maximin LHS of size 10.

A maximin LHS of size 10.

Infectious populations corresponding to the 10 points.

Infectious populations corresponding to the 10 points.

This is a Monte Carlo approach, where we use random sampling to generate results.

4.3 Example: Mumps in a virgin population

Philip, Reinhard, and Lackman (1959) reported on two outbreaks of Mumps in 1957, on St. Lawrence Island

St. Lawrence Island

St. Lawrence Island

Since mumps is immunizing and the populations started as susceptible, this dataset should work well with an SIR model.

Let’s try to find appropriate values of \(\beta\) and \(\gamma\)

Working with real outbreak data

The data are weekly counts of new incidences - this is not the same as \(I\left(t\right)\) (or any of our other compartments).

We need to create an analagous quantity we can calculate using the SIR model

##           week cases cum_cases  Village time_days
## 141 1957-03-30     1         1 Savoonga         1
## 151 1957-04-06     0         1 Savoonga         8
## 161 1957-04-13     6         7 Savoonga        15
## 171 1957-04-20     9        16 Savoonga        22
## 181 1957-04-27     3        19 Savoonga        29
## 191 1957-05-04    20        39 Savoonga        36
## 201 1957-05-11    26        65 Savoonga        43
## 211 1957-05-18    31        96 Savoonga        50
## 221 1957-05-25    28       124 Savoonga        57
## 231 1957-06-01    21       145 Savoonga        64
## 241 1957-06-08     8       153 Savoonga        71
## 251 1957-06-15    10       163 Savoonga        78
## 261 1957-06-22     1       164 Savoonga        85
## 271 1957-06-29     0       164 Savoonga        92
## 281 1957-07-06     0       164 Savoonga        99
## 291 1957-07-13     0       164 Savoonga       106
## 301 1957-07-20     0       164 Savoonga       113
## 311 1957-07-27     0       164 Savoonga       120
## 321 1957-08-03     0       164 Savoonga       127
## 331 1957-08-10     0       164 Savoonga       134
## 341 1957-08-17     0       164 Savoonga       141
## 351 1957-08-24     0       164 Savoonga       148

Example: Savoonga dataset

Weekly counts of new incidences - \(I\left(t\right)\) unknown.

We can generate a cumulative sum of new infections

Cumulative case count for Savoonga, 1957.

Cumulative case count for Savoonga, 1957.

Our goal is to find values for \(\beta,\,\gamma\) that lead to a good fit.

4.3.1 Goodness of fit

Goal: find values for \(\beta\) and \(\gamma\) that produce an \(R\) curve that is as close as possible to the actual data.

To do this, we need some measure of the goodness-of-fit of model output to the data.

We use the measure to assess the fit across the parameter space, and thus find the best point.

The measure we will use is the root mean squared error, RMSE.

There are many MANY other ways to assess model fit!

Root mean squared error (RMSE)

The root mean squared error is defined as \[ RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\left(\hat{y}_i - y_i\right)^2}}. \]

The RMSE gives the root of the mean squared distance between model and data, and is on the same scale as the data.

The smaller the RMSE, the better the fit of the model to the data.

Example: RMSE

Let’s set \(\beta=1.1,\,\gamma=0.75\), and compare the resulting model curve to the Savoonga data.

Cumulative case count for Savoonga, 1957 (blue), with the model produced by beta=1.1, gamma=0.75 (black), and the distance between each pair of observed and predicted values marked in red.

Cumulative case count for Savoonga, 1957 (blue), with the model produced by beta=1.1, gamma=0.75 (black), and the distance between each pair of observed and predicted values marked in red.

\[ RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\left(\hat{y}_i - y_i\right)^2}}. \]

4.3.2 Example: Exploring the space

Sampling approach

  1. Generate a selection of parameter values that cover the space well
  2. Run the model for each point and calculate RMSE
  3. Identify which regions of parameter space give a better fit.

Example: 1000 point maximin LHD

## [1] "Smallest RMSE is 9.33929 at beta=0.216393, gamma=0.127752"
1000 points in SIR parameter space, coloured by the RMSE as a measure of fit to the Savoonga mumps data.

1000 points in SIR parameter space, coloured by the RMSE as a measure of fit to the Savoonga mumps data.

Notice that the best pairs of values for \(\beta\) and \(\gamma\) lie on a straight line, representing the same \(R_0\) value (remember that for this model \(R_0= \frac{\beta}{\gamma}\))

Example: The best fit (from the sample)

Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found.

Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found.

This is still not the best fit - it is the best from the 1,000 candidate models we generated.

4.3.3 Focussing in

Based on the best parameter values, we will refocus on the region \(\beta \in [0.23,\,0.3],\;\gamma \in [0.125,\,0.2]\).

## [1] "Smallest RMSE is 5.0837 at beta=0.247978, gamma=0.154183"

Model fit from zoomed-in LHD

We can plot the \(I^{cum}_t\) output for the SIR model with these parameters, to see the fit.

Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found in the refocussed region.

Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found in the refocussed region.

If we wanted to, we could repeat this process to gain an even better fit.

4.4 Some caveats!

This process is subject to many uncertainties, errors and problems. Here are a few:

5 The Optimization approach

In this section:

The Optimization approach

Rather than evaluating the SIR model and RMSE hundreds or thousands of times, we can use a step-by-step approach:

There are many optimisation methods - we will look at Nelder-Mead.

5.1 Optimisation using Nelder-Mead

Optimisation methods usually require a function

\[ f: \mathbb{R}^n \rightarrow{\mathbb{R}}.\]

In our case,

Nelder-Mead

Nelder-Mead starts with a randomly generated simplex in \(\mathbb{R}^n\).

A simplex is the simplest possible shape spanning all \(n\) dimensions, and has \(n+1\) vertices. A 2-simplex is a triangle.

The function is evaluated at each vertex, and the worst \(\left(x_h\right)\), second worst \(\left(x_s\right)\) and best \(\left(x_l\right)\) are labelled.

Then several new points are proposed.

Nelder-Mead: new point

Reflection

Reflection

Nelder-Mead: Changes

5.1.1 A stopping criterion

With the four instructions above, the algorithm would continue indefinitely; it needs to know when to stop and return the ‘optimal’ point.

In general the Nelder-Mead algorithm terminates when either

Video

5.2 Optim: Mumps example

Starting point \(\beta = 0.6,\,\gamma=0.6\).

Optim example - fit to data

Slightly better than before

5.2.1 Optimisation - some thoughts

6 Summary

In this lecture we have thought about using observed data with SIR-type models.

In the workshop we will develop these ideas more.

7 References

Nishiura, Hiroshi, Natalie M Linton, and Andrei R Akhmetzhanov. 2020. “Serial Interval of Novel Coronavirus (COVID-19) Infections.” International Journal of Infectious Diseases 93: 284–86.
Philip, BN, Karl R Reinhard, and DB Lackman. 1959. “Observations on a Mumps Epidemic in a "Virgin" Population.” American Journal of Epidemiology 69 (2): 91–111.