Rachel Oughton
15/02/2023
This lecture will think about linking the SIR model we developed in Lectures 1 and 2 to the real world.
These could easily each be a lecture (or more!) on their own, so please ask questions at any time!
Our model structure (SIR, SEIRS etc) should relate
to the real world
(Think of all the questions we
considered)
But the compartment populations vary hugely with \(\beta\) and \(\gamma\):
The measures we’ve discussed have simple definitions but are impossible to measure:
Can we use our model and real world data to understand
- What’s happening?
- What’s likely to happen next?
In this section:
Using early stage data, we can estimate \(R_0\).
This relies on the assumption
that almost all the population is susceptible.
This helps
us to understand:
To estimate \(R_0\), we first need to calculate the growth rate.
Key idea:
During the first stages of an epidemic, the rate of increase in infectious individuals should be approximately constant.
In our SIR model: \[
\frac{dI}{dt} = \frac{\beta I S}{N} - \gamma I.
\] At the start of the epidemic, we have \(S\approx{N}\), and so approximately \[
\frac{dI\left(t\right)}{dt} = \beta I\left(t\right) - \gamma
I\left(t\right) = \left(\beta - \gamma\right) I\left(t\right),
\]
Therefore an approximate solution is \[ I\left(t\right) \approx I\left(0\right)e^{\left(\gamma-\beta\right)t} \]
More generally (not assuming a simple SIR model) we write
\[ I\left(t\right) \approx I\left(0\right)e^{\Lambda t}, \] where \(\Lambda\) is the growth rate.
Taking logs, we therefore have \[
\ln\left(I\left(t\right)\right) \approx \ln\left(I\left(0\right)\right)
+ \Lambda t.
\] This is a straight line, with gradient \(\Lambda\).
Therefore, if we plot \(\ln{\left[I\left(t\right)\right]}\), we should see a straight line with gradient \(\Lambda\).
But
What if we don’t know the number of infectious individuals at each time step? (normally we won’t)
At log scale, the cumulative case count increases at a similar rate early on:
Natural logs of the I population, and the cumulative sum of I.
Another problem
Incidences of an infectious disease are often under-reported (especially early on).
Say, \[I_{reported} = k_r I_{true} \text{ where } 0<k_r<1.\] Our estimate of the growth-rate \(\Lambda\) is unaffected: \[ \begin{align*} I_{true}\left(t\right) & \approx I_{true}\left(0\right)e^{\Lambda t}\\ & \approx \frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}, \end{align*} \] and therefore \[\begin{align*} \ln \left[I_{true}\left(0\right)\right] & \approx \ln{\left[\frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}\right]}\\ &\approx{\ln\left(\frac{1}{k_r}I_{reported}\left(0\right)\right) + \Lambda t}. \end{align*}\]
The intercept of the straight line changes, but not the gradient \(\Lambda\).
From the UK government website you can download daily case data for Covid-19 in the UK.
New daily cases and cumulative counts, for each of England, Scotland, Wales and Northern Ireland.
Plotting the time series for the whole of the pandemic up to January 2022 we see
Daily new Covid-19 cases in the UK from early 2020 to present.
Cumulative Covid-19 cases in the UK from early 2020 to present.
Plotting the natural log, for just the first month (March 2020):
Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.
Fit straight lines to each of these using linear regression:
Natural logarithms of cumulative Covid-19 cases in the UK in early 2020.
The gradients and their standard errors are:
Region | Coefficients | SE |
---|---|---|
England | 0.224 | 0.00461 |
Northern Ireland | 0.194 | 0.00823 |
Scotland | 0.229 | 0.00940 |
Wales | 0.277 | 0.00992 |
The ‘Coefficients’ column gives the estimate of the growth rate \(\Lambda\).
There are many ways to estimate \(R_0\) from \(\Lambda\)
For example many assume that the pre-infectious and/or infectious periods follow an exponential distribution (which we also do when we use rate parameters).
One such estimate is
\[ \begin{equation} R_0 \approx 1 + \Lambda T_S \end{equation} \]
(\(T_S\) is the serial interval,
between the onsets of successive cases).
Another is
\[\begin{equation} R_0\approx \left(1+\Lambda D\right)\left(1+\Lambda D'\right), \end{equation}\] where \(D\) and \(D'\) are the durations of the infectious and pre-infectious (latent) periods, respectively.
Nishiura, Linton, and Akhmetzhanov (2020) estimate \(T_s\) of Covid-19 to be 4.6 with a 95% credible interval of \(\left(3.5, 5.9\right)\).
We can use these with our estimates of \(\Lambda\) to estimate \(R_0\):
Region | Coefficients | SE | R0_lower95 | R0_median | R0_upper95 |
---|---|---|---|---|---|
England | 0.224 | 0.00461 | 1.7840 | 2.0304 | 2.3216 |
Northern Ireland | 0.194 | 0.00823 | 1.6790 | 1.8924 | 2.1446 |
Scotland | 0.229 | 0.00940 | 1.8015 | 2.0534 | 2.3511 |
Wales | 0.277 | 0.00992 | 1.9695 | 2.2742 | 2.6343 |
Note: this ignores uncertainty in the estimates of \(\Lambda\).
In this section:
We can also use outbreak data to find appropriate values for the parameters of an SIR-type model.
We call this calibration (a general model-fitting term)
Parameter space
If a model has \(n\) parameters, we can think of them as living in an \(n\)-dimension space: the parameter space (or sometimes input space or model space)
Each point in parameter space will correspond to a particular set of parameter values.
Difficult to visualise for large \(n\).
Think of the SIR model with two parameters (so \(n=2\)):
We can see the point \(\left(\beta=1.25,\,\gamma=0.5\right)\) in parameter space, and the corresponding output:
The parameter values from our example, plotted in 2D parameter space.
The model output corresponding to the parameter values from this example.
How does the output change for as we move around parameter space?
To answer this question, we need to choose some points.
But we have to be careful how we choose them…
Problem: We won’t about what happens when \(\beta\) and \(\gamma\) are varied independently
Problem: If one parameter turns to have little/no effect, we effectively have only 3 points.
To build a \(p\)-point latin hypercube design (LHD) we
This means the diagonal design is in fact an LHD, so we add another constraint.
A maximin LHD is an LHD where the minimum distance between points has been maximised.
Now we have
We can run the SIR model for each \(\left(\beta,\,\gamma\right)\) pair in our
10-point maximin latin hypercube:
A maximin LHS of size 10.
Infectious populations corresponding to the 10 points.
This is a Monte Carlo approach, where we use random sampling to generate results.
Philip, Reinhard, and Lackman (1959) reported on two outbreaks of Mumps in 1957, on St. Lawrence Island
St. Lawrence Island
Since mumps is immunizing and the populations started as susceptible, this dataset should work well with an SIR model.
Let’s try to find appropriate values of \(\beta\) and \(\gamma\)
The data are weekly counts of new incidences - this is not the same as \(I\left(t\right)\) (or any of our other compartments).
We need to create an analagous quantity we can calculate using the SIR model
## week cases cum_cases Village time_days
## 141 1957-03-30 1 1 Savoonga 1
## 151 1957-04-06 0 1 Savoonga 8
## 161 1957-04-13 6 7 Savoonga 15
## 171 1957-04-20 9 16 Savoonga 22
## 181 1957-04-27 3 19 Savoonga 29
## 191 1957-05-04 20 39 Savoonga 36
## 201 1957-05-11 26 65 Savoonga 43
## 211 1957-05-18 31 96 Savoonga 50
## 221 1957-05-25 28 124 Savoonga 57
## 231 1957-06-01 21 145 Savoonga 64
## 241 1957-06-08 8 153 Savoonga 71
## 251 1957-06-15 10 163 Savoonga 78
## 261 1957-06-22 1 164 Savoonga 85
## 271 1957-06-29 0 164 Savoonga 92
## 281 1957-07-06 0 164 Savoonga 99
## 291 1957-07-13 0 164 Savoonga 106
## 301 1957-07-20 0 164 Savoonga 113
## 311 1957-07-27 0 164 Savoonga 120
## 321 1957-08-03 0 164 Savoonga 127
## 331 1957-08-10 0 164 Savoonga 134
## 341 1957-08-17 0 164 Savoonga 141
## 351 1957-08-24 0 164 Savoonga 148
Weekly counts of new incidences - \(I\left(t\right)\) unknown.
We can generate a cumulative sum of new infections
Cumulative case count for Savoonga, 1957.
Our goal is to find values for \(\beta,\,\gamma\) that lead to a good fit.
Goal: find values for \(\beta\) and \(\gamma\) that produce an \(R\) curve that is as close as possible to the actual data.
To do this, we need some measure of the goodness-of-fit of
model output to the data.
We use the measure to assess the fit across the parameter
space, and thus find the best point.
The measure we will use is the root mean squared error, RMSE.
There are many MANY other ways to assess model fit!
The root mean squared error is defined as \[ RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\left(\hat{y}_i - y_i\right)^2}}. \]
- \(n\) is the number of observed values
- \(y_i\), for \(1\leq{i}\leq{n}\), are the observed values
- \(\hat{y}_i\) are the corresponding model predictions
The RMSE gives the root of the mean squared distance between model and data, and is on the same scale as the data.
The smaller the RMSE, the better the fit of the model to the data.
Let’s set \(\beta=1.1,\,\gamma=0.75\), and compare the resulting model curve to the Savoonga data.
Cumulative case count for Savoonga, 1957 (blue), with the model produced by beta=1.1, gamma=0.75 (black), and the distance between each pair of observed and predicted values marked in red.
\[ RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\left(\hat{y}_i - y_i\right)^2}}. \]
Sampling approach
## [1] "Smallest RMSE is 9.33929 at beta=0.216393, gamma=0.127752"
1000 points in SIR parameter space, coloured by the RMSE as a measure of fit to the Savoonga mumps data.
Notice that the best pairs of values for \(\beta\) and \(\gamma\) lie on a straight line, representing the same \(R_0\) value (remember that for this model \(R_0= \frac{\beta}{\gamma}\))
Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found.
This is still not the best fit - it is the best from the 1,000 candidate models we generated.
Based on the best parameter values, we will refocus on the region \(\beta \in [0.23,\,0.3],\;\gamma \in [0.125,\,0.2]\).
## [1] "Smallest RMSE is 5.0837 at beta=0.247978, gamma=0.154183"
We can plot the \(I^{cum}_t\) output for the SIR model with these parameters, to see the fit.
Cumulative case count for Savoonga, 1957 (blue), with the model produced by best pair of parameter values found in the refocussed region.
If we wanted to, we could repeat this process to gain an even better fit.
This process is subject to many uncertainties, errors and problems. Here are a few:
In this section:
Rather than evaluating the SIR model and RMSE hundreds or thousands of times, we can use a step-by-step approach:
There are many optimisation methods - we will look at Nelder-Mead.
Optimisation methods usually require a function
\[ f: \mathbb{R}^n \rightarrow{\mathbb{R}}.\]
In our case,
Nelder-Mead starts with a randomly generated simplex in \(\mathbb{R}^n\).
A simplex is the simplest possible shape spanning all \(n\) dimensions, and has \(n+1\) vertices. A 2-simplex is a triangle.
The function is evaluated at each vertex, and the worst \(\left(x_h\right)\), second worst \(\left(x_s\right)\) and best \(\left(x_l\right)\) are labelled.
Then several new points are proposed.
Reflection
Expansion
Contraction
Shrinking
With the four instructions above, the algorithm would continue indefinitely; it needs to know when to stop and return the ‘optimal’ point.
In general the Nelder-Mead algorithm terminates when either
Starting point \(\beta =
0.6,\,\gamma=0.6\).
Slightly better than before
In this lecture we have thought about using observed data with SIR-type models.
In the workshop we will develop these ideas more.