This lecture will think about linking the SIR model we developed in Lectures 1 and 2 to the real world. In particular, we will cover:
These could easily each be a lecture (or more!) on their own, so please ask questions at any time!
So far we have looked at the development of SIR type models, and some key extensions, but we haven’t yet thought about how to use them in the face of a real epidemic. The model development stages we’ve discussed in the first session do of course link to the real world, since the model structure we choose would be rooted in our understanding of the behaviour of the disease. This may still be a huge source of uncertainty, especially in the face of a new infectious disease; we remember how early on in the Covid-19 pandemic it was unknown whether people retained their immunity after being infected, for example.
The main difficulty, once a model structure has been decided upon, is to estimate the parameter values, and quantities such as the basic reproduction number. There are two main strands to this area:
Although many of the quantities we have mentioned (reproduction numbers, transmission rates etc.) have intuitively simple definitions, they are difficult or even impossible to measure. However, combining the mathematical model with data we can make some estimations.
Using data on the incidence of an infectious disease during the early stages of an epidemic, we can estimate the basic reproduction number, \(R_0\). This is useful as it gives an indicator of the transmissibility of the disease and the herd immunity threshold. To estimate \(R_0\), we first need to calculate the growth rate.
For more detail on this section, see Vynnycky and White (2010).
During the first stages of an epidemic, the rate of increase in infectious individuals should be approximately constant. To see why this is, we first think of a simple differential equation \[ \frac{dQ\left(t\right)}{dt} = -kQ\left(t\right), \] where \(Q\left(t\right)\) is some function of time that decreases at constant rate \(k\). The solution to this equation (ie. an equation for the actual function \(Q\left(t\right)\)) is \[ Q\left(t\right) = Q\left(0\right)e^{-kt}. \] Since the derivative of \(e^{-kt}\) w.r.t \(t\) is \(-ke^{-kt}\), we can see that \[ \frac{dQ\left(t\right)}{dt} = -kQ\left(0\right)e^{-kt} = -kQ\left(t\right), \] and that this function therefore solves the differential equation.
The differential equation for the number of Infectious individuals is (in a simple SIR model) \[ \frac{dI}{dt} = \frac{\beta I S}{N} - \gamma I. \] At the start of the epidemic, we have \(S\approx{N}\), and so approximately \[ \frac{dI\left(t\right)}{dt} = \beta I\left(t\right) - \gamma I\left(t\right) = \left(\beta - \gamma\right) I\left(t\right), \] and therefore an approximate solution is \[ I\left(t\right) \approx I\left(0\right)e^{\left(\gamma-\beta\right)t} \] where \(I\left(0\right)\) is the number of infectious individuals at time 0. More generally (ie. not assuming the simplest SIR model), we write
\[ I\left(t\right) \approx I\left(0\right)e^{\Lambda t}, \] where \(\Lambda\) is the growth rate. Taking logs, we therefore have \[ \ln\left(I\left(t\right)\right) \approx \ln\left(I\left(0\right)\right) + \Lambda t. \] This is a straight line, with gradient \(\Lambda\). Therefore, if we plot the natural logarithm of the number of infectious individuals against time, we should see a straight line with gradient \(\Lambda\).
This leads to an obvious question, which is “What if we don’t know the number of infectious individuals at each time step?”. Thankfully, at log scale, the cumulative case increases at a similar rate, as shown in Figure 3.1.
It is often the case that incidences of an infectious disease are under-reported. Sometimes this is included using some model parameter to represent the proportion of cases that are reported, say \(k_r\), with \(0<k_r<1\). In this case,
\[I_{reported} = k_r I_{true} \] and therefore
\[I_{true} = \frac{I_{reported}}{k_r},\] where \(I_{true}\) is the true number of infectious cases, and \(I_{reported}\) the number that have been reported. This under-reporting can especially be the case at the start of an epidemic, but thankfully our estimate of the growth-rate \(\Lambda\) is unaffected. This is because \[ \begin{align*} I_{true}\left(t\right) & \approx I_{true}\left(0\right)e^{\Lambda t}\\ & \approx \frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}, \end{align*} \] and therefore
\[\begin{align*} \ln \left(I_{true}\left(t\right)\right) & \approx \ln{\left[\frac{1}{k_r}I_{reported}\left(0\right) e^{\Lambda t}\right]}\\ &\approx{\ln\left(\frac{1}{k_r}I_{reported}\left(0\right)\right) + \Lambda t}. \end{align*}\]
This means that the intercept of the straight line changes, but not the gradient (which is the part we are interested in).
From the UK government website you can download daily case data for Covid-19 in the UK. This contains both new daily cases and cumulative counts, for each of England, Scotland, Wales and Northern Ireland.
Plotting the time series for 2020 and 2021 we see the trends in Figures 3.2 and 3.3.
We can take logs of the cumulative count and focus on just the first month, and this is shown in Figure 3.4.
We can fit straight lines to each of these using ordinary least squares, as in Figure 3.5.
The gradients and their standard errors are given in Table 3.1.
Region | Coefficients | SE |
---|---|---|
England | 0.224 | 0.00461 |
Northern Ireland | 0.194 | 0.00823 |
Scotland | 0.229 | 0.00940 |
Wales | 0.277 | 0.00992 |
The ‘Coefficients’ column in Table 3.1 give an estimate of the growth rate \(\Lambda\) for each UK region. These values are all fairly similar, and it is quite plausible that factors such as the different population densities in the regions would affect the growth rate of Covid 19.
There are many ways to estimate \(R_0\) from \(\Lambda\), each relying on different model assumptions and using different summaries of the disease’s behaviour. For example, many of the simpler estimates assume that the pre-infectious and/or infectious periods follow an exponential distribution (which we also do when we use rate parameters).
One such estimate is
\[ \begin{equation} R_0 \approx 1 + \Lambda T_S \tag{3.1} \end{equation} \]
where \(T_S\) is the serial interval (the interval between successive cases). In the early stages this can be estimated using data from cases where the chain of infection is almost certainly known.
Another is
\[\begin{equation} R_0\approx \left(1+\Lambda D\right)\left(1+\Lambda D'\right),\tag{3.2} \end{equation}\] where \(D\) and \(D'\) are the durations of the infectious and pre-infectious (latent) periods, respectively.
Nishiura, Linton, and Akhmetzhanov (2020) estimate the serial interval of Covid-19 to be 4.6 (using their most certain primary and secondary case pairs), with a 95% credible interval of \(\left(3.5, 5.9\right)\). Using Equation (3.1), we can therefore use our estimates of \(\Lambda\) from Table 3.1 with these estimates to produce estimates of \(R_0\). These are shown in Table 3.2. Notice that this paper was published in April 2020, and that these estimates were therefore available relatively early on in the pandemic.
Region | Coefficients | SE | R0_lower95 | R0_median | R0_upper95 |
---|---|---|---|---|---|
England | 0.224 | 0.00461 | 1.7840 | 2.0304 | 2.3216 |
Northern Ireland | 0.194 | 0.00823 | 1.6790 | 1.8924 | 2.1446 |
Scotland | 0.229 | 0.00940 | 1.8015 | 2.0534 | 2.3511 |
Wales | 0.277 | 0.00992 | 1.9695 | 2.2742 | 2.6343 |
At this point, we haven’t made use of the uncertainty in the estimates of \(\Lambda\), but this would also widen the credible intervals.
We can also use data from an outbreak of an infectious disease to find appropriate values for the parameters of an SIR-type model. We call this calibration (this is a general model-fitting term, not just for SIR-type or compartmental models). If a model has \(n\) parameters, we can think of them as forming an \(n\)-dimensional space. We can call this space the parameter space. Each point in the space will correspond to a particular ‘setting’ of the model (ie. to a particular set of specific parameter values). In general this can be difficult to visualise, as \(n\) can be quite large, but for 2 or 3 parameters we can easily see points in the space.
To look into this, we will return to the simplest model type we looked at in lecture 1, the SIR model with no vital dynamics. This model involves two parameters: the per capita transmission rate \(\beta\) and the recovery rate \(\gamma\). In model fitting generally, we talk about the parameter space (or sometimes input space), and in this case where there are two parameters, this is a 2D surface. If we set upper and lower limits on the value of each parameter, we would have a rectangle. Because there are only two parameters, we can plot any pair of values. For instance, in our example we had \(\beta=1.25,\,\gamma=0.5\), and we can see this point in parameter spcae in Figure 4.1.
This led to the model output shown in Figure 4.2
We can now visualise how the model output varies as we move around the parameter space. Choosing points effectively to maximise the information gained is an important area of statistics (and science generally) known as experimental design.
There are many (infinitely many!) ways we could arrange n points in 2-dimensional space.
In Figure 4.3 10 points are arranged in parameter space such that \(\alpha=\gamma\). With this design, we will learn about what happens when the parameters are varied together, but not about what will happen if the points are varied independently.
In Figure 4.4 we see what is called a full-factorial design with nine points: every combination of each of three values for both parameters is present. This is an improvement, but if one of the parameters turns out to have little effect on the output (as is not uncommon with real models), our design is effectively reduced to three points.
Instead, we will use a latin-hypercube design (LHD). To make an LHD of size \(p\), the parameter space is divided into a \(p\times{p}\) grid, and there must be one point in each row and each column. This in fact means that the design in Figure 4.3 is a latin hypercube, so we will add a further constraint. A maximin LHD is an LHD where the smallest distance between two points is maximised. Now we have \(p\) distinct values for each parameter, and the space is, in some sense, being filled. This is a bit of a segue into experimental design, but if you want to learn more about it the original paper is Morris and Mitchell (1995).
We can generate a 10 point maximin LHD in \(\beta\) and \(\gamma\) automatically in R, using various packages (eg. lhs
).
Each of these points corresponds to specific values for \(\beta\) and \(\gamma\), and so for each point we can run the SIR model to produce output like that shown in Figure 4.2. To demonstrate the correspondance between point and output, Figure 4.6 shows just the \(I\) compartment, coloured to correspond to the colours in Figure 4.5. This is a Monte Carlo approach, where we use random sampling to generate results.
We can see that varying the values of \(\beta\) and \(\gamma\) leads to very different trajectories. We will employ this technique to narrow down the parameter space to suitable values given some data.
Philip, Reinhard, and Lackman (1959) reported on two outbreaks of Mumps in 1957, on St. Lawrence Island in the Bering Sea. This island is 40 miles from Alaska and 120 miles from Siberia (see Figure 4.7), and at the time was visited very infrequently from either mainland. There are two villages on St Lawrence Island, Gambell (1957 population 302) and Savoonga (1957 population 259). No prior outbreak of mumps was known there, and there was no word in the native language to convey the idea of mumps. Since mumps is an immunising infection, and the populations of these villages each started as entirely (or very nearly entirely) susceptible, this dataset should be a good candidate to calibrate the parameters of an SIR model.
The data points given are weekly counts of new incidences. Because of this, we don’t have the actual number of infectious people at any time, which is what the model’s I compartment gives us. We therefore use the observed data to compute the cumulative sum of incidences, and derive a corresponding value from the model.
In this model, as people leave the \(S\) compartment they are newly infected. As time progresses, more and more people leave the \(S\) compartment, passing through the \(I\) compartment into \(R\). These all count in a measure of cumulative infections. The only people who don’t count in the cumulative total of infections are those still in \(S\). Therefore at times \(t\), the cumulative total of infections is \[I^{cum}_t = N - S_{t}.\] We can use this value along with the recorded cumulative infection data to calibrate the SIR model to this disease outbreak.
In this case it is simple to derive an analogous quantity from the model, but in cases with more compartments, or with re-infection, it can be more complicated.
In this lecture we will look at the epidemic in Savoonga.
Our goal is to find values for \(\beta\) and \(\gamma\) that produce an \(I^{cum}_t\) curve that is as close as possible to the actual data in Figure 4.8.
To compare candidate parameter values, we need some measure of the goodness-of-fit of a model curve to the data. The one we will use is the root mean squared error, RMSE.
Definition 4.1 The root mean squared error is defined as \[ RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}{\left(\hat{y}_i - y_i\right)^2}}. \] The \(y_i\), for \(1\leq{i}\leq{n}\), are the observed values, and \(n\) is therefore the number of observed values. The \(\hat{y}_i\) are the corresponding model predictions. As its name suggests, the RMSE gives the square root of the mean of the squared errors between model and data, and is on the same scale as the data. The smaller the RMSE, the better the fit of the model to the data.
We can demonstrate this by choosing some values for \(\beta\) and \(\gamma\), say \(\beta=1.1,\,\gamma=0.75\), and comparing the resulting model curve to the Savoonga data.
Figure 4.9 shows the model’s \(I^{cum}_t\) population curve in black, with red lines to show the distance between each observed value and the model prediction. These are the \(\left(\hat{y}_i - y_i\right)\) in the RMSE definition. This particular pair of parameter values appears to give a fairly poor fit, with RMSE = 54.71, rather large in relation to the observed values.
By generating a selection of parameter values that cover the space well, we can get some idea as to which regions of parameter space might give a better fit. Figure 4.10 shows 1000 different points in our SIR model’s parameter space. For each one, the model has been generated and the output (specifically \(I^{cum}_t\)) compared to the Savoonga data. For each point, we can therefore calculate an RMSE value, and this is shown by the colour. We can see that the lowest RMSE is now much lower than that in Figure 4.9.
## [1] "Smallest RMSE is 12.4548 at beta=0.26039, gamma=0.155666"
Notice that the best pairs of values for \(\beta\) and \(\gamma\) approximately lie on a straight line through \(\beta=0,\,\gamma=0\), representing the same \(R_0\) value (remember that for this model \(R_0= \frac{\beta}{\gamma}\)). We can plot the \(I^{cum}_t\) output for the SIR model with these parameters, to see the fit.
This is still not the best fit - it is the best from the 1,000 candidate models we generated. From the plot we can see that most of the region can be ruled out, and we can ‘zoom in’ on the region that seems most plausible.
Based on the best value from our coarse experiment in Figure 4.11, we will refocus on the region \(\beta \in [0.23,\,0.3],\;\gamma \in [0.125,\,0.2]\).
## [1] "Smallest RMSE is 5.14844 at beta=0.24675, gamma=0.153755"
We can plot the \(I^{cum}_t\) output for the SIR model with these parameters, to see the fit.
If we wanted to, we could repeat this process to gain an even better fit.
This process is subject to many uncertainties, errors and problems. Here are a few:
Instead of evaluating a function (in this case the SIR model and RMSE calculation) hundreds or thousands of times, we can use a step-by-step approach, employing some sort of strategy to select our next points to run. There are many different optimisation algorithms, but the one we will use is Nelder-Mead. This is rarely the fastest method, but it is relatively robust and intuitive, and doesn’t require information about the gradient of the function, which many do.
Optimisation methods require a function
\[ f: \mathbb{R}^n \rightarrow{\mathbb{R}}.\] That is, a function whose input can (and usually will) be multivariate, but whose output is univariate - we need a single valued summary of each parameter point (measuring ‘how good it is’ in some sense) so that we can arrive at the optimal point. In the case of our SIR mumps example, the input space is \(\left(\beta,\,\gamma\right)\) and the output space is the RMSE value of the SIR output compared to the observed infection data.
Nelder-Mead optimisation starts with a randomly generated n-simplex in \(\mathbb{R^n}\). A simplex is the simplest possible shape spanning all dimensions, and has \(n+1\) vertices. A 2-simplex (ie. in \(\mathbb{R^2}\)) is a triangle. A 3-simplex is a tetrahedron, and so on.
The function is evaluated at each vertex, and the worst (\(x_h\), the maximum), second worst (\(x_s\)) and best (\(x_l\), the minimum) vertices are labelled. The following changes are proposed in order:
With the four instructions above, the algorithm would continue indefinitely; it needs to know when to stop and return the ‘optimal’ point. In general the Nelder-Mead algorithm terminates when either
This video demonstrates how the simplex moves around the space until it finds the optimal point.
Continuing with our mumps example, we can use optim
to find the point in parameter yielding the smallest value of RMSE (up to some tolerence). Figure 5.5 shows the RMSE of the same region as before by colour, with the sequence of points resulting from the Nelder-Mead algorithm shown in orange. The starting point is \(\beta=0.6,\,\gamma=0.6\). The number of function iterations reported by the optimisation function is 57 - this means it is much more efficient than the Monte Carlo method.
This point found by Nelder-Mead is slightly better than the result using the zoomed in Latin hypercube method, since it is not constrained by a set of randomly generated points. Like before, we can plot the fitted cumulative \(I\) curve agains the mumps outbreak data from Savoonga (See Figure 5.6.
In this lecture we have thought about using observed data with SIR-type models. In particular
In the early stages of an epidemic (or using data from the early stages only) we can estimate the growth rate \(\Lambda\), and use this to estimate the basic reproduction number \(R_0\).
We can calibrate the parameters of an SIR model to observed data once an epidemic has passed. We have done this in two ways:
In the next workshop you will be able to explore these methods in more detail.