Models and Methods in Health Data Science

Lectures 1 & 2: Introduction to compartmental infectious disease models

Rachel Oughton

13/02/2023

1 Overview

In this lecture we will cover:

Much of the material in this section of the course is built on (Bjørnstad 2018) and (Vynnycky and White 2010), which are good references if you’d like to know more (see the References for details).

2 Modelling infectious diseases

In this section:

What is a model?

Before we think about modelling infectious diseases, we need to think about models.

We are familiar with many sorts of model:

All have a similar purpose: to display the key attributes of the thing they are modelling.

A scale model building gives us an idea of the layout, proportions and styling but will be lacking many attributes of the true thing.

For example:

Likewise, a mathematical model of an infectious disease will [attempt to] replicate some of the features of a disease, but not all of them.

Modelling infectious diseases

What do we want the model to tell us?


Features of the disease we need to consider:


We are unlikely to be interested in things like symptoms.

2.1 Stages of an infectious disease

Important stages in an infection

The progress of an infectious disease in a patient. From @arzt2019quantitative

The progress of an infectious disease in a patient. From Arzt et al. (2019)

We distinguish between

2.1.1 Secondary infections

During the infectious period, the patient can go on to infect one or more secondary cases.

The serial interval, \(T_S\) is the time interval from the onset of a primary case to the onset of a secondary case (one who was infected directly by the primary case).

The serial interval is:

We are concerned only with whether or not the person is infectious, not with whether they have symptoms.

2.2 Trajectory

Most epidemiological models think of each person as belonging to a particular state (or compartment, as we will soon call it).

For example, a person can be susceptible, infectious, recovered and so on.

When modelling any disease, it is important to consider

  1. What are the states a person can be in?
  2. Which of these states are important for understanding how the disease spreads?
  3. What are the possible pathways through those states?

The main states we will encounter in our models are:

A person must be in one (and only one) state at a time

2.2.1 Model diagrams

We use these states to label models.

In an SIR model a person starts as susceptible, can become infectious and subsequently recover with permanent immunity.




In an SEIRS model there is a latent period following infection (during which a person is in the ‘E’ state), and after recovering people can be reinfected

Important questions

When thinking about the structure of a model, it is important to ask questions like:

2.3 Transmission

An important feature of any disease model is some measure of how easily the disease is transmitted from one person to another.


There are several measures of transmissibility:

2.3.1 Secondary attack rate

This is perhaps the most accurate, but not the most widely used.

The secondary attack rate is the proportion of susceptibles in contact with a primary case (for example those in the same house or office) who are then infected.

2.3.1.1 Example

Suppose two of these become infected, then we estimate the secondary attack rate to be \[\frac{2}{6} = \frac{1}{3}.\]

2.3.2 Reproduction number

The reproduction number, \(R\) is defined as the (average) number of successful direct transmissions per infectious person.

This is the number we came to know as “the R number” in Covid times.


There are two widely used versions of the reproduction number.

  1. The basic reproduction number (\(R_0\))
  2. The effective/net reproduction number (\(R_n\))

The basic reproduction number (\(R_0\))

\(R_0\) is the basic reproduction number.

This is the expected number of secondary cases arising from an infectious person in an otherwise susceptible population.

The basic reproduction number illustrated, with \(R_0 = 2\). From The Conversation.

The effective reproduction number (\(R_n\))

However: The proportion of susceptibles decreases throughout the epidemic (and may not be everyone to start with), so the number of secondary infections from each infectious person will be less than \(R_0\).

\(R_n\) is the effective reproduction number (or sometimes net reproduction number).

\(R_n\) is calculated as \[\begin{equation} R_n = R_0 \times{s}, \end{equation}\] where \(s\) is the proportion of the population that is susceptible.

The effective reproduction number (\(R_n\))

The effective reproduction number illustrated if \(R_0=2\) and \(s=0.5\), therefore \(R_n = 1\). Adapted from The Conversation.”

Reproduction number = 1


An outbreak will only occur if the number of infectious people grows from the small group that started it.

The reproduction number needs to be greater than one for the epidemic to continue growing

2.3.3 Herd immunity threshold (HIT)

Idea: Use the \(R_n\) formula to find the proportion of the population that needs to be immune (ie. not susceptible) to achieve \(R_n<1\).

We have \(R_n = R_0\times{s}.\) So, for \(R_n=1\) we need \[s = \frac{1}{R_0}\] Then, the proportion who are immune, and therefore the herd-immunity threshold (HIT) is \[HIT = 1-s = 1-\frac{1}{R_0} = \frac{R_0-1}{R_0}.\]

The Herd Immunity Threshold can be used

For example (from Vynnycky and White 2010)

This means that when \(s<0.5\) for Influenza, or \(s<0.08\) for Measles, \(R_n\approx{1}\) and the infection should start to die out.

2.4 General modelling issues

2.4.1 Modelling vocabulary quiz

Words Definitions
1. Rate A. A description of a system using mathematical rules.
2. Risk B. The whole number of individuals in a mathematical model.
3. Population C. The measure of frequency of some event for a given number of individuals.
4. Model D. The chance/probability of a particular event happening.

2.5 Question: which features are necessary?

Modern computers allow for complex models. But it’s harder to learn from more complex models.

We will look into some of these this week.

3 The SIR model

In this section:

3.1 Compartmental models

Compartmental models are a widely used class of computer model (not just for epidemiology)

In epidemiology we often name a model by the trajectory of states, eg. ‘SIR model’, ‘SEIRS model’.

We will look at the SIR model to understand how this works.

3.2 The SIR model

The SIR model was developed in the early 20th Century to model how a disease would spread through a population of susceptible people.

The three compartments are:

The total number of people remains constant at \[N= S + I + R,\] and therefore the model is said to be closed.


The model starts with everyone Susceptible apart from a small number (often one) who are Infectious.

3.3 Difference equations

Key Question: Given the number of people in each compartment today, can we calculate how many will be in each compartment tomorrow?

Difference equations relate the number of people in a compartment at time \(t\) to the number in each compartment at time \(t-1\) (the previous time).

Notation

So for example

Possible transfers between compartments

We have: \[\begin{align*} S_{t} & = S_{\left(t-1\right)} - \left(\text{# newly infected at time }\left(t-1\right)\right)\\ I_{t} & = I_{\left(t-1\right)} + \left(\text{# newly infected at time }\left(t-1\right)\right) - \left(\text{# newly removed at time }\left(t-1\right)\right)\\ R_{t} & = R_{\left(t-1\right)} + \left(\text{# newly removed at time }\left(t-1\right)\right) \end{align*}\]

The population of each compartment at time \(t\) is equal to the population at time \(\left(t-1\right)\)

In a closed epidemic (ie. one with no births, deaths or migrations) we should have \[ S_{\left(t-1\right)} + I_{\left(t-1\right)} + R_{\left(t-1\right)} = S_t + I_t + R_t = N\] for all \(t\), and it is clear here that this is the case.

Parameters for difference equations

We have: \[\begin{align*} S_{t} & = S_{\left(t-1\right)} - \left(\text{# newly infected at time }\left(t-1\right)\right)\\ I_{t} & = I_{\left(t-1\right)} + \left(\text{# newly infected at time }\left(t-1\right)\right) - \left(\text{# newly removed at time }\left(t-1\right)\right)\\ R_{t} & = R_{\left(t-1\right)} + \left(\text{# newly removed at time }\left(t-1\right)\right) \end{align*}\]

In this model there are therefore two quantities to think about:

In this course we will follow the conventions of Kermack and McKendrick (1927), which are still commonly in use.

3.3.1 Newly infected

The number of people newly infected at each time step will depend on

Model parameter \(\beta\) (Beta)

\(\beta\) is the per capita transmission rate. It is often broken down as

\[ \beta = \text{contact rate} \times \text{transmission probability}. \] If an infectious person has contact with 6 [susceptible] people per day, and the probability of transmitting the disease [to a susceptible person] is 0.4, they will infect 2.4 people per day.

We could think of \(\beta\) in this case as \[ \beta = 6 \times{0.4} = 2.4.\]

Note: We’ve written ‘day’, but really what we mean is ‘time unit’

Effective contact

Contact here is used to mean effective contact - a level of contact which would be sufficient to lead to infection.

The actual level of contact this means depends on the mode of transmission (for example respiratory / vector-borne / faecal-oral / sexual etc.).

The number of newly infected people at each time step

We can use \(\beta\) to model this quantity

Those people moving from \(S\) (susceptible) to \(I\) (infectious) can be broken down as \[I \times{\beta} \times{\frac{S}{N}}= \frac{\beta I S}{N}. \]

The Mass Action principle

An important point here is the mass action principle.


In a compartmental model, we assume

There is no real concept of ‘individuals’, or of some people being more proximal to one another.

This is an assumption we will return to later, when we look at Agent Based Models.

3.3.2 Newly recovered

The number of people recovering at each time step depends on

Model parameter \(\gamma\) (‘Gamma’), the recovery rate

This is the proportion of infectious people who will recover at each time step.

The number of newly recovered people at each time step

At each time-step, \(\gamma \times{I}\) people will move move from \(I\) (infectious) to \(R\) (removed)

The period of time for which a person is infectious is therefore approximately \(\frac{1}{\gamma}\).

3.3.3 Difference equations specified

Our difference equations therefore become

\[\begin{align*} S_{t} & = S_{\left(t-1\right)} - \frac{\beta I_{\left(t-1\right)} S_{\left(t-1\right)}}{N}\\ I_{t} & = I_{\left(t-1\right)} + \frac{\beta I_{\left(t-1\right)}S_{\left(t-1\right)}}{N} - \gamma I_{\left(t-1\right)}\\ R_{t} & = R_{\left(t-1\right)} + \gamma I_{\left(t-1\right)} \end{align*}\]

Given

we can compute the trajectory of an epidemic (according to this model).

Exercise

Talk about this with the people around you

Think of the interventions used during the Covid-19 pandemic.
Which model parameters would they affect, and how?


Reminder:

3.3.4 A note on risks and rates

When we think about events happening, we are really thinking about risk.

There is uncertainty about if/when people will get infected or recover.

However, in compartmental models we work in terms of rates.

We assume that individuals become infectious at a constant rate, and that once infected they recover at a constant rate.

Working with rates

Convenient:

\[ \scriptstyle \text{Rate of occurence} = \frac{1}{\text{Average time to event}}. \]

Unrealistic:

Implies that the event duration is exponentially distributed:

  • Memoryless property
  • Unrealistic shape
Distribution of the time between becoming infectious and recovery, assuming that infectious individuals recover at a rate of 0.3 per day.

Distribution of the time between becoming infectious and recovery, assuming that infectious individuals recover at a rate of 0.3 per day.

3.3.5 Reproduction numbers revisited

From our SIR model we can calculate a theoretical value for the basic reproduction number \(R_0\).

An epidemic grows if the number of infected individuals is increasing, that is if \[ \frac{\beta I_{\left(t-1\right)}S_{\left(t-1\right)}}{N} - \gamma I_{\left(t-1\right)} = \frac{\beta S_{\left(t-1\right)}}{N} - \gamma > 0\] assuming \(I_{\left(t-1\right)}>0\).

We can rearrange this as \[ \frac{\beta S_{\left(t-1\right)}}{N} > \gamma \] or \[ \frac{\beta S_{\left(t-1\right)}}{\gamma N} > 1. \]

Reproduction numbers revisited

At the start, we have \(S_{\left(t-1\right)}\approx{N}\) because almost everyone is still susceptible, so \[\frac{S_{\left(t-1\right)}}{N}\approx{1}\] and therefore an epidemic occurs if \[\frac{\beta}{\gamma}>1 \] and we have (for this model set-up) \(R_0=\frac{\beta}{\gamma}\).

Example: SIR

Suppose we have a population of \(N=1000\) people, of which one is infectious with some disease, and none have immunity. This means we have \[ \begin{align*} S_1 & = 999 \\ I_1 &= 1 \\ R_1 &= 0 \end{align*}\] Assume that each person is in effective contact with 5 people per day, and that the probability of an infectious person passing the disease on to a susceptible person when in effective contact is 0.25.

With this disease, an infectious person is infectious for 2 days (on average). This means that \[ \begin{align*} \beta &= 5 \times{0.25} = 1.25 \\ \gamma &= \frac{1}{2} \end{align*}\]

Example continued

To calculate the compartment populations at \(t=2\) therefore, we have

\[ \begin{align*} S_{2} &= S_1 -\frac{\beta I_1 S_1}{N} & =& 999 -\frac{1.25 \times{1}\times{999}}{1000} & = 997.8 \text{ (3 s.f.)}\\ I_2 &=I_1 + \frac{\beta I_1 S_1}{N} - \gamma I_1 & = &1+ \frac{1.25 \times{1}\times{999}}{1000} - \frac{1}{2}\times{1} &= 1.75\text{ (3 s.f.)} \\ R_2 &=R_1+ \gamma I_1 & =&0+ \frac{1}{2}\times{1} & = 0.5 \text{ (3 s.f.)} \end{align*} \]

Exercise

Programming SIR models

We can program these equations in R (or another language) and plot the progress of an epidemic

The example above continued for 60 days. Points show the calculated values, the lines interpolate these points to make the paths clearer.

The example above continued for 60 days. Points show the calculated values, the lines interpolate these points to make the paths clearer.

3.3.6 Final epidemic size

The total number of people infected over the course of the epidemic is known as the final epidemic size.


A closed system (where the total number of people, \(N\), stays the same) will tend towards \(\lbrace S^*, I^*, R^*\rbrace\), where

The final epidemic size is \(R^*\), the total number of people who contracted the disease, and \(S^*\) is the number who escaped it.

In the example, we can see that \(S^*=70, R^*=930\).

3.3.7 Example: Reproduction numbers

Having calculated \(R_0 = 2.5\), we can now calculate \(R_n\) using the proportion of susceptibles at each time point: \[R_n\left(t\right) = \frac{S_t}{N}R_0\]

The values of $R_n$ and $R_0$ for the model described above.

The values of \(R_n\) and \(R_0\) for the model described above.


Notice that the point at which \(R_n\approx{1}\) corresponds to the point at which the number of infectious people begins to decrease, around day 13.

Instability!

Unfortunately, this discrete time-step approach is sensitive and can be unstable, especially for

Changing time-step in our example

We will re-run our example using 6-hourly and 3-daily time steps.

This means adjusting our parameters to have the same overall meaning:

\[\begin{align*} \beta_{6\text{hourly}} &= 0.25\times{\beta_{\text{daily}}}\\ \gamma_{6\text{hourly}} &=0.25\times\gamma_{\text{daily}} \end{align*}\]

and similarly \[\begin{align*} \beta_{3\text{daily}} &= 3\times{\beta_{\text{daily}}}\\ \gamma_{3\text{daily}} &=3\times\gamma_{\text{daily}} \end{align*}\]

Example: changing time-step

Our SIR model with three different timesteps.

Our SIR model with three different timesteps.

The shorter the timestep, the smoother the curves and the less erratic the behaviour.
In particular, with a 3 day timestep we have some negative \(I\) values and some \(R\) values greater than \(N\).

4 From difference equations to differential equations

Because the disease is progressing continuously (rather than in discrete steps) our model needs to work continuously.


In this section

4.1 Thinking continuously

For a continuous model, it is sometimes impossible to write down a formula for the populations of the compartments at a particular time point

We therefore think in terms of the rate of change in each compartment over time.

Think first of the change in compartment populations from time \(t\) to time \(t+1\)

For example, \[ \frac{I_{\left(t+1\right)} - I_{t}}{\left(t+1\right) - t} = I_{\left(t+1\right)} - I_{t}, \] This is basically a gradient - how the population of \(I\) changes from one time step to the next.

Thinking continuously

We can do this for some amount of time \(\delta t\) \[ \frac{I_{\left(t+\delta t\right)} - I_{t}}{\left(t+\delta t\right) - t}=\frac{I_{\left(t+\delta t\right)} - I_{t}}{\delta t}, \] where \(\delta t\) represents some very small (but non-zero) time interval.

This doesn’t quite make sense, since we are no longer thinking in discrete time steps, so we change to continuous function notation:

\[ \frac{I\left(t+\delta t\right) - I\left(t\right)}{\left(t+\delta t\right) - t} = \frac{I\left(t+\delta t\right) - I\left(t\right)}{\delta t}. \]

Shrinking the time-step

The smaller the time-step \(\delta t\), the more accurate our gradient \[\frac{I\left(t+\delta t\right) - I\left(t\right)}{\delta t}\] becomes.

Thinking continuously

We would like the gradient to change continuously as the function \(I\left(t\right)\) changes, so we think about the limit as \(\delta t\) “tends to zero”: \[ \frac{dI\left(t\right)}{dt} = \lim\limits_{\delta t \rightarrow{0}}{\frac{I\left(t+\delta t\right) - I\left(t\right)}{\delta t}}. \] We call this ‘the derivative of \(I\left(t\right)\) with respect to \(t\)’, and often write it \[\frac{dI}{dt}.\]

4.1.1 Continuous SIR model

We can now specify our SIR model as a system of ordinary differential equations

These equate the rates of change of the compartments to the populations of the compartments as follows:

\[ \begin{align*} \frac{dS}{dt} &= -\frac{\beta I S}{N}\\ \frac{dI}{dt} &= \frac{\beta I S}{N} - \gamma I\\ \frac{dR}{dt} &= \gamma I \end{align*}. \]

Notice that we have dropped the \(\left(t\right)\) notation, but \(S,\,I\) and \(R\) are each still functions of \(t\).

4.2 Solving the ODEs in R

The differential equations need to be solved for us to calculate \(S\left(t\right),\,I\left(t\right)\) and \(R\left(t\right)\).

To set up an SIR model in R, there are five steps:

  1. Define an R function containing the general set of equations, those that govern the flow of people from one compartment to another.
  2. Specify the time points for which we want R to output a solution.
  3. Supply values for the parameters (for example \(\beta\) and \(\gamma\)).
  4. Supply initial values for the populations of each compartment.
  5. Call the function that will put all these together.

There are several ways to solve systems of ordinary differential equations in R, but we will mainly use the package deSolve.

4.2.1 Defining the set of equations

# Lines like this one, with a hash at the start, are comments

library(deSolve)
sir_model = function (t, y, parms)
{
  # label state variables from within the y vector of initial values
  S = y[1]        # susceptibles
  I = y[2]        # infectious
  R = y[3]        # recovered
  N = S + I + R
  
  # Pull parameter values from the parms vector
  beta = parms["beta"]
  gamma = parms["gamma"]

  # Define equations (dS etc. are simplified notation for dS/dt)
  dS = -(beta*S*I)/N
  dI = (beta*S*I)/N - gamma*I
  dR = gamma * I

  # return list of gradients   
  res = c(dS, dI, dR)
  list(res)
}

4.2.2 Specifying time points

We must also specify the time points for which we would like the model to return output, for example:

timepoints = seq (0, 50, by=0.5)
timepoints
##   [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
##  [16]  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5
##  [31] 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0
##  [46] 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5
##  [61] 30.0 30.5 31.0 31.5 32.0 32.5 33.0 33.5 34.0 34.5 35.0 35.5 36.0 36.5 37.0
##  [76] 37.5 38.0 38.5 39.0 39.5 40.0 40.5 41.0 41.5 42.0 42.5 43.0 43.5 44.0 44.5
##  [91] 45.0 45.5 46.0 46.5 47.0 47.5 48.0 48.5 49.0 49.5 50.0

Whatever time unit we think in terms of while defining our parameters, that is our time unit.

Note: Unlike in the discrete case, this has nothing to do with the resolution at which R calculates the output.

4.2.3 Parameter Values

We must supply values for parms

For example, to recreate our example we’d put

params_vec = c(
  beta = 1.25, 
  gamma = 1/2
  )
params_vec
##  beta gamma 
##  1.25  0.50

Because we supply a named vector (with the names matching the parameters listed in sir_model) the function knows how to use each value.

4.2.4 Initial values

We supply starting values for each compartment, for example:

initial_values = c (S = 999, I = 1, R = 0)

This is to be used as the y argument to the function sir_model.

Again, the elements of the vector are named with the same names as those used in sir_model, so that their intended use is clear.

4.2.5 Putting it all together

Finally, we supply all this information to the function ode:

output1 = ode (y=initial_values, times = timepoints, func = sir_model, parms= params_vec)
output1 = as.data.frame(output1)
head(output1)
##   time        S        I         R
## 1  0.0 999.0000 1.000000 0.0000000
## 2  0.5 998.2430 1.453761 0.3032041
## 3  1.0 997.1439 2.112208 0.7438629
## 4  1.5 995.5498 3.066331 1.3838414
## 5  2.0 993.2416 4.446083 2.3123499
## 6  2.5 989.9071 6.435446 3.6574847

The object output1 is then a matrix (which we have now turned into a data.frame) with four columns: time, S, I and R, showing the populations in each compartment at each time step in timepoints.

Reproduction numbers

Using the parameters we have specified we can also calculate \(R_0\), and then \(R_n\) at each time point:

# calculate R0:
R0 = params_vec['beta']/params_vec['gamma']

# calculate total number of people from initial values
pop_total = sum(initial_values)

# Calculate Rn and add as a column of output1
output1$Rn = R0*(1/pop_total)*output1$S

head(output1)
##   time        S        I         R       Rn
## 1  0.0 999.0000 1.000000 0.0000000 2.497500
## 2  0.5 998.2430 1.453761 0.3032041 2.495608
## 3  1.0 997.1439 2.112208 0.7438629 2.492860
## 4  1.5 995.5498 3.066331 1.3838414 2.488875
## 5  2.0 993.2416 4.446083 2.3123499 2.483104
## 6  2.5 989.9071 6.435446 3.6574847 2.474768

Changing to long format

It is often simpler to work with the data in ‘long’ format (more rows, fewer columns), so we write a plotting function which pivots the data frame to be ‘long’ before plotting the data using the package ggplot2 (Wickham 2016):

plot_sir = function(
  sir_df, 
  Compartments = c("S", "I", "R") # default
){
# Collect the values in columns S, I and R (or those given as "Compartments")
#  into one column named 'value', and create an additional column named 
# "Compartment".
    output_tidy = pivot_longer(sir_df, cols = all_of(Compartments), names_to = "Compartment")
  # Plot the values, with a different coloured line for each compartment
    ggplot(data=output_tidy, aes(x=time, y=value, col=Compartment)) + 
      geom_line() + ylab("People")  
  }

There is a new ‘Compartment’ column, and the populations are in the ‘value’ column.

This is much simpler to plot using the package ggplot2 (Wickham 2016):

The continuous model

plot_sir(output1, Compartments= c("S", "I", "R"))

4.3 Assumptions

You may have been thinking that this model wouldn’t accurately represent many real-life infectious diseases, and you would be correct.

Some of the assumptions / simplifications that we will go on to investigate are:

5 Summary

We’ve covered a lot of ground in these first two lectures.

The main points to take away are:

6 Some further reading

There are many, many articles about SIR models (especially since Covid-19!) so here are some that may be of interest:

7 References

Arzt, Jonathan, Matthew A Branan, Amy H Delgado, Shankar Yadav, Karla I Moreno-Torres, Michael J Tildesley, and Carolina Stenfeldt. 2019. “Quantitative Impacts of Incubation Phase Transmission of Foot-and-Mouth Disease Virus.” Scientific Reports 9 (1): 1–13.
Bjørnstad, O. N. 2018. Epidemics: Models and Data Using r. Use r! Springer International Publishing. https://books.google.co.uk/books?id=4sJ1DwAAQBAJ.
Brauer, Fred. 2017. “Mathematical Epidemiology: Past, Present, and Future.” Infectious Disease Modelling 2 (2): 113–27.
Kermack, William Ogilvy, and Anderson G McKendrick. 1927. “A Contribution to the Mathematical Theory of Epidemics.” Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character 115 (772): 700–721.
Lazzari, Gianrocco, Giovanni Colavizza, Fabio Bortoluzzi, Davide Drago, Andrea Erboso, Francesca Zugno, Frédéric Kaplan, and Marcel Salathé. 2020. “A Digital Reconstruction of the 1630–1631 Large Plague Outbreak in Venice.” Scientific Reports 10 (1): 1–7.
Vynnycky, E., and R. White. 2010. An Introduction to Infectious Disease Modelling. OUP Oxford. https://books.google.co.uk/books?id=8iVz3NHLnYUC.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.