1 Overview

In this lecture we will continue to think about some issues in health economics:

  • Decision theory
  • Tests and diagnosis

2 Decision Theory

An important aspect of health economics problems is that a decision needs to be made in the presence of uncertainty. This is often done in the framework of Decision theory. In the rest of this lecture, we will introduce Decision theory and explore some practical examples. This lecture is in some part based on content from Briggs, Claxton, and Sculpher (2006), which is a good reference for this material.

A decision analytic model uses a set of mathematical relationships to define a series of possible consequences resulting from a set of alternative options. As well as the decisions we make, there are events that are random, and we incorporate these using probability. Each decision we make is associated with a cost. By combining

  • the decisions made,
  • the cost of each decision made,
  • the probabilities of random events and
  • an outcome for each combination of decisions and events,

we can calculate the expected outcome and cost associated with each set of decisions. In general in decision theory we express the outcome in terms of utility, but in health economics we will usually use QALYs. You can read about the link between utility and the QALY in Whitehead and Ali (2010).

In health economics, the decisions relate to resource allocation. For example, should a new drug for asthma be funded? What is the most cost-effective diagnostic strategy for urinary tract infections in children?

An important part of decision theory is that it incorporates the uncertainty inherent in any decision. For example, apparently similar patients will respond differently to the same treatment, with no way to anticipate which would work. Grouping the possible responses to treatment is an important part of this, for example one might have ‘Response’ and ‘No response’, or more detail might be needed.

In health economic decision analysis there are two basic groups of questions:

  1. Should this treatment / technology / intervention etc. be adopted, given the existing evidence and the uncertainty surrounding its outcomes? If so, which strategy should be adopted, and for which cohort(s)?
  2. Is more evidence / information required before question 1 can be answered?

We will look at some aspects of each of these.

2.1 Summary

Decision theory addresses the question “How should one decide what to do when things are uncertain?”. In decision theory, we combine all the ingredients of the decision problem into a formal framework and, using mathematical rules, find the optimal decision. We can also use decision theory to answer questions like “How much should we be willing to pay for more information?”

2.2 The ingredients of a decision problem

So, what are the ingredients of a decision problem? We will illustrate this throughout with the simple example of a lottery ticket.

2.2.1 Decisions

The first task is to list all the possible decisions you can make. We could write these as \(d_1,\ldots,d_k\). For example, let’s say you’re deciding whether to buy a lottery ticket. Then we might have \[ \begin{align*} d_1: & \text{ Buy a ticket} \\ d_2: & \text{ Don't buy a ticket} \end{align*} \]

In this example, there is only one ‘phase’ of decision-making, but in many problems there might be several points at which a decision needs to be made, and options might depend on what has happened before.

2.2.2 Events

The second set of ingredients is the events. We use the word ‘events’ in the probabilistic sense, to mean the set of possible things that might happen after you have made your decision (or between subsequent decisions, if there are multiple phases). We often label these \(E_1,\ldots,E_m\). When listing the possible events, it is important to make sure that exactly one of them will definitely happen.

In our example about buying a lottery ticket, we might have:

\[ \begin{align*} E_1: & \text{ You win the lottery} \\ E_2: & \text{ You don't win the lottery} \end{align*} \]

The key thing is that at the point of making the decision, we don’t know which of these events will occur.

2.2.2.1 Uncertainties

As well as the collection of events that might occur, we also need to include the probability of each event. Remember that all probabilities must be between zero and one. An event with probability zero is impossible. An event with probability one is certain.

These probabilities will often be formed subjectively, based on all the information available. The probabilities may also depend on what decisions have been taken and any events that have already happened.

Here, we might have:

\[ \begin{align*} p\left(E_1\mid{d_1}\right) & = 0.0001 \text{(if you bought a ticket)} \\ p\left(E_1\mid{d_2}\right) & = 0 \text{ (if you didn't buy a ticket)} \end{align*} \]

The vertical line here is used to denote conditional probability. For example, \(p\left(E_1\mid{d_2}\right)\) is the probability of \(E_1\) happening (you win the lottery) given that \(d_2\) (you don’t buy a ticket) has already happened.

2.2.3 Rewards / payoffs

Finally, we consider the set of rewards (or payoffs): the consequences following each combination of decisions and events. Note that although these are called ‘rewards’, they can sometimes be bad!

To denote the reward you receive if you chose decision \(d_i\) and then event \(E_j\) happened, we write \(r\left(d_i,\,E_j\right)\). For example, using our ‘lottery’ set-up, \(r\left(d_1,\,E_2\right)\) is the ‘reward’ we receive if we buy a lottery ticket and don’t win.

2.2.4 Costs

Linked to rewards is the idea of ‘costs’. Usually, each decision will incur some sort of cost. Often this will be monetary, but it could be more general (for example the inconvenience or discomfort). This is where the topic of utility is important, especially if there are costs of different kinds.

In our example let’s say the cost of buying a lottery ticket is £1, and the reward if you win is £500.

2.3 The decision tree

Now that we have all the ingredients, we can put them together into a decision tree. A decision tree is made of nodes and branches, arranged to show

  • the sequence of what could happen
  • the outcomes of each sequence (in terms of cost and reward)
  • the probability of each sequence

There are two types of node: decision nodes (where a decision has to be made) and chance nodes (where an event will occur). At each node there are then a number of branches, depending on the number of possible events or decisions. Finally, for any combination of decisions and events, we end up at a reward / outcome. It is important that the pathways are mutually exclusive.

The decision tree for our lottery example.

Figure 2.1: The decision tree for our lottery example.

For example, in the tree above, if we decide to buy a lottery ticket, and that ticket wins, our outcome is £499. Note that in this case, the possible events (winning or not winning) depend on our decision to buy a ticket. Sometimes this will not be the case.

2.4 Solving the Decision tree

The point of the decision tree was to help us to make a decision. So, how do we use it to do that?

The idea in solving a decision tree is that we combine the probabilities and outcomes to find the expected value of each decision. At each decision node, we can rule out the options with the lower expected values, and so only one path will remain. Of course, depending on what happens at the chance nodes, this sequence of decisions may not actually lead to a desirable outcome, but given the information we have it is the one most likely to lead to the best outcome.

So, how do we solve a decision tree?

2.4.1 Backwards induction

We construct the tree from left (the first decision node) to right (the final outcomes), but we solve it from right to left. We work through the tree, removing each node as we go, from right to left.

  • For each chance node, we calculate the expectation of their branches (using the probabilities and outcomes). We then write this value at the chance node and remove the part of the tree to the right.

  • For decision nodes, we choose the option that leads to the highest expected outcome, and cross out all other branches.

We continue these two steps until we have reached the root decision node, by which time we will have found

  • The optimal path of decisions to make (ie. those not crossed out)
  • The expected outcome of that path of decisions

We have already seen an example of this with the ‘Standard gamble’ method for eliciting HRQoL values.

2.4.2 Lottery example

For our lottery ticket example, we have one decision node (the root node), and one chance node. The chance node is the furthest to the right, so we will start there.

At the chance node, we have probability \(0.0001\) of an outcome of £499, and probability \(0.9999\) of an outcome of -£1. Therefore our expected outcome (in £) is

\[0.0001 \times{499} + 0.9999 \times{-1} = -0.95 \] Our tree therefore becomes

The decision tree for our lottery example, with the chance node solved.

Figure 2.2: The decision tree for our lottery example, with the chance node solved.

We can now look at the two branches from the decision node. The expected outcome if we choose \(d_1\) is £-0.95, which is less than the expected value of \(d_2\). Therefore we cross out \(d_1\) and our optimal decision is not to buy a ticket.

The decision tree for our lottery example, with the chance node solved.

Figure 2.3: The decision tree for our lottery example, with the chance node solved.

2.5 Expected value of perfect information (EVPI)

Decision theory is based on the assumption that we can make some decisions, but don’t know what the outcomes will be (because of the chance nodes / events). However, suppose we have an option to find out what will happen at each chance node - this would almost certainly have a big impact on our decision-making!

:::{.definition} The expected value of perfect information is the difference in the expected outcome when we know beforehand what will happen at each chance node (ie. we have perfect information), compared to the expected outcome when we don’t know what will happen.

This is calculated from the perspective of someone who is deciding whether or not to pay to gain the perfect information, but doesn’t yet know the outcome, so the probabilities are still important.

For our lottery example:

  • If the perfect information revealed that our ticket would win, we would definitely buy a ticket, and our outcome would be £499 .
  • If the perfect information revealed that our ticket would not win, we would not buy a ticket, and our outcome would be £0.

The probabilities of the information revealing each event are the same as the probabilities of the event, so we have an expected outcome with perfect information of:

\[ 0.0001\times{£499} + 0.9999\times{£0} = £0.0499.\] Recalling that our optimal decision (not to buy a ticket) had expected outcome £0, our expected value of perfect information is

\[ EVPI = £0.0499 - £0 = £0.0499. \]

This very small amount of money reflects the very small chance that the information will reveal that our ticket would win.

3 Decision analysis for treatments

The example above was clearly not a health economics one! In a health economics context, we will often be dealing with decisions between treatments, and the resulting outcome in terms of QALYs and costs per treatment.

3.1 Example: Angina operation

Williams (1985) presents several scenarios regarding angina patients, and performs a cost-effectiveness analysis of coronary artery bypass grafting (an operation) compared to the standard ongoing treatment. For this example I have simplified the problem and changed some numbers, but the paper is still a useful reference.

Figure 3.1 shows the outcomes of the different options in terms of QALYs (we have already seen this when we learned about QALYs).

Expected value of health-related quality of life for patients with severe angina and left main vessel disease, taken from @williams1985economics.

Figure 3.1: Expected value of health-related quality of life for patients with severe angina and left main vessel disease, taken from Williams (1985).

With each outcome there is an associated probability. These are the probabilities of those outcomes if the operation is performed:

  • p(Improvement) = 0.67
  • p(No change) = 0.3
  • p(Operative mortality) = 0.03

We see that there is a fairly high chance the operation will be successful and the patient will improve, some chance that there will be no change and the outcome will be as though they hadn’t had the operation, and a small chance that they will die during the operation.

We can formulate this into a decision tree:
The decision tree for our angina example. The QALY values have been estimated by eye from @williams1985economics..

Figure 3.2: The decision tree for our angina example. The QALY values have been estimated by eye from Williams (1985)..

We can solve this with backward induction as before. For the chance node, we have an expected outcome (compared to continued medical management) of

\[ \underbrace{0.67\times{9}}_{\text{Improvement}} + \underbrace{0.3\times{4.4}}_{\text{No change}} + \underbrace{0.03 \times{0}}_{\text{Operative} \\ \text{mortality}} = 7.35 \text{ QALYs.}\] This is greater than the expected outcome of not performing the operation, and so the optimal strategy would be to perform the operation.

Exercise 3.1 Calculate the expected value of perfect information in this case.

3.1.1 Incorporating costs

So far this has just taken out the expected outcome in terms of QALYs - we would choose whichever decision is best for the patient. In fact, each course of treatment or decision option has associated with it a cost. In this case, the operation and subsequent care are estimated to cost approximately £3000pp, whereas the medical management costs around £500 route per patient.

We can use this, with our previous results, to calculate the incremental cost efficiency ratio (ICER):

\[ ICER = \frac{3000-500}{7.35 - 4.4} = \frac{2500}{2.95} = 847.46. \] and if our willingness-to-pay threshold is above £847.46 per QALY, the operation is in budget (these prices are from 1983, so this was actually quite high!).

How much would we pay in this case for perfect information? Our outcome with perfect information would be

\[\underbrace{0.67\times{9}}_{\text{Operation successful}} + \underbrace{0.3\times{4.4}}_{\text{No change}} + \underbrace{0.03\times{4.4}}_{\text{Operative mortality}} = 7.482.\]

This is \(7.482 - 7.35 = 0.132\) QALYs more than our expected outcome. Therefore we will pay up to 0.132 times our willingness-to-pay-threshold for perfect information.

You can find a much more realistic (complex!) EVPI example in McCullagh, Walsh, and Barry (2012).

3.2 Example: GORD - a more complex set-up

We won’t cover this in the lecture as it requires a lot of detail for relatively little gain.

This example comes from Goeree et al. (1999), in which six competing treatments for GORD (gastro-oesophageal reflux disease) are evaluated. For this example we will skirt over much of the detail, but describe the general process of the cost-effectiveness analysis. If you’d like to see more about the analysis you can find the details in the References.

The most common symptom of GORD is severe heartburn. The treatments compared fall into two categories: \(H_2RA\)s, which had been the standard and PPI, which are more expensive but thought to be more effective. Various health funding bodies therefore required evidence of the cost effectiveness of PPI before funding these treatments. There also PAs (prokinetic agents). The study compared the costs and outcomes of six alternative strategies over 1 year.

The six treatments compared were:

  • A: Intermittent PPI: Acute treatment with PPI for 8 weeks, then no further treatment until recurrence
  • B: Maintenance PPI: Acute treatment with PPI for 8 weeks, then continuous maintenance with PPI (on same dose)
  • C: Maintenance \(H_2RA\): Acute treatment with \(H_2RA\) for 8 weeks, then continuous treatment with \(H_2RA\) (same dose)
  • D: Step-down maintenance PA: Acute treatment with PA for 12 weeks, then continuous maintenance with lower dose of PA
  • E: Step-down maintenance \(H_2RA\): Acute treatment with \(H_2RA\) for 8 weeks then continuous maintenance with lower dose of \(H_2RA\).
  • F: Step-down maintenance PPI: Acute treatment with PPI for 8 weeks, then continuous maintenance with a lower dose of PPI.

For all these treatment strategies, it is assumed that if a patient doesn’t respond to the first-line treatment, the dose is is increased or another drug is used.

You can see the pathways in Figure 3.3.

The pathways from the six treatment strategies. From @goeree1999economic.

Figure 3.3: The pathways from the six treatment strategies. From Goeree et al. (1999).

You can see that the tree branches again for each of the periods for which recurrence was monitored (0-6m and 6-12m). And the oft-mentioned Table one gives details of the treatment pathways in each case. The paper gives a good idea of the amount of work and research involved in such a study, with tables of costs per drug, results (in terms of both healing and recurrence) from many studies involving each treatment and so on.

The analysis had three main components:

  1. The decision model (Figure 3.3) was constructed and used to compare the expected costs and outcomes of the 6 strategies.
  2. Many other studies were systematically used to estimate the probabilities of each clinical event (eg. healing and recurrence)
  3. Cost-effectiveness analysis was used to compare the treatments in terms of dominance (to be explained soon) and incremental cost-effectiveness.

In order to be able to use results from the existing literature on GORD treatments (specifically the probabilities of different events under particular treatments and circumstances), Goeree et al. (1999) use ‘symptom-free time’ or ‘oesophagitis-free time’ in a follow-up period as their objective function (rather than QALYs).

As well as cost information for the treatments, Goeree et al. (1999) also had to estimate the cost of ongoing healthcare following the treatment, depending on the patient’s clinical condition. These data came from a study investigating clinical practice patterns.

Combining the expected cost with the expected outcome, the authors then performed a cost-effectiveness analysis. The first step was to see whether any of the options were dominated by others.

Definition 3.1 In decision theory, one option dominates another if it is sometimes better, but never worse, than the other.

In this setting, a treatment that had a worse outcome, and was more costly than another treatment, would be dominated by that second treatment. Ruling out the strategies that were dominated by other strategies left three: A, B and E, which can be joined to form the efficient frontier shown in Figures 3.4 and 3.5. The other treatments are plotted relative to ‘maintenance \(H_2RA\)’ (at a different price point per plot) and it was found in the paper that whether or not the new treatment strategy ‘Step-down maintenance PPI’ was part of this efficient frontier depended strongly on the price of \(H_2RA\).

The six treatment strategies placed in cost-benefit space relative to maintenance H2RA. From @goeree1999economic.

Figure 3.4: The six treatment strategies placed in cost-benefit space relative to maintenance H2RA. From Goeree et al. (1999).

The six treatment strategies placed in cost-benefit space relative to maintenance H2RA (at a different price point). From @goeree1999economic.

Figure 3.5: The six treatment strategies placed in cost-benefit space relative to maintenance H2RA (at a different price point). From Goeree et al. (1999).

Another detailed example of decision theory in practice can be found in, McGuinness et al. (2021). This paper gives an example of using decision theory to decide whether or not to operate on patients with abdominal aortic aneurysm during the Covid-19 pandemic.

4 Diagnostic tests

For a given condition there may be a range of different diagnostic tests, each with different costs and accuracy rates. We are probably familiar with this from Covid-19 times, as we compare the more expensive but more accurate PCR test with the less expensive but less accurate LFD test. Ultimately, the aim of diagnostic testing is to find people with the disease, and treat them effectively. We would like to avoid missing people who have the disease, or treating people unnecessarily (ie. when they don’t have the disease).

In what follows we will use \(D\) to denote a disease (or condition) and \(T\) to denote a test. We will use superscript \(+\) and \(-\) to denote the outcome. For example, \(T^-\) means that a test is negative, whereas \(D^+\) means the disease/condition is present.

4.1 Measures of test accuracy

Two measures are important when considering diagnostic test accuracy:

  • Sensitivity: \(p\left(T^+\mid{D^+}\right)\): The probability that the test is positive given that the patient does have the disease (a true positive) . Can the test pick up the disease when it is present?
  • Specificity: \(p\left(T^-\mid{D^-}\right)\): The probability that the test is negative given that the disease is not present (a true negative).

It is possible to have a very high sensitivity but a very low specificity. At the extreme, if a test identifies everyone as having a disease (ie. \(p\left(T^+\right)=1\)) then the sensitivity would be 1 (because it would correctly diagnose everyone who had the disease) but the specificity would be 0 (because it would incorrectly identify everyone who didn’t have the disease as having the disease). Clearly this would not be a very useful test. We will return to this idea when we look at ROC analysis.

We can visualise these measures using a confusion matrix.

Test_Positive Test_Negative
Disease True positives (TP) False negatives (FN)
No Disease False positives (FP) True negatives (TN)

4.2 Predictive value

What we really want to know from a diagnostic test is whether someone has the disease or not. We are using the test to provide more information for a particular patient, in the hope that we can calculate a more accurate and informed probability for them than the population baseline \(P\left(D^+\right)\). What we want to know are \(p\left(D^+\mid{T^+}\right)\) and \(p\left(D^+\mid{T^-}\right)\); that is, the probability that the patient has the disease given either test result.

We can work this out using Bayes Theorem:

Theorem 4.1 Bayes Theorem states that for two events \(A\) and \(B\), with \(p\left(B\right)>0\),

\[ p\left(A\mid{B}\right) = \frac{p\left(B\mid{A}\right)p\left(A\right)}{p\left(B\right)}.\]

In terms of our events \(T\) and \(D\) we can calculate the probability of someone having the disease given that they have tested positive (for example) as:

\[p\left(D\mid{T}\right) = \frac{p\left(T\mid{D}\right)p\left(D\right)}{p\left(T\right)}.\]

There are three important quantities we need to know to calculate \(p\left(D^+\mid{T^+}\right)\):

  • \(p\left(T\mid{D}\right)\). Depending on which disease state and which test outcome we are interested in, this could the sensitivity, specificity, or one minus one of those quantities.
  • The prevalence \(p\left(D^+\right)\). We see that if the disease is very prevalent in the population, the probability of having the disease given either test outcome is higher than if the prevalence is low.
  • The probability of the test outcome, \(p\left(T\right)\)

We can calculate the probability of the test outcome, \(p\left(T^+\right)\), for example, using Partition theorem as follows:

\[p\left(T^+\right) = p\left(T^+\mid{D^+}\right)p\left(D^+\right) + p\left(T^+\mid{D^-}\right)p\left(D^-\right).\]

This leads us to another two important measures of diagnostic accuracy:

  • Positive predictive value: \[p\left(D^+\mid{T^+}\right).\] The probability of having the disease given a positive test result
  • Negative predictive value: \[p\left(D^-\mid{T^-}\right).\] The probability of not having the disease given a negative test result.

Because of the dependence on the prevalence, these quantities may need to be re-calculated often.

Example

Suppose a diagnostic test for a particular disease has sensitivity 0.99 and specificity 0.8. That is, \[p\left(T^+\mid{D^+}\right) = 0.99\] and \[p\left(T^-\mid{D^-}\right)=0.8.\] The prevalence of the disease in the population is 1%, that is \[ p\left(D^+\right) = 0.01.\] To find \(p\left(D^+\mid{T^+}\right)\), the probability that someone has the disease given that they have tested positive, we first need to calculate \(p\left(T^+\right)\).

\[\begin{align*} p\left(T^+\right) & = p\left(T^+\mid{D^+}\right)p\left(D^+\right) + p\left(T^+\mid{D^-}\right)p\left(D^-\right)\\ &= 0.99 \times{0.01} + \left(1-p\left(T^-\mid{D^-}\right)\right)p\left(D^-\right)\\ & = 0.99 \times{0.01} + 0.2 \times{0.99}\\ & = 0.0099 + 0.198 & = 0.2079. \end{align*}\]

Now we can calculate the positive predictive value (we will do this in class).

Even though the sensitivity of the test is high \(\left(p\left(T^+\mid{D^+}\right) = 0.99\right)\) the posterior probability of someone having the disease given a positive test result is still very low.

This is linked to the prosecutor’s fallacy, where the two probabilities are confused, sometimes with catastrophic results - this misunderstanding of statistics has led to many wrong convictions. Westreich and Iliinsky (2014) give a brief overview containing examples of the prosecutor’s fallacy in epidemiology.

4.3 Decision trees for diagnostic testing

In this section we will use decision trees as a way to calculate probabilities (note that there are no decision nodes!), but this is not unusual practice, as they can be a useful visual tool. Rautenberg, Gerritsen, and Downes (2020) give a review of the use of decision theory in diagnostic testing, as well as setting forward good practice.

This approach begins with a cohort of patients whose disease status is known, for example using some gold standard test. The proportion of patients with the disease would be based on disease prevalence information.

A disease-based approach.

Figure 4.1: A disease-based approach.

4.3.1 Example

Suppose some disease has the prevalence \(p\left(D^+\right)= 0.2\), and we know that

  • Sensitivity = \(p\left(T^+\mid{D^+}\right) = 0.86 = 1- p\left(T^-\mid{D^+}\right)\)
  • Specificity = \(p\left(T^-\mid{D^-}\right) = 0.7 = 1- p\left(T^+\mid{D^-}\right)\)

We can then fill in the tree as follows:

A disease-based approach.

Figure 4.2: A disease-based approach.

Competing diagnostics can be laid out in the same way and the results compared. In the above, we have used the decision tree to lay out some fairly simple probabilistic calculations. The probabilities at the end show the expected proportions of true positives, true negatives etc. However, this procedure can also be used to help understand new tests as part of a sequential diagnosis.

4.4 Sequential diagnostic testing

Rather than exploring whether a new diagnostic test might replace an existing one, we could also see how they would work in sequence. This might be the case, for example, if one test is more accurate but also very expensive. The first test is used to ‘triage’ patients so that only those who test positive are given the second test.

4.4.1 Example continued

Let’s imagine that the diagnostic test shown in Figure 4.2 is the first test (which we will now call \(T_1\)), which everyone is sent for. If 1000 people go, we expect:

  • 172 to be true positives
  • 28 to be false negatives
  • 240 to be false positives
  • 560 to be true negatives

Only the \(172 + 240 = 412\) with a positive result would be sent for the second test \(T_2\). Note that even with this very high disease prevalence of 0.2, less than half of the positive tests are correct. The 28 who do have the disease but tested negative would not be sent for the further testing, which could potentially be very serious. In medical sciences, a false negative is often given much more weight than a false positive, for example when developing diagnostics.

Let’s say that for the second, much more expensive, test \(T_2\) we have

  • Sensitivity = \(p\left(T_2^+\mid{D^+}\right) = 0.95 = 1- p\left(T_2^-\mid{D^+}\right)\)
  • Specificity = \(p\left(T_2^-\mid{D^-}\right) = 0.80 = 1- p\left(T_2^+\mid{D^-}\right)\)

For this second test we also have a new within-sample prevalence, since we will only test those who tested positive with the first diagnostic. We have replaced (or updated) \(p\left(D^+\right)=0.2\) with \(p\left(D^+\mid{T_1^+}\right)=\frac{172}{412}={0.417}\) (3 s.f.).

We also assume here that the results of \(T_1\) and \(T_2\) are conditionally independent given the disease state, or \(T_1\perp{T_2}\mid{D}\), and therefore \(T_2\mid{D,T_1}\sim T_2\mid{D}\). Crucially, this means we can use the sensitivity and specificity stated, even though the patients will have already been tested with \(T_1\).

Sequential testing.

Figure 4.3: Sequential testing.

From this we can compare the results after just the first test with the results following both tests. Table 4.1 shows the results from just test 1, and Table 4.2 shows the results of the sequential testing, where only the [expected] 412 people who tested positive on Test \(T_1\) were then given test \(T_2\).

Table 4.1: Diagnostic tables for just test 1.
D_pos D_neg
T1_pos 172 240
T1_neg 28 560

In this Table 4.2, a negative test outcome could come either from testing negative on \(T_1\), or from testing positive for \(T_1\) but subsequently negative on \(T_2\). A positive outcome comes only from testing positive in both tests.

Table 4.2: Diagnostic tables sequential testing.
D_pos D_neg
Test_pos 163.4 48
Test_neg 36.6 752

We see that using the sequential testing method, approximately 192 people (those who test positively for \(T_1\) but negatively for \(T_2\)) would be ruled out for treatment. Of these, around 8.6 do in fact have the disease. Whether this is acceptable would depend on various factors:

  • What is the saving from not treating these \(\sim{192}\) people?
  • How soon would the disease be likely to be picked up in the \(\sim8.6\) false negatives?

In that sense, the calculations we have performed here would feed into a health economic model, for example as the probabilities in a decision tree. You can read more of this, and some of the difficulties surrounding decision making with diagnostic tests, in Sutton et al. (2008).

5 Receiver-operating characteristic (ROC) analysis

Up to this point we have assumed that diagnostic tests give a simple, dichotomous ‘Negative’ or ‘Positive’ value. In fact, in most cases this isn’t true. The test gives a continuous output of some kind (for example the concentration of a particular substance) and we must employ a decision threshold to determine which values are classed as positive, and which are negative. There is usually no definitively true cut-off point.

ROC analysis was developed during the second world war, as radar operators analysed their classification accuracy in distinguishing signal (eg. an enemy plane) from noise. It is still widely used in the field of statistical classification, including in medical diagnostics.

Suppose we are interested in some medical measurement, for example the concentration of a particular hormone. We believe that the distribution of values of this measurement is different for people with a particular disease than for those without it. Without loss of generality, we will assume that people without the disease generally have a lower value than those with the disease.

Probability distributions of a measurement for people with (D) and without (No D) a disease.

Figure 5.1: Probability distributions of a measurement for people with (D) and without (No D) a disease.

We can demonstrate how ROC analysis works by assuming both have a normal distribution, as in Figure 5.1, but note that ROC analysis doesn’t make any distributional assumptions in general.

If we assume the distributions in Figure 5.1, then a decision threshold of zero (the mean of the non-disease measurements) would lead us to classify half of those people without the disease as negative, and most (but not all) of those with the disease as positive. This is shown more clearly in Figure 5.2, where the shaded parts of each distribution show the proportions that are being correctly classified.

Probability distributions of a measurement for people with (D) and without (No D) a disease, with a decision threshold of zero.

Figure 5.2: Probability distributions of a measurement for people with (D) and without (No D) a disease, with a decision threshold of zero.

Now we see that 90% of those with the disease are correctly classified as having it. However, we have many false positives as well (half of all those who are negative). If we increase the decision threshold to reduce the number of false positives, as in Figure 5.3, we see that the sensitivity (the proportion of true positives) is also reduced.

Probability distributions of a measurement for people with (D) and without (No D) a disease, with a decision threshold of 0.5

Figure 5.3: Probability distributions of a measurement for people with (D) and without (No D) a disease, with a decision threshold of 0.5

Any decision threshold value produces a pair (Sensitivity, Specificity), and in ROC analysis it is common to plot these with sensitivity on the \(y\) axis and (1-specificity) on the \(x\) axis. The two threshold values we have tried above, \(t=0\) and \(t=0.5\), are shown below.

The performance of the two threshold values 0 and 0.5 in ROC space.

Figure 5.4: The performance of the two threshold values 0 and 0.5 in ROC space.

If we vary the decision threshold continuously, we can produce a ROC curve, which traces out the sensitivity and 1-Specificity as the decision threshold varies. The ROC curve for the measurement shown in Figures 5.1 to 5.3 is shown in Figure 5.5.

The ROC curve for the measurement shown in Figures 5.1-5.3. AUC is 'area under the curve', an idea we will explore shortly.

Figure 5.5: The ROC curve for the measurement shown in Figures 5.1-5.3. AUC is ‘area under the curve’, an idea we will explore shortly.

Each point on the ROC curve corresponds to a value of the decision threshold, as shown in Figures 5.6 to 5.9.

Decision threshold = -1

Figure 5.6: Decision threshold = -1

Decision threshold = 0

Figure 5.7: Decision threshold = 0

Decision threshold = 2

Figure 5.8: Decision threshold = 2

Decision threshold = 4

Figure 5.9: Decision threshold = 4

There are two main points we will explore from the ROC curve:

  1. The ROC curve shows us the overall performance of the measurement as a classifier, and we can summarize this using the area under the curve (AUC, or sometimes AUROC)
  2. There is no obvious ‘best’ value of the decision threshold (but there are some methods for choosing a value).

5.1 Overall diagnostic performance

The shape of the ROC curve is determined (at least in part) by the degree of separation in the probability distribution of the measurement for people with and without the disease. If there is a good degree of separation, as in Figure 5.10, then there will be some values of the decision threshold where there is (almost) perfect accuracy.

Good separation in distributions of a measurement for people with (D) and without (No D) a disease.

Figure 5.10: Good separation in distributions of a measurement for people with (D) and without (No D) a disease.

The ROC curve for this diagnostic measurement is shown in Figure 5.11.

The ROC curve for the measurement with good separation.

Figure 5.11: The ROC curve for the measurement with good separation.

This shape of ROC curve indicates an ideal diagnostic, and the area under the curve (AUC) is 1 (the best it can be). The diagonal dashed line shows the performance expected of random guessing (or of a diagnostic that is no better than random guessing), and is often included as a baseline. This is the ROC curve we would expect if the diagnostic measurement had the same distribution for people with and without the disease. The area under the random-guessing curve (the diagonal line) is 0.5, so we definitely wouldn’t entertain any diagnostic devices that achieve an AUC of less than 0.5.

On the other hand, suppose we have a high degree of overlap between the measurement distributions for people with and without the disease, as shown in Figure 5.12.

Poor separation in distributions of a measurement for people with (D) and without (No D) a disease.

Figure 5.12: Poor separation in distributions of a measurement for people with (D) and without (No D) a disease.

Although some values of the decision threshold are better than others, none perform very well, and we see this in the shape of the ROC curve, shown in Figure 5.13.

The ROC curve for the measurement with poor separation.

Figure 5.13: The ROC curve for the measurement with poor separation.

The area under the curve (AUC) is much lower, showing a lower degree of accuracy.

So, the area under the ROC curve is used as a general summary of the usefulness of a measurement as a classifier, averaged over the possible threshold values.

5.2 Choosing a value for the decision threshold

If we are performing a diagnostic test we might need to specify a value for the decision threshold, so that people can test positive or negative.

The optimal value will depend on our application, and in particular:

  • The consequences of failing to detect true positives
  • The consequences of raising false alarms.

As we have seen, there is a trade off between sensitivity and specificity. We will assume an equal balance, but the measures in this section can be weighted to favour one sort of error over the other.

5.2.1 Youden’s index

The decision threshold according to Youden’s Index \(J\), is the value of \(T\) that maximizes \(J\left(T\right)\):

\[J\left(T\right) = \operatorname{sensitivity}\left(T\right) + \operatorname{specificity}\left(T\right) - 1\]

5.2.2 Distance from (0,1)

We can also choose to minimize the distance to the top left corner (the ideal point, where specificity and sensitivity are both one):

\[ER\left(T\right) = \sqrt{\left(1-\text{specificity}\right)^2 + \left(1-\text{sensitivity}\right)^2}.\]

Figure 5.14 compares the decision thresholds found from these two methods with our poor-separation scenario, and Figure 5.15 does the same with a different pair of distributions.

The Youden point (blue) and ER point (red)

Figure 5.14: The Youden point (blue) and ER point (red)

The Youden point and ER point for our original scenario.

Figure 5.15: The Youden point and ER point for our original scenario.

5.3 Example: ROC analysis with data

So far we have studied how ROC analysis works with theoretical distributions. ROC analysis doesn’t rely on these distributional assumptions, rather they were used to help us visualize how the method works.

In a real application, the data would take the form of measurements (or some continuous score value) and true disease state. The true disease state is usually learned from some ‘gold standard’ test. A ROC curve derived from real data is likely to be much less smooth than the theoretically derived curves we have been looking at so far.

Figure 5.16 shows 100 measurements of some diagnostic test, coloured by whether or not each person has the disease in question. Notice that there is some overlap, and so this measurement cannot produce a perfectly accurate diagnostic test.

An empirical dataset.

Figure 5.16: An empirical dataset.

If we vary the decision threshold we obtain the ROC curve shown in Figure 5.17. The Youden point is shown, as is the point minimizing the distance from (0,1). These are quite often the same point!

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The ROC curve for our empirical data.

Figure 5.17: The ROC curve for our empirical data.

Using the information gained from the ROC analysis, we can assess our confidence in this measurement as a diagnostic for the disease in question, and can select an optimal (in some sense we choose) value for the decision threshold.

We have only skimmed the surface of ROC analysis, but you can read more about it in Zou, O’Malley, and Mauri (2007) and Fawcett (2006).

6 Summary

In this lecture we have studied

  • Decision trees as a tool for modelling decisions under uncertainty
  • Diagnostic tests
    • Measures of test accuracy
    • decision trees as tools for probabilistic calculations
    • ROC analysis

References

Briggs, A. H., K. Claxton, and M. J. Sculpher. 2006. Decision Modelling for Health Economic Evaluation. Decision Modelling for Health Economic Evaluation. Oxford University Press. https://books.google.co.uk/books?id=OgJOllOt\_dkC.
Fawcett, Tom. 2006. “An Introduction to ROC Analysis.” Pattern Recognition Letters 27 (8): 861–74.
Goeree, Ron, Bernie O’Brien, Richard Hunt, Gordon Blackhouse, Andrew Willan, and Jan Watson. 1999. “Economic Evaluation of Long Term Management Strategies for Erosive Oesophagitis.” Pharmacoeconomics 16 (6): 679–97.
McCullagh, Laura, Cathal Walsh, and Michael Barry. 2012. “Value-of-Information Analysis to Reduce Decision Uncertainty Associated with the Choice of Thromboprophylaxis After Total Hip Replacement in the Irish Healthcare Setting.” Pharmacoeconomics 30 (10): 941–59.
McGuinness, Brandon, Michael Troncone, Lyndon P James, Steve P Bisch, and Vikram Iyer. 2021. “Reassessing the Operative Threshold for Abdominal Aortic Aneurysm Repair in the Context of COVID-19.” Journal of Vascular Surgery 73 (3): 780–88.
Rautenberg, Tamlyn, Annette Gerritsen, and Martin Downes. 2020. “Health Economic Decision Tree Models of Diagnostics for Dummies: A Pictorial Primer.” Diagnostics 10 (3): 158.
Sutton, Alexander J, Nicola J Cooper, Steve Goodacre, and Matthew Stevenson. 2008. “Integration of Meta-Analysis and Economic Decision Modeling for Evaluating Diagnostic Tests.” Medical Decision Making 28 (5): 650–67.
Westreich, Daniel, and Noah Iliinsky. 2014. “Epidemiology Visualized: The Prosecutor’s Fallacy.” American Journal of Epidemiology 179 (9): 1125–27.
Whitehead, Sarah J, and Shehzad Ali. 2010. “Health Outcomes in Economic Evaluation: The QALY and Utilities.” British Medical Bulletin 96 (1): 5–21.
Williams, Alan. 1985. “Economics of Coronary Artery Bypass Grafting.” Br Med J (Clin Res Ed) 291 (6491): 326–29.
Zou, Kelly H, A James O’Malley, and Laura Mauri. 2007. “Receiver-Operating Characteristic Analysis for Evaluating Diagnostic Tests and Predictive Models.” Circulation 115 (5): 654–57.