$$ \newcommand{\pr}[1]{\mathbb{P}\left(#1\right)} \newcommand{\cpr}[2]{\mathbb{P}\left(#1\mid\,#2\right)} $$
3 Conditional probability and independence
3.1 Conditional probability
In this course, when \(\mathbb{P}(B)=0\), \(\mathbb{P}(A \mid B)\) is undefined. The usual interpretation is that \(\mathbb{P}(A \mid B)\) represents our probability for \(A\) after we have observed \(B\). Conditional probability is therefore very important for statistical reasoning, for example:
In legal trials. How can we use DNA (or other) evidence to determine the chance that an accused person is guilty?
Medical screening. How can we make best use of the information from large scale cancer screening programs?
Unfortunately, conditional probability is not always well understood. There are several well-known legal cases that have involved a serious error in probabilistic reasoning: see e.g. Example 2.4.5 of (Anderson, Seppäläinen, and Valkó 2018).
For example, if we roll a fair six-sided die, the conditional probability that the score is odd, given that the score is at most 3, is \[\mathbb{P}(\text{odd} \mid \text{ at most 3}) = \frac{\mathbb{P}(\{1,3\})}{\mathbb{P}(\{1,2,3\})} = \frac{2/6}{3/6} = \frac{2}{3}. \]
3.2 Properties of conditional probability
In this section, we’ll meet five key properties of conditional probability.
For example, C6 for conditional probabilities says that, if \(\mathbb{P}(C)>0\), \[\mathbb{P}(A \cup B \mid C) = \mathbb{P}(A \mid C) +\mathbb{P}(B \mid C) - \mathbb{P}(A \cap B \mid C) .\]
Some people refer to P2 as the multiplication rule for probabilities.
Both P1 and P2 can be deduced from the definition of probability. For example, Equation 3.1 follows from the fact that \[\mathbb{P}(B \mid C) \, \mathbb{P}(A \mid B\cap C) = \frac{\mathbb{P}(B \cap C)}{\mathbb{P}(C)} \cdot \frac{\mathbb{P}(A \cap B \cap C)}{\mathbb{P}(B \cap C)} = \frac{\mathbb{P}(A \cap B \cap C)}{\mathbb{P}(C)} = \mathbb{P}(A \cap B \mid C) .\]
Our next property is a more general version of the multiplication rule.
When \(k=2\), we get P2; for \(k=3\), this becomes \[\mathbb{P}(A \cap B \cap C) = \mathbb{P}(A)\, \mathbb{P}(B \mid A) \, \mathbb{P}(C\mid A\cap B).\] We can prove this by repeatedly applying Equation 3.1 (in this case, we use it twice).
This result is often called the partition theorem, or the law of total probability. (If you’ve forgotten what a partition is, head back to Section 1.6.)
To prove P4 is true, we first use P2 on the right-hand side of Equation 3.2 to get \[\sum_{i=1}^k \pr{E_i} \cpr{A}{E_i} = \sum_{i=1}^k \pr{A \cap E_i} .\] But since the \(E_i\) form a partition, they are pairwise disjoint, and hence so are the \(A \cap E_i\), so by C7 \[\sum_{i=1}^k \pr{A \cap E_i} = \pr{\cup_{i=1}^k (A \cap E_i ) } = \pr {A \cap (\cup_{i=1}^k E_i ) } ,\] but since the \(E_i\) form a partition, \(\cup_{i=1}^k E_i = \Omega\), giving the result. You should check that P4 remains true (with \(k=\infty\)) for infinite partitions.
The most important result in conditional probability is Bayes’ theorem. It allows us to express the conditional probability of an event \(A\) given \(B\) in terms of the “inverse” conditional probability of \(B\) given \(A\).
We can also combine properties P4 and P5 to make a mega-property of conditional expectation: Bayes’ theorem for partitions.
3.3 Independence of events
Tied to the idea of conditional probability is the idea of independence: the property that two events are unrelated, or have no bearing on each other’s likelihood.
For example, if we pick a card from a well-shuffled deck, the events “the card is red” (\(R\)) and “the card is an Ace” (\(A\)) are independent.
By counting, we have that \(\pr{R} = \frac{26}{52} = \frac{1}{2}\) and \(\mathbb{P}(A) = \frac{4}{52} = \frac{1}{13}\). Now, \(A \cap R = \{ A\diamondsuit, A\heartsuit\}\) so \(\pr{A \cap R } = \frac{2}{52} = \frac{1}{26}\), We check that \(\frac{1}{26} = \frac{1}{2} \cdot \frac{1}{13}\), so \(R\) and \(A\) are indeed independent.
Never confuse disjoint events with independent events! For independent events, we have that \(\pr{A\cap B} = \mathbb{P}(A)\mathbb{P}(B)\), but for disjoint events, \(\pr{A \cap B}=0\) because \(A\cap B=\emptyset\).
Disjointness is a property of the sets only (it can be seen from the Venn diagram). Independence is a property of probabilities (it cannot be seen from the Venn diagram).
The next theorem explains why independence is called independence:
Consider any two events \(A\) and \(B\) with \(\mathbb{P}(A)>0\) and \(\mathbb{P}(B)>0\). The following statements are equivalent.
\(\pr{A\cap B} = \mathbb{P}(A)\mathbb{P}(B)\).
\(\mathbb{P}(A \mid B)=\mathbb{P}(A)\).
\(\mathbb{P}(B \mid A)=\mathbb{P}(B)\).
In other words, learning about \(B\) will not tell us anything new about \(A\), and similarly, learning about \(A\) will not tell us anything new about \(B\).
For conditional independence, we have a similar result.
Consider any three events \(A\), \(B\), and \(C\), with \(\pr{A \cap B \cap C}>0\). The following statements are equivalent.
\(\cpr{A\cap B}{C} = \mathbb{P}(A \mid C)\mathbb{P}(B \mid C)\).
\(\cpr{A}{B\cap C}=\mathbb{P}(A \mid C)\).
\(\cpr{B}{A\cap C}=\mathbb{P}(B \mid C)\).
In other words, if we know \(C\) then learning about \(B\) will not tell us anything new about \(A\), and similarly, if we know \(C\) then learning about \(A\) will not tell us anything new about \(B\).
Consider the card-shuffling example again. The probability that our card is an Ace is \(\mathbb{P}(A) = 1/13\) and the probabilitiy that it is an Ace, given it is red, is \[\cpr{A}{R} = \frac{\pr{A\cap R}}{\pr{R}} = \mathbb{P}(A) ,\] by independence. The ‘reason’ for the independence is that the proportion of aces in the deck (\(4/52\)) is the same as that of aces among the red cards (\(2/26\)).
It can be extremely useful to recognize situations where (conditional) independence can be applied. Of course, it is equally important not to assume (conditional) independence where there really are dependencies.
The smallest case here is to consider three events. We say that the events \(A\), \(B\), and \(C\) are mutually independent if all of the following equalities are satisfied: \[\begin{aligned} \pr{A \cap B \cap C} &=\mathbb{P}(A)\mathbb{P}(B)\mathbb{P}(C), \\ \pr{A\cap B}&=\mathbb{P}(A)\mathbb{P}(B),\\ \pr{B \cap C}&=\mathbb{P}(B)\mathbb{P}(C),\\ \pr{C \cap A}&=\mathbb{P}(C)\mathbb{P}(A). \end{aligned}\]
Suppose we roll 4 dice and their values are independent.
To find the probability that we throw no sixes let \(A_i\) be the event ‘the \(i\)th throw is not a 6’. By assumption \(A_1\), …, \(A_4\) are independent so \[\pr{\text{no sixes on 4 dice}} = P\left(\bigcap_{i=1}^4A_i\right) = \prod_{i=1}^4 \pr{A_i} = \Bigl(\frac{5}{6} \Bigr)^4.\] The same result is obtained from the classical model, by selection with replacement.
It is possible for events to be pairwise independent without being mutually independent, as the next example demonstrates.
3.4 Historical context
Bayes’ theorem is named after the Reverend Thomas Bayes (1701–1761); it was published after his death, in 1763. In our modern approach to probability, the theorem is a very simple consequence of our definitions; however, the result may be interpreted more widely, and is one of the most important results regarding statistical reasoning.