$$ \newcommand{\pr}[1]{\mathbb{P}\left(#1\right)} \newcommand{\cpr}[2]{\mathbb{P}\left(#1\mid\,#2\right)} \newcommand{\expec}[1]{\mathbb{E}\left[#1\right]} \newcommand{\var}[1]{\text{Var}\left(#1\right)} \newcommand{\sd}[1]{\sigma\left(#1\right)} \newcommand{\cov}[1]{\text{Cov}\left(#1\right)} \newcommand{\cexpec}[2]{\mathbb{E}\left[#1 \vert#2 \right]} $$
8 Expectation
8.1 Definition and interpretation
In a relative frequency interpretation (discussed earlier in Section 4.1), suppose that we run \(n\) trials on an experiment where we observe the outcome of some real-valued random variable \(X:\Omega\to\mathbb{R}\) in each trial. Let \(x_i\) denote the observed value of \(X\) in the \(i\)th trial; the sequence of observations \(x_1\), \(x_2\), …, \(x_n\) is called a sample. The sample mean is then simply \(\frac{1}{n}\sum_{i=1}^n x_i\). As a mathematical idealization, we may suppose that there is a unique, empirical limiting value for the sample mean, as \(n\) tends to infinity, which we call the expectation of \(X\).
In a betting interpretation (discussed earlier in ), you can simply consider your ‘fair price’ for a bet which pays \(X\); that price, we call your expectation of \(X\).
The idea of expectation is very interesting mathematically, and also provides ways to use probability in a host of practical applications. Regardless of interpretation, the expectation of \(X\) can be connected to the probability mass function \(p(x)\) (if \(X\) is discrete) or the probability density function \(f(x)\) (if \(X\) is continuously distributed) in the following way:
8.2 Expectation of functions of random variables
Let \(X\) be a discrete random variable with \(\pr {X \in \mathcal{X}}=1\) for a finite or countable set \(\mathcal{X}\), and let \(g: \mathcal{X}\to\mathbb{R}\) be a real-valued function. As seen in Section 6.11, \(g(X):= g\circ X\) is again a random variable. Indeed, \(g(X)\) is discrete, since \(\pr{ g(X) \in g(\mathcal{X}) } =1\) where \(g(\mathcal{X}) := \{ g(x) : x \in \mathcal{X} \}\) is finite or countable, and \(g(X)\) is a real-valued random variable, so we can define its expectation.
To find the expectation of \(g(X)\), by Equation 8.1, according to the definition we need to find the probability mass function \(p_{g(X)}()\) first. It turns out however that we can express \(\mathbb{E}[g(X)]\) directly in terms of \(p_X()\), saving us the effort of having to calculate \(p_{g(X)}()\) from \(p_X()\).
For any \(y\in g(\mathcal{X})\), \[\begin{aligned} p_{g(X)}(y) &=\pr{g(X)=y} =\sum_{x\in \mathcal{X}}\cpr{g(X)=y}{X=x}\pr{X=x} =\sum_{x \in \mathcal{X} : g(x)=y} p(x), \end{aligned} \tag{8.3}\] since \[\cpr{ g(X) = y}{X=x} = \begin{cases} 1 & \text{if } y = g(x), \\ 0 &\text{otherwise} .\end{cases}\] It follows that \[\begin{aligned} \mathbb{E}[g(X)] &=\sum_{y\in g(\mathcal{X})} y \, p_{g(X)}(y) =\sum_{y\in g(\mathcal{X})} y \left(\sum_{x \in \mathcal{X} : g(x)=y}p(x)\right) \\ & =\sum_{y\in g(\mathcal{X})} \left(\sum_{x \in \mathcal{X} : g(x)=y}yp(x)\right) =\sum_{x\in \mathcal{X}} \left(\sum_{y\in g(\mathcal{X}):y=g(x)} y p(x)\right) \\ & =\sum_{x\in \mathcal{X}} \left(\sum_{y\in g(\mathcal{X}):y=g(x)} y \right) p(x) =\sum_{x\in \mathcal{X}} g(x)p(x), \end{aligned}\] where we applied the definition of expectation, Equation 8.3, distributivity, change of order of summation, and distributivity again. A similar result can be proven when \(X\) is continuously distributed. Concluding, we have the following result, which is sometimes known as the Law of the Unconscious Statistician:
For multiple random variables, the Law of the Unconscious Statistician reads as follows:
8.3 Linearity of expectation
Remember that summation and integration are linear operators, i.e., \[\begin{aligned} \sum_i \alpha f(x_i) + \beta g(x_i) &= \alpha \sum_i f(x_i) + \beta \sum_i g(x_i), \end{aligned}\] and \[\begin{aligned} \int_A \left( \alpha f(x)+\beta g(x) \right) \, \mathrm{d} x &= \alpha \int_A f(x) \, \mathrm{d} x + \beta \int_A g(x) \, \mathrm{d} x . \end{aligned}\] Consequently,
A similar, but deeper, result is the following.
8.4 Variance and covariance
As mentioned earlier, we can interpret the expectation of \(X\) as a long-run average of a sample from distribution \(X\). A popular and mathematically convenient way to measure the variability of \(X\)—i.e. to measure how much \(X\) varies from \(\mathbb{E}[X]\) in the long run—goes via the expectation of the random variable \((X-\mathbb{E}[X])^2\).
Note that both \(\var{X}\) and \(\sd{X}\) are non-negative numbers.
Using LOTUS, we can immediately derive the following expressions for the variance: \[\begin{aligned} \var{X} &= \sum_{x\in \mathcal{X}} (x - \mathbb{E}[X])^2 p(x) &&\text{if $X$ is discrete, and} \\ \var{X} &= \int_{-\infty}^{\infty} (x - \mathbb{E}[X])^2 f(x) \, \mathrm{d} x &&\text{if $X$ is continuously distributed,} \end{aligned}\] provided that the sum or integral exists.
For two real-valued random variables, we can ask ourselves how they vary jointly.
We also use the following qualitative terminology.
If \(\cov{X,Y}>0\) it means that \(X-\mathbb{E}[X]\) and \(Y-\mathbb{E}[Y]\) tend to have the same sign. That is, if \(X > \mathbb{E}[X]\) then it tends to be the case that \(Y > \mathbb{E}[Y]\) (or, conversely, if \(X < \mathbb{E}[X]\) then it tends to be the case that \(Y < \mathbb{E}[Y]\) too). In this case we say that \(X\) and \(Y\) are positively correlated.
If \(\cov{X,Y} <0\) we say that \(X\) and \(Y\) are negatively correlated. Now \(X - \mathbb{E}[X]\) and \(Y-\mathbb{E}[Y]\) tend to have opposite signs.
If \(\cov{X,Y}=0\) we say that \(X\) and \(Y\) are uncorrelated.
Note: uncorrelated is not the same as independent (more on this later). A quantification of the correlation is provided by the correlation coefficient, given by \[\rho ( X, Y) := \frac{\cov{X,Y}}{\sqrt{ \var{X} \var{Y} } }.\] It can be proved (see Exercises 8.x and 8.x) that \[-1 \leq \rho(X,Y) \leq 1 .\] We will see an example below where the correlation coefficient is 1.
Using LOTUS for multiple random variables, we immediately derive the following expressions for the covariance: \[\cov{X,Y} = \sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) p(x,y)\] if \(X\) and \(Y\) are discrete, and \[ \cov{X,Y} = \iint\limits_{\mathbb{R}^2} (x - \mathbb{E}[X])(y - \mathbb{E}[Y]) f(x,y) \, \mathrm{d} x \, \mathrm{d} y,\] if \(X\) and \(Y\) are jointly continuously distributed, provided that the double sum or double integral exists.
We note two simple but important properties.
As immediate consequences of linearity of expectation (see Section 8.3), we obtain the formulæ:
Note: Equation 8.6 through Equation 8.8 mean that \(\cov{}\) is a bilinear operator.
We also obtain a slightly different way of calculating the variance:
The next example shows how our various formulae can be put to good use.
We also obtain a slightly different way of calculating the covariance:
Similarly to the preceding results, we have the following.
Finally, we can now also say something about the variance of sums of random variables:
So, in general, the variance of a sum is not equal to the sum of the variances, unless all covariances are zero (zero covariance occurs under independence, covered later).
For example, for three real-valued random variables \(X\), \(Y\), and \(Z\), \[\var{X+Y+Z} = \var{X} + \var{Y} + \var{Z} + 2\left( \cov{X,Y} + \cov{X,Z} + \cov{Y,Z} \right). \]
8.5 Conditional expectation
We now turn to the expectation of a random variable given an event (such as the value of another random variable). Recall that the indicator random variable of an event \(A\) is given by \[𝟙\{A\}(\omega)= \begin{cases} 1&\text{if }\omega\in A, \\ 0&\text{otherwise}. \end{cases}\] Then \(\expec{𝟙\{A\}}= 1 \cdot \pr{A} + 0 \cdot \pr{A^\textrm{c}} = \pr{A}\).
Conditional expectation generalizes the concept of conditional probability. For example, if \(A\) and \(B\) are any events such that \(\pr{B}>0\) then because \(𝟙\{A\}𝟙\{B\}=𝟙\{A\cap B\}\), it follows that \(\expec{𝟙\{A\}\mid B}=\expec{𝟙\{A\}𝟙\{B\}}/\pr{B}=\pr{A\cap B}/\pr{B}=\cpr{A}{B}\).
One may also view \(\cexpec{ \, \cdot \,}{A}\) as expectation with respect to the conditional probability \(\cpr{ \, \cdot \,}{A}\):
It was no coincidence that the answers to parts (b) and (c) of were the same. This is an example of the following important theorem.
A more demanding concept is the following:
You may see this written more compactly in books as \[\cexpec{X}{Y}(\omega) := \cexpec{X}{Y=Y(\omega)}, \text{ for all }\omega\in\Omega,\] but our definition above is a little easier to digest. In the case where \(Y\) is discrete, \(\cexpec{X}{Y}\) is a random variable that takes values \(\cexpec{X}{Y=y}\) with probabilities \(\pr{Y=y}\). We concentrate on the discrete case here.
The next result is of considerable importance.
This theorem is sometimes called the law of iterated expectation. We can use this result to calculate \(\mathbb{E}[X]\) in some tricky cases.
8.6 Independence: multiplication rule for expectation
Remember that two discrete random variables are independent when their joint probability mass function factorises, i.e. when \(p(x,y) = p(x) p(y)\). Similarly, two jointly continuously distributed random variables are independent when their joint probability density function factorizes, i.e. when \(f(x,y) = f(x) f(y)\). It turns out that in these cases, expectation factorizes as well:
The converse of the last property is not true: \(X\) and \(Y\) may be dependent but uncorrelated.
An important consequence of this Corollary is a simplification of the formula for the variance of a sum for pairwise independent random variables:
8.7 Expectation and probability inequalities
By the monotonicity properties of summation and integration, namely that if \(f(x) \geq g(x)\) then \[\begin{aligned} \sum_i f(x_i) \geq \sum_i g(x_i), \text{ and } \int_A f(x) \, \mathrm{d} x \geq \int_A g(x) \, \mathrm{d} x , \end{aligned}\] we immediately get the following.
For instance, suppose that \(X\) and \(Y\) have \(\pr { X \leq Y } = 1\). Then \(\pr{ Y-X \geq 0 } =1\) so \(\expec{Y - X} \geq 0\) and hence \(\mathbb{E}[X] \leq \mathbb{E}[Y]\).
This simple property has various interesting consequences:
Markov and Chebyshev bounds are often too generous when distributional information is available, as seen in the next examples. Nevertheless, their generality and simplicity make these inequalities very valuable for complex probability calculations.
8.8 Historical context
There are approaches to probability theory that start out from expectation of random variables directly, rather than starting out from probability of events as we have done here; see e.g. (Whittle 1992).
Pafnuty Chebyshev (1821–1894) and his student Andrei Markov (1856–1922) made several important contributions to early probability theory. What we call Markov’s inequality was actually published by Chebyshev, as was what we call Chebyshev’s inequality; our nomenclature is standard, and at least has the benefit of distinguishing the two. A version of the inequality was first formulated by Irénée-Jules Bienaymé (1796–1878).

