A Some probability revision

(Original version produced by A.R.Wade.)

Probability spaces and random variables

A probability space \((\Omega,\cF,\Pr)\) consists of

the sample space \(\Omega\) (the set of all possible outcomes);
a \(\sigma\)-algebra \(\cF\) of events;
a probability \(\Pr\) that to each event \(A \in \cF\) assigns a number \(\Pr (A)\) satisfying the probability axioms.

To say that \(\cF\) is a \(\sigma\)-algebra means that \(\cF\) is a collection of subsets of \(\Omega\) such that:

\(\Omega \in \cF\).
If \(A \in \cF\) then its complement \(A^\mathrm{c} \in \cF\) too.
If \(A_1, A_2, \ldots \in \cF\) then \(\cup_{n=1}^\infty A_n \in \cF\) too.

A random variable \(X\) is a function \(X : \Omega \to \R\) which is \(\cF\)-measurable, meaning that \[ \{ \omega \in \Omega : X(\omega) \leq x \} \in \cF \text{ for all } x \in \R .\] An important example is the random variable \(\1_A\) of an event \(A \in \cF\), defined by \[ \1_A ( \omega ) = \begin{cases} 1 & \text{if } \omega \in A, \\ 0 & \text{if } \omega \notin A . \end{cases} \] If \(\cG \subseteq \cF\) is another (smaller) \(\sigma\)-algebra, then a random variable \(X\) is \(\cG\)-measurable if \[ \{ \omega \in \Omega : X(\omega) \leq x \} \in \cG \text{ for all } x \in \R .\]

Roughly speaking \(X\) is \(\cG\)-measurable if “knowing” \(\cG\) means “knowing” \(X\).

Example. Consider the sample space for a die roll \(\Omega = \{1,2,3,4,5,6\}\). Take \(\cF = 2^\Omega\) the power set of \(\Omega\) (the set of all subsets). Take \(\cG \subset \cF\) given by \[ \cG = \bigl\{ \emptyset, E , E^\mathrm{c}, \Omega \bigr\} ,\] where \(E = \{ 2,4,6\}\) is the event that the score is even, and its complement \(E^\mathrm{c} = \{1,3,5\}\) (the score is odd). Note that \(\cG\) is a \(\sigma\)-algebra. Take random variables \(X(\omega) = \omega\) (the score) and \(Y(\omega ) = \1_E\) (the indicator that the score is even).

Both \(X\) and \(Y\) are clearly \(\cF\)-measurable.

Moreover, \(Y\) is \(\cG\)-measurable, since e.g., \(\{ \omega : Y (\omega) \leq 1/2 \} = E^\mathrm{c} \in \cG\), but \(X\) is not \(\cG\)-measurable, since e.g., \(\{ \omega : X (\omega ) \leq 1 \} = \{ 1 \} \notin \cG\).

The normal distribution

A real-valued random variable \(X\) has the normal distribution with mean \(\mu\) and variance \(\sigma^2\) if it is a continuous random variable with probability density function \[ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ - \frac{(x-\mu)^2}{2\sigma^2} \right\}, \text{ for } x \in \R .\] We write \(X \sim \mathcal{N} (\mu,\sigma^2)\).

The case \(\mu =0\), \(\sigma^2 = 1\) is the standard normal distribution \(Z \sim \mathcal{N} (0,1)\). The density of the standard normal we usually write as \(\phi\), and the cumulative distribution function is \[ N ( x ) = \Pr ( Z \leq x ) = \int_{-\infty}^x \phi (y) \mathrm{d} y .\] If \(X \sim \mathcal{N} (\mu,\sigma^2)\), then \(\alpha + \beta X \sim \mathcal{N} (\alpha + \beta \mu, \beta^2 \sigma^2 )\). In particular, if \(X \sim \mathcal{N} (\mu,\sigma^2)\), then \[ \frac{X - \mu}{\sigma} \sim \mathcal{N} (0,1) .\]

The moment generating function of \(Z \sim \mathcal{N} (0,1)\) is \(M_Z (t) = \Exp \left( \mathrm{e}^{t Z} \right) = \mathrm{e}^{t^2 /2}\), for \(t \in \R\).

You may not all have seen the multivariate normal distribution, although it is important in most statistics courses. A random vector \(X \in \R^k\) has the \(k\)-dimensional normal distribution with mean vector \(\mu \in \R^k\) and covariance matrix \(\Sigma\) (a \(k \times k\) symmetric, positive-definite matrix) if it has the \(k\)-dimensional probability density function \[ f(x) = \frac{1}{\sqrt{(2 \pi)^k \det \Sigma}} \exp \left\{ - \frac{(x-\mu)^\tra \Sigma^{-1} (x-\mu)}{2} \right\}, \text{ for } x \in \R^k .\] We write \(X \sim \mathcal{N}_k (\mu, \Sigma)\).

The central limit theorem

Let \(X_1, X_2, \ldots\) be independent, identically distributed (i.i.d.) random variables on a probability space \((\Omega, \cF, \Pr)\). Set \(S_0 = 0\) and \(S_n = \sum_{i=1}^n X_i\) for \(n \geq 1\).

The central limit theorem says that if \(\Exp ( X_i ) = \mu\) and \(\mathbb{V}\mathrm{ar} (X_i) = \sigma^2 \in (0,\infty)\), then \[ \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \tod \mathcal{N} (0,1) .\] In other words, \[ \lim_{n \to \infty} \Pr \left[ \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \leq z \right] = N (z) \text{ for all } z \in \R .\]

Conditional expectation

For a random variable \(X\) and an event \(A \in \cF\), \[ \Exp ( X \mid A ) = \frac{\Exp ( X \1_A )}{\Pr (A) } \] is the conditional expectation of \(X\) given \(A\).

If \(Y\) is another random variable, then taking \(A= \{ Y = y\}\) we can define \(\Exp ( X \mid Y = y )\) (at least if \(\Pr (Y = y) >0\), and more generally in certain cases).

We can define the random variable \(\Exp (X \mid Y)\) by \[ \Exp (X \mid Y ) = g (Y), \text{ where } g(y ) = \Exp ( X \mid Y = y ) .\]

Theorem. \(\Exp ( \Exp (X \mid Y ) ) = \Exp (X)\).

Example. As in the previous example, take \(\Omega = \{1,2,3,4,5,6\}\) and \(E = \{2,4,6\}\), define random variables \(X(\omega) = \omega\) and \(Y(\omega) = \1_E\). Then

\[\begin{align*} \Exp ( X \mid Y = 0 ) & = \sum_x x \Pr ( X= x \mid Y = 0 ) = \frac{1+3+5}{3} = 3 ,\\ \Exp ( X \mid Y = 1 ) & = \sum_x x \Pr ( X= x \mid Y = 1 ) = \frac{2+4+6}{3} = 4 . \end{align*}\]

So the random variable \(\Exp (X \mid Y)\) is given by \[ \Exp (X \mid Y ) = \begin{cases} 3 & \text{if } Y = 0,\\ 4 & \text{if } Y = 1. \end{cases} \] A compact way to write this is \(\Exp (X \mid Y ) = 3 + Y\).

So, by the theorem, \(\Exp (X) = \Exp ( \Exp (X \mid Y ) ) = \Exp ( 3 + Y) = 7/2\) (as you would expect).

Convergence of random variables

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \cF, \Pr)\). We say that \(X_n\) converges to \(X\) almost surely, written \(X_n \toas X\), if \[ \Pr \left( \bigl\{ \omega: \lim_{n \to \infty} X_n (\omega) = X(\omega) \bigr\} \right) = 1.\] Another way to say the same thing is that if \(N_\eps\) (random) is the smallest integer such that \[ | X_n - X | \leq \eps, \text{ for all } n \geq N_\eps ,\] then \(X_n \to X\) a.s. if and only if \(\Pr \left( N_\eps < \infty \text{ for all } \eps > 0 \right) = 1\).

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \cF, \Pr)\). We say that \(X_n\) converges to \(X\) in probability, written \(X_n \toP X\), if \[ \lim_{n \to \infty} \Pr \left( | X_n - X | > \eps \right) = 0 \text{ for all } \eps > 0 .\]

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables on a probability space \((\Omega, \cF, \Pr)\). For \(q \geq 1\), we say \(X_n\) converges to \(X\) in \(L^q\), if \[ \lim_{n \to \infty} \Exp \left[ | X_n - X |^q \right] = 0 .\]

Let \(X\) and \(X_0, X_1, X_2, \ldots\) be random variables (not necessarily on the same probability space) with cumulative distribution functions \(F (x) = \Pr ( X \leq x)\) and \(F_n (x) = \Pr ( X_n \leq x )\). We say \(X_n\) converges to \(X\) in distribution, written \(X_n \tod X\), if \[ \lim_{n \to \infty} F_n(x) = F(x) \text{ for all } x \text{ at which } F \text{ is continuous.} \]

It can be shown that

\(X_n \to X\) a.s. implies that \(X_n \toP X\).
\(X_n \to X\) in \(L^q\) implies that \(X_n \toP X\).
\(X_n \toP X\) implies that \(X_n \tod X\).

Example. To prove the first implication, note that \(\Pr ( | X_n - X | > \eps ) \leq \Pr (n < N_\eps )\), so \[ \lim_{n \to \infty} \Pr \left( | X_n - X | > \eps \right) \leq \lim_{n \to \infty} \Pr (n < N_\eps ) = \Pr ( N_\eps = \infty ) .\]