A Some probability revision
(Original version produced by A.R.Wade.)
Probability spaces and random variables
A probability space (\Omega,\cF,\Pr) consists of
- the sample space \Omega (the set of all possible outcomes);
- a \sigma-algebra \cF of events;
- a probability \Pr that to each event A \in \cF assigns a number \Pr (A) satisfying the probability axioms.
To say that \cF is a \sigma-algebra means that \cF is a collection of subsets of \Omega such that:
- \Omega \in \cF.
- If A \in \cF then its complement A^\mathrm{c} \in \cF too.
- If A_1, A_2, \ldots \in \cF then \cup_{n=1}^\infty A_n \in \cF too.
A random variable X is a function X : \Omega \to \R which is \cF-measurable, meaning that \{ \omega \in \Omega : X(\omega) \leq x \} \in \cF \text{ for all } x \in \R . An important example is the random variable \1_A of an event A \in \cF, defined by \1_A ( \omega ) = \begin{cases} 1 & \text{if } \omega \in A, \\ 0 & \text{if } \omega \notin A . \end{cases} If \cG \subseteq \cF is another (smaller) \sigma-algebra, then a random variable X is \cG-measurable if \{ \omega \in \Omega : X(\omega) \leq x \} \in \cG \text{ for all } x \in \R .
Roughly speaking X is \cG-measurable if “knowing” \cG means “knowing” X.
Example. Consider the sample space for a die roll \Omega = \{1,2,3,4,5,6\}. Take \cF = 2^\Omega the power set of \Omega (the set of all subsets). Take \cG \subset \cF given by \cG = \bigl\{ \emptyset, E , E^\mathrm{c}, \Omega \bigr\} , where E = \{ 2,4,6\} is the event that the score is even, and its complement E^\mathrm{c} = \{1,3,5\} (the score is odd). Note that \cG is a \sigma-algebra. Take random variables X(\omega) = \omega (the score) and Y(\omega ) = \1_E (the indicator that the score is even).
Both X and Y are clearly \cF-measurable.
Moreover, Y is \cG-measurable, since e.g., \{ \omega : Y (\omega) \leq 1/2 \} = E^\mathrm{c} \in \cG, but X is not \cG-measurable, since e.g., \{ \omega : X (\omega ) \leq 1 \} = \{ 1 \} \notin \cG.
The normal distribution
A real-valued random variable X has the normal distribution with mean \mu and variance \sigma^2 if it is a continuous random variable with probability density function f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ - \frac{(x-\mu)^2}{2\sigma^2} \right\}, \text{ for } x \in \R . We write X \sim \mathcal{N} (\mu,\sigma^2).
The case \mu =0, \sigma^2 = 1 is the standard normal distribution Z \sim \mathcal{N} (0,1). The density of the standard normal we usually write as \phi, and the cumulative distribution function is N ( x ) = \Pr ( Z \leq x ) = \int_{-\infty}^x \phi (y) \mathrm{d} y . If X \sim \mathcal{N} (\mu,\sigma^2), then \alpha + \beta X \sim \mathcal{N} (\alpha + \beta \mu, \beta^2 \sigma^2 ). In particular, if X \sim \mathcal{N} (\mu,\sigma^2), then \frac{X - \mu}{\sigma} \sim \mathcal{N} (0,1) .
The moment generating function of Z \sim \mathcal{N} (0,1) is M_Z (t) = \Exp \left( \mathrm{e}^{t Z} \right) = \mathrm{e}^{t^2 /2}, for t \in \R.
You may not all have seen the multivariate normal distribution, although it is important in most statistics courses. A random vector X \in \R^k has the k-dimensional normal distribution with mean vector \mu \in \R^k and covariance matrix \Sigma (a k \times k symmetric, positive-definite matrix) if it has the k-dimensional probability density function f(x) = \frac{1}{\sqrt{(2 \pi)^k \det \Sigma}} \exp \left\{ - \frac{(x-\mu)^\tra \Sigma^{-1} (x-\mu)}{2} \right\}, \text{ for } x \in \R^k . We write X \sim \mathcal{N}_k (\mu, \Sigma).
The central limit theorem
Let X_1, X_2, \ldots be independent, identically distributed (i.i.d.) random variables on a probability space (\Omega, \cF, \Pr). Set S_0 = 0 and S_n = \sum_{i=1}^n X_i for n \geq 1.
The central limit theorem says that if \Exp ( X_i ) = \mu and \mathbb{V}\mathrm{ar} (X_i) = \sigma^2 \in (0,\infty), then \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \tod \mathcal{N} (0,1) . In other words, \lim_{n \to \infty} \Pr \left[ \frac{ S_n - n \mu}{\sqrt{n \sigma^2}} \leq z \right] = N (z) \text{ for all } z \in \R .
Conditional expectation
For a random variable X and an event A \in \cF, \Exp ( X \mid A ) = \frac{\Exp ( X \1_A )}{\Pr (A) } is the conditional expectation of X given A.
If Y is another random variable, then taking A= \{ Y = y\} we can define \Exp ( X \mid Y = y ) (at least if \Pr (Y = y) >0, and more generally in certain cases).
We can define the random variable \Exp (X \mid Y) by \Exp (X \mid Y ) = g (Y), \text{ where } g(y ) = \Exp ( X \mid Y = y ) .
Theorem. \Exp ( \Exp (X \mid Y ) ) = \Exp (X).
Example. As in the previous example, take \Omega = \{1,2,3,4,5,6\} and E = \{2,4,6\}, define random variables X(\omega) = \omega and Y(\omega) = \1_E. Then
\begin{align*} \Exp ( X \mid Y = 0 ) & = \sum_x x \Pr ( X= x \mid Y = 0 ) = \frac{1+3+5}{3} = 3 ,\\ \Exp ( X \mid Y = 1 ) & = \sum_x x \Pr ( X= x \mid Y = 1 ) = \frac{2+4+6}{3} = 4 . \end{align*}
So the random variable \Exp (X \mid Y) is given by \Exp (X \mid Y ) = \begin{cases} 3 & \text{if } Y = 0,\\ 4 & \text{if } Y = 1. \end{cases} A compact way to write this is \Exp (X \mid Y ) = 3 + Y.
So, by the theorem, \Exp (X) = \Exp ( \Exp (X \mid Y ) ) = \Exp ( 3 + Y) = 7/2 (as you would expect).
Convergence of random variables
Let X and X_0, X_1, X_2, \ldots be random variables on a probability space (\Omega, \cF, \Pr). We say that X_n converges to X almost surely, written X_n \toas X, if \Pr \left( \bigl\{ \omega: \lim_{n \to \infty} X_n (\omega) = X(\omega) \bigr\} \right) = 1. Another way to say the same thing is that if N_\eps (random) is the smallest integer such that | X_n - X | \leq \eps, \text{ for all } n \geq N_\eps , then X_n \to X a.s. if and only if \Pr \left( N_\eps < \infty \text{ for all } \eps > 0 \right) = 1.
Let X and X_0, X_1, X_2, \ldots be random variables on a probability space (\Omega, \cF, \Pr). We say that X_n converges to X in probability, written X_n \toP X, if \lim_{n \to \infty} \Pr \left( | X_n - X | > \eps \right) = 0 \text{ for all } \eps > 0 .
Let X and X_0, X_1, X_2, \ldots be random variables on a probability space (\Omega, \cF, \Pr). For q \geq 1, we say X_n converges to X in L^q, if \lim_{n \to \infty} \Exp \left[ | X_n - X |^q \right] = 0 .
Let X and X_0, X_1, X_2, \ldots be random variables (not necessarily on the same probability space) with cumulative distribution functions F (x) = \Pr ( X \leq x) and F_n (x) = \Pr ( X_n \leq x ). We say X_n converges to X in distribution, written X_n \tod X, if \lim_{n \to \infty} F_n(x) = F(x) \text{ for all } x \text{ at which } F \text{ is continuous.}
It can be shown that
- X_n \to X a.s. implies that X_n \toP X.
- X_n \to X in L^q implies that X_n \toP X.
- X_n \toP X implies that X_n \tod X.
Example. To prove the first implication, note that \Pr ( | X_n - X | > \eps ) \leq \Pr (n < N_\eps ), so \lim_{n \to \infty} \Pr \left( | X_n - X | > \eps \right) \leq \lim_{n \to \infty} \Pr (n < N_\eps ) = \Pr ( N_\eps = \infty ) .