$$ \newcommand{\pr}[1]{\mathbb{P}\left(#1\right)} \newcommand{\cpr}[2]{\mathbb{P}\left(#1\mid\,#2\right)} \newcommand{\expec}[1]{\mathbb{E}\left[#1\right]} \newcommand{\var}[1]{\text{Var}\left(#1\right)} \newcommand{\sd}[1]{\sigma\left(#1\right)} \newcommand{\cov}[1]{\text{Cov}\left(#1\right)} \newcommand{\cexpec}[2]{\mathbb{E}\left[#1 \vert#2 \right]} $$
6 Random variables
In many experiments we are often interested in a numerical value rather than the elementary event \(\omega \in \Omega\) per se: for example, in the financial industry we may not care about the behaviour of the stock price throughout a given period, only whether it reached a certain level or not; in weather forecasting we may not be interested in the detailed variation of atmospheric pressure and temperature, only in how much rain is going to fall, and so on. These uncertain quantities associated with random scenarios have as their mathematical idealization the concept of random variable.
Put simply, a random variable is a function or mapping of the sample space: for each \(\omega \in \Omega\), the random variable \(X\) gives the output \(X(\omega)\). In this chapter, we study both discrete and continuous univariate random variables, and discuss some important examples: binomial, geometric, Poisson, uniform, exponential, and normal distributions. To enable practical calculations, we also discuss cumulative distribution functions, standard tables, and how probabilities behave under transformations.
6.1 Definition and notation
Typically, we find ourselves in one of the following situations:
\(X(\Omega)\subseteq\mathbb{R}\), in which case we say that \(X\) is a real-valued random variable or a univariate random variable; or
\(X (\Omega) \subseteq\mathbb{R}^d\), in which case we say that \(X\) is a vector-valued random variable, or a multivariate random variable; in this case we can identify \(X\) with a vector \((X_1,\ldots,X_d)\) of real-valued random variables, where \(X_i(\omega):= [X(\omega)]_i\), that is, \(X_i(\omega)\) is the \(i\)th component of \(X(\omega)\).
For any \(B \subseteq X(\Omega)\), we write ‘\(X \in B\)’ to denote the event \(\{ \omega \in \Omega: X(\omega) \in B \}\). For any \(x\in X(\Omega)\), we write ‘\(X = x\)’ to mean the event \(X\in\{x\}\), that is, \(\{\omega \in \Omega: X(\omega) = x\}\). We sometimes also write \(\{X=x\}\) and \(\{X\in B\}\) to emphasize that these are sets.
For example, if we consider throwing two standard dice, with sample space \[\Omega=\{(i,j): i\in\{1,2,3,4,5,6\},\,j\in\{1,2,3,4,5,6\}\},\] then the sum of the numbers that show on the dice corresponds to a real-valued random variable \(X\) defined by: \[X(i,j):= i+j, \text{ for all } (i,j) \in \Omega.\]
Then the notation \(X=10\) denotes the event \(\{(4,6),(5,5),(6,4)\},\) and \(X\in[0,4]\) denotes the event \[\{(1,1),(1,2),(2,1),(1,3),(2,2),(3,1)\}.\]
A simple but important class of random variable is formed by the indicator random variables, denoted for an event \(A \in \mathcal{F}\) by \(𝟙_{A}\) and given by the mapping \[𝟙_{A}(\omega)= \begin{cases} 1&\text{if }\omega\in A, \\ 0&\text{otherwise}. \end{cases}\] Note that \(\pr{𝟙_{A} = 1} = \pr{ \{ \omega \in \Omega : \omega \in A \} } = \pr{A}\) and \(\pr{𝟙_{A}=0} = 1-\pr{A}\).
Recall the definition of probability from Section 1.5. Given a probability distribution \(\pr{}\) on the sample space \(\Omega\), a random variable \(X:\Omega\to X (\Omega)\) induces a probability distribution \(\mathbb{P}_X(\cdot)\) on the sample space \(X (\Omega)\) as follows.
6.2 Discrete random variables
The function \(\mathbb{P}_X(\cdot)\) tells us everything we might need to know about the random variable \(X\): it is called the distribution of \(X\). In general it is a large and unwieldy object, as we need a value of \(\mathbb{P}_X(B)\) for every \(B \subseteq X(\Omega)\). However, there are two special cases where we can give an efficient description of the distribution. The first is in the case of discrete random variables (the second, which we will see a little later, is the case of continuous random variables).
Recall that a set is countable if there exists a bijection between that set and a subset of the natural numbers \(\mathbb{N}\).
The fact that \(p(x) \in [0,1]\) is an immediate consequence of A1 and C4. Here are some further important properties of the probability mass function.
The probability mass function of a discrete random variable summarizes all information we have about \(X\). Specifically, it allows us to calculate the probability of every event of the form \(\{X\in B\}\).
In simple cases, the set \(\mathcal{X}\) is often just \(X(\Omega)\), but this definition is necessary to cover all cases. If the random variable under consideration is not clear from the context, we may write \(p_X()\) for the probability mass function of \(X\).
6.3 The binomial and geometric distributions
Consider the following random experiment, called a binomial scenario:
A sequence of \(n\) trials will be carried out, where \(n\) is known in advance of the experiment.
Trials are independent.
Each trial has only two outcomes, usually denoted ‘success’ or ‘failure’.
Each trial succeeds independently with the same probability \(p\).
Consider the random variable \(X\), the total number of successes in the \(n\) trials.
The usual sample space \(\Omega\) for the binomial scenario is the set of all the possible length-\(n\) sequences of successes and failures; if we represent a success by 1 and a failure by 0, then each \(\omega \in \Omega\) is a string \(\omega = \omega_1 \omega_2 \cdots \omega_n\) with each \(\omega_i \in \{0,1\}\). The random variable \(X\) takes values in \(X(\Omega):=\{0,1,\dots,n\}\), which is finite, so \(X\) is a discrete random variable. As a mapping on the sample space, we have \(X(\omega) = \sum_{i=1}^n \omega_i\), the total numbers of 1s in the string.
For each \(x\in\{0,1,\dots,n\}\):
because trials are independent (see the definition of independence of multiple events in Section 3.3), every sequence \(\omega\in\Omega\) with exactly \(x\) successes and \(n-x\) failures has probability \(\pr{\{\omega\}}=p^x (1-p)^{n-x}\); and
there are \(\binom{n}{x}\) sequences with exactly \(x\) successes.
By (C7), we can sum the probabilities of the outcomes in the event \[\{X=x\} = \{ \omega \in\Omega : \sum_{i} \omega_i = x\}\] to obtain \[p(x) =\pr{X=x} =\sum_{\omega \in\Omega : \sum_{i} \omega_i = x}\pr{\omega}.\] Putting everything together, the probability mass function of \(X\) is given by the following, which we take as a definition.
In the case of just a single trial (\(n=1\)), the binomial scenario is often referred to as a Bernoulli trial, and \(X\sim\mathrm{Bin}(1,p)\) is often referred to as a Bernoulli random variable with parameter \(p\).
Suppose that we extend the binomial scenario indefinitely, to an unlimited number of trials, and we repeat the trials until we obtain the first success. The (random) number of trials up to and including the first success is called the geometric distribution. Note that \[\begin{aligned} \pr{ \text{first success occurs on trial } n } & = \pr{ \text{first } n-1 \text{ trials are failure, then trail $n$ is a success} } \\ & (1-p)^{n-1} p ,\end{aligned}\] by independence of the trials and the definition of “independence of multiple events”. Note that, provided \(p > 0\), this is a probability distribution on \(\{1,2,3,\ldots\}\) because, by the geometric series formula, \[\sum_{n=1}^\infty (1-p)^{n-1} p = p \cdot \frac{1}{1-(1-p)} = 1 .\] Thus we are led to the following definition.
6.4 The Poisson distribution
The Poisson distribution is used to model counts of events which occur randomly in time at a constant average rate \(r\) per unit time, under some natural assumptions. Specifically, \[\pr{\text{event occurs in }(x, x+h)} \approx r h, \text{ for small } h.\] If \(X\) is the count of the number of events over a period of length \(t\) then it has distribution \(\mathrm{Po}(rt)\). Thus, the interpretation of the parameter \(\lambda\) in \(\mathrm{Po}(\lambda)\) is as the average number of events: we will return to this more formally later. Typical applications are
calls to a telephone exchange,
radioactive decay events,
jobs at a printer queue,
accidents at busy traffic intersections,
earthquakes at a tectonic boundary,
fish biting at an angler’s line.
Another situation where the Poisson distribution arises is as an approximation to the binomial distribution when \(p\) is small and \(n\) is large, i.e., events are rare. More precisely, we have the following result.
This means that if \(X \sim \mathrm{Bin}(n,p),\) where \(n\) is large and \(p\) is small, then approximately \(X \sim \mathrm{Po}(np)\). As a rule of thumb, with \(p \leq 0.05\), we find \(n=20\) gives a reasonable approximation, while \(n = 100\) gives a good approximation. The approximation is useful when it allows us to sidestep calculating \(\binom{n}{x}\) for large values of \(n\) and \(x\).
6.5 Continuous random variables
If \(X\) is continuously distributed, then taking \(a=b=x\) in Equation 6.1 we see that \(\pr{X=x}=0\) for all \(x\in\mathbb{R}\), and so for any \(a <b\) we have \[\pr{X\in[a,b]}=\pr{X\in[a,b)}=\pr{X\in(a,b]}=\pr{X\in(a,b)}.\] Roughly speaking, \(f(x)\) has the interpretation \[ \pr{X\in [x,x+ \,\mathrm{d} x]} = f(x) \,\mathrm{d} x, \text{ for all $x$ at which $f$ is continuous}. \tag{6.2}\] All of the continuous random variables that we will see in this course have a probability density function that is piecewise continuous.
If the random variable under consideration is not clear from the context, we may write \(f_X(\cdot)\) for the probability density function of \(X\).
The probability density function of a random variable determines its probability distribution for all events in a reasonable collection of subsets of \(\mathbb{R}\) (but not all subsets of \(\mathbb{R}\)). The relevant concept here is again that of a \(\sigma\)-algebra. We will not give the exact definition of this \(\sigma\)-algebra here, but events of the following type can be assigned probabilities.
Note that we can also evaluate probabilities of unbounded intervals: \[\begin{aligned} \pr{ -\infty < X \leq b} & = \int_{-\infty}^b f(x) \,\mathrm{d} x ;\\ \pr{ a \leq X < \infty } & = \int_a^\infty f(x) \,\mathrm{d} x ;\\ \pr{ -\infty < X < \infty } & = \int_{-\infty}^\infty f(x) \,\mathrm{d} x = 1 .\end{aligned}\] To see this, we can for example use (C9) (continuity along monotone limits) to get \[\begin{aligned} \pr { a \leq X < \infty} & = \pr { \cup_{n=1}^\infty \{ a \leq X \leq a+n \} } \\ & = \lim_{n \to \infty} \pr { a \leq X \leq a + n} \\ & = \lim_{n \to \infty} \int_a^{a+n} f(x) \,\mathrm{d} x \\ & = \int_a^\infty f(x) \,\mathrm{d} x ,\end{aligned}\] as claimed. In particular, we recover a version of (C10) for probability density functions:
There are random variables that are neither discrete nor continuous, but they do not arise in practice very often. Can you think how to construct one? We will return to this briefly in Section 6.9.
6.6 The uniform distribution
When \(X \sim \mathrm{U}(a,b)\) then \(X\) can take any value in the continuous range of values from \(a\) to \(b\) and the probability of finding \(X\) in any interval \([x,x+h]\subseteq [a,b]\) does not depend on \(x\).
6.7 The exponential distribution
Suppose a bell chimes randomly in time at rate \(\beta >0\). Let \(T > 0\) denote the time of the first chime (a random variable). The events {no chimes in \([0, \tau]\)} and \(\{T > \tau\}\) are the same. We know that the number of chimes in the interval \([0,\tau]\) is \(\mathrm{Po}(\beta \tau)\) and so the probability of no chimes is \(e^{-\beta \tau}\). Hence \[\pr{T > \tau} = e^{-\beta \tau} \quad\mbox{or}\quad \pr{T \leq \tau} = 1 - e^{-\beta \tau} \quad\text{for all } \tau \geq 0\] As \(1 - e^{-\beta \tau} = \int_0^\tau \beta e^{-\beta t}\,\mathrm{d} t\) we see that \(T\) is a continuous random variable with probability density function \(f(t) = \beta e^{-\beta t}\) for all \(t\ge 0\).
6.8 The normal distribution
The normal distribution is one of the most important probability distributions for several reasons, some of which we will see later in this course. As fate would have it, its density is also a little more complicated.
The normal distribution is also sometimes called the Gaussian distribution. The probability density function of the normal distribution is bell shaped. Roughly speaking, the parameter \(\mu\) determines the location of the bell, and \(\sigma\) determines the spread of the bell: for smaller \(\sigma\), the bell is narrower. The relevant formal concepts are expectation and standard deviation, which will be introduced later on.
Unfortunately, there is no closed analytical form for \(\pr{X\in[a,b]}=\int_a^bf(x)\,\mathrm{d} x\) when \(X\) is normally distributed.
We can compute \(\pr{X \in [a,b]}\) by numerical integration, but we would like a quick reference to these computations. This looks unwieldy, since there are four parameters upon which \(\pr{X \in [a,b]}\) depends: \(a\), \(b\), \(\mu\), and \(\sigma\). We show by a series of steps how to reduce everything to a single parameter, so that values can be easily tabulated. The relevant concepts that we will need to carry out this simplification are the cumulative distribution function, and transformations of random variables. Both of these concepts are important for many other applications too, so we spend a little time on them.
6.9 Cumulative distribution functions
If the random variable under consideration is not clear from the context, we may write \(F_X\) for the cumulative distribution function of \(X\).
By (C1), it follows that \[\pr{X\in (a,b]}=\pr{X \leq b}- \pr{X \leq a} = F(b) - F(a).\] Now for a continuous random variable \(X\) this is the same as \(\pr{X \in (a,b)}\), \(\pr { X \in [a,b]}\), and so on, so knowledge of the cumulative distribution function is sufficient to calculate the probability of virtually any event of practical interest, that is, any finite union of intervals. This is also true for the discrete case, but is a little more complicated as it requires a limiting procedure. The continuous case is thus somewhat simpler, so we start with that.
Now let us move on to the case of a discrete random variable. Let us start with an example.
We state a result on the properties of the cumulative distribution function in general, that are already apparent from our examples in the special cases of discrete and continuous random variables.
We do not give the proof here, but you can try to prove this or consult the recommended text books.
We have already seen, in the discrete and continuous cases, that we can recover the probability mass function and probability density function, respectively, from the cumulative distribution function. In other words, the cumulative distribution function determines the distribution. This is true in general.
Now we can see that there are some cumulative distribution functions that correspond to random variables that are neither discrete nor continuous. For example, we might have some jumps but also some continuously increasing parts. There are even examples where the cumulative distribution function is continuous but no probability density function exists: these singular distributions are quite pathological and rarely occur in practice.
6.10 Standard normal tables
Because the standard normal distribution plays such a central role in many practical probability calculations, we use a special symbol to denote its probability density function and cumulative distribution function:1 \[\begin{aligned} \phi(z)&:= f(z)=\frac{1}{\sqrt{2\pi}}\, e^{-z^2/2}, & \Phi(z)&:= F(z)=\int_{-\infty}^z \phi(t)\,\mathrm{d} t.\end{aligned}\]
As already mentioned, \(\Phi\) has no closed analytical form, however, it can be tabulated. Such tabulation is called a standard normal table. Because \(\phi(z) = \phi(-z)\), it follows that \(\Phi(z) = 1 - \Phi(-z)\), and so we only need to tabulate \(\Phi\) for non-negative values of \(z\). Some values that are often useful are:
| \(z\) | \(0\) | \(1.28\) | \(1.64\) | \(1.96\) | \(2.58\) | |
|---|---|---|---|---|---|---|
| \(\Phi(z)\) | \(0.5\) | \(0.9\) | \(0.95\) | \(0.975\) | \(0.995\) |
We can use normal tables for \(\Phi\) to also calculate \(\pr{X\in[a,b]}\) for any \(X\sim\mathcal{N}(\mu,\sigma^2)\). To explain this, we need to look at transformations of random variables, introduced next.
6.11 Functions of random variables
Suppose \(X\colon\Omega\to X(\Omega)\) is a random variable, and \(g\colon X(\Omega)\to\mathcal{S}\) is some function. Then \(g(X)\) is also a random variable, namely the outcome to a ‘new experiment’ obtained by running the ‘old experiment’ to produce a value \(x\) for \(X\), and then evaluating \(g(x)\). Formally, as a function of \(\omega \in \Omega\), \(g(X):= g\circ X\), i.e., \[g(X)(\omega):= g(X(\omega))\text{ for all }\omega\in\Omega.\] For example, \[\pr{g(X)\in B}=\pr{ \{\omega\in\Omega\colon g(X(\omega))\in B\} } \text{ for all }B\subseteq\mathcal{S}.\]
There is one particularly important function which enables us to get the cumulative distribution function of any normally distributed random variable, using just the the standard normal tables (i.e. \(\Phi\), the cumulative distribution function of the standard normal).
6.12 Historical context
The normal distribution appeared already in work of de Moivre, and is sometimes known as the Gaussian distribution after Carl Friedrich Gauss (1777–1855). The name ‘normal distribution’ was applied by eugenicist and biometrician Sir Francis Galton (1822–1911) and statistician Karl Pearson (1857–1936) to mark the distribution’s ubiquity in biometric data.




There is a great deal of subtle and interesting mathematics on the subject of what functions are integrable over what sets. You may see some of this in the third year probability course. The Riemann integral that we use here is sufficient for integrating piecewise continuous functions over finite unions of intervals. Here, we will only consider continuous random variables which have a piecewise continuous probability density function. Other approaches to integration are required to deal with more general functions.
For instance, for infinite countable unions of intervals, we would need the Lebesgue integral (see for instance (Rosenthal 2007)). More precisely, \(f(\cdot)\) still determines the value of \(\pr{X\in B}\) when \(B\) is an infinite countable union of intervals, but that value is not necessarily given by the Riemann integral.
The treatment of discrete and continuous random variables separately is a little irksome. A general treatment of random variables, which covers both cases, as well as cases that are neither discrete nor continuous, in a unified setting, requires the mathematical framework of measure theory; you will see some of this if you take later probability courses.
In the programming language R, \(\phi\) is the function
dnorm, and \(\Phi\) is the functionpnorm.↩︎