Stochastic Processes (2021/22)
Michaelmas term 1

1 Introduction

Here we shall briefly recall the most important facts from basic probability theory; for a more detailed recap, check your notes from the previous years.

1.1 Probability revision

Probability triple

A probability triple is a collection \(\bigl(\Omega,{\mathcal{F}},{\mathsf{P}}\bigr)\), where:

  • \(\Omega\) is a sample space (eg., the collection \(\bigl\{{\mathsf{H}},{\mathsf{T}}{\mathsf{H}},{\mathsf{T}}{\mathsf{T}}{\mathsf{H}},\dots\bigr\}\) of all possible results in a coin tossing experiment until the first occurence of \({\mathsf{H}}\));

  • the \(\sigma\)-field \({\mathcal{F}}\) is the collection of all events under consideration (more precisely it is a set of subsets of \(\Omega\) satisfying (i) \(\varnothing\in {\mathcal{F}}\) (ii) \(A\in {\mathcal{F}}\Rightarrow \Omega\setminus A\in {\mathcal{F}}\) & (iii) if \(A_k\in {\mathcal{F}}\) for all \(k\ge 1\), then \(\cup_{k\ge 1} A_k\in F\)); and

  • the \(\sigma\)-additive 2 probability measure \({\mathsf{P}}:{\mathcal{F}}\to[0,1]\) assigns probabilities to all these events in a consistent way. That is, \[{\mathsf{P}}:A\mapsto{\mathsf{P}}(A)\in[0,1]\,,\qquad {\mathsf{P}}(\emptyset)=0\,,\qquad{\mathsf{P}}(\Omega)=1, \text{ and }\] \[(A_k)_{k\ge 1} \in{\mathcal{F}}\text{ and } A_j\cap A_k=\varnothing \text{ for all } j\ne k\quad \Rightarrow \quad {\mathsf{P}}(\cup_{k\ge 1} A_k)=\sum_{k\ge 1} {\mathsf{P}}(A_k).\]

One immediate consequence is monotonicity: \[C, D \in{\mathcal{F}}\,\quad\text{ and }\quad C\subseteq D\qquad\Longrightarrow\qquad{\mathsf{P}}(C)\le{\mathsf{P}}(D)\,.\] Another important property is the following continuity result:
If the events \(A_k\in{\mathcal{F}}\), \(k\ge1\) form a monotone increasing sequence, ie., \(A_k\subseteq A_{k+1}\) for all \(k\ge1,\) then we say that \[A_k \nearrow A \stackrel{{\sf def}}{=}\bigcup\limits_{j\ge1}A_j \text{ as } k\to \infty.\] We then have that \[{\mathsf{P}}(A)=\lim_{k\to\infty}{\mathsf{P}}(A_k)\] The proof is a straightforward, but instructive exercise. 3 By taking complements, one deduces an analogous result for monotone decreasing sequences \(B_k\in{\mathcal{F}}\), \(k\ge1\) of events, ie., satisfying \(B_k\supseteq B_{k+1}\) for all \(k\ge1\).

In the standard coin flipping experiment, let \(B_k\) be the event \(\{\text{the first $k$ results are ${\mathsf{T}}$}\}\,;\) then \(B_k\searrow B\equiv\{\text{all results are ${\mathsf{T}}$}\}\). If the coin shows \({\mathsf{H}}\) with probability \(p>0\), then \({\mathsf{P}}(B_k)=(1-p)^k\searrow0\) as \(k\to\infty\) and, by continuity, \({\mathsf{P}}(B)=0\).

In the setup of Example [exmle:1718-00-seeing-only-same-coin-side-probability], let \(A_k\) be the event \[\{\text{no ${\mathsf{H}}$ observed from $k^{\text{th}}$ flip onwards}\}.\] Show that \({\mathsf{P}}(A_k)=0\) for all \(k\ge1\). Verify that \(\bigl(A_k\bigr)_{k\ge1}\) is a monotone sequence of events with the limit \[A\equiv\bigl\{\text{at most finitely many ${\mathsf{H}}$ observed}\,\bigr\}\,.\] Use the continuity property of probability measures to deduce that \({\mathsf{P}}(A)=0\).

Conditional probability

If \(A\), \(B\) are events (ie., \(A\), \(B\in{\mathcal{F}}\)) with \({\mathsf{P}}(B)>0\), then the conditional probability of \(A\) given \(B\) is \[{\mathsf{P}}\bigl(\,A\mid B\,\bigr)\stackrel{{\sf def}}{=}\frac{{\mathsf{P}}(A\cap B)}{{\mathsf{P}}(B)}\,.\] Notice that \({\mathsf{P}}(\,\cdot\mid B\,):{\mathcal{F}}\to[0,1]\) is a probability measure.

Formula of total probability (partition theorem)

Events \(B_1\), \(B_2\), …\(\in{\mathcal{F}}\) are said to form a partition of \(\Omega\), if they are pairwise disjoint: \[B_i\cap B_j=\varnothing \text{ for } i\neq j,\] and cover the whole \(\Omega\), ie., \[\cup_kB_k=\Omega.\]
If \(\bigl\{B_1,B_2,\dots\bigr\}\) form a partition of \(\Omega\), then for every \(A\in{\mathcal{F}}\) the following formula of total probability holds (tacitly assuming that \({\mathsf{P}}\bigl(A\mid B_k\bigr)\,{\mathsf{P}}(B_k)\equiv{\mathsf{P}}(A\cap B_k)=0\) for \({\mathsf{P}}(B_k)=0\)): \[{\mathsf{P}}(A)=\sum_{k\ge1}{\mathsf{P}}\bigl(A\mid B_k\bigr)\,{\mathsf{P}}(B_k)\,.\]

(Discrete Renewals) Consider a sequence of events that can only occur at discrete times \(k=1\), \(2\), …  (eg., a light bulb burns out and is immediately replaced with a new one). Assume that the intervals \(X\) between consecutive events have a common distribution \(f_k={\mathsf{P}}(X=k)\), \(k\ge1\). Let \(r_n\) denote the probability that an event occurs at time \(n\); ie., \(r_n={\mathsf{P}}(A_n)\) with \(A_n=\bigl\{\text{ replacement at time $n$ }\bigr\}\); we shall also assume that a replacement also occurs at time \(0\) so that \(r_0=1\). Since the events \(B_k=\{\text{ first bulb burns at time $k$ }\}\) form a countable partition of \(\Omega\), and \({\mathsf{P}}(\,A_n\mid B_k\,)=r_{n-k}\) for all \(n\ge k\) (with \(r_1=f_1\)), the partition theorem implies \[\begin{equation} \label{eq:discrete-renewals-convolution} r_n=\sum_{k=1}^n\,r_{n-k}\,f_k\,. \end{equation}\] As \(f_0=0\), the RHS above is a convolution of the sequences \((r_n)_{n\ge0}\) and \((f_k)_{k\ge0}\); the large \(n\) behaviour of \(r_n\) can be analysed by taking various transforms (eg., generating functions) of these sequences.

Random variables

Random variables are “nice” functions \(X:\Omega\to{\mathbb{R}}\) characterized by the fact that for every \(a\in{\mathbb{R}}\) we have \(\bigl\{\omega\in\Omega:X(\omega)\le a\bigr\}\in{\mathcal{F}}\). In other words, every inverse image \(X^{-1}\bigl((-\infty,a]\bigr)\) is an event.
For discrete random variables, ie., those attaining at most countably many values, it is often convenient to replace the previous condition with its equivalent: for every \(a\in{\mathbb{R}}\) the set \(X^{-1}\bigl(\{a\}\bigr)\) is an event.

Expectation

If \(X\) is a discrete random variable taking values in \(\bigl\{x_1,x_2,\dots\bigr\}\) with probabilities \[p_k={\mathsf{P}}(X=x_k)\equiv{\mathsf{P}}\bigl(\{\omega\in\Omega:X(\omega)=x_k\}\bigr)\,,\] then if \(\sum_{k\ge 1} |x_k|\, p_k<\infty\), we say that \(X\) is integrable and the expectation \({\mathsf{E}}(X)\) of \(X\) is given by \[{\mathsf{E}}(X)\stackrel{{\sf def}}{=}\sum_{k\ge1}\,x_k\,p_k\equiv\sum_{k\ge1}x_k\,{\mathsf{P}}(X=x_k)\,.\] (If \(\sum |x_k| p_k =\infty\) we also write \({\mathsf{E}}(|X|)=\infty\)).

It is clear from this definition that if \(X,Y\) are two random variables defined on the same probability space with \({\mathsf{P}}(X\le Y)=1\) then \({\mathsf{E}}(X)\le {\mathsf{E}}(Y)\). In particular:

(Markov’s inequality) Suppose that \(X\) is a random variable taking real values, and \(a>0\). Then \[{\mathsf{E}}(|X|)\le {\mathsf{E}}(|X|\mathsf{1}_{X\ge a}))\le a{\mathsf{P}}(X\ge a)\] which rearranges to give \[\begin{equation} \label{eq:Markov} {\mathsf{P}}(|X|\ge a)\le \frac{{\mathsf{E}}(|X|)}{a}. \end{equation}\] This is a very useful tool for bounding the tail probabilities of \(X\) (i.e. the LHS above with \(a\) large).

It is also straightforward to verify that expectation is linear: if \(X_1, \dots, X_n\) are integrable random variables and \(a_1, \dots, a_n\in \mathbb{R}\) then \[{\mathsf{E}}(\sum_i a_i X_i)=\sum_i a_i {\mathsf{E}}(X_i).\]

The conditional expectation of \(X\) given \(B\in{\mathcal{F}}\) with \({\mathsf{P}}(B)>0\) is computed similarly: \[{\mathsf{E}}\bigl(\,X\mid B\,\bigr)=\sum_{k\ge1}x_k\,{\mathsf{P}}(\,X=x_k\mid B\,)\,.\] We then have the following result.

Partition theorem for expectations

If \(\bigl\{B_1,B_2,\dots\bigr\}\) form a partition of \(\Omega\), then for every random variable \(X\) \[{\mathsf{E}}(X)=\sum_{k\ge1}{\mathsf{E}}\bigl(\,X\mid B_k\,\bigr)\,{\mathsf{P}}(B_k)\,.\] In some cases, the RHS above might not be well defined (i.e., the partial sums of the above series may not converge). However, everything is fine if \[\sum_{k\ge1}\,\bigl|{\mathsf{E}}\bigl(\,X\mid B_k\,\bigr)\bigr|\,{\mathsf{P}}(B_k)<\infty\,\] (and we have equality in the sense that both sides are equal to \(\infty\) if \({\mathsf{P}}(X\ge 0)=1\)).

Recall also that for a random variable \(X\) with \({\mathsf{E}}(X^2)<\infty\), we define its variance \[{\mathsf{Var}}(X)={\mathsf{E}}(X^2)-{\mathsf{E}}(X)^2.\] This measures, roughly speaking, how much the random variable deviates from its expected value. For example, if \({\mathsf{P}}(X=x)=1\) for some fixed value \(x\), then \({\mathsf{Var}}(X)=x^2-(x)^2=0\).

Independence

Events \(A\), \(B\in{\mathcal{F}}\) are independent if \[{\mathsf{P}}(A\cap B)={\mathsf{P}}(A)\,{\mathsf{P}}(B)\,.\] Notice that if \(A\), \(B\in{\mathcal{F}}\) are independent and if \({\mathsf{P}}(\,A\mid B\,)\) is well defined, then \({\mathsf{P}}(\,A\mid B\,)={\mathsf{P}}(A)\). Also, if \(B\in{\mathcal{F}}\) satisfies \({\mathsf{P}}(B)\in\{0,1\}\), then for every \(A\in{\mathcal{F}}\) the events \(A\) and \(B\) are independent.
In general, a collection of events \(\bigl(A_\alpha\bigr)_{\alpha\in{\mathcal{A}}}\) is independent if every finite subcollection is independent, ie., for every \(k\ge1\) and all \(\alpha_1\), …, \(\alpha_k\in{\mathcal{A}}\), \[{\mathsf{P}}\bigl(A_{\alpha_1}\cap A_{\alpha_2}\cap\ldots\cap A_{\alpha_k}\bigr) ={\mathsf{P}}(A_{\alpha_1})\,{\mathsf{P}}(A_{\alpha_2})\,\dots\,{\mathsf{P}}(A_{\alpha_k})\,.\] 1ex Two (discrete) random variables, \(X\) and \(Y\) are independent, if for all \(x\), \(y\), \[{\mathsf{P}}\bigl(X=x,Y=y\bigr)={\mathsf{P}}(X=x)\,{\mathsf{P}}(Y=y).\] For general random variables (taking potentially uncountably many values in \(\mathbb{R}\)), the last condition needs to be replaced with, say, \[{\mathsf{P}}\bigl(X\in[a,b],Y\in[c,d]\bigr)={\mathsf{P}}\bigl(X\in[a,b]\bigr)\,{\mathsf{P}}\bigl(Y\in[c,d]\bigr)\] for all finite or infinite real \(a\), \(b\), \(c\), \(d\). Recall that in the discrete case the collection of numbers \({\mathsf{P}}\bigl(X=x,Y=y\bigr)\) is the joint distribution of the pair \((X,Y)\) of random variables.

Of course, the above idea can also be used to define independence of arbitrary collections of random variables.

Let \(D\) be the result of a single roll of a standard fair dice. Next, flip a fair coin \(D\) times, and let \(H\) be the total number of ‘heads’ observed. Write the joint distribution of the pair \((D,H)\). Are \(D\) and \(H\) independent?

If \(X\) and \(Y\) are independent discrete random variables, and \(f\), \(g:{\mathbb{R}}\to{\mathbb{R}}\) are arbitrary functions, then \(f(X)\) and \(g(Y)\) are independent random variables and \({\mathsf{E}}\bigl[f(X)g(Y)\bigr]={\mathsf{E}}f(X)\cdot{\mathsf{E}}g(Y)\). For example, \({\mathsf{E}}(s^{X+Y})={\mathsf{E}}(s^X)\,{\mathsf{E}}(s^Y)\) for \(|s|\le1\).

Notice that knowing the joint distribution of a random vector \((X,Y)\), we can derive the so-called marginal distributions of its components \(X\) and \(Y\). The inverse operation of constructing the joint distribution of a vector from its marginal distributions is not well posed, and often has no unique answer (see below).

Let \(X \sim {\mathsf{Ber}}_{\pm 1}(p_1)\), ie., \({\mathsf{P}}(X=1)=1-{\mathsf{P}}(X=-1)=p_1\) and let \(Y\) be a \({\mathsf{Ber}}_{\pm 1}(p_2)\) random variable, ie., \({\mathsf{P}}(Y=1)=1-{\mathsf{P}}(Y=-1)=p_2\). Without loss of generality we may assume that \(p_1\le p_2\). Then both of the below are valid joint distributions with given marginals (write \(q_i=1-p_i\), \(i=1,2\)):

  1. \(-1\) \(1\) \(X\)
    \(-1\) \(q_1q_2\) \(q_1p_2\) \(q_1\)
    \(1\) \(p_1q_2\) \(p_1p_2\) \(p_1\)
    \(Y\) \(q_2\) \(p_2\)
  2. \(-1\) \(1\) \(X\)
    \(-1\) \(q_2\) \(p_2-p_1\) \(q_1\)
    \(1\) \(0\) \(p_1\) \(p_1\)
    \(Y\) \(q_2\) \(p_2\)

In (A) the variables \(X\) and \(Y\) are independent, whereas in (B) they are not independent. This demonstrates that we cannot re-construct the joint distribution of \(X\) and \(Y\) knowing only their individual distributions.

As we shall see below, this flexibility in constructing several variables on a common probability space (or “coupling”) often allows for intuitive and clear probabilistic arguments.

Let \((w^X_n)_{n\ge0}\) be the random walk generated by independent copies of \(X \sim {\mathsf{Ber}}_{\pm 1}(p_1)\); \(w^X_0=0\). Similarly, let \((w^Y_n)_{n\ge0}\) be the random walk generated by independent copies of \(Y \sim {\mathsf{Ber}}_{\pm 1}(p_2)\); \(w^Y_0=0\). If \(p_1<p_2\), then the law of large numbers implies that \(\frac1n\,w^X_n\) grows linearly with slope \(2p_1-1\), whereas \(\frac1n\,w^Y_n\) grows linearly with slope \(2p_2-1>2p_1-1\). In other words, for times \(n\) large enough the trajectories of \((w^X_n)_{n\ge0}\) will lie below those of \((w^Y_n)_{n\ge0}\). In fact, using the joint distribution from Example [exmle:first-couple] (B), one can construct a joint distribution for the entire trajectories of these random walks such that the inequality \(w^X_n\le w^Y_n\) holds for all times \(n\ge0\), and not only for \(n\) large enough. This is useful. For example, as a result, for every monotone increasing function \(f:{\mathbb{R}}\to{\mathbb{R}}\) one has \[f\bigl(w^X_n\bigr)\le f\bigl(w^Y_n\bigr) \text{ for all } n \ge 0, \text{ and therefore } {\mathsf{E}}f\bigl(w^X_n\bigr)\le {\mathsf{E}}f\bigl(w^Y_n\bigr).\]

1.2 Generating functions: key properties

Lengthy calculations arising from even quite straightforward counting problems can be simplified by using generating functions. Recall that the generating function of a real sequence \((a_k)_{k\ge0}\) is \[\begin{equation} \label{eq:generating-function-def} G(s)=G_a(s)\stackrel{{\sf def}}{=}\sum_{k=0}^\infty a_k\,s^k \end{equation}\] (defined whenever the sum on the RHS converges). Similarly, the (probability) generating function of a random variable \(X\) with values in \[{\mathbb{Z}}^+\stackrel{{\sf def}}{=}\{0,1,\dots\}\] is just the generating function of its probability mass function: \[\begin{equation} \label{eq:proba-generating-function} G(s)\equiv G_X(s)\stackrel{{\sf def}}{=}{\mathsf{E}}\bigl(s^X\bigr)=\sum_{k=0}^\infty s^k{\mathsf{P}}(X=k)\,. \end{equation}\] Notice that each probability generating function satisfies \[|G_X(s)| \le G_X(1)=\sum_{k=0}^\infty {\mathsf{P}}(X=k) \le 1\,,\] i.e., is well defined and finite for all (complex) \(s\) with \(|s|\le1\). In particular, \(G_X(s)\) can be differentiated term-by-term any number of times in the open unit disk \(|s|<1\).

Generating functions are very useful when studying sums of independent random variables. Indeed, Example [exmle:expectation-and-independence] implies the following important fact:

If \(X\) and \(Y\) are independent random variables with values in \({\mathbb{Z}}^+\) and \(Z= X+Y\), then their generating functions satisfy \[G_Z(s)=G_{X+Y}(s)=G_X(s)\,G_Y(s)\,\] for all \(s\) such that the RHS is well defined.

Let \(X, X_1, \dots, X_n\) be independent identically distributed random variables4 with values in \(\{0,1,2,\dots\}\) and let \(S_n=X_1+\dots+X_n\). Suppose that \(G_X(s)\) is well-defined. Then \[G_{S_n}(s)=G_{X_1}(s)\dots G_{X_n}(s)\equiv\bigl[G_X(s)\bigr]^n\,.\]

Let \(X, X_1, X_2, \dots\) be i.i.d. with values in \(\{0,1,2,\dots\}\) and let \(N\ge0\) be an integer-valued random variable independent of \(\{X_k\}_{k\ge1}\). Then \(S_N= X_1+\dots+X_N\) has generating function \[\begin{equation} \label{eq:GN-GX} G_{S_N}=G_{N}\circ G_X \end{equation}\] This is a straightforward application of the partition theorem for expectations. Alternatively, the result follows from the standard properties of conditional expectation: \[{\mathsf{E}}\bigl(s^{S_N}\bigr)={\mathsf{E}}\Bigl({\mathsf{E}}\bigl(s^{S_N}\mid N\bigr)\Bigr) ={\mathsf{E}}\Bigl(\bigl[G_X(s)\bigr]^N\Bigr)=G_{N}\bigl(G_X(s)\bigr)\,.\]

In general, we say a sequence \({\mathbf{c}}=(c_n)_{n\ge0}\) is the convolution of \({\mathbf{a}}=(a_k)_{k\ge0}\) and \({\mathbf{b}}=(b_m)_{m\ge0}\) (write \({\mathbf{c}}={\mathbf{a}}\star {\mathbf{b}}\)), if  5 \[\begin{equation} \label{eq:convolution} c_n=\sum_{k=0}^na_k\,b_{n-k}\,,\qquad n\ge0\,. \end{equation}\]

If \(c=a\star b\), show that the generating functions \(G_c\), \(G_a\), and \(G_b\) satisfy \(G_c=G_a \times G_b\).

In the setup of Example [exmle:1718-00-discrete-renewals], let \(G_f\) and \(G_r\) be the generating functions of the sequences \({\mathbf{f}}=(f_k)_{k\ge1}\) and \({\mathbf{r}}=(r_n)_{n\ge0}\). Show that \(G_r(s)=1/(1-G_f(s))\) for all \(|s|\le1\).

If the generating function \(G_a\) of \((a_n)_{n\ge0}\) is analytic in a neighbourhood of the origin, then there is a one-to-one correspondence between \(G_a\) and \((a_n)_{n\ge0}\). Namely, \(a_k\) can be recovered from \(G_a\) via  6 \[\begin{equation} \label{eq:gen-fn-uniqueness} a_k=\frac{1}{k!}\,\frac{d^k}{ds^k}\,G_a(s)\bigm|_{s=0} \qquad\text{ or }\qquad a_k=\frac1{2\pi i}\oint_{|z|=\rho}\frac{G_a(z)}{z^{k+1}}\,dz\,, \end{equation}\] for suitable \(\rho>0\). This result is often referred to as the uniqueness property of generating functions.

Let \(X\sim{\sf Poi}(\lambda)\) and \(Y\sim{\sf Poi}(\mu)\) be independent. A straightforward computation gives \(G_X(s)=e^{\lambda(s-1)}\) for all \(s\) so that if \(Z=X+Y\), Example [exmle:generating-functions-convolution] implies that \[G_Z(s)=G_X(s)\,G_Y(s)=e^{\lambda(s-1)}\,e^{\mu(s-1)}\equiv e^{(\lambda+\mu)(s-1)}\,.\] This means that \(Z\) is \({\sf Poi}(\lambda+\mu)\) distributed.

A similar argument can be used in the following exercise.

If \(X\sim{\mathsf{Bin}}(n,p)\) and \(Y\sim{\mathsf{Bin}}(m,p)\) are independent, show that \(X+Y\sim{\mathsf{Bin}}(n+m,p)\).

Another useful property of probability generating functions is that they can be used to compute moments:

If \(X\) has generating function \(G_X\), then  \[{\mathsf{E}}\bigl(X(X-1)\dots(X-k+1)\bigr)=G^{(k)}_X(1)\,\] where \(G^{(k)}(1)\) is the shorthand for \(G^{(k)}(1_-)\equiv\lim_{s\uparrow1}G^{(k)}(s)\), the limiting value of the \(k\)th derivative of \(G(s)\) at \(s=1\). Since \(s^kG^{(k)}(s)\) is increasing in \(s\), the RHS above is either \(+\infty\), or finite. In the latter case, \(X(X-1)\dots (X-k+1)\) is integrable and the equality above holds. In the former, \(X(X-1)\dots (X-k+1)\) is not integrable.

The quantity \({\mathsf{E}}\bigl(X(X-1)\dots(X-k+1)\bigr)\) is called the \(k\)th factorial moment of \(X\). Notice also that \[\begin{equation} \label{eq:variance-via-gen-fns} {\mathsf{Var}}(X)=G_X''(1)+G_X'(1)-\bigl(G_X'(1)\bigr)^2\,. \end{equation}\]

Notice that \(\lim_{s\nearrow1}G_X(s)\equiv\lim_{s\nearrow1}{\mathsf{E}}(s^X)={\mathsf{P}}(X<\infty)\). This allows us to check whether the random variable \(X\) is finite, if we do not know this apriori. See Example [exmle:SRW-hitting-time] below.

The fact that a probability generating function is finite at \(u=1\) (or has a finite left derivative there) does not, in general, imply any regularity beyond the unit disk. Indeed, let \(X\) be a random variable satisfying \[{\mathsf{P}}(X=k)=\tfrac1{k(k+1)}\quad\text{ for all $k\ge1$,}\] and let \(G_X\) be its generating function. It is easy to check that \(G_X(1)=1\) while \({\mathsf{E}}(X)=G_X'(1_-)=\infty\), and thus \(|G_X(u)|\le G_X(1)=1\) if \(|u|\le1\) but \(G_X(u)=\infty\) for all \(|u|>1\).

Let \({\mathsf{P}}(X=k)=4/\bigl(k(k+1)(k+2)\bigr)\) for \(k\ge1\). Show that the generating function \(G_X(u)={\mathsf{E}}(u^X)\) satisfies \(G_X(1)=1\), \(G_X'(1_-)=2<\infty\), but \(G_X''(1_-)=\infty\). Notice that in this case \(|G_X'(u)|\le2\) uniformly in \(|u|<1\) while \(G_X(u)=\infty\) for all \(|u|>1\).

Following the approach of Exercise [exse:generating-function-bounded-derivative-on-closed-disk-but-infinite-outside] or otherwise, for \(m\in{\mathbb{N}}\) find a generating function \(G\), which is continuous and bounded on the closed unit disk \(|u|\le1\) together with derivatives up to order \(m\), while \(G(u)=\infty\) for all \(|u|>1\).

Let \(S_N=X_1+\ldots+X_N\) be a random sum of random variables, whose generating function is \(G_{S_N}(u)\equiv G_N\bigl(G_X(u)\bigr)\), recall Example [exmle:GN-GX]. Use Theorem [thm:moments-from-generating-functions] to express \({\mathsf{E}}\bigl(S_N\bigr)\) and \({\mathsf{Var}}\bigl(S_N\bigr)\) in terms of \({\mathsf{E}}(X)\), \({\mathsf{E}}(N)\), \({\mathsf{Var}}(X)\) and \({\mathsf{Var}}(N)\). Check your result for \({\mathsf{E}}\bigl(S_N\bigr)\) and \({\mathsf{Var}}\bigl(S_N\bigr)\) by directly applying the partition theorem for expectations.

A bird lays \(N\) eggs, each being pink with probability \(p\) and blue otherwise. Assuming that \(N\sim{\mathsf{Poi}}(\lambda)\), find the distribution of the total number \(K\) of pink eggs.

Suppose that in a population, each mature individual produces immature offspring according to a probability generating function \(F\).

  1. Assume that we start with a population of \(k\) immature individuals, each of which grows to maturity with probability \(p\) and then reproduces, independently of other individuals. Find the probability generating function of the number of immature individuals in the next generation.

  2. Find the probability generating function of the number of mature individuals in the next generation, given that there are \(k\) mature individuals in the parent generation.

  3. Show that the distributions in a) and b) above have the same mean, but not necessarily the same variance. You might prefer to first consider the case \(k=1\), and then generalise.

The next example is very important for applications.

Let \(X_k\), \(k\ge1\) be i.i.d. with the common distribution \[{\mathsf{P}}(X_k=1)=p\,,\qquad {\mathsf{P}}(X_k=-1)=q=1-p\,.\] Define the simple random walk \((S_n)_{n\ge0}\) via \(S_0=0\) and \(S_n=X_1+\dots+X_n\) for \(n\ge1\) and let \[T\stackrel{{\sf def}}{=}\inf\bigl\{n\ge1:S_n=1\bigr\}\] be the first time this random walk hits \(1\).

To calculate the generating function \(G_T\), write \(p_k={\mathsf{P}}(T=k)\), so that \(G_T(s)\equiv{\mathsf{E}}(s^T)=\sum_{k\ge0}s^k\,p_k\). Conditioning on the outcome of the first step, and applying the partition theorem for expectations, we get \[G_T(s)\equiv{\mathsf{E}}\bigl(s^T\bigr)={\mathsf{E}}\bigl(s^T\mid X_1=1\bigr)\,p+{\mathsf{E}}\bigl(s^T\mid X_1=-1\bigr)\,q=ps+qs\,{\mathsf{E}}\bigl(s^{T_2}\bigr)\,,\] where \(T_2\) is the time of the first visit to state \(1\) starting from \(S_0=-1\). By partitioning on the time \(T_1\) of the first visit to \(0\), it follows that \[{\mathsf{P}}(T_2=m)=\sum_{k=1}^{m-1}{\mathsf{P}}\bigl(T_1=k,T_2=m\bigr)\,.\] where of course, on the event \(\{S_0=-1\}\), \[{\mathsf{P}}(T_2=m\mid S_0=-1)\equiv{\mathsf{P}}\bigl(S_1<1,\dots,S_{m-1}<1,S_m=1\mid S_0=-1\bigr)\] \[{\mathsf{P}}(T_1=k\mid S_0=-1)\equiv{\mathsf{P}}\bigl(S_1<0,\dots,S_{k-1}<0,S_k=0\mid S_0=-1\bigr).\] Notice that by translation invariance the last probability is just \({\mathsf{P}}(T=k\mid S_0=0)=p_k\). We also observe that \[{\mathsf{P}}\bigl(T_2=m\mid T_1=k\bigr)={\mathsf{P}}\bigl(\text{ first hit $1$ from $0$ after $m-k$ steps }\bigr)\equiv p_{m-k}\,.\] The partition theorem now implies that \[{\mathsf{P}}(T_2=m)=\sum_{k=1}^{m-1}{\mathsf{P}}\bigl(T_2=m\mid T_1=k\bigr)\,{\mathsf{P}}(T_1=k)\equiv\sum_{k=1}^{m-1}p_k\,p_{m-k}=\sum_{k=0}^{m}p_k\,p_{m-k}\,,\] ie., \(G_{T_2}(s)=\bigl(G_T(s)\bigr)^2\). We deduce that \(G_T(s)\) solves the quadratic equation \(\varphi=ps+qs\varphi^2\), so that  7 \[G_T(s)=\frac{1-\sqrt{1-4pqs^2}}{2qs}=\frac{2ps}{1+\sqrt{1-4pqs^2}}\,.\] Finally this allows us to deduce that \[{\mathsf{P}}(T<\infty)\equiv G_T(1)=\frac{1-|p-q|}{2q}=\begin{cases}1\,,&\quad p\ge q\,,\\ p/q\,,&\quad p<q\,.\end{cases}\] In particular, \({\mathsf{E}}(T)=\infty\) for \(p<q\) (because \({\mathsf{P}}(T=\infty)=(q-p)/q>0\)). For \(p\ge q\) we obtain \[{\mathsf{E}}(T)\equiv G_T'(1)=\frac{1-|p-q|}{2q|p-q|}=\begin{cases}\frac1{p-q}\,,&\quad p>q\,,\\ \infty\,,&\quad p=q\,,\end{cases}\] ie., \({\mathsf{E}}(T)<\infty\) if \(p>q\) and \({\mathsf{E}}(T)=\infty\) otherwise.

Notice that at criticality (\(p=q=1/2\)), the variable \(T\) is finite with probability \(1\), but has infinite expectation.

In a sequence of independent Bernoulli experiments with success probability \(p\in(0,1)\), let \(D\) be the first time that two consecutive successful outcomes have occured (i.e., if two successes occured in the first two experiments, then \(D\) would be equal to \(2\)).

To find the generating function of \(D\), there are two good methods, and both involve deriving a recursion relation.
Method 1: For \(n\ge 2\) let us write \(d_n={\mathsf{P}}(D=n)\), and consider the events

  • \(A:=\{(n-2) \text{ failures followed by 2 successes}\},\)

  • \(A_k:=\{\text{first failure immediately preceeded by a success occurs at experiment } k\}\) for \(k=2,\dots, n-2\), and

  • \(B=(A\cup \bigcup_{k=2}^{n-2}A_k)^c.\)

These form a partition of the probability space, since they are disjoint by construction, and the definition of \(B\) means that they cover the whole space. Note that for \(\{D=n\}\) to occur, it must be that one of \(A\) or \(A_2,\dots, A_{n-2}\) occurs, so that \({\mathsf{P}}(\{D=n\}\cap B)=0\). Also it is clear that \(A\subset \{D=n\}\) so \({\mathsf{P}}(\{D=n\}\cap A)={\mathsf{P}}(A)=q^{n-2}p^2\). Thus we can write \[d_n={\mathsf{P}}(D=n)=q^{n-2}p^2+\sum_{k=2}^{n-2}{\mathsf{P}}(\{D=n\}\cap A_k)\] and we are left to calculate \({\mathsf{P}}(\{D=n\}\cap A_k)\) for \(k=2,\dots, n-2\). For this, we observe that for \(\{D=n\}\cap A_k\) to occur, it must be that the first \((k-2)\) experiments are failures, the \((k-1)\)st is a success, the \(k\)th is a failure again, and then for the new sequence of experiments starting from the \((k+1)\)st, the first time that 2 consecutive successes are seen is \((n-k)\). By independence of the experiments, the probability of this happening is just \(q^{k-2}pqd_{n-k}\). Hence we obtain \[d_n=q^{n-2}p^2+\sum_{k=2}^{n-2}q^{k-2}pq\,d_{n-k}\,,\] and a standard method implies that \[G_D(s)=\frac{p^2s^2}{1-qs}+\frac{pqs^2}{1-qs}\,G_D(s)\,,\qquad\text{ or }\qquad G_D(s)=\frac{p^2s^2}{1-qs-pqs^2}\,.\] A straightforward computation gives \(G_D'(1)=\frac{1+p}{p^2}\), so that on average it takes \(42\) tosses of a standard symmetric dice until the first two consecutive sixes appear.
Method 2: we derive a recursion relation directly for the generating function, by conditioning on the result of the first experiment. That is, we use the partition theorem for expectation to write \[\begin{aligned} {\mathsf{E}}(s^D) & = & {\mathsf{E}}(s^{D}\,|\, \text{failure}\,){\mathsf{P}}(\text{failure})+{\mathsf{E}}(s^D\, | \, \text{success}\,){\mathsf{P}}(\text{success})\\ & = &s{\mathsf{E}}(s^{D-1}\, |\, \text{failure}\,)q+s{\mathsf{E}}(s^{D-1}\, | \, \text{success}\,)p.\end{aligned}\]

Now, \({\mathsf{E}}(s^{D-1}\, |\, \text{failure}\,)={\mathsf{E}}(s^D)\) since the new sequence of experiments starting from the 2nd have the same distribution as the whole sequence, and observing a failure for the first experiment means we are still just asking for the first time that two consecutive successes are observed in this new sequence. However, this is not the case for \({\mathsf{E}}(s^{D-1}\, | \, \text{success}\,)\) since if we have already seen one success this could be the first one in a consecutive pair. So for this conditional expectation we will condition again, but now on the result of the 2nd experiment. If it is a success then \(s^{D-1}\) will be \(s\) with conditional probability one, and if it is a failure then we will be starting from scratch again. Thus we obtain that \({\mathsf{E}}(s^{D-1}\, | \, \text{success}\,)=ps+qs{\mathsf{E}}(s^D)\). Putting this all together and writing \({\mathsf{E}}(s^D)=G_D(s)\) we see that \[G_D(s)=qsG_D(s)+p^2s^2+pqs^2G_D(s)\] and can rearrange to reach the same conclusion as for method 1.

In a sequence of independent Bernoulli experiments with success probability \(p\in(0,1)\), let \(M\) be the first time that \(m\) consecutive successful outcomes have occured. Using the approach of Example [exmle:double-success-generating-function] or otherwise, find the generating function of \(M\).

In the setup of Example [exmle:double-success-generating-function], show that \(d_0=d_1=0\), \(d_2=p^2\), \(d_3=qp^2\), and, conditioning on the value of the first outcome, that \(d_n=q\,d_{n-1}+pq\,d_{n-2}\) for \(n\ge3\). Use these relations to re-derive the generating function \(G_D\).

Use the method of Exercise [exse:double-success-first-result-decomposition], to derive an alternative solution to Exercise [exse:repeated-success-generating-function]. Compare the resulting expectation to that in Example [exmle:double-success-generating-function].

A biased coin showing ‘heads’ with probability \(p\in(0,1)\) is flipped repeatedly. Let \(C_w\) be the first time that the word \(w\) appears in the observed sequence of results. Find the generating function of \(C_w\) and the expectation \({\mathsf{E}}\bigl[C_w\bigr]\) for each of the following words: HH, HT, TH and TT.

If \(X_n\sim{\sf Bin}(n,p)\) with \(p=p_n\) satisfying \(n\cdot p_n\to\lambda\) as \(n\to\infty\), then \(G_{X_n}(s)\equiv\bigl(1+p_n(s-1)\bigr)^n\to\exp\{\lambda(s-1)\}\) for every fixed \(s\in[0,1]\), so that the distribution of \(X_n\) converges to that of \(X\sim{\sf Poi}(\lambda)\).

For each \(n\ge1\) let \(Y_n=\sum_{k=1}^nX^{(n)}_k\), where \(X^{(n)}_k\) are independent Bernoulli random variables, \[p^{(n)}_k\stackrel{{\sf def}}{=}{\mathsf{P}}(X^{(n)}_k=1)=1-{\mathsf{P}}(X^{(n)}_k=0).\] Assume that \[\delta^{(n)}\stackrel{{\sf def}}{=}\max_{1\le k\le n}p^{(n)}_k\to0\] as \(n\to\infty\) and that for a positive constant \(\lambda\) we have \[{\mathsf{E}}(Y_n)\equiv\sum_{k=1}^np^{(n)}_k\to\lambda.\] Using generating functions or otherwise, show that the distribution of \(Y_n\) converges to that of a \({\mathsf{Poi}}(\lambda)\) random variable. This result is known as the law of rare events.

More generally, we have the following continuity result:

For every fixed \(n\), suppose that the sequence \(a_{0,n}\), \(a_{1,n}\), … is a probability distribution, ie., \(a_{k,n}\ge0\) and \(\sum_{k\ge0}a_{k,n}=1\), and let \(G_n\) be the corresponding generating function, \(G_n(s)=\sum_{k\ge0}a_{k,n}s^k\) for all \(s\) such that the RHS converges. In order that for every fixed \(k\) \[\lim_{n\to\infty}a_{k,n}=a_k\] it is necessary and sufficient that \(\lim\limits_{n\to\infty}G_n(s)= G_a(s)\) for every fixed \(s\in[0,1)\), where \(G_a(s)=\sum_{k\ge0}a_ks^k\), the generating function of the limiting sequence \((a_k)\).

In the probabilistic context, the convergence above: \[a_{k,n}\equiv{\mathsf{P}}(X_n=k)\overset{n\to \infty}{\rightarrow}{\mathsf{P}}(X=k)=a_k\, \text{ for each } k,\] is known as convergence in distribution.

In applications one often needs to describe the distribution of a random variable, which is obtained as a result of some limiting approach (or approximation). Then Theorem [thm:convergence-in-distribution-via-generating-functions] can help to simplify the argument. This method is similar to proving the central limit theorem using moment generating functions \({\mathsf{E}}(\exp\bigl\{tX_n\bigr\})\equiv G_n(e^t)\). Notice that \(G_n(e^t)\) exists for some \(t>0\) only if the sequence \(a_{k,n}\equiv{\mathsf{P}}(X_n=k)\) decays sufficiently fast, recall Remark [rem:generating-function-without-expectation] above.

2 Branching processes

2.1 Classification and extinction

Informally, a branching process  8 is described as follows, where \(\{p_k\}_{k\ge0}\) is a fixed probability mass function (pmf).

This model has a number of applications in biology (eg., it can be thought as a model of population growth), physics (chain reaction in nuclear fission), queueing theory, etc. Originally it arose from a study of the likelihood of survival of family names (“how fertile must a family be to ensure that in future generations the family name will not die out?”).

Formally, let \(\{Z_{n,k}\}\), \(n\ge1\), \(k\ge1\), be a family of i.i.d. random variables in \({\mathbb{Z}}^+\), each having a common probability mass function \(\{p_k\}_{k\ge0}\). Then the branching process \((Z_n)_{n\ge0}\) (generated by \(\{p_k\}_{k\ge0}\)) is defined by setting \(Z_0=1\), and, for \(n\ge1\), \[\begin{equation} \label{eq:branch-process-recursion} Z_n\stackrel{{\sf def}}{=}Z_{n,1}+Z_{n,2}+\dots+Z_{n,Z_{n-1}}\,, \end{equation}\] where the empty sum is interpreted as zero. Notice that \(Z_n\) is a Markov chain in \({\mathbb{Z}}^+\). We shall use \({\mathsf{P}}(\,\cdot\,)\equiv{\mathsf{P}}_1(\,\cdot\,)\) and \({\mathsf{E}}(\,\cdot\,)\equiv{\mathsf{E}}_1(\,\cdot\,)\) to denote the corresponding probability measure and the expectation operator.  9 If \(\varphi_n(s)\equiv{\mathsf{E}}(s^{Z_n})\) is the generating function of \(Z_n\), a straightforward induction based on ([eq:branch-process-recursion]) and ([eq:GN-GX]) implies that \[\varphi_0(s)\equiv s\,,\qquad \varphi(s)\equiv\varphi_1(s)\equiv{\mathsf{E}}{s^{Z_1}}\,,\] \[\begin{equation} \label{eq:branch-process-gener-fn} \varphi_k(s)=\varphi_{k-1}\bigl(\varphi(s)\bigr)\equiv\varphi\bigl(\varphi_{k-1}(s)\bigr)=\underset{k \text{ times }}{\underbrace{\varphi(\dots \varphi(s)\dots)\dots)}} \quad k>1 . \end{equation}\] Usually explicit calculations are hard, but at least in principle, equations ([eq:branch-process-gener-fn]) determine the distribution of \(Z_n\) for any \(n\ge0\).

Let \(\varphi_1(s)\equiv\varphi(s)=q+ps\) for some \(0<p=1-q<1\). Then \[\varphi_n(s)\equiv q(1+p+\dots+p^{n-1})+p^ns=1+p^n(s-1)\,.\] Notice that here we have \(\varphi_n(s)\to1\) as \(n\to\infty\), for all \(s\in[0,1]\). In other words, the distribution of \(Z_n\) converges to that of \(Z_\infty\equiv0\), recall Theorem [thm:convergence-in-distribution-via-generating-functions].

The following result is a straightforward corollary of ([eq:GN-GX]).

In a branching process \((Z_n)_{n\ge0}\) with \(Z_0=1\), let the offspring distribution have mean \(m\). Then \({\mathsf{E}}(Z_n)=m^n\) for all \(n\ge 1\).

Recall that a random variable \(X\) with values in \(\mathbb{Z}_{\ge 0}\) has finite mean equal to \(G_X'(1_{-}):=\lim_{s\uparrow 1} G_{X}'(s)\), if and only if this limit exists and is finite. Since \(Z_1\) is assumed to have finite mean \(m\), this implies that \(\varphi'(1_{-}):=\lim_{s\uparrow 1} \varphi'(s)=m\) (where \(\varphi\) is the generating function of \(Z_1\)). We also know by ([eq:GN-GX]) that the generating function of \(Z_n\) is given by \(\varphi_n\) which is just the composition of \(\varphi\) with itself \(n\) times. By the chain rule, and since \(\varphi_k(1)=1\) for all \(k\), we see that \[\lim_{s\uparrow 1} \varphi_n'(1_{-})=\varphi'(1_{-})^n=m^n,\] implying the result. This can alternatively be shown by induction, using a conditioning argument.

In a branching process \((Z_n)_{n\ge0}\) with \(Z_0=1\), let the offspring distribution have mean \(m\), variance \(\sigma^2\), and generating function \(\varphi\). Write \(\varphi_n\) for the generating function of the \(n\)th generation size \(Z_n\), \(\varphi_n(s)\equiv{\mathsf{E}}(s^{Z_n})\).

  1. Using ([eq:branch-process-gener-fn]) or otherwise, show that \({\mathsf{Var}}(Z_n)=\sigma^2m^{n-1}(m^n-1)/(m-1)\) if \(m\neq1\) and \({\mathsf{Var}}(Z_n)=\sigma^2n\) if \(m=1\).

  2. Deduce that \({\mathsf{E}}\bigl((Z_n/m^n)^2\bigr)\) is uniformly bounded for \(m\ne 1\).

This result suggests that if \(m\equiv{\mathsf{E}}(Z_1) \neq1\), the branching process might explode (for \(m>1\)) or die out (for \(m<1\)). One therefore classifies branching process as either critical (if \(m=1\)), subcritical (\(m<1\)), or supercritical (\(m>1\)).

It is straightforward to describe the case \(m<1\). Indeed, the Markov inequality ([eq:Markov]) implies that \[{\mathsf{P}}(Z_n>0)={\mathsf{P}}(Z_n\ge1)\le{\mathsf{E}}(Z_n)=m^n\,,\] so that \({\mathsf{P}}(Z_n>0)\to0\) as \(n\to\infty\) (ie., \(Z_n\to0\) in probability). We also notice that the average total population in this case is finite, \({\mathsf{E}}\bigl(\sum_{n\ge0}Z_n\bigr)=\sum_{n\ge0}m^n=(1-m)^{-1}<\infty\).

The extinction event \({\mathcal{E}}\) is the event \({\mathcal{E}}=\cup_{n=1}^\infty\bigl\{Z_n=0\bigr\}\). Since \(\bigl\{Z_n=0\bigr\}\subset\bigl\{Z_{n+1}=0\bigr\}\) for all \(n\ge0\), the extinction probability \(\rho\) is defined as \[\rho={\mathsf{P}}({\mathcal{E}})=\lim_{n\to\infty}{\mathsf{P}}\bigl(Z_n=0\bigr)\,,\] where \({\mathsf{P}}\bigl(Z_n=0\bigr)\equiv\varphi_n(0)\) is the extinction probability before \((n+1)\)st generation.

The following result helps to derive the extinction probability \(\rho\) without needing to compute the iterates \(\varphi_n(\,\cdot\,)\) precisely. To avoid trivialities we shall assume that \(p_0={\mathsf{P}}(Z=0)\) satisfies  10 \(0<p_0<1\); notice that under this assumption \(\varphi(s)\) is a strictly increasing function of \(s\in[0,1]\).

If \(0<p_0<1\), then the extinction probability \(\rho\) is given by the smallest positive solution to the equation \[\begin{equation} \label{eq:extinction-probability-solution} s=\varphi(s)\,. \end{equation}\] In particular, if \(m={\mathsf{E}}Z_1\le1\), then \(\rho=1\); otherwise, we have \(0<\rho<1\).

In words, if the branching process is subcritical or critical then it eventually becomes extinct with probability one. However, if it is supercritical, the process has a positive probability to survive for all time.

There is a clear probablistic intuition behind the relation \(\rho=\varphi(\rho)\). Indeed, if \(\rho={\mathsf{P}}_1({\mathcal{E}})\) is the extinction probability starting from a single individual, \(Z_0=1\), then by independence we get \({\mathsf{P}}_k({\mathcal{E}})\equiv{\mathsf{P}}({\mathcal{E}}\mid Z_0=k)=\rho^k\), and thus the first step decomposition for the Markov chain \(Z_n\) gives \[\begin{aligned} \rho={\mathsf{P}}({\mathcal{E}})&=\sum_{k\ge0}{\mathsf{P}}({\mathcal{E}},Z_1=k)=\sum_{k\ge0}{\mathsf{P}}({\mathcal{E}}\mid Z_1=k){\mathsf{P}}(Z_1=k) \\ &=\sum_{k\ge0}\rho^k\,{\mathsf{P}}(Z_1=k)\equiv{\mathsf{E}}\bigl(\rho^{Z_1}\bigr)\equiv\varphi(\rho)\,, \end{aligned}\] in agreement with ([eq:extinction-probability-solution]).

Let us now give the proof of Theorem [thm:branching-extinction]. You may find it helpful to draw a picture!

Denote \(\rho_n={\mathsf{P}}\bigl(Z_n=0\bigr)\equiv\varphi_n(0)\). By continuity and strict monotonicity of \(\varphi(\,\cdot\,)\) we have (recall ([eq:branch-process-gener-fn])) \[0<\rho_1=\varphi(0)<\rho_2=\varphi(\rho_1)<\dots<1\,,\] so that the extinction probability \(\rho\in (0,1]\) is the increasing limit of \(\rho_n\) as \(n\to \infty\), and satisfies \[\rho=\lim_n\varphi_n(0)=\lim_n \varphi(\varphi_{n-1}(0))=\varphi(\lim_n \varphi_{n-1}(0))=\varphi(\rho).\] On the other hand, if \(\bar\rho\) is any other fixed point of \(\varphi(\,\cdot\,)\) in \([0,1]\), ie., \(\bar\rho=\varphi(\bar\rho)\), then \(\bar\rho=\varphi_n(\bar\rho)\ge \varphi_n(0)\) for all \(n\), meaning that \(\bar\rho \ge \lim_{n\to \infty} \varphi_n(0)=\rho\). So, \(\rho\) is indeed the smallest positive solution to ([eq:extinction-probability-solution]).

Next, we turn to the extinction criterion in terms of \(m\). For this, observe that \(\varphi(\,\cdot\,)\) is convex on \([0,1]\), since \(\varphi''(s)={\mathsf{E}}(Z_1(Z_1-1)s^{Z_1-2})\ge 0\) for \(s\in [0,1]\) (actually unless \(p_1=1-p_0\), i.e., if there is some possibility of having more than 1 child, \(\varphi''(s)\) is strictly positive and so \(\varphi\) is strictly convex on \((0,1)\)). Hence if \(m=\varphi'(1_{-})>1\) we must have \(\varphi(s_0)<s_0\) for some \(s_0<1\) and therefore the curves \(y=s\) and \(y=\varphi(s)\) must cross at some point strictly in \((0,1)\) (recall that \(\varphi(0)=p_0>0\)). A completely rigorous way to justify this is to define \(f(s):=\varphi(s)-s\), which is continuous on \([0,s_0]\) with \(f(s_0)<0\) and \(f(0)>0\), so by the intermediate value theorem satisfies \(f(s)=0\), i.e. \(\varphi(s)=s\), for some \(s\in (0,s_0)\). Conversely, suppose that \(m\le 1\) and \(p_1\ne 1-p_0\). Then the condition \(m=\varphi'(1_{-})\le1\) together with strict convexity implies that \(\varphi(s)-s\) is strictly decreasing on \([0,1]\), from \(p_0\) at \(0\) to \(0\) at \(1\), and therefore cannot be \(0\) for any \(s<1\). The case \(p_1=1-p_0\) give \(\varphi(s)=p_0+p_1 s\) and it is immediate that the smallest solution of \(\varphi(s)=s\) in \([0,1]\) is at \(1\). This completes the proof.

If \(s\in[0,1)\), we have \(\varphi_n(s)\equiv{\mathsf{E}}\bigl(s^{Z_n}\bigr)\to\rho\in(0,1]\) as \(n\to\infty\).

As a result, the distribution of \(Z_n\) converges to that of \(Z_\infty\), where \({\mathsf{P}}(Z_\infty=0)=\rho\) and \({\mathsf{P}}(Z_\infty=\infty)=1-\rho\).

For a branching process with generating function \(\varphi(s)=as^2+bs+c\), where \(a>0\), \(b>0\), \(c>0\), \(\varphi(1)=1\), compute the extinction probability \(\rho\) and give the condition for sure extinction. Can you interpret your results?

Let \((Z_n)_{n\ge0}\) be a branching process with generating function \(\varphi(s)\equiv{\mathsf{E}}{s^{Z_1}}\) satisfying \(0<\varphi(0)<1\). Let \[\bar\varphi_n(u)\stackrel{{\sf def}}{=}{\mathsf{E}}\bigl(u^{\bar{Z}_n}\bigr)\] be the generating function of \[\bar{Z}_n=\sum_{k=0}^nZ_k,\] the total population size up to time \(n\).

  1. Show that \(\bar\varphi_{n+1}(u)=u\,\varphi\bigl(\bar\varphi_n(u)\bigr)\) for all \(n\ge0\) and \(u\ge0\).

  2. If \(u\in(0,1)\), show that \(\bar\varphi_n(u)\to\bar\varphi(u)\stackrel{{\sf def}}{=}{\mathsf{E}}\bigl(u^{\bar{Z}}\bigr)\) as \(n\to\infty\), where \(\bar{Z}\) is the total population size, \(\sum_{k\ge0}Z_k\), of \((Z_n)_{n\ge0}\). Show that the limiting generating function \(\bar\varphi(u)\) is given by the smallest positive solution \(s\) to the equation \(s=u\varphi(s)\).

  3. Let the process \((Z_n)_{n\ge0}\) be subcritical (\(\varphi'(1_{-})<1\)) with offspring distribution having exponential tails (\(\varphi(s)<\infty\) for some \(s>1\)). Show that for some \(u>1\) the equation \(s=u\varphi(s)\) has positive solutions \(s\), the smallest of which coincides with \(\bar\varphi(u)={\mathsf{E}}\bigl(u^{\bar{Z}}\bigr)\).

  4. Let the process \((Z_n)_{n\ge0}\) be supercritical (\({\mathsf{E}}{Z_1}>1\)) with \(0<{\mathsf{P}}(Z_1=0)<1\) and let \(u>1\) be such that the equation \(s=u\varphi(s)\) has positive solutions \(s\). Show that \(\bar\varphi_n(u)\to\infty\), in agreement with the fact that with positive probability the process \((Z_n)_{n\ge0}\) survives forever, \({\mathsf{P}}(\bar{Z}=\infty)>0\).

  5. In the setting of part d), let \(\hat\varphi_n(u)\stackrel{{\sf def}}{=}{\mathsf{E}}\bigl(u^{\bar{Z}_n}\mathsf{1}_{Z_n=0}\bigr)\), the generating function of the total population on the event that the process \((Z_n)_{n\ge0}\) dies out by time \(n\). Show that \(\hat\varphi_{n+1}(u)=u\varphi\bigl(\hat\varphi_n(u)\bigr)\) for all \(n\ge0\) and \(u\ge0\). Deduce that for each \(u>1\) such that the fixed point equation \(s=u\varphi(s)\) has positive solutions \(s\), we have \(\hat\varphi_n(u)\to{\mathsf{E}}\bigl(u^{\bar{Z}}\mathsf{1}_{\bar{Z}<\infty}\bigr)\), where the latter coincides with the smallest positive \(s\) satisfying \(s=u\varphi(s)\).

We now turn to classification of states for the Markov chain \(Z_n\) in \({\mathbb{Z}}^+\). Of course, since \(0\) is an absorbing state, it is recurrent.

If \(p_1={\mathsf{P}}(Z_1=1)\neq1\), then every state \(k\in{\mathbb{N}}\) is transient. As a result, \[{\mathsf{P}}(Z_n\to\infty)=1-{\mathsf{P}}(Z_n\to0)=1-\rho\,.\]

We first show that every \(k\in{\mathbb{N}}\) is transient. If \(p_0=0\), then \(Z_n\) is a non-decreasing Markov chain (ie., \(Z_{n+1}\ge Z_n\)), so that for every \(k\in{\mathbb{N}}\) the first passage probability \(f_{kk}\) satisfies \[f_{kk}={\mathsf{P}}\bigl(Z_{n+1}=k\mid Z_n=k\bigr)={\mathsf{P}}_k(Z_1=k)=(p_1)^k<1\,.\] On the other hand, for \(p_0\in(0,1]\) we have \[f_{kk}\le{\mathsf{P}}\bigl(Z_{n+1}\neq0\mid Z_n=k\bigr)={\mathsf{P}}_k(Z_1\neq0)=1-{\mathsf{P}}_k(Z_1=0)=1-(p_0)^k<1\,.\] This means that \[{\mathsf{P}}(Z_n=k \text{ i.o.})=\lim_{m\to \infty} {\mathsf{P}}(Z_n \text{ returns to } k \text{ at least } m \text{ times})\le \lim_{m\to \infty} f_{kk}^{m-1}=0\] and the state \(k\) is transient.

Fix arbitrary \(K>0\). Since the states \(1\), \(2\), …, \(K\) are transient, we see that \({\mathsf{P}}\bigl(\{Z_n=0\}\cup\{Z_n>K\}\bigr)\to1\) as \(n\to \infty\) and therefore \[{\mathsf{P}}\bigl(Z_n\to0\text{ or }Z_n\to\infty\bigr)=1\,.\] As the LHS above equals \({\mathsf{P}}(Z_n\to0)+{\mathsf{P}}(Z_n\to\infty)\), the result follows from the observation that \({\mathsf{P}}(Z_n\to0)\equiv{\mathsf{P}}({\mathcal{E}})=\rho\).

For a supercritical branching process \((Z_n)_{n\ge0}\), let \(T_0=\min\{n\ge0:Z_n=0\}\) be its extinction time and let \(\rho={\mathsf{P}}(T_0<\infty)>0\) be its extinction probability. Define \((\widehat{Z}_n)_{\ge0}\) as \(Z_n\) conditioned on extinction, ie., \(\widehat{Z}_n=\bigl(Z_n\mid T_0<\infty\bigr)\).

  1. Show that the transition probabilities \(\hat{p}_{xy}\) of \((\widehat{Z}_n)_{n\ge0}\) and the transition probabilities \(p_{xy}\) of the original process \((Z_n)_{n\ge0}\) are related via \(\hat{p}_{xy}=p_{xy}\rho^{y-x}\), \(x,y\ge0\).

  2. Deduce that the generating functions \(\widehat\varphi(s)\equiv{\mathsf{E}}_1\bigl[s^{\widehat{Z}_1}\bigr]\) and \(\varphi(s)\equiv{\mathsf{E}}_1\bigl[s^{Z_1}\bigr]\) are related via 11 \(\widehat\varphi(s)=\frac1\rho\varphi(\rho s)\), \(0\le s\le1\).

  3. If the offspring distribution of \((Z_n)_{n\ge0}\) is \({\mathsf{Poi}}(\lambda)\) with \(\lambda>1\), use the fixed point equation \(\rho=e^{\lambda(\rho-1)}\) to show that \(\widehat\varphi(s)=e^{\lambda\rho(s-1)}\), ie., that the offspring distribution for \((\widehat{Z}_n)_{n\ge0}\) is just \({\mathsf{Poi}}(\lambda\rho)\).

Let \((Z_n)_{n\ge0}\) be a supercritical branching process with offspring distribution \(\{p_k\}_{k\ge0}\), offspring generating function \(\varphi(s)\) and extinction probability \(\rho\in[0,1)\).

  1. If \(Z_0=1\), let \(\tilde p_k\) be the probability that conditioned on survival the first generation has exactly \(k\) individuals with an infinite line of descent. Show that \[\tilde p_k=\frac1{1-\rho}\sum_{n=k}^\infty p_n\binom{n}{k}(1-\rho)^k\rho^{n-k}\,.\]

  2. Let \((\widetilde{Z}_n)_{n\ge0}\) count only those individuals in \((Z_n)_{n\ge0}\), who conditioned on survival have an infinite line of descent. Show that \((\widetilde{Z}_n)_{n\ge0}\) is a branching process with offspring generating function [rescaled-generating-function] \[\widetilde\varphi(s)=\frac1{1-\rho}\Bigl(\varphi\bigl((1-\rho)s+\rho\bigr)-\rho\Bigr)\,.\]

Let \((Z_n)_{n\ge0}\) be a subcritical branching process whose generating function \(\varphi(s)={\mathsf{E}}(s^{Z_1})\) is finite for some \(s>1\), ie., the offspring distribution has finite exponential moments in a neighbourhood of the origin.

  1. Using the result of Exercise [exse:branching-process-total-population-size-generating-function] or otherwise, show that the total population size \(\bar{Z}=\sum_{k\ge0}Z_k\) satisfies \({\mathsf{E}}\bigl(u^{\bar{Z}}\bigr)<\infty\) for some \(u>1\).

  2. Suppose that for each \(1\le i\le\bar{Z}\), individual \(i\) produces wealth of size \(W_i\), where \(W_i\) are independent random variables with common distribution satisfying \({\mathsf{E}}\bigl(s^W\bigr)<\infty\) for some \(s>1\). Show that for some \(u>1\) we have \({\mathsf{E}}\bigl(u^{\overline{W}}\bigr)<\infty\), where \(\overline{W}=W_1+\dots+W_{\bar{Z}}\) is the total wealth generated by \((Z_n)_{n\ge0}\).

2.2 Critical case \(m=1\)

The following example is one of very few for which the computation in the critical case \(m={\mathsf{E}}(Z_1) =1\) can be done explicitly.

Consider the so-called linear-fractional case, where the offspring distribution is given by \(p_j=2^{-(j+1)}\), \(j\ge0\). Then the offspring generating function is \(\varphi(s)=\sum_{j\ge0}s^j/2^{j+1}=(2-s)^{-1}\) and a straightforward induction gives (check this!) \[\varphi_k(s)=\frac{k-(k-1)s}{(k+1)-ks} =\frac{k}{k+1}+\frac{1}{k(k+1)}\sum_{m\ge1}\Bigl(\frac{ks}{k+1}\Bigr)^m\,,\] so that \({\mathsf{P}}(Z_k=0)=\varphi_k(0)=k/(k+1)\), \({\mathsf{P}}(Z_k>0)=1/(k+1)\), and \[{\mathsf{P}}\bigl(Z_k=m\mid Z_k>0\bigr)=\frac{1}{k+1}\,\Bigl(\frac{k}{k+1}\Bigr)^{m-1}\,,\qquad m\ge1\,,\] ie., \((Z_k\mid Z_k>0\bigr)\) has geometric distribution with success probability \(1/(k+1)\).

For each \(k\ge 0\), by the partition theorem, \[1={\mathsf{E}}(Z_k)={\mathsf{E}}\bigl(Z_k\mid Z_k>0\bigr)\,{\mathsf{P}}(Z_k>0)+{\mathsf{E}}\bigl(Z_k\mid Z_k=0\bigr)\,{\mathsf{P}}(Z_k=0)\,,\] so that in the previous example we have \[{\mathsf{E}}\bigl(Z_k\mid Z_k>0\bigr)=\frac{1}{{\mathsf{P}}(Z_k>0)}=k+1\,,\] ie., conditional on survival, the average generation size grows linearly with time.

The following example is known as the general linear-fractional case:

For fixed \(b>0\) and \(p\in(0,1)\), consider a branching process with offspring distribution \(p_j=b\,p^{j-1}\), \(j\ge1\), and \(p_0=1-\sum_{j\ge1}p_j\).

  1. Show that for \(b\in(0,1-p)\) the distribution above is well defined; find the corresponding \(p_0\), and show that \[\varphi(s)=\frac{1-b-p}{1-p}+\frac{bs}{1-ps}\,;\]

  2. Find \(b\) for which the branching process is critical and show that then \[\varphi_k(s)={\mathsf{E}}\bigl(s^{Z_k}\bigr)=\frac{kp-(kp+p-1)s}{(1-p+kp)-kps}\,;\]

  3. Deduce that \((Z_k\mid Z_k>0\bigr)\) is geometrically distributed with parameter \(\frac{1-p}{kp+1-p}\).

Straightforward computer experiments show that a similar linear growth of \({\mathsf{E}}\bigl(Z_k\mid Z_k>0\bigr)\) takes place for other critical offspring distributions, eg., the one with \(\varphi(s)=(1+s^2)/2\).

If the offspring distribution of the branching process \((Z_k)_{k\ge0}\) has mean \(m=1\) and finite variance \(\sigma^2>0\), then \(k\,{\mathsf{P}}(Z_k>0)\to\frac{2}{\sigma^2}\) as \(k\to\infty\); equivalently, \[\begin{equation} \label{eq:critical-branching-expectation} \frac1k\,{\mathsf{E}}(Z_k\mid Z_k>0\bigr)\to\frac{\sigma^2}2\quad\text{ as $k\to\infty$.} \end{equation}\]

This general result suggests that, conditional on survival, a general critical branching process exhibits linear intermittent behaviour;  12 namely, with small probability (of order \(2/(k\sigma^2)\)) the values of \(Z_k\) are of order \(k\).

Our argument is based on the following general fact:  13

Let \((y_n)_{n\ge0}\) be a real-valued sequence. If for some constant \(a\) we have \(y_{n+1}-y_n\to a\) as \(n\to\infty\), then \(n^{-1}y_n\to a\) as \(n\to\infty\).

By changing the variables \(y_n\mapsto y'_n=y_n-na\) if necessary, we can and shall assume that \(a=0\). Fix arbitrary \(\delta>0\) and find \(K>0\) such that for \(n\ge K\) we have \(|y_{n+1}-y_n|\le\delta\). Decomposing, for \(n>K\),
\(y_n-y_K=\sum\limits_{j=K}^{n-1}\bigl(y_{j+1}-y_j\bigr)\) we deduce that \(|y_n-y_K|\le\delta(n-K)\) so that the claim follows from the estimate \[\Bigl|\frac{y_n}{n}\Bigr|\le\Bigl|\frac{y_n-y_K}{n}\Bigr|+\Bigl|\frac{y_K}{n}\Bigr|\le\delta+\Bigl|\frac{y_K}{n}\Bigr|\le2\delta\,,\] provided \(n\) is chosen sufficiently large.

(of Theorem [thm:critical-branching]). We only derive the second claim of the theorem, ([eq:critical-branching-expectation]). By assumptions and Taylor’s theorem (here we are also using Theorem [thm:moments-from-generating-functions]) the offspring generating function \(\varphi\) satisfies \[1-\varphi(s)=(1-s)+\frac{\sigma^2}{2}(1-s)^2+R(s)(1-s)^2 \text{ where } R(s)\to 0 \text{ as } s\uparrow 1.\] Since \(\varphi_n(0)={\mathsf{P}}(Z_n=0)\to 1\) as \(n\to \infty\) this means in particular that \(1-\varphi_{n+1}(0)=1-\varphi(\varphi_n(0))=1-\varphi_n(0)+(1-\varphi_n(0))^2(\frac{\sigma^2}{2}+r_n))\) with \(r_n:=R(\varphi_n(0))\to 0\) as \(n\to \infty\). Setting \[y_n=\frac{1}{1-\varphi_n(0)} =\frac{{\mathsf{E}}(Z_n)}{{\mathsf{P}}(Z_n>0)}={\mathsf{E}}(Z_n|Z_n>0)\] we therefore have \[y_{n+1}-y_n = \frac{1}{1-\varphi_n(0)}\frac{1}{1-(1-\varphi_n(0))(\sigma^2/2+r_n)}-1\] which we can rewrite as \[\frac{(1-\varphi_n(0))(\sigma^2/2+r_n)}{1-\varphi_n(0)}\frac{1}{1-(1-\varphi_n(0))(\sigma^2/2+r_n)}\] for each \(n\). It therefore follows that \(y_{n+1}-y_n\to \sigma^2/2\) as \(n\to \infty\) and hence \[\frac{y_n}{n}=\frac{{\mathsf{E}}(Z_n|Z_n>0)}{n}\to \frac{\sigma^2}{2}\] as well (by the Lemma). This completes the proof.

With a bit of extra work 14 one can generalize the above proof to show that \[\lim\limits_{n\to\infty}n^{-1}\Bigl(\frac{1}{1-\varphi_n(s)}-\frac{1}{1-s}\Bigr)=\frac{\sigma^2}{2}\] for any \(s\in [0,1]\) and use this relation to derive the convergence in distribution:

If \({\mathsf{E}}Z_1=1\) and \({\mathsf{Var}}(Z_1)=\sigma^2\in(0,\infty)\), then for every \(z\ge0\) we have \[\lim_{k\to\infty}{\mathsf{P}}\Bigl(\frac{Z_k}{k}>z\mid Z_k>0\Bigr)=\exp\Bigl\{-\frac{2z}{\sigma^2}\Bigr\}\,,\] ie., the distribution of \(\bigl(k^{-1}Z_k\mid Z_k>0\bigr)\) is approximately exponential with parameter \(2/\sigma^2\).

In the setup of Example [exmle:linear-fractional-branching-process], we have \[{\mathsf{P}}\bigl(Z_k>m\mid Z_k>0\bigr)=\Bigl(\frac{k}{k+1}\Bigr)^m=\Bigl(1-\frac1{k+1}\Bigr)^m\,,\] so that \({\mathsf{P}}\bigl(Z_k>kz\mid Z_k>0\bigr)\to e^{-z}\) as \(k\to\infty\); in other words, for large \(k\) the distribution of \(\bigl(k^{-1}Z_k\mid Z_k>0\bigr)\) is approximately \({\mathsf{Exp}}(1)\).

Let \((Z_n)_{n\ge0}\) be the critical branching process from Exercise [exse:general-linear-fractional-branching-process], namely, the one whose offspring distribution is given by \((p_j)_{j\ge0}\), \[p_j=b\,p^{j-1}\,,\quad j\ge1\,,\qquad p_0=1-\sum_{j\ge1}p_j\,,\] where \(b>0\) and \(p\in(0,1)\) are fixed parameters. Show that the result of Theorem [thm:critical-branching-limit-distribution] holds: for every \(z\ge0\) \[\lim_{k\to\infty}{\mathsf{P}}\Bigl(\frac{Z_k}{k}>z\mid Z_k>0\Bigr)=\exp\Bigl\{-\frac{2z}{\sigma^2}\Bigr\}\,,\] where \({\mathsf{Var}}(Z_1)=\sigma^2\in(0,\infty)\).

2.3 Non-homogeneous case

If the offspring distribution changes with time, the previous approach must be modified. Let \(\psi_n(u)\) be the generating function of the offspring distribution of a single ancestor in the \((n-1)\)st generation, \[\psi_n(u)={\mathsf{E}}\bigl(u^{Z_n}\mid Z_{n-1}=1\bigr)\,\] (so in the cases considered up to now, \(\psi_n=\varphi\) for every \(n\)). Then the generating function \(\varphi_n(u)={\mathsf{E}}\bigl(u^{Z_n}\mid Z_0=1\bigr)\) of the population size at time \(n\) given a single ancestor at time \(0\), can be defined recursively as follows: \[\varphi_0(u)\equiv u\,,\qquad \varphi_n(u)=\varphi_{n-1}\bigl(\psi_n(u)\bigr)\,,\quad \forall n\ge1\,.\] If \(\mu_n={\mathsf{E}}(Z_n\mid Z_{n-1}=1)=\psi'_n(1)\) denotes the average offspring size in the \(n\)th generation given a single ancestor in the previous generation, then \[m_n\equiv{\mathsf{E}}\bigl(Z_n\mid Z_0=1\bigr)=\mu_1\mu_2\ldots\mu_{n-1}\mu_n\,.\] It is natural to call the process \((Z_n)_{n\ge0}\) supercritical if \(m_n\to\infty\) and subcritical if \(m_n\to0\) as \(n\to\infty\).

A strain of phototrophic bacteria uses light as the main source of energy. As a result individual organisms reproduce with probability mass function \(p_0=1/4\), \(p_1=1/4\) and \(p_2=1/2\) per unit of time in light environment, and with probability mass function \(p_0=1-p\), \(p_1=p\) (with some \(p>0\)) per unit of time in dark environment. A colony of such bacteria is grown in a laboratory, with alternating light and dark unit time intervals.
a) Model this experiment as a time non-homogeneous branching process \((Z_n)_{n\ge0}\) and describe the generating function of the population size at the end of the \(n\)th interval.
b) Characterise all values of \(p\) for which the branching process \(Z_n\) is subcritical and for which it is supercritical.
c) Let \((D_k)_{k\ge0}\) be the original process observed at the end of each even interval, \(D_k\stackrel{{\sf def}}{=}Z_{2k}\). Find the generating function of \((D_k)_{k\ge0}\) and derive the condition for sure extinction. Compare your result with that of part b).

3 Coupling

Two random variables, say \(X\) and \(Y\), are coupled if they are defined on the same probablity space. To couple two given variables \(X\) and \(Y\) means to define a random vector \(\bigl(\widetilde{X},\widetilde{Y})\) with joint probability \(\widetilde{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) on some probability space 15 so that the marginal distribution of \(\widetilde{X}\) coincides with the distribution of \(X\) and the marginal distribution of \(\widetilde{Y}\) coincides with the distribution of \(Y\). Recall the following example:

Fix \(p_1\), \(p_2\in[0,1]\) such that \(p_1\le p_2\) and consider the following joint distributions (we write \(q_i=1-p_i\)):

  1. \(0\) \(1\) \(\widetilde{X}\)
    \(0\) \(q_1q_2\) \(q_1p_2\) \(q_1\)
    \(1\) \(p_1q_2\) \(p_1p_2\) \(p_1\)
    \(\widetilde{Y}\) \(q_2\) \(p_2\)
  2. \(0\) \(1\) \(\widetilde{X}\)
    \(0\) \(q_2\) \(p_2-p_1\) \(q_1\)
    \(1\) \(0\) \(p_1\) \(p_1\)
    \(\widetilde{Y}\) \(q_2\) \(p_2\)

It is easy to see that in both cases  16 \[\widetilde{X}\sim{\mathsf{Ber}}(p_1)\, , \, \widetilde{Y}\sim{\mathsf{Ber}}(p_2),\] though in the first case \(\widetilde{X}\) and \(\widetilde{Y}\) are independent, whereas in the second case we have \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\).

3.1 Stochastic domination

If \(X\sim{\mathsf{Ber}}(p)\), its tail probabilities \({\mathsf{P}}(X>a)\) satisfy \[{\mathsf{P}}(X>a)=\begin{cases}1,&a<0\,,\\p,&0\le a<1\,,\\0,&a\ge1\,.\end{cases}\] Consequently, in the setup of [exmle:coupled-bernoulli], for the variables \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) with \(p_1\le p_2\) we have \({\mathsf{P}}(X>a)\le{\mathsf{P}}(Y>a)\) for all \(a\in{\mathbb{R}}\). The last inequality is useful enough to deserve a name:

A random variable \(X\) is stochastically smaller than a random variable \(Y\) (write \(X\preccurlyeq Y\)) if the inequality \[\begin{equation} \label{eq:stoch-domination} {\mathsf{P}}\bigl(X>x\bigr)\le{\mathsf{P}}\bigl(Y>x\bigr) \end{equation}\] holds for all \(x\in{\mathbb{R}}\).

If \(X\preccurlyeq Y\) and \(g(\,\cdot\,)\ge0\) is an arbitrary increasing function on \({\mathbb{R}}\), then \(g(X)\preccurlyeq g(Y)\). If, in addition, \(X\ge0\), \(Y\ge0\), and \(g(\cdot)\) is smooth with \(g(0)=0\), then \[\begin{equation} \label{eq:stochastic-domination-expectation-of-monotone-function} {\mathsf{E}}g(X)\equiv\int_0^\infty g'(z){\mathsf{P}}(X>z)\,dz\le\int_0^\infty g'(z){\mathsf{P}}(Y>z)\,dz\equiv{\mathsf{E}}g(Y)\,. \end{equation}\]

If \(X\) is a random variable and \(a\ge0\) is a fixed constant, it is immediate that \(Y=a+X\) stochastically dominates \(X\), ie., \(X\preccurlyeq Y\): for each \(x\in{\mathbb{R}}\) we have \({\mathsf{P}}(X>x)\le{\mathsf{P}}(X>x-a)={\mathsf{P}}(Y>x)\). Similarly, if \(a\ge1\) and \(Z=aX\ge0\), then \(X\preccurlyeq Z\).

In the setup of [exmle:coupled-bernoulli], if \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) then \(X\) is stochastically smaller than \(Y\) (ie., \(X\preccurlyeq Y\)) if and only if \(p_1\le p_2\); moreover, this is equivalent to existence of a coupling \(\bigl(\widetilde{X},\widetilde{Y})\) of \(X\) and \(Y\) in which these variables are ordered with probablity one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\). The next result shows that this is a rather generic situation.

A random variable \(X\) is stochastically smaller than a random variable \(Y\) if and only if there exists a coupling \(\bigl(\widetilde{X},\widetilde{Y})\) of \(X\) and \(Y\) such that \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\).

Notice that one claim of [lem:coupling-vs-stoch-domination] is immediate from \[{\mathsf{P}}(x<X)\equiv\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{X}\bigr)=\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{X}\le\widetilde{Y}\bigr) \le\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{Y}\bigr)\equiv{\mathsf{P}}\bigl(x<Y\bigr)\,;\] the other claim requires a more advanced argument (we shall not do it here!).

If \(X\sim{\mathsf{Bin}}(m,p)\) and \(Y\sim{\mathsf{Bin}}(n,p)\) with \(m\le n\), then \(X\preccurlyeq Y\). Indeed, Let \(Z_1\sim{\mathsf{Bin}}(m,p)\) and \(Z_2\sim{\mathsf{Bin}}(n-m,p)\) be independent variables defined on the same probability space. We then put \(\widetilde{X}=Z_1\) and \(\widetilde{Y}=Z_1+Z_2\) so that \(\widetilde{Y}-\widetilde{X}=Z_2\ge0\) with probability one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\), and \(X\sim\widetilde{X}\), \(Y\sim\widetilde{Y}\).

If \(X\sim{\mathsf{Poi}}(\lambda)\) and \(Y\sim{\mathsf{Poi}}(\mu)\) with \(\lambda\le\mu\), then \(X\preccurlyeq Y\). Indeed, Let \(Z_1\sim{\mathsf{Poi}}(\lambda)\) and \(Z_2\sim{\mathsf{Poi}}(\mu-\lambda)\) be independent variables defined on the same probability space.  17 We then put \(\widetilde{X}=Z_1\) and \(\widetilde{Y}=Z_1+Z_2\) so that \(\widetilde{Y}-\widetilde{X}=Z_2\ge0\) with probability one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\), and \(X\sim\widetilde{X}\), \(Y\sim\widetilde{Y}\).

Let \((X_n)_{n\ge0}\) be a branching process with offspring distribution \(\{p_m\}_{m\ge0}\) and \(X_0=1\). Let \((Y_n)_{n\ge0}\) be a branching process with the same offspring distribution and \(Y_0=2\). We can use stochastic domination to show that \({\mathsf{P}}(X_n=0)\ge{\mathsf{P}}(Y_n=0)\) for all \(n\).

Indeed, it is enough to show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\). To this end consider two independent branching processes, \((Z'_n)_{n\ge0}\) and \((Z''_n)_{n\ge0}\), having the same offspring distribution \(\{p_m\}_{m\ge0}\) and satisfying \(Z'_0=Z''_0=1\). We then put \(\widetilde{X}_n=Z'_n\) and \(\widetilde{Y}_n=Z'_n+Z''_n\) for all \(n\ge0\), so that \(\widetilde{Y}_n-\widetilde{X}_n=Z''_n\ge0\), ie., \(\widetilde{{\mathsf{P}}}(\widetilde{X}_n\le\widetilde{Y}_n)=1\) and \(X_n\sim \widetilde{X}_n\), \(Y_n\sim \widetilde{Y}_n\) for all \(n\ge0\).

For a given offspring distribution \(\{p_m\}_{m\ge0}\), let \((X_n)_{n\ge0}\) be the branching process with \(X_0=k\) and let \((Y_n)_{n\ge0}\) be the branching process with \(Y_0=l\), \(k<l\). Show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\).

Let \(X\), \(Y\), \(Z\) be random variables in \({\mathbb{Z}}_+\). If \(X\preccurlyeq Y\) and \(Z\) is independent of \(\{X,Y\}\), show that \(X+Z\preccurlyeq Y+Z\).
Let \(\{X_i,Y_i\}_{i=1,\dots,n}\) be random variables in \({\mathbb{Z}}_+\). If the pairs \((X_i,Y_i)\) are mutually independent and \(X_i\preccurlyeq Y_i\) for each \(i\), show that \(X_1+\dots+X_n\preccurlyeq Y_1+\dots+Y_n\).

Let \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) be standard branching processes with \(X_0=Y_0=1\). Assume that the offspring distribution of \(X\) is stochastically smaller than that of \(Y\) (\(X_1\preccurlyeq Y_1\)), ie., for all integer \(k\ge0\), \({\mathsf{P}}(X_1>k|X_0=1)\le{\mathsf{P}}(Y_1>k|Y_0=1)\). Show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\).

3.2 Total variation distance

Let \(\mu\) and \(\nu\) be two probability measures on the same probability space. The total variation distance between \(\mu\) and \(\nu\) is \[\begin{equation} \label{eq:TV-dist-def} {\mathsf{d_{TV}}}(\mu,\nu)\stackrel{{\sf def}}{=}\max_A\bigl|\mu(A)-\nu(A)\bigr|\,. \end{equation}\] If \(X\), \(Y\) are random variables in \({\mathbb{Z}}^+\) with respective p.m.f.s \(\{p\} = \{p_k\}_{k \geq 0}\) and \(\{q\}=\{q_\ell\}_{\ell \geq 0}\), then \[{\mathsf{d_{TV}}}(X,Y) \stackrel{{\sf def}}{=}{\mathsf{d_{TV}}}(\{p\},\{q\}) = \max_{A\subseteq {\mathbb{Z}}^+}\bigl|\textstyle\sum_{k\in A} p_k - \sum_{k \in A} q_k \bigr|.\]

If \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) we have \[{\mathsf{d_{TV}}}(X,Y)=\max\bigl\{|p_1-p_2|,|q_1-q_2|\bigr\}=|p_1-p_2|=\tfrac12\bigl(|p_1-p_2|+|q_1-q_2|\bigr)\,.\]

Let probability measures \(\mu\) and \(\nu\) have respective p.m.f.s \(\{p_x\}\) and \(\{q_y\}\). Show that the total variation distance between \(\mu\) and \(\nu\) is \[{\mathsf{d_{TV}}}(\mu,\nu)\equiv{\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)=\frac12\sum_z\bigl|p_z-q_z\bigr|\,.\] Deduce that \({\mathsf{d_{TV}}}(\cdot,\cdot)\) is a distance between probability measures 18 (ie., it is non-negative, symmetric, and satisfies the triangle inequality) such that \({\mathsf{d_{TV}}}(\cdot,\cdot)\le1\) for all probability measures \(\mu\) and \(\nu\).

An important relation between coupling and the total variation distance is explained by the following fact.

Let random variables \(X\) and \(Y\) be such that \({\mathsf{P}}(X=x)=p_x\) and \({\mathsf{P}}(Y=y)=q_y\). Define the coupling \(\bigl(\widetilde{X},\widetilde{Y}\bigr)\) via 19 \[\begin{equation} \label{eq:max-coupling} \widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=z\bigr)=\min\bigl(p_z,q_z\bigr)\,, \end{equation}\]\[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=x,\widetilde{Y}=y\bigr)=\frac{\bigl(p_x-\min(p_x,q_x)\bigr)\,\bigr(q_y-\min(p_y,q_y)\bigr)}{{\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)}\,,\quad x\neq y\,.\]

Show that \[\sum_x\bigl(p_x-\min(p_x,q_x)\bigr)=\sum_y\bigr(q_y-\min(p_y,q_y)\bigr)={\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)\,,\] and deduce that \(\eqref{eq:max-coupling}\) is indeed a coupling of \(X\) and \(Y\) (ie., that \(\eqref{eq:max-coupling}\) defines a probability distribution with correct marginals).

Consider \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) with \(p_1\le p_2\). It is a straightforward exercise to check that the second table in [exmle:coupled-bernoulli] provides the maximal coupling of \(X\) and \(Y\). We notice also that in this case \[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)=p_2-p_1={\mathsf{d_{TV}}}(X,Y)\,.\]

Let \(\widehat{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) be the maximal coupling \(\eqref{eq:max-coupling}\) of \(X\) and \(Y\). Then for every other coupling \(\widetilde{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) of \(X\) and \(Y\) we have \[\begin{equation} \label{eq:coupling-inequality-dTV} \widetilde{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\ge\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)={\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\,. \end{equation}\]

Summing the inequalities \(\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=z\bigr)\le\min(p_z,q_z)\) we deduce \[\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\ge1-\sum_z\min(p_z,q_z)=\sum_z\bigl(p_z-\min(p_z,q_z)\bigr)={\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)\,,\] in view of [exse:dTV-vs-coupling] and [exmle:max-coupling-Bernoulli].

Notice that according to \(\eqref{eq:coupling-inequality-dTV}\), \[\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)\le\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)=1-{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\,,\] ie., the probability that \(\widetilde{X}=\widetilde{Y}\) is maximised under the optimal coupling \(\widehat{{\mathsf{P}}}(\cdot\,,\,\cdot)\).

The maximal coupling of \(X\sim{\mathsf{Ber}}(p)\) and \(Y\sim{\mathsf{Poi}}(p)\) satisfies \[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=0\bigr)=1-p\,,\qquad \widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=1\bigr)=pe^{-p}\,,\] and \(\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=x\bigr)=0\) for all \(x>1\). As a result, \[\begin{equation} \label{eq:Bernoulli-Poisson-max-coupling} {\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)=1-\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)=p\bigl(1-e^{-p}\bigr)\le p^2\,. \end{equation}\] Is either of the variables \(X\) and \(Y\) stochastically dominated by the other?

3.3 Applications to convergence

3.3.1 Coupling of Markov chains

Coupling is a very useful tool for proving convergence towards equilibrium for various processes, including Markov chains and random walks. The following example uses the independent coupling 20 of two random walks on a complete graph.

On the complete graph \(K_m\) on \(m\) vertices, consider the random walk \((X_n)_{n\ge0}\) with \(X_0=x\), such that at every step it jumps to any of the vertices of \(K_m\) uniformly at random (excluding the current one). To show that eventually \((X_n)_{n\ge0}\) forgets its initial distribution, one couples \((X_n)_{n\ge0}\) with another copy \((Y_n)_{n\ge0}\), \(Y_0=y\), of this random walk, and shows that for large \(n\) both processes coincide with large probability.

To do this, we treat the pair \((\widetilde{X}_n,\widetilde{Y}_n)_{n\ge0}\) as a two-component random walk on \(K_m\times{K_m}\) with the following jump probabilities: the transition \((x,y)\mapsto(x',y')\) occurs with probability \[\tilde{p}_{(x,y)(x',y')}= \begin{cases} \frac1{m^2}\,,&\quad \text{ if } x\neq y\,,\\ \frac1m\,,&\quad \text{ if $x=y$ and $x'=y'$\,,}%\\0\,,&\quad \text{ if $x=y$ and $x'\neq y'$\,.} \end{cases}\] and set \(\tilde{p}_{(x,y)(x',y')}=0\) otherwise (ie., if \(x=y\) and \(x'\neq y'\)). It is straightforward to check that this is a correct coupling: 21 \[\sum_{y'}\tilde{p}_{(x,y)(x',y')}=m\cdot\frac1{m^2}=p_{x,x'}\,,\qquad \sum_{y'}\tilde{p}_{(x,x)(x',y')}=\tilde{p}_{(x,x)(x',x')}=\frac1m=p_{x,x'}\,.\] Notice that the walks \((\widetilde{X}_n)_{n\ge0}\) and \((\widetilde{Y}_n)_{n\ge0}\) run independently until they meet, and from that moment onwards they move together. For \(x\neq y\), denote \[\begin{equation} \label{eq:coupling-time-def} T\stackrel{{\sf def}}{=}\min\{n\ge0:\widetilde{X}_n=\widetilde{Y}_n\}\,. \end{equation}\] By a straightforward induction, it is easy to see that \[\widetilde{{\mathsf{P}}}(T>n)=\frac1m\bigl(1-\frac{m-2}{(m-1)^2}\bigr)^{n},\] ie., \(T\) is (basically) a geometric random variable. Since \(\widetilde{X}_n\) and \(\widetilde{Y}_n\) coincide after they meet, and \(\widetilde{{\mathsf{P}}},\widetilde{X}_n,\widetilde{Y}_n\) gives a coupling of \((X_n,Y_n)\) for every \(n\) we have that \[\begin{aligned} {\mathsf{d_{TV}}}(X_n,Y_n)\le \widetilde{{\mathsf{P}}}(\widetilde{X}_n\ne \widetilde{Y}_n)=\widetilde{{\mathsf{P}}}(T>n)\,, \end{aligned}\] by (\(\eqref{eq:coupling-inequality-dTV}\)). It therefore only remains to observe that the RHS above is exponentially small, \(\widetilde{{\mathsf{P}}}(T>n)\le\bigl(1-\frac1m\bigr)^n\le e^{-n/m}\). In other words, the random walk \((X_n)_{n\ge0}\) forgets its initial state \(X_0=x\) at least exponentially fast.

Notice that if \((Y_n)_{n\ge0}\) starts from the equilibrium (ie., its initial position is selected in \(K_m\) uniformly at random), our argument shows that for every initial state \(x\) and every state \(j\) we have \(|{\mathsf{P}}(X_n=j)-\frac1m|\le e^{-n/m}\), ie., convergence towards the equilibrium distribution is at least exponentially fast. 22

The next example is important for computer science.

An \(n\)-dimensional hypercube is just \({\mathcal{H}}_n\equiv\{0,1\}^n\), a graph whose vertex set is the collection of all \(n\)-sequences made of \(0\) and \(1\) (there are \(2^n\) of them) and whose edges connect vertices which differ at a single position. The “lazy” random walk \((W_k)_{k\ge0}\) on \({\mathcal{H}}_n\) is defined as follows: if \(W_k=v\in{\mathcal{H}}_n\), select \(m\), \(1\le m\le n\), uniformly at random and flip a fair coin. If the coin shows heads, set the \(m\)th coordinate of \(v\) to \(1\), otherwise set it to \(0\). The walk \((W_k)_{k\ge0}\) is aperiodic and irreducible; as \(k\) increases, it tends to forget its initial state.

Let \((W_k)_{k\ge0}\) be the random walk on the hypercube \({\mathcal{H}}_n\), recall [exmle:random-walk-on-hypercube]. Use [exmle:coupled-totally-uniform-walks-on-complete-graph] to study its distribution for large \(k\).

Let \((\xi_j)_{j\ge1}\) be independent variables with \({\mathsf{P}}(\xi=1)=1-{\mathsf{P}}(\xi=-1)=p_x\) and let \((\eta_j)_{j\ge1}\) be independent variables with \({\mathsf{P}}(\eta=1)=1-{\mathsf{P}}(\eta=-1)=p_y\). Define simple random walks \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) on \({\mathbb{Z}}\) via \(X_n=x+\sum_{j=1}^n\xi_j\) and \(Y_n=y+\sum_{j=1}^n\eta_j\).
a) Show that the random walk \((X_n)_{n\ge0}\) on \({\mathbb{Z}}\) forgets its initial state \(x\); namely, for \(p_x=p_y\) and \(y-x=2k\), construct a coupling of \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) similar to that of [exmle:coupled-totally-uniform-walks-on-complete-graph].
b) Show that the random walk \((X_n)_{n\ge0}\) monotonously depends on \(p_x\); namely, for \(x\le y\) and \(p_x\le p_y\), use the ideas from [exmle:coupled-bernoulli] to construct a monotone coupling of \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\), ie., such that \(\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}_n\le\widetilde{Y}_n\bigr)=1\) for all \(n\ge0\).

For many practical purposes it is important to generate a random permutation of a collection of cards \(1\), \(2\), …, \(m\). One way is to define a Markov chain \((X_n)_{n\ge0}\) in the space of all possible permutations (so at any time \(n\), \(X_n\) will be a permutation of \(\{1,\dots, m\}\)), and run it until the stationary distribution is reached. The stationary distribution is as usual the uniform distribution on all possible permutations.

One of the simplest algorithms—the random-to-top shuffling—is defined as follows: for a given state \(X_n=(\sigma(1),\dots, \sigma(m))\) of the chain, find the \(j\)th card by position (with \(j\) taken uniformly at random between \(1\) and \(m\)) and put it at the top of the deck to get the new state \(X_{n+1}=(\sigma(j),\dots, \sigma(j-1), \sigma(j+1),\dots)\). Note that we could take the \(j\)th card by value instead, and then the new state would be \((j,\dots, \sigma(k-1),\sigma(k+1),\dots)\) for the unique \(k\) such that \(\sigma(k)=j\). Since \(j\) is chosen uniformly at random, this gives the same algorithm (a Markov chain with the same transition probabilities).

To study the approach to stationarity, one couples two copies of the Markov chain, \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\), one starting from a fixed state, and other starting from equilibrium, and shuffles until both configurations \(X_n\) and \(Y_n\) agree.

Let \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) be random-to-top shuffling (RTTS) Markov chains, recall [exmle:random-to-top-shuffling]. Suppose that a random card in \(X_n\) (by position) is chosen, say of value \(j\), and then put at the top to get \(X_{n+1}\). In \(Y_n\) find the (randomly positioned) card with number \(j\) and also move it to the top to get \(Y_{n+1}\). Let \(T:=\min\{k\ge 0: Y_k=X_k\}\) be the coupling time for \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\).
a) Show that the described algorithm provides a coupling 23 of the RTTS (by position) for \((X_n)_{n\ge0}\) and the RTTS (by value) for \((Y_n)_{n\ge0}\).
b) Describe the distribution of \(T\).

3.3.2 The Law of rare events

This subsection is optional and will not be examined.
The convergence result in [exse:sum-of-nonhomogeneous-Bernoulli-converges-to-Poisson] can also be derived using coupling:

Let \(X=\sum_{k=1}^nX_k\), where \(X_k\sim{\mathsf{Ber}}(p_k)\) are independent random variables. Let, further, \(Y\sim{\mathsf{Poi}}(\lambda)\), where \(\lambda=\sum_{k=1}^np_k\). Then the maximal coupling of \(X\) and \(Y\) satisfies \[{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\le\sum_{k=1}^n\bigl(p_k\bigr)^2\,.\]

Write \(Y=\sum_{k=1}^nY_k\), where \(Y_k\sim{\mathsf{Poi}}(p_k)\) are independent rv’s. Of course, \({\mathsf{P}}\bigl(\sum_{k=1}^nX_k\neq\sum_{k=1}^nY_k\bigr)\le\sum_{k=1}^n{\mathsf{P}}(X_k\neq Y_k)\) for every joint distribution of \((X_k)_{k=1}^n\) and \((Y_k)_{k=1}^n\). Let \(\widehat{{\mathsf{P}}}_k\) be the maximal coupling for the pair \(\{X_k,Y_k\}\), and let \(\widehat{{\mathsf{P}}}_0\) be the maximal coupling for two sums. Notice that the LHS above is not smaller than \({\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}_0\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\); on the other hand, using the (independent) product measure \({\mathsf{P}}=\widehat{{\mathsf{P}}}_1\times\dots\times\widehat{{\mathsf{P}}}_n\) on the right we deduce that then the RHS becomes just \(\sum_{k=1}^n\widehat{{\mathsf{P}}}_k\bigl(\widetilde{X}_k\neq\widetilde{Y}_k\bigr)\). The result now follows from \(\eqref{eq:Bernoulli-Poisson-max-coupling}\).

Let \(X\sim{\mathsf{Bin}}(n,\frac{\lambda}{n})\) and \(Y\sim{\mathsf{Poi}}(\lambda)\) for some \(\lambda>0\). Show that \[\begin{equation} \label{eq:Bin2PoiDistance} \frac12\bigl|{\mathsf{P}}(X=k)-{\mathsf{P}}(Y=k)\bigr|\le{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\le\frac{\lambda^2}{n}\,\qquad\text{for every $k\ge0$.} \end{equation}\] Deduce that for every fixed \(k\ge0\), we have \({\mathsf{P}}(X=k)\to\frac{\lambda^k}{k!}e^{-\lambda}\) as \(n\to\infty\).

Notice that \(\eqref{eq:Bin2PoiDistance}\) implies that if \(X\sim{\mathsf{Bin}}(n,p)\) and \(Y\sim{\mathsf{Poi}}(np)\) then for every \(k\ge0\) the probabilities \({\mathsf{P}}(X=k)\) and \({\mathsf{P}}(Y=k)\) differ by at most \(2np^2\). In particular, if \(n=10\) and \(p=0.01\) the discrepancy between any pair of such probabilities is bounded above by \(0.002\), ie., they coincide in the first two decimal places.

3.4 Additional problems

In the setting of [exmle:coupled-bernoulli] show that every convex linear combination of tables \(T_1\) and \(T_2\), ie., each table of the form \(T_\alpha = \alpha T_1 + (1-\alpha) T_2\) with \(\alpha\in[0,1]\), gives an example of a coupling of \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\). Can you find all possible couplings for these variables?

Generalise the inequality \(\eqref{eq:stochastic-domination-expectation-of-monotone-function}\) 1to a broader class of functions \(g(\,\cdot\,)\) and verify that if \(X\preccurlyeq Y\), then \({\mathsf{E}}(X^{2k+1})\le{\mathsf{E}}(Y^{2k+1})\), \({\mathsf{E}}e^{tX}\le{\mathsf{E}}e^{tY}\) with \(t>0\), \({\mathsf{E}}s^X\le{\mathsf{E}}s^Y\) with \(s>1\) etc.

Let \(\xi\sim{\mathcal{U}}(0,1)\) be a standard uniform random variable. For fixed \(p\in(0,1)\), define \(X=\mathsf{1}_{\xi<p}\). Show that \(X\sim{\mathsf{Ber}}(p)\), a Bernoulli random variable with parameter \(p\). Now suppose that \(X=\mathsf{1}_{\xi<p_1}\) and \(Y=\mathsf{1}_{\xi<p_2}\) for some \(0<p_1\le p_2<1\) and \(\xi\) as above. Show that \(X\preccurlyeq Y\) and that \({\mathsf{P}}(X\le Y)=1\). Compare your construction to the second table in [exmle:coupled-bernoulli].

Let \(X\sim{\mathsf{Geom}}(p)\) and \(Y\sim{\mathsf{Geom}}(r)\) be two geometric random variables. If \(0<p\le r<1\), are \(X\) and \(Y\) stochastically ordered? Justify your answer.

Let \(X\sim\Gamma(a,\lambda)\) and \(Y\sim\Gamma(b,\lambda)\) be two gamma random variables. 24 If \(0<a\le b<\infty\), are \(X\) and \(Y\) stochastically ordered? Justify your answer.

4 Martingales

4.1 Definition and some examples

A martingale is a generalized version of a “fair game”.

A process \((M_n)_{n\ge0}\) is a martingale if

  1. for every \(n\ge0\) the expectation \({\mathsf{E}}M_n\) is finite, equivalently, \({\mathsf{E}}|M_n|<\infty\);

  2. for every \(n\ge0\) and all \(m_n\), \(m_{n-1}\), …, \(m_0\) we have \[\begin{equation} \label{eq:martingale-def} {\mathsf{E}}\bigl(M_{n+1}\mid M_n=m_n,\dots,M_0=m_0\bigr)=m_n\,. \end{equation}\]

For those familiar with the notion of conditioning on a random variable (see next section), \(\eqref{eq:martingale-def}\) can just be written as \({\mathsf{E}}\bigl(M_{n+1}\mid M_n,\dots,M_0\bigr)=M_n\).

We say that \((M_n)_{n\ge0}\) is a supermartingale25 if the equality in \(\eqref{eq:martingale-def}\) holds with \(\le\), ie., \({\mathsf{E}}\bigl(M_{n+1}\mid M_n,\dots,M_0\bigr)\le M_n\); and we say that \((M_n)_{n\ge0}\) is a submartingale, if \(\eqref{eq:martingale-def}\) holds with \(\ge\), ie., \({\mathsf{E}}\bigl(M_{n+1}\mid M_n,\dots,M_0\bigr)\ge M_n\).

Let \((\xi_n)_{n\ge1}\) be independent random variables  26 with \[{\mathsf{E}}\xi_n=0\] for all \(n\ge1\). Then the process \((M_n)_{n\ge0}\) defined via \[M_n\stackrel{{\sf def}}{=}M_0+\xi_1+\dots+\xi_n\] is a martingale as long as the random variable \(M_0\) is independent of \((\xi_n)_{n\ge1}\) and \({\mathsf{E}}|M_0|<\infty\). For example, we will often take \(M_0=0\) or some other deterministic constant.

Indeed, by the triangle inequality, \({\mathsf{E}}|M_n|\le{\mathsf{E}}|M_0|+\sum_{j=1}^n{\mathsf{E}}|\xi_j|<\infty\) for all \(n\ge0\); on the other hand, the independence property implies \[{\mathsf{E}}\bigl(M_{n+1}-M_n\mid M_n,\dots,M_0\bigr)\equiv{\mathsf{E}}\bigl(\xi_{n+1}\mid M_n,\dots,M_0\bigr)={\mathsf{E}}\xi_{n+1}=0\,.\]

Notice that if \({\mathsf{E}}\xi_n\ge0\) for all \(n\ge1\), then \((M_n)_{n\ge0}\) is a submartingale, whereas if \({\mathsf{E}}\xi_n\le0\) for all \(n\ge1\), then \((M_n)_{n\ge0}\) is a supermartingale. More generally, if \((\xi_n)_{n\ge1}\) are independent random variables with \({\mathsf{E}}|\xi_n|<\infty\) for all \(n\ge1\), then the process \(M_n=M_0+(\xi_1-{\mathsf{E}}\xi_1)+\dots+(\xi_n-{\mathsf{E}}\xi_n)\), \(n\ge0\), is a martingale.

Given a sequence \((\xi_k)_{k\ge1}\) of independent Bernoulli variables with common distribution \({\mathsf{P}}(\xi=1)=p\) and \({\mathsf{P}}(\xi=-1)=q=1-p\), define the generated random walk via \(X_n=\sum_{k=1}^n\xi_k\), \(n\ge0\). Show that the process \(M_n=(q/p)^{X_n}\) is a martingale.

Let \((Z_n)_{n\ge0}\) be a branching process with \({\mathsf{E}}Z_1=m<\infty\). We have \({\mathsf{E}}|Z_n|<\infty\) and \({\mathsf{E}}(Z_{n+1}\mid Z_n,\dots,Z_0)=mZ_n\) for all \(n\ge0\). In other words, the process \((Z_n)_{n\ge0}\) is a martingale, a submartingale or a supermartingale depending on whether \(m=1\), \(m>1\) or \(m<1\).
Notice also, that for every \(m\in(0,\infty)\) the process \((Z_n/m^n)_{n\ge0}\) is a martingale.

Let \(\rho\) be the extinction probability for a branching process \((Z_n)_{n\ge0}\); show that \(M_n=\rho^{Z_n}\) is a martingale.

We can also construct many examples of submartingales and supermartingales if we have a martingale and apply certain functions to it. This is a consequence of:

[Conditional Jensen’s inequality] Suppose that \(X\) is an integrable random variable on \((\Omega, \mathcal{F},{\mathsf{P}})\) and \(A\in \mathcal{F}\) with \({\mathsf{P}}(A)>0\).

Suppose that \(\varphi\) is a convex function such that \(\varphi(X)\) is also integrable. Then \[{\mathsf{E}}(\varphi(X)|A)\ge \varphi({\mathsf{E}}(X|A)).\]

This follows from the non-conditional version of Jensen’s inequality, and using the fact that \({\mathsf{P}}(\cdot|A)\) is a valid probability measure.

Let \((X_n)_{n\ge0}\) be a martingale. If, for some convex function \(f(\,\cdot\,)\) we have \({\mathsf{E}}\bigl(|f(X_n)|\bigr)<\infty\) for all \(n\ge0\), then the process \(f(X):=\bigl(f(X_n)\bigr)_{n\ge 0}\) is a submartingale.

Similarly, if for some concave \(f(\,\cdot\,)\) we have \({\mathsf{E}}\bigl(|f(X_n)|\bigr)<\infty\) for all \(n\ge0\), then the process \(f(X)\) is a supermartingale.

Both these claims follow immediately from the Jensen inequalities for conditional expectations.

4.2 A few remarks on conditional expectation

Recall the following basic definition:

Let \((\Omega,{\mathcal{F}},{\mathsf{P}})\) be a probability triple and let \(X\) and \(Y\) be random variables taking values \(x_1\), \(x_2\), …, \(x_m\) and \(y_1\), \(y_2\), …, \(y_n\) respectively. On the event \(\{Y=y_j\}\) one defines the conditional probability \[{\mathsf{P}}\bigl(X=x_i\mid Y=y_j\bigr)\stackrel{{\sf def}}{=}\frac{{\mathsf{P}}(X=x_i,Y=y_j)}{{\mathsf{P}}(Y=y_j)}\] and the conditional expectation: \({\mathsf{E}}\bigl(X\mid Y=y_j\bigr)\equiv\sum_ix_i{\mathsf{P}}\bigl(X=x_i\mid Y=y_j\bigr)\). Then the conditional expectation \(Z={\mathsf{E}}(X\,|\,Y)\) of \(X\) given \(Y\) is defined as follows: \[\text{if } Y(\omega)=y_j, \quad\text{ then }\quad Z(\omega)\stackrel{{\sf def}}{=}z_j\equiv{\mathsf{E}}\bigl(X\mid Y=y_j\bigr)\,.\] Notice that the value \(z_j\) is completely determined by \(y_j\); in other words, \(Z\) is a function of \(Y\), and as such, a random variable. Of course, if \(X\) and \(Y\) are independent, we have \(Z(\omega)\equiv{\mathsf{E}}X\).

If \({\mathcal{D}}=\{D_1,D_2,\dots,D_m\}\) is a finite partition  27 of \(\Omega\), the collection \({\mathcal{G}}=\sigma({\mathcal{D}})\) of all \(2^m\) possible subsets of \(\Omega\) constructed from blocks \(D_i\) is called the \(\sigma\)-field generated by \({\mathcal{D}}\). A random variable \(Y\) is measurable with respect to \({\mathcal{G}}\) if it is constant on every block \(D_i\) of the initial partition \({\mathcal{D}}\).

In [exmle:conditional-expectation] above we see that \(Z\) is constant on the events \(\{Y=y_j\}\); ie., \(Z\) is measurable w.r.t. the \(\sigma\)-field \(\sigma(Y)\equiv\sigma(\{Y=y_j\}_j)\). Moreover, for every \(G_j\equiv\{Y=y_j\}\), we have \[\begin{aligned} {\mathsf{E}}\bigl(Z\,\mathsf{1}_{G_j})&=z_j{\mathsf{P}}(Y=y_j)=\sum_ix_i{\mathsf{P}}\bigl(X=x_i\mid Y=y_j\bigr){\mathsf{P}}(Y=y_j) \\&=\sum_ix_i{\mathsf{P}}\bigl(X=x_i,Y=y_j\bigr)={\mathsf{E}}\bigl(X\mathsf{1}_{G_j})\,, \end{aligned}\] where the last equality follows from the observation that the random variable \(X\mathsf{1}_{G_j}\) equals \(x_i\) on every event \(G_j\cap\{X=x_i\}=\{Y=y_j\}\cap\{X=x_i\}\) and vanishes identically outside the set \(G_j\).

The last two properties can be used to define conditional expectation in general: Let \((\Omega, {\mathcal{F}},{\mathsf{P}})\) be a probability triple, let \({\mathcal{G}}\subseteq{\mathcal{F}}\) be a \(\sigma\)-field, and let \(X\) be a rv, \(X:\Omega\to{\mathbb{R}}\). The conditional expectation of \(X\) w.r.t. \({\mathcal{G}}\) is the unique random variable \(Z\) such that: \(Z\) is \({\mathcal{G}}\) measurable and for every set \(G\in{\mathcal{G}}\) we have \(\int_GZ{\mathsf{d}}{\mathsf{P}}\equiv{\mathsf{E}}\bigl(Z\mathsf{1}_G)={\mathsf{E}}\bigl(X\mathsf{1}_G)\equiv\int_GX{\mathsf{d}}{\mathsf{P}}\).

Notice that when \(\mathcal{G}=\sigma(D)\), the definition in the remark above coincides with Definition [def:ce_partition].

The following are the most important properties of conditional expectation:

Let \((\Omega, {\mathcal{F}},{\mathsf{P}})\) be a probability triple, and let \({\mathcal{G}}\subset{\mathcal{F}}\) a \(\sigma\)-field. Then for all random variables \(X\), \(X_1\), \(X_2\) and constants \(a_1\), \(a_2\), the following properties hold:

  1. If \(Z={\mathsf{E}}\bigl(X\,|\,{\mathcal{G}}\bigr)\) then \({\mathsf{E}}Z={\mathsf{E}}X\);

  2. If \(X\) is \({\mathcal{G}}\)-measurable, then \({\mathsf{E}}\bigl(X\,|\,{\mathcal{G}}\bigr)=X\);

  3. \({\mathsf{E}}\bigl(a_1X_1+a_2X_2\,|\,{\mathcal{G}}\bigr)=a_1{\mathsf{E}}\bigl(X_1\,|\,{\mathcal{G}}\bigr)+a_2{\mathsf{E}}\bigl(X_2\,|\,{\mathcal{G}}\bigr)\);

  4. If \(X\ge0\), then \({\mathsf{E}}\bigl(X\,|\,{\mathcal{G}}\bigr)\ge0\);

  5. [tower-property-conditional-expectation] If \({\mathcal{H}}\) and \({\mathcal{G}}\) are two \(\sigma\)-fields in \(\Omega\) such that \({\mathcal{H}}\subseteq{\mathcal{G}}\), then \[{\mathsf{E}}\Bigl[{\mathsf{E}}\bigl(X\,|\,{\mathcal{G}}\bigl)\,|\,{\mathcal{H}}\Bigr]={\mathsf{E}}\Bigl[{\mathsf{E}}\bigl(X\,|\,{\mathcal{H}}\bigl)\,|\,{\mathcal{G}}\Bigr]={\mathsf{E}}\bigl(X\,|\,{\mathcal{H}}\bigl)\,;\]

  6. If \(Z\) is \({\mathcal{G}}\)-measurable, then \({\mathsf{E}}\bigl[ZX\,|\,{\mathcal{G}}\bigr]=Z\,{\mathsf{E}}(X\,|\,{\mathcal{G}})\).

  7. If \(Z\) is independent of \({\mathcal{G}}\), then \({\mathsf{E}}\bigl[Z\,|\,{\mathcal{G}}\bigr]={\mathsf{E}}[Z]\).

If \((X_k)_{k\ge1}\) is a sequence of random variables, we can define the generated \(\sigma\)-fields \({\mathcal{F}}_1^X\), \({\mathcal{F}}_2^X\), …, via \[\begin{equation} \label{eq:Xn-sigma-field-def} {\mathcal{F}}_k^X\stackrel{{\sf def}}{=}\sigma(X_1,X_2,\dots,X_k)\,; \end{equation}\] here, the \(\sigma\)-field \({\mathcal{F}}_k^X\) stores all information about the process \((X_n)_{n\ge1}\) up to time \(k\). Observe that these \(\sigma\)-fields form a filtration \(\bigl({\mathcal{F}}_n^X\bigr)_{n\ge1}\) in the sense that \[\begin{equation} \label{eq:FXn-filtration} {\mathcal{F}}_1^X\subseteq{\mathcal{F}}_2^X\subseteq\ldots\subseteq{\mathcal{F}}_k^X\subseteq\ldots\,. \end{equation}\] The notion of filtration is very useful when working with martingales. Indeed, a process \((M_n)_{n\ge0}\) is a martingale if for all \(n\ge0\), \({\mathsf{E}}|M_n|<\infty\) and \({\mathsf{E}}\bigl(M_{n+1}\mid{\mathcal{F}}^M_n)=M_n\). We can also generalise the definition of a martingale:

\((M_n)_{n\ge 0}\) is a martingale with respect to a filtration \((\mathcal{F}_n)_{n\ge 0}\) if for all \(n\ge 0\), \({\mathsf{E}}(|M_n|)<\infty\) and \[{\mathsf{E}}(M_{n+1}|\mathcal{F}_n)=M_n.\]

The original definition of martingale is sometimes referred to as “being a martingale with respect to the natural filtration”.

We say that \(M\) is a martingale with respect to a sequence \((X_n)_{n\ge 0}\) if it is a martingale with respect to the filtration \((\mathcal{F}_n^X)_{n\ge 0}\).
We also notice that by repeatedly applying the tower property in claim [tower-property-conditional-expectation]) of [lem:conditional-expectation-properties] above to the sequence \(\eqref{eq:FXn-filtration}\), we get the following result:

If \((M_n)_{n\ge0}\) is a submartingale w.r.t. \((X_n)_{n\ge0}\), then for all integer \(n\ge k\ge0\), we have \({\mathsf{E}}M_n\ge{\mathsf{E}}M_k\).

Changing signs, we get the inequality \({\mathsf{E}}M_n\le{\mathsf{E}}M_k\) for supermartingales, and therefore the equality \({\mathsf{E}}M_n={\mathsf{E}}M_k\) for martingales.

4.3 Martingales and stopping times

Martingales are extremely useful in studying stopping times:

Let \((X_k)_{k\ge1}\) be a sequence of i.i.d. Bernoulli random variables with common distribution \({\mathsf{P}}(X_1=1)=p\), \({\mathsf{P}}(X_1=-1)=q=1-p\), where \(p\in(0,1)\), \(p\neq1/2\). For fixed \(k\in\{0,1,\dots,N\}\), define the random walk \((S_n)_{n\ge0}\) defined via \(S_n=k+\sum_{j=1}^nX_j\), \(n\ge0\), and consider the process \((Y_n)_{n\ge0}\), where \(Y_n=(q/p)^{S_n}\). Clearly, \(Y_n\) is an \(({\mathcal{F}}_n^X)_{n\ge0}\)-martingale, so that \({\mathsf{E}}(Y_n)={\mathsf{E}}(Y_0)=(q/p)^k\) for all \(n\ge0\), recall [lem:martingale-property].

Let \(T\) be the first time \(S_n\) hits \(0\) or \(N\). If an analogue of the above equality, \({\mathsf{E}}(Y_T)={\mathsf{E}}(Y_0)=(q/p)^k\) were true for (random) time \(T\), we could find the exit probability \({\mathsf{p}}_k={\mathsf{P}}(\, S \text{ hits $0$ before $N$ }|S_0=k)\) from the expression \[{\mathsf{E}}\bigl(Y_T\bigr)=(q/p)^0{\sf p}_k+(q/p)^N\bigl(1-{\sf p}_k\bigr)\,,\] thus obtaining \(\displaystyle{\sf p}_k=\bigl((q/p)^k-(q/p)^N\bigr)/\bigl(1-(q/p)^N\bigr)\).

The method used in the previous example relies on the assumption \({\mathsf{E}}(Y_T)={\mathsf{E}}(Y_0)\) and on the formula for \({\mathsf{E}}(Y_T)\) for a certain random variable \(T\). An important part of the theory of martingales is to study random variables \(T\) for which the above statements are true.  28

Let \(S_n=\sum_{k=1}^nX_k\) be the simple symmetric random walk in \(\{-K,\dots,K\}\), generated by a sequence of i.i.d. symmetric Bernoulli r.v. \(X_k\), where \({\mathsf{P}}(X_1=\pm1)=1/2\). Similarly to [exmle:gamblers-ruin] one can study the hitting time \(T\) of the boundary, \(T=\inf\bigl\{n\ge0:|S_n|=K\bigr\}\): namely, since  29 \[{\mathsf{E}}\bigl((S_T)^2\bigr)=K^2{\mathsf{P}}(S_T=K)+K^2{\mathsf{P}}(S_T=-K)=K^2\,,\] the same heuristics applied to the martingale \((Y_n)_{n\ge0}\), \(Y_n\stackrel{{\sf def}}{=}\bigl(S_n\bigr)^2-n\), leads to \(0={\mathsf{E}}(Y_0)={\mathsf{E}}(Y_T)={\mathsf{E}}\bigl((S_T)^2\bigr)-{\mathsf{E}}(T)\); i.e., it suggests that \({\mathsf{E}}(T)=K^2\).

One of our aims is to discuss general results that justify the above heuristics. To this end, we need to carefully define what we mean by a “stopping time”.

A variable \(T\) is a stopping time for a process \((X_n)_{n\ge0}\), if the occurrence/non-occurrence of the event \(\{T=n\}=\)“we stop at time \(n\) can be determined by looking at the values \(X_0\), \(X_1\), …, \(X_n\) of the process up to time \(n\). Equivalently, if we have \(\{T\le n\}\in{\mathcal{F}}^X_n\equiv\sigma(X_0,\dots,X_n)\) for every \(n\ge 0\).

Let \((X_n)_{n\ge0}\) be a stochastic process with values in \(S\), and let \(T\) be the hitting time of a set \(A\subset S\), namely, \(T=\min\bigl\{n\ge0:X_n\in A\bigr\}\). Then \(T\) is a stopping time for \((X_n)_{n\ge0}\).

Indeed, for every fixed \(n\ge0\), we have \(\{T>n\}\equiv\{X_0\notin A,X_1\notin A,\dots,X_n\notin A\}\); therefore, the event \(\{T>n\}\) and its complement \(\{T\le n\}\) both belong to \({\mathcal{F}}^X_n\).

By contrast, the last time that \(X\) visits \(A\), \(\tilde{T}=\max\{n\ge 0: X_n\in A\}\) is not generally a stopping time. Because we generally cannot tell just by looking at \(X_0,\dots, X_n\), whether the process will visit \(A\) after time \(n\) or not.

Let \(k\in{\mathbb{N}}\) be fixed, and let \(S\) and \(T\) be stopping times for a process \((X_n)_{n\ge0}\). Show that the following are stopping times:

  1. \(T\equiv k\),

  2. \(S\land T\equiv\min(S,T)\),

  3. \(S\lor T\equiv\max(S,T)\).

Let \(A\) be a set of states, and let \(T=T_A\) be the moment of the first visit to \(A\), ie., \(T=\min\{n\ge0:X_n\in A\}\). Consider \(S=S_A=\min\{n>T_A:X_n\in A\}\), the moment of the second visit to \(A\). Show that \(S_A\) is a stopping time for \((X_n)_{n\ge0}\). Is the variable \(L=\max\bigl\{n\ge0:X_n\in A\bigr\}\) a stopping time for \((X_n)_{n\ge0}\)?

Let \((M_n)_{n\ge0}\) be a submartingale w.r.t. \((X_n)_{n\ge0}\), and let \(T\) be a stopping time for \((X_n)_{n\ge0}\). Show that the process \((L_n^T)_{n\ge0}\), \[L_n^T\stackrel{{\sf def}}{=}M_{n\land T}\equiv\begin{cases}M_n\,,&\quad n\le T\,,\\ M_T\,,&\quad n>T\,,\end{cases}\] is a submartingale w.r.t. \((X_n)_{n\ge0}\). Deduce that if \((M_n)_{n\ge0}\) is a martingale w.r.t. \((X_n)_{n\ge0}\), then the stopped process \((L_n^T)_{n\ge0}\) is also a martingale w.r.t. \((X_n)_{n\ge0}\).

4.4 Optional stopping theorem

The optional stopping (or sampling) theorem (OST) tells us that, under quite general assumptions, whenever \(X_n\) is a martingale, then \(X_{T\land n}\) is a martingale for a stopping time \(T\). Such results are very useful in deriving inequalities and probabilities of various events associated with such stochastic processes.

[Optional Stopping Theorem][thm:OST] Let \((M_n)_{n\ge0}\) be a martingale w.r.t. \((X_n)_{n\ge0}\), and let \(T\) be a stopping time for \((X_n)_{n\ge0}\). Then the equality \[\begin{equation} \label{eq:OST} {\mathsf{E}}\bigl[M_T\bigr]={\mathsf{E}}[M_0] \end{equation}\] holds whenever one of the following conditions holds:

(OST-1)

the stopping time \(T\) is bounded, i.e., \({\mathsf{P}}(T\le N)=1\) for some \(N<\infty\);

(OST-2)

\({\mathsf{E}}T<\infty\), and the martingale \((M_n)_{n\ge0}\) has bounded increments, i.e., \(|M_{n+1}-M_n|\le K\) for all \(n\) and some constant \(K\);

(OST-3)

\({\mathsf{P}}(T<\infty)=1\), and the martingale \((M_n)_{n\ge0}\) is bounded, i.e., \(|M_n|\le K\) for all \(n\) and some constant \(K\).

If \(M_n\) records gambler’s fortune, by (OST-3), one cannot make money from a fair game, unless an unlimited amount of credit is available.

One can generalize (OST-3) by replacing the condition of boundedness, \(|M_n|\le K\), by that of uniform integrability for the martingale \((M_n)_{n\ge0}\): a sequence of random variables \((Y_n)_{n\ge0}\) is uniformly integrable if \[\begin{equation} \label{eq:UI-def} \lim_{K\to\infty}\sup_{n\ge0}{\mathsf{E}}\bigl(|Y_n|\mathsf{1}_{\{|Y_n|>K\}}\bigr)=0\,. \end{equation}\]

Let the SSRW \((S_n)_{n\ge0}\) be generated as in [exmle:SSRW-expected-exit-time-via-martingales]. Put \[H\stackrel{{\sf def}}{=}\inf\{n\ge0:S_n=1\}\,.\] Since this RW is recurrent,  30 we deduce that \({\mathsf{P}}(H<\infty)=1\). However, the (OST) does not apply, as \(0={\mathsf{E}}(S_0)\neq{\mathsf{E}}(S_H)\equiv1\). It is an instructive Exercise to check which conditions in each of the above (OST) are violated.

We now give the proof of [thm:OST].

Let us first assume that \(T\) satisfies (OST-1); that is, there is some \(N<\infty\) with \(0\le T\le N\). Then the decomposition \[M_T=\sum_{n=0}^NM_T\mathsf{1}_{\{T=n\}}=\sum_{n=0}^NM_n\mathsf{1}_{\{T=n\}}%=\sum_{n=0}^NM_n\bl(\ind(T>n-1)-\ind(T>n)\br) =M_0+\sum_{n=0}^{N-1}\bigl(M_{n+1}-M_n\bigr)\mathsf{1}_{\{T>n\}}\] holds. Now for each \(0\le n\le N-1\), (a) of Lemma [lem:conditional-expectation-properties] allows us to rewrite \({\mathsf{E}}\bigl( (M_{n+1}-M_n)\mathsf{1}_{\{T>n\}}\bigr)\) as \[{\mathsf{E}}\bigl({\mathsf{E}}\bigl((M_{n+1}-M_n)\mathsf{1}_{\{T>n\}}|\,{\mathcal{F}}^X_n\bigr)\bigr)={\mathsf{E}}\bigl(\mathsf{1}_{\{T>n\}}{\mathsf{E}}\bigl(M_{n+1}-M_n|\,{\mathcal{F}}^X_n\bigr)\bigr)\] where we have used that \(\{T>n\}\) is \(\mathcal{F}_n^X\)-measurable (by definition of a stopping time). Since \(M\) is a martingale, \({\mathsf{E}}\bigl(M_{n+1}-M_n|\,{\mathcal{F}}^X_n\bigr)=0\) and hence \({\mathsf{E}}\bigl( (M_{n+1}-M_n)\mathsf{1}_{\{T>n\}}\bigr)=0\) for each \(n\). Linearity of expectation then implies that \({\mathsf{E}}(M_T)={\mathsf{E}}(M_0)\) as required.

Now let us assume that \(T\) satisfies (OST-2). For fixed \(n\), \(\min(T,n)=T\wedge n\) is clearly a bounded stopping time (it is less than \(n\) with probability one), so by (OST-1) we have that \({\mathsf{E}}M_{T\land n}={\mathsf{E}}M_0\) for all \(n\ge0\). We now want to take a limit as \(n\to \infty\). For this we write \[|M_{T}-M_{T\land n}|=\bigl|\sum_{ k>n}(M_k-M_n)\mathsf{1}_{\{T=k\}}|=\bigl|\sum_{k>n}\bigl(M_k-M_{k-1}\bigr)\mathsf{1}_{\{T\ge k\}}\bigr|\] (in particular, noting that this gives \(|M_T|\le KT\) when \(n=0\) and so \(M_T\) is integrable). Taking expectations then gives that \({\mathsf{E}}(|M_T-M_{T\land n}|)\le K\sum_{k>n} {\mathsf{P}}(T\ge k)\) which tends to \(0\) as \(n\to \infty\). Hence \({\mathsf{E}}(M_T)=\lim_{n\to \infty}{\mathsf{E}}(M_{T\land n})=\lim_{n\to \infty} {\mathsf{E}}(M_0)={\mathsf{E}}(M_0)\).

Finally, we can assume (OST-3) and deduce the result in a similar way. The strategy is the same, but instead of writing \(M_T-M_{T\wedge n}\) as a telescoping sum, we note that \(|M_{T}-M_{T\wedge n}|\le 2K\mathsf{1}_{\{T>n\}}\) so that \[\bigl|{\mathsf{E}}M_T-{\mathsf{E}}M_{T\land n}\bigr|\le 2K{\mathsf{P}}(T>n)\to0\qquad\text{ as $n\to\infty$\,.}\]

For those who are familiar with the dominated convergence theorem. In (OST-3) it is sufficient to assume that \(\bigl|M_{n\land T}\bigr|\le K\) for all \(n\ge0\) and \({\mathsf{P}}(T<\infty)=1\). Indeed, by the dominated convergence theorem we deduce that then \(\bigl|M_T\bigr|\le K\), so that the argument in the proof above applies.

With suitable martingales (OST) gives very powerful results.

The following instructive example is due to D. Williams:

(ABRACADABRA) A monkey types symbols at random, one per unit time, on a typewriter having \(26\) keys. How long on average will it take him to produce the sequence ‘ABRACADABRA’?

Solution To compute the expected time, imagine a sequence of gamblers, each initially having £1, playing at a fair gambling casino. Gambler arriving just before time \(n\) (\(n\ge1\)) bets £1 on the event \(\{n\text{th letter will be {\sf A}}\}\). If he loses, he leaves. If he wins, he receives £26 all of which he bets on the event \(\{n+1\text{\,st letter will be {\sf B}}\}\). If he loses, he leaves. If he wins, he receives £\(26^2\) all of which he bets on the event \(\{n+2\text{\,nd letter will be {\sf R}}\}\) and so on through the whole ABRACADABRA sequence.

Let \(X_n\) denote the total winnings of the casino after the \(n\)th day. Since all bets are fair the process \((X_n)_{n\ge0}\) is a martingale with mean zero. Let \(N\) denote the time until the sequence ABRACADABRA appears. At time \(N\), gambler \(N-10\) would have won £\(26^{11}-1\), gambler \(N-3\) would have won £\(26^4-1\), gambler \(N\) would have won £\(26-1\) and all other \(N-3\) gamblers would have lost their initial fortune. Therefore, \[X_N=N-3-(26^{11}-1)-(26^4-1)-(26-1)=N-26^{11}-26^4-26\] and since (OST-2) can be applied (check this!), we deduce the \({\mathsf{E}}(X_N)={\mathsf{E}}(X_0)=0\), that is \[{\mathsf{E}}(N)=26+26^4+26^{11}\,.\]

Notice that the answer could also be obtained by considering a finite state Markov chain \(X_n\) on the state space of \(12\) strings representing the longest possible intersection of the tail of the typed text with the target word ABRACADABRA, ie., \(\bigl\{{\sf ABRACADABRA, ABRACADABR, \dots, ABRA, ABR, AB, A},\varnothing \bigr\}\), as there \(N\) is just the hitting time of the state ABRACADABRA from the initial condition \(X_0=\varnothing\).

Use an appropriate (OST) to carefully derive the probability \({\sf p}_k\) in [exmle:gamblers-ruin].

Use an appropriate (OST) to carefully derive the expectation \({\mathsf{E}}(T)\) in [exmle:SSRW-expected-exit-time-via-martingales].

Consider the simple symmetric random walk \((S_n)_{n\ge0}\), generated by a sequence of i.i.d. Bernoulli r.v. \(X_k\) with \({\mathsf{P}}(X_1=\pm1)=1/2\), ie., \(S_n=\sum_{k=1}^nX_k\). For integer \(a<0<b\), let \(T\) be the stopping time \(T=\inf\bigl\{n\ge0:S_n\notin(a,b)\bigr\}\equiv\inf\bigl\{n\ge0:S_n\in\{a,b\}\bigr\}\).

  1. Show that \((S_n)_{n\ge0}\) is a martingale and use an appropriate (OST) to find \({\mathsf{P}}(S_T=a)\) and \({\mathsf{P}}(S_T=b)\).

  2. Show that \((M_n)_{n\ge0}\) defined via \(M_n=(S_n)^2-n\) is a martingale w.r.t. the process \((S_n)_{n\ge0}\).

  3. For fixed integer \(K>0\), carefully apply an appropriate (OST) to \(M_n\) and prove that \({\mathsf{E}}(T\land K)={\mathsf{E}}\bigl[(S_{T\land K})^2\bigr]\).

  4. Deduce that \({\mathsf{E}}(T)=-ab\).

A coin showing heads with probability \(p\) is tossed repeatedly. Let \(w\) be a fixed sequence of outcomes such as ‘HTH’, and let \(N\) denote the number of (independent) tosses until the word \(w\) is observed. Using an appropriate martingale, find the expectation \({\mathsf{E}}N\) for each of the following sequences: ‘HH’, ‘HTH’, ‘HHTTHH’.

Let \((Y_n)_{n\ge0}\) be a supermartingale w.r.t. a sequence \((X_n)_{n\ge0}\) and let \(H_n\in{\mathcal{F}}_{n-1}^X=\sigma(X_0,\dots,X_{n-1})\) satisfy \(0\le H_n\le c_n\), where the constant \(c_n\) might depend on \(n\). Then the process \(W_n=W_0+\sum_{m=1}^nH_m\bigl(Y_m-Y_{m-1}\bigr)\), \(n\ge0\), is a supermartingale w.r.t. \((X_n)_{n\ge0}\).

Following the proof of the optional stopping theorem, we observe that since \((Y_n)_{n\ge0}\) is a supermartingale w.r.t. \((X_n)_{n\ge0}\), \[{\mathsf{E}}\bigl(H_m(Y_m-Y_{m-1})\bigr)={\mathsf{E}}\bigl[H_m\,{\mathsf{E}}(Y_m-Y_{m-1}\,|\,{\mathcal{F}}_{m-1})\bigr]\le0\,.\]

If \((Y_n)_{n\ge0}\) describes the stock price process, and \(H_m\) is the number of stocks held during the time \((m-1,m]\) (decided when the price \(Y_{m-1}\) is known), then \(W_n\) describes the fortune of an investor at time \(n\). As \((W_n)_{n\ge0}\) is a supermartingale w.r.t. \((X_n)_{n\ge0}\), we have \({\mathsf{E}}W_n\le{\mathsf{E}}W_0\) for all \(n\ge0\).

The famous “doubling martingale” corresponds to doubling the bet size until one wins, ie., to taking \(H_m=2^{m-1}\mathsf{1}_{\{T>m\}}\), where \(T\) is the first moment when the price goes up, ie., \(T=\min\{m>0:Y_m-Y_{m-1}=1\}\). Since the stopped process \((W_{n\land T})_{n\ge0}\) is a supermartingale, for all \(n\ge0\) we have \({\mathsf{E}}(W_{n\land T})\le{\mathsf{E}}(W_0)\), ie., on average, the doubling strategy does not produce money if one bets against a (super)martingale.

Let \((S_n)_{n\ge0}\) be a random walk generated by a sequence \((X_n)_{n\ge0}\) of i.i.d. steps with \({\mathsf{E}}|X|<\infty\) and \({\mathsf{E}}(X)=m\). If \(T\) is a stopping time for \((X_n)_{n\ge0}\) with \({\mathsf{E}}(T)<\infty\), then the optional stopping theorem implies that \(S_T\) is integrable and \[{\mathsf{E}}(S_T-S_0)=m\,{\mathsf{E}}(T).\]

To show this, first notice that \(S_n-n{\mathsf{E}}X=S_n-mn\) is a martingale and for every \(n\ge0\) the variable \(T\land n\) is a bounded stopping time. By (OST-1), we have \[\begin{equation} \label{eq:Wald-stopped-time} {\mathsf{E}}(S_0)={\mathsf{E}}\bigl(S_{n\land T}-m(n\land T)\bigr). \end{equation}\] This rearranges to \[{\mathsf{E}}\bigl(S_{n\land T}-S_0\bigr)=m{\mathsf{E}}(n\land T)\] for every \(n\). Now, the RHS converges to \({\mathsf{E}}(T)\) as \(n\to \infty\) since \(|{\mathsf{E}}(T)-{\mathsf{E}}(n\land T)|\le {\mathsf{E}}(T\mathsf{1}_{\{T>n\}})=\sum_{k>n} k{\mathsf{P}}(T=k)\), where the tail sums on the right go to zero as \(n\to \infty\) by the assumption that \(T\) is integrable. Next, by writing \(|S_T|=\sum_{k\ge 0} |X_k|\mathsf{1}_{\{T\ge k\}}\) as a telescoping sum, where \({\mathsf{E}}(|X_k|\mathsf{1}_{\{T\ge k\}})={\mathsf{P}}(T\ge k){\mathsf{E}}(|X_1|)\) and \(\sum_{k} {\mathsf{P}}(T\ge k)<\infty\), we see that \(|S_T|\) is integrable. 31 Similarly, we can bound \({\mathsf{E}}(|S_T-S_{T\wedge n}|)\le {\mathsf{E}}(|X_1|)\sum_{k>n} {\mathsf{P}}(T\ge k)\) which tends to \(0\) as \(n\to \infty\). This implies that \({\mathsf{E}}(S_{T\land n})\to {\mathsf{E}}(S_T)\) as \(n\to \infty\), and combining the above completes the argument.

4.5 Martingale convergence theorem

This subsection is optional and will not be examined.
The following example has a number of important applications.

[Pólya’s urn][exmle:Polya-urn] An urns contains one green and one red ball. At every step a ball is selected at random, and then replaced together with another ball of the same colour. Let \(X_n\) be the number of green balls after \(n\)th draw, \(X_0=1\). Then the fraction \(M_n=X_n/(n+2)\) of green balls is a martingale w.r.t. the filtration \(({\mathcal{F}}^X_n)_{n\ge0}\).

Indeed, as \(|M_n|\le1\) we have \({\mathsf{E}}|M_n|\le1\) for all \(n\ge0\), and since \[{\mathsf{P}}(X_{n+1}=k+1\mid X_n=k)=\frac{k}{n+2}\,,\qquad{\mathsf{P}}(X_{n+1}=k\mid X_n=k)=1-\frac{k}{n+2}\,,\] we get \({\mathsf{E}}\bigl(X_{n+1}\mid{\mathcal{F}}_n^X\bigr)=\frac{n+3}{n+2}\,X_n\), equivalently, \({\mathsf{E}}(M_{n+1}\mid{\mathcal{F}}^X_n)=M_n\).

Show that \({\mathsf{P}}(M_n=\frac{k}{n+2})=\frac{1}{n+1}\) for \(1\le k\le n+1\), ie., \(M_n\) is uniformly distributed in \(\bigl\{\tfrac1{n+2},\tfrac2{n+2},\ldots,\tfrac{n+1}{n+2}\bigr\}\).

[exse:Polya-urn-one] suggests that in the limit \(n\to\infty\) the distribution of \(M_n\) becomes uniform in \((0,1)\):

Show that \(\lim\limits_{n\to\infty}{\mathsf{P}}\bigl(M_n<x\bigr)=x\) for every \(x\in(0,1)\).

In view of [exse:Polya-urn-one], a natural question is: does the proportion \(M_n\) of green balls fluctuate between \(0\) and \(1\) infinitely often or does it eventually settle down to a particular value? The following example shows that the latter is true. Our argument is based upon the following observation: if a real sequence \(y_n\) does not converge, for some real \(a\), \(b\) with \(-\infty<a<b<\infty\) the sequence \(y_n\) must go from the region below \(a\) to the region above \(b\) (and back) infinitely often.

For fixed \(n\ge0\) let \(M_n<a\in(0,1)\), and let \(N=\min\{k>n: M_n>b\}\) for some \(b\in(a,1)\). Since \(N_m=N\land m\) is a bounded stopping time, by (OST-1) we have \({\mathsf{E}}{M_{N_m}}={\mathsf{E}}{M_n}<a\) if only \(m>n\). On the other hand, \[{\mathsf{E}}{M_{N_m}}\ge{\mathsf{E}}\bigl(M_{N_m}\mathsf{1}_{N\le m}\bigr)\equiv{\mathsf{E}}\bigl(M_{N}\mathsf{1}_{N\le m}\bigr)>b\,{\mathsf{P}}(N\le m)\,.\] In other words, \({\mathsf{P}}(N\le m)<\frac{a}{b}\) and consequently \({\mathsf{P}}(N<\infty)\le\frac{a}{b}<1\), ie., the fraction \(M_n\) of green balls ever gets above level \(b\) with probability at most \(\frac{a}{b}\). Suppose that at certain moment \(N\in(n,\infty)\) the fraction of green balls became bigger than \(b\). Then a similar argument shows that with probability at most \((1-b)/(1-a)\) the value \(M_n\) becomes smaller than \(a\) at a later moment.

Put \(S_0=\min\{n\ge0:M_n<a\}\), and then, inductively, for \(k\ge0\), \[\begin{equation} \label{eq:Polya-urn-ST-times} T_k=\min\bigl\{n>S_k:M_n>b\bigr\}\,,\qquad S_{k+1}=\min\bigl\{n>T_k:M_n<a\}\,. \end{equation}\] The argument above implies that \[\begin{equation} \label{eq:probability-of-upcrossings} {\mathsf{P}}(S_k<\infty)\le\prod_{j=1}^k\Bigl({\mathsf{P}}(T_{j-1}<\infty\mid S_{j-1}<\infty){\mathsf{P}}(S_j<\infty\mid T_{j-1}<\infty)\Bigr) \end{equation}\] with the RHS bounded above by \(\bigl(\frac{a}{b}\bigr)^k\bigl(\frac{1-b}{1-a}\bigr)^k\to0\) as \(k\to\infty\). As a result, the probability of infinitely many crossing (ie., \(S_k<\infty\) for all \(k\ge0\)) vanishes.

Clearly, the argument above applies to all strips \((a,b)\subset(0,1)\) with rational endpoints. Thus, with probability one, 32 trajectories of \(M_n\) eventually converge to a particular value. 33

Write \(U_{(a,b)}\) for the total number of upcrossings of the strip \((a,b)\) by the process \((M_n)_{n\ge0}\). By using the approach of [exmle:Polya-urn-convergence] and noticing that \(\bigl\{U_{(a,b)}\ge m\bigr\}\subset\bigl\{S_{m-1}<\infty\bigr\}\) or otherwise, show that \({\mathsf{E}}U_{(a,b)}<\infty\).

The argument in [exmle:Polya-urn-convergence] also works in general. Let \((M_n)_{n\ge0}\) be a martingale w.r.t. filtration \(({\mathcal{F}}^X_n)_{n\ge0}\). For real \(a\), \(b\) with \(-\infty<a<b<\infty\) let \(U_{(a,b)}\) be the total number of upcrossings of the strip \((a,b)\). The following result (or some of its variants) is often referred to as Doob’s Upcrossing Lemma:

Let the martingale \((M_n)_{n\ge0}\) have uniformly bounded expectations, ie., for some constant \(K\) and all \(n\ge0\), \({\mathsf{E}}|M_n|<K<\infty\). If \(U_{(a,b)}\) is the number of upcrossings of a strip \((a,b)\), then \({\mathsf{E}}U_{(a,b)}<\infty\).

With stopping times as in \(\eqref{eq:Polya-urn-ST-times}\), put \(H_n=1\) if \(S_m<n\le T_{m}\) and put \(H_n=0\) otherwise. Then the process \(W_n=\sum_{k=1}^nH_k(M_k-M_{k-1})\) is a martingale w.r.t. \((M_n)_{n\ge0}\), cf. [lem:betting-martingale-transform]. It is easy to check that \(W_n\ge(b-a)\,U_n-|M_n-a|\) (draw the picture!), where \(U_n=\max\{m\ge0:T_m\le n\}\) is the number of upcrossings of the strip \((a,b)\) up to time \(n\). As a result  \[0={\mathsf{E}}W_0={\mathsf{E}}W_n\ge(b-a)\,{\mathsf{E}}U_n-{\mathsf{E}}|M_n-a|\ge(b-a)\,{\mathsf{E}}U_n-(K+|a|)\,,\] so that \({\mathsf{E}}U_n\le(K+|a|)/(b-a)\) for all \(n\ge0\), and thus \({\mathsf{E}}U_{(a,b)}<\infty\).

Let \((M_n)_{n\ge0}\) be a martingale as in [lem:Doob-upcrossings]. Then there exists a random variable \(M_\infty\) such that \(M_n\to M_\infty\) with probability one.

If \(M_n\) does not converge, for some rational \(a\), \(b\) with \(-\infty<a<b<\infty\) we must have \(U_{(a,b)}=\infty\). However, by [lem:Doob-upcrossings], \({\mathsf{E}}U_{(a,b)}<\infty\) implying that \({\mathsf{P}}(U_{(a,b)}=\infty)=0\). As the number of such pairs \((a,b)\) is countable, the result follows.

Let \((X_k)_{k\ge1}\) be independent variables with \[{\mathsf{P}}\Bigl(X=\frac32\Bigr)={\mathsf{P}}\Bigl(X=\frac12\Bigr)=\frac12\,.\] Put \(M_n=X_1\cdot\ldots\cdot X_n\) with \(M_0=1\). Show that \(M_n\) is an \(({\mathcal{F}}^X_n)_{n\ge0}\) martingale. Deduce that \(M_n\to M_\infty\) with probability one. Can you compute \({\mathsf{E}}(M_\infty)\)?

4.6 Additional problems

Let \((\eta_n)_{n\ge1}\) be independent positive random variables with \({\mathsf{E}}\eta_n=1\) for all \(n\ge1\). If a random variable \(M_0>0\) is independent of \((\eta_n)_{n\ge0}\) and \({\mathsf{E}}M_0<\infty\), then the process \((M_n)_{n\ge0}\) defined via \(M_n=M_0\prod_{j=1}^n\eta_j\) is a martingale w.r.t. \((\eta_n)_{n\ge1}\).

1ex Interpreting \(\eta_n-1\) as the (fractional) change in the value of a stock during the \(n\)th time interval, the martingale \((M_n)_{n\ge0}\) can be used to model stock prices. Two often used examples are:
Discrete Black-Sholes model: take \(\eta_j=e^{\zeta_j}\), where \(\zeta_j\) is Gaussian, \(\zeta_j\sim{\mathcal{N}}(\mu,\sigma^2)\);
Binomial model: take \(\eta_j=(1+a)e^{-r}\) and \(\eta_j=(1+a)^{-1}e^{-r}\) with probabilities \(p\) and \(1-p\) respectively.

Let \((S_n)_{n\ge0}\) be the random walk from [exmle:SSRW-expected-exit-time-via-martingales]. Find constants \(a\), \(b\), \(c\) such that the process \((S_n)^4+an(S_n)^2+bn^2+cn\) is an \((X_n)_{n\ge0}\)-martingale. Use the heuristc in [exmle:SSRW-expected-exit-time-via-martingales] to predict the value of the second moment \({\mathsf{E}}(T^2)\) of the exit time \(T\).

A standard symmetric dice is tossed repeatedly. Let \(N\) be the number of (independent) tosses until a fixed pattern is observed. Using an appropriate martingale, find \({\mathsf{E}}N\) for the sequences ‘123456’ and ‘123321’.

Suppose that the process in [exmle:Polya-urn] is modified as follows: for a fixed integer \(c>1\), every time a random ball is selected, it is replaced together with other \(c\) balls of the same colour. If, as before, \(X_n\) denotes the total number of green balls after \(n\) draws, show that the the fraction \(M_n=\frac{X_n}{2+nc}\) of green balls forms a martingale w.r.t. \(({\mathcal{F}}^X_n)_{n\ge0}\).

Find the large-\(n\) limit of the distribution of the martingale \((M_n)_{n\ge0}\) from [exse:Polya-urn-general].

Let \((X_n)_{n\ge0}\) be a birth-and-death process in \({\mathcal{S}}=\{0,1,\dots\}\), ie., a Markov chain in \({\mathcal{S}}\) with transition probabilities \(p_{00}=r_0\), \(p_{01}=p_0\), and \(p_{m,m-1}=q_m\), \(p_{m,m}=r_m\), \(p_{m,m+1}=p_m\) for \(m>0\), while \(p_{m,k}=0\) for all other pairs \((m,k)\in{\mathcal{S}}^2\). Let \(X_0=x\), and for \(y\ge0\) denote \(T_y\stackrel{{\sf def}}{=}\min\{n\ge0:X_n=y\}\).

  1. Show that the process \(\bigl(\varphi(X_n)\bigr)_{n\ge0}\) with \(\varphi(z)\stackrel{{\sf def}}{=}\sum\limits_{y=1}^z\prod\limits_{x=1}^{y-1}\dfrac{q_x}{p_x}\) is a martingale.

  2. Show that for all \(0\le a<X_0=x<b\) we have \[{\mathsf{P}}\bigl(T_b<T_a\bigr)=\bigl(\varphi(x)-\varphi(a)\bigr)/\bigl(\varphi(b)-\varphi(a)\bigr)\,.\] Deduce that state \(0\) is recurrent iff \(\varphi(b)\to\infty\) as \(b\to\infty\).

  3. Now suppose that \(p_m\equiv p\), \(q_m\equiv q=1-p\), and \(r_m=0\) for \(m>0\), whereas \(p_0=p\) and \(r_0=q\). Show that in this case the result in part b) above becomes \[{\mathsf{P}}\bigl(T_b<T_a\bigr)=\bigl((q/p)^a-(q/p)^x\bigr)/\bigl((q/p)^a-(q/p)^b\bigr)\,.\]

  4. Find \({\mathsf{P}}\bigl(T_b<T_a\bigr)\) if in the setup of part c) one has \(p=q=1/2\).

Let \((\xi_k)_{k\ge1}\) be i.i.d. random variables with \({\mathsf{P}}(\xi=1)=p<\tfrac12\), \({\mathsf{P}}(\xi=-1)=q=1-p\), and \({\mathsf{E}}\xi>0\). Let \((S_n)_{n\ge0}\) be the generated random walk, \(S_n=x+\xi_1+\dots+\xi_n\), and let \(T_0=\min\{n\ge0:S_n=0\}\) be the hitting time of \(0\). Deduce that for all \(x>0\), \({\mathsf{P}}(T_0<\infty)=(q/p)^x\). Compare this to the result of [exmle:SRW-hitting-time].

Let \((Z_n)_{n\ge0}\) be a homogeneous branching process with \(Z_0=1\), \(m={\mathsf{E}}Z_1>0\) and finite variance \(\sigma^2={\mathsf{Var}}(Z_1)\). Show that \(M_n=Z_n/m^n\) is a martingale.

  1. Let \(m>1\). By using [exse:branching-process-two-moments] or otherwise show that \({\mathsf{E}}(M_n)\) is uniformly bounded. Deduce that \(M_n\to M_\infty\) almost surely. What can you say about \({\mathsf{E}}M_\infty\)?

  2. What happens if \(m\le1\)? Compute \({\mathsf{E}}(M_\infty)\).

Hint: Recall [exse:uniform-integrability].

Let \((X_n)_{n\ge0}\) be a sequence of i.i.d. Bernoulli random variables with \({\mathsf{P}}(X=1)=p\) and \({\mathsf{P}}(X=-1)=1-p=q\). Let \((S_n)_{n\ge0}\) be the generated random walk with \(S_0=x>0\), and let \(T=\min\{n\ge0:S_n=0\}\) be the hitting time of the origin. [exmle:SRW-hitting-time] suggests that \({\mathsf{E}}(T)=x/(q-p)<\infty\) for \(q>p\).

  1. Use \(\eqref{eq:Wald-stopped-time}\) to deduce that \((q-p){\mathsf{E}}(n\land T)={\mathsf{E}}\bigl(S_0-S_{n\land T}\bigr)\le{\mathsf{E}}(S_0)=x\); then take \(n\to\infty\) to show that \({\mathsf{E}}(T)<\infty\);

  2. Use the Wald equation to deduce that indeed \({\mathsf{E}}(T)=\frac{x}{q-p}\). Can you give a heuristic explanation of this result?

  3. Argue, without using the Wald equation, that \({\mathsf{E}}(T)=cx\) for some constant \(c\).

  4. Use the Wald equation and an argument by contradiction to show that if \(p\ge q\), then \({\mathsf{E}}(T)=\infty\) for all \(x>0\).

Let a variable \(Y\) satisfy \({\mathsf{E}}(Y^2)<\infty\). Show that \({\mathsf{E}}|Y|<\infty\). Hint Notice that \({\mathsf{Var}}(|Y|)\ge0\).

Let \((X_n)_{n\ge0}\) be an irreducible Markov chain in \({\mathcal{S}}=\{0,1,\dots\}\) with bounded jumps, and let a function \(\varphi:{\mathcal{S}}\to{\mathbb{R}}^+\) satisfy \(\varphi(x)\to\infty\) as \(x\to\infty\). Let \(K\ge0\) be such that \[{\mathsf{E}}_x\varphi(X_1)\stackrel{{\sf def}}{=}{\mathsf{E}}\bigl[\varphi(X_1)|X_0=x\bigr]\le\varphi(x)\] for all \(x\ge K\).

  1. If the function \(\varphi(x)\) is monotone, show that the set of states \(\{0,1,\dots,K\}\), and thus the whole space \({\mathcal{S}}\) is recurrent for \((X_n)_{n\ge0}\).
    Hint If \(H_K=\min\bigl\{n\ge0:0\le X_n\le K\}\), show that \(\varphi(X_{n\land H_K})\) is a supermartingale. Deduce that if \(T_M=\min\bigl\{n\ge0:X_n\ge M\}\), then \(\varphi(x)\ge\varphi(M){\mathsf{P}}(T_M<H_K)\).

  2. Argue that the result holds for \(\varphi(x)\ge0\) not necessarily monotone, but only satisfying \(\varphi(x)\to\infty\) as \(x\to\infty\).
    Hint With \(T_M\) as above, show that \(\varphi_M^*\stackrel{{\sf def}}{=}\min\{\varphi(x):x\ge M\}\to\infty\) as \(M\to\infty\).

Let \((X_k)_{k\ge1}\) be independent variables with \({\mathsf{P}}(X=\pm1)=\frac12\). Show that the process \[M_n=\sum_{k=1}^n\frac1k\,X_k\] is a martingale w.r.t. \(({\mathcal{F}}^X_n)_{n\ge0}\) and that \({\mathsf{E}}\bigl[(M_n)^2\bigr]<K<\infty\) for some constant \(K\) and all \(n\ge0\). By using [exse:uniform-integrability] or otherwise, deduce that with probability one, \(M_n\to M_\infty\) for some random variable \(M_\infty\). In other words, the random sign harmonic series converges with probability one.

Thinking of a population of \(N\) haploid individuals who have one copy of each of their chromosomes, consider a fixed population of \(N\) genes that can be of two types \(A\) or \(a\). In the simplest version of this model the population at time \(n+1\) is obtained by sampling with replacement from the population at time \(n\). If we let \(X_n\) to be the number of \(A\) alleles at time \(n\), then \(X_n\) is a Markov chain with transition probability \[p_{ij}=\binom{N}{j}\Bigl(\frac{i}{N}\Bigr)^j\Bigl(1-\frac{i}{N}\Bigr)^{N-j}\,.\] Starting from \(i\) of the \(A\) alleles and \(N-i\) of \(a\) alleles, what is the probability that the population fixates in the all \(A\) state? Hint You can use the heuristics of [exmle:gamblers-ruin] but need to justify your computation!

Let \((X_n)_{n\ge0}\) be a Markov chain with a (countable) state space \({\mathcal{S}}\) and the transition matrix \({\mathbf{P}}\), and let \(h(x,n)\) be a function of the state \(x\) and time \(n\) such that  34 \[h(x,n)=\sum_yp_{xy}h(y,n+1)\,.\] Show that \((M_n)_{n\ge0}\) with \(M_n=h(X_n,n)\) is a martingale w.r.t. \((X_n)_{n\ge0}\).

Let \((X_n)_{n\ge0}\) be a Markov chain with a (countable) state space \({\mathcal{S}}\) and the transition matrix \({\mathbf{P}}\). If \(\boldsymbol{\psi}\) is a right eigenvector of \({\mathbf{P}}\) corresponding to the eigenvalue \(\lambda>0\), ie., \({\mathbf{P}}\boldsymbol{\psi}=\lambda\boldsymbol{\psi}\), show that the process \(M_n=\lambda^{-n}\psi\bigl(X_n\bigr)\) is a martingale w.r.t. \((X_n)_{n\ge0}\).


  1. current version by Ellen Powell (ellen.g.powell@durham.ac.uk); based on previous notes by Nicholas Georgiou and Ostap Hyrniv↩︎

  2. also called countably additive;↩︎

  3. See your 2nd year notes.↩︎

  4. from now on we shall often abbreviate this to just i.i.d.↩︎

  5. If \(X\) and \(Y\) are independent variables in \({\mathbb{Z}}_+\) and \(Z=X+Y\), their p.m.f.s satisfy this equation.↩︎

  6. if a power series \(G_a(s)\) is finite for \(|s|<r\) with \(r>0\), then it can be differentiated in the disk \(|s|<r\); recall that each probability generating function is analytic in the unit disk \(|s|<1\).↩︎

  7. by recalling the fact that \(G_T(s)\to0\) as \(s\to0\);↩︎

  8. sometimes called a Galton-Watson-Bienaymé process↩︎

  9. If \(Z_0=k\), we shall explicitly write \({\mathsf{P}}_k(\,\cdot\,)\) and \({\mathsf{E}}_k(\,\cdot\,)\).↩︎

  10. otherwise the model is degenerate: if \(p_0=0\), then \(Z_n\ge1\) for all \(n\ge0\) so that \(\rho=0\); if \(p_0=1\), then \({\mathsf{P}}(Z_1=0)=\rho=1\).↩︎

  11. [rescaled-generating-function] Geometrically, the graph of this generating function is a rescaled version of that of \(\varphi(\cdot)\).↩︎

  12. Intermittency follows from the criticality condition, \(1={\mathsf{E}}(Z_k\mid Z_k>0\bigr){\mathsf{P}}(Z_k>0)\); it is the linearity which is surprising here!↩︎

  13. Compare the result to Cesàro limits of real sequences: if \((a_k)_{k\ge1}\) is a real-valued sequence, and \(s_n=a_1+\dots+a_n\) is its \(n\)th partial sum, then \(\frac1ns_n\) are called the Cesàro averages for the sequence \((a_k)_{k\ge1}\). Lemma [lem:cesaro-limit] claims that if \(a_k\to a\) as \(k\to\infty\), then the sequence of its Cesàro averages also converges to \(a\). The converse is, of course, false. (Find a counterexample!)↩︎

  14. using the fact that every \(s\in(0,1)\) satisfies \(0<s<\varphi_k(0)<1\) for some \(k\ge1\);↩︎

  15. A priori the original variables \(X\) and \(Y\) can be defined on arbitrary probability spaces, so that one has no reason to expect that these spaces can be “joined” in any way!↩︎

  16. and in fact, every convex linear combination of these two tables provides a joint distribution with the same marginals.↩︎

  17. here and below we assume that \(Z\sim{\mathsf{Poi}}(0)\) means that \({\mathsf{P}}(Z=0)=1\).↩︎

  18. so that all probability measures form a metric space for this distance!↩︎

  19. By [exse:TV-distance-explicit], if \({\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)=0\), we have \(p_z=q_z\) for all \(z\), and thus all off-diagonal terms in \(\eqref{eq:max-coupling}\) vanish.↩︎

  20. Recall the first table in [exmle:coupled-bernoulli].↩︎

  21. the argument for the other marginal is similar;↩︎

  22. [exse:maximal-coupling-walk-on-graph] gives a more precise information about this convergence.↩︎

  23. try coupling RTTS (by value) and RTTS (by value) chains!↩︎

  24. The density of \(Z\sim\Gamma(a,\lambda)\) is \(\frac{\lambda^a}{\Gamma(a)}x^{a-1}e^{-\lambda x}\) (\(x>0\)); notice that \(\Gamma(1,\lambda)\) is just \({\mathsf{Exp}}(\lambda)\).↩︎

  25. If \(M_n\) traces your fortune, then “there is nothing super about a supermartingale”.↩︎

  26. Notice that we do not assume that all \(\xi_n\) have the same distribution!↩︎

  27. [general-measurability] In the general countable setting, if \({\mathcal{D}}=\{D_1,D_2,\dots\}\) forms a denumerable (ie., infinite countable) partition of \(\Omega\), then the generated \(\sigma\)-field \({\mathcal{G}}=\sigma({\mathcal{D}})\) consists of all possible subsets of \(\Omega\) which are obtained by taking countable unions of blocks of \({\mathcal{D}}\). Similarly, a variable \(Y\) is measurable w.r.t. \({\mathcal{G}}\), if for every real \(y\) the event \(\{\omega:Y(\omega)\le y\}\) belongs to \({\mathcal{G}}\) (equivalently, can be expressed as a countable union of blocks of \({\mathcal{D}}\).↩︎

  28. Notice that the gambler’s ruin problem can be solved by using the methods of finite Markov chains, so we indeed know that the result above is correct.↩︎

  29. the equality is correct, because \({\mathsf{P}}(T<\infty)=1\) here!↩︎

  30. alternatively, recall the result in [exmle:SRW-hitting-time].↩︎

  31. Here we are really using the monotone convergence theorem: if \(0\le Z_n\le Z\) for every \(n\) and \(Z_n\uparrow Z\) as \(n\to \infty\), then \({\mathsf{E}}(Z_n)\to{\mathsf{E}}(Z)\) as \(n\to \infty\).↩︎

  32. If \(M_n\) does not converge, it must cross at least one of countably many strips \((a,b)\) with rational points infinitely many times.↩︎

  33. which is random and depends on the trajectory↩︎

  34. This result is useful, eg., if \(h(x,n)=x^2-cn\) or \(h(x,n)=\exp\{x-cn\}\) for a suitable \(c\).↩︎