Stochastic Processes (2021/22)
Michaelmas term 1

3 Coupling

Two random variables, say \(X\) and \(Y\), are coupled if they are defined on the same probablity space. To couple two given variables \(X\) and \(Y\) means to define a random vector \(\bigl(\widetilde{X},\widetilde{Y})\) with joint probability \(\widetilde{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) on some probability space 2 so that the marginal distribution of \(\widetilde{X}\) coincides with the distribution of \(X\) and the marginal distribution of \(\widetilde{Y}\) coincides with the distribution of \(Y\). Recall the following example:

Fix \(p_1\), \(p_2\in[0,1]\) such that \(p_1\le p_2\) and consider the following joint distributions (we write \(q_i=1-p_i\)):

  1. \(0\) \(1\) \(\widetilde{X}\)
    \(0\) \(q_1q_2\) \(q_1p_2\) \(q_1\)
    \(1\) \(p_1q_2\) \(p_1p_2\) \(p_1\)
    \(\widetilde{Y}\) \(q_2\) \(p_2\)
  2. \(0\) \(1\) \(\widetilde{X}\)
    \(0\) \(q_2\) \(p_2-p_1\) \(q_1\)
    \(1\) \(0\) \(p_1\) \(p_1\)
    \(\widetilde{Y}\) \(q_2\) \(p_2\)

It is easy to see that in both cases  3 \[\widetilde{X}\sim{\mathsf{Ber}}(p_1)\, , \, \widetilde{Y}\sim{\mathsf{Ber}}(p_2),\] though in the first case \(\widetilde{X}\) and \(\widetilde{Y}\) are independent, whereas in the second case we have \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\).

3.1 Stochastic domination

If \(X\sim{\mathsf{Ber}}(p)\), its tail probabilities \({\mathsf{P}}(X>a)\) satisfy \[{\mathsf{P}}(X>a)=\begin{cases}1,&a<0\,,\\p,&0\le a<1\,,\\0,&a\ge1\,.\end{cases}\] Consequently, in the setup of [exmle:coupled-bernoulli], for the variables \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) with \(p_1\le p_2\) we have \({\mathsf{P}}(X>a)\le{\mathsf{P}}(Y>a)\) for all \(a\in{\mathbb{R}}\). The last inequality is useful enough to deserve a name:

A random variable \(X\) is stochastically smaller than a random variable \(Y\) (write \(X\preccurlyeq Y\)) if the inequality \[\begin{equation} \label{eq:stoch-domination} {\mathsf{P}}\bigl(X>x\bigr)\le{\mathsf{P}}\bigl(Y>x\bigr) \end{equation}\] holds for all \(x\in{\mathbb{R}}\).

If \(X\preccurlyeq Y\) and \(g(\,\cdot\,)\ge0\) is an arbitrary increasing function on \({\mathbb{R}}\), then \(g(X)\preccurlyeq g(Y)\). If, in addition, \(X\ge0\), \(Y\ge0\), and \(g(\cdot)\) is smooth with \(g(0)=0\), then \[\begin{equation} \label{eq:stochastic-domination-expectation-of-monotone-function} {\mathsf{E}}g(X)\equiv\int_0^\infty g'(z){\mathsf{P}}(X>z)\,dz\le\int_0^\infty g'(z){\mathsf{P}}(Y>z)\,dz\equiv{\mathsf{E}}g(Y)\,. \end{equation}\]

If \(X\) is a random variable and \(a\ge0\) is a fixed constant, it is immediate that \(Y=a+X\) stochastically dominates \(X\), ie., \(X\preccurlyeq Y\): for each \(x\in{\mathbb{R}}\) we have \({\mathsf{P}}(X>x)\le{\mathsf{P}}(X>x-a)={\mathsf{P}}(Y>x)\). Similarly, if \(a\ge1\) and \(Z=aX\ge0\), then \(X\preccurlyeq Z\).

In the setup of [exmle:coupled-bernoulli], if \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) then \(X\) is stochastically smaller than \(Y\) (ie., \(X\preccurlyeq Y\)) if and only if \(p_1\le p_2\); moreover, this is equivalent to existence of a coupling \(\bigl(\widetilde{X},\widetilde{Y})\) of \(X\) and \(Y\) in which these variables are ordered with probablity one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\). The next result shows that this is a rather generic situation.

A random variable \(X\) is stochastically smaller than a random variable \(Y\) if and only if there exists a coupling \(\bigl(\widetilde{X},\widetilde{Y})\) of \(X\) and \(Y\) such that \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\).

Notice that one claim of [lem:coupling-vs-stoch-domination] is immediate from \[{\mathsf{P}}(x<X)\equiv\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{X}\bigr)=\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{X}\le\widetilde{Y}\bigr) \le\widetilde{{\mathsf{P}}}\bigl(x<\widetilde{Y}\bigr)\equiv{\mathsf{P}}\bigl(x<Y\bigr)\,;\] the other claim requires a more advanced argument (we shall not do it here!).

If \(X\sim{\mathsf{Bin}}(m,p)\) and \(Y\sim{\mathsf{Bin}}(n,p)\) with \(m\le n\), then \(X\preccurlyeq Y\). Indeed, Let \(Z_1\sim{\mathsf{Bin}}(m,p)\) and \(Z_2\sim{\mathsf{Bin}}(n-m,p)\) be independent variables defined on the same probability space. We then put \(\widetilde{X}=Z_1\) and \(\widetilde{Y}=Z_1+Z_2\) so that \(\widetilde{Y}-\widetilde{X}=Z_2\ge0\) with probability one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\), and \(X\sim\widetilde{X}\), \(Y\sim\widetilde{Y}\).

If \(X\sim{\mathsf{Poi}}(\lambda)\) and \(Y\sim{\mathsf{Poi}}(\mu)\) with \(\lambda\le\mu\), then \(X\preccurlyeq Y\). Indeed, Let \(Z_1\sim{\mathsf{Poi}}(\lambda)\) and \(Z_2\sim{\mathsf{Poi}}(\mu-\lambda)\) be independent variables defined on the same probability space.  4 We then put \(\widetilde{X}=Z_1\) and \(\widetilde{Y}=Z_1+Z_2\) so that \(\widetilde{Y}-\widetilde{X}=Z_2\ge0\) with probability one, \(\widetilde{{\mathsf{P}}}(\widetilde{X}\le\widetilde{Y})=1\), and \(X\sim\widetilde{X}\), \(Y\sim\widetilde{Y}\).

Let \((X_n)_{n\ge0}\) be a branching process with offspring distribution \(\{p_m\}_{m\ge0}\) and \(X_0=1\). Let \((Y_n)_{n\ge0}\) be a branching process with the same offspring distribution and \(Y_0=2\). We can use stochastic domination to show that \({\mathsf{P}}(X_n=0)\ge{\mathsf{P}}(Y_n=0)\) for all \(n\).

Indeed, it is enough to show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\). To this end consider two independent branching processes, \((Z'_n)_{n\ge0}\) and \((Z''_n)_{n\ge0}\), having the same offspring distribution \(\{p_m\}_{m\ge0}\) and satisfying \(Z'_0=Z''_0=1\). We then put \(\widetilde{X}_n=Z'_n\) and \(\widetilde{Y}_n=Z'_n+Z''_n\) for all \(n\ge0\), so that \(\widetilde{Y}_n-\widetilde{X}_n=Z''_n\ge0\), ie., \(\widetilde{{\mathsf{P}}}(\widetilde{X}_n\le\widetilde{Y}_n)=1\) and \(X_n\sim \widetilde{X}_n\), \(Y_n\sim \widetilde{Y}_n\) for all \(n\ge0\).

For a given offspring distribution \(\{p_m\}_{m\ge0}\), let \((X_n)_{n\ge0}\) be the branching process with \(X_0=k\) and let \((Y_n)_{n\ge0}\) be the branching process with \(Y_0=l\), \(k<l\). Show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\).

Let \(X\), \(Y\), \(Z\) be random variables in \({\mathbb{Z}}_+\). If \(X\preccurlyeq Y\) and \(Z\) is independent of \(\{X,Y\}\), show that \(X+Z\preccurlyeq Y+Z\).
Let \(\{X_i,Y_i\}_{i=1,\dots,n}\) be random variables in \({\mathbb{Z}}_+\). If the pairs \((X_i,Y_i)\) are mutually independent and \(X_i\preccurlyeq Y_i\) for each \(i\), show that \(X_1+\dots+X_n\preccurlyeq Y_1+\dots+Y_n\).

Let \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) be standard branching processes with \(X_0=Y_0=1\). Assume that the offspring distribution of \(X\) is stochastically smaller than that of \(Y\) (\(X_1\preccurlyeq Y_1\)), ie., for all integer \(k\ge0\), \({\mathsf{P}}(X_1>k|X_0=1)\le{\mathsf{P}}(Y_1>k|Y_0=1)\). Show that \(X_n\preccurlyeq Y_n\) for all \(n\ge0\).

3.2 Total variation distance

Let \(\mu\) and \(\nu\) be two probability measures on the same probability space. The total variation distance between \(\mu\) and \(\nu\) is \[\begin{equation} \label{eq:TV-dist-def} {\mathsf{d_{TV}}}(\mu,\nu)\stackrel{{\sf def}}{=}\max_A\bigl|\mu(A)-\nu(A)\bigr|\,. \end{equation}\] If \(X\), \(Y\) are random variables in \({\mathbb{Z}}^+\) with respective p.m.f.s \(\{p\} = \{p_k\}_{k \geq 0}\) and \(\{q\}=\{q_\ell\}_{\ell \geq 0}\), then \[{\mathsf{d_{TV}}}(X,Y) \stackrel{{\sf def}}{=}{\mathsf{d_{TV}}}(\{p\},\{q\}) = \max_{A\subseteq {\mathbb{Z}}^+}\bigl|\textstyle\sum_{k\in A} p_k - \sum_{k \in A} q_k \bigr|.\]

If \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) we have \[{\mathsf{d_{TV}}}(X,Y)=\max\bigl\{|p_1-p_2|,|q_1-q_2|\bigr\}=|p_1-p_2|=\tfrac12\bigl(|p_1-p_2|+|q_1-q_2|\bigr)\,.\]

Let probability measures \(\mu\) and \(\nu\) have respective p.m.f.s \(\{p_x\}\) and \(\{q_y\}\). Show that the total variation distance between \(\mu\) and \(\nu\) is \[{\mathsf{d_{TV}}}(\mu,\nu)\equiv{\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)=\frac12\sum_z\bigl|p_z-q_z\bigr|\,.\] Deduce that \({\mathsf{d_{TV}}}(\cdot,\cdot)\) is a distance between probability measures 5 (ie., it is non-negative, symmetric, and satisfies the triangle inequality) such that \({\mathsf{d_{TV}}}(\cdot,\cdot)\le1\) for all probability measures \(\mu\) and \(\nu\).

An important relation between coupling and the total variation distance is explained by the following fact.

Let random variables \(X\) and \(Y\) be such that \({\mathsf{P}}(X=x)=p_x\) and \({\mathsf{P}}(Y=y)=q_y\). Define the coupling \(\bigl(\widetilde{X},\widetilde{Y}\bigr)\) via 6 \[\begin{equation} \label{eq:max-coupling} \widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=z\bigr)=\min\bigl(p_z,q_z\bigr)\,, \end{equation}\]\[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=x,\widetilde{Y}=y\bigr)=\frac{\bigl(p_x-\min(p_x,q_x)\bigr)\,\bigr(q_y-\min(p_y,q_y)\bigr)}{{\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)}\,,\quad x\neq y\,.\]

Show that \[\sum_x\bigl(p_x-\min(p_x,q_x)\bigr)=\sum_y\bigr(q_y-\min(p_y,q_y)\bigr)={\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)\,,\] and deduce that \(\eqref{eq:max-coupling}\) is indeed a coupling of \(X\) and \(Y\) (ie., that \(\eqref{eq:max-coupling}\) defines a probability distribution with correct marginals).

Consider \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\) with \(p_1\le p_2\). It is a straightforward exercise to check that the second table in [exmle:coupled-bernoulli] provides the maximal coupling of \(X\) and \(Y\). We notice also that in this case \[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)=p_2-p_1={\mathsf{d_{TV}}}(X,Y)\,.\]

Let \(\widehat{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) be the maximal coupling \(\eqref{eq:max-coupling}\) of \(X\) and \(Y\). Then for every other coupling \(\widetilde{{\mathsf{P}}}(\,\cdot\,,\,\cdot\,)\) of \(X\) and \(Y\) we have \[\begin{equation} \label{eq:coupling-inequality-dTV} \widetilde{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\ge\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)={\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\,. \end{equation}\]

Summing the inequalities \(\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=z\bigr)\le\min(p_z,q_z)\) we deduce \[\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\ge1-\sum_z\min(p_z,q_z)=\sum_z\bigl(p_z-\min(p_z,q_z)\bigr)={\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)\,,\] in view of [exse:dTV-vs-coupling] and [exmle:max-coupling-Bernoulli].

Notice that according to \(\eqref{eq:coupling-inequality-dTV}\), \[\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)\le\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)=1-{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\,,\] ie., the probability that \(\widetilde{X}=\widetilde{Y}\) is maximised under the optimal coupling \(\widehat{{\mathsf{P}}}(\cdot\,,\,\cdot)\).

The maximal coupling of \(X\sim{\mathsf{Ber}}(p)\) and \(Y\sim{\mathsf{Poi}}(p)\) satisfies \[\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=0\bigr)=1-p\,,\qquad \widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=1\bigr)=pe^{-p}\,,\] and \(\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}=x\bigr)=0\) for all \(x>1\). As a result, \[\begin{equation} \label{eq:Bernoulli-Poisson-max-coupling} {\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)=1-\widehat{{\mathsf{P}}}\bigl(\widetilde{X}=\widetilde{Y}\bigr)=p\bigl(1-e^{-p}\bigr)\le p^2\,. \end{equation}\] Is either of the variables \(X\) and \(Y\) stochastically dominated by the other?

3.3 Applications to convergence

3.3.1 Coupling of Markov chains

Coupling is a very useful tool for proving convergence towards equilibrium for various processes, including Markov chains and random walks. The following example uses the independent coupling 7 of two random walks on a complete graph.

On the complete graph \(K_m\) on \(m\) vertices, consider the random walk \((X_n)_{n\ge0}\) with \(X_0=x\), such that at every step it jumps to any of the vertices of \(K_m\) uniformly at random (excluding the current one). To show that eventually \((X_n)_{n\ge0}\) forgets its initial distribution, one couples \((X_n)_{n\ge0}\) with another copy \((Y_n)_{n\ge0}\), \(Y_0=y\), of this random walk, and shows that for large \(n\) both processes coincide with large probability.

To do this, we treat the pair \((\widetilde{X}_n,\widetilde{Y}_n)_{n\ge0}\) as a two-component random walk on \(K_m\times{K_m}\) with the following jump probabilities: the transition \((x,y)\mapsto(x',y')\) occurs with probability \[\tilde{p}_{(x,y)(x',y')}= \begin{cases} \frac1{m^2}\,,&\quad \text{ if } x\neq y\,,\\ \frac1m\,,&\quad \text{ if $x=y$ and $x'=y'$\,,}%\\0\,,&\quad \text{ if $x=y$ and $x'\neq y'$\,.} \end{cases}\] and set \(\tilde{p}_{(x,y)(x',y')}=0\) otherwise (ie., if \(x=y\) and \(x'\neq y'\)). It is straightforward to check that this is a correct coupling: 8 \[\sum_{y'}\tilde{p}_{(x,y)(x',y')}=m\cdot\frac1{m^2}=p_{x,x'}\,,\qquad \sum_{y'}\tilde{p}_{(x,x)(x',y')}=\tilde{p}_{(x,x)(x',x')}=\frac1m=p_{x,x'}\,.\] Notice that the walks \((\widetilde{X}_n)_{n\ge0}\) and \((\widetilde{Y}_n)_{n\ge0}\) run independently until they meet, and from that moment onwards they move together. For \(x\neq y\), denote \[\begin{equation} \label{eq:coupling-time-def} T\stackrel{{\sf def}}{=}\min\{n\ge0:\widetilde{X}_n=\widetilde{Y}_n\}\,. \end{equation}\] By a straightforward induction, it is easy to see that \[\widetilde{{\mathsf{P}}}(T>n)=\frac1m\bigl(1-\frac{m-2}{(m-1)^2}\bigr)^{n},\] ie., \(T\) is (basically) a geometric random variable. Since \(\widetilde{X}_n\) and \(\widetilde{Y}_n\) coincide after they meet, and \(\widetilde{{\mathsf{P}}},\widetilde{X}_n,\widetilde{Y}_n\) gives a coupling of \((X_n,Y_n)\) for every \(n\) we have that \[\begin{aligned} {\mathsf{d_{TV}}}(X_n,Y_n)\le \widetilde{{\mathsf{P}}}(\widetilde{X}_n\ne \widetilde{Y}_n)=\widetilde{{\mathsf{P}}}(T>n)\,, \end{aligned}\] by (\(\eqref{eq:coupling-inequality-dTV}\)). It therefore only remains to observe that the RHS above is exponentially small, \(\widetilde{{\mathsf{P}}}(T>n)\le\bigl(1-\frac1m\bigr)^n\le e^{-n/m}\). In other words, the random walk \((X_n)_{n\ge0}\) forgets its initial state \(X_0=x\) at least exponentially fast.

Notice that if \((Y_n)_{n\ge0}\) starts from the equilibrium (ie., its initial position is selected in \(K_m\) uniformly at random), our argument shows that for every initial state \(x\) and every state \(j\) we have \(|{\mathsf{P}}(X_n=j)-\frac1m|\le e^{-n/m}\), ie., convergence towards the equilibrium distribution is at least exponentially fast. 9

The next example is important for computer science.

An \(n\)-dimensional hypercube is just \({\mathcal{H}}_n\equiv\{0,1\}^n\), a graph whose vertex set is the collection of all \(n\)-sequences made of \(0\) and \(1\) (there are \(2^n\) of them) and whose edges connect vertices which differ at a single position. The “lazy” random walk \((W_k)_{k\ge0}\) on \({\mathcal{H}}_n\) is defined as follows: if \(W_k=v\in{\mathcal{H}}_n\), select \(m\), \(1\le m\le n\), uniformly at random and flip a fair coin. If the coin shows heads, set the \(m\)th coordinate of \(v\) to \(1\), otherwise set it to \(0\). The walk \((W_k)_{k\ge0}\) is aperiodic and irreducible; as \(k\) increases, it tends to forget its initial state.

Let \((W_k)_{k\ge0}\) be the random walk on the hypercube \({\mathcal{H}}_n\), recall [exmle:random-walk-on-hypercube]. Use [exmle:coupled-totally-uniform-walks-on-complete-graph] to study its distribution for large \(k\).

Let \((\xi_j)_{j\ge1}\) be independent variables with \({\mathsf{P}}(\xi=1)=1-{\mathsf{P}}(\xi=-1)=p_x\) and let \((\eta_j)_{j\ge1}\) be independent variables with \({\mathsf{P}}(\eta=1)=1-{\mathsf{P}}(\eta=-1)=p_y\). Define simple random walks \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) on \({\mathbb{Z}}\) via \(X_n=x+\sum_{j=1}^n\xi_j\) and \(Y_n=y+\sum_{j=1}^n\eta_j\).
a) Show that the random walk \((X_n)_{n\ge0}\) on \({\mathbb{Z}}\) forgets its initial state \(x\); namely, for \(p_x=p_y\) and \(y-x=2k\), construct a coupling of \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) similar to that of [exmle:coupled-totally-uniform-walks-on-complete-graph].
b) Show that the random walk \((X_n)_{n\ge0}\) monotonously depends on \(p_x\); namely, for \(x\le y\) and \(p_x\le p_y\), use the ideas from [exmle:coupled-bernoulli] to construct a monotone coupling of \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\), ie., such that \(\widetilde{{\mathsf{P}}}\bigl(\widetilde{X}_n\le\widetilde{Y}_n\bigr)=1\) for all \(n\ge0\).

For many practical purposes it is important to generate a random permutation of a collection of cards \(1\), \(2\), …, \(m\). One way is to define a Markov chain \((X_n)_{n\ge0}\) in the space of all possible permutations (so at any time \(n\), \(X_n\) will be a permutation of \(\{1,\dots, m\}\)), and run it until the stationary distribution is reached. The stationary distribution is as usual the uniform distribution on all possible permutations.

One of the simplest algorithms—the random-to-top shuffling—is defined as follows: for a given state \(X_n=(\sigma(1),\dots, \sigma(m))\) of the chain, find the \(j\)th card by position (with \(j\) taken uniformly at random between \(1\) and \(m\)) and put it at the top of the deck to get the new state \(X_{n+1}=(\sigma(j),\dots, \sigma(j-1), \sigma(j+1),\dots)\). Note that we could take the \(j\)th card by value instead, and then the new state would be \((j,\dots, \sigma(k-1),\sigma(k+1),\dots)\) for the unique \(k\) such that \(\sigma(k)=j\). Since \(j\) is chosen uniformly at random, this gives the same algorithm (a Markov chain with the same transition probabilities).

To study the approach to stationarity, one couples two copies of the Markov chain, \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\), one starting from a fixed state, and other starting from equilibrium, and shuffles until both configurations \(X_n\) and \(Y_n\) agree.

Let \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\) be random-to-top shuffling (RTTS) Markov chains, recall [exmle:random-to-top-shuffling]. Suppose that a random card in \(X_n\) (by position) is chosen, say of value \(j\), and then put at the top to get \(X_{n+1}\). In \(Y_n\) find the (randomly positioned) card with number \(j\) and also move it to the top to get \(Y_{n+1}\). Let \(T:=\min\{k\ge 0: Y_k=X_k\}\) be the coupling time for \((X_n)_{n\ge0}\) and \((Y_n)_{n\ge0}\).
a) Show that the described algorithm provides a coupling 10 of the RTTS (by position) for \((X_n)_{n\ge0}\) and the RTTS (by value) for \((Y_n)_{n\ge0}\).
b) Describe the distribution of \(T\).

3.3.2 The Law of rare events

This subsection is optional and will not be examined.
The convergence result in [exse:sum-of-nonhomogeneous-Bernoulli-converges-to-Poisson] can also be derived using coupling:

Let \(X=\sum_{k=1}^nX_k\), where \(X_k\sim{\mathsf{Ber}}(p_k)\) are independent random variables. Let, further, \(Y\sim{\mathsf{Poi}}(\lambda)\), where \(\lambda=\sum_{k=1}^np_k\). Then the maximal coupling of \(X\) and \(Y\) satisfies \[{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\le\sum_{k=1}^n\bigl(p_k\bigr)^2\,.\]

Write \(Y=\sum_{k=1}^nY_k\), where \(Y_k\sim{\mathsf{Poi}}(p_k)\) are independent rv’s. Of course, \({\mathsf{P}}\bigl(\sum_{k=1}^nX_k\neq\sum_{k=1}^nY_k\bigr)\le\sum_{k=1}^n{\mathsf{P}}(X_k\neq Y_k)\) for every joint distribution of \((X_k)_{k=1}^n\) and \((Y_k)_{k=1}^n\). Let \(\widehat{{\mathsf{P}}}_k\) be the maximal coupling for the pair \(\{X_k,Y_k\}\), and let \(\widehat{{\mathsf{P}}}_0\) be the maximal coupling for two sums. Notice that the LHS above is not smaller than \({\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\equiv\widehat{{\mathsf{P}}}_0\bigl(\widetilde{X}\neq\widetilde{Y}\bigr)\); on the other hand, using the (independent) product measure \({\mathsf{P}}=\widehat{{\mathsf{P}}}_1\times\dots\times\widehat{{\mathsf{P}}}_n\) on the right we deduce that then the RHS becomes just \(\sum_{k=1}^n\widehat{{\mathsf{P}}}_k\bigl(\widetilde{X}_k\neq\widetilde{Y}_k\bigr)\). The result now follows from \(\eqref{eq:Bernoulli-Poisson-max-coupling}\).

Let \(X\sim{\mathsf{Bin}}(n,\frac{\lambda}{n})\) and \(Y\sim{\mathsf{Poi}}(\lambda)\) for some \(\lambda>0\). Show that \[\begin{equation} \label{eq:Bin2PoiDistance} \frac12\bigl|{\mathsf{P}}(X=k)-{\mathsf{P}}(Y=k)\bigr|\le{\mathsf{d_{TV}}}\bigl(\widetilde{X},\widetilde{Y}\bigr)\le\frac{\lambda^2}{n}\,\qquad\text{for every $k\ge0$.} \end{equation}\] Deduce that for every fixed \(k\ge0\), we have \({\mathsf{P}}(X=k)\to\frac{\lambda^k}{k!}e^{-\lambda}\) as \(n\to\infty\).

Notice that \(\eqref{eq:Bin2PoiDistance}\) implies that if \(X\sim{\mathsf{Bin}}(n,p)\) and \(Y\sim{\mathsf{Poi}}(np)\) then for every \(k\ge0\) the probabilities \({\mathsf{P}}(X=k)\) and \({\mathsf{P}}(Y=k)\) differ by at most \(2np^2\). In particular, if \(n=10\) and \(p=0.01\) the discrepancy between any pair of such probabilities is bounded above by \(0.002\), ie., they coincide in the first two decimal places.

3.4 Additional problems

In the setting of [exmle:coupled-bernoulli] show that every convex linear combination of tables \(T_1\) and \(T_2\), ie., each table of the form \(T_\alpha = \alpha T_1 + (1-\alpha) T_2\) with \(\alpha\in[0,1]\), gives an example of a coupling of \(X\sim{\mathsf{Ber}}(p_1)\) and \(Y\sim{\mathsf{Ber}}(p_2)\). Can you find all possible couplings for these variables?

Generalise the inequality \(\eqref{eq:stochastic-domination-expectation-of-monotone-function}\) 1to a broader class of functions \(g(\,\cdot\,)\) and verify that if \(X\preccurlyeq Y\), then \({\mathsf{E}}(X^{2k+1})\le{\mathsf{E}}(Y^{2k+1})\), \({\mathsf{E}}e^{tX}\le{\mathsf{E}}e^{tY}\) with \(t>0\), \({\mathsf{E}}s^X\le{\mathsf{E}}s^Y\) with \(s>1\) etc.

Let \(\xi\sim{\mathcal{U}}(0,1)\) be a standard uniform random variable. For fixed \(p\in(0,1)\), define \(X=\mathsf{1}_{\xi<p}\). Show that \(X\sim{\mathsf{Ber}}(p)\), a Bernoulli random variable with parameter \(p\). Now suppose that \(X=\mathsf{1}_{\xi<p_1}\) and \(Y=\mathsf{1}_{\xi<p_2}\) for some \(0<p_1\le p_2<1\) and \(\xi\) as above. Show that \(X\preccurlyeq Y\) and that \({\mathsf{P}}(X\le Y)=1\). Compare your construction to the second table in [exmle:coupled-bernoulli].

Let \(X\sim{\mathsf{Geom}}(p)\) and \(Y\sim{\mathsf{Geom}}(r)\) be two geometric random variables. If \(0<p\le r<1\), are \(X\) and \(Y\) stochastically ordered? Justify your answer.

Let \(X\sim\Gamma(a,\lambda)\) and \(Y\sim\Gamma(b,\lambda)\) be two gamma random variables. 11 If \(0<a\le b<\infty\), are \(X\) and \(Y\) stochastically ordered? Justify your answer.


  1. current version by Ellen Powell (ellen.g.powell@durham.ac.uk); based on previous notes by Nicholas Georgiou and Ostap Hyrniv↩︎

  2. A priori the original variables \(X\) and \(Y\) can be defined on arbitrary probability spaces, so that one has no reason to expect that these spaces can be “joined” in any way!↩︎

  3. and in fact, every convex linear combination of these two tables provides a joint distribution with the same marginals.↩︎

  4. here and below we assume that \(Z\sim{\mathsf{Poi}}(0)\) means that \({\mathsf{P}}(Z=0)=1\).↩︎

  5. so that all probability measures form a metric space for this distance!↩︎

  6. By [exse:TV-distance-explicit], if \({\mathsf{d_{TV}}}\bigl(\{p\},\{q\}\bigr)=0\), we have \(p_z=q_z\) for all \(z\), and thus all off-diagonal terms in \(\eqref{eq:max-coupling}\) vanish.↩︎

  7. Recall the first table in [exmle:coupled-bernoulli].↩︎

  8. the argument for the other marginal is similar;↩︎

  9. [exse:maximal-coupling-walk-on-graph] gives a more precise information about this convergence.↩︎

  10. try coupling RTTS (by value) and RTTS (by value) chains!↩︎

  11. The density of \(Z\sim\Gamma(a,\lambda)\) is \(\frac{\lambda^a}{\Gamma(a)}x^{a-1}e^{-\lambda x}\) (\(x>0\)); notice that \(\Gamma(1,\lambda)\) is just \({\mathsf{Exp}}(\lambda)\).↩︎