1 Opening remarks

Bayesian statistical methods are not just another set of techniques for statisticians. They provide us with a different way of thinking about statistical inference and uncertainty. A posterior distribution encodes our uncertainty about quantities and events of interest. Frequentist statisticians do point estimation, create confidence intervals and hypothesis tests. Bayesians have analogues for all these things, but the interpretations are far more natural.

2 What I expect you to know

2.1 Integration and summation by inspection

Let \(g(x)\) be a function such that \(g(x)=cf(x)\) where \(f(x)\) is a probability density function and \(c\) is a constant, then \[\begin{equation*} \int_\mathcal{X} g(x) dx = \int_\mathcal{X} cf(x) dx=c \int_\mathcal{X} f(x) dx=c. \end{equation*}\]

Example

  1. \(\int_{-\infty}^{\infty} e^{-0.5(x-5)^2} dx= \sqrt{2\pi} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}}e^{-0.5(x-5)^2} dx=\sqrt{2\pi}\) N(5,1)

  2. \(\int_{0}^{\infty}x^4 e^{-\frac{x}{2}}dx=2^5 \Gamma(5)\int_{0}^{\infty}\frac{x^4 e^-\frac{x}{2}}{2^5 \Gamma(5)} dx=2^5 \Gamma(5)\) Ga(5,0.5)

If \(f(x)\) is a probability function and \(g(x)\) is supported on the same set of discrete values \(\mathcal{X}\): \(g(x)=cf(x)\), then \[\begin{equation*}\sum_\mathcal{X} g(x) = \sum_\mathcal{X} cf(x)=c\sum_\mathcal{X} f(x)=c. \end{equation*}\]

Example

\(\sum_{n=0}^{\infty}\frac{3^n}{n!}=e^3\) Po(3)

2.2 The distributions in Bayesian inference

2.2.1 Prior distribution

We are uncertain about \(\theta\) that takes some value from the set \(\Theta\). We use the prior density, \(\pi(\theta)\), to encode our uncertainty about \(\theta\). If we had to guess a value for \(\theta\), we might report the mean. Before we see data, this is called our prior mean: \[\begin{equation*} E_\theta(\theta)=\int_{\Theta} \theta \pi(\theta) d\theta. \end{equation*}\]

2.2.2 The likelihood

The likelihood tells us about how likely a value of \(\theta\) is for different data \(x\) (in relative terms). The likelihood is only specified up to a constant of proportionality: \[\begin{equation*} l(\theta;x)\propto \pi(x \mid \theta). \end{equation*}\]

The likelihood encodes our beliefs about the data-generating process, and it links the data to the uncertain parameter \(\theta\).

Note that I tend to use \(L(\theta;x)\) for the log-likelihood.

2.2.3 Preposterior (or prior-predictive) distribution

Given that we are uncertain about \(\theta\) and we believe \(x\) is generated from some stochastic process, we are uncertain about the value of \(x\) that we will observe. The preposterior distribution encodes this uncertainty: \[\begin{equation*} \pi(x)=\int_{\Theta} \pi(x,\theta)d\theta=\int_{\Theta} \pi(x \mid \theta)\pi(\theta)d\theta=E_{\theta}[\pi(x \mid \theta)]. \end{equation*}\]

2.2.4 The posterior distribution

After we observe some data \(x^*\) say, we update our beliefs using \[\pi(\theta \mid x^*) \propto \pi(x^* \mid \theta)\pi(\theta),\] where \(\pi(\theta \mid x^*)\) is our posterior density.

We find the constant of proportionality through \[\begin{equation*} \pi(\theta \mid x^*)=c\pi(x^* \mid \theta)\pi(\theta) \implies \int_{\Theta} \pi(\theta \mid x^*)d\theta =1=c\int_{\Theta} \pi(x^* \mid \theta) \pi(\theta)d\theta\end{equation*}\]

\[\begin{equation*}\implies \frac{1}{c}=\int_{\Theta} \pi(x^* \mid \theta) \pi(\theta) d\theta = \pi(x^*) \end{equation*}\] This constant \(\pi(x^*)\) is called the evidence and is an instance of the preposterior distribution.

2.2.5 The (posterior-)predictive distribution

After observing \(x^*\), we are still uncertain about \(\theta\), and, hence, we are still uncertain about the next data value \(x\). This uncertainty is encoded in our predictive distribution: \[\begin{align*} \pi(x \mid x^*) &= \int_{\Theta} \pi(x, \theta \mid x^*) d\theta \\ &= \int_{\Theta} \pi(x \mid \theta, x^*)\pi(\theta \mid x^*) d\theta\\ &=E_{\theta \mid x^*} [\pi(x \mid \theta)]. \end{align*}\]