Visual Guide to Statistics. Part II: Bayesian Statistics
Part II introduces different approach to parameters estimation called Bayesian statistics.
Basic definitions
We noted in the previous part that it is extremely unlikely to get a uniformly best estimator. An alternative way to compare risk functions is to look at averaged values (weighting over parameters probabilities) or at maximum values for worst-case scenarios.
In Bayes interpretation parameter $\vartheta$ is random, namely instance of random variable $\theta: \Omega \rightarrow \Theta$ with distribution $\pi$. We call $\pi$ a prior distribution for $\vartheta$. For an estimator $g \in \mathcal{K}$ and its risk $R(\cdot, g)$
\[R(\pi, g) = \int_{\Theta} R(\theta, g) \pi(d \vartheta)\]is called the Bayes risk of $g$ with respect to $\pi$. An estimator $\tilde{g} \in \mathcal{K}$ is called a Bayes estimator if it minimizes the Bayes risk over all estimators, that is
\[R(\pi, \tilde{g}) = \inf_{g \in \mathcal{K}} R(\pi, g).\]The right hand side of the equation above is call the Bayes risk. The function $R(\pi, g)$ plays the role of the average value over all risk functions, where the possible values of $\theta$ are weighted according to their probabilities. Distribution $\pi$ can interpreted as prior knowledge of statistician about unknown parameter.
In the following we will denote conditional distribution of $X$ (under condition $\theta = \vartheta$) as
\[P_\vartheta = Q^{X \mid \theta=\vartheta}\]and joint distribution of $(X, \theta)$ as $Q^{X, \theta}$:
\[Q^{X, \theta}(A) = \int_\Theta \int_\mathcal{X} 1_A(x,\vartheta) P_\vartheta (dx) \pi(d \vartheta).\]Before experiment we have $\pi = Q^\theta$, marginal distribution of $\theta$ under $Q^{X, \theta}$, assumed distribution of parameter $\vartheta$. After observation $X(\omega)=x$ the information about $\theta$ changes from $\pi$ to $Q^{\theta \mid X=x}$, which we will call a posterior distribution of random variable $\theta$ under condition $X=x$.
Posterior risk
Recall that risk function is an expected value of a loss function $L$:
\[R(\vartheta, g) = \int_{\mathcal{X}} L(\gamma(\vartheta), g(x)) P_\vartheta(dx).\]Then
\[\begin{aligned} R(\pi,g) & =\int_\Theta R(\vartheta, g) \pi(d\vartheta) \\ &=\int_{\Theta} \int_{\mathcal{X}} L(\gamma(\vartheta), g(x)) P_\vartheta(dx) \pi(d\vartheta)\\ & = \int_{\Theta \times \mathcal{X}} L(\gamma(\vartheta), g(x)) Q^{X,\theta} (dx, d\vartheta) \\ &=\int_{\mathcal{X}} {\color{Salmon}{ \int_{\Theta} L(\gamma(\vartheta), g(x)) Q^{\theta \mid X = x} (d\vartheta)}} Q^X(dx) \\ & = \int_{\mathcal{X}} {\color{Salmon}{R_{\pi}^x(g)}} Q^X(dx). \end{aligned}\]The term
\[R_{\pi}^x(g) :=\int_{\Theta} L(\gamma(\vartheta), g(x)) Q^{\theta | X = x} (d\vartheta)\]is called a posterior risk of $g$ with given $X=x$. It can be shown that for an estimator $g^*$ of $\vartheta$ to be Bayes, it must provide minimum posterior risk:
\[R_{\pi}^x(g^*)=\inf_{g \in \mathcal{K}}R_{\pi}^x(g)=\inf_{a \in \Theta} \int L(\vartheta, a) Q^{\theta \mid X = x}(d\vartheta),\]because $R(\pi, g)$ is minimal if and only if $R_\pi^x(g)$ is minimal. In particular, for quadratic loss $L(\vartheta,a) = (\vartheta-a)^2$ Bayes estimator is
\[g^*(x) = \mathbb{E}[\theta \mid X = x] = \int_{\Theta} \vartheta Q^{\theta \mid X=x} (d \vartheta).\]Say for $P_\vartheta$ we have density function $f(x \mid \vartheta)$, and for $\pi$ density is $h(\vartheta)$. Then posterior distribution of $Q^{\theta \mid X=x}$ has density
\[f(\vartheta|x) = \frac{f(x|\vartheta) h(\vartheta)}{ \int_\Theta f(x|\vartheta) h(\vartheta) d\vartheta }.\]Posterior and Bayes risks respectively
\[R_\pi^x(g) = \frac{\int_\Theta L(\vartheta, g(x))f(x|\vartheta) h(\vartheta) d\vartheta}{\int_\Theta f(x|\vartheta) h(\vartheta) d\vartheta}\]and
\[R(\pi, g)=\int_{\mathcal{X}}\int_\Theta L(\vartheta, g(x))f(x|\vartheta) h(\vartheta) d\vartheta dx.\]Let’s take an example of an estimation of probability parameter for binomial distribution. Let $\Theta = (0, 1)$, $\mathcal{X} = \lbrace 0, \dots, n \rbrace$ and
\[P_\vartheta(X=x) = \binom n x \vartheta^x (1-\vartheta)^{n-x}.\]We take quadratic loss function $L(x,y)=(x-y)^2$. Say we only have observed one sample $X=x$. From previous post we know that binomial distribution belongs to exponential family and therefore $g(x) = \frac{x}{n}$ is an UMVU estimator for $\vartheta$ with
\[\operatorname{Var}(g(X)) = \frac{\vartheta(1-\vartheta)}{n}.\]On the other hand, we have density
\[f(x | \vartheta) = \binom n x \vartheta^x (1-\vartheta)^{n-x} 1_{ \lbrace 0, \dots n \rbrace }(x).\]If we take prior uniform distribution $\pi \sim \mathcal{U}(0, 1)$, then $ h(\vartheta) = 1_{(0, 1)}(\vartheta)$ and posterior density
\[f(\vartheta \mid x) = \frac{\vartheta^x (1-\vartheta)^{n-x} 1_{(0,1)}(\vartheta)}{B(x+1, n-x+1)},\]where we have beta-function in denominator:
\[B(a,b)=\int_{0}^{1} \vartheta^{a-1} (1-\vartheta)^{b-1} d \vartheta.\]Then Bayes estimator will be
\[\begin{aligned} g^*(x)&=\mathbb{E}[\theta|X=x]\\ &=\int_0^1 \frac{\vartheta^{x+1}(1-\vartheta^{n-x})}{B(x+1, n-x+1)}\\ &=\frac{B(x+2, n-x+1)}{B(x+1, n-x+1)} =\frac{x+1}{n+2}, \end{aligned}\]and Bayes risk:
\[\begin{aligned} R(\pi,g^*) & =\int_0^1 R(\vartheta, g^*) d\vartheta\\ &=\int_0^1 \mathbb{E}\Big[\Big(\frac{X+1}{n+2}-\vartheta \Big)^2\Big]d\vartheta \\ & =\frac{1}{(n+2)^2} \int_0^1 (n\vartheta - n\vartheta^2+1-4\vartheta+4\vartheta^2)\ d\vartheta\\ &=\frac{1}{6(n+2)}. \end{aligned}\]Let’s take another example: $X_1, \dots X_n$ i.i.d. $\sim P_\mu^1 = \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known in advance. Take for $\mu$ prior distribution with gaussian density
\[h(\mu) = \frac{1}{\sqrt{2 \pi \tau^2}} \exp \Big( -\frac{(\mu-\nu)^2}{2\tau^2} \Big).\]Taking density for $X$
\[f(x|\mu)=\Big( \frac{1}{\sqrt{2\pi \sigma^2}}\Big)^n \exp \Big( \frac{1}{2\sigma^2}\sum_{j=1}^n(x_j-\mu)^2 \Big ),\]we get posterior distribution
\[Q^{\mu|X=x} \sim \mathcal{N} \Big( g_{\nu, \tau^2}(x), \Big( \frac{n}{\sigma^2} + \frac{1}{\tau^2}\Big)^{-1} \Big),\]where
\[g_{\nu, \tau^2}(x)=\Big( 1 + \frac{\sigma^2}{n \tau^2} \Big)^{-1} \overline{x}_n+\Big( \frac{n \tau^2}{\sigma^2}+1 \Big)^{-1} \nu.\]For quadratic loss function $g_{\nu, \tau^2}(x)$ is a Bayes estimator. It can be interpreted as following: for large values of $\tau$ (not enough prior information) estimator $g_{\nu, \tau^2}(x) \approx \overline{x}_n$.
Otherwise, $g_{\nu, \tau^2}(x)$ $\approx \nu$.
Fig. 1. Bayesian inference for normal distribution.
Minimax estimator
For an estimator $g$
\[R^*(g) = \sup_{\vartheta \in \Theta} R(\vartheta, g)\]is called the maximum risk and
\[R^*(g^*) = \inf_{g \in \mathcal{K}} R^*(g)\]is minimax risk and corresponding $g$ - minimax estimator. The use of minimax estimator is aimed at protecting against large losses. Also it’s not hard to see, that
\[R^*(g) = \sup_{\pi \in \mathcal{M}} R(\pi, g),\]where $\mathcal{M}$ is a set of all prior measures $\pi$. If for some $\pi^*$ we have
\[\inf_{g \in \mathcal{K}} R(\pi^*, g) \geq \inf_{g \in \mathcal{K}} R(\pi, g) \quad \forall \pi \in \mathcal{M},\]then $\pi^*$ is called the least favorable prior. If $g_\pi$ is a Bayes estimator for prior $\pi$ and also
\[R(\pi, g_\pi) = \sup_{\vartheta \in \Theta} R(\vartheta, g_\pi),\]then for any $g \in \mathcal{K}$:
\[\sup_{\vartheta \in \Theta}R(\vartheta, g) \geq \int_{\Theta}R(\vartheta, g)\pi(d\vartheta) \geq \int_{\Theta}R(\vartheta, g_\pi)\pi(d\vartheta)=R(\pi, g_\pi)=\sup_{\vartheta \in \Theta}R(\vartheta, g_\pi)\]and therefore $g_\pi$ is a minimax estimator. Also, $\pi$ is a least favorable prior, because for any distribution $\mu$
\[\begin{aligned} \inf_{g \in \mathcal{K}} \int_{\Theta} R(\vartheta, g)\mu(d\vartheta) &\leq \int_{\Theta}R(\vartheta, g_\pi)\mu(d\vartheta) \\& \leq \sup_{\vartheta \in \Theta} R(\vartheta, g_\pi) \\&= R(\pi, g_\pi) \\ &= \inf_{g \in \mathcal{K}} \int_{\Theta}R(\vartheta, g) \pi(d\vartheta). \end{aligned}\]Sometimes Bayes risk can be constant:
\[R(\vartheta, g_\pi) = c \quad \forall \vartheta \in \Theta.\]Then
\[\sup_{\vartheta \in \Theta} R(\vartheta, g_\pi) = c = \int_{\Theta} R(\vartheta, g_\pi) \pi(d\vartheta) = R(\pi, g_\pi),\]$g_\pi$ is minimax and $\pi$ is least favorable prior.
Let’s get back to an example with binomial distribution:
\[P_\vartheta(X = x) = \binom{n}{x} \vartheta^x (1-\vartheta)^{n-x}.\]Again we use quadratic loss, but only this time we take parameterized beta distrubution $B(a, b)$ as our prior:
\[h(\vartheta) = \frac{\vartheta^{a-1}(1-\vartheta)^{b-1}1_{[0,1]}(\vartheta)}{B(a, b)}.\]Note that for $a = b = 1$ we have $\theta \sim \mathcal{U}(0, 1)$. Now posterior distribution will be $Q^{\vartheta \mid X=x} \sim B(x+a,n-x+b)$ with density
\[f(\vartheta | x)= \frac{\vartheta^{x+a-1}(1-\vartheta)^{n-x+b-1}1_{[0,1](\vartheta)}}{B(x+a,n-x+b)}.\]We use our prior knowledge that for random variable $Z \sim B(p, q)$
\[\mathbb{E}[Z] = \frac{p}{p+q} \quad \text{and} \quad \operatorname{Var}(Z)=\frac{pq}{(p+q)^2(p+q+1)}.\]Recall that for quadratic loss expected value of $\theta$ is Bayes estimator. Therefore,
\[g_{a,b}(x)=\frac{x+a}{n+a+b}\]is a Bayes estimator and it provides risk
\[\begin{aligned} R(\vartheta, g_{a,b})&=\mathbb{E}[(g_{a,b}(X)-\vartheta)^2] \\ &=\frac{\vartheta^2(-n+(a+b)^2+\vartheta(n-2a(a+b))+a^2}{(n+a+b)^2}. \end{aligned}\]If we choose $\hat{a}=\hat{b}=\frac{\sqrt{n}}{2}$ then risk will be
\[R(\vartheta, g_{\hat{a}, \hat{b}})=\frac{1}{4(\sqrt{n} + 1)^2}.\]Such risk doesn’t depend on $\vartheta$ and hence an estimator $g_{\hat{a}, \hat{b}}(x) = \frac{x+\sqrt{n}/2}{n+\sqrt{n}}$ is minimax and $B(\hat{a}, \hat{b})$ is least favorable prior.
Fig. 2. Bayesian inference for binomial distribution. Note that when least favorable prior is chosen, Bayes and minimax estimators coincide regardless of the sample value.
Least favorable sequence of priors
Let
\[r_\pi = \inf_{g \in \mathcal{K}} R(\pi, g), \quad \pi \in \mathcal{M}.\]Then sequence $(\pi_m)_{m \in \mathbb{N}}$ in $\mathcal{M}$ is called least favorable sequence of priors if
- $\lim_{m \rightarrow \infty} r_{\pi_m} = r$,
- $r_\pi \leq r\ $ $\ \forall \pi \in \mathcal{M}$.
Let $(\pi_m)$ in $\mathcal{M}$ be a sequence, such that $r_{\pi_m} \rightarrow r \in \mathbb{R}$. Also let there be an estimator $g^* \in \mathcal{K}$, such that
\[\sup_{\vartheta \in \Theta}R(\theta, g^*) = r.\]Then
\[\sup_{\vartheta \in \Theta} R(\vartheta, g) \geq \int_{\Theta} R(\vartheta, g) \pi_m(d \vartheta) \geq r_{\pi_m} \rightarrow r = \sup_{\vartheta \in \Theta}R(\theta, g^*)\]and therefore $g^*$ is minimax. Also for any $\pi \in \mathcal{M}$
\[r_\pi \leq R(\pi, g^*) = \int_\Theta R(\vartheta, g^*) \pi (d\vartheta) \leq \sup_{\vartheta \in \Theta} R(\vartheta, g^*) = r,\]hence $(\pi_m)$ is a least favorable sequence of priors.
Let’s get back to our previous example of estimating mean for normal distribution with known $\sigma^2$. Say, we have prior distribution
\[h_m(\mu)=\frac{1}{\sqrt{2 \pi m}} \exp \Big \{ -\frac{(\mu-\nu)^2}{2m}\Big \}.\]with $m \in \mathbb{N}$. Recall that Bayes estimator is
\[g_{\nu, m}(x)=\Big( 1 + \frac{\sigma^2}{n m} \Big)^{-1} \overline{x}_n+\Big( \frac{n m}{\sigma^2}+1 \Big)^{-1} \nu.\]For any $\mu \in \mathbb{R}$
\[\begin{aligned} R(\mu, g_{\nu, m}) & = \mathbb{E}[(g_{\nu, m}(X)-\mu)^2] \\ & = \mathbb{E}\Bigg[\bigg(\Big( 1 + \frac{\sigma^2}{n m} \Big)^{-1} (\overline{X}_n-\mu)+\Big( \frac{n m}{\sigma^2}+1 \Big)^{-1} (\nu-\mu)\bigg)^2\Bigg] \\ & = \Big(1 + \frac{\sigma^2}{nm}\Big)^{-2} \frac{\sigma^2}{n} + \Big( 1+\frac{nm}{\sigma^2} \Big)^{-2}(\nu-\mu)^2 \xrightarrow[m \ \rightarrow \infty]{} \frac{\sigma^2}{n} \end{aligned}\]Since the risk is bounded from above:
\[R(\mu, g_{\nu, m}) \leq \frac{\sigma^2}{n} + (\mu - \nu)^2,\]by Lebesgue Dominated Convergence Theorem 1 we have
\[r_{\pi_{m}}=R(\pi_{m}, g_{\nu, m})=\int_{\mathbb{R}}R(\mu, g_{\nu, m})\pi_{m}(d\mu) \longrightarrow \frac{\sigma^2}{n}.\]Since for estimator $g^*(x)=\overline{x}_n$ the equality
\[R(\mu, g^*)=\mathbb{E}[(\overline{X}_n-\mu)^2]=\frac{\sigma^2}{n},\]holds, $g^*(x)$ is minimax and $\pi_{m}$ is sequence of least favorable priors.
Suppose there is measurable space $X$ with measure $\mu$. Also let $\lbrace f_n \rbrace_{n=1}^\infty$ and $f$ be measurable functions on $X$ and $f_n(x) \rightarrow f(x)$ almost everywhere. Then if there exists an integrable function $g$ defined on the same space such that
\[|f_n(x)| \leq g(x) \quad \forall n \in \mathbb{N}\]almost everywhere, then $f_n$ and $f$ are integrable and
\[\lim\limits_{n \rightarrow \infty} \int_X f_n(x) \mu(dx) = \int_X f(x) \mu(dx).\]