Post

Visual Guide to Statistics. Part II: Bayesian Statistics

Part II introduces different approach to parameters estimation called Bayesian statistics.

Basic definitions

We noted in the previous part that it is extremely unlikely to get a uniformly best estimator. An alternative way to compare risk functions is to look at averaged values (weighting over parameters probabilities) or at maximum values for worst-case scenarios.

In Bayes interpretation parameter ϑ is random, namely instance of random variable θ:ΩΘ with distribution π. We call π a prior distribution for ϑ. For an estimator gK and its risk R(,g)

R(π,g)=ΘR(θ,g)π(dϑ)

is called the Bayes risk of g with respect to π. An estimator g~K is called a Bayes estimator if it minimizes the Bayes risk over all estimators, that is

R(π,g~)=infgKR(π,g).

The right hand side of the equation above is call the Bayes risk. The function R(π,g) plays the role of the average value over all risk functions, where the possible values ​​of θ are weighted according to their probabilities. Distribution π can interpreted as prior knowledge of statistician about unknown parameter.

In the following we will denote conditional distribution of X (under condition θ=ϑ) as

Pϑ=QXθ=ϑ

and joint distribution of (X,θ) as QX,θ:

QX,θ(A)=ΘX1A(x,ϑ)Pϑ(dx)π(dϑ).

Before experiment we have π=Qθ, marginal distribution of θ under QX,θ, assumed distribution of parameter ϑ. After observation X(ω)=x the information about θ changes from π to QθX=x, which we will call a posterior distribution of random variable θ under condition X=x.

Posterior risk

Recall that risk function is an expected value of a loss function L:

R(ϑ,g)=XL(γ(ϑ),g(x))Pϑ(dx).

Then

R(π,g)=ΘR(ϑ,g)π(dϑ)=ΘXL(γ(ϑ),g(x))Pϑ(dx)π(dϑ)=Θ×XL(γ(ϑ),g(x))QX,θ(dx,dϑ)=XΘL(γ(ϑ),g(x))QθX=x(dϑ)QX(dx)=XRπx(g)QX(dx).

The term

Rπx(g):=ΘL(γ(ϑ),g(x))Qθ|X=x(dϑ)

is called a posterior risk of g with given X=x. It can be shown that for an estimator g of ϑ to be Bayes, it must provide minimum posterior risk:

Rπx(g)=infgKRπx(g)=infaΘL(ϑ,a)QθX=x(dϑ),

because R(π,g) is minimal if and only if Rπx(g) is minimal. In particular, for quadratic loss L(ϑ,a)=(ϑa)2 Bayes estimator is

g(x)=E[θX=x]=ΘϑQθX=x(dϑ).

Say for Pϑ we have density function f(xϑ), and for π density is h(ϑ). Then posterior distribution of QθX=x has density

f(ϑ|x)=f(x|ϑ)h(ϑ)Θf(x|ϑ)h(ϑ)dϑ.

Posterior and Bayes risks respectively

Rπx(g)=ΘL(ϑ,g(x))f(x|ϑ)h(ϑ)dϑΘf(x|ϑ)h(ϑ)dϑ

and

R(π,g)=XΘL(ϑ,g(x))f(x|ϑ)h(ϑ)dϑdx.

Let’s take an example of an estimation of probability parameter for binomial distribution. Let Θ=(0,1), X={0,,n} and

Pϑ(X=x)=(nx)ϑx(1ϑ)nx.

We take quadratic loss function L(x,y)=(xy)2. Say we only have observed one sample X=x. From previous post we know that binomial distribution belongs to exponential family and therefore g(x)=xn is an UMVU estimator for ϑ with

Var(g(X))=ϑ(1ϑ)n.

On the other hand, we have density

f(x|ϑ)=(nx)ϑx(1ϑ)nx1{0,n}(x).

If we take prior uniform distribution πU(0,1), then h(ϑ)=1(0,1)(ϑ) and posterior density

f(ϑx)=ϑx(1ϑ)nx1(0,1)(ϑ)B(x+1,nx+1),

where we have beta-function in denominator:

B(a,b)=01ϑa1(1ϑ)b1dϑ.

Then Bayes estimator will be

g(x)=E[θ|X=x]=01ϑx+1(1ϑnx)B(x+1,nx+1)=B(x+2,nx+1)B(x+1,nx+1)=x+1n+2,

and Bayes risk:

R(π,g)=01R(ϑ,g)dϑ=01E[(X+1n+2ϑ)2]dϑ=1(n+2)201(nϑnϑ2+14ϑ+4ϑ2) dϑ=16(n+2).

Let’s take another example: X1,Xn i.i.d. Pμ1=N(μ,σ2) with σ2 known in advance. Take for μ prior distribution with gaussian density

h(μ)=12πτ2exp((μν)22τ2).

Taking density for X

f(x|μ)=(12πσ2)nexp(12σ2j=1n(xjμ)2),

we get posterior distribution

Qμ|X=xN(gν,τ2(x),(nσ2+1τ2)1),

where

gν,τ2(x)=(1+σ2nτ2)1xn+(nτ2σ2+1)1ν.

For quadratic loss function gν,τ2(x) is a Bayes estimator. It can be interpreted as following: for large values of τ (not enough prior information) estimator gν,τ2(x)xn.

Otherwise, gν,τ2(x) ν.

−6−4−202460.00.20.4PriorPosteriorBayes-3-2-101230.511.522.53-3-2-101230.511.522.53
μ
Xn
μ
σ
ν
τ

Fig. 1. Bayesian inference for normal distribution.

Minimax estimator

For an estimator g

R(g)=supϑΘR(ϑ,g)

is called the maximum risk and

R(g)=infgKR(g)

is minimax risk and corresponding g - minimax estimator. The use of minimax estimator is aimed at protecting against large losses. Also it’s not hard to see, that

R(g)=supπMR(π,g),

where M is a set of all prior measures π. If for some π we have

infgKR(π,g)infgKR(π,g)πM,

then π is called the least favorable prior. If gπ is a Bayes estimator for prior π and also

R(π,gπ)=supϑΘR(ϑ,gπ),

then for any gK:

supϑΘR(ϑ,g)ΘR(ϑ,g)π(dϑ)ΘR(ϑ,gπ)π(dϑ)=R(π,gπ)=supϑΘR(ϑ,gπ)

and therefore gπ is a minimax estimator. Also, π is a least favorable prior, because for any distribution μ

infgKΘR(ϑ,g)μ(dϑ)ΘR(ϑ,gπ)μ(dϑ)supϑΘR(ϑ,gπ)=R(π,gπ)=infgKΘR(ϑ,g)π(dϑ).

Sometimes Bayes risk can be constant:

R(ϑ,gπ)=cϑΘ.

Then

supϑΘR(ϑ,gπ)=c=ΘR(ϑ,gπ)π(dϑ)=R(π,gπ),

gπ is minimax and π is least favorable prior.

Let’s get back to an example with binomial distribution:

Pϑ(X=x)=(nx)ϑx(1ϑ)nx.

Again we use quadratic loss, but only this time we take parameterized beta distrubution B(a,b) as our prior:

h(ϑ)=ϑa1(1ϑ)b11[0,1](ϑ)B(a,b).

Note that for a=b=1 we have θU(0,1). Now posterior distribution will be QϑX=xB(x+a,nx+b) with density

f(ϑ|x)=ϑx+a1(1ϑ)nx+b11[0,1](ϑ)B(x+a,nx+b).

We use our prior knowledge that for random variable ZB(p,q)

E[Z]=pp+qandVar(Z)=pq(p+q)2(p+q+1).

Recall that for quadratic loss expected value of θ is Bayes estimator. Therefore,

ga,b(x)=x+an+a+b

is a Bayes estimator and it provides risk

R(ϑ,ga,b)=E[(ga,b(X)ϑ)2]=ϑ2(n+(a+b)2+ϑ(n2a(a+b))+a2(n+a+b)2.

If we choose a^=b^=n2 then risk will be

R(ϑ,ga^,b^)=14(n+1)2.

Such risk doesn’t depend on ϑ and hence an estimator ga^,b^(x)=x+n/2n+n is minimax and B(a^,b^) is least favorable prior.

0.00.20.40.60.81.00510Prior012345678Sample0.00.20.40.60.81.00510PosteriorUMVUBayesMinimax246810n0.511.522.53a0.511.522.53b

Fig. 2. Bayesian inference for binomial distribution. Note that when least favorable prior is chosen, Bayes and minimax estimators coincide regardless of the sample value.

Least favorable sequence of priors

Let

rπ=infgKR(π,g),πM.

Then sequence (πm)mN in M is called least favorable sequence of priors if

  • limmrπm=r,
  • rπr   πM.

Let (πm) in M be a sequence, such that rπmrR. Also let there be an estimator gK, such that

supϑΘR(θ,g)=r.

Then

supϑΘR(ϑ,g)ΘR(ϑ,g)πm(dϑ)rπmr=supϑΘR(θ,g)

and therefore g is minimax. Also for any πM

rπR(π,g)=ΘR(ϑ,g)π(dϑ)supϑΘR(ϑ,g)=r,

hence (πm) is a least favorable sequence of priors.

Let’s get back to our previous example of estimating mean for normal distribution with known σ2. Say, we have prior distribution

hm(μ)=12πmexp{(μν)22m}.

with mN. Recall that Bayes estimator is

gν,m(x)=(1+σ2nm)1xn+(nmσ2+1)1ν.

For any μR

R(μ,gν,m)=E[(gν,m(X)μ)2]=E[((1+σ2nm)1(Xnμ)+(nmσ2+1)1(νμ))2]=(1+σ2nm)2σ2n+(1+nmσ2)2(νμ)2m σ2n

Since the risk is bounded from above:

R(μ,gν,m)σ2n+(μν)2,

by Lebesgue Dominated Convergence Theorem 1 we have

rπm=R(πm,gν,m)=RR(μ,gν,m)πm(dμ)σ2n.

Since for estimator g(x)=xn the equality

R(μ,g)=E[(Xnμ)2]=σ2n,

holds, g(x) is minimax and πm is sequence of least favorable priors.


  1. Suppose there is measurable space X with measure μ. Also let {fn}n=1 and f be measurable functions on X and fn(x)f(x) almost everywhere. Then if there exists an integrable function g defined on the same space such that

    |fn(x)|g(x)nN

    almost everywhere, then fn and f are integrable and

    limnXfn(x)μ(dx)=Xf(x)μ(dx).

This post is licensed under CC BY 4.0 by the author.