Visual Guide to Statistics. Part IV: Foundations of Testing
This is the fourth and the last part of a ‘Visual Guide to Statistics’ cycle. All the previous parts and other topics related to statistics could be found here. In this post we will test hypotheses about the unknown parameter $\vartheta$. As before, we have a statistical experiment with sample space $\mathcal{X}$ and family of probability measures $\mathcal{P} = \lbrace P_\vartheta \mid \vartheta \in \Theta \rbrace$.
Introductory example
Let’s discuss a simplified clinical study, in which we want to decide whether a newly invented drug $B$ is better than a well-known drug $A$ or not. Suppose that you know from previous years that $A$ has a chance of healing about $p_a$. The new drug $B$ was tested on $n$ persons and $m$ became healthy. Do we choose $A$ or $B$? In terms of mathematics we test
\[H\colon p_b \leq p_a \quad \text{vs} \quad K\colon p_b > p_a,\]where $p_b$ is the unknown chance of healing with $B$.
Let $\Theta = \Theta_H \cup \Theta_K$ be a partition of $\Theta$.
- $\Theta_H$ is called (null) hypothesis, $\Theta_K$ is called the alternative.
- A randomized test is a measurable map $\varphi: \mathcal{X} \rightarrow [0, 1]$. Here $\varphi(x)$ is the probability of a decision for $\Theta_K$ when $x=X(\omega)$ is observed.
- For a test $\varphi$ we call $\mathcal{K}= \lbrace x \mid \varphi(x)=1 \rbrace$ the critical region and $\mathcal{R}= \lbrace x \mid \varphi(x) \in (0,1) \rbrace$ - the region of randomization. A test $\varphi$ is called non-randomized if $\mathcal{R} = \emptyset$.
In our example we know that the statistic $\overline{X}_n$ is the UMVU estimator for $p$. A reasonable decision rule is to decide for $K$ if $\overline{X}_n$ is “large”. For example,
\[\varphi(x) = \left \lbrace \begin{array}{cl} 1, & \overline{X}_n > c, \\ 0, & \overline{X}_n \leq c \end{array} \right.\]with some constant $c$ is a reasonable test. But how “large” must $c$ be?
When deciding for $H$ or $K$ using $\varphi$, two errors can occur:
- Error of the 1st kind: decide for $K$ when $H$ is true.
- Error of the 2nd kind: decide for $H$ when $K$ is true.
Both errors occur with certain probabilities. In our example the probability of a decision for $K$ is
\[P(\varphi(X)=1)=P(\overline{X}_n > c).\]In practice, we can use approximation by normal distribution
\[\begin{aligned} P(\overline{X}_n > c) & = P\bigg(\frac{\sqrt{n}(\overline{X}_n - p_b)}{\sqrt{p_b(1-p_b)}} > \frac{\sqrt{n}(c - p_b)}{\sqrt{p_b(1-p_b)}}\bigg) \\ \color{Salmon}{\text{Central Limit Theorem} \rightarrow} & \approx P\bigg(\mathcal{N}(0,1) > \frac{\sqrt{n}(c - p_b)}{\sqrt{p_b(1-p_b)}}\bigg) \\& = \Phi\bigg(\frac{\sqrt{n}(p_b - c)}{\sqrt{p_b(1-p_b)}}\bigg), \end{aligned}\]where $\Phi$ is the distribution function of $\mathcal{N}(0, 1)$. The probability of error of the 1st kind is bounded from above:
\[\begin{aligned} P(\text{reject } H \mid H \text{ is true}) &= P(\overline{X}_n > c \mid p_b \leq p_a) \\ &\leq P(\overline{X}_n > c \mid p_b = p_a) \\ & =\Phi\bigg(\frac{\sqrt{n}(p_a - c)}{\sqrt{p_a(1-p_a)}}\bigg). \end{aligned}\]By symmetry,
\[P(\text{accept } H \mid K \text{ is true}) \leq 1 - \Phi\bigg(\frac{\sqrt{n}(p_a - c)}{\sqrt{p_a(1-p_a)}}\bigg).\]Fig. 1. Visualization of basic test experiment. Parameters $p_a$ and $c$ are draggable.
Power of a test
Ideally we want to minimize both errors simulaneously and pick the optimal test. The problem is that criterias $\varphi_0(x) \equiv 0$ and $\varphi_1(x) \equiv 1$ are optimal if one needs to minimize one of the errors, but they don’t minimize both errors at the same time. In practice, the upper bound $\alpha$ is taken for the probability of error of the 1st kind and probability of error of the 2nd kind is minimized. Typically, $0.01 \leq \alpha \leq 0.1$ (the set belonging to the more severe consequences is the alternative).
Now suppose $\varphi$ is a test for $H \colon \vartheta \in \Theta_H$ vs $K \colon \vartheta \in \Theta_K$. Let’s define function
\[\beta_\varphi(\vartheta) = 1 - \mathbb{E}_\vartheta[\varphi(X)].\]Note that for non-randomized test $\varphi$ we have
\[\beta_\varphi(\vartheta) = P_\vartheta(\varphi(X) = 0),\]which is the probability to decide for $H$. In particular,
- $\vartheta \in \Theta_H$: $1 - \beta_\varphi(\vartheta)$ is the probability of an error of the 1st kind,
- $\vartheta \in \Theta_K$: $\beta_\varphi(\vartheta)$ is the probability of an error of the 2nd kind.
The function $1 - \beta_\varphi(\vartheta)$ for $\vartheta \in \Theta_K$, which is the probability of correctly rejecting hypothesis $H$, when alternative $K$ is true, is called power of a test $\varphi$. The same intuition holds for randomized tests. Test $\varphi$ is called a test with significance level $\alpha \in [0, 1]$ if
\[1 - \beta_\varphi(\vartheta) \leq \alpha \quad \forall \vartheta \in \Theta_H.\]A test with significance level $\alpha$ has a probability of an error of the 1st kind, which is bounded by $\alpha$. We will denote set of all tests with significance level $\alpha$ as $\Phi_\alpha$. Test $\varphi$ is also called unbiased with significance level $\alpha$ if $\varphi \in \Phi_\alpha$ and
\[1-\beta_\varphi(\vartheta) \geq \alpha \quad \forall \vartheta \in \Theta_K.\]For an unbiased test with significance level $\alpha$ the probability of deciding for $K$ for every $\vartheta \in \Theta_K$ is not smaller than for $\vartheta \in \Theta_H$. The set of all unbiased tests with level $\alpha$ we will call $\Phi_{\alpha \alpha}$.
Test $\tilde{\varphi} \in \Phi_\alpha$ is called uniformly most powerful (UMP) test with significance level $\alpha$ if
\[\beta_{\tilde{\varphi}}(\vartheta) = \inf_{\varphi \in \Phi_\alpha} \beta_\varphi(\vartheta) \quad \forall \vartheta \in \Theta_K.\]Test $\tilde{\varphi} \in \Phi_{\alpha\alpha}$ is called uniformly most powerful unbiased (UMPU) test with significance level $\alpha$ if
\[\beta_{\tilde{\varphi}}(\vartheta) = \inf_{\varphi \in \Phi_{\alpha\alpha}} \beta_\varphi(\vartheta) \quad \forall \vartheta \in \Theta_K.\]Neyman-Pearson lemma
Let’s start with simple hypothesis:
\[H\colon \vartheta \in \lbrace \vartheta_0 \rbrace \ \ \text{vs} \ \ K\colon \vartheta \in \lbrace \vartheta_1 \rbrace , \quad \vartheta_0 \neq \vartheta_1.\]Corresponding densities: $p_i = \frac{dP_{\vartheta_i}}{dx}$. UMP-test with level $\alpha$ maximizes
\[1-\beta_\varphi(\vartheta_1) = \mathbb{E}_{\vartheta_1}[\varphi(X)] = \int_{\mathcal{X}} \varphi(x)p_1(x)dx\]under the constraint
\[1-\beta_\varphi(\vartheta_0) = \mathbb{E}_{\vartheta_0}[\varphi(X)] = \int_{\mathcal{X}} \varphi(x)p_0(x)dx \leq \alpha.\]In the situation of simple hypotheses a test $\varphi$ is called a Neyman-Pearson test (NP test) if $c\in[0, \infty)$ exists such that
\[\varphi(x): \left \lbrace \begin{array}{cl} 1, & p_1(x) > cp_0(x), \\ 0, & p_1(x) < cp_0(x). \end{array} \right.\]Let $\tilde{\varphi}$ be an NP-test with constant $\tilde{c}$ and let $\varphi$ be some other test with
\[\beta_\varphi(\vartheta_0) \geq \beta_{\tilde{\varphi}}(\vartheta_0).\]Then we have
\[\begin{aligned} \beta_\varphi(\vartheta_1) - \beta_{\tilde{\varphi}}(\vartheta_1) &= (1 - \beta_{\tilde{\varphi}}(\vartheta_1) ) - (1 - \beta_\varphi(\vartheta_1) ) \\&=\int (\tilde{\varphi} - \varphi) p_1 dx \\&= \int (\tilde{\varphi} - \varphi)(p_1 - \tilde{c}p_0)dx + \int \tilde{c} p_0 (\tilde{\varphi} - \varphi) dx. \end{aligned}\]For the first integral note that
\[\begin{aligned}\tilde{\varphi} - \varphi > 0 \Longrightarrow \tilde{\varphi} > 0 \Longrightarrow p_1 \geq \tilde{c}p_0, \\ \tilde{\varphi} - \varphi < 0 \Longrightarrow \tilde{\varphi} < 1 \Longrightarrow p_1 \leq \tilde{c}p_0. \end{aligned}\]Hence, $(\tilde{\varphi} - \varphi)(p_1 - \tilde{c}p_0) \geq 0$ always. The second integral is
\[\tilde{c}(\beta_{\tilde{\varphi}}(\vartheta_0) - \beta_\varphi(\vartheta_0)) \geq 0.\]Therefore we have
\[\beta_\varphi(\vartheta_1) \geq \beta_{\tilde{\varphi}}(\vartheta_1)\]and NP-test $\tilde{\varphi}$ is an UMP test with level $\alpha = \mathbb{E}_{\vartheta_0}[\tilde{\varphi}(X)]$. This statement is called NP lemma.
There are also other parts of this lemma which I will state here without proof:
- For any $\alpha \in [0, 1]$ there is an NP-test $\varphi$ with $\mathbb{E}_{\vartheta_0}[\varphi(X)] = \alpha$.
- If $\varphi’$ is UMP with level $\alpha$, then $\varphi’$ is (a.s.) an NP-test. Also
An NP-test $\tilde{\varphi}$ for $H \colon \vartheta = \vartheta_0$ vs $K \colon \vartheta = \vartheta_1$ is uniquely defined outside of
\[S_= =\lbrace x\ |\ p_1(x) = \tilde{c}p_0(x) \rbrace.\]On $S_=$ set the test can be chosen such that $\beta_{\tilde{\varphi}}(\vartheta_0) = \alpha$.
Is must also be noted that every NP-test $\tilde{\varphi}$ with $\beta_{\tilde{\varphi}}(\vartheta_0) \in (0, 1)$ is unbiased. In particular
\[\alpha := 1 - \beta_{\tilde{\varphi}}(\vartheta_0) < 1 - \beta_{\tilde{\varphi}}(\vartheta_1).\]Proof
Take test $\varphi \equiv \alpha$. It has significance level $\alpha$ and since $\tilde{\varphi}$ is UMP, we have $$1-\beta_\varphi(\vartheta_1) \leq 1-\beta_{\tilde{\varphi}}(\vartheta_1).$$ If $\alpha = 1-\beta_{\tilde{\varphi}}(\vartheta_1) < 1$, then $\varphi \equiv \alpha$ is UMP. Since every UMP test is an NP test, we know that $p_1(x) = \tilde{c}p_0(x)$ for almost all $x$. Therefore, $\tilde{c}=1$ and $p_1 = p_0$ a.s. and also $P_{\vartheta_0} = P_{\vartheta_1}$, which is contradictory.Confidence interval
Let $X_1, \dots X_n$ i.i.d. $\sim \mathcal{N}(\mu,\sigma^2)$ with $\sigma^2$ known. We test
\[H \colon \mu = \mu_0 \quad \text{vs} \quad K \colon \mu = \mu_1\]with $\mu_0 < \mu_1$. For the density of $X_1, \dots X_n$ it holds
\[p_j(x) = (2 \pi \sigma^2)^{-n/2} \exp \Big( -\frac{1}{2\sigma^2} \Big( \sum_{i=1}^{n} X_i^2 - 2 \mu_j \sum_{i=1}^{n}X_i + n\mu_j^2 \Big)\Big), \quad j = 0, 1.\]As the inequality for the likelihood ratio which we need for the construction of the NP test, we get
\[\frac{p_1(x)}{p_0(x)} = \exp \Big( \frac{1}{\sigma^2} \sum_{i=1}^{n} x_i(\mu_1 - \mu_0) \Big) \cdot f(\sigma^2, \mu_1, \mu_0) > \tilde{c},\]where the known constant $f(\sigma^2, \mu_1, \mu_0)$ is positive. This inequality is equivalent to
\[\overline{X}_n = \frac{1}{n} \sum_{i=1}^{n}X_i > c,\]for some appropriate $c$ (because of $\mu_1 > \mu_0$). Therefore it is equally well possible to determine $c$ such that
\[P_{\mu_0}(\overline{X}_n > c) = \alpha\]or equivalently
\[\begin{aligned} P_{\mu_0}\Big( &\underbrace{\frac{\sqrt{n}(\overline{X}_n - \mu_0)}{\sigma}} > \frac{\sqrt{n}(c-\mu_0)}{\sigma}\Big) = 1 - \Phi\Big(\frac{\sqrt{n}(c - \mu_0)}{\sigma}\Big) = \alpha. \\ &\quad \color{Salmon}{\sim \mathcal{N}(0, 1)} \end{aligned}\]If we call $u_p$ the p-quantile of $\mathcal{N}(0, 1)$, which is the value such that $\Phi(u_p)=p$, then we get
\[\frac{\sqrt{n}(c - \mu_0)}{\sigma} = u_{1-\alpha} \quad \Longleftrightarrow \quad c = \mu_0 + u_{1-\alpha}\frac{\sigma}{\sqrt{n}}.\]The NP-test becomes
\[\tilde{\varphi}(X) = 1_{\lbrace\overline{X}_n > \mu_0 + u_{1-\alpha} \frac{\sigma}{\sqrt{n}} \rbrace }.\]
Fig. 2. Visualization of simple hypothesis testing with $\mu_0 = -1$ and $\mu_1=1$. Significance level $\alpha$ on the right-hand side is draggable.
Simple hypotheses like that are not relevant in practice, but:
- They explain intuitively how to construct a test. One needs a so called confidence interval $c(X) \subset \Theta$ in which the unknown parameter lies with probability $1-\alpha$. In example above we used that for
we have
\[P_{\mu_0}(\mu_0 \in c(X)) = P_{\mu_0}(\overline{X}_n \leq \mu_0 + \frac{\sigma}{\sqrt{n}} u_{1-\alpha}) = 1-\alpha.\]Any such $c(X)$ can be used to construct a test, for example,
\[c'(X) =\Big[\overline{X}_n -u_{1-\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}, \overline{X}_n + u_{1-\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}} \Big].\]In addition, simple hypotheses tell you on which side the alternative lies.
- Formal results like the NP lemma are useful to derive more relevant results.
Monotone likelihood ratio
Let $\Theta = \mathbb{R}$, $\mathcal{P} = \lbrace P_\vartheta \mid \vartheta \in \Theta \rbrace$ and $T\colon \mathcal{X} \rightarrow \mathbb{R}$ be some statistic. Family $\mathcal{P}$ is called class with monotone (isotonic) likelihood ratio if for every $\vartheta < \vartheta_1$ there exists monotonically increasing function $H_{\vartheta_0, \vartheta_1} \colon \mathbb{R} \rightarrow [0, \infty)$, such that
\[\frac{p_{\vartheta_1}(x)}{p_{\vartheta_0}(x)} =H_{\vartheta_0, \vartheta_1}(T(x)) \quad P_{\vartheta_0} + P_{\vartheta_1}\text{-a.s.}\]In our example above we had
\[\frac{p_{\mu_1}(x)}{p_{\mu_0}(x)} = \exp \Big( \frac{1}{\sigma^2} \sum_{i=1}^{n} x_i(\mu_1 - \mu_0) \Big) \cdot f(\sigma^2, \mu_1, \mu_0),\]which is monotonically increasing in $\overline{x}_n$. This can be generalized to one-parametric exponential families.
Let $\mathcal{P} = \lbrace P_\vartheta \mid \vartheta \in \Theta \rbrace$ be class with monotone likelihood ratio in $T$, $\vartheta \in \Theta$, $\alpha \in (0, 1)$ and we test the one-sided hypothesis
\[H\colon\vartheta \leq \vartheta_0 \quad \text{vs} \quad K\colon\vartheta > \vartheta_0.\]Let also
\[\tilde{\varphi}(x) = 1_{\lbrace T(x) > c\rbrace} + \gamma 1_{\lbrace T(x) = c\rbrace},\]where $c = \inf \lbrace t \mid P_{\vartheta_0}(T(X) > t) \leq \alpha \rbrace$ and
\[\gamma = \left \lbrace \begin{array}{cl} \frac{\alpha - P_{\vartheta_0}(T(X) > c) }{ P_{\vartheta_0}(T(X) = c) }, & \text{if } P_{\vartheta_0}(T(X) = c) \neq 0 \\ 0, & \text{otherwise}. \end{array} \right.\]Then $1-\beta_{\tilde{\varphi}}(\vartheta_0) = \alpha$ and $\tilde{\varphi}$ is UMP test with significance level $\alpha$.
Proof
We have $$1-\beta_{\tilde{\varphi}}(\vartheta_0)=P_{\vartheta_0}(T(X)>c) + \gamma P_{\vartheta_0}(T(X) = c) = \alpha. $$ Let $\vartheta_0 < \vartheta_1$, then due to monotonicity $$H_{\vartheta_0, \vartheta_1}(T(x)) > H_{\vartheta_0, \vartheta_1}(c) = s \quad \Longrightarrow \quad T(x) > c $$ and $$\tilde{\varphi}(x) = \left \{ \begin{array}{cl} 1, & H_{\vartheta_0, \vartheta_1}(x) > s, \\ 0, & H_{\vartheta_0, \vartheta_1}(x) < s. \end{array} \right.$$ Therefore $\tilde{\varphi}$ is NP-test with significance level $\alpha$ and by NP lemma $$ \beta_{\tilde{\varphi}}(\vartheta_1) = \inf \lbrace \beta_\varphi(\vartheta_1)\ |\ \beta_\varphi(\vartheta_0) = 1-\alpha \rbrace. $$ As $\tilde{\varphi}$ doesn't depend on $\vartheta_1$, this relation holds for all $\vartheta_1 > \vartheta_0$. Finally, let $\varphi'(x) = 1 - \tilde{\varphi}(x)$. Using the similar reasoning as above one can show that $$\beta_{\varphi'}(\vartheta_2) = \inf \lbrace \beta_\varphi(\vartheta_2)\ |\ \beta_\varphi(\vartheta_0) = 1 - \alpha \rbrace \quad \forall \vartheta_2 < \vartheta_0. $$ For trivial test $\overline{\varphi} \equiv \alpha$ the following equality takes place: $\beta_{\overline{\varphi}}(\vartheta_0) = 1-\alpha$. Hence we conclude that $$1-\beta_{\tilde{\varphi}}(\vartheta_2) = \beta_{\varphi'}(\vartheta_2) \geq \beta_{1-\overline{\varphi}}(\vartheta_2) = 1-\beta_{\overline{\varphi}}(\vartheta_2) = \alpha. $$ Hence, $1-\beta_{\tilde{\varphi}}(\vartheta_2) \geq \alpha$, $\tilde{\varphi} \in \Phi_\alpha$ and $\tilde{\varphi}$ is UMP test. Also for any $\vartheta < \vartheta_0$ we have $$\beta_{\tilde{\varphi}}(\vartheta) = \sup \lbrace \beta_\varphi(\vartheta)\ |\ 1 - \beta_\varphi(\vartheta_0) = \alpha \rbrace,$$ because of $\beta_{\varphi'} = 1 - \beta_{\tilde{\varphi}}$.Back to our previous example with $X_1, \dots, X_n$ with known $\sigma^2$, we know that
\[p_\mu(x) = (2 \pi \sigma^2)^{-\frac{n}{2}} \exp \Big( -\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 \Big)\]has a monotone likelihood ratio in $T(X) = \overline{X}_n$. An UMP test with level $\alpha$ is given by
\[\tilde{\varphi}(x) = 1_{\lbrace\overline{x}_n > c\rbrace } + \gamma 1_{\lbrace\overline{x}_n = c\rbrace}.\]Since $P_{\mu_0}(T(X) = c) = 0$, then $\gamma = 0$ and we choose $c$ such that
\[P_{\mu_0}(\overline{X}_n > c) = \alpha \Longleftrightarrow c = \mu_0 + \frac{\sigma}{\sqrt{n}} u_{1-\alpha}.\]This UMP test
\[\tilde{\varphi}(x) = 1_{\lbrace \overline{X}_n > \mu_0 + \frac{\sigma}{\sqrt{n}}u_{1-\alpha} \rbrace }\]is called the one-sided Gauss test.
There is a heuristic how to get to the one-sided Gauss test: since $\overline{X}_n$ is UMVU for $\mu$, a reasonable strategy is to decide for $K$ if $\overline{X}_n$ is “large enough”, so the test shoud be of the form
\[\varphi(x) = 1_{\lbrace \overline{X}_n > c \rbrace }.\]Choosing $c$ happens by controlling the error of the 1st kind. For all $\mu \leq \mu_0$ we have
\[\begin{aligned} \beta_\varphi(\mu) &= P_\mu(\overline{X}_n > c) \\ &= P_\mu \Big( \frac{\sqrt{n}(\overline{X}_n - \mu) }{\sigma} > \frac{\sqrt{n}(c-\mu)}{\sigma}\Big) \\ &= 1 - \Phi\Big(\frac{\sqrt{n}(c-\mu)}{\sigma}\Big) \\&\leq 1 - \Phi\Big(\frac{\sqrt{n}(c-\mu_0)}{\sigma}\Big). \end{aligned}\]So we need to secure that
\[1- \Phi\Big(\frac{\sqrt{n}(c-\mu_0)}{\sigma}\Big) \leq \alpha \Longleftrightarrow c \geq \mu_0 + \frac{\sigma}{\sqrt{n}} u_{1-\alpha}.\]We take $c = \mu_0 + \frac{\sigma}{\sqrt{n}} u_{1-\alpha}$ for an error of the 1st kind to be $\alpha$.
This method doesn’t tell you anything about optimality, but at least provides a test. Most importantly, it can be applied in more general situations like unknown $\sigma^2$. In this case one can use
\[\hat{\sigma}_n^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \overline{X}_n)^2.\]As above we obtain
\[\beta_\varphi(\mu) = P_\mu\Big( \frac{\sqrt{n}(\overline{X}_n - \mu) }{\hat{\sigma}_n} > \frac{\sqrt{n}(c-\mu)}{\hat{\sigma}_n}\Big) = 1 - F_{t_{n-1}}\bigg( \frac{c - \mu}{\sqrt{\hat{\sigma}_n^2 / n}} \bigg),\]where $F_{t_{n-1}}$ denotes the distribution function of $t_{n-1}$. A reasonable choice is
\[c = \mu_0 + \frac{\hat{\sigma}_n}{\sqrt{n}}t_{n-1,1-\alpha},\]with the corresponding quantile of a $t_{n-1}$ distribution. The test
\[\phi(x) = 1_{\lbrace \overline{x}_n > \mu_0 + \frac{\hat{\sigma}_n}{\sqrt{n}}t_{n-1,1-\alpha} \rbrace}\]is called the one-sided t-test.
Two-sided tests
There are in general no UMP tests for
\[H\colon\vartheta = \vartheta_0 \quad \text{vs} \quad K\colon\vartheta \neq \vartheta_0,\]because these have to be optimal for all
\[H'\colon\vartheta = \vartheta_0 \quad \text{vs} \quad K'\colon\vartheta = \vartheta_1\]with $\vartheta_0 \neq \vartheta_1$. In case of monotone likelihood-ratio, the optimal test in this case is
\[\varphi(x) = 1_{\lbrace T(x) > c \rbrace} + \gamma(x) 1_{\lbrace T(x) = c\rbrace}\]for $\vartheta_1 > \vartheta_0$ and
\[\varphi'(x) = 1_{\lbrace T(x) < c'\rbrace } + \gamma'(x) 1_{\lbrace T(x) = c'\rbrace}\]for $\vartheta_1 < \vartheta_0$. This is not possible.
There is a theorem for one-parametric exponential family with density
\[p_\vartheta(x) = c(\vartheta)h(x)\exp(Q(\vartheta) T(x))\]with increasing $Q$: UMPU test for
\[H \colon \vartheta \in [\vartheta_1, \vartheta_2] \quad \text{vs} \quad K\colon\vartheta \notin [\vartheta_1, \vartheta_2]\]is
\[\varphi(x) = \left \lbrace \begin{array}{cl} 1, & \text{if } T(x) \notin [c_1, c_2], \\ \gamma_i, & \text{if } T(x) = c_i, \\ 0, & \text{if } T(x) \in (c_1, c_2), \end{array} \right.\]where the constants $c_i, \gamma_i$ determined from
\[\beta_\varphi(\vartheta_1) = \beta_\varphi(\vartheta_2) = 1-\alpha.\]Similar results hold for $k$-parametric exponential families.
Take an example: let $X$ be exponentially distributed random variable: $X \sim \operatorname{Exp}(\vartheta)$ with density
\[f_\vartheta(x) = \vartheta e^{-\vartheta x} 1_{[0, \infty)}(x)\]and we test
\[H\colon \vartheta \in [1, 2] \quad \text{vs} \quad K\colon\vartheta \notin [1, 2].\]We have $T(x) = x$ and
\[\varphi(x) = 1_{\lbrace X \notin [c_1, c_2] \rbrace}.\]It is known that for $X$ distribution function is $F(x) = 1 - e^{-\vartheta x}$, therefore
\[\begin{aligned} P_{1}(X \in [c_1, c_2])&=e^{-c_1}-e^{-c_2} = 1-\alpha, \\ P_{2}(X \in [c_1, c_2])&=e^{-2 c_1}-e^{-2 c_2} = 1-\alpha. \end{aligned}\]Solving this for $c_1$ and $c_2$ we get
\[c_1 = \ln\frac{2}{2-\alpha}, \quad c_2 = \ln\frac{2}{\alpha}.\]Asymptotic properties of tests
Let $X_1, \dots , X_m$ i.i.d. $\sim \mathcal{N}(\mu_1, \sigma^2)$ and $Y_1, \dots , Y_n$ i.i.d. $\sim \mathcal{N}(\mu_2, \tau^2)$ are two independent samples. We want to test the hypothesis:
\[H\colon \mu_1 \leq \mu_2 \quad \text{vs} \quad K\colon \mu_1 > \mu_2.\]We reject $H$ if $\overline{Y}_n$ is much smaller than $\overline{X}_m$.
- Let $\sigma^2=\tau^2$, but variance is unknown. We know from Part I that
and pooled variance:
\[\hat{\sigma}_{m,n}^2=\frac{1}{m+n-2}\Big( \sum_{i=1}^{m}(X_i-\overline{X}_m)^2+\sum_{i=1}^{n}(Y_i-\overline{Y}_n)^2 \Big) \sim \frac{\sigma^2}{m+n-2} \chi_{m+n-2}^2.\]For $\mu_1=\mu_2$ we have
\[T_{m,n}=\sqrt{\frac{mn}{m+n}}\frac{\overline{X}_m-\overline{Y}_n}{\hat{\sigma}_{m,n}} \sim t_{m+n-2},\]therefore test
\[\varphi_{m,n}(x)=1_{\lbrace T_{m,n} > t_{m+n-2, 1-\alpha}\rbrace }\]is UMPU with significance level $\alpha$. This test is called two-sample t-test.
- Let $\sigma^2 \neq \tau^2$, then
Unbiased estimators for variances are
\[\hat{s}_{m}^2(X)=\frac{1}{m-1}\sum_{i=1}^{m}(X_i-\overline{X}_m)^2, \quad \hat{s}_{n}^2(Y)=\frac{1}{n-1}\sum_{i=1}^{n}(Y_i-\overline{Y}_n)^2.\]Let also
\[\hat{s}_{m, n}^2 = \frac{1}{m}\hat{s}_{m}^2(X) + \frac{1}{n}\hat{s}_{n}^2(Y).\]The distribution of random variable
\[T_{m,n}^*=\frac{\overline{X}_m-\overline{Y}_n}{\hat{s}_{m, n}}\]was unknown until recently (so called Behrens-Fisher problem) and can’t be expressed in terms of elementary functions, but from Central Limit Theorem we know that
\[\frac{\overline{X}_m-\overline{Y}_n - (\mu_1-\mu_2)}{\hat{s}_{m,n}} \xrightarrow[]{\mathcal{L}} \mathcal{N}(0,1),\]if $m \rightarrow \infty$, $n \rightarrow \infty$ and $\frac{m}{n}\rightarrow \lambda \in (0, \infty)$. Let
\[\varphi_{m,n}^*(x)=1_{\lbrace T_{m,n}^* > u_{1-\alpha}\rbrace },\]then
\[\begin{aligned} \beta_{\varphi_{m,n}^*}(\mu_1, \mu_2) & =P_{\mu_1, \mu_2}(T_{m,n}^* \leq u_{1-\alpha})\\&=P_{\mu_1, \mu_2}\Big(\frac{\overline{X}_m-\overline{Y}_n - (\mu_1-\mu_2)}{\hat{s}_{m,n}} \leq \frac{- (\mu_1-\mu_2)}{\hat{s}_{m,n}}+ u_{1-\alpha}\Big) \\ & \xrightarrow[m \rightarrow \infty,\ n \rightarrow \infty,\ \frac{m}{n}\rightarrow \lambda]{} \left \lbrace \begin{array}{cl} 0, & \mu_1 > \mu_2, \\ 1-\alpha, & \mu_1=\mu_2, \\ 1, & \mu_1<\mu_2. \end{array} \right. \end{aligned}\]- We say that sequence $(\varphi_n)$ has asymptotic significance level $\alpha$, if
- We say that sequence $(\varphi_n)$ is consistent, if
In our example $\varphi_{m,n}^*(x)$ is consistent and has asymptotic significance level $\alpha$.
Likelihood ratio
The common principle of building tests for
\[H\colon\vartheta \in \Theta_H \quad \text{vs} \quad K\colon\vartheta \in \Theta_K\]is likelihood ratio method. Let $f_n(x^{(n)},\vartheta)$ be density of $P_\vartheta^n$. Then
\[\lambda(x^{(n)})=\frac{\sup_{\vartheta \in \Theta_H}f_n(x^{(n)},\vartheta)}{\sup_{\vartheta \in \Theta}f_n(x^{(n)},\vartheta)}\]is likelihood ratio and
\[\varphi_n(x^{(n)})=1_{\lbrace \lambda(x^{(n)})<c \rbrace }\]is likelihood ratio test. It is common to choose $c$, such that
\[\sup_{\vartheta \in \Theta_H} P_\vartheta(\lambda(X^{(n)})<c) \leq \alpha.\]Distribution $\lambda(X^{(n)})$ nevertheless can be estimated only asymptotically.
Before we continue further, we will formulate some conditions. Let $\Theta \subset \mathbb{R}^d$ and there exist $\Delta \subset \mathbb{R}^c$ and twice continuously differentiable function $h:\Delta \rightarrow \Theta$, such that $\Theta_H = h(\Delta)$ and Jacobian of $h$ is matrix of full rank.
For example, let $X_1, \dots, X_n$ i.i.d. $\sim \mathcal{N}(\mu_1, \sigma^2)$ and $Y_1, \dots, Y_n$ i.i.d. $\sim \mathcal{N}(\mu_2, \sigma^2)$ be two independent samples. Suppose we want to test the equivalency of means:
\[H\colon \mu_1 = \mu_2 \quad \text{vs} \quad K\colon \mu_1 \neq \mu_2.\]Then $\Theta \in \mathbb{R}^2 \times \mathbb{R}^+$, $\Delta = \mathbb{R} \times \mathbb{R}^+$ and $h(\mu, \sigma^2) = (\mu, \mu, \sigma^2)$. Jacobian matrix is
\[J = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 0 & 1 \end{pmatrix},\]matrix of full rank.
Let
\[\hat{\eta}_n=\arg\max_{\eta \in \Delta}f_n(X^{(n)},h(\eta)) \quad \text{and} \quad \hat{\theta}_n=\arg\max_{\vartheta \in \Theta}f_n(X^{(n)},\vartheta)\]be maximum-likelihood estimators for families
\[\mathcal{P}_h = \lbrace P_{h(\eta)}\ |\ \eta \in \Delta\rbrace \quad \text{and} \quad \mathcal{P}_\vartheta = \lbrace P_\vartheta\ |\ \vartheta \in \Theta \rbrace\]respectively. Also let conditions from theorem of asymptotic efficiency for maximum-likelihood estimators for both families be satisfied. Then
\[T_n=-2\log \lambda(X^{(n)})=2(\log f_n(X^{(n)}, \hat{\theta}_n)-\log f_n(X^{(n)}, h(\hat{\eta}_n))) \xrightarrow[]{\mathcal{L}} \chi_{d-c}^2,\]if $\vartheta \in \Theta_H$.
Proof
As before we use notation $$\ell(x, \vartheta) = \log f(x, \vartheta).$$ We start with $$\begin{aligned} T_n^{(1)} & = 2(\log f_n(X^{(n)}, \hat{\theta}_n)-\log f_n(X^{(n)}, \vartheta)) \\ & = 2\sum_{i=1}^{n}\Big(\ell(X_i, \hat{\theta}_n) - \ell(X_i, \vartheta)\Big) \\ & = 2(\hat{\theta}_n - \vartheta)^T \sum_{i=1}^{n} \dot{\ell}(X_i, \vartheta) +(\hat{\theta}_n - \vartheta)^T \sum_{i=1}^{n} \ddot{\ell}(X_i, \widetilde{\vartheta}_n)(\hat{\theta}_n - \vartheta) \\ & = 2 (\hat{\theta}_n - \vartheta)^T \Big( \sum_{i=1}^{n} \dot{\ell}(X_i, \vartheta) + \sum_{i=1}^{n} \ddot{\ell}(X_i, \widetilde{\vartheta}_n)(\hat{\theta}_n - \vartheta) \Big) - (\hat{\theta}_n - \vartheta)^T\sum_{i=1}^{n}\ddot{\ell}(X_i, \widetilde{\vartheta}_n)(\hat{\theta}_n - \vartheta) \end{aligned}$$ for some $\widetilde{\theta}_n \in [\hat{\theta}_n, \vartheta]$. Using the notations from [Part III](https://astralord.github.io/posts/visual-guide-to-statistics-part-iii-asymptotic-properties-of-estimators/#asymptotic-efficiency-of-maximum-likelihood-estimators) we rewrite the first term of equation above: $$\begin{aligned} 2n(\hat{\theta}_n - \vartheta)^T& \underbrace{(\dot{L}_n(\vartheta) - \ddot{L}_n(\tilde{\vartheta})(\hat{\theta}_n - \vartheta))}. \\ & \qquad \qquad\ \color{\Salmon}{ = 0 \text{ (by Mean Theorem)}} \end{aligned}$$ Also $$T_n^{(1)} = -\sqrt{n}(\hat{\theta}_n - \vartheta)^T \ddot{L}_n(\widetilde{\vartheta}_n) \sqrt{n}(\hat{\theta}_n - \vartheta), $$ where $$ \begin{aligned} \sqrt{n}(\hat{\theta}_n - \vartheta)^T & \xrightarrow[]{\mathcal{L}} \mathcal{N}(0, I^{-1}(f(\cdot, \vartheta))), \\ \ddot{L}_n(\widetilde{\vartheta}_n)& \xrightarrow[]{\mathbb{P}} -I(f(\cdot, \vartheta)), \\ \sqrt{n}(\hat{\theta}_n - \vartheta) &\xrightarrow[]{\mathcal{L}} \mathcal{N}(0, I^{-1}(f(\cdot, \vartheta))). \end{aligned}$$ We know that for $X \sim \mathcal{N}_d(0, \Sigma)$ with $\Sigma > 0$ we have $$X^T \Sigma X ~ \sim \mathcal{X}_d^2.$$ Therefore, $$T_n^{(1)} \xrightarrow[]{\mathcal{L}} A \sim \mathcal{X}_d^2.$$ In the same way, $$ T_n^{(2)} = 2 (\log f_n(X^{(n)}, h(\hat{\eta}_n) ) - \log f_n(X^{(n)},h(\eta))) \xrightarrow[]{\mathcal{L}} B \sim \mathcal{X}_c^2. $$ If $H$ is true, then $\vartheta = h(\eta)$ and $$T_n = T_n^{(1)} - T_n^{(2)} \xrightarrow[]{\mathcal{L}} A-B \sim \mathcal{X}_{d-c}^2,$$ which follows from independence of $A-B$ and $B$.This statement is called Wilk’s theorem and it shows that
\[\varphi_n (X^{(n)}) = \left \{ \begin{array}{cl} 1, & -2\log\lambda(X^{(n)}) > \mathcal{X}_{d-c, 1-\alpha}^2, \\ 0, & \text{otherwise} \end{array} \right.\]is a test with asymptotic level $\alpha$. Also, sequence $(\varphi_n)$ is consistent, because
\[\begin{aligned} -\frac{2}{n} \log (\lambda(X^{(n)})) & = \frac{2}{n} \sum_{i=1}^{n} \Big( \ell(X_i, \hat{\theta}_n) - \ell(X_i, h(\hat{\eta}_n)) \Big) \\ & \xrightarrow{\mathcal{L}} 2 \mathbb{E}_\vartheta[\ell(X,\vartheta) - \ell(X, h(\eta))] \\ & = 2 KL(\vartheta | h(\eta)) > 0, \end{aligned}\]if $\vartheta \neq h(\eta)$. Hence for $\vartheta \in \Theta_K$
\[-2\log(\lambda(X^{(n)}))\xrightarrow{\mathcal{L}} \infty.\]Likelihood-ratio tests
Take an example: let $X_{ij} \sim \mathcal{N}(\mu_i, \sigma_i^2)$, $i = 1, \dots, r$ and $j = 1, \dots, n_i$, where $n_i \rightarrow \infty$ with the same speed. We test equivalence of variances:
\[H\colon \sigma_1^2 = \dots = \sigma_r^2 \quad \text{vs} \quad K \colon \sigma_i^2 \neq \sigma_j^2 \text{ for some } i \neq j.\]Here $\Theta = \mathbb{R}^r \times (\mathbb{R}^+)^r$, $\Delta = \mathbb{R}^r \times \mathbb{R}^+$ and
\[h((x_1, \dots, x_r, y)^T) = (x_1, \dots, x_r, y, \dots, y)^T.\]Maximum-likelihood estimator is
\[\hat{\theta}_n = (\overline{X}_{1 \cdot}, \dots, \overline{X}_{r \cdot}, \hat{s}_1^2, \dots, \hat{s}_r^2)\]with
\[\overline{X}_{i \cdot} = \frac{1}{n_i} \sum_{j=1}^{n_i}X_{ij} \quad \text{and} \quad \hat{s}_i^2 = \frac{1}{n_i}\sum_{j=1}^{n_i}(X_{ij} -\overline{X}_{i \cdot})^2.\]Then
\[f_n(X^{(n)}, \hat{\vartheta}_n) = \prod_{i=1}^{r} (2 \pi e \hat{s}_i^2)^{-\frac{n_i}{2}}.\]Under null hypothesis maximum-likelihood estimator maximizes
\[f_n(X^{(n)}, \hat{\eta}_n) = \prod_{i=1}^{r} (2 \pi \sigma^2)^{-\frac{n_i}{2}} \exp \Big( -\frac{1}{2\sigma^2} \sum_{j=1}^{n_i} (X_{ij} - \overline{X}_{i \cdot})^2 \Big ).\]Setting $n = \sum_{i=1}^{r}n_i$, we get
\[\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{r} \sum_{j=1}^{n_i} (X_{ij}-X_{i \cdot})^2 = \sum_{i=1}^r \frac{n_i}{n}\hat{s}_i^2.\]Then
\[f_n(X^{(n)}, \hat{\eta}_n) = \prod_{i=1}^{r}(2\pi e\hat{\sigma}^2)^{-\frac{n_i}{2}} = (2\pi e \hat{\sigma}^2)^{-\frac{n}{2}}\]and test statistic becomes
\[T_n = -2\log \lambda(X^{(n)}) = n \log \hat{\sigma}^2 - \sum_{i=1}^{r} n_i \log \hat{s}_i^2.\]The test
\[\varphi_n(X^{(n)}) = \left \{ \begin{array}{cl} 1, & T_n > \mathcal{X}_{r-1, 1-\alpha}^2, \\ 0, & \text{otherwise}. \end{array} \right.\]is called the Bartlett test.
Fig. 3. Visualization of Bartlett test. Up to five normally distributed samples can be added, each with $n_i=30$. Significance level $\alpha$ is fixed at $0.05$. Choose different variations of $\sigma_i^2$ to observe how it affects test statistic $T_n$.
Take another example: suppose we have two discrete variables $A$ and $B$ (e.g. such as gender, age, education or income), where $A$ can take $r$ values and $B$ can take $s$ values. Further suppose that $n$ individuals are randomly sampled. A contingency table can be created to display the joint sample distribution of $A$ and $B$.
$1$ | $\cdots$ | $s$ | Sum | |
---|---|---|---|---|
$1$ | $X_{11}$ | $\cdots$ | $X_{1s}$ | $X_{1\cdot} = \sum_{j=1}^n X_{1j}$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
$r$ | $X_{r1}$ | $\cdots$ | $X_{rs}$ | $X_{r\cdot}$ |
Sum | $X_{\cdot1}$ | $\cdots$ | $X_{\cdot s}$ | $n$ |
We model vector $X$ with multinomial distribution:
\[(X_1, \dots X_n)^T \sim \mathcal{M}(n, p_{11}, \dots, p_{rs}),\]where $\sum_{ij} p_{ij} = 1$. Joint density is
\[f_n(x^{(n)}, p) = P_p(X_{ij}=x_{ij}) = \frac{n!}{\prod_{i,j=1}^{r,s} x_{ij}!} \prod_{i,j=1}^{r,s} (p_{ij})^{x_{ij}},\]where $x_{ij} = \lbrace 0, \dots, n\rbrace$ and $\sum_{i,j=1}^{r,s} x_{ij} = n$. Maximum-likelihood estimator is
\[\hat{p}_{ij} = \frac{X_{ij}}{n}\](in analogy to binomial distribution) and
\[f_n(X^{(n)}, \hat{p}) = \frac{n!}{\prod_{i,j=1}^{r,s} X_{ij}!} \prod_{i,j=1}^{r,s} \Big(\frac{X_{ij}}{n}\Big)^{X_{ij}}\]Suppose we want to test independence between $A$ and $B$:
\[H\colon p_{ij} = p_i q_j \ \forall i,j \quad \text{vs} \quad K\colon p_{ij} \neq p_i q_j \text{ for some } i \neq j,\]where $p_i = p_{i \cdot} = \sum_{j=1}^{s}p_{ij}$ and $q_j = p_{\cdot, j} = \sum_{i=1}^{r}p_{ij}$. Here $d = rs-1$, $c = r + s - 2$ and $d-c = (r-1)(s-1)$. If null hypothesis is true, then
\[f_n(X^{(n)}, p, q) = \frac{n!}{\prod_{i,j=1}^{r,s} X_{ij}!} \prod_{i,j=1}^{r,s} (p_i q _j)^{X_{ij}} = \frac{n!}{\prod_{i,j=1}^{r,s} X_{ij}!} \prod_{i}^{r} p_i^{X_{i \cdot}} \prod_{j=1}^{s} q_j ^ {X_{\cdot j}}.\]Maximum-likelihood estimators are
\[\hat{p}_i = \frac{X_{i \cdot}}{n} \quad \text{and} \quad \hat{q}_j = \frac{X_{\cdot j}}{n},\]and likelihood function is
\[f_n(X^{(n)}, \hat{p}, \hat{q}) = \frac{n!}{\prod_{i,j=1}^{r,s} X_{ij}!} \prod_{i,j=1}^{r,s} \Big( \frac{X_{i \cdot} X_{\cdot j}}{n^2} \Big)^{X_{ij}}.\]We get
\[T_n = -2 \log \lambda(X^{(n)}) = 2 \sum_{i=1}^r \sum_{j=1}^s X_{ij} \log \Big( \frac{nX_{ij}}{ X_{i \cdot} X_{\cdot j} } \Big)\]and
\[\varphi_n(X^{(n)}) = \left \{ \begin{array}{cl} 1, & T_n > \mathcal{X}_{(r-1)(s-1), 1-\alpha}^2, \\ 0, & \text{otherwise}, \end{array} \right.\]which is called chi-square independence test. Using Taylor expansion with Law of Large Number we can get asymptotic equivalent
\[\tilde{T}_n = \sum_{i=1}^{r} \sum_{j=1}^s \frac{\Big(X_{ij} -\frac{X_{i \cdot} X_{\cdot j}}{n}\Big)^2}{\frac{X_{i \cdot} X_{\cdot j}}{n}}.\]Usually,
\[V_n = \sqrt{\frac{\tilde{T}_n}{n (\min(r, s) - 1)}}\]is used as dependency measure between $A$ and $B$, because under both null hypothesis and alternative convergence takes place
\[V_n^2 \xrightarrow{\mathbb{P}} \frac{1}{\min(r, s) - 1}\sum_{i=1}^{r} \sum_{j=1}^s \frac{(p_{ij} - p_{i \cdot}p_{\cdot j} )^2}{p_{i \cdot}p_{\cdot j}}\]Fig. 4. Visualization for chi-square independence test with $r=4$ and $s=5$. Significance level $\alpha$ is fixed at $0.05$. Click on the cell of contingency table to increase $X_{ij}$ value, and CTRL + click to decrease (or ⌘ + click for Mac OS).