Statistical Inference II
Statistical Inference II
1. Hypothesis Testing
1.1 Mechanics of Hypothesis Testing
- Test of a Hypothesis
- The truth or falsity of a particular hypothesis can never be known for certainty unless we examine the entire population
- A hypothesis testing procedure should be developed with the probability of reaching wrong conclusion.
1.2 Null Hypothesis and Alternative hypothesis
- Null Hypothesis \({H_0}\): is an assertion about one or more population parameters that are assumed to be true until there is sufficient statistical evidence to conclude otherwise.
- Alternative Hypothesis \({H_1}\): is the assertion of all situations not covered by the null hypothesis.
Together the null and the alternative constitute a set of hypotheses that covers all possible values of the parameter or parameters in question.
1.3. Type I error and type II error
- Results of a Test of Hypothesis
Type I error: Rejecting the null hypothesis \(H_0\) when it is true
Type II error: Failing to reject the null hypothesis \(H_0\) when it is false
-
significance level \(\alpha\): the probability of making a type I error:
\[\alpha = \Pr(\text{reject } H_0|H_0 \text{ is true}) \] -
Strong and weak conclusion
-
Because we can usually control the significance level \(\alpha\) (the probability of wrongly rejecting \(H_0\)), the rejection of the null hypothesis is a strong conclusion
-
Failing to reject the null hypothesis does not necessarily means a high probability for \(H_0\) being true. Failing to reject \(H_0\) is a weak conclusion.
-
1.4. Three approaches of a hypothesis testing
-
fix-significance level \(\alpha\) approach:
Given a fixed-significance level \(\alpha\), all we have to do is to determine where to place the critical regions.
-
\(p\)-value approach:
The \(p\)-value provides a measure of the credibility of the null hypothesis.
Given the test statistic, the \(p\)-value is the smallest significance level \(\alpha\) that would lead to rejection of the null hypothesis:
\[p < \alpha \quad \Rightarrow \quad \text{reject } H_0 \] -
confidence interval (CI) approach:
to estimate an unknown parameter \(\theta\), we would prefer an interval estimate \(L \leq \theta \leq U\), where \(L\) and \(U\) are determined by the sample.
\[\Pr[L \leq \theta \leq U] = 1 - \alpha \quad \text{or} \quad \Pr[L \leq \theta] = 1 - \alpha \quad \text{or} \quad \Pr[\theta \leq U] = 1 - \alpha \]
Fix-significance level approach, P-value approach, and CI should lead to the same hypothesis testing conclusion.
2 Inferences Regarding a Single Population
- Note, the inference statistic is always with respect to the population.
2.1 Inference one population mean \(\mu\)
2.1.1 Inference on population mean \(\mu\) with known variance \({σ^2}\): one-sample \(z\)-test
(1) Theorem: Central Limit Theorem. Assumption:
- Let \({X_1}, \cdots, X_n\) be a random sample of size \(n\) taken from a population with mean \({\mu}\) and known variance \({\sigma^2}\) with replacement. (No population distribution assumption)
- The population variance \({\sigma^2}\) is known
- Let \({\bar X}\) be the sample mean.
Then, the limiting form of the distribution of statistic \(Z\)
follows the standard normal distribution. \(Z\) represents a standard normal distributed random variable.
(2) Hypothesis test
The statistic test to inference on population mean \(\mu\) with know variance \({σ^2}\) is called one-sample \(z\)-test.
(3) Estimation of confidence interval for \(\mu\)
The confidence interval of \({\mu}\):
where \({\alpha}\) represents the level of significance and \({1-\alpha}\) is called the confidence level. \({Z_{\alpha/2}}\) is the value of \({Z}\) such that the area in each of the tail under the standard normal curve is \({\alpha/2}\).
Confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value of \({\mu}\)
2.1.2 Inference on population mean \(\mu\) with unknown variance: one-sample \(t\)-test
(1)Theorem: Assumption:
-
Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a a normal distribution with mean \({\mu}\) and unknown variance.
-
The variance of the normal distribution is unknown
-
Let \({\bar X}\) be the sample mean
-
Usually assume \(n < 40\). When \(n \geq 40\), \(T\) can be regarded as standard normal distribution.
Then, the statistic \({T}\)
follows a t-distribution with \({n-1}\) degrees of freedom.
(2) Hypothesis test
The statistic test to inference on population mean \(\mu\) with unknow variance is called one-sample \(t\)-test.
(3) Estimation of confidence interval for \(\mu\)
The confidence interval estimator of the population mean \({\mu}\) is:
where \({S = \sqrt{ \dfrac{1}{(n-1)}\sum_{i=1}^{n}(X_i-\bar X)^2}}\) is the standard deviation of the sample.
2.2 Inference one population variance \(\sigma^2\)
2.2.1 Inference on variance \(\sigma^2\) of a normal population
(1) Theorem: Assumption:
- Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \(\mu\) and variance \(\sigma^2\).
The statistic \(X^2\):
follows a \({\chi^2}\)distribution with \({n-1}\) degrees of freedom.
(2) Hypothesis test
(3) Estimation of confidence interval for \(\sigma^2\)
The confidence interval estimator of the population variance \({\sigma^2}\) is:
The area in the right-hand tail of the distribution is \({\chi^2_{\alpha/2}}\), while the area in the left-hand tail of the distribution is \({\chi^2_{1-\alpha/2}}\)
2.3 Inference one population proportion \(p\)
2.3.1 Inference on population proportion \(p\)
(1) Assumption:
-
Let \({X_1, X_2, ..., X_n}\) can be a series of the independent and identically distributed (i.i.d.) random variables following a Bernoulli Distribution with the successful rate of \({p}\). Thus,
-
\(X_i \sim \text{Bernoulli}(p)\), \(\mu = \mathrm{E} (X_i) = p\), \(\sigma^2 = \mathrm{Var} (X_i) = p(1-p)\)
-
\(Y=\sum_{i=1}^nX_i \sim B(n,p)\), \(\mu = \mathrm{E} (Y) = np\), \(\sigma^2 = \mathrm{Var} (Y) = np(1-p)\)
-
By the central limit theorem, when \(n\) is large, the statistic \(Z\):
Where \(\bar X = \hat p\).
(2) Hypothesis test
(3) Estimation of confidence interval for \(p\)
The confidence interval for the population proportion \({p}\) is calculated as following. The equation is under the assumption that \({n}\) is sufficiently large: \({np \geq5}\) and \({n(1-p) \geq5}\).
3. Inferences Regrading Comparing Two Populations
3.1 Inference on two populations' means \(\mu_1-\mu_2\)
3.1.1 Inference on two populations' means with known variances: two-sample \(z\)-test
(1) Assumption:
-
Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a random sample of size \(n\) taken from a population with mean \({\mu_1}\) and known variance \({\sigma^2_1}\) with replacement. (No population distribution assumption)
-
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a random sample of size \(n\) taken from another population with mean \({\mu_2}\) and known variance \({\sigma^2_2}\) with replacement.
-
The two populations are independent.
-
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are known.
The statistic \(Z\)
follows the standard normal distribution.
(2) Hypothesis test
The statistic test to inference two populations' means \(\mu_1-\mu_2\) with known variances \(\sigma_1^2\) and \(\sigma_2^2\) is called two-sample \(z\)-test.
When \(\sigma^2_1\) and \(\sigma^2_2\) are unknown, if \(n_1>40\) and \(n_2>40\), the \(Z\) can also be regarded as the standard normal distribution.
(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)
A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):
3.3.2 Inference on two populations' means with unknown but equal variances: pooled \(t\)-test
(1) Assumption:
-
Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\).
-
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).
-
The two normal populations are independent.
-
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
-
The variances are equal \(\sigma^2_1=\sigma^2_2\)
-
Usually assume \(n_1 < 40\) or \(n_2<40\). When \(n_1 \geq 40\) and \(n_1 \geq 40\), \(T\) can be regarded as standard normal distribution.
The statistic \(T\)
follows a t-distribution with \({n_1+n_2-1}\) degrees of freedom. Where \(S_p\) is called the pooled estimator of \(\sigma^2\), which is the weighted average of two sample variances. It is calculated by:
(2) Hypothesis test
(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)
A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):
3.3.3 Inference on two populations' means with unknown and unequal variances: Welch's \(t\)-test
(1) Assumption:
- Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\)
- Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).
- The two normal populations are independent.
- The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
- The variances are equal \(\sigma^2_1 \neq \sigma^2_2\)
The statistic \(T\) is:
\(T\) approximately follows a \(t\)-distribution with degrees of freedom \(\nu\), which given by:
Where \(\lfloor \cdot \rfloor\) is floor function. \(S_1\) and \(S_2\) are the standard deviations of samples \(X_{1n}\) and \(X_{2n}\) respectively.
(2) Hypothesis test
(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)
The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):
3.3.4 Inference on two populations' means with paired samples: paired \(t\)-test
(1) Assumption:
- Samples on two populations are collected in pairs. Each pair of samples \((X_{1i},X_{2i})\) is taken under homogeneous conditions, but these conditions may changer from one pair to another. (No population distribution assumption)
- The two random samples may not be independent. (No population independence assumption)
- The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
- Whether the variances are equal or not are also unknown.
The statistic \(T\):
follows follows a \(t\)-distribution with degrees of freedom \(n-1\). Where:
- \(D_i=X_{1i} - X_{2i}, i=1,\cdots,n\)
- \(\mu_D=\mathrm{E}[X_1-X_2] = \mu_1-\mu_2\)
- \(\bar{D} = \dfrac{1}{n}\sum_{i=1}^{n}D_i\)
- \(S_D=\sqrt{\dfrac{1}{n}\sum_{i=1}^{n}(D_i-\bar{D})^2}\)
(2) Hypothesis test
(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)
The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):
3.3.5 Summary: Inference on the Mean of Two Population
| Situations | Variance Known or Variance Unknown (\(n \geq 40\)) |
Variance Unknown (\(n<40\)) | Paired \(t\)-test |
|---|---|---|---|
| Population assumption | two independent distributions | two independent normal distributions | not require |
| statistic | \(z\)-statistic | \(t\)-statistic | \(t\)-statistic |
| procedure | two-sample \(z\)-test | equal variances: pooled \(t\)-test unequal variances: welch \(t\)-test |
paired \(t\)-test |
| sampling distribution of statistic | standard normal distribution | \(t\)-distribution | \(t\)-distribution |
3.2 Inference on the ratio of variance of two populations:
3.2.1 Inference on the ratio of variance of two normal populations: \(F\)-test
(1) Assumption:
- Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \({\mu_1}\) and variance \(\sigma^2_1\)
- Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with unknown mean \({\mu_2}\) and variance \(\sigma^2_2\).
- The two normal populations are independent.
- The means \(\mu_1\) and \(\mu_2\) of the two population are unknown.
The statistic \(F\):
follows an \(F\)-distribution wiht \(n_1-1\) degrees of freedom in the numerator and \(n_2-1\) degrees of freedom in denominator.
(2) Hypothesis test
(3) Estimation of confidence interval for \(\sigma_1^2 / \sigma_2^2\)
The confidence interval for the ratio of two population variances \({\sigma_1^2/\sigma_2^2}\):
For \(F\)-distribution, we have
2.5.6 The Chi-Square Goodness-of-Fit Test
If the data are grouped into \({k}\) cells, let the observed count in cell \({i}\) be \({O_i}\) and the expected count (expected under \({H_0}\)) be \({E_i}\). The summation is over all cells \({i = 1, 2,\dots, k}\). The test statistic is
The expected counts are \({5}\) or more for all cells. When the sample sizes are small or cells are defined such that small expected frequencies, the \({\chi^2}\) test is inappropriate.
Contingency tables are used to determine whether two classification criteria are independent of each other.
- General Layout of a Contingency Table
| First Classification Category \({\rightarrow}\) |
||||||
|---|---|---|---|---|---|---|
| Second Classification Category \({\downarrow}\) |
\({1}\) | \({\cdots}\) | \({j}\) | \({\cdots}\) | \({c}\) | Total |
| \({1}\) | \({O_{11}}\) | \({\cdots}\) | \({O_{1j}}\) | \({\cdots}\) | \({O_{1c}}\) | \({R_1}\) |
| \({\vdots}\) | \({\vdots}\) | \({\ddots}\) | \({\vdots}\) | \({\vdots}\) | \({\vdots}\) | |
| \({i}\) | \({O_{i1}}\) | \({\cdots}\) | \({O_{ij}}\) | \({\cdots}\) | \({O_{ic}}\) | \({R_i}\) |
| \({\vdots}\) | \({\vdots}\) | \({\vdots}\) | \({\ddots}\) | \({\vdots}\) | \({\vdots}\) | |
| \({r}\) | \({O_{r1}}\) | \({\cdots}\) | \({O_{rj}}\) | \({\cdots}\) | \({O_{rc}}\) | \({R_r}\) |
| Total | \({C_1}\) | \({\cdots}\) | \({C_j}\) | \({\cdots}\) | \({C_c}\) | \({n}\) |
The test statistic \({\chi^2}\) for the differences between observed and expected frequencies are summed over all rows and columns of a two-way contingency table is rewritten as follows:
The test statistic is approximately \({\chi^2}\) distributed with degrees of freedom \({df = (r − 1)(c − 1)}\). \({R_i}\) and \({C_j}\) are the row and column totals, \({E_{ij}}\) is the expected count in \({\text{cell}(i, j)}\) expressed as:
Notes that in these instances that contingency tables based on small sample sizes or when expected cell frequencies become small, the \({\chi^2}\) test statistic is unreliable.

浙公网安备 33010602011771号