统计计算——假设检验
Lecture_9-hypothesis_Testing_with_R
library(tidyverse)
library(knitr)
library(dplyr)
library(palmerpenguins)
library(ggplot2)
Statistical inference
通过样本(sample)去估计总体(population).
- point estimate
- interval estimate (比起点估计,多了一个精确度指标)
- rejection of a hypothesis
Point estimate
Moment estimator
Let \(X_1,...,X_n\) i.i.d. as samples of X.
Maximum likelihood estimator
then take the devarient of \(\frac{dL(\mu, \sigma)}{d\mu} = 0\) and \(\frac{dL(\mu, \sigma)}{d\sigma} = 0\).
For normal distribution, $$ \hat\mu = \frac{1}{n}\sum_{i=1}^n X_i$$, and $$ \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n(X_i - \bar X)^2$$
注意:如果方差下面是n,那就是有偏的,所以需要修偏改成n-1。
Confidence Interval
Confidence interval
Note that \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\). If \(\sigma^2\) is known, we have the standardized \(\bar{X}\) follows the standard normal distribution:
Note that
where \(z_{\alpha / 2}\) is the \(\alpha / 2\) upper quantile of \(N(0,1)\). As a result,
which means that
is the \(100(1-\alpha) \%\) confidence interval of the parameter \(\mu\).
If \(\sigma^2\) is unknown, we can use the sample sd to replace the population sd:
where
The \(100(1-\alpha) \%\) confidence interval becomes
where \(t_{\alpha / 2, n-1}\) is the \(\alpha / 2\) upper quantile of \(t_{n-1}\) distribution.
Find the pivotal quantity that follow a specific distribution.
\(z_{\alpha/2}\)可以通过q-distribution函数求出。
ci_normal <- function(x, sigma, alpha = 0.05){
lower <-
upper <-
return c(lower, upper)
}
ci_normal(x, sigma = 7)
如果\(\sigma\)未知,可以使用t分布。
penguins <- drop_na(penguins)
x <- penguins$body_mass_g
xbar <- mean(x)
s <- sd(x)
n <- length(x)
alpha <- 0.05
t.quantile <- qt(alpha/2, df = n - 1, lower = FALSE)
ci.lower <- xbar - t.quantile * s / sqrt(n)
ci.upper <- xbar + t.quantile * s / sqrt(n)
c(ci.lower, ci.upper)
也可以把是否需要simga放在一个函数里。
ci <- function(x, sigma = NULL, alpha = 0.05){
if(is.null(sigma)){
} else {
}
}
Hypothesis testing
\(H_0: \mu = \mu_0\) versus \(H_1: \mu \neq \mu_0\)
Steps to do a hypothesis testing:
- State the null hypothesis and alternative hypothesis.
- Find an estimator of the parameter.
- Derive the distribution of the estimator.
- Based on the distribution of the estimator, find a test statistic whose distribution has no unknown parameters.
- Calculate the p-value. This is the probability, under the null hypothesis, of
sampling a test statistic at least as extreme as that which was observed. - Select a significance level(显著性水平) α, a probability threshold below which the null hypothesis will be rejected, and then make your conclusion.
Solution.
Let \(n=333\), and \(X_1, \ldots, X_n\) be the body mass of the penguins. To test \(\mu=\mu_0\) :
- State the \(H_0\) and \(H_1\).
- Find the estimator:
- Derive the distribution of \(\bar{X}\) :
where \(\mu\) is the population mean of \(X\), and \(\sigma^2\) is the population variance of \(X\). As \(\sigma^2\) is unknown, we can use \(s^2\), the sample variance, to replace \(\sigma^2\). As a result, if \(H_0\) is true,
- Choose \(T\) as our test statistic. Given that we have observed the sample, we can calculate the value of \(T\) by
- Find the \(p\)-value.
Note that the \(p\)-value is the probability of sampling a test statistic at least as extreme as that which was observed. The more extreme event is that we observe a \(t\)-value whose absolute value is larger.
- Select a significance level \(\alpha=0.05\), and we reject \(H_0\) at the significance level \(0.05\) because \(p\)-value is less than \(0.05\).
p值:在原假设成立的情况下,观测到的点足够极端的概率。
如果\(H_0\)不成立,则4中\(\bar x\)的值不是很大,就是很小,会导致t值不是很大,就是很小。
pt(-2.106, df = 333 - 1, lower = TRUE)
pt( 2.106, df = 333 - 1, lower = FALSE)
注意有两种备择假设,单边的或者双边的。
如果是大于的情况。
\(P(T > t) = P(T > -2.106) = 0.9972\) 就不会拒绝了。
t_test <- function(x, mu0, alternative){
if(!is.vector(x) | length(x) < 2){
stop("please input a vector with length > 1")
}
if(!(alternative %in% c('two.sided', 'greater', 'lower'))){
stop("")
}
if(alternative == 'two.sided'){
n <- length(x)
xbar <- mean(x)
s <- sd(x)
t_value <- (xbar - mu0)/(s / sqrt(n))
p_value <- 2 * pt(abs(t_value), df = n - 1, lower = FALSE)
}
return(p_value)
}
penguins <- drop_na(penguins)
t_test(1:10, 4, "two.sided")
Rejection region
先把标准(拒绝域 rejection region)算出来。
t.test
t.test(x,
alternative = c("two.sided", "less", "greater"),
mu = 0,
conf.level = 0.95
)
myt <- with(penguins, t.test(body_mass_g, mu = 4300))
# with 的作用就是body_mass_g是penguins是它的一列
myt$p.value
Likelihood ratio test: how to construct a test statistic
- Assume that a random sample \(\mathbf{X}=\left\{X_1, \cdots, X_n\right\}\) of size \(n\) is available, consider the generic testing problem
construct the test statistic
where \(L(\theta \mid \mathbf{X})\) represents likelihood function of the sample \(\mathbf{X}\). The Likelihood Ratio Test with nominal level \(\alpha\) is to reject \(H_0\) iff \(\lambda(\mathrm{X})<c\), where \(c\) is determined by the size restriction:
One sample test
Eg. One sample test, if \(\sigma\) is known.
Let \(X_1, \cdots, X_n\) be a random sample from \(N\left(\mu, \sigma^2\right)\) population. Consider testing
here we have
Let \(\theta=\mu\) be our parameter.
The numerator of \(\lambda(\mathbf{X})\) is
The denominator of \(\lambda(\mathrm{X})\) is
Thus the likelihood ratio becomes
We reject \(H_0\) if \(\lambda(\mathbf{X})\) is too small, or equivalently, if
The final form of the likelihood ratio test is that $$\text{ reject } H_0 \text{ if } {|\frac{\bar X - \mu_0}{\sigma / \sqrt{n}}| > z_{\alpha/2}}$$, where \(z_{\alpha/2}\) is the upper \(\alpha / 2\) quantile of N(0,1).
Eg. One sample test, if \(\sigma\) is unknown.
Let \(\theta=\left(\mu, \sigma^2\right)\), and let
Let
It can be shown that
Example: One sample test, if \(\sigma\) is unknown
The likelihood ratio is
We reject \(H_0\) is \(\lambda(\mathbf{X})\) is too small, or equivalently, if
This is equivalent to the usual test statistic.
Two sample t test
Let \(X_{11},...,X_{1n_1} \sim N(\mu_1, \sigma_1^2)\) and \(X_{21},...,X_{2n_2} \sim N(\mu_2, \sigma_2^2)\), we want to know \(H_0: \mu_1 = \mu_2, H_1: \mu_1 \neq \mu_2\)
- H0 and H1
- Estimators and their distributions, or likelihood ratio
- test statistic and its distribution
- p-value
区分两种情况:
- \(\sigma_1 = \sigma_2\)
思路:
估计值\(\bar X_1 - \bar X_2\)
\(\bar X_1 \sim N(\mu_1, \sigma_1^2/n_1)\)
\(\bar X_2 \sim N(\mu_2, \sigma_2^2/n_2)\)
此处我们假设\(\sigma_1 = \sigma_2\)
所以\(\bar X_1- \bar X_2 \sim N(\mu_1 - \mu_2, \frac{\sigma^2}{n_1}+\frac{\sigma^2}{n_2})\)
所以 \(\frac{\bar X_1 - \bar X_2}{\sigma\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\)
但是我们不知道 \(\sigma\).
所以我们用总样本(\(X_1\)和\(X_2\))的无偏方差来估计\(\sigma\)(所以均值就是)
所以就是 \(\frac{1}{n_1+n_2+1} \sum_{i=1}^{n_1}\)
If \(\sigma_1=\sigma_2\)
The test statistic (2-sample t-statistic) is
where
The distribution of \(T\) is
- \(\sigma_1 \neq \sigma_2\)
思路:现在两个sigma不一样了,所以要分开估计\(\hat\sigma_1^2\),所以\(\bar X_1 - \bar X_2\) 是服从\(N(\mu_1 - \mu_2, \frac{\hat\sigma_1^2}{n_1} - \frac{\hat\sigma_2^2}{n_2})\)
If \(\sigma_1 \neq \sigma_2\)
The test statistic (the Welch's t-statistic) is
and its distribution is
test_means <- function(x,y, var.equal = FALSE, alternative){
n1 <- length(x)
n2 <- length(y)
mean_x <- mean(x)
mean_y <- mean(y)
if(n1 < 1 || n2 < 1 || n1+n2 < 3){
stop("x and y is not long enough")
}
if(var.equal==TRUE){
sp_square <- (n1-1)*var(x) + (n2-1)*var(y) / (n1+n2-2)
t_value <- (mean_x - mean_y) / sqrt(sp_square * (1/n1+1/n2))
df <- n1+n2-2
} else {
t.value = (mean_x - mean_y) / sqrt(var(x)/n1 + var(y)/n2)
df <- n1+n2-2###not sure
}
p_value <- 0
if(alternative -- 'two-sided'){
p_value <- 2* pt(abs(t_value), df = df, lower = FALSE)
}
return(p_value)
}
Paired Data
- sample sizes are equal
- samples are not independent
Paired T test
\(D_i = X_{1i} - X_{2i}\), 所以想要检测\(X_1\)的均值和\(X_2\)的均值的是否相等,可以转化为检验\(\mu_D = 0\)。就转化成了一个单样本的检验。
假设\(D \sim N(\mu_D, \sigma_D^2)\)
library(MASS)
data(UScrime)
qqnorm(UScrime$Po1)
在做t检验之间,应该要检测变量是否服从正态分布。
Several ways to test normality
shapiro.test()
检验变量是否服从正态分布,原假设一般是是正态,备择假设一般是不是正态。
nortest::ad.test()
ks.test()
Nonparametric tests
Wilcoxon Signed-Rank Test
既可以用做单样本,也可以用作双样本。
One Sample Case
Let \(Y_1, · · · , Y_n\) be i.i.d. random variables with the common absolutely
continuous distribution F.
- F: symmetric.
we want to test:$$H_0: \mu = 0, \text{ v.s. } H_1: \mu \neq 0$$
Let \(R_1, . . . , R_n\) be the rank of
\(|Y_1 - \mu_0|, ..., |Y_n-\mu_0|\).
Define $$\Psi_i=
\begin{cases}
0& \text{if } Y_i-\mu_0 < 0\
1& \text{if } Y_i-\mu_0 \ge 0
\end{cases}$$
the Wilcoxon’s signed rank statistic is defined by \(W = \sum_{i=1}^n R_i\Psi_i\)
y <- c(3,5,9,10,4)
mu0 <- 5
(y.cen.abs <- abs(y-mu0))
(r <- rank(y.cen.abs))
(psi <- (y > mu0))
(w <- sum(r*psi))
A large sample Wilcoxon signed rank test
Theorem.
When \(H_0:\mu=\mu_0\) is true, we have $$E(W) = \frac{n(n+1)}{4}, \quad Var(W)=\frac{n(n+1)(2n+1)}{24}$$
Also, for \(n\ge 12\), the distribution of $$Z = \frac{W - \frac{n(n+1)}{4}}{\sqrt{Var(W)=\frac{n(n+1)(2n+1)}{24}}} \rightarrow N(0,1)$$
Let \(w\) be the signed rank statistic based on \(n\) independent observations, each drawn from a continuous and symmetric pdf, where \(n \geq 12\). Let
- To test \(H_0: \mu=\mu_0\) and \(H_1: \mu>\mu_0\) at the \(\alpha\) level of significance, we reject \(H_0\) if \(z>z_\alpha\).
- To test \(H_0: \mu=\mu_0\) and \(H_1: \mu<\mu_0\) at the \(\alpha\) level of significance, we reject \(H_0\) if \(z<-z_0\).
- To test \(H_0: \mu=\mu_0\) and \(H_1: \mu \neq \mu_0\) at the \(\alpha\) level of significance, we reject \(H_0\) if \(z>z_{\alpha / 2}\) or \(z<-z_{\alpha / 2}\).
- The p-value can be calculated.
Eg.
y <- c(51, 53, 43, 36, 55, 55, 39, 43, 45, 27, 21, 26, 22, 43)
hist(y)
r <- rank(abs(y-28))
psi <- (y>28)
w <- sum(r * psi)
n <- length(y)
mean_w <- n * (n + 1) / 4
sd_w <- sqrt(n * (n + 1) * (2 * n + 1) / 24)
(z <- (w - mean_w)/sd_w)
(z_crit <- qnorm(0.95))
(p_value <- pnorm(z,lower.tail=FALSE))
wilcox.test(y,mu=28,alternative = "greater",exact=FALSE)$p.value
wilcox.test(y,mu=28,alternative = "greater",exact=FALSE,correct=FALSE)$p.value
Testing Paired data
- A Wilcoxon signed rank test can also be used on paired data to test \(H_0: \mu_D=0\), where \(\mu_D=\mu_X-\mu_Y\).
- Suppose we have a random sample of size \(n,\left(x_i, y_i\right), 1 \leq i \leq n\), and let \(d_i=x_i-y_i\).
- Let \(r_i\) be the rank of \(\left|x_i-y_i\right|\) in the set
- Define \(\psi_i=I\left(x_i>y_i\right)\)
- Let \(w=\sum_{i=1}^n r_i \psi_i\)
Eg.
x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30)
y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)
wilcox.test(x,y,paired=TRUE, alternative = "greater")
Testing two independent random samples*
- Let \(X_1 \ldots, X_n\) and \(Y_1, \ldots, Y_m\) be two independent random samples of sizes \(n\) and \(m\) from \(f_X\) and \(f_Y\), respectively.
- For \(1 \leq i \leq n+m\), let \(R_i\) be the rank of the \(i\) th observation in the combined sample, and let
\(\Psi_i= \begin{cases}1 & \text { if the } i \text { th observation came from } f_X, \\ 0 & \text { if the } i \text { th observation came from } f_Y .\end{cases}\)
Define
We have
利用中心极限定理,当n足够大时,\(\bar X - \mu_0}{s/\sqrt{n}\) 服从\(N(0,1)\), 即$$P(N(0,1) > t_{\alpha, n-1} | X_1,...,X_n \sim U[0,2]$$,此处,通过生成M组X,来选择最P最接近\(\alpha\)的一组X。
n <- 20
alpha <- 0.05
mu0 <- 1
M <- 10000
t_value <- numeric(M)
set.seed(308)
for (j in 1:M){
x <- runif(n, mu0 - 1, mu0 + 1)
t_value[j] <- (mean(x) - mu0)/(sd(x) /sqrt(n))
}
is.reject <- (t_value > qt(alpha, df = n - 1, lower = FALSE))
mean(is.reject)
Eg.How about we replace U(0, 2) by the exponential distribution?
n <- 20
alpha <- 0.05
mu0 <- 1
M <- 10000
t_value <- numeric(M)
set.seed(308)
for (j in 1:M){
x <- rexp(n, mu0)
t_value[j] <- (mean(x) - mu0)/(sd(x) / sqrt(n))
}
is.reject <- (t_value > qt(alpha, df = n - 1, lower = FALSE))
mean(is.reject)
type-I-error 小了,但是type-II-error就大了,所以power小了,不好。
在分布未知的情况下,如何去找分位点做假设?
用MC生成一堆样本,用样本0.05的分位点,去做假设。
下面用非参的方法去做检验。(用于真实分布不是正态的)
n <- 20
alpha <- 0.05
mu0 <- 1
M <- 10000
p_value <- numeric(M)
set.seed(308)
for (j in 1:M){
x <- runif(n, mu0 - 1, mu0 + 1)
p_value[j] <- wilcox.test(x, mu = 1, alternative = "greater")$p.value
}
is.reject <- (p_value < alpha)
mean(is.reject)
此处wilcox的效果也不太好,因为不满足所需的对称性。