统计Ⅱ-笔记
这学期的课程内容,原本想要整理一下的,但笔记有点乱,另外这门课注重数理推导所以公式有点多;因此只整理习题课上的提纲,基本上覆盖了这学期的主要内容(主要是因为自己懒🐶)。
Estimation: para \(\theta\) estimator \(\hat\theta\)
- MLE
- LSE
Unbiased: \(E[\hat\theta]=\theta\)
Hyperthesis Testing
\(H_0: \theta\in\Theta_0\leftrightarrow H_1:\theta\in\Theta_0^C\)
e.g. normal dist. \(x_1...x_n\sim N(\theta,1)\), estimate the para
Confidence Interval
Regression
Model: \(Y=X\beta+\epsilon\)
- Y: response vector
- X: Data matrix, design matrix
- \(\theta\): unknown parameters
- \(\epsilon\): error\(\sim N(0,\sigma^2I_n)\)
Estimation
- LSE: \(\hat\beta=(X^TX)^{-1}X^TY\overset{\triangle}{=}b\)
- MLE
- BLUE: best linear unbiased estimator (Gauss-Markov Theo)
Fitted Model of Y: \(\hat Y=Xb=X(X^TX)^{-1}X^TY\overset{\triangle}{=}HY\)
Residual: \(e=Y=\hat Y=(I-H)Y\)
Residual sum of squares: \(SSE=Y^T(I-H)Y\)
Estimator of \(\sigma^2\): \(\hat\sigma^2\overset{\triangle}{=}s^2={SSE\over n-k-1}\), unbiased
Distribution of \(b\) and \(s^2\):
Inference
Sum of Squares | Degree of Freedom |
---|---|
Total: \(SST=\sum(y_i-\hat y)^2=Y^T(I-{1\over n}11^T)Y\) | \(df_T=n-1\) |
Regression: \(SSR\sum(\hat y_i-\bar y)^2=Y^T(H-{1\over n}11^T)Y\) | \(df_R=k\) |
Residual: \(SSE=\sum(y_i-\hat y_i)^2=Y^T(I-H)Y\) | \(df_E=n-1-k\) |
\(SSR=SSR+SSE\) | \(df_t=df_R+df_E\) |
- F-test for regression relation: \(H_0: \beta_1=...=\beta_k=0\)
Reject \(H_0\), when \(F>F_{k,n-k-1}(1-\alpha)\)
- Test for \(\beta_j\): \(H_0: \beta_j=0\)
Under \(H_0\)
Model Fitness
e.g.
given values \(x_0=(1,x_{01},...,x_{0k})^T\), \(x_0^T\beta\) CI
so the \(1-\alpha\) CI for \(x_0^T\beta\) is \([x_0^T\pm t_{n-k-1}(\alpha/1)s\sqrt{x_0^T(X^TX)^{-1}x_0}]\)
e.g.
CI for a new ovservation
e.g.
\(H_0: C\beta=0\), where \(C\in R^{d\times (k+1)}\)
The test statistic is
if \(F>F_{d-,n-k-1}(\alpha)\), regect \(H_0\)
-
\(SSE_F\): SSE under the full model
-
\(SSE_R\): SSE under the reduced model: \(Y=X\beta+\epsilon, C\beta=0\)
Model Seletion
Logistic Regression
Model: \(y_1,..y_n\) iid. binary observations \(y_i\)=1with prob. \(p_i\) and \(=0\) with prob. \(1-p_i\)
MLE for \( p_i\)(as for \(\beta\)):
Generalized linear model
GLM:
- \(y_i\) : random component from a particular dist. in the exponential family
-
\(X\beta\): linear predictor
-
\(g\): link func.
Exponential family
Normal, Poisson, Binary, Gamma
e.g.: Gamma dist:
Mean and variance
Moment generating func.
For \(X\sim\) exponential family, \(EX=r'(\lambda), Var(X)=r''(\lambda)\)
- link func \(g(u)=u\) identity + normal response , GLM reduces to linear regression;
- link func \(g(u)=logit(u)\) + binary response, GLM reduces to logistic regression
A key idea in GLM is respresent \(\lambda_1,...,\lambda_n\) as a linear equation
density func. of \(y\) is
where \(\varphi(\beta)=\sum r(x_i^T\beta), f_0(y)=\prod f_0(y_i)\)
loglikelihood:
then, MLE of \(\beta\), denoted by \(\hat\beta\) satisfies
ANOVA (one-way)
Model: \(y_{ij}=\mu_i+\epsilon_{ij}\), \(\epsilon_{ij}\overset{iid}{\sim}N(0,\sigma^2)\)
for \(i=1,...,r\) \(\sum_{i=1}^r n_i=n\)
Estimation
LSE, MLE for \(\mu_i\): \(\hat\mu_i=\bar y_{i.}\)
unbiaesd estimaor for \(\sigma^2\): MSE
Inference
Hyphysis: \(H_0: \mu_1=...=\mu_r\)
SS | degree of freedom | mean |
---|---|---|
\(SST=\sum_i\sum_j(y_{ij}-\bar y_{..})^2\) | \(df=n-1\) | |
\(S_e=\sum_i\sum_j(y_{ij}-\bar y_{i.})^2\) | \(df=n-r\) | \(MSE\) |
\(S_A=\sum_i\sum_j(\bar y_{i.}-\bar y_{..})^2\) | \(df=r-1\) | \(MSA\) |
\(SST=S_A+S_e\) |
Test statistic:
Reject \(H_0\), when \(F>F_{r-1,n-r}(\alpha)\)
Two-factor ANOVA
Model (sample level)
\(y_{ijk}=\mu_{ij}+\epsilon_{ijk}\), \(\epsilon_{ijk}\overset{iid}{\sim}N(0,\sigma^2)\)
Cell mean: \(y_{ijk}=\mu+\alpha_i+\beta_i+r_{ij}+\epsilon_{ijk}\)
Estimation
Inference (with interaction)
\(r_{ij}\ne0\)
Revised: \(Y_{ijk}=\mu+\alpha_i+\beta_j+\epsilon_{ijk}\)
SS | Full | Revised |
---|---|---|
\(S_{AB}\) | \(df=ab-1\) | $ df=a+b-2$ |
\(SSA\) | \(df=a-1\) | \(df=a-1\) |
\(SSB\) | \(df=b-1\) | \(df=b-1\) |
\(SSAB\) | \(df=(a-1)(b-1)\) | |
\(S_e\) | \(df=ab(n-1)\) | \(df=abn-a-b+1\) |
\(SST\) | \(df=abn-1\) | \(df=abn-1\) |
Test for interaction:
对于最后的双因子 ANOVA 再说明一点:上面给出了在有无交互作用下的方差分解及其自由度。对于交互作用/主效应的假设检验;或者均值/各均值的线性组合的 CI,他们的思路都是一样的。在这里只给出了对于交互作用的假设检验统计量;对于其他的各量来说,方法都是一样的:找出对于量的分布(如 A/B 的主效应为相应自由度的卡方分布,均值则为一定方差的正态分布),该分布中有为知参数\(\sigma^2\),再借助 \(s^2=MSE=S_e/df\) 将其消去,得到相应的 t 分布或 F 分布,再根据问题要求(假设检验或 CI)进行求解。唯一需要注意的是,要看清两因子之间是否有交互作用,这决定了 \(MSE\) 的自由度。