统计计算——线性模型
library(dplyr)
library(ggplot2)
library(knitr)
library(palmerpenguins)
library(tidyr)
Regression
keys: response variable, predictor variable
做预测,并且给出该预测的可信程度。
Simple Linear Regression
samples: \((X_i,Y_i)\) for \(1 \le i \le n\)
model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)
Usually, we assume \(\epsilon_i \sim N(0, \sigma^2)\), 如果mean(errors),即mean(\(epsilon\))=\(\mu \neq 0\), 则方程解不唯一。
Least Square Method
For each i, the error is \((Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2\).
The summed error is \(L(b_0, b_1) = \sum_{i=1}^n(Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2\)
Solve the equation, \(\frac{\partial L}{\partial b_0}L(b_0, b_1) = 0, \frac{\partial L}{\partial b_1}L(b_0, b_1) = 0\)
The solution is $$\hat \beta_1 = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sum_{i = 1}^n(X_i - \bar X)^2}, \hat \beta_0 = \bar Y - \hat \beta_1\bar X$$
\(\hat \beta_1\)很像是X,Y的协方差除以X的方差,\(\beta_0\)一定经过点\((\bar X, \bar Y)\).
Eg.
x <- penguins$bill_length_mm
y <- penguins$body_mass_g
beta1_hat <- cov(x, y)/var(x)
beta0_hat <- mean(y) - beta1_hat * mean(x)
c(beta0_hat = beta0_hat, beta1_hat = beta1_hat)
Conditions of regression
- Quantitative variable condition
- Straight enough condition
- Outlier condition(是小概率事件,还是实验出错?)
- "Does the plot thicken" condition (\(X_i\) 不能过于集中,不然通过\(\beta_1\)的式子求出来就会很不准)
- Normality condition
- Heterogeneous or homogeneous ?
Residuals
我们发现,通过计算\(\sum_{i = 1}^n e_i = 0\), 所以mean(\(e_i\)) = 0
下去算一下\(\sum_{i = 1}^n x_ie_i = 0\).
yhat <- beta0_hat + beta1_hat * x
residual <- y - yhat
hist(residual)
plot(x, residual)
Assume \(\epsilon_i \sim N(0,\sigma^2)\), \(\hat \beta_0\) and \(\hat \beta_1\) are unbiased estimators of \(\beta_0\) and \(\beta_1\), respectively.
Given \(X_1,...,X_n\), we can derive the conditional distribution of \(\hat \beta_0\) and \(\hat \beta_1\): $$\hat \beta_1 \sim N(\beta_1, \frac{1}{\sum_{i = 1}^n(X_i - \bar X)2}\sigma2), \hat \beta_0 \sim N(\beta_0, \frac{\frac{1}{n}\sum_{i=1}^n X_i^2}{\sum_{i= 1}^n (X_i - \bar X)2}\sigma2)$$
Here, \(\sigma^2\) is the variance of \(\epsilon_i\), which can be estimated by $$s^2 = \frac{1}{n - 2}\sum_{i = 1}^n(y_i - \hat \beta_0 - \hat \beta_1 x_i)^2$$, we call \(s\) the residual standard error. (d.f. = n - 2是因为两个估计值带来了两个自由度的损失。)
The standard errors of the coefficients are $$se(\hat \beta_1) = \frac{\sigma}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}} \approx \hat {se}(\hat \beta_1) = \frac{s}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}}$$
n <- length(x)
s <- sqrt(sum(residual^2)/(n - 2))
beta1_se <- s/sqrt(sum((x - mean(x))^2))
beta1_se
Moreover,$$\frac{(n-2)s2}{\sigma2} = \frac{1}{\sigma2}\sum_{i=1}n(Y_i - \hat Y_i)^2 \sim \chi_{n-2}^2$$.
Therefore, $$\frac{\hat \beta_1 - \beta_1}{\hat {se}(\hat \beta_1)} \sim t_{n-2}$$
Hypothesis testing
\(H_0: \beta_1 = 0\),\(H_1: \beta_1 \neq 0\)
Using $$T = \frac{\hat \beta_1 - 0}{\hat {se} (\hat \beta_1)} \sim t_{n-2}$$
t_value <- beta1_hat / beta1_se
t_value
p_value <- pt(abs(t_value), df = n-2, lower = FALSE)
p_value
R-codes for linear model
lm1 <- lm(y ~ x)
lm1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(lm1)
names(lm1)
lm1$coefficients
lm1$rank
lm1$df
Predictions
x_new <- rnorm(10, mean(x), sd(x))
y_new <- beta0_hat + beta1_hat * x_new
y_new
data_new <- data.frame(bill_length_mm = x_new) #需要和原来的col_name相同。
predict(lm1, newdata = data_new)
Other useful functions
summary()
coefficients()
orcoef()
confint()
confident intervalfitted()
fitted valuesresiduals()
plot()
predict()
par(mfrow = c(2, 2))
plot(lm1)
Nonlinear regression
lm2 <- lm(log(body_mass_g) ~ log(bill_length_mm), data = penguins)
lm3 <- lm(body_mass_g ~ bill_length_mm + I(bill_length_mm^2), data = penguins)
lm中的符号的释义如下:
x1:x2 表示 x1x2
x1x2 表示 x1+x2+x1x2
(x1+x2)^2 表示 x1+x2+x1x2
(x1x2)-x1 表示 x2+x1x2
Multiple linear regression
Eg.
library(datasets)
df <- state.x77 %>%
as_tibble() %>%
select(Murder, Population, Illiteracy, Income, Frost)
df
cor(df)
library(car)
scatterplotMatrix(df)
mul_lm <- lm(Murder ~ ., data = df)
summary(mul_lm)
注意:在murder-forest中,协方差显示它们是负相关,linear-model显示它们是正相关。
在协方差矩阵的时候还没有做linear-model,即只考虑了这两个变量,而在linear-model中,是在其他变量都固定的时候考虑的。
Eg2. 想看一下mpg和hp,wt,和它两的协同影响下的关系。
mt <- mtcars %>%
as_tibble() %>%
select(mpg, hp, wt)
cor(mt)
int_lm <- lm(mpg ~ hp + wt + hp:wt, data = mt)
summary(int_lm)
最后那个F statistic是在检验是否所有系数都为0.
We have\(\hat mpg= 49.81 − 0.12 × hp − 8.22 × wt + 0.03 × (hp · wt).\)
If wt = 2.2, \(\hat mpg = 31.41 − 0.06 × hp\)
If wt = 3.2, \(\hat mpg = 23.37 − 0.03 × hp\)
If wt = 4.2, \(\hat mpg = 15.33 − 0.003 × hp\)
library(effects)
plot(effect("hp:wt", int_lm, , list(wt = c(2.2, 3.2, 4.2))), multiline=TRUE)