统计计算——线性模型

library(dplyr)
library(ggplot2)
library(knitr)
library(palmerpenguins)
library(tidyr)

Regression

keys: response variable, predictor variable
做预测，并且给出该预测的可信程度。

Simple Linear Regression

samples: $(X_i,Y_i)$ for $1 \le i \le n$
model: $Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$

Usually, we assume $\epsilon_i \sim N(0, \sigma^2)$, 如果mean(errors),即mean($epsilon$)=$\mu \neq 0$, 则方程解不唯一。

Least Square Method

For each i, the error is $(Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2$.

The summed error is $L(b_0, b_1) = \sum_{i=1}^n(Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2$

Solve the equation, $\frac{\partial L}{\partial b_0}L(b_0, b_1) = 0, \frac{\partial L}{\partial b_1}L(b_0, b_1) = 0$

The solution is $$\hat \beta_1 = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sum_{i = 1}^n(X_i - \bar X)^2}, \hat \beta_0 = \bar Y - \hat \beta_1\bar X$$

$\hat \beta_1$很像是X，Y的协方差除以X的方差，$\beta_0$一定经过点$(\bar X, \bar Y)$.

Eg.

x <- penguins$bill_length_mm
y <- penguins$body_mass_g
beta1_hat <- cov(x, y)/var(x)
beta0_hat <- mean(y) - beta1_hat * mean(x)
c(beta0_hat = beta0_hat, beta1_hat = beta1_hat)

Conditions of regression

Quantitative variable condition
Straight enough condition
Outlier condition(是小概率事件，还是实验出错？)
"Does the plot thicken" condition ($X_i$ 不能过于集中，不然通过$\beta_1$的式子求出来就会很不准)
Normality condition
Heterogeneous or homogeneous ?

Residuals

我们发现，通过计算$\sum_{i = 1}^n e_i = 0$, 所以mean($e_i$) = 0

下去算一下$\sum_{i = 1}^n x_ie_i = 0$.

yhat <- beta0_hat + beta1_hat * x
residual <- y - yhat
hist(residual)

plot(x, residual)

Assume $\epsilon_i \sim N(0,\sigma^2)$, $\hat \beta_0$ and $\hat \beta_1$ are unbiased estimators of $\beta_0$ and $\beta_1$, respectively.

Given $X_1,...,X_n$, we can derive the conditional distribution of $\hat \beta_0$ and $\hat \beta_1$: $$\hat \beta_1 \sim N(\beta_1, \frac{1}{\sum_{i = 1}^n(X_i - \bar X)^2}\sigma2), \hat \beta_0 \sim N(\beta_0, \frac{\frac{1}{n}\sum_{i=1}^n X_i^2}{\sum_{i= 1}^n (X_i - \bar X)^2}\sigma2)$$
Here, $\sigma^2$ is the variance of $\epsilon_i$, which can be estimated by $$s^2 = \frac{1}{n - 2}\sum_{i = 1}^n(y_i - \hat \beta_0 - \hat \beta_1 x_i)^2$$, we call $s$ the residual standard error. (d.f. = n - 2是因为两个估计值带来了两个自由度的损失。)

The standard errors of the coefficients are $$se(\hat \beta_1) = \frac{\sigma}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}} \approx \hat {se}(\hat \beta_1) = \frac{s}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}}$$

n <- length(x)
s <- sqrt(sum(residual^2)/(n - 2))
beta1_se <- s/sqrt(sum((x - mean(x))^2))
beta1_se

Moreover,$$\frac{(n-2)s^2}{\sigma2} = \frac{1}{\sigma^2}\sum_{i=1}n(Y_i - \hat Y_i)^2 \sim \chi_{n-2}^2$$.

Therefore, $$\frac{\hat \beta_1 - \beta_1}{\hat {se}(\hat \beta_1)} \sim t_{n-2}$$

Hypothesis testing

$H_0: \beta_1 = 0$,$H_1: \beta_1 \neq 0$

Using $$T = \frac{\hat \beta_1 - 0}{\hat {se} (\hat \beta_1)} \sim t_{n-2}$$

t_value <- beta1_hat / beta1_se
t_value
p_value <- pt(abs(t_value), df = n-2, lower = FALSE)
p_value

R-codes for linear model

lm1 <- lm(y ~ x)
lm1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(lm1)

names(lm1)

lm1$coefficients

lm1$rank

lm1$df

Predictions

x_new <- rnorm(10, mean(x), sd(x))
y_new <- beta0_hat + beta1_hat * x_new
y_new
data_new <- data.frame(bill_length_mm = x_new) #需要和原来的col_name相同。
predict(lm1, newdata = data_new)

Other useful functions

summary()
coefficients()orcoef()
confint() confident interval
fitted() fitted values
residuals()
plot()
predict()

par(mfrow = c(2, 2))
plot(lm1)

Nonlinear regression

lm2 <- lm(log(body_mass_g) ~ log(bill_length_mm), data = penguins)
lm3 <- lm(body_mass_g ~ bill_length_mm + I(bill_length_mm^2), data = penguins)

useful_signs
lm中的符号的释义如下：
x1:x2 表示 x1x2
x1x2 表示 x1+x2+x1x2
(x1+x2)^2 表示 x1+x2+x1x2
(x1x2)-x1 表示 x2+x1x2

Multiple linear regression

Eg.

library(datasets)
df <- state.x77 %>%
as_tibble() %>%
select(Murder, Population, Illiteracy, Income, Frost)
df

cor(df)

library(car)
scatterplotMatrix(df)

mul_lm <- lm(Murder ~ ., data = df)
summary(mul_lm)

注意：在murder-forest中，协方差显示它们是负相关，linear-model显示它们是正相关。
在协方差矩阵的时候还没有做linear-model，即只考虑了这两个变量，而在linear-model中，是在其他变量都固定的时候考虑的。

Eg2. 想看一下mpg和hp,wt,和它两的协同影响下的关系。

mt <- mtcars %>%
as_tibble() %>%
select(mpg, hp, wt)
cor(mt)

int_lm <- lm(mpg ~ hp + wt + hp:wt, data = mt)
summary(int_lm)

最后那个F statistic是在检验是否所有系数都为0.

We have$\hat mpg= 49.81 − 0.12 × hp − 8.22 × wt + 0.03 × (hp · wt).$

If wt = 2.2, $\hat mpg = 31.41 − 0.06 × hp$
If wt = 3.2, $\hat mpg = 23.37 − 0.03 × hp$
If wt = 4.2, $\hat mpg = 15.33 − 0.003 × hp$

library(effects)
plot(effect("hp:wt", int_lm, , list(wt = c(2.2, 3.2, 4.2))), multiline=TRUE)

posted @ 2022-10-28 18:15 爱吃番茄的玛丽亚阅读(26) 评论(0) 编辑收藏举报

刷新页面返回顶部

pny01

热爱可抵岁月漫长~