统计计算——线性模型

library(dplyr)
library(ggplot2)
library(knitr)
library(palmerpenguins)
library(tidyr)

Regression

keys: response variable, predictor variable
做预测,并且给出该预测的可信程度。

Simple Linear Regression

samples: \((X_i,Y_i)\) for \(1 \le i \le n\)
model: \(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)

Usually, we assume \(\epsilon_i \sim N(0, \sigma^2)\), 如果mean(errors),即mean(\(epsilon\))=\(\mu \neq 0\), 则方程解不唯一。

Least Square Method

For each i, the error is \((Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2\).

The summed error is \(L(b_0, b_1) = \sum_{i=1}^n(Y_i - \hat Y_i)^2 = (Y_i - b_0 - b_1X_i)^2\)

Solve the equation, \(\frac{\partial L}{\partial b_0}L(b_0, b_1) = 0, \frac{\partial L}{\partial b_1}L(b_0, b_1) = 0\)

The solution is $$\hat \beta_1 = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sum_{i = 1}^n(X_i - \bar X)^2}, \hat \beta_0 = \bar Y - \hat \beta_1\bar X$$

\(\hat \beta_1\)很像是X,Y的协方差除以X的方差,\(\beta_0\)一定经过点\((\bar X, \bar Y)\).

Eg.

x <- penguins$bill_length_mm
y <- penguins$body_mass_g
beta1_hat <- cov(x, y)/var(x)
beta0_hat <- mean(y) - beta1_hat * mean(x)
c(beta0_hat = beta0_hat, beta1_hat = beta1_hat)

Conditions of regression

  • Quantitative variable condition
  • Straight enough condition
  • Outlier condition(是小概率事件,还是实验出错?)
  • "Does the plot thicken" condition (\(X_i\) 不能过于集中,不然通过\(\beta_1\)的式子求出来就会很不准)
  • Normality condition
  • Heterogeneous or homogeneous ?

Residuals

我们发现,通过计算\(\sum_{i = 1}^n e_i = 0\), 所以mean(\(e_i\)) = 0

下去算一下\(\sum_{i = 1}^n x_ie_i = 0\).

yhat <- beta0_hat + beta1_hat * x
residual <- y - yhat
hist(residual)

plot(x, residual)

Assume \(\epsilon_i \sim N(0,\sigma^2)\), \(\hat \beta_0\) and \(\hat \beta_1\) are unbiased estimators of \(\beta_0\) and \(\beta_1\), respectively.

Given \(X_1,...,X_n\), we can derive the conditional distribution of \(\hat \beta_0\) and \(\hat \beta_1\): $$\hat \beta_1 \sim N(\beta_1, \frac{1}{\sum_{i = 1}^n(X_i - \bar X)2}\sigma2), \hat \beta_0 \sim N(\beta_0, \frac{\frac{1}{n}\sum_{i=1}^n X_i^2}{\sum_{i= 1}^n (X_i - \bar X)2}\sigma2)$$
Here, \(\sigma^2\) is the variance of \(\epsilon_i\), which can be estimated by $$s^2 = \frac{1}{n - 2}\sum_{i = 1}^n(y_i - \hat \beta_0 - \hat \beta_1 x_i)^2$$, we call \(s\) the residual standard error. (d.f. = n - 2是因为两个估计值带来了两个自由度的损失。)

The standard errors of the coefficients are $$se(\hat \beta_1) = \frac{\sigma}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}} \approx \hat {se}(\hat \beta_1) = \frac{s}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2}}$$

n <- length(x)
s <- sqrt(sum(residual^2)/(n - 2))
beta1_se <- s/sqrt(sum((x - mean(x))^2))
beta1_se

Moreover,$$\frac{(n-2)s2}{\sigma2} = \frac{1}{\sigma2}\sum_{i=1}n(Y_i - \hat Y_i)^2 \sim \chi_{n-2}^2$$.

Therefore, $$\frac{\hat \beta_1 - \beta_1}{\hat {se}(\hat \beta_1)} \sim t_{n-2}$$

Hypothesis testing

\(H_0: \beta_1 = 0\),\(H_1: \beta_1 \neq 0\)

Using $$T = \frac{\hat \beta_1 - 0}{\hat {se} (\hat \beta_1)} \sim t_{n-2}$$

t_value <- beta1_hat / beta1_se
t_value
p_value <- pt(abs(t_value), df = n-2, lower = FALSE)
p_value

R-codes for linear model

lm1 <- lm(y ~ x)
lm1 <- lm(body_mass_g ~ bill_length_mm, data = penguins)
summary(lm1)
names(lm1)

lm1$coefficients

lm1$rank

lm1$df

Predictions

x_new <- rnorm(10, mean(x), sd(x))
y_new <- beta0_hat + beta1_hat * x_new
y_new
data_new <- data.frame(bill_length_mm = x_new) #需要和原来的col_name相同。
predict(lm1, newdata = data_new)

Other useful functions

  • summary()
  • coefficients()orcoef()
  • confint() confident interval
  • fitted() fitted values
  • residuals()
  • plot()
  • predict()
par(mfrow = c(2, 2))
plot(lm1)

Nonlinear regression

lm2 <- lm(log(body_mass_g) ~ log(bill_length_mm), data = penguins)
lm3 <- lm(body_mass_g ~ bill_length_mm + I(bill_length_mm^2), data = penguins)

useful_signs
lm中的符号的释义如下:
x1:x2 表示 x1x2
x1x2 表示 x1+x2+x1x2
(x1+x2)^2 表示 x1+x2+x1x2
(x1
x2)-x1 表示 x2+x1x2

Multiple linear regression

Eg.

library(datasets)
df <- state.x77 %>%
as_tibble() %>%
select(Murder, Population, Illiteracy, Income, Frost)
df


cor(df)

library(car)
scatterplotMatrix(df)
mul_lm <- lm(Murder ~ ., data = df)
summary(mul_lm)

注意:在murder-forest中,协方差显示它们是负相关,linear-model显示它们是正相关。
在协方差矩阵的时候还没有做linear-model,即只考虑了这两个变量,而在linear-model中,是在其他变量都固定的时候考虑的。

Eg2. 想看一下mpg和hp,wt,和它两的协同影响下的关系。

mt <- mtcars %>%
as_tibble() %>%
select(mpg, hp, wt)
cor(mt)
int_lm <- lm(mpg ~ hp + wt + hp:wt, data = mt)
summary(int_lm)

最后那个F statistic是在检验是否所有系数都为0.

We have\(\hat mpg= 49.81 − 0.12 × hp − 8.22 × wt + 0.03 × (hp · wt).\)

If wt = 2.2, \(\hat mpg = 31.41 − 0.06 × hp\)
If wt = 3.2, \(\hat mpg = 23.37 − 0.03 × hp\)
If wt = 4.2, \(\hat mpg = 15.33 − 0.003 × hp\)

library(effects)
plot(effect("hp:wt", int_lm, , list(wt = c(2.2, 3.2, 4.2))), multiline=TRUE)
posted @ 2022-10-28 18:15  爱吃番茄的玛丽亚  阅读(26)  评论(0编辑  收藏  举报