Linear Regression - 4 Implement in R

Linear Regression Series:

  • Linear Regression - 1 Theory : site

  • Linear Regression - 2 Proofs of Theory : site

  • Linear Regression - 3 Implement in Python : site

  • Linear Regression - 4 Implement in R : site

1 Linear Regression

(1) Add variables

  • add covariates
attach(data)
model <- lm(formula = Y ~ X1 + X2, data = data)
  • all covariates
model <- lm(formula = Y ~ ., data = data)
  • remove covariates
model <- lm(formula = Y ~ . - X1 - X2, data = data)
  • include intercept
model <- lm(formula = Y ~ X1 + X2, data = data)
model <- lm(formula = Y ~ 1 + X1 + X2, data = data)
  • exclude intercept
model <- lm(formula = Y ~ X1 + X2 - 1, data = data)
model <- lm(formula = Y ~ 0 + X1 + X2, data = data)
  • only intercept
model <- lm(formula = Y ~ 1, data = data)

(2) Covariates manipulation

  • Transformations
# Y ~ X1, X1^2, X1^3
model <- lm(formula = Y ~ X1 + I(X1^2) + I(X2^3), data = data)

I() is used to isolate the calculation from the formula code.

  • Categorical Covariates

Transform categorical covariate with \(k\) categories into \(k-1\) dummy variables $\boldsymbol{X}_i $:

\[\boldsymbol{X}_i = \left\{\begin{array}{ll}1 & \text{for category }i \\0 & \text{otherwise} \end{array} \right. \qquad i=1,2,\cdots,k-1 \]

The category for which we do not create a dummy variable is called the reference category.

# suppose X2 has the value {1,2,3}
# X2_2 = as.numeric(X2==2)
# X2_3 = as.numeric(X2==3)
model1 <- lm(Y ~ X1 + X2_2 + X2_3)
model2 <- lm(Y ~ X1 + as.factor(X2))
  • Variables interactions for continuous–continuous interaction

For two continuous variables \(\boldsymbol{X}_i\) and \(\boldsymbol{X}_j\), the interaction of them will generate one new variable \(\boldsymbol{X}_{i} \odot \boldsymbol{X}_{j}\), where \(\odot\) is Hadamard product (also called element-wise product).

# The following codes are equal for Y ~ X1, X2, X1*X2
model1 <- lm(formula = Y ~ X1*X2, data = data)
# "*" operator will creates both the main and interaction term)
model2 <- lm(formula = Y ~ X1 + X2 + X1:X2, data = data)
# ":" operator only creates the interation 
model3 <- lm(formula = Y ~ X1 + X2 + I(X1*X2), data = data)
model4 <- lm(formula = Y ~ (X1 + X2)^2)
  • Variables interactions for categorical–continuous interaction

Suppose one variable \(\boldsymbol{X}_1\) is categorical with \(k\) categories, and the other variable \(\boldsymbol{X}_2\) is continuous. Then, \(k − 1\) new variables
have to be created \(\boldsymbol{X}_{1i} (i=1,2,\cdots,k-1)\) , each consisting of the product of the continuous variable and a dummy variable: \(\boldsymbol{X}_{1i} \odot \boldsymbol{X}_2 (i=1,2,\cdots,k-1)\).

  • Variables interactions for categorical–categorical interaction

For two categorical variables \(\boldsymbol{X}_1\) and \(\boldsymbol{X}_2\) , with \(k\) and \(l\) categories, respectively. The interaction of them will generate \((k − 1)\times(l − 1)\) new dummy variables: \(\boldsymbol{X}_{1i} \odot \boldsymbol{X}_{2j} \ (i=1,2,\cdots,k-1; \ j=1,2,\cdots,l-1)\).

Note: R will automatically regard character (string) variable as categorical variable during linear regression.

(3) Residual and fitted values

  • Get residuals \(\hat{\boldsymbol{\epsilon}}\)
resid = residuals(model)
  • Get fitted values \(\hat{\boldsymbol{Y}}\)
# The following codes will return the same results
fit.val = fitted(model)
est.val = predict(model)

(4) Prediction

Note the difference between CI and PI

# for interval="confidence", the CI is estimated
pred = predict(model, newdata=data_test, interval="confidence", level=0.95)
# for interval="prediction", the (prediction interval) PI is estimated
pred = predict(model, newdata=data_test, interval="prediction", level=0.95)

2. Multicollinearity

2.1 Variance inflation factor (VIF)

model <- lm(formula = Y ~ X1 + X2, data = data)
alias(model)
# find aliases in a linear model
# install.packages("car")  # install "car" package
library(car)
vif(model)

2.2 Alias

Aliases must be remove from the model before use vif()

Aliases: refers to the variables that are linearly dependent on others (i.e. cause perfect multicollinearity 完全的多重共线性). For example, if \(\boldsymbol{X}_2 = 3 \boldsymbol{X}_1 + 1\), \(\boldsymbol{X}_2\) will be detected as an alias.

Given an output of a model by using alias(model) as following:

Model :
V1 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + 
    V13 + V14 + speed + frwy + art + age + frwynos + artnos + 
    ruralnos + youngm

Complete :
         (Intercept) V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 speed frwy frwynos artnos youngm
art       2          -1  0  0  0  0  0  0  0  0   0   0   0   0   0     1    0       0      0    
age      86           0  0  0  0  0  0  0  0  0   0   0  -1   0   0     0    0       0      0    
ruralnos  0           0  0  1  0  0  0  0  0  0   0   0   0   0   0     0   -1      -1      0  

Variables art, age, and ruralnos are detected as aliases and fulfill the following equations:

\[\begin{aligned} \mathrm{art} &= - \mathrm{V2} + \mathrm{frwy} + 2 \\ \mathrm{age} &= - \mathrm{V13} + 86 \\ \mathrm{ruralnos} &= \mathrm{V4} - \mathrm{frwynos} + \mathrm{artnos} \end{aligned} \]

posted @ 2023-01-06 12:42  veager  阅读(11)  评论(0编辑  收藏  举报