Linear Regression - 4 Implement in R
Linear Regression Series:
-
Linear Regression - 1 Theory : site
-
Linear Regression - 2 Proofs of Theory : site
-
Linear Regression - 3 Implement in Python : site
-
Linear Regression - 4 Implement in R : site
1 Linear Regression
(1) Add variables
- add covariates
attach(data)
model <- lm(formula = Y ~ X1 + X2, data = data)
- all covariates
model <- lm(formula = Y ~ ., data = data)
- remove covariates
model <- lm(formula = Y ~ . - X1 - X2, data = data)
- include intercept
model <- lm(formula = Y ~ X1 + X2, data = data)
model <- lm(formula = Y ~ 1 + X1 + X2, data = data)
- exclude intercept
model <- lm(formula = Y ~ X1 + X2 - 1, data = data)
model <- lm(formula = Y ~ 0 + X1 + X2, data = data)
- only intercept
model <- lm(formula = Y ~ 1, data = data)
(2) Covariates manipulation
- Transformations
# Y ~ X1, X1^2, X1^3
model <- lm(formula = Y ~ X1 + I(X1^2) + I(X2^3), data = data)
I()
is used to isolate the calculation from the formula code.
- Categorical Covariates
Transform categorical covariate with \(k\) categories into \(k-1\) dummy variables $\boldsymbol{X}_i $:
The category for which we do not create a dummy variable is called the reference category.
# suppose X2 has the value {1,2,3}
# X2_2 = as.numeric(X2==2)
# X2_3 = as.numeric(X2==3)
model1 <- lm(Y ~ X1 + X2_2 + X2_3)
model2 <- lm(Y ~ X1 + as.factor(X2))
- Variables interactions for continuous–continuous interaction
For two continuous variables \(\boldsymbol{X}_i\) and \(\boldsymbol{X}_j\), the interaction of them will generate one new variable \(\boldsymbol{X}_{i} \odot \boldsymbol{X}_{j}\), where \(\odot\) is Hadamard product (also called element-wise product).
# The following codes are equal for Y ~ X1, X2, X1*X2
model1 <- lm(formula = Y ~ X1*X2, data = data)
# "*" operator will creates both the main and interaction term)
model2 <- lm(formula = Y ~ X1 + X2 + X1:X2, data = data)
# ":" operator only creates the interation
model3 <- lm(formula = Y ~ X1 + X2 + I(X1*X2), data = data)
model4 <- lm(formula = Y ~ (X1 + X2)^2)
- Variables interactions for categorical–continuous interaction
Suppose one variable \(\boldsymbol{X}_1\) is categorical with \(k\) categories, and the other variable \(\boldsymbol{X}_2\) is continuous. Then, \(k − 1\) new variables
have to be created \(\boldsymbol{X}_{1i} (i=1,2,\cdots,k-1)\) , each consisting of the product of the continuous variable and a dummy variable: \(\boldsymbol{X}_{1i} \odot \boldsymbol{X}_2 (i=1,2,\cdots,k-1)\).
- Variables interactions for categorical–categorical interaction
For two categorical variables \(\boldsymbol{X}_1\) and \(\boldsymbol{X}_2\) , with \(k\) and \(l\) categories, respectively. The interaction of them will generate \((k − 1)\times(l − 1)\) new dummy variables: \(\boldsymbol{X}_{1i} \odot \boldsymbol{X}_{2j} \ (i=1,2,\cdots,k-1; \ j=1,2,\cdots,l-1)\).
Note: R will automatically regard character (string) variable as categorical variable during linear regression.
(3) Residual and fitted values
- Get residuals \(\hat{\boldsymbol{\epsilon}}\)
resid = residuals(model)
- Get fitted values \(\hat{\boldsymbol{Y}}\)
# The following codes will return the same results
fit.val = fitted(model)
est.val = predict(model)
(4) Prediction
Note the difference between CI and PI
# for interval="confidence", the CI is estimated
pred = predict(model, newdata=data_test, interval="confidence", level=0.95)
# for interval="prediction", the (prediction interval) PI is estimated
pred = predict(model, newdata=data_test, interval="prediction", level=0.95)
2. Multicollinearity
2.1 Variance inflation factor (VIF)
model <- lm(formula = Y ~ X1 + X2, data = data)
alias(model)
# find aliases in a linear model
# install.packages("car") # install "car" package
library(car)
vif(model)
2.2 Alias
Aliases must be remove from the model before use vif()
Aliases: refers to the variables that are linearly dependent on others (i.e. cause perfect multicollinearity 完全的多重共线性). For example, if \(\boldsymbol{X}_2 = 3 \boldsymbol{X}_1 + 1\), \(\boldsymbol{X}_2\) will be detected as an alias.
Given an output of a model by using alias(model)
as following:
Model :
V1 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 +
V13 + V14 + speed + frwy + art + age + frwynos + artnos +
ruralnos + youngm
Complete :
(Intercept) V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 speed frwy frwynos artnos youngm
art 2 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
age 86 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0
ruralnos 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 0
Variables art
, age
, and ruralnos
are detected as aliases and fulfill the following equations: