主成分回归

思路

利用主成分分析,从变量中提取主成分,进而利用主成分做回归

关于主成分分析的基本步骤,可参考> https://www.cnblogs.com/hznudmh/p/16324990.html>

步骤

Step1. 原始数据标准化

Step2. 计算自变量之间的相关矩阵

Step3. 检验是否适合做主成分分析

Step4. 主成分提取

Step5. 计算主成分得分

Step6. 做主成分回归

实例

首先我们导入数据

ce <- read.table(file.choose(), header = T)
ce
## X1 X2 X3 X4 Y
## 1 7 26 6 60 78.5
## 2 1 29 15 52 74.3
## 3 11 56 8 20 104.3
## 4 11 31 8 47 87.6
## 5 7 52 6 33 95.9
## 6 11 55 9 22 109.2
## 7 3 71 17 6 102.7
## 8 1 31 22 44 72.5
## 9 2 54 18 22 93.1
## 10 21 47 4 26 115.9
## 11 1 40 23 34 83.8
## 12 11 66 9 12 113.3
## 13 10 68 8 12 109.4

Step1. 原始数据标准化

scale(ce) -> ce.s
ce.s
## X1 X2 X3 X4 Y
## 1 -0.07846099 -1.42368840 -0.9007209 1.7923096 -1.12492615
## 2 -1.09845379 -1.23089726 0.5044037 1.3143603 -1.40411237
## 3 0.60153422 0.50422298 -0.5884710 -0.5974365 0.59007490
## 4 0.60153422 -1.10236984 -0.5884710 1.0156421 -0.52002268
## 5 -0.07846099 0.24716813 -0.9007209 0.1792310 0.03170246
## 6 0.60153422 0.43995926 -0.4323460 -0.4779492 0.91579215
## 7 -0.75845619 1.46817866 0.8166536 -1.4338477 0.48371824
## 8 -1.09845379 -1.10236984 1.5972783 0.8364111 -1.52376360
## 9 -0.92845499 0.37569555 0.9727785 -0.4779492 -0.15442168
## 10 2.30152223 -0.07415044 -1.2129708 -0.2389746 1.36116064
## 11 -1.09845379 -0.52399643 1.7534033 0.2389746 -0.77261973
## 12 0.60153422 1.14686010 -0.4323460 -1.0753857 1.18833108
## 13 0.43153542 1.27538753 -0.5884710 -1.0753857 0.92908673
## attr(,"scaled:center")
## X1 X2 X3 X4 Y
## 7.461538 48.153846 11.769231 30.000000 95.423077
## attr(,"scaled:scale")
## X1 X2 X3 X4 Y
## 5.882394 15.560881 6.405126 16.738180 15.043723

Step2. 计算自变量之间的相关矩阵

cor(ce[,1:4]) -> ce.cor # 使自变量之间的相关矩阵,因此选前四列

Step3. 检验是否适合做主成分分析

  • KMO法
KMO(ce.cor)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = ce.cor)
## Overall MSA = 0.24
## MSA for each item =
## X1 X2 X3 X4
## 0.22 0.26 0.20 0.26

发现 \(MSA = 0.24 < 0.7\)

  • bartlett法
cortest.bartlett(ce.cor, n = 13) # n为样本个数,即有几行
## $chisq
## [1] 67.28248
##
## $p.value
## [1] 1.473373e-12
##
## $df
## [1] 6

发现 \(p << 0.05\).

虽然 KMO 检验未通过,但 bartlett 检验通过了,我们主要看bartlett检验,因此认为样本适合做主成分.

Step4. 主成分提取

princomp(ce[,1:4], cor = T) -> ce.pca
summary(ce.pca, loadings = T)
## Importance of components:
##                        Comp.1   Comp.2    Comp.3
## Standard deviation     1.495227 1.2554147 0.43197934
## Proportion of Variance 0.558926 0.3940165 0.04665154
## Cumulative Proportion  0.558926 0.9529425 0.99959406
## Comp.4
## Standard deviation 0.0402957285
## Proportion of Variance 0.0004059364
## Cumulative Proportion 1.0000000000
##
## Loadings:
##    Comp.1 Comp.2 Comp.3 Comp.4
## X1 0.476  0.509  0.676  0.241
## X2 0.564  -0.414 -0.314 0.642
## X3 -0.394 -0.605 0.638  0.268
## X4 -0.548 0.451  -0.195  0.677

因此可以选两个或三个主成分,这里我们选三个. 那么保留了 99.96% 的样本信息.

Step5. 计算主成分得分

as.matrix(ce.s[, 1:4]) %*% ce.pca$loadings[, 1:3] -> Z1
Z1
##   Comp.1     Comp.2      Comp.3
## 1 -1.4672378  1.9030357  -0.53000021
## 2 -2.1358287  0.2383537  -0.29018631
## 3 1.1298705   0.1838772  -0.01071260
## 4 -0.6598955  1.5767742  0.17920364
## 5 0.3587646   0.4835379  -0.74012228
## 6 0.9666396   0.1699440  0.08570239
## 7 0.9307051   -2.1348165 -0.17298608
## 8 -2.2321380  -0.6916707 0.45971977
## 9 -0.3515156  -1.4322451 -0.03156438
## 10 1.6625430  1.8280966  0.85119311
## 11 -1.6401800 -1.2951128 0.49417845
## 12 1.6925941  -0.3922488 -0.01981007
## 13 1.7456787  -0.4375255 -0.27461543

Step6. 做主成分回归

ZZ1 <- data.frame(Z1) # 化为数据框
lm(ce.s[, 5] ~ z1 + z2 + z3, data = ZZ1) -> ce.fit1 # data 必须要是数据框
summary(ce.fit1)
##
## Call:
## lm(formula = ce.s[, 5] ~ z1 + z2 + z3, data = ZZ1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20228 -0.12766 0.02411 0.07911 0.25621
##
## Coefficients:
##             Estimate   Std. Error t value Pr(>|t|)
## (Intercept) 2.831e-16  4.281e-02  0.000   1.0000
## z1          6.570e-01  2.980e-02  22.045  3.84e-09 ***
## z2          -8.309e-03 3.549e-02  -0.234  0.8202
## z3          3.028e-01  1.031e-01  2.935   0.0166 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1544 on 9 degrees of freedom
## Multiple R-squared: 0.9821, Adjusted R-squared: 0.9762
## F-statistic: 164.9 on 3 and 9 DF, p-value: 3.5e-08

常数项明显未通过,剔除

lm(ce.s[, 5] ~ 0 + z1 + z2 + z3, data = ZZ1) -> ce.fit2 #data 必须要是数据框
summary(ce.fit2)
##
## Call:
## lm(formula = ce.s[, 5] ~ 0 + z1 + z2 + z3, data = ZZ1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20228 -0.12766 0.02411 0.07911 0.25621
##
## Coefficients:
##    Estimate  Std. Error t value Pr(>|t|)
## z1 0.656958  0.028271   23.238  4.93e-10 ***
## z2 -0.008309 0.033671   -0.247  0.8101
## z3 0.302770  0.097856   3.094   0.0114 *
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1464 on 10 degrees of freedom
## Multiple R-squared: 0.9821, Adjusted R-squared: 0.9768
## F-statistic: 183.2 on 3 and 10 DF, p-value: 4.895e-09

z2 检验未通过,但是主成分回归不太在乎显著性检验,因此可以剔除也可以不剔除,这里我们不剔除.

于是

\[y = 0.656958z_1 - 0.008309z_2 + 0.302770z_3 \]

下面得到 \(y\) 关于 \(x\) 的回归方程

rep(ce.fit2$coefficients, each = 4)*ce.pca$loadings[, 1:3] -> tempp
apply(tempp, 1, sum) -> ce.fit3
ce.fit3
## X1         X2         X3          X4
## 0.51297502 0.27868114 -0.06078483 -0.42288461

于是

\[y = 0.51297502x_1 + 0.27868114x_2 − 0.06078483x_3 − 0.42288461x_4 \]

posted @ 2022-05-31 14:49  只会加减乘除  阅读(410)  评论(0编辑  收藏  举报