『原创』统计建模与R软件-第六章回归分析

2016-01-08 08:28 Digging4 阅读(14910) 评论(0) 收藏举报

摘要: 本文由digging4发表于：http://www.cnblogs.com/digging4/p/5079829.html

统计建模与R软件-第六章回归分析

6.1为了估计山上积雪融化后对下游灌溉的影响，在山上建立一个观测站，测量最大积雪深度X与当年灌溉面积Y，测得连续10年的数据如表6.17所示

表6.17 10年中最大积雪深度与当年灌溉面积的数据
序号 X(米) Y(公顷)
1 5.1 1907
2 3.5 1287
3 7.1 2700
4 6.2 2373
5 8.8 3260
6 7.8 3000
7 4.5 1947
8 5.6 2273
9 8.0 3113
10 6.4 2493
（1）试画出相应的散点图，判断Y与X是否有线性关系；
（2）求出Y关于X的一元线性回归方程；
（3）对方程作显著性检验；
（4）现测得今年的数据是X=7米，给出今年灌溉面积的预测值和相应的区间估计（$\alpha=0.05$）

x <- c(5.1, 3.5, 7.1, 6.2, 8.8, 7.8, 4.5, 5.6, 8, 6.4)
y <- c(1907, 1287, 2700, 2373, 3260, 3000, 1947, 2273, 3113, 2493)
# 从散点图可以看出，这些点基本上在一条直线上，x和y有线性关系
plot(x, y)
# y=141.0 + 364.2*x ，但是截距项不显著
lm.sol <- lm(y ~ x)
summary(lm.sol)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128.59  -70.98   -3.73   49.26  167.23 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    141.0      125.1    1.13     0.29    
## x              364.2       19.3   18.91  6.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 96.4 on 8 degrees of freedom
## Multiple R-squared: 0.978,	Adjusted R-squared: 0.975 
## F-statistic:  358 on 1 and 8 DF,  p-value: 6.33e-08

# 采用无截距项回归，回归方程通过了显著性检验，且R平方值提高到0.999
# 回归方程 y=385.23*x
lm.sol <- lm(y ~ x - 1)
summary(lm.sol)

## 
## Call:
## lm(formula = y ~ x - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -130.0  -52.0  -10.1   30.3  213.5 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   385.23       4.76    80.9  3.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 97.8 on 9 degrees of freedom
## Multiple R-squared: 0.999,	Adjusted R-squared: 0.998 
## F-statistic: 6.54e+03 on 1 and 9 DF,  p-value: 3.42e-14

abline(lm.sol)

#
# 预测，就算只有一个值，也要采用数据框的形式，并且数据框的变量名和回归方程的自变量名要一致
new <- data.frame(x = 7)
predict(lm.sol, new, interval = "prediction", level = 0.95)

##    fit  lwr  upr
## 1 2697 2463 2930

# 得到预测值为2697，其95%的可信区间为 [2463, 2930]

6.2研究同一地区土壤所含可给态磷的情况，得到18组数据如表6.18所示，表中$X_1$为土壤内所含无机磷浓度，$X_2$为土壤内溶于$K_2CO_3$溶液并受溴化物水解的有机磷，$X_3$为土壤内溶于$K_2CO_3$溶液但不溶于溴化物水解的有机磷。

（1）求出Y关于X的多元线性回归方程
（2）对方程做显著性检验
（3）对变量做逐步回归分析
序号 X1 X2 X3 Y
1 0.4 52 158 64
2 0.4 23 163 60
3 3.1 19 37 71
4 0.6 34 157 61
5 4.7 24 59 54
6 1.7 65 123 77
7 9.4 44 46 81
8 10.1 31 117 93
9 11.6 29 173 93
10 12.6 58 112 51
11 10.9 37 111 76
12 23.1 46 114 96
13 23.1 50 134 77
14 21.6 44 73 93
15 23.1 56 168 95
16 1.9 36 143 54
17 26.8 58 202 168
18 29.9 51 124 99

soil <- data.frame(X1 = c(0.4, 0.4, 3.1, 0.6, 4.7, 1.7, 9.4, 10.1, 11.6, 12.6, 
    10.9, 23.1, 23.1, 21.6, 23.1, 1.9, 26.8, 29.9), X2 = c(52, 23, 19, 34, 24, 
    65, 44, 31, 29, 58, 37, 46, 50, 44, 56, 36, 58, 51), X3 = c(158, 163, 37, 
    157, 59, 123, 46, 117, 173, 112, 111, 114, 134, 73, 168, 143, 202, 124), 
    Y = c(64, 60, 71, 61, 54, 77, 81, 93, 93, 51, 76, 96, 77, 93, 95, 54, 168, 
        99))

lm.sol <- lm(Y ~ X1 + X2 + X3, data = soil)
summary(lm.sol)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = soil)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28.35 -11.38  -2.66  12.10  48.81 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  43.6501    18.0544    2.42   0.0298 * 
## X1            1.7853     0.5398    3.31   0.0052 **
## X2           -0.0833     0.4204   -0.20   0.8458   
## X3            0.1610     0.1116    1.44   0.1710   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 20 on 14 degrees of freedom
## Multiple R-squared: 0.549,	Adjusted R-squared: 0.453 
## F-statistic: 5.69 on 3 and 14 DF,  p-value: 0.00923

# 可以看出X2和X3的系数不显著，可以考虑从画出X2～Y，X3·Y的图进行分析
plot(residuals(lm.sol))

# 发现第17个点的残差比其他点都大，去掉17点后重新计算
lm.sol2 <- lm(Y ~ X1 + X2 + X3, data = soil[-17, ])
summary(lm.sol2)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + X3, data = soil[-17, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.79  -5.68   2.27   4.89  16.78 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.4929    12.2929    5.25  0.00016 ***
## X1            1.3034     0.3576    3.64  0.00297 ** 
## X2           -0.1322     0.2669   -0.50  0.62863    
## X3            0.0227     0.0767    0.30  0.77177    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 12.7 on 13 degrees of freedom
## Multiple R-squared: 0.529,	Adjusted R-squared: 0.42 
## F-statistic: 4.86 on 3 and 13 DF,  p-value: 0.0176

# 和之前一样，X2和X3的系数都没有通过显著性检验。
# 使用step函数，进行逐步回归
lm.step <- step(lm.sol2)

## Start:  AIC=89.77
## Y ~ X1 + X2 + X3
## 
##        Df Sum of Sq  RSS  AIC
## - X3    1        14 2101 87.9
## - X2    1        39 2126 88.1
## <none>              2087 89.8
## - X1    1      2132 4218 99.7
## 
## Step:  AIC=87.89
## Y ~ X1 + X2
## 
##        Df Sum of Sq  RSS  AIC
## - X2    1        31 2131 86.1
## <none>              2101 87.9
## - X1    1      2119 4220 97.7
## 
## Step:  AIC=86.13
## Y ~ X1
## 
##        Df Sum of Sq  RSS  AIC
## <none>              2131 86.1
## - X1    1      2295 4426 96.6

# 可以看到，去掉自变量X2和X3以后，AIC=86.13 达到最小
lm.fin <- lm(Y ~ X1, data = soil[-17, ])
summary(lm.step)

## 
## Call:
## lm(formula = Y ~ X1, data = soil[-17, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.056  -3.061   0.939   5.038  18.016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62.569      4.452   14.05  4.8e-10 ***
## X1             1.229      0.306    4.02   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 11.9 on 15 degrees of freedom
## Multiple R-squared: 0.519,	Adjusted R-squared: 0.486 
## F-statistic: 16.2 on 1 and 15 DF,  p-value: 0.00111

截距项和X1的系数都通过了显著性检验，因此回归方程为 $Y=62.569+1.229X_1$

6.3已知如下数据，由表6.19所示

序号 X Y
1 1 0.6
2 1 1.6
3 1 0.5
4 1 1.2
5 2 2.0
6 2 1.3
7 2 2.5
8 3 2.2
9 3 2.4
10 3 1.2
11 4 3.5
12 4 4.1
13 4 5.1
14 5 5.7
15 6 3.4
16 6 9.7
17 6 8.6
18 7 4.0
19 7 5.5
20 7 10.5
21 8 17.5
22 8 13.4
23 8 4.5
24 9 30.4
25 11 12.4
26 12 13.4
27 12 26.2
28 12 7.4
（1）画出数据的散点图，求回归直线$y= \widehat{\beta_0}+\widehat{\beta_1}x$，同时将回归直线也画在散点图上。
（2）分析T检验和F检验是否通过
（3）画出残差（普通残差和标准化残差）与预测值的残差图，分析误差是否是等方差的
（4）修正模型，对响应变量Y作开方，再完成（1）-（3）的工作

data <- data.frame(x = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6, 
    7, 7, 7, 8, 8, 8, 9, 11, 12, 12, 12), y = c(0.6, 1.6, 0.5, 1.2, 2, 1.3, 
    2.5, 2.2, 2.4, 1.2, 3.5, 4.1, 5.1, 5.7, 3.4, 9.7, 8.6, 4, 5.5, 10.5, 17.5, 
    13.4, 4.5, 30.4, 12.4, 13.4, 26.2, 7.4))

# 求回归方程
lm.sol <- lm(y ~ x + 1, data)
# 画出散点图
plot(data$x, data$y)
abline(lm.sol)


summary(lm.sol)

## 
## Call:
## lm(formula = y ~ x + 1, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.841 -2.337 -0.021  1.059 17.832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.452      1.835   -0.79     0.44    
## x              1.558      0.281    5.55  7.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.17 on 26 degrees of freedom
## Multiple R-squared: 0.542,	Adjusted R-squared: 0.525 
## F-statistic: 30.8 on 1 and 26 DF,  p-value: 7.93e-06

# 截距项的t value = -0.79， Pr(>|t|)= 0.44>0.05, 此项不显著，没有通过t检验
# F-statistic: 30.8 on 1 and 26 DF, p-value: 7.93e-06 <0.05，表示通过F检验

# 残差属于回归诊断问题 预测值
y.fit <- predict(lm.sol)
# 普通残差，残差在不断增加，因此误差为异方差
y.res.nor <- residuals(lm.sol)
plot(y.res.nor ~ y.fit)
abline(1.5, 0.6)
abline(-0.1, -0.6)


# 标准化残差（学生化残差），残差在不断增加，因此误差为异方差
y.res.nor <- rstandard(lm.sol)
plot(y.res.nor ~ y.fit)
abline(0.3, 0.1)
abline(-0.1, -0.1)


##################################################################################
################################################################################## 修正模型，对响应变量Y作开方，再完成（1）-（3）的工作
################################################################################## 求回归方程
data$y <- sqrt(data$y)

lm.sol <- lm(y ~ x + 1, data)
# 画出散点图
plot(data$x, data$y)
abline(lm.sol)


summary(lm.sol)

## 
## Call:
## lm(formula = y ~ x + 1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5426 -0.4528 -0.0118  0.3493  2.1249 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7665     0.2559    3.00    0.006 ** 
## x             0.2914     0.0391    7.44  6.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.721 on 26 degrees of freedom
## Multiple R-squared: 0.681,	Adjusted R-squared: 0.668 
## F-statistic: 55.4 on 1 and 26 DF,  p-value: 6.64e-08

# 两个系数的 Pr(>|t|)都小于0.05, 通过t检验 F-statistic: 55.4 on 1 and 26
# DF, p-value: 6.64e-08，表示通过F检验

# 残差属于回归诊断问题 预测值
y.fit <- predict(lm.sol)
# 普通残差，残差在一定范围内变化，因此误差为同方差
y.res.nor <- residuals(lm.sol)
plot(y.res.nor ~ y.fit)
abline(1, 0)
abline(-1, 0)


# 标准化残差（学生化残差），残差在不断增加，因此误差为异方差
y.res.nor <- rstandard(lm.sol)
plot(y.res.nor ~ y.fit)
abline(1.3, 0)
abline(-1.3, 0)

得到最终的回归模型为$\sqrt{y}=0.7665+0.2914x$

6.4对牙膏销售数据（数据表见例6.9）得到的线性模型做回归诊断，分析哪些样本点需要作进一步的研究？哪些样本点需要在回归计算中删去，如果有，删去再做线性回归模型的计算。

toothpaste <- data.frame(X1 = c(-0.05, 0.25, 0.6, 0, 0.25, 0.2, 0.15, 0.05, 
    -0.15, 0.15, 0.2, 0.1, 0.4, 0.45, 0.35, 0.3, 0.5, 0.5, 0.4, -0.05, -0.05, 
    -0.1, 0.2, 0.1, 0.5, 0.6, -0.05, 0, 0.05, 0.55), X2 = c(5.5, 6.75, 7.25, 
    5.5, 7, 6.5, 6.75, 5.25, 5.25, 6, 6.5, 6.25, 7, 6.9, 6.8, 6.8, 7.1, 7, 6.8, 
    6.5, 6.25, 6, 6.5, 7, 6.8, 6.8, 6.5, 5.75, 5.8, 6.8), Y = c(7.38, 8.51, 
    9.52, 7.5, 9.33, 8.28, 8.75, 7.87, 7.1, 8, 7.89, 8.15, 9.1, 8.86, 8.9, 8.87, 
    9.26, 9, 8.75, 7.95, 7.65, 7.27, 8, 8.5, 8.75, 9.21, 8.27, 7.67, 7.93, 9.26))

# 例子中得到的模型
lm.sol <- lm(Y ~ X1 + X2 + I(X2^2) + X1:X2, data = toothpaste)
summary(lm.sol)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + I(X2^2) + X1:X2, data = toothpaste)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4372 -0.1175  0.0049  0.1226  0.3841 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   29.113      7.483    3.89  0.00066 ***
## X1            11.134      4.446    2.50  0.01915 *  
## X2            -7.608      2.469   -3.08  0.00496 ** 
## I(X2^2)        0.671      0.203    3.31  0.00282 ** 
## X1:X2         -1.478      0.667   -2.21  0.03610 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.206 on 25 degrees of freedom
## Multiple R-squared: 0.921,	Adjusted R-squared: 0.908 
## F-statistic: 72.8 on 4 and 25 DF,  p-value: 2.11e-13


# 计算残差
y.res <- rstandard(lm.sol)
# 验证残差的正态性，p-value = 0.931>0.05，能通过正态性检验
shapiro.test(y.res)

## 
## 	Shapiro-Wilk normality test
## 
## data:  y.res 
## W = 0.9822, p-value = 0.8817

plot(y.res)

# 通过标准化残差图可以看出，点5和，点11的残差最大，去掉这两个点后重新计算
lm.sol.n <- lm(Y ~ X1 + X2 + I(X2^2) + X1:X2, data = toothpaste[c(-5, -11), 
    ])
summary(lm.sol.n)

## 
## Call:
## lm(formula = Y ~ X1 + X2 + I(X2^2) + X1:X2, data = toothpaste[c(-5, 
##     -11), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3560 -0.1126 -0.0063  0.0965  0.3348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   23.378      6.538    3.58   0.0016 **
## X1            10.292      3.790    2.72   0.0123 * 
## X2            -5.667      2.163   -2.62   0.0153 * 
## I(X2^2)        0.508      0.178    2.85   0.0090 **
## X1:X2         -1.324      0.570   -2.32   0.0295 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.173 on 23 degrees of freedom
## Multiple R-squared: 0.944,	Adjusted R-squared: 0.935 
## F-statistic: 97.6 on 4 and 23 DF,  p-value: 4.42e-14

# 新的模型可以看到，Residual standard error: 从0.206下降到0.173 Multiple
# R-squared: 从0.921上升到 0.944，模型变好

6.5诊断水泥数据（数据见例6.10）是否存在多重共线性，分析例6.10中step()函数去掉的变量是否合理。

cement <- data.frame(X1 = c(7, 1, 11, 11, 7, 11, 3, 1, 2, 21, 1, 11, 10), X2 = c(26, 
    29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68), X3 = c(6, 15, 8, 8, 6, 
    9, 17, 22, 18, 4, 23, 9, 8), X4 = c(60, 52, 20, 47, 33, 22, 6, 44, 22, 26, 
    34, 12, 12), Y = c(78.5, 74.3, 104.3, 87.6, 95.9, 109.2, 102.7, 72.5, 93.1, 
    115.9, 83.8, 113.3, 109.4))

# kappa(z, exact = FALSE, ...) 计算矩阵的条件数 κ < 100 不存在多重共线性
# 100 ≤ κ ≤ 1000, 存在一定程度多重共线性 κ > 1000, 存在严重多重共线性

XX <- cor(cement[, 1:4])
kappa(XX, exact = TRUE)

## [1] 1377


# κ =1377 > 1000, 存在严重多重共线性

# 下一步要找出哪些变量有多重共线性，计算矩阵的特征值和特征向量
eigen(XX)

## $values
## [1] 2.235704 1.576066 0.186606 0.001624
## 
## $vectors
##         [,1]    [,2]    [,3]   [,4]
## [1,]  0.4760  0.5090  0.6755 0.2411
## [2,]  0.5639 -0.4139 -0.3144 0.6418
## [3,] -0.3941 -0.6050  0.6377 0.2685
## [4,] -0.5479  0.4512 -0.1954 0.6767

得到 $$\lambda _{min}=0.001624, \varphi =(0.2411,0.6418,0.2685,0.6767)^{T}$$
因此有 $$0.2411X1+0.6418X2+0.2685X3+0.6767X4\approx 0$$

#
# 分析例6.10中step()函数去掉的变量是否合理，例子中去掉了变量X3和X4，从多重共线性角度分析去掉是否合理
XX2 <- cor(cement[, 1:2])
kappa(XX2, exact = TRUE)

## [1] 1.593

# κ =1.593，已经消除了共线性问题

6.6为研究一些因素（如用抗生素、有无危险因子和事先是否有计划）对“破腹产后是否有感染”的影响，表6.20给出的是某医院破腹产后的数据，试用logistic回归模型对这些数据进行研究，分析感染与这些因素的关系。

表6.20 某医院进行破腹产后的数据
---------------------------事先有计划--------------------临时决定
-----------------------有感染----无感染-----------有感染-----无感染
用抗生素--有危险因子----1----------17----------------------11--87
用抗生素--无危险因子----0----------2-----------------------0---0
不用抗生素--有危险因子--28---------30---------------------23--3
不用抗生素--无危险因子--8----------32---------------------0---9

修改下表格的形式
是否用抗生素有无危险因子事先有无计划有感染无感染
用有有 1 17
用有无 11 87
用无有 0 2
用无无 0 0
不有有 28 30
不有无 23 3
不无有 8 32
不无无 0 9

# x1 是否使用抗生素 1用 0 不用 x2 有无危险因子 1有 0 无 x3 事先有无计划
# 1有 0 无
patient <- data.frame(x1 = c(1, 1, 1, 1, 0, 0, 0, 0), x2 = c(1, 1, 0, 0, 1, 
    1, 0, 0), x3 = c(1, 0, 1, 0, 1, 0, 1, 0), y.yes = c(1, 11, 0, 0, 28, 23, 
    8, 0), y.no = c(17, 87, 2, 0, 30, 3, 32, 9))

patient$Ymat <- cbind(patient$y.yes, patient$y.no)
glm.sol <- glm(Ymat ~ x1 + x2 + x3, family = binomial, data = patient)
summary(glm.sol)

## 
## Call:
## glm(formula = Ymat ~ x1 + x2 + x3, family = binomial, data = patient)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.2647  -0.0716  -0.1523   0.0000  -0.7852   1.4962   1.2156  -2.5623  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.821      0.495   -1.66    0.097 .  
## x1            -3.254      0.481   -6.76  1.4e-11 ***
## x2             2.030      0.455    4.46  8.2e-06 ***
## x3            -1.072      0.425   -2.52    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.491  on 6  degrees of freedom
## Residual deviance: 10.997  on 3  degrees of freedom
## AIC: 36.18
## 
## Number of Fisher Scoring iterations: 5


# 可以看到截距项不能通过验证，去掉截距项重新回归，回归方程的都通过了检验
glm.sol <- glm(Ymat ~ x1 + x2 + x3 - 1, family = binomial, data = patient)
summary(glm.sol)

## 
## Call:
## glm(formula = Ymat ~ x1 + x2 + x3 - 1, family = binomial, data = patient)
## 
## Deviance Residuals: 
##      1       2       3       4       5       6       7       8  
##  0.640  -0.156  -0.153   0.000  -0.384   0.794   0.396  -3.532  
## 
## Coefficients:
##    Estimate Std. Error z value Pr(>|z|)    
## x1   -3.596      0.458   -7.85  4.0e-15 ***
## x2    1.577      0.352    4.49  7.2e-06 ***
## x3   -1.545      0.329   -4.70  2.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.439  on 7  degrees of freedom
## Residual deviance:  13.868  on 4  degrees of freedom
## AIC: 37.05
## 
## Number of Fisher Scoring iterations: 4

得到破腹产后是否有感染的概率模型：$$P=\frac{exp(-3.596x_1+1.577x_2-1.545x_3)}{1+exp(-3.596x_1+1.577x_2-1.545x_3)}$$

6.7一位饮食公司的分析人员想调查自助餐馆中的自动咖啡售货机数量与咖啡销售量之间的关系，她选择了14家餐馆来进行实验，这14家餐馆在营业额、顾客类型和地理位置方面都是相近的，放在试验餐馆的自动售货机数量从0（这里由咖啡服务员端来）到6不等，并且是随机分配到每个餐馆的，表6.21所示的是关于试验结果的数据。

（1）作线性回归模型
（2）作多项式回归模型
（3）画出数据的散点图和拟合曲线
表6.21: 自动咖啡售货机数量与咖啡销售量数据
餐馆售货机数量咖啡销售量
1 0 508.1
2 0 498.4
3 1 568.2
4 1 577.3
5 2 651.7
6 2 657.0
7 3 713.4
8 3 697.5
9 4 755.3
10 4 758.9
11 5 787.6
12 5 792.1
13 6 841.4
14 6 831.8

coffee <- data.frame(x = c(0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6), y = c(508.1, 
    498.4, 568.2, 577.3, 651.7, 657, 713.4, 697.5, 755.3, 758.9, 787.6, 792.1, 
    841.4, 831.8))
# 做线性回归模型
lm.sol <- lm(y ~ x, coffee)
summary(lm.sol)

## 
## Call:
## lm(formula = y ~ x, data = coffee)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25.40 -11.48  -3.78  14.63  24.92 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   523.80       8.47    61.8  < 2e-16 ***
## x              54.89       2.35    23.4  2.3e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 17.6 on 12 degrees of freedom
## Multiple R-squared: 0.978,	Adjusted R-squared: 0.977 
## F-statistic:  545 on 1 and 12 DF,  p-value: 2.26e-11

plot(coffee$x, coffee$y)
abline(lm.sol)


# 虽然线性回归模型通过了检验，但是从图上看可以继续做多项式回归

lm.sol2 <- lm(y ~ 1 + x + I(x^2), coffee)
summary(lm.sol2)

## 
## Call:
## lm(formula = y ~ 1 + x + I(x^2), data = coffee)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.664  -5.662  -0.465   5.500  10.668 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  502.556      4.850  103.62  < 2e-16 ***
## x             80.386      3.786   21.23  2.8e-10 ***
## I(x^2)        -4.249      0.606   -7.01  2.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 7.86 on 11 degrees of freedom
## Multiple R-squared: 0.996,	Adjusted R-squared: 0.995 
## F-statistic: 1.39e+03 on 2 and 11 DF,  p-value: 5.95e-14


# 多项式回归的系数都通过了检验 标准话残差由
# 17.6下降到7.86，R平方由0.978上升到0.996，可以认为此多项式回归优于线性回归

# 拟合曲线为
xfit <- seq(0, 6.5, len = 200)
yfit <- predict(lm.sol2, data.frame(x = xfit))
plot(coffee$x, coffee$y)
lines(xfit, yfit)

得到的回归方程为 $$y=502.556+80.386x-4.249x^2$$

6.8 表6.22是40名肺癌病人的生存资料，其中$X_1$表示生活行动能力评分（1-100），$X_2$表示病人的年龄， $X_3$表示由诊断到直入研究时间（月）， $X_4$表示肿瘤的类型（“0”是磷癌，“1”是小型细胞癌，“2”是腺癌，“3”是大型细胞癌）；$X_5$表示两种化疗方法（“1”常规，“2”试验新方法）；$Y$表示病人的生存时间（“0”是生存时间短，即生存时间小于200天，“1”表示生存时间长，即生存时间大于或等于200天）。

序号 X1 X2 X3 X4 X5 Y
1 70 64 5 1 1 1
2 60 63 9 1 1 0
3 70 65 11 1 1 0
4 40 69 10 1 1 0
5 40 63 58 1 1 0
6 70 48 9 1 1 0
7 70 48 11 1 1 0
8 80 63 4 2 1 0
9 60 63 14 2 1 0
10 30 53 4 2 1 0
11 80 43 12 2 1 0
12 40 55 2 2 1 0
13 60 66 25 2 1 1
14 40 67 23 2 1 0
15 20 61 19 3 1 0
16 50 63 4 3 1 0
17 50 66 16 0 1 0
18 40 68 12 0 1 0
19 80 41 12 0 1 1
20 70 53 8 0 1 1
21 60 37 13 1 1 0
22 90 54 12 1 0 1
23 50 52 8 1 0 1
24 70 50 7 1 0 1
25 20 65 21 1 0 0
26 80 52 28 1 0 1
27 60 70 13 1 0 0
28 50 40 13 1 0 0
29 70 36 22 2 0 0
30 40 44 36 2 0 0
31 30 54 9 2 0 0
32 30 59 87 2 0 0
33 40 69 5 3 0 0
34 60 50 22 3 0 0
35 80 62 4 3 0 0
36 70 68 15 0 0 0
37 30 39 4 0 0 0
38 60 49 11 0 0 0
39 80 64 10 0 0 1
40 70 67 18 0 0 1
（1）建立$P(Y=1)$ 对 $X_1~X_5$ 的logistic回归模型，$X_1~X_5$对$P(Y=1)$综合影响是否显著？哪些变量是主要的影响因素，显著水平如何？计算各病人生存时间大于等于200天的概率估计值。
（2——用逐步回归法选取自变量，结果如何？在所选模型下，计算病人生存时间大于等于200天的概率估计值，并将计算结果与（1）中的模型做比较，差异如何？哪一个模型更合理。

patient <- data.frame(X1 = c(70, 60, 70, 40, 40, 70, 70, 80, 60, 30, 80, 40, 
    60, 40, 20, 50, 50, 40, 80, 70, 60, 90, 50, 70, 20, 80, 60, 50, 70, 40, 
    30, 30, 40, 60, 80, 70, 30, 60, 80, 70), X2 = c(64, 63, 65, 69, 63, 48, 
    48, 63, 63, 53, 43, 55, 66, 67, 61, 63, 66, 68, 41, 53, 37, 54, 52, 50, 
    65, 52, 70, 40, 36, 44, 54, 59, 69, 50, 62, 68, 39, 49, 64, 67), X3 = c(5, 
    9, 11, 10, 58, 9, 11, 4, 14, 4, 12, 2, 25, 23, 19, 4, 16, 12, 12, 8, 13, 
    12, 8, 7, 21, 28, 13, 13, 22, 36, 9, 87, 5, 22, 4, 15, 4, 11, 10, 18), X4 = c(1, 
    1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
    1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 0, 0, 0, 0, 0), X5 = c(1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), Y = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 1, 1))
# 建立$P(Y=1)$ 对 $X_1~X_5$ 的logistic回归模型
glm.sol <- glm(Y ~ X1 + X2 + X3 + X4 + X5, data = patient)
summary(glm.sol)

## 
## Call:
## glm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = patient)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6427  -0.3012  -0.0658   0.2848   0.8126  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.322177   0.457570   -0.70   0.4862   
## X1           0.010246   0.003549    2.89   0.0067 **
## X2           0.003468   0.006250    0.55   0.5826   
## X3           0.000788   0.004217    0.19   0.8528   
## X4          -0.118238   0.066063   -1.79   0.0824 . 
## X5          -0.117326   0.125824   -0.93   0.3577   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for gaussian family taken to be 0.1495)
## 
##     Null deviance: 7.5000  on 39  degrees of freedom
## Residual deviance: 5.0838  on 34  degrees of freedom
## AIC: 45
## 
## Number of Fisher Scoring iterations: 2

# 可见只有X1的系数满足了显著性要求，其他自变量对Y的影响不显著。


yfit <- predict(glm.sol, patient[, 1:5])
p <- exp(yfit)/(1 + exp(yfit))
# 得到各病人生存时间大于等于200天的概率估计值
cbind(patient, p)

##    X1 X2 X3 X4 X5 Y      p
## 1  70 64  5  1  1 1 0.5952
## 2  60 63  9  1  1 0 0.5702
## 3  70 65 11  1  1 0 0.5972
## 4  40 69 10  1  1 0 0.5248
## 5  40 63 58  1  1 0 0.5291
## 6  70 48  9  1  1 0 0.5825
## 7  70 48 11  1  1 0 0.5829
## 8  80 63  4  2  1 0 0.5903
## 9  60 63 14  2  1 0 0.5420
## 10 30 53  4  2  1 0 0.4547
## 11 80 43 12  2  1 0 0.5750
## 12 40 55  2  2  1 0 0.4816
## 13 60 66 25  2  1 1 0.5467
## 14 40 67 23  2  1 0 0.4961
## 15 20 61 19  3  1 0 0.4103
## 16 50 63  4  3  1 0 0.4849
## 17 50 66 16  0  1 0 0.5779
## 18 40 68 12  0  1 0 0.5537
## 19 80 41 12  0  1 1 0.6299
## 20 70 53  8  0  1 1 0.6149
## 21 60 37 13  1  1 0 0.5488
## 22 90 54 12  1  0 1 0.6634
## 23 50 52  8  1  0 1 0.5643
## 24 70 50  7  1  0 1 0.6120
## 25 20 65 21  1  0 0 0.5016
## 26 80 52 28  1  0 1 0.6415
## 27 60 70 13  1  0 0 0.6053
## 28 50 40 13  1  0 0 0.5550
## 29 70 36 22  2  0 0 0.5746
## 30 40 44 36  2  0 0 0.5080
## 31 30 54  9  2  0 0 0.4858
## 32 30 59 87  2  0 0 0.5055
## 33 40 69  5  3  0 0 0.4941
## 34 60 50 22  3  0 0 0.5321
## 35 80 62  4  3  0 0 0.5893
## 36 70 68 15  0  0 0 0.6554
## 37 30 39  4  0  0 0 0.5309
## 38 60 49 11  0  0 0 0.6157
## 39 80 64 10  0  0 1 0.6742
## 40 70 67 18  0  0 1 0.6551


# 用逐步回归法选取自变量
glm.new <- step(glm.sol)

## Start:  AIC=45
## Y ~ X1 + X2 + X3 + X4 + X5
## 
##        Df Deviance  AIC
## - X3    1     5.09 43.0
## - X2    1     5.13 43.4
## - X5    1     5.21 44.0
## <none>        5.08 45.0
## - X4    1     5.56 46.6
## - X1    1     6.33 51.8
## 
## Step:  AIC=43.04
## Y ~ X1 + X2 + X4 + X5
## 
##        Df Deviance  AIC
## - X2    1     5.14 41.4
## - X5    1     5.23 42.2
## <none>        5.09 43.0
## - X4    1     5.57 44.6
## - X1    1     6.38 50.1
## 
## Step:  AIC=41.41
## Y ~ X1 + X4 + X5
## 
##        Df Deviance  AIC
## - X5    1     5.26 40.3
## <none>        5.14 41.4
## - X4    1     5.61 42.9
## - X1    1     6.38 48.1
## 
## Step:  AIC=40.35
## Y ~ X1 + X4
## 
##        Df Deviance  AIC
## <none>        5.26 40.3
## - X4    1     5.75 41.9
## - X1    1     6.51 46.9

# 最后只选择了变量X1和X4 ，如果取显著性水平为0.1，则所有系数都通过了检验
summary(glm.new)

## 
## Call:
## glm(formula = Y ~ X1 + X4, data = patient)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5381  -0.3203  -0.0427   0.3250   0.7991  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.15086    0.23036   -0.65   0.5166   
## X1           0.00984    0.00331    2.97   0.0052 **
## X4          -0.11941    0.06433   -1.86   0.0714 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for gaussian family taken to be 0.1421)
## 
##     Null deviance: 7.500  on 39  degrees of freedom
## Residual deviance: 5.258  on 37  degrees of freedom
## AIC: 40.35
## 
## Number of Fisher Scoring iterations: 2


yfit.new <- predict(glm.sol, patient[, 1:5])
p.new <- exp(yfit.new)/(1 + exp(yfit.new))
# 得到各病人生存时间大于等于200天的概率估计值
cbind(patient, p.new)

##    X1 X2 X3 X4 X5 Y  p.new
## 1  70 64  5  1  1 1 0.5952
## 2  60 63  9  1  1 0 0.5702
## 3  70 65 11  1  1 0 0.5972
## 4  40 69 10  1  1 0 0.5248
## 5  40 63 58  1  1 0 0.5291
## 6  70 48  9  1  1 0 0.5825
## 7  70 48 11  1  1 0 0.5829
## 8  80 63  4  2  1 0 0.5903
## 9  60 63 14  2  1 0 0.5420
## 10 30 53  4  2  1 0 0.4547
## 11 80 43 12  2  1 0 0.5750
## 12 40 55  2  2  1 0 0.4816
## 13 60 66 25  2  1 1 0.5467
## 14 40 67 23  2  1 0 0.4961
## 15 20 61 19  3  1 0 0.4103
## 16 50 63  4  3  1 0 0.4849
## 17 50 66 16  0  1 0 0.5779
## 18 40 68 12  0  1 0 0.5537
## 19 80 41 12  0  1 1 0.6299
## 20 70 53  8  0  1 1 0.6149
## 21 60 37 13  1  1 0 0.5488
## 22 90 54 12  1  0 1 0.6634
## 23 50 52  8  1  0 1 0.5643
## 24 70 50  7  1  0 1 0.6120
## 25 20 65 21  1  0 0 0.5016
## 26 80 52 28  1  0 1 0.6415
## 27 60 70 13  1  0 0 0.6053
## 28 50 40 13  1  0 0 0.5550
## 29 70 36 22  2  0 0 0.5746
## 30 40 44 36  2  0 0 0.5080
## 31 30 54  9  2  0 0 0.4858
## 32 30 59 87  2  0 0 0.5055
## 33 40 69  5  3  0 0 0.4941
## 34 60 50 22  3  0 0 0.5321
## 35 80 62  4  3  0 0 0.5893
## 36 70 68 15  0  0 0 0.6554
## 37 30 39  4  0  0 0 0.5309
## 38 60 49 11  0  0 0 0.6157
## 39 80 64 10  0  0 1 0.6742
## 40 70 67 18  0  0 1 0.6551

# 可见新的模型效果更好

得到最终的模型为$$P=\frac{exp(-0.15086+0.00984x_1-0.11941x_4)}{1+exp(-0.15086+0.00984x_1-0.11941x_4)}$$

6.9一位医院管理人员想建立一个回归模型，对重伤病人出院后的长期恢复情况进行预测，自变量是病人住院的天数（X），应变量是病人出院后长期恢复的预后指数（Y），指数的数值越大表示预后结局越好，为此，研究了15个别病人的数据，这些数据列在表6.23中，根据经验表明，病人住院的天数（X）和预后指数（Y）服从非线性模型

\[Y_i=\theta_0exp(\theta_1X_i)+\varepsilon_i,i=1,\cdots, 15 \]

表6.23: 关于重伤病人的数据
病号住院天数(X) 预后指数(Y )
1 2 54
2 5 50
3 7 45
4 10 37
5 14 35
6 19 25
7 26 20
8 31 16
9 34 18
10 38 13
11 45 8
12 52 11
13 53 8
14 60 4
15 65 6
（1）用内在线性模型方法计算其各种参的估计值；
（2）用非线性方法（nls()函数和nlm()函数）计算其各种参的估计值

patient <- data.frame(x = c(2, 5, 7, 10, 14, 19, 26, 31, 34, 38, 45, 52, 53, 
    60, 65), y = c(54, 50, 45, 37, 35, 25, 20, 16, 18, 13, 8, 11, 8, 4, 6))
# 用内在线性模型方法计算其各种参的估计值 对自变量做2次正交
poly(patient$x, degree = 2)

##               1         2
##  [1,] -0.365891  0.385604
##  [2,] -0.327689  0.253961
##  [3,] -0.302221  0.173806
##  [4,] -0.264019  0.064985
##  [5,] -0.213083 -0.058810
##  [6,] -0.149412 -0.179320
##  [7,] -0.060274 -0.284134
##  [8,]  0.003396 -0.313357
##  [9,]  0.041598 -0.312633
## [10,]  0.092534 -0.290367
## [11,]  0.181672 -0.192826
## [12,]  0.270810 -0.020734
## [13,]  0.283544  0.009937
## [14,]  0.372682  0.267231
## [15,]  0.436352  0.496657
## attr(,"degree")
## [1] 1 2
## attr(,"coefs")
## attr(,"coefs")$alpha
## [1] 30.73 33.95
## 
## attr(,"coefs")$norm2
## [1]       1      15    6167 1727980
## 
## attr(,"class")
## [1] "poly"   "matrix"

lm.pol <- lm(y ~ 1 + poly(x, 2), data = patient)
summary(lm.pol)

## 
## Call:
## lm(formula = y ~ 1 + poly(x, 2), data = patient)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.672 -1.290  0.219  1.384  4.074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.333      0.634   36.81  1.0e-13 ***
## poly(x, 2)1  -59.094      2.455  -24.07  1.6e-11 ***
## poly(x, 2)2   19.464      2.455    7.93  4.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.46 on 12 degrees of freedom
## Multiple R-squared: 0.982,	Adjusted R-squared: 0.979 
## F-statistic:  321 on 2 and 12 DF,  p-value: 3.81e-11


# 画出散点图和预测曲线
xfit <- seq(2, 65, len = 200)
yfit <- predict(lm.pol, data.frame(x = xfit))
plot(patient$x, patient$y)
lines(xfit, yfit)

使用内在线性模型方法，得到y关于x的二次回归方程：$y=23.333-59.094*x+19.464*x^2$

# 用非线性方法（nls()函数和nlm()函数）计算其各种参的估计值
nls.sol <- nls(y ~ a * exp(b * x), data = patient, start = list(a = 2, b = 0.01))
summary(nls.sol)

## 
## Formula: y ~ a * exp(b * x)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a 58.60655    1.47216    39.8  5.7e-15 ***
## b -0.03959    0.00171   -23.1  6.0e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.95 on 13 degrees of freedom
## 
## Number of iterations to convergence: 7 
## Achieved convergence tolerance: 3.48e-06

# 画出散点图和预测曲线
xfit <- seq(2, 65, len = 200)
yfit <- predict(nls.sol, data.frame(x = xfit))
plot(patient$x, patient$y)
lines(xfit, yfit)

使用非线性方法方法，得到y关于x的回归方程：$y=58.60655*exp(-0.03959*x)$

刷新页面返回顶部

Digging for ...

『原创』统计建模与R软件-第六章回归分析

统计建模与R软件-第六章回归分析

6.1为了估计山上积雪融化后对下游灌溉的影响，在山上建立一个观测站，测量最大积雪深度X与当年灌溉面积Y，测得连续10年的数据如表6.17所示

6.3已知如下数据，由表6.19所示

得到最终的回归模型为\(\sqrt{y}=0.7665+0.2914x\)

6.4对牙膏销售数据（数据表见例6.9）得到的线性模型做回归诊断，分析哪些样本点需要作进一步的研究？哪些样本点需要在回归计算中删去，如果有，删去再做线性回归模型的计算。

6.5诊断水泥数据（数据见例6.10）是否存在多重共线性，分析例6.10中step()函数去掉的变量是否合理。

6.6为研究一些因素（如用抗生素、有无危险因子和事先是否有计划）对“破腹产后是否有感染”的影响，表6.20给出的是某医院破腹产后的数据，试用logistic回归模型对这些数据进行研究，分析感染与这些因素的关系。

About

Digging for ...

『原创』统计建模与R软件-第六章 回归分析

统计建模与R软件-第六章 回归分析

6.1为了估计山上积雪融化后对下游灌溉的影响，在山上建立一个观测站，测量最大积雪深度X与当年灌溉面积Y，测得连续10年的数据如表6.17所示

6.3已知如下数据，由表6.19所示

得到最终的回归模型为\(\sqrt{y}=0.7665+0.2914x\)

6.4对牙膏销售数据（数据表见例6.9）得到的线性模型做回归诊断，分析哪些样本点需要作进一步的研究？哪些样本点需要在回归计算中删去，如果有，删去再做线性回归模型的计算。

6.5诊断水泥数据（数据见例6.10）是否存在多重共线性，分析例6.10中step()函数去掉的变量是否合理。

6.6为研究一些因素（如用抗生素、有无危险因子和事先是否有计划）对“破腹产后是否有感染”的影响，表6.20给出的是某医院破腹产后的数据，试用logistic回归模型对这些数据进行研究，分析感染与这些因素的关系。

About

『原创』统计建模与R软件-第六章回归分析

统计建模与R软件-第六章回归分析