ISLR—第三章 Linear Regression

  • 几种常见的线性模型

1 简单的线性模型


2 多元线性回归
3 扩展线性回归
克服了多元线性模型 X1 与 X2 不协同作用的假设。 




  • 线性模型的评价指标

估计系数 ——最小二乘估计  
残差平方和(residual sum of squares,RSS)


(1)  SE(标准误差,standard error)






(2)t统计量 ( t-statistic )
t-statistic 有正负,比较时应取绝对值,绝对值大说明该变量X与Y有相关关系,这里,newspaper的t-statistic 参数为-0.18,明显newspaper变量与sales无明显的相关关系


p-value 越小,越说明该X与Y相关,这个模型中,明显newspaper变量与sales无明显的相关关系,在之后的模型优化上应该舍弃这一变量(p81). 
不能通过系数大小来判断变量是否和模型的相关关系(p134 3.c).
Consequently, it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming β1 = 0. We call this probability the p-value. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.
(4) RSE(残差标准误,Residual Standard Error)


(5) R2-statistic(R^2统计量)
相比于RSE,R2 在0到1之间, R2 越大,越说明模型中Y与X相关










(7) studentized residuals(学生化残差)
studentized residuals是用来检测数据中的异常值的,一般某数据的studentized residuals的绝对值超过3就定性为异常值,需要进行处理如舍弃等。


(8) VIF(方差膨胀因子)




  • Some important questions

1. Is There a Relationship Between the Response and Predictors?

When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small
(3) 对单独变量测试时,需要注意:
it seems likely that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed, especially when the number of predictors p is large
sometimes we have a very large number of variables. If p > n then there are more coefficients βj to estimate than observations from which to estimate them. In this case we cannot even fit the multiple linear regression model using least squares, so the F-statistic cannot be used, and neither can most of the other concepts that we have seen so far in this chapter.

2. Deciding on Important Variables

所有p个变量的组合情况有 2^p 种,每一个都测试是不现实的
  1. Forward selection 
    从null model开始一个一个添加,到一定条件停止 
  2. Backward selection (cannot be used if p > n) 
  3. Mixed selection 
    前面的两种方法各有其特点,若自变量X1,X2,...,Xk 完全是独立的,则可结合这两种方法,但是,在实际的数据中,自变量X1,X2,...,Xk之间往往并不是独立的,而是有一定的相关性存在的,这就会使得随着回归方程中变量的增加和减少,某些自变量对回归方程的贡献也会发生变化。因此一种很自然的想法是将前两种方法综合起来,也就是对每一个自变量,随着其对回归方程贡献的变化,它随时可能被引入回归方程或被剔除出去,最终的回归模型是在回归方程中的自变量均为显著,不在回归方程中的自变量均不显著。

3. Model Fit

An R2 value close to 1 indicates that the model explains a large portion 
of the variance in the response variable
    1. R2 
      An R2 value close to 1 indicates that the model explains a large portion 
      of the variance in the response variable
    2. RSE 
      models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p. 
    3. 图形  Graphical summaries can reveal problems with a model that are not visible from numerical statistics

4. Predictions

  1. 预测时要考虑置信区间
  2. model bias
  3. Prediction intervals are always wider than confidence intervals


  • Potential Problems


Non-linearity of the response-predictor relationships.(非线性的响应预测问题)

Correlation of error terms (误差项自相关)


Non-constant variance of error terms (误差项方差非恒定)


Outliers 离群点



High-leverage points(高杠杆点)
In order to quantify an observation’s leverage, we compute the leverage 






  • KNN回归


实验 线性回归


