ISL - Ch.3 Linear Regression

课后习题完成:https://github.com/sheepshaker1011/ISL-homework/blob/master/ISL%2B-%2BCh.3%2BEx.ipynb

3.1 Simple Linear Regression

$$\hat y = \hat \beta_0 + \hat \beta_1x (3.1)$$ 

3.1.1 Estimating the Coefficients

We define the residual sum of squares (RSS) as

$$RSS =(y_1 -  \hat \beta_0 - \hat \beta_1x_1)^2 +(y_2 -  \hat \beta_0 - \hat \beta_1x_2)^2+...+(y_3 -  \hat \beta_0 - \hat \beta_1x_n)^2$$

The least squares approach chooses $β_0$ and $β_1$ to minimize the RSS. Using some calculus, one can show that the minimizers are

              

3.1.2 Assessing the Accuracy of the Coefficient Estimates

the population regression line, which is the best linear approximation to the true relationship between X and Y, is defined as

 $$Y = \beta_0 +  \beta_1X + \epsilon (3.2)$$ 

The error term is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y , and there may be measurement error. We typically assume that the error term is independent of X.

So the population regression line is unobserved.

To compute the standard errors associated with $\beta_0$ and $\beta_1$, we use the following formulas:

                

where $\sigma^2 = Var(\epsilon)$

In general, $\sigma^2$ is not known, but can be estimated from the data. This estimate is known as the residual standard error,

$$RSE = \sqrt{RSS/(n-2)}$$

For linear regression, the 95% confidence interval for β approximately takes the form

$$\hat \beta \pm 2 \cdot SE(\hat \beta)$$

 

Standard errors can also be used to perform hypothesis tests on the coefficients.

Testing the null hypothesis of 

$H_0$ : There is no relationship between X and Y. $H_0: \beta_1 = 0$

versus the alternative hypothesis

$H_A$ : There is some relationship between X and Y. $H_A: \beta_1 \neq 0$

t-statistic measures the number of standard deviations that $\hat \beta_1$ is away from 0. 

$$t= \frac{\hat \beta_1 - 0}{SE(\hat \beta_1)}$$

If there really is no relationship between X and Y , then we expect that will have a t-distribution with n − 2 degrees of freedom.

The p-value is the probability of observing any value equal to |t| or larger, assuming $\beta_1 = 0$.

Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%.

 

3.1.3 Assessing the Accuracy of the Model

The RSE is considered a measure of the lack of fit of the model to the data. But it is measured in the units of Y.

The $R^2$ statistic measures the proportion of variability in Y that can be explained using X, and is independent of the scale of Y .

$$R^2 = \frac{TSS-RSS}{TSS}$$

where $$TSS = \sum (y_i - \bar{y})^2$$ is the total sum of squares

Correlation is also a measure of the linear relationship between X and Y. In fact, it can be shown that in the simple linear regression setting, $R^2 = r^2$.

          

However, in the multiple linear regression problem, in which we use several predictors simultaneously to predict the response. The concept of correlation between the predictors and the response does not extend automatically to this setting. We will see that $R^2$ fills this role.

 

3.2 Multiple Linear Regression

$$Y = \beta_0 +  \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$$ 

3.2.1 Estimating the Regression Coefficients

Using the same least squares approach that we saw in the context of simple linear regression. We choose $β_0, β_1, . . . , β_p$ to minimize the RSS.

3.2.2 Some Important Questions

One: Is there a relationship between the response and predictors?

Hypothesis test $H_0 :\beta_1 =\beta_2 =···=\beta_p =0$

versus the alternative $H_a$ : at least one $\beta_j$ is non-zero.

This hypothesis test is performed by computing the F-statistic

$$ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$$

when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. On the other hand, if Ha is true, we expect F to be greater than 1.

How large does the F-statistic need to be before we can reject $H_0$ and conclude that there is a relationship? It turns out that the answer depends on the values of n and p. We use the p-value associated with the F-statistic using the F-distribution to determine whether of not to reject $H_0$

Two: Deciding on Important Variables

Methods to judge the quality of a model include Mallow’s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R-square.

Methods for variable selection: forward, backward, and mixed selection.

More details in Chapter 6!

Three: Model Fit

$R^2$

$RSE = \sqrt{\frac{RSS}{n-p-1}}$

Four: Predictions

We use a confidence interval to quantify the uncertainty surrounding the average sales over a large number of cities. We interpret this to mean that 95% of intervals of this form will contain the true value of f(X). 

On the other hand, a prediction interval can be used to quantify the uncertainty surrounding sales for a particular city. We interpret this to mean that 95% of intervals of this form will contain the true value of Y for this city.

 

3.3 Other Considerations in the Regression Model

3.3.1 Qualitative Predictors

If a qualitative predictor (also known as a factor) only has two levels, then incorporating it into a regression model is very simple. We simply create an indicator or dummy variable that takes on two possible numerical values.

For example,

3.3.2 Extensions of the Linear Model

The standard linear regression model provides interpretable results and works quite well on many real-world problems. However, it makes several highly restrictive assumptions that the relationship between the predictors and response are additive and linear. The additive assumption means that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors.

Removing the Additive Assumption

If a model contains interaction effects, it includes a third predictor like

$$Y = \beta_0 +  \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$$ 

The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

Non-Linear Relationships

Polynomial regression in Chapter 7!

3.3.3 Potential Problems

1. Non-linearity of the response-predictor relationships: residual plot

2. Correlation of error terms: residual plot

3. Non-constant variance of error terms: log(Y) or sqrt(Y)

4. Outliers: Studentized residual = residual / estimated standard error

5. High-leverage points: Leverage statistic

Outliers are observations for which the response $y_i$ is unusual given the predictor $x_i$. In contrast, observations with high leverage have an unusual value for $x_i$

It is clear from this equation that hi increases with the distance of $x_i$ from $\bar{x}$.

6. Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another.

it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity.

Compute the variance inflation factor (VIF)

            

where$R^2_{X_j|X_-j}$ is the $R^2$ from a regression of $X_j$ onto all of the other predictors. If $R^2_{X_j|X_-j}$ is close to one, then collinearity is present, and so the VIF will be large. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collearnierity

 

posted @ 2017-03-27 00:10  sheepshaker  阅读(346)  评论(0编辑  收藏  举报