ISL - Ch.2 Statistical Learning

2.1 What Is Statistical Learning? 

In essence, statistical learning refers to a set of approaches for estimating f

 

2.1.1 Why Estimate f?

Prediction: predict Y using $\hat Y = \hat f(X)$, where $\hat f$ represents our estimate for f , and $\hat Y$ represents the resulting prediction for Y . 

Inference: understand the relationship between X and Y , or more specifically, to understand how Y changes as a function of $X_1, ... , X_p$. 

An example: in a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth. In this case one might be interested in how the individual input variables affect the prices — that is, how much extra will a house be worth if it has a view of the river? This is an inference problem. Alternatively, one may simply be interested in predicting the value of a home given its characteristics: is this house under- or over-valued? This is a prediction problem.

 

2.1.2 How Do We Estimate f? 

2 methods: parametric & non-parametric

Parametric: model-based.it reduces the problem of estimating f down to one of estimating a set of parameters 

Non-parametric:no explicit assumptions about the functional form of f. Seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. 

 

2.1.3 The Trade-off Between Prediction Accuracy and Model Interpretability 

Example: when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between Y and X1,X2,...,Xp. In contrast, very flexible approaches, such as the splines discussed in Chapter 7, and the boosting methods discussed in Chapter 8, can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response. 

 

2.1.4 Supervised versus Unsupervised Learning 

Supervised: we have both predictor measurements and a response measurement. 

Unsupervised: we have predictor measurements but no response measurement. 

 

2.1.5 Regression versus Classification Problems 

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.  

 

2.2 Assessing Model Accuracy 

In this section, we discuss some of the most important concepts that arise in selecting a statistical learning procedure for a specific data set. Remember, there is no free lunch in statistics: no one method dominates all others over all possible data sets.  

 

2.2.1 Measuring the Quality of Fit 

In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by 

$$MSE = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat f(x_i))^2$$ 

In practice, one can usually compute the training MSE with relative ease, but estimating test MSE is considerably more difficult because usually no test data are available.  

Throughout this book, we discuss a variety of approaches that can be used in practice to estimate this minimum point. One important method is cross-validation (Chapter 5) 

 

2.2.2 The Bias-Variance Trade-Off 

The expected test MSE, for a given value $x_0$, can always be decomposed into the sum of three fundamental quantities: the variance of $\hat f(x_0)$, the squared bias of $f(x_0)$ and the variance of the error terms $\epsilon$.

$$E(y_0 - \hat f(x_0))^2 = Var(\hat f(x_0))+[Bias(\hat f(x_0))]^2+Var(\epsilon)$$

Variance refers to the amount by which $\hat f$ would change if we estimated it using a different training data set. 

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. 

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease (bias-variance trade-off).

 

2.2.3 The Classification Setting 

In the classification setting, error rate $\frac{1}{n}\sum_{i=1}^{n} I(y_i\neq \hat y_i)$

I is an indicator variable that equals 1 if $y_i \neq \hat y_i$ and zero if $y_i = \hat y_i$

On the other words, we should simply assign a test observation with predictor vector $x_0$ to the class j for which 

$$Pr(Y=j|X=x_0)$$

is largest. It is a conditional probability.

Example:

The orange shaded region reflects the set of points for which Pr(Y = orange|X) is greater than 50%, while the blue shaded region indicates the set of points for which the probability is below 50%. The purple dashed line represents the points where the probability is exactly 50%. This is called the Bayes decision boundary.  

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible.

So then, we introduce K-nearest neighbors (KNN) classifier

An example:

The choice of K has a drastic effect on the KNN classifier obtained. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. This corresponds to a low-variance but high-bias classifier.

 

posted @ 2017-03-25 14:33  sheepshaker  阅读(524)  评论(0编辑  收藏  举报