统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification
response variable:
- quantitative
- qualitative / categorical
methods for classification
- first predict the probability that the observation belongs to each of the categories of a qualitative variable, then the response can be seen as a member of the max probability class.
Reading tips:
- The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular Possion regression.
4.1 Overview
In classicification setting, we have a set of training observations
Data Set: Default
, columns income
, balance
,
4.2 Why not Linear Regression?
For example, $$
\begin{align}
\begin{split}
Y{}= \left {
\right.
\end{split}
\end{align}$$
Facts:
- This coding implies an ordering on the outcomes, putting
drug overdose
in betweenstroke
andepileptic seizure
, and insisting that the difference betweenstroke
anddrug overdose
is the same as the difference betweendrug overdose
andepileptic seizure
. - If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.
- Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
- For a binary response with a
coding as above, regression by least squares is not completely unreasonable: it can be shown that the obtained using linear regression is in fact an estimate of in this special case. However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities.
Summary: There are at least two reasons not to perform classification using a regression method:
(a) a regression method cannot accommodate a qualitative response with more than two classes;
(b) a regression method will not provide meaningful estimates of
4.3 Logistic Regression
Logistic regression models the probability that Y belongs to a particular category.
4.3.1 The Logistic Model
First, consider a linear regression model:
To fit the model above, we use a method called maximum likelihood Estimation. (MLE)
After a bit of manipulation of above equation, we find that
The quantity
By taking the logarithm of both sides, we arrive at
The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in
The amount that
4.3.2 Estimating the Regression Coefficients
The basic intuitition behiind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for
Here,
4.3.3 Making Predictions
Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book.
For qualitative predictors with 2 categoties, we may simply create a dummy variable that takes 0/1.
4.3.4 Multiple Logistic Regression
where
The above equation is equivalent to
confounding
For Default
dataset, the negative coefficient for student
in the multiple logistic regression indicates that for a fixed value of balance
and income
, a student is less likely to default then a non-student.
But for the bar plot, which shows the default rates for students and non-students averaged over all values of balance
and income
, suggest the opposite effect: the overall student default rate is higher than the non-student default rate.
Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.
![[Pasted image 20221017160629.png]]
![[Pasted image 20221017160707.png]]
(Confounding) This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant. As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.
4.3.5 Multinomial Logistic Regression
Want to do: classify a response variable that has more than two classes
However, the logistic regression approach that we have seen in this section only allows for K = 2 classes for the response variable.
It turns out that it is possible to extend the two-class logistic regression approach to the setting of K > 2 classes. This extension is sometimes known as multinomial logistic regression.
- we first select a single multinomial logistic regression class to serve as the baseline; without loss of generality, we select the Kth class for this role. Then we replace the model by $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ for
and $$Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}{K-1}e+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ - so that
for . This indicate the log odds between any pair of classes is linear in the features. - .The coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same.
Softmax coding
The softmax coding is equivalent softmax to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding.
In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for
4.4 Generative Models for Classification
In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X.
In this new approach, we model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these around into estimates for
advantages:
- When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. The methods that we consider in this section do not suffer from this problem.
- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
- The methods in this section can be naturally extended to the case of more than two response classes. (In the case of more than two response classes, we can also use multinomial logistic regression from Section 4.3.5.)
Let
is the posterior probability that an observation posterior belongs to the th class.- In general, estimating πk is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the
th class. - As we will see, to estimate
, we will typically have to make some simplifying assumptions.
4.4.1 Linear Discriminant Analysis for p=1
Assumptions:
is normal or Gaussian (a shared variance)
so we have $$p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_k)2)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_l)2)}$$
Taking log, this is equivalent to assigning the observation to the class for which
The Bayes decision boundary is the point for which
In practice, the following estimates are used:
discriminant functions
summary: the LDA classifier results from assuming that the observations within each class come from a normal distribution with a classspecific mean and a common variance
4.4.2 Linear Discriminant Analysis for p > 1
Assumptions:
- X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix.
Note: the multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.
An example in which
![[Pasted image 20221017165257.png]]
we write
The discriminant functions
Summary:
- we need to estimate the unknown parameters
, and ; - To assign a new observation
, LDA plugs these estimates to obtain quantities , and classifies to the class for which is largest. is a linear function of ; that is, the LDA decision rule depends on only through a linear combination of its elements.
Class-specific performance is also important in medicine and biology, where the terms sensitivity and specificity characterize the performance of sensitivity specificity a classifier or screening test.
LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.
The Bayes classifier works by assigning an observation to the class for which the posterior probability
Figure 4.7 illustrates the trade-off that results from modifying the threshold value for the posterior probability of default.
.As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases.
![[Pasted image 20221017190149.png]]
ROC curve
The ROC (receiver operating characteristics) curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.
![[Pasted image 20221017190418.png]]
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.
![[Pasted image 20221017190710.png]]
4.4.3 Quadratic Discriminant Analysis
Assumptions:
- (like LDA) each class are drawn from a Gaussian distribution
- (like LDA) each class has its own mean vector
- (unlike LDA) each class has its own covariance matrix
the Bayes classifier assigns an observation
= -\frac{1}{2}xT\Sigma_kx+xT\Sigma_k\mu_k-\frac{1}{2}\mu_kT\Sigma_k\mu_k-\frac{1}{2}log|\Sigma_k|+log\pi_k$$ is largest. The quantity
why would one prefer LDA to QDA, or vice-versa?
The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters.
LDA is a much less flexible classifier than QDA, and so has substantially lower variance. But there is a trade-off: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.
Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.
4.4.4 Naive Bayes
Assumption:
- Within the
th class, the predictors are independent. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal),i.e. for , , where is the density function of the th predictor among observations in the th class.
Why is this assumption so powerful?
Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal distribution each predictor on its own — but also the joint distribution of the predictors joint distribution — that is, the association between the different predictors.
it often leads to pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class.
Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.
To estimate the one-dimensional density function
- If
is quantitative, then we can assume that . While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal. - If
is quantitative, then another option is to use a non-parametric estimate for . A very simple way to do this is by making a histogram for the observations of the th predictor within each class. Then we can estimate as the fraction of the training observations in the kth class that belong to the same histogram bin as . Alternatively, we can use a kernel density estimator, which is essentially a smoothed version of a histogram. - If
is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.
We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.
4.5 A Comparison of Classification Methods
4.5.1 An Analytical Comparison
Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes
- For QDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}pb_{kj}x_j+\sum_{j=1}p\sum_{l=1}^p c_{kjl}x_jx_l$$ where
, , and are functions of , , , , and . - for naive Bayes
where and . Hence, the right-hand side of takes the form of a generalized additive model (CH7) .
Summary1:
- LDA is a special case of QDA with
for all , , and . (Of course, this is not surprising, since LDA is simply a restricted version of QDA with ) - Any classifier with a linear decision boundary is a special case of naive Bayes with
. In particular, this means that LDA is a special case of naive Bayes! - If we model
in the naive Bayes classifier using a one-dimensional Gaussian distribution , then we end up with where . In this case, naive Bayes is actually a special case of LDA with restricted to be a diagonal matrix with jth diagonal element equal to . - Neither QDA nor naive Bayes is a special case of the other. Naive Bayes can produce a more flexible fit, since any choice can be made for
. However, it is restricted to a purely additive fit, however, these terms are never multiplied. By contrast, QDA includes multiplicative terms of the form . Therefore, QDA has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes.
- for multinomial logistic regression,
This is identical to the linear form of LDA, In LDA, those coefficients are functions of estimates by assuming that follow a normal distribution. However, in logistic regression, the coefficients are chosen to maximize the likelihood function,.Thus, we expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not. - for K-nearest neighbors (KNN), in order to make a prediction for an observation
, the training observations that are closest to are identified. Then is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: **no assumptions are made about the shape of the decision boundary
Summary2:
- Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that
is very large and is small. - KNN requires a lot of observations relative to the number of predictors—that is,
much larger than . - KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
- In settings where the decision boundary is non-linear but
is only modest, or is not very small, then QDA may be preferred to KNN. - Unlike logistic regression, KNN does not tell us which predictors are important.
4.5.2 An Empirical Comparison
- When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
- When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.
- for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
- Finally, recall from Chapter 3 that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including
, and even as predictors. If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different.
4.6 Generalized Linear Models
dataset: Bikeshare
4.6.2 Poisson Regression on the Bikeshare Data
Poisson Distribution:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 如何调用 DeepSeek 的自然语言处理 API 接口并集成到在线客服系统
· 【译】Visual Studio 中新的强大生产力特性
· 2025年我用 Compose 写了一个 Todo App