统计机器学习-Introduction to Statistical Learning-阅读笔记-CH4-Classification

response variable:

  • quantitative
  • qualitative / categorical

methods for classification

  • first predict the probability that the observation belongs to each of the categories of a qualitative variable, then the response can be seen as a member of the max probability class.

Reading tips:

  • The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular Possion regression.

4.1 Overview

In classicification setting, we have a set of training observations (x1,y1),...,(xn,yn) that we can use to build a classifier.

Data Set: Default, columns income, balance,

4.2 Why not Linear Regression?

For example, $$
\begin{align}
\begin{split}
Y{}= \left {
1if stroke 2if drug overdose 3if epileptic seizure
\right.
\end{split}
\end{align
}$$
Facts:

  • This coding implies an ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure, and insisting that the difference between stroke and drug overdose is the same as the difference between drug overdose and epileptic seizure.
  • 0.5epileptic seizure+0.5stroke==1drug overdose
  • If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.
  • Curiously, it turns out that the classifications that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure we discuss in Section 4.4.
  • For a binary response with a 0/1 coding as above, regression by least squares is not completely unreasonable: it can be shown that the Xβ^ obtained using linear regression is in fact an estimate of Pr(drug overdose|X) in this special case. However, if we use linear regression, some of our estimates might be outside the [0, 1] interval, making them hard to interpret as probabilities.

Summary: There are at least two reasons not to perform classification using a regression method:
(a) a regression method cannot accommodate a qualitative response with more than two classes;
(b) a regression method will not provide meaningful estimates of Pr(Y|X), even with just two classes.

4.3 Logistic Regression

Logistic regression models the probability that Y belongs to a particular category.

4.3.1 The Logistic Model

First, consider a linear regression model: p(X)=β0+β1X. Any time a straight line is fit to a binary response that is coded as 0 or 1, in principle we can always predict p(X)<0 for some values of X and p(X)>1 for others (unless the range of X is limited). To avoid this problem, we must model p(X) using a function that gives outputs between 0 and 1 for all values of X. In logistic regression, we use the logistic function,$$p(X)=\frac{e{\beta_0+\beta_1X}}{1+e{\beta_0+\beta_1X}}$$
To fit the model above, we use a method called maximum likelihood Estimation. (MLE)

After a bit of manipulation of above equation, we find that p(X)1p(X)=eβ0+β1X
The quantity p(X)/[1P(X)] is called the odds, and can take on any value between 0 and .
By taking the logarithm of both sides, we arrive at log(p(X)1p(X))=β0+β1X
The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in X.

The amount that p(X) changes due to a one-unit change in X depends on the current value of X. But regardless of the value of X, if β1 is positive then increasing X will be associated with increasing p(X), and if β1 is negative then increasing X will be associated with decreasing p(X).

4.3.2 Estimating the Regression Coefficients

The basic intuitition behiind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for β0 and β1 such that the predicted probability p^(xi) of default for each individual, corresponds as closely as possible to the individual's observerd default status.
Here, l(β0,β1)=Πi:yi=1p(xi)Πi:yi=0(1p(xi))

4.3.3 Making Predictions

Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book.

For qualitative predictors with 2 categoties, we may simply create a dummy variable that takes 0/1.

4.3.4 Multiple Logistic Regression

log(p(X)1p(X))=β0+β1X1+...+βpXp

where X=(X1,...,Xp) are predictors.

The above equation is equivalent to p(X)=eβ0+β1X1+...+βpXp1+eβ0+β1X1+...+βpXp

confounding

For Default dataset, the negative coefficient for student in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default then a non-student.
But for the bar plot, which shows the default rates for students and non-students averaged over all values of balance and income, suggest the opposite effect: the overall student default rate is higher than the non-student default rate.

Thus, even though an individual student with a given credit card balance will tend to have a lower probability of default than a non-student with the same credit card balance, the fact that students on the whole tend to have higher credit card balances means that overall, students tend to default at a higher rate than non-students.

![[Pasted image 20221017160629.png]]
![[Pasted image 20221017160707.png]]

(Confounding) This simple example illustrates the dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant. As in the linear regression setting, the results obtained using one predictor may be quite different from those obtained using multiple predictors, especially when there is correlation among the predictors.

4.3.5 Multinomial Logistic Regression

Want to do: classify a response variable that has more than two classes

However, the logistic regression approach that we have seen in this section only allows for K = 2 classes for the response variable.

It turns out that it is possible to extend the two-class logistic regression approach to the setting of K > 2 classes. This extension is sometimes known as multinomial logistic regression.

  1. we first select a single multinomial logistic regression class to serve as the baseline; without loss of generality, we select the Kth class for this role. Then we replace the model by $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$ for k=1,...,K1 and $$Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}{K-1}e+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$
  2. so that log(Pr(Y=k|X=x)Pr(Y=K|X=x))=βk0+βk1x1+...+βkpxp for k=1,...K1. This indicate the log odds between any pair of classes is linear in the features.
  3. .The coefficient estimates will differ between the two fitted models due to the differing choice of baseline, but the fitted values (predictions), the log odds between any pair of classes, and the other key model outputs will remain the same.
Softmax coding

The softmax coding is equivalent softmax to the coding just described in the sense that the fitted values, log odds between any pair of classes, and other key model outputs will remain the same, regardless of coding.
In the softmax coding, rather than selecting a baseline class, we treat all K classes symmetrically, and assume that for k=1,...,K, $$Pr(Y=k|X=x)=\frac{e{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}x_p}}{1+\sum_{l=1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}$$Thus, rather than estimating coefficients for K − 1 classes, we actually estimate coefficients for all K classes. The log odds ratio between the kth and k th classes equals log(Pr(Y=k|X=x)Pr(Y=k|X=x))=(βk0βk0)+(βk1βk1)x1+...+(βkpβkp)xp

4.4 Generative Models for Classification

In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X.
In this new approach, we model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these around into estimates for Pr(Y=k|X=x). When the distribution of X within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression.

advantages:

  • When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. The methods that we consider in this section do not suffer from this problem.
  • If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the approaches in this section may be more accurate than logistic regression.
  • The methods in this section can be naturally extended to the case of more than two response classes. (In the case of more than two response classes, we can also use multinomial logistic regression from Section 4.3.5.)

Let fk(X)Pr(X|Y=k) denote the density function of X for an observation that comes from the kth class. Then Bayes’ theorem states that Pr(Y=k|X=x)=πkfk(x)l=1Kπlfl(x)

  • pk(x)=Pr(Y=k|X=x) is the posterior probability that an observation posterior X=x belongs to the kth class.
  • In general, estimating πk is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the kth class.
  • As we will see, to estimate fk(x), we will typically have to make some simplifying assumptions.

4.4.1 Linear Discriminant Analysis for p=1

Assumptions:

  1. fk(x) is normal or Gaussian
  2. σ12=...=σK2=σ2 (a shared variance)

so we have $$p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_k)2)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{1}{2\sigma2}(x-\mu_l)2)}$$
Taking log, this is equivalent to assigning the observation to the class for which δk(x)=xμkσ2μk22σ2+log(πk) is largest.
The Bayes decision boundary is the point for which δ1(x)=δ2(x); one can show that this amounts tox=μ12μ222(μ1μ2)=μ1+μ22 (Assume πi=πj for all i,j)
In practice, the following estimates are used: μ^k=1nki:yi=kxiσ^2=1nKk=1Ki:yi=k(xiμ^k)2
discriminant functions δ^k(x) are linear functions of x.

summary: the LDA classifier results from assuming that the observations within each class come from a normal distribution with a classspecific mean and a common variance σ2, and plugging estimates for these parameters into the Bayes classifier.

4.4.2 Linear Discriminant Analysis for p > 1

Assumptions:

  • X = (X1, X2,...,Xp) is drawn from a multivariate Gaussian (or multivariate normal) distribution, with a class-specific multivariate mean vector and a common covariance matrix.

Note: the multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.
An example in which Var(X1)=Var(X2) and Cor(X1,X2)=0; this surface has a characteristic bell shape. (就像一个中心对称的窝窝头),如果Cor(X1,X2)0, 就是一个被压扁了的窝窝头。
![[Pasted image 20221017165257.png]]

we write XN(μ,Σ). Here E(X)=μ is the mean of X (a vector with p components), and Cov(X)=Σ is the p × p covariance matrix of X. Formally, the multivariate Gaussian density is defined as $$f(x)=\frac{1}{(2\pi){p/2}|\Sigma|{1/2}}exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$$
The discriminant functions δk(x) are$$\delta_k(x)=xT\Sigma\mu_k-\frac{1}{2}\mu_kT\Sigma\mu_k + log \pi_k$$are linear functions of x. The Bayes decision boundaries are $$xT\Sigma\mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k = x^T \Sigma^{-1}\mu_l - \frac{1}{2}\mu_lT\Sigma\mu_l$$ (Assume πk=πj for all k,j)

Summary:

  • we need to estimate the unknown parameters μ1,...,μK,π1,...,πK, and Σ;
  • To assign a new observation X=x, LDA plugs these estimates to obtain quantities δ^k(x), and classifies to the class for which δ^k(x) is largest.
  • δ^k(x) is a linear function of x; that is, the LDA decision rule depends on x only through a linear combination of its elements.

Class-specific performance is also important in medicine and biology, where the terms sensitivity and specificity characterize the performance of sensitivity specificity a classifier or screening test.

LDA is trying to approximate the Bayes classifier, which has the lowest total error rate out of all classifiers.

The Bayes classifier works by assigning an observation to the class for which the posterior probability pk(X) is greatest. Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50 % for the posterior probability of default in order to assign an observation to the default class. However, this threshold can be changed to 80% or other percentage.

Figure 4.7 illustrates the trade-off that results from modifying the threshold value for the posterior probability of default.
.As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases.
![[Pasted image 20221017190149.png]]

ROC curve

The ROC (receiver operating characteristics) curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.
![[Pasted image 20221017190418.png]]

The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.
![[Pasted image 20221017190710.png]]

4.4.3 Quadratic Discriminant Analysis

Assumptions:

  1. (like LDA) each class are drawn from a Gaussian distribution
  2. (like LDA) each class has its own mean vector
  3. (unlike LDA) each class has its own covariance matrix

the Bayes classifier assigns an observation X=x to the class for which $$\delta_k(x) = -\frac{1}{2}(x-\mu_k)T\Sigma_k(x-\mu_k)-\frac{1}{2}log|\Sigma_k|+log \pi_k \
= -\frac{1}{2}xT\Sigma_kx+xT\Sigma_k\mu_k-\frac{1}{2}\mu_kT\Sigma_k\mu_k-\frac{1}{2}log|\Sigma_k|+log\pi_k$$ is largest. The quantity x appears as a quadratic function.

why would one prefer LDA to QDA, or vice-versa?

The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters.

LDA is a much less flexible classifier than QDA, and so has substantially lower variance. But there is a trade-off: if LDA’s assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.

Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.

4.4.4 Naive Bayes

Assumption:

  1. Within the kth class, the p predictors are independent. Instead of assuming that these functions belong to a particular family of distributions (e.g. multivariate normal),i.e. for k=1,...,K, fk(x)=fk1(x1)×fk2(x2)×···×fkp(xp), where fkj is the density function of the jth predictor among observations in the kth class.
Why is this assumption so powerful?

Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor — that is, the distribution of marginal distribution each predictor on its own — but also the joint distribution of the predictors joint distribution — that is, the association between the different predictors.

it often leads to pretty decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class.

Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practice as a result of the bias-variance trade-off.

Pr(Y=k|X=x)=πkfk1(x1)fk2x2...fkp(xp)l=1Kπlfl1(x1)fl2x2...flp(xp)

To estimate the one-dimensional density function fkj using training data x1j,...,xnj , we have a few options:

  • If Xj is quantitative, then we can assume that Xj|Y=kN(μjk,σjk2). While this may sound a bit like QDA, there is one key difference, in that here we are assuming that the predictors are independent; this amounts to QDA with an additional assumption that the class-specific covariance matrix is diagonal.
  • If Xj is quantitative, then another option is to use a non-parametric estimate for fkj . A very simple way to do this is by making a histogram for the observations of the jth predictor within each class. Then we can estimate fkj(xj) as the fraction of the training observations in the kth class that belong to the same histogram bin as xj . Alternatively, we can use a kernel density estimator, which is essentially a smoothed version of a histogram.
  • If Xj is qualitative, then we can simply count the proportion of training observations for the jth predictor corresponding to each class.

We expect to see a greater pay-off to using naive Bayes relative to LDA or QDA in instances where p is larger or n is smaller, so that reducing the variance is very important.

4.5 A Comparison of Classification Methods

4.5.1 An Analytical Comparison

Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes log(Pr(Y=k|X=x)Pr(Y=K|X=x))1. For LDA, log(Pr(Y=k|X=x)Pr(Y=K|X=x))=ak+j=1pbkjxj where ak=log(πkπK)12(μk+μK)TΣ1(μkμK) and bkj is the jth component of Σ1(μkμK).

  1. For QDA, $$log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)})=a_k+\sum_{j=1}pb_{kj}x_j+\sum_{j=1}p\sum_{l=1}^p c_{kjl}x_jx_l$$ where ak, bkj , and ckjl are functions of πk, πK, μk, μK, Σk and ΣK.
  2. for naive Bayes log(Pr(Y=k|X=x)Pr(Y=K|X=x))=ak+j=1pgkj(xj) where ak=log(πkπK) and gkj(xj)=log(fkj(xj)fKj(xj)). Hence, the right-hand side of takes the form of a generalized additive model (CH7) .

Summary1:

  • LDA is a special case of QDA with ckjl=0 for all j=1,...,p, l=1,...,p, and k=1,...,K. (Of course, this is not surprising, since LDA is simply a restricted version of QDA with Σ1=···=ΣK=Σ.)
  • Any classifier with a linear decision boundary is a special case of naive Bayes with gkj(xj)=bkjxj . In particular, this means that LDA is a special case of naive Bayes!
  • If we model fkj(xj) in the naive Bayes classifier using a one-dimensional Gaussian distribution N(μkj,σj2), then we end up with gkj(xj)=bkjxj where bkj=(μkjμKj)/σj2 . In this case, naive Bayes is actually a special case of LDA with Σ restricted to be a diagonal matrix with jth diagonal element equal to σj2 .
  • Neither QDA nor naive Bayes is a special case of the other. Naive Bayes can produce a more flexible fit, since any choice can be made for gkj(xj). However, it is restricted to a purely additive fit, however, these terms are never multiplied. By contrast, QDA includes multiplicative terms of the form ckjlxjxl. Therefore, QDA has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes.
  1. for multinomial logistic regression, log(Pr(Y=k|X=x)Pr(Y=K|X=x))=βk0+l=1pβklxl This is identical to the linear form of LDA, In LDA, those coefficients are functions of estimates by assuming that X1,...,Xp follow a normal distribution. However, in logistic regression, the coefficients are chosen to maximize the likelihood function,.Thus, we expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not.
  2. for K-nearest neighbors (KNN), in order to make a prediction for an observation X=x, the training observations that are closest to x are identified. Then X is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: **no assumptions are made about the shape of the decision boundary

Summary2:

  • Because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that n is very large and p is small.
  • KNN requires a lot of observations relative to the number of predictors—that is, n much larger than p.
  • KNN is non-parametric, and thus tends to reduce the bias while incurring a lot of variance.
  • In settings where the decision boundary is non-linear but n is only modest, or p is not very small, then QDA may be preferred to KNN.
  • Unlike logistic regression, KNN does not tell us which predictors are important.

4.5.2 An Empirical Comparison

  • When the true decision boundaries are linear, then the LDA and logistic regression approaches will tend to perform well.
  • When the boundaries are moderately non-linear, QDA or naive Bayes may give better results.
  • for much more complicated decision boundaries, a non-parametric approach such as KNN can be superior. But the level of smoothness for a non-parametric approach must be chosen carefully.
  • Finally, recall from Chapter 3 that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including X2,X3, and even X4 as predictors. If we added all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different.

4.6 Generalized Linear Models

dataset: Bikeshare

4.6.2 Poisson Regression on the Bikeshare Data

Poisson Distribution:

P(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!}$$ for $k=0,1,2...$ properties: - $\lambda = E(Y)=Var(Y)$ Poisson distribution is typically used to model **counts**. Assumption: by Poisson regression, we implicitly assume that **mean** bike usage in a given hour equals the **variance** of bike usage during that hour. $$log(\lambda(X_1,...X_p)) = \beta_0+\beta_1X_1+...+\beta_pX_p$$ To estimate the coefficients, use MLE approach. >In fact, the variance in the Bikeshare data appears to be much higher than the mean, a situation referred to as overdispersion. This causes the Z-values to be inflated in Table 4.11. A more careful analysis should account for this overdispersion to obtain more accurate Z-values, and there are a variety of methods for doing this. But they are beyond the scope of this book. ### 4.6.3 Generalized Linear Models in Greater Generality three types of regression models: linear, logistic and Poisson. common characteristics: - Each approach uses predictors $X_1,...,X_p$ to predict a response $Y$ . We assume that, conditional on $X_1,...,X_p$, $Y$ belongs to a certain family of distributions. For linear regression, we typically assume that $Y$ follows a **Gaussian** or normal distribution. For logistic regression, we assume that $Y$ follows a **Bernoulli** distribution. Finally, for Poisson regression, we assume that $Y$ follows a **Poisson** distribution. - for $E(Y|X_1,...X_p)$, can be expressed using a **link function**, $\eta$. The link functions for linear, logistic and Poisson regression are $η(µ) = µ$, $\eta(\mu) = log(\mu/(1 - \mu))$, and $\eta(\mu) = log(\mu)$, respectively. - The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the **exponential family**. In general, we can perform a regression by modeling the response $Y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors. Any regression approach that follows this very general recipe is known as a **generalized linear model (GLM)**. Thus, linear regression, logistic regression, and Poisson regression are three examples of GLMs. Other examples not covered here include Gamma regression and negative binomial regression.

posted @   爱吃番茄的玛丽亚  阅读(43)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 如何调用 DeepSeek 的自然语言处理 API 接口并集成到在线客服系统
· 【译】Visual Studio 中新的强大生产力特性
· 2025年我用 Compose 写了一个 Todo App
点击右上角即可分享
微信分享提示