CCJ PRML Study Note - Chapter 1.5 : Decision Theory

Chapter 1.5 : Decision Theory

 
 

Chapter 1.5 : Decision Theory

 

Christopher M. Bishop, PRML, Chapter 1 Introdcution

1. PRML所需要的三论:

  • Probability theory: provides us with a consistent mathematical framework for quantifying and manipulating uncertainty.
  • Decision theory: allows us to make optimal decisions in situations involving uncertainty such as those encountered in pattern recognition.
  • Information theory:

Inference step & Decision step

  • The joint probability distribution provides a complete summary of the uncertainty associated with these variables. Determination of from a set of training data is an example of inference and is typically a very difficult problem whose solution forms the subject of much of this book.
  • In a practical application, however, we must often make a specific prediction for the value of t, or more generally take a specific action based on our understanding of the values is likely to take, and this aspect is the subject of decision theory.

2. An example

Problem Description:

Consider, for example, a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer or not.

  • Representation: choose to be a binary variable such that corresponds to class and corresponds to class .
  • Inference Step: The general inference problem then involves determining the joint distribution , or equivalently , which gives us the most complete probabilistic description of the situation.
  • Decision Step: In the end we must decide either to give treatment to the patient or not, and we would like this choice to be optimal in some appropriate sense. This is the decision step, and it is the subject of decision theory to tell us how to make optimal decisions given the appropriate probabilities.

How to predict?

Using Bayes’ theorem, these probabilities can be expressed in the form

If our aim is to minimize the chance of assigning to the wrong class , then intuitively we would choose the class having the higher posterior probability. We now show that this intuition is correct, and we also discuss more general criteria for making decisions.

 

Our objectives vary among those:

  • Minimizing the misclassification rate;
  • Minimizing the expected loss;

    补充:Criteria for making decisions【Ref -1】
    1) Minimizing the misclassification rate.
    2) minimizing the expected loss: 两类错误的后果可能是不同的,例如“把癌症诊断为无癌症”的后果比“把无癌症诊断为癌症”的后果更严重,又如“把正常邮件诊断为垃圾邮件”的后果比“把垃圾邮件诊断为正常邮件”的后果更严重;这时候,少犯前一错误比少犯后一错误更有意义。为此需要 loss function 对不同的错误的代价做量化。
    设集合 是所有可能的决策,决策函把每个观察数据映射到一个决,则有

     


其中 表示条件风险:
  • 是对于特定的真实类别是 的数据,决策为 时的 loss 或 risk,即表示 ;
  • 是当观察到 后,类别 的后验概率。决策函数 总是把观测数据映射到条件风险最小的决策上。
  • 于是,决策函数 的总的 risk 是:
    即总的risk是条件风险 在所有可能观测数据(特征空间)分布下的期望值。

3. Minimizing the misclassification rate

3.1 Decision Regions & Boundaries

Suppose that our goal is simply to make as few misclassifications as possible.
  • Decision regions: to divide the input space into regions called decision regions, one for each class, such that all points in are assigned to class ;
  • Decision boundaries or decision surfaces: The boundaries between decision regions.

3.2 Two Classes:

Consider the cancer problem for instance. A mistake occurs when an input vector belonging to class is assigned to class or vice versa. The probability of this occurring is given by

Clearly to minimize we should arrange that each is assigned to whichever class has the smaller value of the integrand in (1.78). This result is illustrated for two classes, and a single input variable , in Figure 1.24. Alt text|center

 

3.3 K Classes:

For the more general case of classes, it is slightly easier to maximize the probability of being correct, which is given by Alt text|center which is maximized when the regions are chosen such that each is assigned to the class for which the joint probability or equivalently posterior probability is largest.

4. Minimizing the expected loss

4.1 Problem Description:

For many applications, our objective will be more complex than simply minimizing the number of misclassifications. Let us consider again the medical diagnosis problem. We note that, if a patient who does not have cancer is incorrectly diagnosed as having cancer, the consequences may be some patient distress plus the need for further investigations. Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death due to lack of treatment. Thus the consequences of these two types of mistake can be dramatically different. It would clearly be better to make fewer mistakes of the second kind, even if this was at the expense of making more mistakes of the first kind.

4.2 How to solve the problem? Introduce loss/cost function

  • loss function or cost function: a single, overall measure of loss incurred in taking any of the available decisions or actions, whose value they aim to minimize.
  • utility function: whose value they aim to maximize.
  • The optimal solution is the one which minimizes the loss function, or equivalently maximizes the utility function.
  • loss matrix: Alt text|center

4.3 Optimal Solution

The optimal solution is the one which minimizes the loss function. * ??? However, the loss function depends on the true class, which is unknown. * For a given input vector , our uncertainty in the true class is expressed through the joint probability distribution and so we seek instead to minimize the average loss, where the average is computed with respect to this distribution, which is given by Alt text|center

Equivalently, for each we should minimize to choose the corresponding optimal region .

Thus the decision rule that minimizes the expected loss (1.80) is the one that assigns each new to the class for which the quantity Alt text|center is a minimum. This is clearly trivial to do, once we know the posterior class probabilities .

5. The reject option

We have seen that classification errors arise from the regions of input space where the largest of the posterior probabilities is significantly less than unity(i.e., the state of being in full agreement), or equivalently where the joint distributions have comparable values. These are the regions where we are relatively uncertain about class membership.

  • Reject option: In some applications, it will be appropriate to avoid making decisions on the difficult cases in anticipation of a lower error rate on those examples for which a classification decision is made. This is known as the reject option. For example, in our hypothetical medical illustration, it may be appropriate to use an automatic system to classify those X-ray images for which there is little doubt as to the correct class, while leaving a human expert to classify the more ambiguous cases.
  • We can achieve this by introducing a threshold and rejecting those inputs for which the largest of the posterior probabilities is less than or equal to . This is illustrated for the case of two classes, and a single continuous input variable , in Figure 1.26. Alt text|center

  • Note that setting will ensure that all examples are rejected, whereas if there are classes then setting will ensure that no examples are rejected. Thus the fraction of examples that get rejected is controlled by . We can easily extend the reject criterion to minimize the expected loss, when a loss matrix is given, taking account of the loss incurred when a reject decision is made.

6. Three distinct approaches to solving decision problems

The three distinct approaches are given, in decreasing order of complexity, by:

  • generative models: using Bayes’ theorem to find the posterior class probabilities .
  • discriminative models: to model the posterior probabilities directly.
  • discriminant functions: function .

6.1 Generative Models

  • 1-1) Likelihood: First solve the inference problem of determining the class-conditional densities for each class individually.
  • 1-2) Prior: separately infer the prior class probabilities .
  • 1-3) Posterior: use Bayes’ theorem to find the posterior class probabilities in the formAlt text|center where the denominator is obtained by Alt text|center
  • 2) Equivalently, we can model the joint distribution directly and then normalize to obtain the posterior probabilities.
  • 3) Decision Stage: use decision theory to determine class membership for each new input .

Why called generative models?

Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space.

Pros and Cons:

  • (-) Generative models is the most demanding because it involves finding the joint distribution over both and . For many applications, will have high dimensionality, and consequently we may need a large training set in order to be able to determine the class-conditional densities to reasonable accuracy.
  • (+) it allows the marginal density of data to be determined from (1.83). This can be useful for outlier detection or novelty detection.

6.2 Discriminative Models:

  • First solve the inference problem of determining the posterior class probabilities ,
  • and then subsequently use decision theory to assign each new to one of the classes.

Approaches that model the posterior probabilities directly are called discriminative models.

The classification problem is usually broken down into two separate stages as do (6.1) and (6.2):

  • inference stage: to use training data to learn a model for .
  • decision stage: to use these posterior probabilities to make optimal class assignments.

However (6.3) provides us with a different approach, which combines the inference and decision stages into a single learning problem.

6.3 Discriminant Functions:

Discriminant functions solve both inference problem and decision problem together and simply learn a function that maps inputs directly into decisions.

因此discriminant function 把 inference 和 decision 合为一步解决了[Ref-1]。

Disadvantage: we no longer have access to the posterior probabilities .

6.4 The Relative Merits of These Three Alternatives [Ref-1]

  • Generative Model 的缺点: 如果只是make classification decision, 计算 joint distribution is wasteful of computational resources, and is excessively demanding of data。一般有后验概率, 即Discriminative Models就足够了。
  • Discriminant function的缺点: 该方法不求后验概率posterior, but there are many powerful reasons for wanting to compute the posterior probabilities:
    • (1) Minimizing risk: 当 loss matrix 可能随时间改变时 (例如 financial application),如果已经有了计算出来的后验概率, 那么解决 the minimum risk decision problem 只需要适当修改(1.81)即可; 但是对于没有后验概率的discriminant function来时,只能全部从头再learning一次。
    • (2) Compensating for class priors: 当正负样本不平衡时(例如 X-ray 图诊断癌症, 由于cancer is rare, therefore 99.9%可能都是无癌症样本),为了获得好的分类器,需要人工做出一个balanced data set 用于training; 训练得到一个后验概率 后,需要 compensate for the effects of the modification to the training data, 即把 obtained from our artificially balanced data set, 除以 balanced data set 的先验 , 再乘以真实数据(i.e., in the population)的先验 , 从而得到了真实数据的后验概率。没有后验概率的 discriminant function 是无法用以上方法应对正负样本不平衡问题的。
    • (3) Combining models(模型融合): 对于复杂应用,一个problem 被分解成a number of smaller subproblems each of which can be tackled by a separate module, 例如疾病诊断,可能除了 X-ray 图片数据 ,还有血液检查数据 。与其把这些 heterogeneous information 合在一起作为input, 更有效的方法是build one system to interpret the X-ray images and a different one to interpret the blood data。即有: to assume conditional independence based on the class in the form ( the naive Bayes model) Alt text|center The posterior probability, given both the X-ray and blood data, is then given by Alt text|center

7. Loss functions for regression

So far, we have discussed decision theory in the context of classification problems. We now turn to the case of regression problems, such as the curve fitting example discussed earlier.

  • Decision stage for regression: The decision stage consists of choosing a specific estimate of the value of for each input . Suppose that in doing so, we incur a loss . The average, or expected, loss is then given by Alt text|center

  • the squared loss: , which is substituted into (1.86), to generate the following Alt text|center

  • Solution: assume a complete flexible function , we can do this formally using the calculus of variations to giveAlt text|center Solving for , and using the sum and product rules of probability, we obtain Alt text|center which is the conditional average of conditioned on and is known as the regression function. This result is illustrated in Figure 1.28. Alt text|center

  • we can identify three distinct approaches to solving regression problems given, in order of decreasing complexity, by:

    • (a) First solve the inference problem of determining the joint density . Then normalize to find the conditional density , and finally marginalize to find the conditional mean given by (1.89).
    • (b) First solve the inference problem of determining the conditional density , and then subsequently marginalize to find the conditional mean given by (1.89).
    • (c) Find a regression function directly from the training data.
  • The relative merits of these three approaches follow the same lines as for classification problems above.
  • Minkowski loss: one simple generalization of the squared loss, called the Minkowski loss, whose expectation is
    given by Alt text|center which reduces to the expected squared loss for .

[1] Page 6 of PRML notes;
[2] Page 7-8 of PRML notes;

 
posted @ 2016-06-21 01:28  GloryOfFamily  阅读(479)  评论(0编辑  收藏  举报