CCJ PRML Study Note - Chapter 1.2 : Probability Theory

Chapter 1.2 : Probability Theory

 
 

Chapter 1.2 : Probability Theory

 

Christopher M. Bishop, PRML, Chapter 1 Introdcution


1. Uncertainty

A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5 (see PRML), it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous.


2. Example discussed through this chapter

We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9. Alt text| center

Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we put it back in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box of the time and we pick the blue box of the time, and that when we pick an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box.

In this example, the identity of the box that will be chosen is a random variable, which we shall denote by . This random variable can take one of two possible values, namely (corresponding to the red box) or (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by . It can take either of the values (for apple) or (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is .


3. Basic Terminology

3.1 Probability densities

  • PDF, Probability Density Function: If the probability of a real-valued variable falling in the interval is given by for , then is called the probability density over x. Alt text|center
    and pdf must satisfy the two conditions Alt text|center

  • PMF, Probability Mass Function: Note that if is a discrete variable, then is called a probability mass function because it can be regarded as a set of “probability masses” concentrated at the allowed values of .

  • CDF, Cumulative Distribution Function: The probability that lies in the interval is given by the cumulative
    distribution function defined by Alt text|center
    which satisfies .

3.2 Expectations and covariances

  • Expectation of : the average value of some function under a probability distribution is called the expectation of and will be denoted by , Alt text and Alt text, for discrete variables and continuous variables, respectively.

  • Approximating expectation using sampling methods: if we are given a finite number of points drawn from the pdf, then the expectation can be approximated as a finite sum over these points Alt text|center

  • Expectations of functions of several variables: here we can use a subscript to indicate which variable is being averaged over, so that for instance denotes the average of the function with respect to the distribution of . Note that will be a function of .

  • Variance of : is defined by , and provides a measure of how much variability there is in around its mean value . Expanding out the square, we get .

  • Variance of the variable x itself: .
  • Covariance of two r.v. and : is defined by Alt text| center

  • Covariance of two vecotrs of r.v.’s and : is defined by Alt text|center

  • Covariance of the components of a vector x with each other: then we use a slightly simpler notation .

3.3 Joint, Marginal, Conditional Probability

In order to derive the rules of probability, consider the following example shown in Figure 1.10 involving two random variables and . We shall suppose that can take any of the values where , and can take the values , where . Consider a total of trials in which we sample both of the variables and , and let the number of such trials in which and be . Also, let the number of trials in which takes the value (irrespective of the value that takes) be denoted by , and similarly let the number of trials in which takes the value be denoted by .

Fig-1.10 | center

  • joint probability: is called the joint probability of and , and is given by
    Alt text|center Here we are implicitly considering the limit .
  • marginal probability: is called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case ), i.e., Alt text | center
  • conditional probability: is called the conditional probability of given , obtained by Alt text| centerAlt text, which is called the product rule of probability.

3.4 The Rules of Probability

  • Discrete Variables: Alt text| center

  • Continuous Variables: if and are two real continuous variables, then the sum and product rules take the formAlt text|center

  • Bayes’ theorem: From the product rule, together with the symmetry property , we immediately obtain the following relationship between conditional probabilities Alt text|center
    Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator Alt text| center
    We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12) over all values of equals one.

4. An Important Interpretation of Bayes’ Theorem

Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by , and , respectively. Note that these satisfy .

Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is , and so . In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box Alt text| center

We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple Alt text| center
from which it follows, using the sum rule, that .


Interpretation of Bayes’ Theorem (See Page 17 in PRML)

Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to giveAlt text| center From the sum rule, it then follows that .

We can provide an important interpretation of Bayes’ theorem as follows.

  • Prior probability: If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability . We call this the prior probability because it is the probability available before we observe the identity of the fruit.

  • Posterior probability : Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability , which we shall call the posterior probability because it is the probability obtained after we have observed .

  • Evidence: Note that in this example, the prior probability of selecting the red box was , so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now , so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favoring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one.

  • Independent: Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that , then and are said to be independent. From the product rule, we see that , and so the conditional distribution of given is indeed independent of the value of . For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then , so that the probability of selecting, say, an apple is independent of which box is chosen.


5. Bayesian Probability

5.1 Two Interpretations of Probabilities:

  • Classical or Frequentist Interpretation: we have viewed probabilities in terms of the frequencies of random, repeatable events, and have defined the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity.
  • Bayesian Interpretation of Probability: Introduce the “uncertainty” or “degrees of belief”. Consider an uncertain event, for example whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very
    general, Bayesian interpretation of probability.

5.2 摘自:[PRML 笔记][1]

  • 与其说是Bayesian对“概率”这个概念的解释,不如说是概率碰巧可以作为量化Bayesian “degree of belief”这个概念的手段。Bayesian的出发点是“uncertainty”这一概念,对此给予“degree of belief”以表示不确定性。The use of probability to represent uncertainty, however, is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. Cox showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. 因此之故,我们才可以 use the machinery of probability theory to describe the uncertainty in model parameters.
  • 对parameters的观点, 以及Bayesian对先验、后验概率的解释: 对于 Frequentist 来说, model parameter 是一个 fixed 的量,要用“estimator”来估计(最常见的 estimator 是 likelihood,即maximum likelihood estimation)。然而,对 Bayesian 来说, 本身是一个不确定量,其不确定性用 prior probability 表示。为了获知 fixed 的, Frequentist 进行重复多次的试验,获得不同的 data sets D; 对于 Bayesian 而言, there is only a single data set D, namely the one that is actually observed. 在得到一个 observation D 后, 贝叶斯学派要revise原来对于参数 的 belief(prior probability), 用后验概率 表示调整后的 belief。调整的方法是贝叶斯定理Bayes’ Theorem。Bayesian 的中心定理是贝叶斯定理, 该定理 convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data。其中的条件概率 called likelihood, 表示的是how probable the observed data set is for different settings of parameter vector . Alt text|center
    上式分母 只是用于归一化的量, 使得等式LHS的确实是一个概率。而 的计算已经给出在上面的分母中。
  • 理解后验概率: 即修正后的先验概率。例如,有 个类别,先验为,这个时候如果给一个未知类别的数据让我们猜它是哪个类别, 显然应该猜先验概率最大的那个类别。在观察到数据 后,计算后验概率 .。于是此时的“先验”修正为 。如果现在再来一个未知类别的数据让我们猜,我们猜的方法仍旧是找先验概率最大的那个类别,只不过此时的先验概率是

5.3 Bayes’ theorem and Bayesian Probability

Using examples to understand Bayesian Probability and Bayes’ theorem:

  • Fruit Example: Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data.
  • Polynomial curve fitting example: we can adopt a similar approach when making inferences about quantities such as the parameters in the polynomial curve fitting example. We capture our assumptions about , before observing the data, in the form of a probability distribution . The effect of the observed data is expressed through the conditional probability , and we shall see later, in Section 1.2.5, how this can be represented explicitly.

Bayes’ theorem:

Bayes’ theorem, which takes the form

i.e.,

 

then allows us to evaluate the uncertainty in after we have observed in the form of the posterior probability .

The Bayes’ Theorem, where all of these quantities are viewed as functions of , incorporates 4 notions:

  • Prior:
  • Likelihood: The quantity on the right-hand side of Bayes’ theorem is evaluated for the observed data set and can be viewed as a function of the parameter vector , in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector . Note that the likelihood is not a probability distribution over , and its integral with respect to does not (necessarily) equal .
  • Evidence: , the denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to .
  • Posterior: .

How to interpret likelihood function in both the Bayesian and frequentist paradigms?

  • In a Frequentist setting: is considered to be a fixed parameter, whose value is determined by some form of “estimator” (A widely used frequentist estimator is maximum likelihood, in which is set to the value that maximizes the likelihood function . This corresponds to choosing the value of for which the probability of the observed data set is maximized), and error bars (One approach to determining frequentist error bars is the bootstrap, in which multiple data sets are created by repeated sampling from the original data set) on this estimate are obtained by considering the distribution of possible data sets .
  • From the Bayesian viewpoint, there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over .

5.4 Pros (+) and Cons (-)

  • Pros(+) of Bayes over Frequentist: the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion.
  • Cons(-) of Bayes against Frequentist: one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difficulty. Reducing the dependence on the prior is one motivation for so-called noninformative priors. However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such problems, and techniques such as cross-validation remain useful in areas such as model comparison.
  • Cons of Frequentist: Over-fitting problem can be understood as a general property of maximum likelihood.

5.5 [应对over-fitting问题][摘自:Ref 1]

  • Frequentist 控制 over-fitting 的方法:
    1. Regularization,即在目标函数中加入一个 penalty term: L2 regularizer 被称为 ridge regression, L1 regularizer 被称为 Lasso regression。 加入 penalty 的方法也叫 shrinkage method, 因为它可以 reduce the value of the coefficients.
    2. Cross-validation,即留出一部分数据做 validation. Cross-validation 也是一种进行 model selection 的方法。利用留出来的validation data,可以选择多个所训练 model 中的最好的一个。
  • Bayesian 控制 over-fitting 的方法: Prior probability.

5.5 Difficulties in Carrying through the Full Bayesian Procedure: Marginalization

The practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall see, is required in order to make predictions or to compare different models.

(^=^感觉这种心得体会之类的东西,必须得用中文说出来才过瘾! ^=^) Bayesian methods 的应用长期受制于 marginalization。对于一个 full Bayesian procedure 来说, 它要make prediction 或 compare different models, 必要的一步是 marginalize (sum or integrate) over the whole of parameter space.

The door to the practical use of Bayesian techniques in an impressive range of problem domains is opened due to the following: :

  1. the development of sampling methods, e.g., Markov Chain Monte Carlo (MCMC). Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.
  2. Dramatic improvements in the speed (i.e. CPU) and memory capacity of computers.
  3. Highly efficient deterministic approximation schemes, such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications (Blei et al., 2003).

6. Maximum-likelihood Estimation (MLE) for a univariate Gaussian Case

6.1 Gaussian distribution:

  • 1-dimension: Alt text|center Alt text|center Alt text|center Alt text|center
  • D-dimension: Alt text|center where the D-dimensional vector is called the mean, the matrix is called the covariance, and denotes the determinant of .

6.2 Sampling from a Gaussian distribution [see Ref 2]

Alt text|center

6.3 Take the univariate Gaussian for example.

Now suppose that we have a data set of observations , representing observations of the scalar variable . We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean and variance are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.

Because our data set is i.i.d., we can therefore write the probability of the data set, given and , in the form Alt text | center
In practice, it is more convenient to maximize the log of the likelihood function, written in the formAlt text | center


Why taking the log?

  • The logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself.
  • Taking the log simplifies the subsequent mathematical analysis;
  • Taking the log helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities.

When viewed as a function of and , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14. Alt text| center

Maximizing (1.54) with respect to , we obtain the maximum likelihood solution given by

which is called sample mean, i.e., the mean of the observed values . Similarly, maximizing (1.54) with respect to , we obtain the so-called sample variance measured with respect to the sample mean in the form

 

6.4 One Limitation of the Maximum Likelihood Approach

Limitation: The maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.

Alt text | center

We first note that the maximum likelihood solutions and are functions of the data set values . Consider the expectations of these quantities with respect to the data set values, which (i.e., the data set) themselves come from a Gaussian distribution with parameters and , i.e. . It is straightforward to show that

From (1.58) it follows that the following estimate for the variance parameter is unbiased Alt text |center

 

In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting.


Exercise 1.12

Alt text | center

Solution:

Alt text | center


7. Curve fitting re-visited

7.1 Purpose:

MLE, Point estimate Probabilistic Model MAP Bayesian

We have seen how the problem of polynomial curve fitting can be expressed in terms of error minimization in Section 1.1. Here we return to the curve fitting example and view it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment.

7.2 Goal in the curve fitting problem:

The goal in the curve fitting problem is to be able to make predictions for the target variable given some new value of the input variable on the basis of a set of training data .

7.3 Uncertainty over the value of the target variable

We can express our uncertainty over the value of the target variable using a probability distribution. For this purpose, we shall assume that, given the value of , the corresponding value of has a Gaussian distribution with a mean equal to the value of the polynomial curve given by (1.1). Thus we have Alt text | center
where, for consistency with the notation in later chapters, we have defined a precision parameter corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16.

Alt text | center
For the i.i.d. training data , the likelihood function is given by Alt text|center and the log likelihood function in the form Alt text|center

We can use maximum likelihood to determine the precision parameter of the Gaussian conditional distribution, Alt text | center


7.4 The Likelihood for Linear Regression and its Solution of MLE in the Point Estimate Category [see Ref -2]

The same idea can be found in Lecture 3 of [Ref-2], shown as belowAlt text | center Please note here is used to represent the target variable. The maximum likelihood estimate (MLE) of is obtained by taking the derivate of the log-likelihood, . The goal is to maximize the likelihood of seeing the training data by modifying the parameters . Alt text|center

  • The MLE of is:

Alt text|center

  • The MLE of is:

Alt text|center


7.5 Making predictions:

Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over t, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into (1.60) to give Alt text|center

As the special case, the Linear Regressionthe: [MLE plugin prediction] [Ref-2], given the training data , for a new input and known : Alt text|center
shown in the following figureAlt text|center

7.6 Taking a step towards a more Bayesian approach

  • Prior distribution over the polynomial coefficients : For simplicity, let us consider a Gaussian distribution of the form Alt text |center
    where is the precision of the distribution, and is the total number of elements in the vector for an order polynomial.
  • Hyperparameters: Variables such as , which control the distribution of model parameters, are called hyperparameters.
  • Calculate the Posterior distribution for : Using Bayes’ theorem, the Posterior distribution for is given by Alt text|center
  • MAP, maximum posterior: We can now determine by finding the most probable value of given the data, in other words by maximizing the posterior distribution. This technique is called maximum posterior, or simply MAP. Taking the negative logarithm of (1.66) and combining with (1.62) and (1.65), we find that the maximum of the posterior is given by the minimum of Alt text|center
  • Equivalence between Posterior and Regularized sum-of-squares Error function: Thus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function encountered earlier in the form (1.4), with a regularization parameter given by .

Note:

Although we have included a prior distribution , we are so far still making a point estimate of and so this does not yet amount to a Bayesian treatment, discussed in the following section.


8. Bayesian Curve fitting

In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires, as we shall see shortly, that we integrate over (i.e., to marginalize) all values of w. Such marginalizations lie at the heart of Bayesian methods for pattern recognition.

In the curve fitting problem, we are given the training data and , along with a new test point , and our goal is to predict the value of . We therefore wish to evaluate the predictive distribution . Here we shall assume that the parameters and are fixed and known in advance (in later chapters we shall discuss how such parameters can be inferred from data in a Bayesian setting).

A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be written in the form Alt text | center

  • in RHS: is given by (1.60), and we have omitted the dependence on α and β to simplify the notation.
  • in RHS: is the posterior distribution over parameters, and can be found by normalizing the right-hand side of equation (1.66). It will be shown in Section 3.3 that this posterior distribution is a Gaussian and can be evaluated analytically.
  • LHS: the integration in (1.68) can also be performed analytically with the result that the predictive distribution is given by a Gaussian of the form Alt text|center
    where the mean and variance are given by Alt text|center
    Here the matrix S is given by Alt text|center where is the unit matrix, and the vector .

Analysis of (1.71):

  • the first term : represents the uncertainty in the predicted value of due to the noise on the target variables and was expressed already in the maximum likelihood predictive distribution (1.64) through
  • the second term : arises from the uncertainty in the parameters and is a consequence of the Bayesian treatment.

The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.17. Alt text|center

9. Curve fitting为例子演示三种方法 [See Ref-1]

  • 1) MLE,直接对 likelihood function 求最大值,得到参数 。该方法属于 point estimation。
  • 2) MAP (poor man’s bayes),引入 prior probability,对 posterior probability 求最大值,得到。MAP 此时相当于在 MLE 的目标函数(likelihood function)中加入一个 L2 penalty。该方法仍属于 point estimation。
  • 3) fully Bayesian approach,需要使用 sum rule 和 product rule (因为“degree of belief”的machinery 和概率相同, 因此这两个 rule 对 “degree of belief”成立), 而要获得 predictive distribution 又需要 marginalize (sum or integrate) over the whole of parameter space w。Alt text|center
    其中, 是待预测的点, 是观察到的数据集, 是数据集中每个数据点相应的 label。其实是用参数 的后验概率为权, 对 probability 进行一次加权平均; 因此这个过程需要对 进行积分, 即 marginalization

10. References

[1]: http://www.cvrobot.net/wp-content/uploads/2015/09/PRML%E7%AC%94%E8%AE%B0-Notes-on-Pattern-Recognition-and-Machine-Learning-1.pdf, Page 4-6, Chapter 01, PRML笔记,Notes-on-Pattern-Recognition-and-Machine-Learning;

[2]: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/; Slides of Deep Learning Course at Oxford University;

 
posted @ 2016-06-21 01:31  GloryOfFamily  阅读(626)  评论(0编辑  收藏  举报