PRML读书笔记——2 Probability Distributions
2.1. Binary Variables
1. Bernoulli distribution, p(x = 1|µ) = µ
2.Binomial distribution
+
3.beta distribution(Conjugate Prior of Bernoulli distribution)
The parameters a and b are often called hyperparameters because they control the distribution of the parameter µ.
m observations of x = 1 and l observations of x = 0.
the variance goes to zero for a → ∞ or b → ∞. It is a general property of Bayesian learning that, as we observe more and more data, the uncertainty represented by the posterior distribution will steadily decrease.
2.2. Multinomial Variables
1.Consider discrete variables that can take on one of K possible mutually exclusive states.
One of the elements xk equals 1, and all remaining elements equa 0.
x = (0, 0, 1, 0, 0, 0)T
Consider a data set D of N independent observations x1, . . . , xN. The corresponding likelihood function takes the form:
And find the maximum likelihood solution for µ. log it and add use a Lagrange multiplier λ,
Setting derivative of it with respect to µk and we abtain:
give λ = −N, and the solution is in the form:
which is the fraction of the N observations for which xk = 1.
Consider the join distribution of quantities m1, ... , mk
2.The Dirichlet distribution(Conjugate Prior of Multinomial Distribution)
m = (m1, . . . , mK)T
we can interpret the parameters αk of the Dirichlet prior as an effective number of observations of xk = 1.
multinomial distribution with K = 2.
2.3. The Gaussian Distribution
For a D-dimensional vector x:
Σ is a D × D covariance matrix.
eigenvector equation for the covariance matrix:
Σui = λiui
Σ can be expressed as an expansion in terms of its eigenvecExercise 2.19 tors in the form:
Define:
in the yj coordinate system, the Gaussian distribution takes the form
This confirms that the multivariate Gaussian is indeed normalized.
And the expection of Gaussian distribution is:
2.3.1 Conditional Gaussian distributions
An important property of the multivariate Gaussian distribution is that if two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian. Similarly, the marginal distribution of either set is also Gaussian.
the mean and covariance of the conditional distribution p(xa|xb).
And they are independent of xa.
2.3.2 Marginal Gaussian distributions
Marginal Gaussian distribution is also Gaussian.
2.3.3 Bayes’ theorem for Gaussian variables
Here we shall suppose that we are given a Gaussian marginal distribution p(x) and a Gaussian conditional distribution p(y|x) in which p(y|x) has a mean that is a linear function of x, and a covariance which is independent of x. We wish to find the marginal distribution p(y) and the conditional distribution p(x|y).
2.3.4 Maximum likelihood for the Gaussian
The log likelihood function is given by
we see that the likelihood function depends on the data set only through the two quantities
the maximum likelihood estimate of the mean and corvirance matrix given by
evaluate the expectations of the maximum likelihood solutions under the true distribution, we obtain the following results
We see that the expectation of the maximum likelihood estimate for the mean is equal to the true mean. However, the maximum likelihood estimate for the covariance has an expectation that is less than the true value, and hence it is biased. We can correct this bias by defining a different estimator Σ given by
2.3.5 Sequential estimation
1.Sequential methods allow data points to be processed one at a time and then discarded and are important for on-line applications, and also where large data sets are involved so that batch processing of all data points at once is infeasible.
This result has a nice interpretation, as follows. After observing N − 1 data points we have estimated µ by . We now observe data point xN, and we obtain our revised estimate by moving the old estimate a small amount, proportional to 1/N, in the direction of the ‘error signal’ . Note that, as N increases, so the contribution from successive data points gets smaller.
2.Robbins-Monro algorithm
The conditional expectation of z given θ defines a deterministic function f(θ) that is given by
We shall assume that the conditional variance of z is finite so that
The Robbins-Monro procedure then defines a sequence of successive estimates of the root θ given by
where z(θ(N)) is an observed value of z when θ takes the value θ(N). The coefficients {aN} represent a sequence of positive numbers that satisfy the conditions
first condition ensures that the successive corrections decrease in magnitude so that the process can converge to a limiting value. The second condition is required to ensure that the algorithm does not converge short of the root, and the third condition is needed to ensure that the accumulated noise has finite variance and hence does not spoil convergence.
2.3.6 Bayesian inference for the Gaussian
gamma distribution
The mean and variance of the gamma distribution are given by
2.3.7 Student’s t-distribution
If we have a univariate Gaussian N(x|µ, τ −1) together with a Gamma prior Gam(τ|a, b) and we integrate out the precision, we obtain the marginal distribution of x in the form
set ν = 2a and λ = a/b
which is known as Student’s t-distribution. The parameter λ is sometimes called the precision of the t-distribution, even though it is not in general equal to the inverse of the variance. The parameter ν is called the degrees of freedom.
For the particular case of ν = 1, the t-distribution reduces to the Cauchy distribution, while in the limit ν → ∞ the t-distribution St(x|µ, λ, ν) becomes a Gaussian N(x|µ, λ−1) with mean µ and precision λ.
The result is a distribution that in general has longer ‘tails’ than a Gaussian, as was seen in figure above. This gives the tdistribution an important property called robustness, which means that it is much less sensitive than the Gaussian to the presence of a few data points which are outliers.
2.3.8 Periodic variables
1 Periodic quantities can conveniently be represented using an angular (polar) coordinate 0 θ < 2π.
We might be tempted to treat periodic variables by choosing some direction as the origin and then applying a conventional distribution such as the Gaussian. Such an approach, however, would give results that were strongly dependent on the arbitrary choice of origin.
To find an invariant measure of the mean, we note that the observations can be viewed as points on the unit circle and can therefore be described instead by two-dimensional unit vectors x1, . . . , xN where xn = 1 for n = 1, . . . , N
The Cartesian coordinates of the observations are given by xn = (cos θn, sin θn), and we can write the Cartesian coordinates of the sample mean in the form
2 von Mises distribution
we will consider distributions p(θ) that have period 2π. Any probability density p(θ) defined over θ must not only be nonnegative and integrate to one, but it must also be periodic. Thus p(θ) must satisfy the three conditions
it follows that p(θ + M2π) = p(θ) for any integer M.
Consider a Gaussian distribution over two variables x = (x1, x2) having mean µ = (µ1, µ2) and a covariance matrix Σ = σ2I where I is the 2 × 2 identity matrix, so that.
2.3.9 Mixtures of Gaussians
Consider a superposition of K Gaussian densities of the form
,
which is called a mixture of Gaussians. The parameters πk in are called mixing coefficients.
one example for k = 3
2.4. The Exponential Family
1 The probability distributions that we have studied so far in this chapter (with the exception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith,1994).
The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form
where x may be scalar or vector, and may be discrete or continuous. Here η are called the natural parameters of the distribution, and u(x) is some function of x. The function g(η) can be interpreted as the coefficient that ensures that the distribution is normalized and therefore satisfies
where the integration is replaced by summation if x is a discrete variable.
2 Conjugate priors
In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.
2.5. Nonparametric Methods
First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point.
Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results.(degree M of the polynomial, the value α of the regularization parameter)
2.5.1 Kernel density estimators
consider some small region R containing x, and x would be the kernel. we obtain our density estimate in the form
K is the total number of points that lie inside R. V is the volume of R
kernel function k discribes that how close data point is to x.
thus the estimated density at x is
we have used hD for the volume of a hypercube of side h in D dimensions.
k can also be a Gaussian kernel function
h represents the standard deviation of the Gaussian components and plays the role of a smoothing parameter, and there is a trade-off between sensitivity to noise at small h and over-smoothing at large h.
2.5.2 Nearest-neighbour methods
One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and a washing out of structure that might otherwise be extracted from the data. However, reducing h may lead to noisy estimates elsewhere in data space where the density is smaller.
the optimal choice for h may be dependent on location within the data space. This issue is addressed by nearest-neighbour methods for density estimation.
we consider a fixed value of K and use the data to find an appropriate value for V .
Note that the model produced by K nearest neighbours is not a true density model because the integral over all space diverges.
If we wish to classify a new point x, we draw a sphere centred on x containing precisely K points irrespective of their class. Suppose this sphere has volume V and contains Kk points from class Ck.
An interesting property of the nearest-neighbour (K = 1) classifier is that, in the limit N → ∞, the error rate is never more than twice the minimum achievable error rate of an optimal classifier.
both the K-nearest-neighbour method, and the kernel density estimator, require the entire training data set to be stored, leading to expensive computation if the data set is large.
This effect can be offset, at the expense of some additional one-off computation, by constructing tree-based search structures to allow (approximate) near neighbours to be found efficiently without doing an exhaustive search of the data set.