机器学习学习笔记 PRML Chapter 2.0 : Prerequisite 1 - Sufficient Statistics
Chapter 2.0 : Prerequisite 1 - Sufficient Statistics
Chapter 2.0 : Prerequisite 1 - Sufficient Statistics
Christopher M. Bishop, PRML, Chapter 2 Probability Distributions
1. Introduction
In the process of estimating parameters, we summarize, or reduce, the information in a sample of size
,
, to a single number, such as the sample mean
. The actual sample values are no longer important to us. That is, if we use a sample mean of
to estimate the population mean μ, it doesn’t matter if the original data values were
or
.
Problems:
- Has this process of reducing the
data points to a single number retained all of the information about
that was contained in the original
data points?
- Or has some information about the parameter been lost through the process of summarizing the data?
In this lesson, we’ll learn how to find statistics that summarize all of the information in a sample about the desired parameter. Such statistics are called sufficient statistics.
2. Definition of Sufficiency
2.1 Definition:
Let be a random sample from a probability distribution with unknown parameter
. Then, the statistic
is said to be sufficient for
if the conditional distribution of
, given the statistic
, i.e.,
does not depend on the parameter
.
- Why called “sufficient”?
- We say that
is sufficient for
, since once the value of
is known (即,有了value of
(i.e., given
) 就足以获取了关于未知参数
的全部可用信息), 并且no other function of
will provide any additional information about the possible value of
.
- Sufficiency means that if we know the value of
, we cannot gain any further information about the parameter
by considering other functions of the data
.
- We say that
2.2 Example 1 - Binomial Distribution:
Consider Bernoulli trials:
Let
be a random sample of
Bernoulli trials in which the success has the probability
, and the fail with
, i.e,
, and
, for
. Suppose, in a random sample of
, that success events occur
in total. If we know the value of
, the number of successes in
trials, can we gain any further information about the parameter
by considering other functions of the data
? Or equivalently is
sufficient for
?
Solution:
The definition of sufficiency tells us that if the conditional distribution of , given the statistic
, does not depend on
, then
is said to be a sufficient statistic for the unknown parameter
. The conditional distribution of
, given
, is given by:
Now, for the sake of concreteness, suppose we were to observe a random sample of size in which
. In this case:
Now, let’s consider an event that is possible, namely . In that case, we have, by independence:
Now, the denominator in (2.1) is the binomial probability of getting exactly successes in
trials with a probability of success
. That is, the denominator is:
Putting the numerator and denominator together, we get
Conclusion 1:
We have just shown that the conditional distribution of given
does not depend on
. Therefore,
is indeed sufficient for
. That is, once the value of
is known, no other function of
will provide any additional information about the possible value of
.
3. Factorization Theorem
3.1 We need more easy method to identify sufficiency:
While the definition of sufficiency may make sense intuitively, it is not always all that easy to find the conditional distribution of
given
. Not to mention that we’d have to find the conditional distribution of
given
for every
that we’d want to consider a possible sufficient statistic! Therefore, using the formal definition of sufficiency as a way of identifying a sufficient statistic for a parameter
can often be a daunting road to follow. Thankfully, a theorem often referred to as the Factorization Theorem provides an easier alternative!
3.2 Factorization Theorem:
Let
denote random variables with joint probability density function or joint probability mass function
, which depends on the parameter
. Then, the statistic
is sufficient for
if and only if the p.d.f (or p.m.f.) can be factored into two components, that is:
where:
is a function that depends on the data
only through the function
, and
- the function
does not depend on the parameter
.
3.3 Example 2 - Poisson Distribution:
Recall that the mathematical constant is the unique real number such that the value of the derivative (slope of the tangent line) of the function
at the point
is equal to
. It turns out that the constant is irrational, but to five decimal places, it equals
. Also, note that there are (theoretically) an infinite number of possible Poisson distributions. Any specific Poisson distribution depends on the parameter
.
Let
denote a random sample from a Poisson distribution with parameter
. Find a sufficient statistic for the parameter
.
Solution:
Because is a random sample, the joint probability mass function of
is, by independence:
Hey, look at that! We just factored the joint p.m.f. into two functions, one () being only a function of the statistic
and the other (
) not depending on the parameter
:
We can also write the joint p.m.f. as:
If you think about it, it makes sense that and
are both sufficient statistics, because if we know
, we can easily find
, and vice verse.
Conclusion 2:
There can be more than one sufficient statistic for a parameter . In general, if
is a sufficient statistic for a parameter
, then every one-to-one function of
not involving
is also a sufficient statistic for
.
3.4 Example 3 - Gaussian Distribution
:
Let
be a random sample from a normal distribution with mean
and variance
. Find a sufficient statistic for the parameter
.
Solution:
For i.i.d. data , the joint probability density function of
is
A trick to making the factoring of the joint p.d.f. an easier task is to add to the quantity in parentheses in the summation. That is:
Now, squaring the quantity in parentheses, we get:
But, the middle term in the exponent is , and the last term, because it doesn’t depend on the index
, can be added up
times:
So, simplifying, we get:
In summary, we have factored the joint p.d.f. into two functions, one () being only a function of the statistic
and the other (
) not depending on the parameter
:
Conclusion 3:
- Therefore, the Factorization Theorem tells us that
is a sufficient statistic for
.
- Now,
is also sufficient for
, because if we are given the value of
, we can easily get the value of
through the one-to-one function
, that is
.
- However,
is not a sufficient statistic for
, because it is not a one-to-one function, with both
and
mapped to
.
3.5 Example 4 - Exponential Distribution:
Let
be a random sample from an exponential distribution with parameter
. Find a sufficient statistic for the parameter
.
Solution:
The joint probability density function of is, by independence:
Conclusion 4:
Therefore, the Factorization Theorem tells us that is a sufficient statistic for
. And, since
is a one-to-one function of
, it implies that
is also a sufficient statistic for
.
4. Exponential Form
4.1 Exponential Form
You might not have noticed that in all of the examples we have considered so far in this lesson, every p.d.f. or p.m.f. could be written in what is often called exponential form, that is:
1) Exponential Form of Bernoulli Distribution:
For example, the Bernoulli random variables with p.m.f. is written in exponential form as: with
- (1)
and
being functions only of
,
- (2)
and
being functions only of the parameter
, and
- (3) the support
not depending on the parameter
.
2) Exponential Form of Poisson Distribution:
with
- (1)
and
being functions only of
,
- (2)
and
being functions only of the parameter
, and
- (3) the support
not depending on the parameter
.
3) Exponential Form of Gaussian Distribution
:
with
- (1)
and
being functions only of
,
- (2)
and
being functions only of the parameter
, and
- (3) the support
not depending on the parameter
.
4) Exponential Form of Exponential Distribution:
with
- (1)
and
being functions only of
,
- (2)
and
being functions only of the parameter
, and
- (3) the support
not depending on the parameter
.
4.2 Exponential Criterion
It turns out that writing p.d.f.s and p.m.f.s in exponential form provides us yet a third way of identifying sufficient statistics for our parameters. The following theorem tells us how.
Theorem:
Let be a random sample from a distribution with a p.d.f. or p.m.f. of the exponential form:
- (1)
and
being functions only of
,
- (2)
and
being functions only of the parameter
, and
- (3) the support being free of the parameter
.
Then, the statistic:
Proof:
- one (
) being only a function of the statistic
and
- the other (
) not depending on the parameter
:
Therefore, the Factorization Theorem tells us that is a sufficient statistic for
.
4.3 Example 5 - Geometric Distribution:
Let
be a random sample from a geometric distribution with parameter
. Find a sufficient statistic for the parameter
.
Solution:
The probability mass function of a geometric random variable is:
Conclusion 5:
Therefore, is sufficient for
. Easy as pie!
5. Two or More Parameters
What happens if a probability distribution has two parameters,
and
, say, for which we want to find sufficient statistics,
and
? Fortunately, the definitions of sufficiency can easily be extended to accommodate two (or more) parameters. Let’s start by extending the Factorization Theorem.
5.1 Factorization Theorem
5.2 Example 6 - Gaussian Distribution
:
Let
denote a random sample from a normal distribution
. That is,
denotes the mean
and
denotes the variance
. Use the Factorization Theorem to find joint sufficient statistics for
and
.
Solution:
The joint probability density function of is, by independence:
We get
Simplifying yet more, we get:
Look at that! We have factored the joint p.d.f. into two functions, one () being only a function of the statistic
and
, and the other (
) not depending on the parameter
and
:
Conclusion 6.1:
- Therefore, the Factorization Theorem tells us that
and
are joint sufficient statistics for
and
.
- And, the one-to-one functions of
and
, namely:
and
.
- We have just shown that the intuitive estimators of
and
are also sufficient estimators. That is, the data contain no more information than the estimators
and
do about the parameters
and
. That seems like a good thing!
5.3 Exponential Criterion
We have just extended the Factorization Theorem. Now, the Exponential Criterion can also be extended to accommodate two (or more) parameters. It is stated here without proof.
Exponential Criterion:
Let |
5.4 Example 6 - Gaussian Distribution
(continued):
Let
denote a random sample from a normal distribution
. That is,
denotes the mean
and
denotes the variance
. Use the Exponential Criterion to find joint sufficient statistics for
and
.
Solution:
The probability density function of a normal random variable with mean and variance
can be written in exponential form as:
Conclusion 6.2:
Therefore, the statistics and
are joint sufficient statistics for
and
.
6. Reference
[1]: Lesson 53: Sufficient Statistics (ttps://onlinecourses.science.psu.edu/stat414/print/book/export/html/244)