PRML-第2章 概率分布 总结

一些记号

\(D=\{x_1,...,x_N\}\)
观测数据集


2.1 二元变量-伯努利分布

伯努利概率分布为:(x只能取0或1,取1的概率是\(\mu,p(x = 1|\mu) = \mu\))

\[Bern(x|\mu) = \mu^x(1-\mu)^{1 - x} \tag{2.2} \]


均值
\(E[x]=\mu \tag{2.3}\)


方差
\(var[x]=\mu(1-\mu) \tag{2.4}\)


似然函数

\[p(D|\mu) = \prod\limits_{n=1}^Np(x_n|\mu) = \prod\limits_{n=1}^{N}\mu^{x_n}(1-\mu)^{1-x_n} \tag{2.5} \]


对数似然函数

\[\ln p(D|\mu) = \sum\limits_{n=1}^N\ln p(x_n|\mu) = \sum\limits_{n=1}^N{x_n\ln \mu + (1 - x_n)\ln(1-\mu)} \tag{2.6} \]


最大似然解

\[\mu_{ML} = \frac{1}{N}\sum\limits_{n=1}^N x_n \tag{2.7} \]

\[\mu_{ML} = \frac{m}{N} \tag{2.8} \]

m为x=1的观测次数


2.1 二项分布

换一个角度,在给定数据集规模N的条件下,x=1的观测出现数量m的概率分布:二项分布。
(x只能取0或1,取1的概率是\(\mu,p(x = 1|\mu) = \mu\))

\[Bin(m|N, \mu) = \binom{N}{m}\mu^m(1 - \mu)^{N - m} \tag{2.9} \]

其中 $$ \binom{N}{m} \equiv \frac{N!}{(N - m)!m!} \tag{2.10} $$


均值

\[\mathbb{E}[m] \equiv \sum_{m=0}^{N} m \operatorname{Bin}(m \mid N, \mu)=N \mu \tag{2.11} \]


方差

\[var[m] \equiv \sum_{m=0}^{N} (m-E[m])^2 \operatorname{Bin}(m \mid N, \mu)=N \mu(1-\mu) \tag{2.12} \]

2.1.1 Beta分布

注意,这里的\(\mu\)是Beta分布的随机变量,不是均值

\[Beta(\mu|a, b) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1} \tag{2.13} \]

其中

\[\Gamma(x) \equiv \int_0^\infty u^{x-1}e^{-u}du \]


Beta函数性质

\[\Gamma(x+1)=x! \]


均值:
\(E[\mu]= \frac{a}{a + b} \tag{2.15}\)


方差
\(\ var[\mu] =\frac{ab}{(a+b)^2(a+b+1)} \tag{2.16}\)
这里a和b是超参数。


\(\Gamma\)分布是二项分布关于参数\(\mu\)的共轭分布族,相关概念见https://www.cnblogs.com/boyknight/p/15898454.html
公式2.13 乘以公式2.9(非\mu的常数项拿掉)得到下面的正比例关系,其中\(l=N-m\)

\[p(\mu|m,l,a,b) \propto \mu^{m+a-1}(1-\mu)^{l+b-1} \tag{2.17} \]

仿照公式2.13 做归一化,得到归一化系数,则二项分布关于参数\(\mu\)的后验分布如下(前提是先验是\(\Gamma\)分布)

\[p(\mu|m,l,a,b) = \frac{\Gamma(m+a+l+b)}{\Gamma(m+a)\Gamma(l+b)}\mu^{m+a-1}(1-\mu)^{l+b-1} \tag{2.18} \]

贝叶斯顺序推断 python实现

书上 图2.3

点击查看代码

import numpy as np
import matplotlib.pyplot as plt

from prml.rv import (
    Bernoulli,
    Beta,
    Categorical,
    Dirichlet,
    Gamma,
    Gaussian,
    MultivariateGaussian,
    MultivariateGaussianMixture,
    StudentsT,
    Uniform
)

np.random.seed(1234)

x = np.linspace(0, 1, 100)
beta = Beta(2, 2)
plt.subplot(2, 1, 1)
plt.xlim(0, 1)
plt.ylim(0, 2)
plt.plot(x, beta.pdf(x))
plt.annotate("prior", (0.1, 1.5))

model = Bernoulli(mu=beta)
model.fit(np.array([1]))
plt.subplot(2, 1, 2)
plt.xlim(0, 1)
plt.ylim(0, 2)
plt.plot(x, model.mu.pdf(x))
plt.annotate("posterior", (0.1, 1.5))

plt.show()


点击查看代码

import numpy as np


class RandomVariable(object):
    """
    base class for random variables
    """

    def __init__(self):
        self.parameter = {}

    def __repr__(self):
        string = f"{self.__class__.__name__}(\n"
        for key, value in self.parameter.items():
            string += (" " * 4)
            if isinstance(value, RandomVariable):
                string += f"{key}={value:8}"
            else:
                string += f"{key}={value}"
            string += "\n"
        string += ")"
        return string

    def __format__(self, indent="4"):
        indent = int(indent)
        string = f"{self.__class__.__name__}(\n"
        for key, value in self.parameter.items():
            string += (" " * indent)
            if isinstance(value, RandomVariable):
                string += f"{key}=" + value.__format__(str(indent + 4))
            else:
                string += f"{key}={value}"
            string += "\n"
        string += (" " * (indent - 4)) + ")"
        return string

    def fit(self, X, **kwargs):
        """
        estimate parameter(s) of the distribution

        Parameters
        ----------
        X : np.ndarray
            observed data
        """
        self._check_input(X)
        if hasattr(self, "_fit"):
            self._fit(X, **kwargs)
        else:
            raise NotImplementedError

    # def ml(self, X, **kwargs):
    #     """
    #     maximum likelihood estimation of the parameter(s)
    #     of the distribution given data

    #     Parameters
    #     ----------
    #     X : (sample_size, ndim) np.ndarray
    #         observed data
    #     """
    #     self._check_input(X)
    #     if hasattr(self, "_ml"):
    #         self._ml(X, **kwargs)
    #     else:
    #         raise NotImplementedError

    # def map(self, X, **kwargs):
    #     """
    #     maximum a posteriori estimation of the parameter(s)
    #     of the distribution given data

    #     Parameters
    #     ----------
    #     X : (sample_size, ndim) np.ndarray
    #         observed data
    #     """
    #     self._check_input(X)
    #     if hasattr(self, "_map"):
    #         self._map(X, **kwargs)
    #     else:
    #         raise NotImplementedError

    # def bayes(self, X, **kwargs):
    #     """
    #     bayesian estimation of the parameter(s)
    #     of the distribution given data

    #     Parameters
    #     ----------
    #     X : (sample_size, ndim) np.ndarray
    #         observed data
    #     """
    #     self._check_input(X)
    #     if hasattr(self, "_bayes"):
    #         self._bayes(X, **kwargs)
    #     else:
    #         raise NotImplementedError

    def pdf(self, X):
        """
        compute probability density function
        p(X|parameter)

        Parameters
        ----------
        X : (sample_size, ndim) np.ndarray
            input of the function

        Returns
        -------
        p : (sample_size,) np.ndarray
            value of probability density function for each input
        """
        self._check_input(X)
        if hasattr(self, "_pdf"):
            return self._pdf(X)
        else:
            raise NotImplementedError

    def draw(self, sample_size=1):
        """
        draw samples from the distribution

        Parameters
        ----------
        sample_size : int
            sample size

        Returns
        -------
        sample : (sample_size, ndim) np.ndarray
            generated samples from the distribution
        """
        assert isinstance(sample_size, int)
        if hasattr(self, "_draw"):
            return self._draw(sample_size)
        else:
            raise NotImplementedError

    def _check_input(self, X):
        assert isinstance(X, np.ndarray)


点击查看代码

import numpy as np
from scipy.special import gamma
from prml.rv.rv import RandomVariable


np.seterr(all="ignore")


class Beta(RandomVariable):
    """
    Beta distribution
    p(mu|n_ones, n_zeros)
    = gamma(n_ones + n_zeros)
      * mu^(n_ones - 1) * (1 - mu)^(n_zeros - 1)
      / gamma(n_ones) / gamma(n_zeros)
    """

    def __init__(self, n_zeros, n_ones):
        """
        construct beta distribution

        Parameters
        ----------
        n_zeros : int, float, or np.ndarray
            pseudo count of zeros
        n_ones : int, float, or np.ndarray
            pseudo count of ones
        """
        super().__init__()
        if not isinstance(n_ones, (int, float, np.number, np.ndarray)):
            raise ValueError(
                "{} is not supported for n_ones"
                .format(type(n_ones))
            )
        if not isinstance(n_zeros, (int, float, np.number, np.ndarray)):
            raise ValueError(
                "{} is not supported for n_zeros"
                .format(type(n_zeros))
            )
        n_ones = np.asarray(n_ones)
        n_zeros = np.asarray(n_zeros)
        if n_ones.shape != n_zeros.shape:
            raise ValueError(
                "the sizes of the arrays don't match: {}, {}"
                .format(n_ones.shape, n_zeros.shape)
            )
        self.n_ones = n_ones
        self.n_zeros = n_zeros

    @property
    def ndim(self):
        return self.n_ones.ndim

    @property
    def size(self):
        return self.n_ones.size

    @property
    def shape(self):
        return self.n_ones.shape

    def _pdf(self, mu):
        return (
            gamma(self.n_ones + self.n_zeros)
            * np.power(mu, self.n_ones - 1)
            * np.power(1 - mu, self.n_zeros - 1)
            / gamma(self.n_ones)
            / gamma(self.n_zeros)
        )

    def _draw(self, sample_size=1):
        return np.random.beta(
            self.n_ones, self.n_zeros, size=(sample_size,) + self.shape
        )


点击查看代码

import numpy as np
from prml.rv.rv import RandomVariable
from prml.rv.beta import Beta


class Bernoulli(RandomVariable):
    """
    Bernoulli distribution
    p(x|mu) = mu^x (1 - mu)^(1 - x)
    """

    def __init__(self, mu=None):
        """
        construct Bernoulli distribution

        Parameters
        ----------
        mu : np.ndarray or Beta
            probability of value 1 for each elzement
        """
        super().__init__()
        self.mu = mu

    @property
    def mu(self):
        return self.parameter["mu"]

    @mu.setter
    def mu(self, mu):
        if isinstance(mu, (int, float, np.number)):
            if mu > 1 or mu < 0:
                raise ValueError(f"mu must be in [0, 1], not {mu}")
            self.parameter["mu"] = np.asarray(mu)
        elif isinstance(mu, np.ndarray):
            if (mu > 1).any() or (mu < 0).any():
                raise ValueError("mu must be in [0, 1]")
            self.parameter["mu"] = mu
        elif isinstance(mu, Beta):
            self.parameter["mu"] = mu
        else:
            if mu is not None:
                raise TypeError(f"{type(mu)} is not supported for mu")
            self.parameter["mu"] = None

    @property
    def ndim(self):
        if hasattr(self.mu, "ndim"):
            return self.mu.ndim
        else:
            return None

    @property
    def size(self):
        if hasattr(self.mu, "size"):
            return self.mu.size
        else:
            return None

    @property
    def shape(self):
        if hasattr(self.mu, "shape"):
            return self.mu.shape
        else:
            return None

    def _fit(self, X):
        if isinstance(self.mu, Beta):
            self._bayes(X)
        elif isinstance(self.mu, RandomVariable):
            raise NotImplementedError
        else:
            self._ml(X)

    def _ml(self, X):
        n_zeros = np.count_nonzero((X == 0).astype(np.int))
        n_ones = np.count_nonzero((X == 1).astype(np.int))
        assert X.size == n_zeros + n_ones, (
            "{X.size} is not equal to {n_zeros} plus {n_ones}"
        )
        self.mu = np.mean(X, axis=0)

    def _map(self, X):
        assert isinstance(self.mu, Beta)
        assert X.shape[1:] == self.mu.shape
        n_ones = (X == 1).sum(axis=0)
        n_zeros = (X == 0).sum(axis=0)
        assert X.size == n_zeros.sum() + n_ones.sum(), (
            f"{X.size} is not equal to {n_zeros} plus {n_ones}"
        )
        n_ones = n_ones + self.mu.n_ones
        n_zeros = n_zeros + self.mu.n_zeros
        self.prob = (n_ones - 1) / (n_ones + n_zeros - 2)

    def _bayes(self, X):
        assert isinstance(self.mu, Beta)
        assert X.shape[1:] == self.mu.shape
        n_ones = (X == 1).sum(axis=0)
        n_zeros = (X == 0).sum(axis=0)

        assert X.size == n_zeros.sum() + n_ones.sum(), (
            "input X must only has 0 or 1"
        )
        self.mu.n_zeros += n_zeros
        self.mu.n_ones += n_ones

    def _pdf(self, X):
        assert isinstance(self.mu, np.ndarray)  # 这里貌似漏了点
        return np.prod(
            self.mu ** X * (1 - self.mu) ** (1 - X)
        )

    def _draw(self, sample_size=1):
        if isinstance(self.mu, np.ndarray):
            return (
                    self.mu > np.random.uniform(size=(sample_size,) + self.shape)
            ).astype(np.int)
        elif isinstance(self.mu, Beta):
            return (
                    self.mu.n_ones / (self.mu.n_ones + self.mu.n_zeros)
                    > np.random.uniform(size=(sample_size,) + self.shape)
            ).astype(np.int)
        elif isinstance(self.mu, RandomVariable):
            return (
                    self.mu.draw(sample_size)
                    > np.random.uniform(size=(sample_size,) + self.shape)
            )


直接采用贝叶斯方法写出后验公式如下(跳过对自然参数的估计,直接采用加法法则和乘法法则),证明方法见公式1.68

\[p(x=1|D) = \int_0^1p(x=1|\mu)p(\mu|D)d\mu = \int_0^1\mu p(\mu|D)d\mu = \mathbb{E}[\mu|D] \tag{2.19} \]


根据公式 2.18,2.13,2.15可知,因为2.19就是后验分布,服从2.18公式,所以,其均值就是2.15所示

\[p(x=1|D) = \frac{m+a}{m+a+l+b} \tag{2.20} \]


\(m,l\)趋近于无穷大,2.20(贝叶斯后验),2.8(最大似然估计)两者会趋近于统一,不仅仅在beta分布下成立,在其他分布也有这样性质)

\[\mu_{ML} = \frac{m}{N} \tag{2.8} \]


\(a,b\to \infty,有var(\mu)\to 0\)


用频率派的角度解释,推导见 https://www.cnblogs.com/boyknight/p/16010734.html

\[\mathbb{E}_\theta[\theta] = \mathbb{E}_D[\mathbb{E}_\theta[\theta|D]] \tag{2.21} \]

其中
$ \mathbb{E}_\theta[\theta] = \int p(\theta)\theta d\theta \tag{2.22}$

\[\mathbb{E}_{\mathcal{D}}\left[\mathbb{E}_{\boldsymbol{\theta}}[\boldsymbol{\theta} \mid \mathcal{D}]\right] \equiv \int\left\{\int \boldsymbol{\theta} p(\boldsymbol{\theta} \mid \mathcal{D}) \mathrm{d} \boldsymbol{\theta}\right\} p(\mathcal{D}) \mathrm{d} \mathcal{D} \tag{2.23} \]

\(\theta\)的后验均值(在产生数据集的分布上的平均)等于\(\theta\)的先验均值。

同样的我们可以得到:

\[var_\theta[\theta] = \mathbb{E}_D[var_\theta[\theta|D]] + var_D[\mathbb{E}_\theta[\theta|D]] \tag{2.24} \]

公式(2.24)中左边是$ \theta \(的先验方差。右边的第一项是\) \theta \(的后验方差的均值。第二项是\) \theta \(的后验均值的方差。因为方差是一个正的量(第二项大于零),所以一般来说,\) \theta $的后验方差小于先验方差。

2.2 多项式变量

二元变量:2个状态。推广到k个互斥状态。用one-hot表示。比如K=6,\(x_3=1\)

\[x = (0, 0, 1, 0, 0, 0)^T \tag{2.25} \]

向量满足\(\sum_{k=1}^K x_k = 1\)。如果用参数\(\mu_k\)来标记\(x_k = 1\)的概率,那么我们就得到\(x\)的分布:

\[p(x|\mu) = \prod\limits_{k=1}^K\mu_k^{x_k} \tag{2.26} \]

其中\(\mu = (\mu_1,...,\mu_K)^T\),由于参数\(\mu_k\)表示概率,所以需要满足$\mu_k \geq 0 \(且\) \sum_k\mu_k = 1 $。公式(2.26)分布可以看作伯努利分布在多于两种输出时的泛化。很容易证明这个分布是标准化的。

$\sum\limits_xp(x|\mu) = \sum\limits_{k=1}^K\mu_k = 1 \tag{2.27} $

\(\mathbb{E}[x|\mu] = \sum\limits_xp(x|\mu)x = (\mu_1,...,\mu_M)^T = \mu \tag{2.28}\)

考虑一个有\(N\)个独立观测值$x_1,...,x_N \(的数据集\)D$。其对应的似然函数的形式为

\[p(D|\mu) = \prod\limits_{n=1}^N\prod\limits_{k=1}^K\mu_k^{x_{nk}} = \prod\limits_{k=1}^K\mu_k^{(\sum_nx_{nk})} = \prod\limits_{k=1}^K\mu_k^{m_k} \tag{2.29} \]

令:

$ m_k = \sum\limits_n x_{nk} \tag{2.30} $
它表示观测到$ x_k = 1 $的次数。这些别称为这个分布的充分统计量。

求最大似然解,我们需要在\(\mu_k\)的和等于1的约束下,关于\(\mu_k\)最大化\(\ln p(D|\mu)\)。这可以通过拉格朗日乘数法得到

\(\mu_k = -m_k / \lambda \tag{2.32}\)

把公式(2.32)代入限制条件$ \sum_k\mu_k = 1 \(,可得\) \lambda = -N $。所以我们的最大似然解:

\[\mu_k^{ML} = \frac{m_k}{N} \tag{2.33} \]

就是观测\(x_k = 1\)出现占总观测的比例。

2.2.1 狄利克雷分布

多项式分布关于参数\(\mu_k\)的共轭分布族是狄利克雷分布

\[Dir(\mu|\alpha) = \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_1)...\Gamma(\alpha_K)}\prod\limits_{k=1}^K\mu_k^{\alpha_k - 1} \tag{2.38} \]

\[\alpha_0 = \sum\limits_{k=1}^K\alpha_k \tag{2.39} \]

用似然2.34乘先验2.38就得到后验,形式为:

\[p(\mu|D,\alpha) \propto p(D|\mu)p(\mu|\alpha) \propto \prod\limits_{k=1}^K\mu_k^{\alpha_k + m_k - 1} \tag{2.40} \]

因为形式与先验相同,对比写出归一化系数:

\[\begin{eqnarray} p(\mu|D, \alpha) &=& Dir(\mu|\alpha + m) \ &=& \frac{\Gamma(\alpha_0 + N)}{\Gamma(\alpha_1+m_1)...+\Gamma(\alpha_K+m_K)}\prod\limits_{k=1}^K\mu_k^{\alpha_k + m_k - 1} \tag{2.41} \end{eqnarray} \]

其中\(m = (m_1,...,m_K)^T\)。与二项分布的beta先验一样,可以把狄利克雷分布参数\(\alpha_k\)当成观测到\(x_k = 1\)的数量。而二元变量就是多项式变量的一个特例。

2.3 高斯分布

单变量x,

\[\mathcal{N}\left(x | \mu, \sigma^{2}\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{1}{2}}} \exp \left\{-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right\} \tag{2.42} \]

D维向量x,

\[\mathcal{N}(x|\mu, \Sigma) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}} exp\{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x - \mu)\} \tag{2.43} \]

・一组随机变量之和,概率分布随着项(随机变量个数)增加趋近于高斯分布。(拉普拉斯中心极限定理)比如均匀分布还有之前的二项分布

中心极限定理python

点击查看代码

# 均匀分布
uniform = Uniform(low=0, high=1)
plt.figure(figsize=(10, 5))

plt.subplot(1, 3, 1)
plt.xlim(0, 1)
plt.ylim(0, 5)
plt.annotate("N=1", (0.1, 4.5))
plt.hist(uniform.draw(100000), bins=20, density=True)

plt.subplot(1, 3, 2)
plt.xlim(0, 1)
plt.ylim(0, 5)
plt.annotate("N=2", (0.1, 4.5))
plt.hist(0.5 * (uniform.draw(100000) + uniform.draw(100000)), bins=20, density=True)

plt.subplot(1, 3, 3)
plt.xlim(0, 1)
plt.ylim(0, 5)
sample = 0
for _ in range(10):
    sample = sample + uniform.draw(100000)
plt.annotate("N=10", (0.1, 4.5))
plt.hist(sample * 0.1, bins=20, density=True)

plt.show()

点击查看代码

import numpy as np
from prml.rv.rv import RandomVariable


class Uniform(RandomVariable):
    """
    Uniform distribution
    p(x|a, b)
    = 1 / ((b_0 - a_0) * (b_1 - a_1)) if a <= x <= b else 0
    """

    def __init__(self, low, high):
        """
        construct uniform distribution

        Parameters
        ----------
        low : int, float, or np.ndarray
            lower boundary
        high : int, float, or np.ndarray
            higher boundary
        """
        super().__init__()
        low = np.asarray(low)
        high = np.asarray(high)
        assert low.shape == high.shape
        assert (low <= high).all()
        self.low = low
        self.high = high
        self.value = 1 / np.prod(high - low)

    @property
    def low(self):
        return self.parameter["low"]

    @low.setter
    def low(self, low):
        self.parameter["low"] = low

    @property
    def high(self):
        return self.parameter["high"]

    @high.setter
    def high(self, high):
        self.parameter["high"] = high

    @property
    def ndim(self):
        return self.low.ndim

    @property
    def size(self):
        return self.low.size

    @property
    def shape(self):
        return self.low.shape

    @property
    def mean(self):
        return 0.5 * (self.low + self.high)

    def _pdf(self, X):
        higher = np.logical_and.reduce(X >= self.low, 1)
        lower = np.logical_and.reduce(X <= self.high, 1)
        return self.value * np.logical_and(higher, lower)

    def _draw(self, sample_size=1):
        u01 = np.random.uniform(size=(sample_size,) + self.shape)
        return u01 * (self.high - self.low) + self.low



马氏距离,这里\(\Delta\)就是\(x和\mu\)之间的马氏距离。

\[\Delta^2 = (X-\mu)^T\Sigma^{-1}(X-\mu) \tag{2.44} \]


这一段推导了高斯分布在图像上的展示,结论如下
高斯分布的图像是一个椭圆
\(\mathbf{1.\Delta决定过了这个椭圆的大小}\)

\(\mathbf{2.两个特征向量\mu_1,\mu_2决定了椭圆的两个轴的方向}\)

\(\mathbf{3.两个特征值\lambda_1,\lambda_2决定了椭圆两个轴上的距离}\)

\[|\Sigma|^{1/2} = \prod\limits_{j=1}^D \lambda_j^{1/2} \tag{2.55} \]


高斯分布的期望

\[\mathbb{E}[X] = \mu \tag{2.59} \]


高斯分布的协方差矩阵

\[var[X] = \Sigma \]


高斯分布的局限性
1.参数较多
一个协方差矩阵\(\Delta\)\(\frac{D(D+1)}{2}\)个独立参数
2.高斯分布本质上是单峰的(即只有一个最大值)

需要引入隐变量

条件高斯分布

多元高斯性质:两个变量的联合高斯分布,一个变量为条件的高斯分布也是高斯分布。边缘高斯分布也是高斯分布

\(p(x_a,x_b)=p(x_a|x_b)p(x_b)\)

精度矩阵(precision matrix)

\[\Lambda \equiv \Sigma^{-1} \tag{2.68} \]


\(x\)划分为两个不相交的子集\(x_a, x_b\)。令\(x_a\)\(x\)的前\(M\)个分量,令\(x_b\)为剩余的\(D − M\)个分量,得到:

\[\boldsymbol{x}=\left(\begin{array}{l} \boldsymbol{x}_{a} \\ \boldsymbol{x}_{b} \end{array}\right) \tag{2.65} \]

\[\boldsymbol{\mu}=\left(\begin{array}{l} \boldsymbol{\mu}_{a} \\ \boldsymbol{\mu}_{b} \end{array}\right) \tag{2.66} \]

\[\boldsymbol{\Sigma}=\left(\begin{array}{cc} \boldsymbol{\Sigma}_{a a} & \boldsymbol{\Sigma}_{a b} \\ \boldsymbol{\Sigma}_{b a} & \boldsymbol{\Sigma}_{b b} \end{array}\right) \tag{2.67} \]

\[\boldsymbol{\Lambda}=\left(\begin{array}{cc} \boldsymbol{\Lambda}_{a a} & \boldsymbol{\Lambda}_{a b} \\ \boldsymbol{\Lambda}_{b a} & \boldsymbol{\Lambda}_{b b} \end{array}\right) \tag{2.69} \]

条件分布\(p(x_a|x_b)\)由联合分布 \(p(x) = p(x_a, x_b)\)通过固定\(x_b\)为观测到的值,然后标准化所得到的表达式就可以得到\(x_a\)上的有效概率。

\[\begin{eqnarray} &-\frac{1}{2}&(x-\mu)^T\Sigma^{-1}(x-\mu) = \ &-\frac{1}{2}&(x_a - \mu_a)^T\Lambda_{aa}(x_a - \mu_a) -\frac{1}{2}(x_a - \mu_a)^T\Lambda_{ab}(x_b - \mu_b) \ &-\frac{1}{2}&(x_b - \mu_b)^T\Lambda_{ba}(x_a - \mu_a) -\frac{1}{2}(x_b - \mu_b)^T\Lambda_{bb}(x_b - \mu_b) \tag{2.70} \end{eqnarray} \]

因为一个通用的的高斯分布$ \mathcal{N}(x|\mu, \Sigma) $的指数项可以写成:

\[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) = -\frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu + const \tag{2.71} \]

\(p(x_a|x_b)\)的协方差(精度矩阵的逆):

\[\Sigma_{a|b} = \Lambda_{aa}^{-1} \tag{2.73} \]

所以,\(p(x_a|x_b)\)的均值

\[\begin{aligned} \boldsymbol{\mu}_{a \mid b} &=\boldsymbol{\Sigma}_{a \mid b}\left\{\boldsymbol{\Lambda}_{a a} \boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a b}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right)\right\} \\ &=\boldsymbol{\mu}_{a}-\boldsymbol{\Lambda}_{a a}^{-1} \boldsymbol{\Lambda}_{a b}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \end{aligned} \tag{2.75} \]

其中

\[\begin{array}{c} \boldsymbol{\Lambda}_{a a}=\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \\ \tag{2.79,2.80} \boldsymbol{\Lambda}_{a b}=-\left(\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a}\right)^{-1} \boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \end{array} \]

故均值和协方差矩阵还可以表示为

\[\begin{array}{c} \boldsymbol{\mu}_{a \mid b}=\boldsymbol{\mu}_{a}+\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1}\left(\boldsymbol{x}_{b}-\boldsymbol{\mu}_{b}\right) \\ \tag{2.81, 2.82} \boldsymbol{\Sigma}_{a \mid b}=\boldsymbol{\Sigma}_{a a}-\boldsymbol{\Sigma}_{a b} \boldsymbol{\Sigma}_{b b}^{-1} \boldsymbol{\Sigma}_{b a} \end{array} \]


总结
条件分布:

\[p(x_a|x_b) = \mathcal{N}(x|\mu_{a|b}, \Lambda_{aa}^{-1}) \tag{2.96} \]

\[\mu_{a|b} = \mu_a - \Lambda_{aa}^{-1}\Lambda_{ab}(x_a - \mu_b) \tag{2.97} \]

2.32 边缘高斯分布

\(p(x_a,x_b)=p(x_a|x_b)p(x_b)\)

\[p(x_a) = \int p(x_a, x_b)dx_b \tag{2.83} \]


\[\begin{aligned} \mathbb{E}\left[\boldsymbol{x}_{a}\right] &=\boldsymbol{\mu}_{a} \\ \operatorname{cov}\left[\boldsymbol{x}_{a}\right] &=\boldsymbol{\Sigma}_{a a} \end{aligned} \tag{2.92,2.93} \]

总结
边缘分布:

\[p(x_a) = \mathcal{N}(x_a|\mu_a, \Sigma_{aa}) \tag{2.98} \]

2.33 高斯变量的贝叶斯定理

本节主要求后面这5个分布的均值和方差,\(p(y|x)p(x)=p(x|y)p(y)=p(x,y)\)

假设给定高斯边缘分布\(p(x)\)和均值是关于\(x\)的线性函数且方差与$ x \(无关的高斯条件分布\)p(y|x)$

\[\begin{array}{r} p(\boldsymbol{x})=\mathcal{N}\left(\boldsymbol{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}\right) \\ p(\boldsymbol{y} \mid \boldsymbol{x})=\mathcal{N}\left(\boldsymbol{y} \mid \boldsymbol{A} \boldsymbol{x}+\boldsymbol{b}, \boldsymbol{L}^{-1}\right) \end{array} \tag{2.99,2.100} \]

其中$ \mu, A, b \(是控制均值的参数,\) \Lambda , L \(是精度矩阵。设\) x,y \(分别是\) M,D \(维的,那么矩阵\) A \(是\) D \times M $矩阵。


求x,y的联合分布。

\[ z = \left( \begin{array}{c} x \\ y \end{array} \right) \tag{2.101} \]


\(p(x,y)协方差矩阵\)

\[ cov[z] = R^{-1} = \left( \begin{array}{cc} \Lambda^{-1} & \Lambda^{-1}A^T \\ A\Lambda^{-1} & L^{-1} + A\Lambda^{-1}A^T \end{array} \right) \tag{2.105} \]


\(p(x,y)均值\)

\[\mathbb{E}[z] = R^{-1} \left( \begin{array}{c} \Lambda\mu - A^TLb \\ Lb \end{array} \right) \tag{2.107} \]

\[\mathbb{E}[z] = \left( \begin{array}{c} \mu \\ A\mu + b \end{array} \right) \tag{2.108} \]


边缘分布y

\[\begin{eqnarray} \mathbb{E}[y] &=& A\mu + b \tag{2.109} \\ cov[y] &=& L^{-1} + A\Lambda^{-1}A^T \tag{2.110} \end{eqnarray} \]


条件分布\(p(\boldsymbol{x} \mid \boldsymbol{y})\)

\[\begin{aligned} \mathbb{E}[\boldsymbol{x} \mid \boldsymbol{y}]=&\left(\boldsymbol{\Lambda}+\boldsymbol{A}^{T} \boldsymbol{L} \boldsymbol{A}\right)^{-1}\left\{\boldsymbol{A}^{T} \boldsymbol{L}(\boldsymbol{y}-\boldsymbol{b})+\boldsymbol{\Lambda} \boldsymbol{\mu}\right\} \\ & \operatorname{cov}[\boldsymbol{x} \mid \boldsymbol{y}]=\left(\boldsymbol{\Lambda}+\boldsymbol{A}^{T} \boldsymbol{L} \boldsymbol{A}\right)^{-1} \end{aligned} \tag{2.111,2.112} \]


总结:

对于$ x \(的边缘高斯分布和\) y \(关于\) x $的条件高斯分布:

\[p(x) = \mathcal{N}(x|\mu,\Lambda^{-1}) \tag{2.113} \]

\[p(y|x) = \mathcal{N}(y|Ax + b,L^{-1}) \tag{2.114} \]

那么$ y \(的边缘分布和\) x \(关于\) y $的条件高斯分布为:

\[p(y) = \mathcal{N}(y|A\mu + b,L^{-1} + A\Lambda^{-1}A^T) \tag{2.115} \]

\[p(x|y) = \mathcal{N}(x|\Sigma\left\{A^TL(y-b) + \Lambda\mu \right\},\Sigma) \tag{2.116} \]

其中

\[\Sigma = (\Lambda + A^TLA)^{-1} \tag{2.117} \]


2.3.4 高斯分布的最大似然估计

给定一个数据集$ X = (x_1,...,x_N)^T $,i.i.d.最大似然来估计参数,对数似然为:

\[\ln p(X|\mu, \Sigma) = -\frac{ND}{2}\ln(2\pi)-\frac{N}{2}\ln |\Sigma| - \frac{1}{2}\sum\limits_{n=1}^{N}(x_n - \mu)^T\Sigma^{-1}(x_n - \mu) \tag{2.118} \]


\(\mu\)求导(C.19):

\[\frac{\partial}{\partial\mu}\ln p(X|\mu,\Sigma) = \sum\limits_{n=1}^N\Sigma^{-1}(x_n - \mu) \tag{2.120} \]


最大似然估计:

\[\mu_{ML} = \frac{1}{N}\sum\limits_{n=1}^Nx_n \tag{2.121} \]


\[\Sigma_{ML} = \frac{1}{N}\sum\limits_{n=1}^N(x_n - \mu_{ML})(x_n - \mu_{ML})^T \tag{2.122} \]


如果我们估计真实概率分布,可以得到有偏的结果。协方差期望小于真实值。

\[\begin{eqnarray} \mathbb{E}[\mu_{ML}] &=& \mu \tag{2.123} \\ \mathbb{E}[\Sigma_{ML}] &=& \frac{N - 1}{N}\Sigma \tag{2.124} \end{eqnarray} \]


所以需要补正。

\[\widetilde{\Sigma} = \frac{1}{N-1}\sum\limits_{n=1}^N(x_n - \mu_{ML})(x_n - \mu_{ML})^T \tag{2.125} \]


2.35 顺序估计

处理一个数据,整合进模型,处理完就丢掉。
最大似然估计:

\[\mu_{ML} = \frac{1}{N}\sum\limits_{n=1}^Nx_n \tag{2.121} \]

把第$ N \(个观察量的估计记作\)\mu_{ML}^{(N)}$,就可以写成:

\[\begin{eqnarray} \mu_{ML}^{(N)} &=& \frac{1}{N}\sum\limits_{n=1}^Nx_n \ &=& \frac{1}{N}x_N + \frac{1}{N}\sum\limits_{n=1}^{N-1}x_n \ &=& \frac{1}{N}x_N + \frac{N-1}{N}\mu_{ML}^{(N-1)} \ &=& \mu_{ML}^{(N-1)} + \frac{1}{N}(x_N - \mu_{ML}^{(N-1)}) \tag{2.126} \end{eqnarray} \]

只要增加一个新观测数据的修正量,就可以得到结果了。随着N增加,修正量的影响也在变小


推广到通用层面:Robbins-Monro算法。

考虑一对有联合分布\(p(z, \theta)\)控制的随机变量$\theta , z $$\theta\(上的\)z\(的条件期望由确定函数\)f(\theta)$给出:

\[f(\theta) \equiv \mathbb{E}[z|\theta] = \int zp(z|\theta) dz \tag{2.127} \]

称之为回归函数。我们假定的目标是找到\(f(\theta^*) = 0\)的根\(\theta^*\)


Robbins-Monro算法的顺序估计就是:

假设:
当$\theta > \theta^∗ \(时\)f(\theta) > 0\(,当\)\theta < \theta^∗ \(时\)f(\theta) < 0$

\[\theta^{(N)} = \theta^{(N-1)} + a_{N-1}z(\theta^{(N-1)}) \tag{2.129} \]

其中$ z(\theta^{(N)}) \(是当\) \theta \(取\) \theta^{(N)} \(时\) z \(的观测值。系数\) {a_N} $表示一个满足下列条件的正数序列:

\[\begin{eqnarray} \lim\limits_{N \to \infty}a_N &=& 0 \tag{2.130} \\ \sum\limits_{N=1}^\infty a_N &=& \infty \tag{2.131} \\ \sum\limits_{N=1}^\infty a_N^2 &<& \infty \tag{2.132} \end{eqnarray} \]

2.129以概率1收敛于根。2.130保证修正越来越小,2.131保证不会收敛到不根的值(阻止太快收束),2.132保证累计噪声是有限的(抑制noise发散),会收敛。


顺序估计应用于最大似然问题

最大似然解$ \theta_{ML} $是对数似然函数的驻点。因此满足:

\[\left.\frac{\partial}{\partial \theta}\left\{\frac{1}{N} \sum_{n=1}^{N}-\ln p\left(x_{n} \mid \theta\right)\right\}\right|_{\theta_{M L}}=0 \tag{2.133} \]

交换求导与求和顺序,且令极限$ N \to \infty $得到:

\[\lim\limits_{N \to \infty}\frac{1}{N}\sum\limits_{n=1}^N\frac{\partial}{\partial\theta}\ln p(x_n|\theta) = \mathbb{E}_x\left[\frac{\partial}{\partial\theta}\ln p(x_n|\theta)\right] \tag{2.134} \]

\(\mathbb{E}_x\left[\frac{\partial}{\partial\theta}\ln p(x_n|\theta)\right] \tag{2.134}\)就是公式2.127的形式,故\(\mathbb{E}_x=0\)即是最大似然解
最大似然的解就是回归函数的根

用Robbins-Monro算法:

\[\theta^{(N)} = \theta^{(N-1)} + a_{N-1}\frac{\partial}{\partial\theta^{(N-1)}}\ln p(x_N|\theta^{(N-1)}) \tag{2.135} \]

参数\(a_N = \sigma^2 / N\)


最大似然问题可以转化为顺序更新问题:N个数据点的均值:N-1个数据点的均值和一个数据点\(x_N\)的贡献。

这里我们看后验分布可以写成:

\[p(\mu|D) \propto \left[p(\mu)\prod\limits_{n=1}^{N-1}p(x_n|\mu)\right]p(x_N|\mu) \tag{2.144} \]

2.144 是一个比较通用的顺序估计公式
方括号里的是观测N-1个数据点后的后验分布(忽略归一化系数) 可以被看作一个 先验分布


2.3.6 高斯分布的贝叶斯推断

一元高斯分布似然函数

\[p(\mathbf{x} \mid \mu)=\prod_{n=1}^{N} p\left(x_{n} \mid \mu\right)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{\frac{N}{2}}} \exp \left\{-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\} \tag{2.137} \]


针对参数\(\mu\),高斯分布的共轭分布是高斯分布,令先验为

\[p(\mu) = \mathcal{N}(\mu|\mu_0, \sigma_0^2) \tag{2.138} \]

且后验分布由:

$ p(\mu|X) \propto p(X|\mu)p(\mu) \tag{2.139} $

给出。 通过简单的配出指数中二次项的操作,可以得到的后验分布为:

\[p(\mu|X) = \mathcal{N}(\mu|\mu_N, \sigma_N^2) \tag{2.140} \]

其中

\[\begin{eqnarray} \mu_N &=& \frac{\sigma^2}{N\sigma_0^2+\sigma^2}\mu_0 + \frac{N\sigma_0^2}{N\sigma_0^2+\sigma^2}\mu_{ML} \tag{2.141} \\ \frac{1}{\sigma_N^2} &=& \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2} \tag{2.142} \end{eqnarray} \]

其中\(\mu_{ML}\)是$ \mu $的最大似然解,由样本均值给出:

\[\mu_{ML} = \frac{1}{N}\sum\limits_{n=1}^Nx_n \tag{2.143} \]

\(\color{red}{高斯分布关于参数\mu的共轭分布是高斯分布}\)


观察有几个结论:

  • 2.141:后验的均值是先验\(\mu_0\)和似然\(\mu_{ML}\)的折中。(2.20推导的时候也有这一点)
    如果N=0,就变成先验;N趋近无穷大,变成似然。
  • 2.142:精度是可以叠加的。每个观测精度之和+先验精度就是后验精度。
    N趋近于无穷大,后验精度趋近于零,在最大似然附近变成尖峰。
  • 2.143:当数据点无穷大,最大似然可以精确地由贝叶斯公式恢复(\(\mu_N=\mu_{ML}\))(通过后验可以计算出来?)。
  • 对于有限的N,如果\(\sigma_0\to \infty\),先验的方差无穷大,2.141后验均值就是最大似然,2.142方差变为$ \sigma_N^2 = \sigma^2 / N $。

假设均值是已知的,推断方差。 同样选择先验是共轭的。定义精度\(\lambda \equiv 1 / \sigma^2\)进行计算是最方便的。关于\(\lambda\)的似然函数为:

\[p(\mathbf{x} \mid \lambda)=\prod_{n=1}^{N} \mathcal{N}\left(x_{n} \mid \mu, \lambda^{-1}\right) \propto \lambda^{\frac{N}{2}} \exp \left\{-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\} \tag{2.145} \]

因此,对应的共轭先验正比于\(\lambda\)的幂次数和\(\lambda\)的线性函数的指数。这就是Gamma分布,定义为:

\[Gam(\lambda|a,b) = \frac{1}{\Gamma(a)}b^a\lambda^{a-1}exp(-b\lambda) \tag{2.146} \]

其中
\(\Gamma(x) \equiv \int_0^\infty u^{x-1}e^{-u}du\)是归一化系数。

Gamma分布的均值和方差为:

\[\begin{eqnarray} \mathbb{E}[\lambda] &=& \frac{a}{b} \tag{2.147} \\ var[\lambda] &=& \frac{a}{b^2} \tag{2.148} \end{eqnarray} \]

设先验

\[Gam(\lambda|a_0,b_0) \]

然后假设先验为$ Gam(\lambda|a_0,b_0) $如果乘以似然函数(2.145),那么就得到后验分布:

\[p(\lambda \mid \mathbf{x}) \propto \lambda^{a_{0}-1} \lambda^{\frac{N}{2}} \exp \left\{-b_{0} \lambda-\frac{\lambda}{2} \sum_{n=1}^{N}\left(x_{n}-\mu\right)^{2}\right\} \tag{2.149} \]

整理一下,看成$Gam(\lambda|a_N, b_N) $的gamma分布,其中:

\[\begin{eqnarray} a_N &=& a_0 + \frac{N}{2} \tag{2.150} \\ b_N &=& b_0 + \frac{1}{2}\sum\limits_{n=1}^N(x_n - \mu)^2 = b_0 + \frac{N}{2}\sigma_{ML}^2 \tag{2.151} \end{eqnarray} \]

其中$ \sigma_{ML}^2 $是对方差的最大似然估计。

观察有几个结论:

  • 2.150:\(N\)个数据点的效果是使\(a\)增加了\(N / 2\)。因此我们可以把先验分布中的参数$ a_0 $看成\(2a_0\)个“有效”先验观测。
  • 2.151:\(N\)个数据点为参数\(b\)贡献了\(N\sigma_{ML}^2/2\)其中\(\sigma_{ML}^2\)是方差,所以把先验中的参数\(b_0\)解释为:\(2a_0\)个方差为\(2b_0/(2a_0)=b_0/a_0\)“有效”的先验观测的效果。

回忆一下,我们在Dirichlet先验中做过类似的有效观测数的解释。这些分布是指数族的例子,我们将会看到,把共轭先验解释为有效的虚拟数据点是指数族分布的一种通用方法。

\(\color{red}{高斯分布关于参数\sigma^2的共轭分布是Gamma分布}\)


现在,假设均值和精度都是未知的。为了找到共轭先验,考虑似然函数对\(\mu, \lambda\)的依赖:

\[\begin{array}{c} p(\mathbf{x} \mid \mu, \lambda)=\prod_{n=1}^{N}\left(\frac{\lambda}{2 \pi}\right)^{\frac{1}{2}} \exp \left\{-\frac{\lambda}{2}\left(x_{n}-\mu\right)^{2}\right\} \\ \propto\left[\lambda^{\frac{1}{2}} \exp \left(-\frac{\lambda \mu^{2}}{2}\right)\right]^{N} \exp \left\{\lambda \mu \sum_{n=1}^{N} x_{n}-\frac{\lambda}{2} \sum_{n=1}^{N} x_{n}^{2}\right\} \end{array} \]

现在,我们在想找到一个对于\(\mu,\lambda\)的依赖与似然函数有着相同的函数形式的先验分布\(p(\mu,\lambda)\)因此,假设形式:

\[\begin{array}{c} p(\mu, \lambda) \propto\left[\lambda^{\frac{1}{2}} \exp \left(-\frac{\lambda \mu^{2}}{2}\right)\right]^{\beta} \exp \{c \lambda \mu-d \lambda\} \\ =\exp \left\{-\frac{\beta \lambda}{2}\left(\mu-\frac{c}{\beta}\right)^{2}\right\} \lambda^{\frac{\beta}{2}} \exp \left\{-\left(d-\frac{c^{2}}{2 \beta}\right) \lambda\right\} \end{array} \tag{2.153} \]

其中\(c, d, \beta\)是常量。由于总有\(p(\mu,\lambda) = p(\mu|\lambda)p(\lambda)\),我们可以通过观察找到\(p(\mu|\lambda), p(\lambda)\)。特别的,\(p(\mu|\lambda)\)是一个精度为关于\(\lambda\)的线性函数的高斯分布,\(p(\lambda)\)是一个gamma分布时,得到的标准化的先验形式为:

\[p(\mu,\lambda) = \mathcal{N}(\mu|\mu_0,(\beta\lambda)^{-1})Gam(\lambda|a,b) \tag{2.154} \]

其中,我们的新常数为
\(\mu_0 = c/\beta, a = 1 + \beta / 2, b = d - c^2/2\beta\)(对比就可以得到)。式(2.154)的分布被称为正态-gamma(normal-gamma)或高斯-gamma(Gaussian-gamma)分布,并在图2.14中展示。

\(\color{red}{高斯分布关于参数\mu,\sigma^2的共轭分布是正态-gamma分布}\)


即使我们选择一个\(\mu\)和\lambda\(相互独立的先验,后验概率中,\)\mu\(的精度和\)\lambda$值也会相互耦合


对于$ D \(维向量\)x \(的多元高斯分布\) \mathcal{N}(x|\mu, \Lambda^{−1}) \(,假设精度已知,那么均值\) \mu \(共轭先验还是高斯分布。对于已知的均值,未知的精度矩阵\) \Lambda $,共轭先验是Wishart分布:

\[\mathcal{W}(\Lambda|W, v) = B|\Lambda|^{(v-D-1)/2}exp\left(-\frac{1}{2}Tr(W^{-1}\Lambda)\right) \tag{2.155} \]

其中$ v \(是分布的自由度,\) W \(是\) D \times D \(的伸缩矩阵,\) TR(.) \(记作迹。标准化常量\) B $为:

\[B(W,v) = |W|^{-v/2}\left(2^{vD/2}\pi^{D(D-1)/4}\prod\limits_{i=1}^D\Gamma\left(\frac{v+1-i}{2}\right)\right)^{-1} \tag{2.156} \]

如果均值和精度同时未知,那么,和一元变量类似的推理得到共轭先验:

$ p(\mu,\Lambda|\mu_0,\beta,W,v) = \mathcal{N}(\mu|\mu_0,(\beta\Lambda)^{-1})\mathcal{W}(\Lambda|W,v) \tag{2.157} $

这被称为正态-Wishart分布或高斯-Wishart分布。

\(\color{red}{多维高斯分布关于参数\mu的共轭分布是Wishart分布}\)


2.3.7 学生t分布

T分布有一个重要的性质,鲁棒性笔记好,在有一些异常值的情况下,正态分布会产生较大的偏离,但是t分布不会

点击查看代码

X = np.random.normal(size=20)
X = np.concatenate([X, np.random.normal(loc=20., size=3)])
plt.hist(X.ravel(), bins=50, density=1., label="samples")

students_t = StudentsT()
gaussian = Gaussian()

gaussian.fit(X)
students_t.fit(X)

print(gaussian)
print(students_t)

x = np.linspace(-5, 25, 1000)
plt.plot(x, students_t.pdf(x), label="student's t", linewidth=2)
plt.plot(x, gaussian.pdf(x), label="gaussian", linewidth=2)
plt.legend()
plt.show()

2.3.9 高斯混合模型

使用kmeans聚类的高斯混合模型

点击查看代码

import numpy as np
import matplotlib.pyplot as plt

from prml.rv import (
    Bernoulli,
    Beta,
    Categorical,
    Dirichlet,
    Gamma,
    Gaussian,
    MultivariateGaussian,
    MultivariateGaussianMixture,
    StudentsT,
    Uniform
)

# 例子
x1 = np.random.normal(size=(100, 2))
x1 += np.array([-5, -5])
x2 = np.random.normal(size=(100, 2))
x2 += np.array([5, -5])
x3 = np.random.normal(size=(100, 2))
x3 += np.array([0, 5])
X = np.vstack((x1, x2, x3))

model = MultivariateGaussianMixture(n_components=3)
model.fit(X)
print(model)

x_test, y_test = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
X_test = np.array([x_test, y_test]).reshape(2, -1).transpose()
probs = model.pdf(X_test)
Probs = probs.reshape(100, 100)
plt.scatter(X[:, 0], X[:, 1])
plt.contour(x_test, y_test, Probs)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()


点击查看代码

import numpy as np
from prml.clustering import KMeans
from prml.rv.rv import RandomVariable


class MultivariateGaussianMixture(RandomVariable):
    """
    p(x|mu, L, pi(coef))
    = sum_k pi_k N(x|mu_k, L_k^-1)
    """

    def __init__(self,
                 n_components,
                 mu=None,
                 cov=None,
                 tau=None,
                 coef=None):
        """
        construct mixture of Gaussians

        Parameters
        ----------
        n_components : int
            number of gaussian component
        mu : (n_components, ndim) np.ndarray
            mean parameter of each gaussian component
        cov : (n_components, ndim, ndim) np.ndarray
            variance parameter of each gaussian component
        tau : (n_components, ndim, ndim) np.ndarray
            precision parameter of each gaussian component
        coef : (n_components,) np.ndarray
            mixing coefficients
        """
        super().__init__()
        assert isinstance(n_components, int)
        self.n_components = n_components
        self.mu = mu
        if cov is not None and tau is not None:
            raise ValueError("Cannot assign both cov and tau at a time")
        elif cov is not None:
            self.cov = cov
        elif tau is not None:
            self.tau = tau
        else:
            self.cov = None
            self.tau = None
        self.coef = coef

    @property
    def mu(self):
        return self.parameter["mu"]

    @mu.setter
    def mu(self, mu):
        if isinstance(mu, np.ndarray):
            assert mu.ndim == 2
            assert np.size(mu, 0) == self.n_components
            self.ndim = np.size(mu, 1)
            self.parameter["mu"] = mu
        elif mu is None:
            self.parameter["mu"] = None
        else:
            raise TypeError("mu must be either np.ndarray or None")

    @property
    def cov(self):
        return self.parameter["cov"]

    @cov.setter
    def cov(self, cov):
        if isinstance(cov, np.ndarray):
            assert cov.shape == (self.n_components, self.ndim, self.ndim)
            self._tau = np.linalg.inv(cov)
            self.parameter["cov"] = cov
        elif cov is None:
            self.parameter["cov"] = None
            self._tau = None
        else:
            raise TypeError("cov must be either np.ndarray or None")

    @property
    def tau(self):
        return self._tau

    @tau.setter
    def tau(self, tau):
        if isinstance(tau, np.ndarray):
            assert tau.shape == (self.n_components, self.ndim, self.ndim)
            self.parameter["cov"] = np.linalg.inv(tau)
            self._tau = tau
        elif tau is None:
            self.parameter["cov"] = None
            self._tau = None
        else:
            raise TypeError("tau must be either np.ndarray or None")

    @property
    def coef(self):
        return self.parameter["coef"]

    @coef.setter
    def coef(self, coef):
        if isinstance(coef, np.ndarray):
            assert coef.ndim == 1
            if np.isnan(coef).any():
                self.parameter["coef"] = np.ones(self.n_components) / self.n_components
            elif not np.allclose(coef.sum(), 1):
                raise ValueError(f"sum of coef must be equal to 1 {coef}")
            self.parameter["coef"] = coef
        elif coef is None:
            self.parameter["coef"] = None
        else:
            raise TypeError("coef must be either np.ndarray or None")

    @property
    def shape(self):
        if hasattr(self.mu, "shape"):
            return self.mu.shape[1:]
        else:
            return None

    def _gauss(self, X):
        d = X[:, None, :] - self.mu
        D_sq = np.sum(np.einsum('nki,kij->nkj', d, self.cov) * d, -1)
        return (
                np.exp(-0.5 * D_sq)
                / np.sqrt(
            np.linalg.det(self.cov) * (2 * np.pi) ** self.ndim
        )
        )

    def _fit(self, X):
        cov = np.cov(X.T)  # 计算协方差矩阵
        kmeans = KMeans(self.n_components)  # 直接用kmeans聚类
        kmeans.fit(X)
        self.mu = kmeans.centers
        self.cov = np.array([cov for _ in range(self.n_components)])
        self.coef = np.ones(self.n_components) / self.n_components
        params = np.hstack(
            (self.mu.ravel(),
             self.cov.ravel(),
             self.coef.ravel())
        )
        while True:
            stats = self._expectation(X)
            self._maximization(X, stats)
            new_params = np.hstack(
                (self.mu.ravel(),
                 self.cov.ravel(),
                 self.coef.ravel())
            )
            if np.allclose(params, new_params):
                break
            else:
                params = new_params

    def _expectation(self, X):
        resps = self.coef * self._gauss(X)
        resps /= resps.sum(axis=-1, keepdims=True)
        return resps

    def _maximization(self, X, resps):
        Nk = np.sum(resps, axis=0)
        self.coef = Nk / len(X)
        self.mu = (X.T @ resps / Nk).T
        d = X[:, None, :] - self.mu
        self.cov = np.einsum(
            'nki,nkj->kij', d, d * resps[:, :, None]) / Nk[:, None, None]

    def joint_proba(self, X):
        """
        calculate joint probability p(X, Z)

        Parameters
        ----------
        X : (sample_size, n_features) ndarray
            input data

        Returns
        -------
        joint_prob : (sample_size, n_components) ndarray
            joint probability of input and component
        """
        return self.coef * self._gauss(X)

    def _pdf(self, X):
        joint_prob = self.coef * self._gauss(X)
        return np.sum(joint_prob, axis=-1)

    def classify(self, X):
        """
        classify input
        max_z p(z|x, theta)

        Parameters
        ----------
        X : (sample_size, ndim) ndarray
            input

        Returns
        -------
        output : (sample_size,) ndarray
            corresponding cluster index
        """
        return np.argmax(self.classify_proba(X), axis=1)

    def classify_proba(self, X):
        """
        posterior probability of cluster
        p(z|x,theta)

        Parameters
        ----------
        X : (sample_size, ndim) ndarray
            input

        Returns
        -------
        output : (sample_size, n_components) ndarray
            posterior probability of cluster
        """
        return self._expectation(X)


点击查看代码

import numpy as np
from scipy.spatial.distance import cdist


class KMeans(object):

    def __init__(self, n_clusters):
        self.n_clusters = n_clusters

    def fit(self, X, iter_max=100):
        """
        perform k-means algorithm

        Parameters
        ----------
        X : (sample_size, n_features) ndarray
            input data
        iter_max : int
            maximum number of iterations

        Returns
        -------
        centers : (n_clusters, n_features) ndarray
            center of each cluster
        """
        I = np.eye(self.n_clusters)
        centers = X[np.random.choice(len(X), self.n_clusters, replace=False)]
        for _ in range(iter_max):
            prev_centers = np.copy(centers)
            D = cdist(X, centers)
            cluster_index = np.argmin(D, axis=1)
            cluster_index = I[cluster_index]
            centers = np.sum(X[:, None, :] * cluster_index[:, :, None], axis=0) / np.sum(cluster_index, axis=0)[:, None]
            if np.allclose(prev_centers, centers):
                break
        self.centers = centers

    def predict(self, X):
        """
        calculate closest cluster center index

        Parameters
        ----------
        X : (sample_size, n_features) ndarray
            input data

        Returns
        -------
        index : (sample_size,) ndarray
            indicates which cluster they belong
        """
        D = cdist(X, self.centers)
        return np.argmin(D, axis=1)


2.5 非参数化估计

  • 直方图估计
  • 核密度估计
  • K近邻估计

直方图方法

在实际应用中,直方图方法对于快速地将一维或者二维的数据可视化很有用,但是并不适用于大多数概率密度估计的应用
一个明显的问题是概率密度故连续

2.5.1 核密度估计

假设观测是D维空间未知概率分布p(x),希望估计。看包含x的小区域R,概率密度是:

\[P = \int_Rp(x)dx \tag{2.242} \]

假设收集了R内部的K个数据点,服从二项分布:x落在区域R中被观测到,数量为K个的概率。

\[Bin(K|N,P) = \frac{N!}{K!(N-K)!}P^K(1-P)^{N-K} \tag{2.243} \]

使用2.11,2.12(在给定数据集规模N的条件下,x=1的观测出现数量m的概率分布的期望和方差)

\[\mathbb{E}[m] \equiv \sum_{m=0}^{N} m \operatorname{Bin}(m \mid N, \mu)=N \mu \tag{2.11} \]

\[\mathbb{E}[m] \equiv \sum_{m=0}^{N} (m-E[m])^2 \operatorname{Bin}(m \mid N, \mu)=N \mu(1-\mu) \tag{2.12} \]

得到落在区域内部的数据点的平均比例(mean fraction)为$ \mathbb{E}[K/N] = P \(,同时引用式(2.12)得到这个均值的方差为\) var[K/N] = P(1-P)/N $。

对于大的$ N $值,这个分布将会在均值附近产生尖峰(方差变小),且

\[K \simeq NP \tag{2.244} \]

但是,如果同时假定区域R足够小,使得在这个区域内的概率密度p(x)大致为常数,那么就有

\[P \simeq p(x)V \tag{2.245} \]

其中$V \(是\) R $的体积。结合式(2.244)和(2.245)得到密度估计的形式:

\[p(x) = \frac{K}{NV} \tag{2.246} \]

注意,式(2.246)的成立依赖于两个相互矛盾的假设,即区域$ R \(要足够小,使得这个区域内的概率密度近似为常数,但是也要足够大(关于密度的值),使得落在这个区域内的数据点的数量\) K $足够让二项分布达到尖峰。(太少就没点在区域里)

因为p(x)概率函数和N是数据点。我们有两种思路:

  • 固定K,通过数据确定V,就是k邻近算法。
  • 固定V,通过数据确定K的值,就是核方法。

可以证明在极限$ N \to \infty $下,V随N的增加而收缩,K随N的增加而增大。最终两种方法得到的概率密度都会收敛于真实的概率密度。(Duda and Hart, 1973)。


首先是核方法,我们把区域$ R \(取成以想确定概率密度的点\) x \(为中心的小超立方体。为了统计落在这个区域内的数据点的数量\) K $,定义函数

\[k(\boldsymbol{u})=\left\{\begin{array}{ll} 1, & \left|u_{i}\right| \leq \frac{1}{2}, \quad i=1, \ldots, D \\ 0, & \text { 其他情况 } \end{array}\right. \tag{2.247} \]

这表示一个以原点为中心的单位立方体。函数k(u)就是核函数的一个例子。从式(2.247),如果数据点$ x_n \(位于以\) x \(为中心的边长为\) h \(的立方体中,那么量\) k((x-x_n)/h) $等于1,否则它的值为0。位于这个立方体内的数据点的总数为:

\[K = \sum\limits_{n=1}^Nk\left(\frac{x-x_n}{h}\right) \tag{2.248} \]

把这个表达式代入式(2.246),可以得到点$ x $处的概率密度估计

\[p(x) = \frac{1}{N}\sum\limits_{n=1}^N\frac{1}{h^D}k\left(\frac{x-x_n}{h}\right) \tag{2.249} \]

使用函数$ k(u) \(的对称性(两点距离计算),可以重新解读这个等式为以\) N \(个数据点\) x_n \(为中心的\) N \(个立方体的和,而不是解读为以\) x$为中心的一个立方体。

但问题跟直方图一样,非连续性。(不是1就是0).这个是由密度估计中立方体的边界带来的。如果我们选择一个平滑的核函数,那么就可以得到一个更加光滑的模型。一个常用的选择是高斯核函数,它给出

\[p(\boldsymbol{x})=\frac{1}{N} \sum_{n=1}^{N} \frac{1}{\left(2 \pi h^{2}\right)^{\frac{D}{2}}} \exp \left\{-\frac{\left\|\boldsymbol{x}-\boldsymbol{x}_{n}\right\|^{2}}{2 h^{2}}\right\} \tag{2.250} \]


高斯核密度估计的python实现

点击查看代码

import numpy as np
import matplotlib.pyplot as plt
import math


def gauss(x):
    return (1 / math.sqrt(2 * math.pi)) * math.exp(-0.5 * (x ** 2))


def get_kde(x, data_array, bandwidth=0.1):
    N = len(data_array)
    res = 0
    if len(data_array) == 0:
        return 0
    for i in range(len(data_array)):
        res += gauss((x - data_array[i]) / bandwidth)  # 对应书上的公式 2.250 D=1
    res /= (N * bandwidth)
    return res


input_array = np.random.randn(20000).tolist()
print(input_array)
bandwidth = 1.05 * np.std(input_array) * (len(input_array) ** (-1 / 5))  # 带宽也就是书里面的h 是个超参数
x_array = np.linspace(min(input_array), max(input_array), 50)
y_array = [get_kde(x_array[i], input_array, bandwidth) for i in range(x_array.shape[0])]
# 这里可以看到,每次计算kde 都要代入整个观测数据集,运算很慢

plt.subplot(1, 2, 1)
plt.hist(input_array, bins=40, density=True)
plt.plot(x_array.tolist(), y_array, color='red', linestyle='-')

bandwidth = 0.005
x_array = np.linspace(min(input_array), max(input_array), 50)
y_array = [get_kde(x_array[i], input_array, bandwidth) for i in range(x_array.shape[0])]

plt.subplot(1, 2, 2)
plt.hist(input_array, bins=40, density=True)
plt.plot(x_array.tolist(), y_array, color='red', linestyle='-')
plt.show()


2.5.2

与之前固定$ V \(然后从数据中确定\) K \(的值不同,我们考虑固定\) K \(的值然后使用数据来确定合适的\)

其实这部分很简单,回到公式2.246

\[p(x)=\frac{K}{NV} \]

K,N都是常数,计算V就是计算概率密度了,但是
注意,由K近邻方法得到的模型不是真实的概率密度模型,因为它在整个空间的积分是发散的

posted @ 2022-03-19 10:42  筷点雪糕侠  阅读(495)  评论(0编辑  收藏  举报