PRML 1: Gaussian Distribution
1. Overview of Machine Learning
P.S. 随手翻下教材,复习几个概率论的基本概念。
概率是定义在样本空间上的一种测度,满足非负性、规范性和可列可加性,用于描述我们对随机事件的确信程度。
随机变量是随机试验结果的实值函数。对于一维随机变量,我们可以定义一种满足单调性、有界性和右连续性的实函数,称为概率分布函数;对于多维随机变量,我们可以定义联合分布、边缘分布和条件分布。
在一维空间中,我们可以定义一个随机变量的均值以及两个随机变量间的协方差,$cov[\vec{x},\vec{y}]=E[\vec{x}^T\vec{y}]-E[\vec{x}]^TE[\vec{y}]$;在高维空间中,与之相对应的概念是均值向量和协方差矩阵。两个随机变量(标量)不相关意味着它们之间的协方差为零,其充要条件为均值之积为积的均值、方差之和为和的方差。独立性是一种特殊的不相关性,它要求两个随机变量的边缘分布函数之积等同于联合分布函数。
概率论中有两个著名的极限定理,分别称为大数定律和中心极限定理:前者证明了当随机试验次数无限多时,某结果出现的频率依概率收敛于对应事件的概率;后者说明数量无限多的独立同分布随机变量的均值近似服从正态分布,这也正是为何正态分布如此 popular 的原因之一。
2. The Gaussian Distribution
$Gauss(\vec x\text{ | }\vec\mu,\Sigma)=\frac{1}{(2\pi)^{D/2}}\cdot\frac{1}{|\Sigma|^{1/2}}\cdot exp\{-\frac{1}{2}(\vec x-\vec \mu)^T\cdot\Sigma^{-1}\cdot(\vec x-\vec \mu)\}$
Lemma: If $Mat = \begin{bmatrix} A &B \\ C & D\end{bmatrix}$, then $Mat^{-1}=\begin{bmatrix}S^{-1} & -S^{-1}BD^{-1}\\ -D^{-1}CS^{-1} & D^{-1}(I+CS^{-1}BD^{-1})\end{bmatrix}$ ,
where $S=A-BD^{-1}C$ is Schur Complement of $Mat$ with respect to $D$.
Partitioned Gaussians: Suppose $\vec x = [{\vec x_1}^T,{\vec x_2}^T]^T$ obeys the Gaussian distribution with the mean vector $\vec\mu = [{\vec\mu_1}^T,{\vec\mu_2}^T]^T$ and the covariance matrix $\Sigma =\begin{bmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{bmatrix}$, then we have:
(1) Marginal Distribution: $p(\vec{x_1})=Gauss(\vec{x_1}\text{ | }\vec{\mu_1},\Sigma_{11})$, $p(\vec{x_2})=Gauss(\vec{x_2}\text{ | }\vec{\mu_2},\Sigma_{22})$;
(2) Conditional Distribution: $p(\vec{x_1}\text{ | }\vec{x_2})=Gauss(\vec{x_1}\text{ | }\vec{\mu_1}+ \Sigma_{12}\Sigma^{-1}_{22}(\vec{x_2}-\vec{\mu_2}),\Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})$.
Linear Gaussian Model: Given $p(\vec{x})=Gauss(\vec{\mu},\Lambda^{-1})$ and $p(\vec{y}\text{ | }\vec{x})=Gauss(\vec{y}\text{ | }A\cdot\vec{x}+\vec{b},L^{-1})$, then we have:
(1) $p(\vec{y})=Gauss(\vec{x}\text{ | }A\cdot\vec{\mu}+\vec{b},L^{-1}+A\cdot\Lambda^{-1}\cdot A^T)$;
(2) $p(\vec{x}\text{ | }\vec{y})=Gauss(\vec{x}\text{ | }\Sigma\cdot\{A^T\cdot L\cdot(\vec{y}-\vec{b})+\Lambda\cdot\vec{\mu}\},\Sigma)$, where $\Sigma=(\Lambda+A^T\cdot L\cdot A)^{-1}$.
Maximum Likelihood Estimate: The mean vector can be estimated sequentially by $\vec{\mu}^{(n)}_{ML}=\vec{\mu}^{(n-1)}_{ML}+\frac{1}{n}\cdot(\vec{x_n}-\vec{\mu}^{(n-1)}_{ML})$, whereas the covariance matrix can only be obtained by $\Sigma_{ML}=\frac{1}{N}\cdot\sum_{i=1}^{N}(\vec{x_i}-\vec{\mu}_{ML})\cdot(\vec{x_i}-\vec{\mu}_{ML})^T$.
Convolution: $\int {Gauss(\vec t\text{ | }\vec y,\Sigma_2)\cdot Gauss(\vec y\text{ | }\vec\mu,\Sigma_1)}=Gauss(\vec t\text{ | }\vec\mu,\Sigma_1+\Sigma_2)$.
3. The Exponential Family
A distribution belongs to the Exponential Family so long as it satisfies $p(\vec{x}\text{ | }\vec{\eta})=g(\vec{\eta})\cdot h(\vec{x})\cdot exp\{\vec{\eta}^T\cdot\vec{u}(\vec{x})\}$.
(1) For the Multinomial Distribution $p(\vec{x}\text{ | }\vec{\eta})=(1+\sum_{k=1}^{K-1}exp\{\eta_k\})^{-1}\cdot exp\{\vec{\eta}^T\cdot\vec{x}\}$:
$\vec{\eta}=[\eta_1,\eta_2,...,\eta_{K-1}]^T$, where $\eta_k=ln(\frac{\mu_k}{1-\sum_{i=1}^{K-1}\mu_i})$;
(2) For the Univariate Gaussian Distribution $p(x\text{ | }\vec{\eta})=\frac{1}{\sqrt{2\pi}\sigma}\cdot exp\{-\frac{\mu^2}{2\sigma^2}\}\cdot exp\{\vec{\eta}^T\cdot\vec{u}(x)\}$:
$\vec{\eta}=[\frac{\mu}{\sigma^2},-\frac{1}{2\sigma^2}]^T$ where $\vec{u}(x)=[x,x^2]^T$.
To make a maximum likelihood estimate of the parameters $\vec{\eta}$, one may only maintain the sufficient statistics of the data set: $\sum\vec{u}(x)$. For multinomial distribution, maintaining $\sum x$ is enough, whereas for the univariate Gaussian distribution both $\sum x$ and $\sum x^2$ are requisite.
Let $\vec{\eta}=\vec{\eta}(\vec{w}^T\vec{x})$, and $p(y\text{ | }\vec{x},\vec{w})=h(y)g(\vec{\eta})e^{\vec{\eta}^T\vec{u}(y)}$, then we have the Generalized Linear Model (GLM) such that $E[y\text{ | }\vec{x},\vec{w}]=f(\vec{w}^T\vec{x})$, an activation function $f$ acting on a linear function of feature variables.
(1) Linear Regression: $p(y\text{ | }\vec{x},\vec{w})=Gauss(y\text{ | }\vec{w}^T\vec{x},\sigma^2)$ for $y\in\mathbb{R}$, $E[y\text{ | }\vec{x},\vec{w}]=\vec{w}^T\vec{x}$,
$\vec{\eta}(\vec{w}^T\vec{x})=[\frac{\vec{w}^T\vec{x}}{\sigma^2},-\frac{1}{2\sigma^2}]^T$, $g(\vec{\eta})=\frac{1}{\sigma}e^{-\frac{(\vec{w}^T\vec{x})^2}{2\sigma^2}}$, $\vec{u}(y)=[y,y^2]^T$, $h(y)=\frac{1}{\sqrt{2\pi}}$;
(2) Logistic Regression: $p(y\text{ | }\vec{x},\vec{w})=\sigma(\vec{w}^T\vec{x})^y(1-\sigma(\vec{w}^T\vec{x}))^{1-y}$ for $y=0,1$, $E[y\text{ | }\vec{x},\vec{w}]=\sigma(\vec{w}^T\vec{x})$,
$\eta(\vec{w}^T\vec{x})=\vec{w}^T\vec{x}$, $g(\vec{\eta})=(1+e^{\vec{w}^T\vec{x}})^{-1}$, $u(y)=y$, $h(y)=1$;
(3) Poisson Regression: $p(y\text{ | }\vec{x},\vec{w})=\frac{\lambda^y e^{-\lambda}}{y!}$ for $y=0,1,2,...$, where $\lambda=E[y\text{ | }\vec{x},\vec{w}]=e^{\vec{w}^T\vec{x}}$,
$\eta(\vec{w}^T\vec{x})=\vec{w}^T\vec{x}$, $g(\eta)=e^{-\lambda}$, $u(y)=y$, $h(y)=\frac{1}{y!}$.
P.S. Density Estimate 的两种方法:
(1) Parzen Window 固定 V 统计 k:$p(\vec{x})=\frac{1}{N}\sum_{n=1}^N Gauss(\vec{x}_n\text{ | }\vec{x},\lambda I)$;
(2) kNN 固定 k 度量 V(最小超球):$p(\vec{x})=\frac{k}{N}\cdot(\frac{4}{3}\pi\cdot r(\vec{x})^3)^{-1}$.
References:
1. Bishop, Christopher M. Pattern Recognition and Machine Learning [M]. Singapore: Springer, 2006