Hiroki

大部分笔记已经转移到 https://github.com/hschen0712/machine_learning_notes ,QQ:357033150, 欢迎交流

贝叶斯线性回归

贝叶斯线性回归

问题背景:

为了与PRML第一章一致,我们假定数据出自一个高斯分布:

\[p(t|x,\mathbf{w},\beta)=\mathcal{N}(t|y(x,\mathbf{w}),\beta^{-1})=\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(t-y(x,\mathbf{w}))^2) \]

其中\(\beta\)是精度,\(y(x,\mathbf{w})=\sum\limits_{j=0}^Mw_jx^j\)
\(\mathbf{w}\)的先验为:

\[p(\mathbf{w})=\mathcal{N}(\mathbf{w}|\mathbf{0},\alpha^{-1}\mathbf{I})=(\frac{\alpha}{2\pi})^{(M+1)/2}\exp(-\frac{\alpha}{2}\mathbf{w}^T\mathbf{w}) \]

其中\(\alpha\)是高斯分布的精度。
为了表示方便,我们定义一个变换\(\phi(x)=(1,x,x^2,...,x^M)^T\),那么\(y(x,\mathbf{w})=\mathbf{w}^T\phi(x)\)。为了对\(\mathbf{w}\)作推断,我们需要收集数据更新先验分布,记收集到的数据为\(\mathbf{x}_N=\{x_1,...,x_N\}\)\(\mathbf{t}_N=\{t_1,...,t_N\}\),其中\(t_i\)\(x_i\)对应的响应。进一步我们引入一个矩阵\(\Phi\),其定义如下:

\[\Phi=\begin{bmatrix}\phi(x_1)^T\\\phi(x_2)^T\\\vdots\\\phi(x_N)^T\end{bmatrix} \]

我们可以认为这个矩阵是由\(\phi(x_i)^T\)平铺而成。

详细推导过程

首先我们先验证\(p(\mathbf{w}|\mathbf{x},\mathbf{t})\)是个高斯分布。根据贝叶斯公式我们有:

\[\begin{aligned}p(\mathbf{w}|\mathbf{x},\mathbf{t})&\propto p(\mathbf{w}|\alpha)p(\mathbf{t}|\mathbf{x},\mathbf{w})\\&=(\frac{\alpha}{2\pi})^{(M+1)/2} \exp(-\frac{\alpha}{2}\mathbf{w}^T\mathbf{w})\prod_{i=1}^N\sqrt{\frac{\beta}{2\pi}}\exp(-\frac{\beta}{2}(t_i-\mathbf{w}^T\phi(x_i))^2)\\&\propto \exp(-\frac{1}{2}\Big\{\alpha \mathbf{w}^T\mathbf{w}+\beta\sum_{i=1}^N(t_i^2-2t_i\mathbf{w}^T\phi(x_i)+\mathbf{w}^T\phi(x_i)\phi(x_i)^T\mathbf{w})\Big\})\\&\propto \exp(-\frac{1}{2}\Big\{ \mathbf{w}^T(\alpha \mathbf{I}+\beta\sum_{i=1}^N\phi(x_i)\phi(x_i)^T)\mathbf{w}-2\beta\sum_{i=1}^N t_i\phi(x_i)^T\mathbf{w}\Big\})\\&\propto \exp(-\frac{1}{2}\Big\{\mathbf{w}^T(\alpha \mathbf{I}+\beta \Phi^T\Phi)\mathbf{w} -2\beta(\Phi^T\mathbf{t}_N)^T\mathbf{w}\Big\})\end{aligned} \]

\(S^{-1}=\alpha \mathbf{I}+\beta\Phi^T\Phi\),\(\mu=\beta S\Phi^T\mathbf{t}_N\)(注:原书中式1.72写错了)

\[p(\mathbf{w}|\mathbf{x},\mathbf{t})\propto \exp(-\frac{1}{2}(\mathbf{w}-\mu)^TS^{-1}(\mathbf{w}-\mu))\propto \mathcal{N}(\mathbf{w}|\mu,S) \]

至此,我们证明了后验分布也是个高斯,接下来我们计算predictive distribution,注意到\(p(t|x,\mathbf{x},\mathbf{t})\)是两个高斯分布的卷积,其结果也是一个高斯,但为了严谨起见,还是证明一下。

\[\begin{aligned}p(t|x,\mathbf{x},\mathbf{t})&=\int p(t|x,\mathbf{w})p(\mathbf{w}|\mathbf{x},\mathbf{t})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}} \int \exp(-\frac{1}{2}\Big\{\beta(t-\mathbf{w}^T\phi(x))^2+(\mathbf{w}-\mu)^TS^{-1}(\mathbf{w}-\mu)\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\int\exp(-\frac{1}{2}\Big\{\beta t^2-2\beta t\phi(x)^T\mathbf{w}+\beta\mathbf{w}^T\phi(x)\phi(x)^T\mathbf{w}\\&+\mathbf{w}^TS^{-1}\mathbf{w}-2\mu^T S^{-1}\mathbf{w}+\mu^T S^{-1}\mu\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot \\&\int\exp(-\frac{1}{2}\Big\{-2\beta t\phi(x)^T\mathbf{w}+\beta\mathbf{w}^T\phi(x)\phi(x)^T\mathbf{w}+\mathbf{w}^TS^{-1}\mathbf{w}-2\mu^T S^{-1}\mathbf{w}\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot\\&\int\exp(-\frac{1}{2}\Big\{\mathbf{w}^T(\underbrace{\beta\phi(x)\phi(x)^T+S^{-1})}_{\Lambda^{-1}}\mathbf{w}-2(\underbrace{\beta t\phi(x)^T+\mu^TS^{-1}}_{m^T\Lambda^{-1}})\mathbf{w}\Big\}d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu))\cdot\\&\int\exp(-\frac{1}{2}\Big\{\mathbf{w}^T\Lambda^{-1}\mathbf{w}-2m^T\Lambda^{-1}\mathbf{w}+m^T\Lambda^{-1}m-m^T\Lambda^{-1}m\Big\})d\mathbf{w}\\&=\frac{1}{(2\pi)^{M/2+1}}\frac{1}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\cdot\\&\int\exp(-\frac{1}{2}\Big\{(\mathbf{w}-m)^T\Lambda^{-1}(\mathbf{w}-m)\Big\})d\mathbf{w}\\&=\frac{(2\pi)^{(M+1)/2}}{(2\pi)^{M/2+1}}\frac{|\Lambda|^{1/2}}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\\&=\frac{1}{(2\pi)^{1/2}}\frac{|(\beta\phi(x)\phi(x)^T+S^{-1})^{-1}|^{1/2}}{(\beta^{-1}|S|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\\&=\frac{1}{(2\pi)^{1/2}}\frac{1}{(\beta^{-1}|S||\beta\phi(x)\phi(x)^T+S^{-1}|)^{1/2}}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\end{aligned} \]

到这里不知道怎么推下去了,于是去网上闲逛找解决办法,终于找到了一篇论文《Modeling Inverse Covariance Matrices by Basis Expansion》这篇论文里介绍了一个引理
引理 (对称矩阵的秩1扰动)\(\alpha\in\mathbb{R}\),\(\mathbf{a}\in\mathbb{R}^d\),\(P\in\mathbb{R}^{d\times d}\) 为可逆矩阵。如果\(\alpha\neq\mathbf{a} -(\mathbf{a}^TP\mathbf{a})^{-1}\)那么秩1扰动矩阵\(P+\alpha \mathbf{a} \mathbf{a}^T\)可逆,且

\[(P+\alpha \mathbf{a} \mathbf{a}^T)^{-1}=P^{-1}-\frac{\alpha P^{-1}\mathbf{a}\mathbf{a}^T P^{-1}}{1+\alpha \mathbf{a}^TP^{-1}\mathbf{a}} \]

\[det(P+\alpha \mathbf{a} \mathbf{a}^T)=(1+\alpha \mathbf{a}^T P^{-1}\mathbf{a})det(P) \]

这条定理说的是如果我们给协方差矩阵一个秩为1的向量外积做扰动,我们可以将扰动后的矩阵的逆和行列式进行展开。具体地,我们考察\(|\beta\phi(x)\phi(x)^T+S^{-1}|\),发现

\[|\beta\phi(x)\phi(x)^T+S^{-1}|=(1+\beta \phi(x)^T S\phi(x))det(S^{-1})=(1+\beta \phi(x)^T S\phi(x))/|S| \]

于是

\[\frac{1}{(\beta^{-1}|S||\beta\phi(x)\phi(x)^T+S^{-1}|)^{1/2}}=\frac{1}{(\beta^{-1}|S|\cdot \frac{(1+\beta \phi(x)^T S\phi(x))}{|S|})^{1/2}}=\frac{1}{(\beta^{-1}+\phi(x)^T S\phi(x))^{1/2}} \]

接下来考察指数部分\(\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))\),注意到\(\mu=\beta S\Phi^T\mathbf{t}_N\),于是\(\mu^TS^{-1}\mu=\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)\)。同时,应用上述引理我们有

\[\Lambda=(\beta\phi(x)\phi(x)^T+S^{-1})^{-1}=S-\frac{\beta S\phi(x)\phi(x)^TS}{1+\beta \phi(x)^TS\phi(x)}=S-\frac{ S\phi(x)\phi(x)^TS}{\beta^{-1}+\phi(x)^TS\phi(x)} \]

利用以上两个关系,我们进一步进行推导

\[\begin{aligned}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))&=\exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)-(\beta t\phi(x)^T+\beta (\Phi^T\mathbf{t}_N)^T)\Lambda\Lambda^{-1}\Lambda(\beta t\phi(x)+\beta \Phi^T\mathbf{t}_N)\Big\})\\&=\exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)-\big[\beta^2 t^2\phi(x)^T\Lambda\phi(x)+2\beta^2 t\phi(x)^T\Lambda (\Phi^T\mathbf{t}_N)+\beta^2 (\Phi^T\mathbf{t}_N)^T\Lambda(\Phi^T\mathbf{t}_N)\big]\Big\})\\&= \exp(-\frac{1}{2}\Big\{\beta t^2+\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)+\big[-\beta^2 t^2\phi(x)^TS\phi(x)+\beta^3 t^2 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\\ &-2\beta^2 t\phi(x)^TS(\Phi^T\mathbf{t}_N)+2\beta^3 t\frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\\&-\beta^2(\Phi^T\mathbf{t}_N)^TS(\Phi^T\mathbf{t}_N)+\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\big]\Big\})\\&=\exp(-\frac{1}{2}\Big\{\big(\beta-\beta^2\phi(x)^TS\phi(x)+\beta^3 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\big)t^2\\ &+2\big(\beta^3 \frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)\big)t+\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}\Big\})\end{aligned}\]

我们考察每一个系数,首先是\(t^2\)的系数

\[\begin{aligned} \beta-\beta^2\phi(x)^TS\phi(x)+\beta^3 \frac{(\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}&=\beta+\frac{-\beta^2\phi(x)^TS\phi(x)[1+\beta\phi(x)^TS\phi(x)]+\beta^3 (\phi(x)^TS\phi(x))^2}{1+\beta\phi(x)^TS\phi(x)}\\&=\beta-\frac{\beta^2\phi(x)^TS\phi(x)}{1+\beta \phi(x)^TS\phi(x)}\\&=\frac{\beta(1+\beta \phi(x)^TS\phi(x))-\beta^2\phi(x)^TS\phi(x)}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{\beta}{1+\beta\phi(x)^TS\phi(x)}=\frac{1}{\beta^{-1}+\phi(x)^TS\phi(x)}\end{aligned} \]

接着是\(t\)的系数

\[\begin{aligned}\beta^3 \frac{\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)&=\frac{\beta^3\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)(1+\beta\phi(x)^TS\phi(x))}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{\beta^3\phi(x)^T\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^2\phi(x)^TS(\Phi^T\mathbf{t}_N)-\beta^3\phi(x)^TS(\Phi^T\mathbf{t}_N)\phi(x)^TS\phi(x)}{1+\beta\phi(x)^TS\phi(x)}\\&=\frac{-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)}{\beta^{-1}+\phi(x)^TS\phi(x)}\end{aligned} \]

最后我们考察常数项

\[\beta^3 \frac{(\Phi^T\mathbf{t}_N)^TS\phi(x)\phi(x)^TS(\Phi^T\mathbf{t}_N)}{1+\beta\phi(x)^TS\phi(x)}=\frac{\beta^2(\phi(x)^TS(\Phi^T\mathbf{t}_N))^2}{\beta^{-1}+\phi(x)^TS\phi(x)} \]

综合以上,我们有

\[\begin{aligned}\exp(-\frac{1}{2}(\beta t^2+\mu^T S^{-1}\mu-m^T\Lambda^{-1}m))&=\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}\Big\{t^2-2\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)t+\beta^2(\phi(x)^TS(\Phi^T\mathbf{t}_N))^2\Big\})\\&=\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}(t-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N))^2)\end{aligned} \]

综合以上,我们可以得到

\[p(t|x,\mathbf{x},\mathbf{t})=\frac{1}{\sqrt{2\pi\cdot(\beta^{-1}+\phi(x)^TS\phi(x))}}\exp(-\frac{1}{2(\beta^{-1}+\phi(x)^TS\phi(x))}(t-\beta\phi(x)^TS(\Phi^T\mathbf{t}_N))^2) \]

\[ m(x)=\beta\phi(x)^TS(\Phi^T\mathbf{t}_N)=\mu^T\phi(x)\\ s^2(x)=\beta^{-1}+\phi(x)^TS\phi(x)\]

以上两式对应PRML中的式1.70~1.71。式1.71中,第一项表示数据中的噪音(方差越小,数据越集中,不确定性越小);第二项表示关于参数\(\mathbf{w}\)的不确定性,当\(N\to \infty\)时,第二项趋于0,这是由于当数据量趋于无限大时,关于参数的不确定性逐渐消失,先验的影响逐渐减弱。理论上的证明如下,首先我们考察\(S_{N+1}\):

\[S_{N+1}=(\alpha I+\beta \sum_{i=1}^N \phi(x_i)\phi(x_i)^T+\beta \phi(x_{N+1})\phi(x_{N+1})^T)=(S_N^{-1}+\beta \phi(x_{N+1})\phi(x_{N+1})^T)\\=S_N-\beta\frac{S_N\phi(x_{N+1})\phi(x_{N+1})^TS_N}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})}\\=S_N-\frac{\beta}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})} (S_N\phi(x_{N+1}))(S_N\phi(x_{N+1}))^T \]

于是

\[\sigma_{N+1}^2(x)=\beta^{-1}+\phi(x)^TS_{N+1}\phi(x)=\sigma_N^2(x)-\frac{\beta}{1+\beta \phi(x_{N+1})^TS_N\phi(x_{N+1})}[ \phi(x)^T(S_N\phi(x_{N+1}))]^2\leq \sigma_N^2(x) \]

因此序列\(\sigma_N^2(x)\)是单调递减序列,又由于有下界(0),因此当\(N\to\infty\)时,\(\sigma_N^2(x)\to 0\)

于是我们知道

\[p(t|x,\mathbf{x},\mathbf{t})=\mathcal{N}(t|m(x),s^2(x)) \]

也就是说后验预测分布也是一个高斯,\(t\)的且均值、方差取决于\(x\)
需要注意的是当\(x\)满足\(\beta=-(\phi(x)^TS\phi(x))^{-1}\)时,方差

\[s^2(x)=\beta^{-1}+\phi(x)^TS\phi(x)=-\phi(x)^TS\phi(x)+\phi(x)^TS\phi(x)=0 \]

因此在这一点分布未定义

posted on 2016-05-15 16:29  Hiroki  阅读(1786)  评论(0编辑  收藏  举报

导航