PRML-公式推导- 1.68
1.摘抄1-老外的一些解释
https://stats.stackexchange.com/questions/305078/how-to-compute-equation-1-68-of-bishops-book
I was treating the problem as having four random variables \(x,t,D,w\) where \(D=(X,T)\) then I only obtain this:
\(P(t,x,D)=\int P(t,x,D,w)dw\)
\(P(t|x,D)P(x,D)=\int P(t|x,D,w)P(x,D,w)dw\)
\(P(t|x,D)=\int P(t|x,D,w)P(w|x,D)dw\)
The book sneakily invoked the concept of "conditional independence".
本书偷偷摸摸的提到了"条件独立的概念"
Suppose we have variables \(A,\) \(B,\) and \(C,\) and that \(A\) and \(B\) are conditionally independent given \(C.\) This means that \(P(A \mid B, C) = P(A \mid C).\) That is, if \(C\) is observed, then \(A\) is independent of \(B.\) However, that independence is conditional, so it's still true that \(P(A \mid B) \ne P(A)\) in general.
\(假设我们有A,B,C三个变量,并且A,B在给定C的条件下是独立的,这就意味着P(A \mid B, C) = P(A \mid C),那么如果C被观测到,A,B就是独立的(\color{red}{貌似是概率图模型的概念}),然而这种独立性是带条件的,所以并不意味着P(A \mid B) \ne P(A)成立\)
In this case, \(t\) is conditionally independent of \(D\) given \(w.\) The reason for this is that \(t\) solely depends on \(w\) and \(x,\) but if you don't know \(w\) then \(D\) gives you a hint to the value of \(w.\) However, if you do know \(w\) then \(D\) is no longer useful for determining the value of \(t.\) This explains why \(D\) was omitted from \(P(t \mid x, w, D)\) but not from \(P(t \mid x, D).\)
\(基于上述情况,t在w给定的情况下是和D独立的,理由是t仅仅依赖于w和x,但是如果你不知道w,那么D会给w一点小提示。然而如果你知道w,那么推导t就不再需要D了,这就解释了D为什么能够从P(t \mid x, w, D)中被忽略掉,变成P(t \mid x, D)\)
Similarly, \(w\) is entirely independent of \(x\) so \(P(w \mid x, D) = P(w \mid D).\)
2.摘抄2-国人1
贝叶斯曲线拟合
前面介绍的MLE和MAP都属于点估计,这一节将介绍一种更完全的贝叶斯方法。回顾曲线拟合的目标,我们希望为给定的输入\(\hat{x}\)预测其对应的输出\(\hat{t}\)。这里假设参数\(\alpha\)和\(\beta\)已知,于是可以省略\(\mathbf{w}\)的后验概率中的参数,写为\(p(\mathbf{w}|\mathbf{x},\mathbf{t})\)。通过对下式右端关于\(\mathbf{w}\)积分,我们可以得到\(t\)的后验预测分布(posterior predictive distribution): $$ p(t|x,\mathbf{x},\mathbf{t})=\int p(t|x,\mathbf{w})p(\mathbf{w}|\mathbf{x},\mathbf{t})d\mathbf{w}$$ 这个公式是我读这本书遇到的第一道坎,貌似很多人也在这个公式上卡了很久。我说一下我对这个公式的理解:
第一种理解:我们知道在贝叶斯中数据是已知的,只有参数\(\mathbf{w}\)是不确定的,因此式中\(x,\mathbf{x},\mathbf{t}\)都是确定的,为了直观我们可以把已知的都省略,于是原式变为 $$p(t)=\int p(t|\mathbf{w})p(\mathbf{w}) d\mathbf{w}=\int p(t,\mathbf{w})d\mathbf{w}$$ 这就很好理解了,就是对\(\mathbf{w}\)做marginalization(运用概率论的乘法公式和加法公式,连续的情况下求和变为积分)。
第二种理解:概率图模型,需要用到D-separation理论(D-Separation是一种用来判断变量是否条件独立的图形化方法)。以下举个D-separation最简单的例子,更多的理论知识请参考PRML第8章
我们要确定上图中\(a\)和\(b\)的关系,则可以分为两种情况来讨论
首先依据链式法则,我们写出该图模型的联合概率 $$p(a,b,c)=p(c)p(a|c)p(b|c)$$ 1)如果随机变量\(c\)已经被观测,则\(a\)与\(b\)条件独立,即\(p(a,b|c)=p(a|c)p(b|c)\)
证明过程如下: $$p(a,b|c)=\frac{p(a,b,c)}{p(c)}=\frac{p(c)p(a|c)p(b|c)}{p(c)}=p(a|c)p(b|c)$$ 同理,我们还能证明\(p(b|a, c)=p(b|c)\) $$p(b|a, c)=\frac{p(a,b,c)}{p(a, c)}=\frac{p(c)p(a|c)p(b|c)}{p(c)p(a|c)}=p(b|c)$$ 2)如果随机变量\(c\)未被观测,通过对\(p(a,b,c)\)关于\(c\)积分我们获得\(a\)和\(b\)的联合概率 $$p(a,b)=\sum_{c}=p(c)p(a|c)p(b|c)$$ 通常情况下,\(p(a,b)\)是不等于\(p(a)p(b)\)的,因此\(a\)和\(b\)相互不独立
接下来我们讨论回归模型的概率图模型:
接下来我们来证明原式成立: $$\begin{aligned}p(t|x,\mathbf{x},\mathbf{t})&=\frac{p(t,x,\mathbf{x},\mathbf{t})}{p(x,\mathbf{x},\mathbf{t})}\&=\int \frac{p(t,x,\mathbf{x},\mathbf{t}, \mathbf{w})}{p(x,\mathbf{x},\mathbf{t})}d\mathbf{w}\&=\int \frac{p(t,x,\mathbf{x},\mathbf{t}, \mathbf{w})}{p(x,\mathbf{x},\mathbf{t}, \mathbf{w})}\frac{p(x,\mathbf{x},\mathbf{t}, \mathbf{w})}{p(x,\mathbf{x},\mathbf{t})}d\mathbf{w}\&=\int p(t|x,\mathbf{x},\mathbf{t}, \mathbf{w})p(\mathbf{w}|x,\mathbf{x},\mathbf{t})d\mathbf{w}\end{aligned}$$ 根据图模型的D-separation理论,\(\mathbf{w}\)被观测的条件下,上图中\(\mathbf{x}\)到\(t\)(在图中是\(\hat{t}\))的通路被阻断,因此\(t\)与\(\mathbf{x}\)及\(\mathbf{t}\)相互独立,则 $$p(t|x,\mathbf{x},\mathbf{t}, \mathbf{w})=p(t|x,\mathbf{w})$$ 接着我们考察概率\(p(\mathbf{w}|x,\mathbf{x},\mathbf{t})\),由于\(t\)尚未被观测,根据图模型D-separation理论,\(\mathbf{w}\)和\(x\)应该是独立的,此外由于\(\mathbf{t}\)已经被观测,那么\(\mathbf{w}\)与\(\mathbf{x}\)条件不独立。于是 $$p(\mathbf{w}|x,\mathbf{x},\mathbf{t})=p(\mathbf{w}|\mathbf{x},\mathbf{t})$$ 综上,我们知道 $$p(t|x,\mathbf{x},\mathbf{t})=\int p(t|x,\mathbf{w})p(\mathbf{w}|\mathbf{x},\mathbf{t})d\mathbf{w}$$
3.摘抄3-国人2
https://www.codetd.com/article/10631869
这篇写的很好,很工整,结合上了上面两篇
在第一章的1.2.6节,有公式(1.68)
\(\[p(t | x, \mathbf{x}, \mathbf{t})=\int p(t | x, \boldsymbol{w}) p(\boldsymbol{w} | \mathbf{x}, \mathbf{t}) \mathrm{d} \boldsymbol{w} \]\)
\(这个公式实际上是在贝叶斯框架下对回归\(t=y(x,w)\)进行推断,即给出了新的\(x\)(注意粗体的区别,\(\mathbf{x}\)是测试集的样本,这部分信息是已知的)下,我们对t的后验概率进行推断。\)
从读MLAPP的时候就对这个公式有点疑惑,虽然书中一笔带过,但是小白的我决定自己推导一番:
\(\[p(t | x, \mathbf{x}, \mathbf{t})=\int p(t,\boldsymbol{w}|x,\mathbf{x}, \mathbf{t})d\boldsymbol{w} \]\)
而
\(\[p(t,\boldsymbol{w}|x,\mathbf{x}, \mathbf{t})=\frac{p(t,\boldsymbol{w},x,\mathbf{x}, \mathbf{t})}{p(x,\mathbf{x}, \mathbf{t})} \]
\[p(t | x, \boldsymbol{w}) p(\boldsymbol{w} | \mathbf{x}, \mathbf{t}) =\frac{p(t , x, \boldsymbol{w})p(\boldsymbol{w} , \mathbf{x}, \mathbf{t})}{p(x, \boldsymbol{w})p(\mathbf{x}, \mathbf{t}) } \]\)
所以目标是证明
\(\[\frac{p(t,\boldsymbol{w},x,\mathbf{x}, \mathbf{t})}{p(x,\mathbf{x}, \mathbf{t})}=\frac{p(t , x, \boldsymbol{w})p(\boldsymbol{w} , \mathbf{x}, \mathbf{t})}{p(x, \boldsymbol{w})p(\mathbf{x}, \mathbf{t}) } \]\)
是不是等价性没有那么self-evident =皿=
其实这个地方有用到几个条件独立性。
\(\(p(t,\boldsymbol{w}|x,\mathbf{x}, \mathbf{t})=p(t|x,\mathbf{x}, \mathbf{t})p(\boldsymbol{w}|x,\mathbf{x}, \mathbf{t})\) 这个理解起来就是说,在给定\((x,\mathbf{x}, \mathbf{t})\)下,\(t\)和\(\boldsymbol{w}\)是条件独立的。\)
\(\(t\)与\(\boldsymbol{w}\)之间的联系是由\((x,\mathbf{x}, \mathbf{t})\)给出的,所以当中间连接他们的纽带给定的时候,这两个随机变量是条件独立的。\)
\(显然\(p(\boldsymbol{w}|x,\mathbf{x}, \mathbf{t})=p(\boldsymbol{w}|\mathbf{x}, \mathbf{t})\),因为\(x\)是新的样本,无法对w的后验概率造成影响。\)
\(\(p(t|x,\mathbf{x}, \mathbf{t})=p(t|x,\boldsymbol{w})\).因为\((\mathbf{x}, \mathbf{t})\)影响t的路径是通过影响w产生的,所以这两个等价。\)
于是,我们得到
\(\[p(t | x, \mathbf{x}, \mathbf{t})=\int p(t | x, \boldsymbol{w}) p(\boldsymbol{w} | \mathbf{x}, \mathbf{t}) \mathrm{d} \boldsymbol{w} \]\)
在1.5.1节,给出了错误分类率的公式
\(\[\begin{aligned}p(\text { mistake }) &=p\left(\boldsymbol{x} \in \mathcal{R}_{1}, \mathcal{C}_{2}\right)+p\left(\boldsymbol{x} \in \mathcal{R}_{2}, \mathcal{C}_{1}\right) \\&=\int_{\mathcal{R}_{1}} p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) \mathrm{d} \boldsymbol{x}+\int_{\mathcal{R}_{2}} p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) \mathrm{d} \boldsymbol{x}\end{aligned} \]\)
书中直接给出结论,要使得错误分类率最小,应该分给后验概率(P(C_k|x))最大的类别中。
推导过程如下:
\(对于最优的\(\mathcal{R}_{1}, \mathcal{R}_{2}\),只要满足它的犯错概率小于其他所有的决策区域\(\mathcal{R}_{1}’, \mathcal{R}_{2}’\)下的犯错概率即可。\)
\(\[\begin{aligned}p(\text { mistake }) &=p\left(\boldsymbol{x} \in \mathcal{R}_{1}, \mathcal{C}_{2}\right)+p\left(\boldsymbol{x} \in \mathcal{R}_{2}, \mathcal{C}_{1}\right) \\&=\int_{\mathcal{R}_{1}} p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) \mathrm{d} \boldsymbol{x}+\int_{\mathcal{R}_{2}} p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) \mathrm{d} \boldsymbol{x}\end{aligned} \]
\[\begin{aligned}p'(\text { mistake }) &=p\left(\boldsymbol{x} \in \mathcal{R}_{1}’, \mathcal{C}_{2}\right)+p\left(\boldsymbol{x} \in \mathcal{R}_{2}’, \mathcal{C}_{1}\right) \\&=\int_{\mathcal{R}_{1}’} p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) \mathrm{d} \boldsymbol{x}+\int_{\mathcal{R}_{2}’} p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) \mathrm{d} \boldsymbol{x}\end{aligned} \]\)
对两个做差,得到
\(\[p(mistake)-p'(mistake) \\=\int_{\mathcal{R}_{1}\cap \mathcal{R}_{2}’ } (p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) -p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) )\mathrm{d} \boldsymbol{x}+\int_{\mathcal{R}_{2}\cap \mathcal{R}_{1}’ } (p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) -p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) )\mathrm{d} \boldsymbol{x} \]\)
那么我们只需要
\(\(p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) -p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) \le0\)在任意\(\mathcal{R}_{1}\cap \mathcal{R}_{2}’\)上成立。\)
\(\(p\left(\boldsymbol{x}, \mathcal{C}_{1}\right) -p\left(\boldsymbol{x}, \mathcal{C}_{2}\right) \le0\)在任意\(\mathcal{R}_{2}\cap \mathcal{R}_{1}’\)上成立。\)
\(由于\)p\left(\boldsymbol{x}\right) \(是相同的,上述两个公式等价于:\)
\(\(p\left(\boldsymbol{x}| \mathcal{C}_{2}\right) -p\left(\boldsymbol{x}|\mathcal{C}_{1}\right) \le0\)在任意\(\mathcal{R}_{1}\cap \mathcal{R}_{2}’\)上成立。\)
\(\(p\left(\boldsymbol{x}| \mathcal{C}_{1}\right) -p\left(\boldsymbol{x}|\mathcal{C}_{2}\right) \le0\)在任意\(\mathcal{R}_{2}\cap \mathcal{R}_{1}’\)上成立。\)
\(而任意\(\mathcal{R}_{1}\cap \mathcal{R}_{2}’\)其实就是\(\mathcal{R}_{1}\),任意\(\mathcal{R}_{2}\cap \mathcal{R}_{1}’\)其实就是\(\mathcal{R}_{2}\)\)
\(所以最优的分配规则就是,如果\(p\left(\boldsymbol{x}| \mathcal{C}_{2}\right) \le p\left(\boldsymbol{x}|\mathcal{C}_{1}\right)\)就分配到第一类上,如果\(p\left(\boldsymbol{x}| \mathcal{C}_{1}\right) \le p\left(\boldsymbol{x}|\mathcal{C}_{2}\right)\)就分配到第二类上。\)