Pattern recognition and machine learning 疑难处汇总

不断更新ing。。。。。。。。。

p141 para 1. 当一个x对应的t值不止一个时，Gaussian nosie assumption就不合适了。因为Gaussian 是unimodal的，这意味着一个x只能对应一个t.
p143 section 3.1.2. 解释下本节的一些难懂的细节。首先，作者假设存在一个 N 维的space, 而\(\mathbf{t}\)的每个元素相当于在此space的坐标轴下的系数，所以N维的\(\mathbf{t}\)位于此space中，而且N维的\(\mathbf{y}\)也会在其中。其次，\(\Phi\) 的每一列位于一个M维的subspace中（假设M<N）。那么问题来了，为啥作者说\(\mathbf{y}\)会位于M维的subspace中呢? 这是因为\( \mathbf{y}=\Phi \mathbf{w} \)，而矩阵与向量乘等价于矩阵的每一列乘以向量的一个元素再求和，也就是\(\mathbf{y}\)是\(\Phi\)的线性组合，这个线性组合的系数由\(\mathbf{w}\)的元素确定: \(\mathbf{y}=\Phi \mathbf{w}=\sum_{m=1}^{M} w_m \Phi_m \)。最后，至于\(\mathbf{y}-\mathbf{t}\)为啥会垂直于M维subspace，请参考习题3.2。
p194 para 1. 本段提到“The total error function is therefore piecewise linear”，那么问题来了，非连续的(4.54)如何求导呢? 解决的方法是拆开(4.54)的\(\sum_{n}\)，而某一个错分类的training sample对应的error是连续的，也就是文中提到的“The contribution to the error associated with a particular misclassified pattern is a linear function of w in regions of w space where the pattern is misclassified and zero in regions where it is correctly classified." 所以，感知机的训练不是采用batch模式而是采用逐sample updata weight的方式。
p200 equation (4.71). 这个likelihood的形式初看比较奇怪，怎么是两个随机变量\((\mathbf{t}, X)\)的joint distr的形式\(p(\mathbf{t}, X| \pi,\mu_1,\mu_2,\Sigma)\)？回忆linear regression的likelihood\(p(\mathbf{t}| \cdot)\)和section 4.3 discriminative model的\(p(\mathbf{t}|\cdot)\)，纵观全书貌似也只有这一个特例。但细想之下eq(4.71)却也合理，因为这个likelihood也写成\(p(\mathbf{t} | X,\pi,\mu_1,\mu_2,\Sigma)\)的形式，但这需要将\(p(\mathbf{x}_n, C_1)\)换成\(\frac{p(\mathbf{x}_n, C_1)}{\sum_{i=1}^{2} p(x|C_i)p(C_i)}\), 这增加了求解的难度，并不值得。那么为啥linear regression和section 4.3 discriminative model的likelihood要写成\(p(\mathbf{t}|\cdot)\)的形式呢?这是因为这俩模型其实都是discriminative的模型，\(\mathbf{t}\)都可以写成一个自变量为\(\mathbf{x}\)的参数函数形式,如此以来\(p(\mathbf{t}|\cdot)\)的形式更为简单。
p206 para 4. 首先解释下决策面。当使用logistic sigmoid function \(\sigma(a)=p(C_1|\mathbf{x})\)时，由Section 1.5知最优的决策面应该满足条件：\(p(C_1|\mathbf{x})= p(C_2|\mathbf{x}) \) 也即：\(\sigma=1-\sigma\)。从Fig 4.10可以看出，此时\(\sigma=0.5\) (注意：无论是两分类还是多分类\(\sigma\)始终为标量)，进而有\(a=\mathbf{w^T}\mathbf{x}=0\)；其次，解释下为啥\(\mathbf{w}\) goes to infinity （参考习题4.14）. 当min(4.90) w.r.t \(y_n\)时 \(\arg\min_y E(\mathbf{w}) \Rightarrow y_n = t_n \) 这意味着 logistic sigmoid function saturated 也就是对于任意的\(\mathbf{x}\)， \(\sigma(a)=\sigma(\mathbf{w^T}\mathbf{x})\)只会输出0 or 1, 而这需要\(\mathbf{w}\rightarrow -\infty\) or \(+\infty\).
p238 equation(5.31)，这个公式是将(5.28)两边对\(\mathbf{w}\)求导而得到。
p242,p243 eq(5.47) eq(5.48)。使用反向传播算法时，output unit上的error \(\delta\) 对于三种error function(线性回归的sum of square error, 两分类的cross entropy error, 多分类的error)，三种output unit active function (线性回归的linear active, 两分类的sigmoid logistic, 多分类的softmax) 都是相同的=eq.(5.48)。作者在4.3.6节用link functions的概念证明了这一点。参考eq.(3.13)(4.91)(4.124)(5.18)和练习5.6。
p431 equation(9.12)，这个公式可谓简约而不简单，先从直觉上感受下：通过构建一个包含latent variable的图模型，自然的推倒出了(9.7)定义的mixture Gaussian；再来看推导过程，从\(\prod_{k=1}^{K}\)变成了\(\sum_{k=1}^{K}\)，利用了hot-coding的编码规则也就是只有一个z_i不为0，这里感觉很精妙。
p445 exercise 9.12 有点复杂，看了答案又用了一个冷门的law of iterated expectations定理后才得到解答。请参考习题答案，并注意到：\( E[xx^T]=E[E[xx^T|k]]=\sum_{k=1}^{K} \pi_k ({\Sigma_k+\mu_k \mu_k^T}) \)，其中第一个等号用到了law of iterated expectations 定理，两个\(EE\)，第一个\(E\)是相对\(k\)，第二个\(E\)是相对\( xx^T \)，条件期望\(E[xx^T|k]\)也是随机变量（随k的不同而不同），而conditioned on k 时，原本是mixing的x（\( x=\sum_{k=1}^{K} \pi_k x_k\)）变成了只有一个component k的\(x_k\)；第二个等号用到了\( cov[x]=E[xx^T]-E[x]E[x]^T \rightarrow E[xx^T]=cov[x]+E[x]E[x]^T \)。ref http://math.stackexchange.com/questions/247795/what-is-the-covariance-of-mixture-bernoulli-distribution
P465 公式(10.7)的const怎么来的？参考了下MLAPP P737 (21.28) 感觉PRML（10.7）的const 是个place holder，以使得后面推到要用的归一化分母有个安身之所。具体说：10.7)中的这个const是不依赖于\(z_j\)的部分，也就是说(10.7)把依赖于z_j的部分分到等式右边的第一个term，不依赖于z_j的部分分到了第二个term。这么做得目的是为了在后续推导（10.9）时，让\(q_j\)的normalization constant有个place holder。
P467, para -2, line -6, nonsingular. 这里的nonsingular是指高斯分布\(p(X)\)的精度\(\Lambda \neq 0\).
P473, Section 10.1.4. 本节的latent var包括了\(Z\) 和 \(m\)，而且因为\(p(Z,m) \neq p(Z)p(m)\)，所以(10.9)的结论不能用了，需要重新推导一种Variational inference. 此时Max (10.35) with respect to \(q(m)\) 或 \(q(Z|m)\) 会使得（10.34）的第二个term趋于0，也即\(q(Z|m)q(m) \approx p(Z,m|X)\). 本节（10.36）是一个重要的结论，它提示我们可以用Lower Bound \(L_m\)来直接做model comparision，例如10.2.4, para 3 、10.3.3 、12.2.3, para 3 都用lower bound来比较模型。
p540 - p541. 作者在11.2.1和11.2.2两节里，讲解的过于精简，所以我在这里做一个markov chains & Metropolis-Hastings alg小议。其一：markov chain的背景知识：设状态空间为\({k=1...K}\)，markov chain由初始分布\(p(z^0)\) 和条件转移矩阵\( T \)唯一确定。其中，\(p(z)\)是向量，每个元素对应状态k的概率\(p(z=k)\)，\( T \)是矩阵，\(T_{i,j}\)是从状态i到状态j转移的条件概率\(p(i|j)\)。在状态空间上可以定义无数的概率分布，但ergodicity的markov chain可以唯一确定一个stationary distribution \(p^*(z)\)，这个分布有个特殊性质也就是从任意的初始分布\( p(z^0)\)出发，一定会在某个时间m，\(p(z^m)\)达到stationary distribution \(p(z^m)=p^*(z)\) ，而且不管以后的再怎么的进行状态转移，\(p(z^m+1)...=p^*(z)\)。另外，ergodicity的markov chain中stationary distribution和状态转移矩阵是相互依赖的，满足detailed balance关系，也就是说确定了状态转移矩阵就能确定stationary distribution，反过来也成立。其二：回到我们要解决的问题：draw samples from \( p(z) \)。MCMC方法是通过constructing a 特定的 markov chain，使得此markov chain 对应唯一一个 stationary distribution，且这个 stationary distribution等于我们想draw samples的分布也就是\( p^*(z) = p(z) \)。注意，当确定了 stationary distribution时，此markov chain的条件转移矩阵也相应的确定了。这样我们就可以proposal 任意的samples，让这些samples按照确定的条件转移矩阵慢慢的转移，最终转移好的samples的分布会和stationary distribution一致，也就和draw samples from 的 \( p(z) \)一致了。其三：公式(11.44)中的A其实就是条件转移矩阵的变身，从（11.45）中知\(q_k(z|z') A_k(z,z')\)就是条件转移矩阵。
P564, (12.14)后面第一句. 因为，\({u_1, \dots, u_M}\)撑起principal subspace, \({u_{M+1}, \dots, u_D}\)撑起与PC正交的subspace，且\(x_n - \widetilde{x}\) is a linear combination of \(u_i\) for \(i=M+1,\dots, D\)也即\(x_n - \widetilde{x}\) 在\({u_{M+1}, \dots, u_D}\)撑起的subspace，所以\(x_n - \widetilde{x}\) 与PC正交。
p575, para 2, last sentence: 作者写的真是简明扼要，下面我阐述下这句话。先总结下本段的中心思想，本段的主要目的是：给columns of \(W_{ML}\)一个直观上的解释，这个解释就是 For the particular case of R = I, we see that the columns of W are the principal component eigenvectors （\(u_i\)）scaled by the variance parameters \(\lambda_i − \sigma^2\)，这是从（12.45）出发给出的解释。注意本章很多地方都讨论了columns of \(W_{ML}\)，而本段就是讨论这个重要的列的性质。由（12.34）和（2.115）(高斯间的卷积)知: margin \(p(x)\)的 \(cov[x]=WIW^T+\sigma^2 I=WW^T+\sigma^2 I\)，这也即是书中提到的the variances are additive. 而x在方向u_i上的方差为\(Var[u_i^T x]=Cov[u_i^T x]=u_i^T Cov[x] u_i\)，（由高斯的线性变换的性质得到），将(12.45)求得的\(W_{ML}\)代入 \(cov[x]\)有：\(cov[x]=U(L-\sigma^2I)^\frac{1}{2} (L-\sigma^2I)^\frac{1}{2} U^T + \sigma^2I = U (L-\sigma^2I) U^T + \sigma^2I\)。再将\(cov[x]\)代入\(Var[u_i^T x]\)有：\(Var[u_i^T x]= Cov[u_i^T x]=u_i^T Cov[x] u_i = u_i^T\{WIW^T+\sigma^2 I\}u_i= \{u_i^T W I W^T u_i\} +\{u_i^T {\sigma^2I} u_i \} =u_i^T \{U(L-\sigma^2I)U^T+\sigma^2I\} u_i = u_i^T {U(L-\sigma^2I)U^T} u_i + u_i^T {\sigma^2I} u_i = (L_{ii} - \sigma^2) + \sigma^2 = L_{ii} = \lambda_i\).
接上，即有：\(Var[u_i^T x]=(\lambda_i - \sigma^2) + \sigma^2 \)。书中提到Thus the variance λi in the direction of an eigenvector ui is composed of the sum of a contribution λi − σ2 from the projection of the unit-variance latent space distribution into data space through the corresponding column of W, plus an isotropic contribution of variance σ2 which is added in all directions by the noise model. 这里的variance \( \lambda_i \) in the direction of an eigenvector \(u_i\) 就是\(Var[u_i^T x]=(\lambda_i - \sigma^2) + \sigma^2 \); 这里的a contribution \(λ_i − \sigma^2\) from the projection of the unit-variance latent space distribution into data space through the corresponding column of W，可用公式表达为\( u_i^T W I W^T u_i = (\lambda_i - \sigma^2) \)；这里的plus an isotropic contribution of variance σ2 which is added in all directions by the noise model，可用公式表达为\(u_i^T {\sigma^2I} u_i = \sigma^2\)；把上两项相加就有\(Var[u_i^T x]=(\lambda_i - \sigma^2) + \sigma^2 \).
p631, equation(13.72), 此公式的推倒可参考equation(4.107)，设序列集合\(X\)对应的类别集合为\(T\), 则\(p(T|X)=\prod_{r=1}^{R} \prod_{m=1}^{M} p(m_r|X_r)^{t_{rm}}\), 取log后有\(\ln p(T|X)=\sum_{r=1}^{R}\sum_{m=1}^{M} t_{rm} p(m_r|X_r)=\sum_{r=1}^{R} \ln p(m_r|X_r)\)。推导得到的(13.72)非常类似与第四章中得(4.108)，不同之处是(4.108)的\(p(C_k|\phi_n)\)可用softmax函数来参数化得到，而这里的\(\ln p(m_r|X_r)\)则无参数化函数的近似，只能用bayes theorem来generative地写出公式(13.73)。

posted @ 2015-04-08 14:49 冷处理场烟囱阅读(452) 评论(0) 编辑收藏举报

刷新页面返回顶部

Pattern recognition and machine learning 疑难处汇总

公告