PRML 7: The EM Algorithm

 

 1. K-means Clustering: clustering can be regarded as special parametric estimating problems with latent variables, which performs a hard assignment of data points to clusters in contrast to Gaussian Mixture Model introduced later.

  (1) Initialization of $K$ mean vectors;

  (2) E Step (Expectation): assign each point to a cluster by

    $y_n=\mathop{argmin}_{C_k}||\vec{x}_n-\vec{\mu}_k||$;

  (3) M Step (Maximization): renew mean vectors by

    $\vec{\mu}_k^{new}=\frac{\sum_{n=1}^N I\{y_n=C_k\}\vec{x}_n}{\sum_{n=1}^N I\{y_n=C_k\}}$;

  (4) Repeat (2) and (3) until convergence.

 

 2. Gaussian Mixture Model: assume $p(\vec{x})=\sum_{k=1}^K\pi_k Gauss(\vec{x}\text{ | }\vec\mu_k,\Sigma_k)$ and  $\vec{x}_1,\vec{x}_2,...,\vec{x}_N$ observed.

  (1) Initiallization of all the parameters;

  (2) E Step (Expectation): calculate the responsibility of $\pi_k Gauss(\vec{x}_n\text{ | }\vec{\mu}_k,\Sigma_k)$ for $\vec{x}_n$ by

    $\gamma_{nk}=\frac{\pi_k\cdot Gauss(\vec{x}_n\text{ | }\vec{\mu}_k,\Sigma_k)}{\sum_{i=1}^K\pi_i\cdot Gauss(\vec{x}_n\text{ | }\vec{\mu}_i,\Sigma_i)}$;

  (3) M Step (Maximization): re-estimate the parameters by

    $\vec{\mu}_k^{new}=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}\cdot\vec{x}_n$,

    $\Sigma_k^{new}=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}\cdot(\vec{x}_n-\vec{\mu}_k^{new})(\vec{x}_n-\vec{\mu}_k^{new})^T$,

    $\pi_k^{new}=N_k/N$,  where $N_k=\sum_{n=1}^N\gamma_{nk}$;

  (4) Repeat (2) and (3) until convergence.

 

 3. Forward-Backward Algorithm: Hidden Markov Model (HMM) is a 3-tuple $\lambda=(A,B,\pi)$, where $A\in\mathbb{R}^{N\times N}$ is the state transition matrix, $B\in\mathbb{R}^{N\times M}$ is the observation probability matrix, and $\vec{\pi}\in\mathbb{R}^{N\times 1}$ is the initial state probability vector. HMM assumes that the state probability at any time is only dependent on the previous state, and that the observation probability at any time is only dependent on the current state. It's too computationally expensive to calculate $p(O\text{ | }\lambda)=\sum_{I}p(O\text{ | }I,\lambda)\cdot p(I\text{ | }\lambda)$, so we either use forward algorithm or backward algorithm to do HMM evaluation instead.

  (1) Forward Algorithm: we calculate $\alpha_t(i)=p(o_1,o_2,...,o_t,i_t=q_i\text{ | }\lambda)$ by

    $\alpha_1(i)=\pi_i b_i(o_1)$ and $\alpha_t(i)=[\sum_{j=1}^N\alpha_{t-1}(j)A_{ji}]b_i(o_t) \text{ }(t>1)$,

    then we get $p(O\text{ | }\lambda)=\sum_{i=1}^N\alpha_T(i)$;

  (2) Backward Algorithm: we calculate $\beta_t(i)=p(o_{t+1},o_{t+2},...,o_{T},i_t=q_i\text{ | }\lambda)$ by

    $\beta_T(i)=1$ and $\beta_t(i)=\sum_{j=1}^N A_{ij}b_j(o_{t+1})\beta_{t+1}(j)\text{ }(t<T)$,

    then we get $p(O\text{ | }\lambda)=\sum_{i=1}^N\pi_i b_i(o_1)\beta_1(i)$;

  (3) Viterbi Decoding: we define $V_t(i)=\mathop{max }_{i_1,i_2,...,i_{t-1}}p(o_1,o_2,...,o_t,i_1,i_2,...,i_{t-1},i_t=q_i\text{ | }\lambda)$ and calculate it by

    $V_1(j)=\pi_j b_j(o_1)$  and  $\begin{cases}\phi_t(j)=\mathop{argmax}_i V_{t-1}(i)A_{ij} \\ V_t(j)=A_{\phi_t(j),j}V_{t-1}(\phi_t(j))\cdot b_j(o_t) \end{cases}$ for $t>1$;

    Then the sequence likelihood can be calculated by

    $q_T^{*}=\mathop{argmax}_j V_T(j)$  and  $q_t^{*}=\phi(q_{t+1}^{*})$ for $t<T$.

  A more general concept is a Probabilistic Graphical Model (PGM), which specifies both a factorization of joint distribution and a set of conditional independence relations. A PGM can be either (1) a directed acyclic graph, a.k.a Bayesian Network, or (2) an undirected graph, a.k.a Markov Network. HMMs and neural networks are special cases of Bayesian networks.

 

 4. Baum-Welch Algorithm: we consider $O$ as observable variables and $I$ as latent variables.

  (1) Initiallization of all the parameters;

  (2) E Step (Expectation): use forward-backward algorithm to calculate

    $\gamma_t(i)=p(i_t=q_i\text{ | }O,\lambda)=\frac{\alpha_t(i)\beta_t(i)}{\sum_{j=1}^N \alpha_t(i)\beta_t(i)}$  and

    $\xi_t(i,j)=p(i_t=q_i\wedge i_{t+1}=q_j\text{ | }O,\lambda)=\frac{\alpha_t(i)A_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N\sum_{j=1}^N\alpha_t(i)A_{ij}b_j(o_{t+1})\beta_{t+1}(j)}$;

  (3) M Step (Maximization): re-estimate the parameters by

    $A_{ij}^{new}=[\sum_{t=1}^{T-1}\xi_t(i,j)]/[\sum_{t=1}^{T-1}\gamma_t(i)]$,

    $b_j(k)^{new}=[\sum_{t=1}^T I\{o_t=\nu_k\}\cdot\gamma_t(i)]/[\sum_{t=1}^T\gamma_t(j)]$,

    $\pi_i^{new}=\gamma_1(i)$;

  (4) Repeat (2) and (3) until convergence.

 

 5. EM Algorithm in general: given observed data $X$ and its joint distribution with latent data $Z$ as $p(X,Z\text{ | }\vec{\theta})$, where $\vec{\theta}$ is unknown parameters, we carry on following steps to maximize the likelihood $p(X\text{ | }\vec{\theta})$.

  (1) Initialization of parameters $\vec{\theta}^{(0)}$;

  (2) E Step (Expectation): given $\vec{\theta}^{(i)}$, we estimate $q(Z)=p(Z\text{ | }X,\vec{\theta}^{(i)})$;

  (3) M Step (Maximization): re-estimate $\vec{\theta}^{(i+1)}=\mathop{argmax}_{\vec\theta}q(Z)ln{p(X,Z\text{ | }\vec{\theta})}$

  (4) Repeat (2) and (3) until convergence.

 

  For detailed proof of the correctness of this algorithm, please refer to JerryLead's blog.

  In brief, our objective is to maximize $ln{p(X\text{ | }\vec{\theta})}=Q(\vec{\theta},\vec{\theta}^{(i)})-H(\vec{\theta},\vec{\theta}^{(i)})$,  where

   $Q(\vec{\theta},\vec{\theta}^{(i)})=\sum_Z p(Z\text{ | }X,\vec{\theta}^{(i)})ln{p(X,Z\text{ | }\vec{\theta})}$,  $H(\vec{\theta},\vec{\theta}^{(i)})=\sum_Z p(Z\text{ | }X,\vec{\theta}^{(i)})ln{p(Z\text{ | }X,\vec{\theta})}$.

  Since $H(\vec{\theta}^{(i+1)},\vec{\theta}^{(i)})\leq H(\vec{\theta}^{(i)},\vec{\theta}^{(i)})$ (KL divergence and Jensen's inequality), to make $ln{p(X\text{ | }\vec{\theta})}$ larger, it suffices to let $Q(\vec{\theta}^{(i+1)},\vec{\theta}^{(i)})\geq Q(\vec{\theta}^{(i)},\vec{\theta}^{(i)})$.

 

 

References:

  1. Bishop, Christopher M. Pattern Recognition and Machine Learning [M]. Singapore: Springer, 2006

  2. 李航.  统计学习方法.  北京:清华大学出版社, 2012

 

posted on 2015-06-17 23:58  DevinZ  阅读(405)  评论(0编辑  收藏  举报

导航