pgm12

作为 inference 部分的小结,我们这里对 machine learning 里面常见的三个 model 的 inference 问题进行整理,当然很幸运的是他们都存在 tractable 的算法是的我们避免了烦人的 approximate inference。

HMM

常意所说的 HMM 是对离散状态、离散观测情形下的一种 generative model,它包括

  • 状态的先验分布 \pi_k(在下面的推导中我们可以将其藏在转移概率中)
  • 转移状态 \tau (k, k'),这是对 k' 的分布
  • 发射概率 e (k, v),这是对 v 的分布

这个模型的潜台词是

  • Markovian property:\Pr (s_t \mid s_1, \ldots, s_{t - 1}) = \Pr (s_t \mid s_{t - 1})
  • time-invariance:\Pr (s_t = k' \mid s_{t - 1} = k)

因此联合分布的概率为

\displaystyle \Pr (o_{1, \ldots, t}, s_{0, \ldots, t}) = \Pr(s_0)\prod_{i = 1}^t \Pr (s_i \mid s_{i - 1}) \Pr (o_i \mid s_i),

其中 \Pr(s_0) = 1 故可省略。下面我们分别讨论这上面的 message passing、belief update 和一些常见的 inference 问题。

message passing 需要建立一个 cluster graph,当然实际也是一个 clique tree,这个图上的顶点包括 C_i,这是将 o_is_{i-1}s_i 绑在一起,i = 1, \ldots, t;则每个对应的 \psi_i (s_i, s_{i-1}, o_i)= \Pr (s_i \mid s_{i - 1}) \Pr (o_i \mid s_i), i = 1, \ldots, t。于是可以计算前向的消息,

\displaystyle\delta_{1\to 2} (s_1) = \sum_{o_1} \psi_1 (s_1, s_0, o_1), \quad \delta_{i \to i + 1} (s_i) = \sum_{o_i} \sum_{s_{i-1}} \delta_{i - 1\to i} (s_{i - 1}) \psi_i (s_i, s_{i - 1}, o_i)

其中 i = 2, \ldots, t - 1,后向消息为

\displaystyle\delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} \sum_{o_t} \phi_t (s_t, s_{t - 1}, o_t) , \quad \delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \sum_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i)

其中 i = t - 1, \ldots, 1。如果仔细分析一下这些消息,我们就会发现,前向消息其实是边际分布

\displaystyle \delta_{1\to 2} (s_1) = \sum_{o_1} \psi_1 (s_1, s_0, o_1) = \sum_{o_1} \Pr (s_1\mid s_0) \Pr (o_1 \mid s_1) = \Pr (s_1) = \pi_{s_1}

我们可以继续代入后面的消息里面,

\displaystyle\delta_{i\to i + 1} (s_0) = \sum_{s_i} \sum_{o_i} \delta_{i - 1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i)= \sum_{o_{i - 1}} \sum_{o_i} \Pr (s_{i - 1}) \Pr (s_i\mid s_{i - 1}) \Pr (o_i \mid s_i) = \Pr(s_i)

如果观测是给定的,即 o_{0, \ldots, i} 已知,这获得的将是 \Pr (s_i, o_{0, \ldots, i})。对后向消息而言,

\displaystyle\delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} \sum_{o_t} \phi_t (s_t, s_{t - 1}, o_t) = \sum_{s_t} \sum_{o_t} \Pr (s_t \mid s_{t - 1}) \Pr (o_t \mid s_t) \equiv 1

代入后面的消息有

\displaystyle\delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \sum_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i) = \sum_{s_i} \delta_{i+1 \to i} \Pr (s_i \mid s_{i - 1}) \equiv 1

都是常数。如果 o_{i, \ldots, t} 是已知的,这将获得 \Pr (s_i, o_{i, \ldots, t})

对于 MAP 类型的 query,我们需要使用 max-product 算法,此时的前向消息为(e_k = \max_v e(k, v)

\displaystyle \delta_{1\to 2} (s_1) = \max_{o_1} \psi_1 (s_1, s_0, o_1) = \pi_{s_1} e_{s_1}

\displaystyle \delta_{i \to i + 1} (s_i) = \max_{o_i} \max_{s_{i-1}} \delta_{i - 1\to i} (s_{i - 1}) \psi_i (s_i, s_{i - 1}, o_i) = e_{s_i} \max_k \delta_{i-1\to i} (k) \tau(k, s_i)

后向消息为

\displaystyle \delta_{t\to t-1} (s_{t - 1}) = \max_{s_t} \max_{o_t} \phi_t (s_t, s_{t - 1}, o_t) = \max_k \tau(s_{t - 1}, k) e_k

\displaystyle \delta_{i\to i-1} (s_{i - 1}) = \max_{s_i} \max_{o_i} \delta_{i+1 \to i} (s_i) \psi_i (s_i, s_{i - 1}, o_i) = \max_k \delta_{i+1 \to i} (k) \tau (s_{i - 1}, k) e_k

对 belief update 来说,belief 是 o_i, s_i, s_{i - 1} 上的边际分布

\displaystyle \beta_1 = \psi_1 \delta_{2\to 1} = \Pr (s_0)\Pr (s_1 \mid s_0) \Pr (o_1\mid s_1), \quad \beta_i = \psi_i\delta_{i-1\to i} \delta_{i+1\to i} = \psi_i \Pr (s_{i-1}) = \Pr (s_{i-1}, s_i, o_i), \quad \beta_t = \psi_t \delta_{t-1\to t} = \psi_t \Pr (s_{t-1}) = \Pr (s_{t - 1}, s_t, o_t)

而对应的 belief update 为

\displaystyle \mu_{i, i+1} (s_i) = \delta_{i\to i + 1} (s_i) \delta_{i + 1\to i} (s_i) = \Pr (s_i)

类似可以导出 MAP 类型下的形式。这样,对于 filtering 来说 \Pr (s_i \mid o_{0, \ldots, i}) \propto \Pr (s_i, o_{0, \ldots, i}) 可以将前向消息归一化,而 prediction 使用的概率

\displaystyle \Pr (s_{i + 1} \mid o_{0, \ldots, i}) = \sum_{s_i} \Pr (s_{i + 1}, s_i \mid o_{0, \ldots, i}) = \sum_{s_i} \Pr (s_{i + 1} \mid s_i) \Pr (s_i \mid o_{0, \ldots, i}) = \sum_k \tau (k, s_{i + 1}) \Pr (s_i \mid o_{0, \ldots, i})

是归一化后的值。smoothing 需要求 \arg\max \Pr (s_{0, \ldots, t} \mid o_{0, \ldots, t}),本质上就是 \arg\max \Pr (s_{0, \ldots, t}, o_{0, \ldots, t}),这直接使用 MAP 类型两种 message 就能给出两种算法。

LDS

LDS 和 HMM 具有类似的图结构,但是对应的状态和观测均为连续分布,因而常使用 Gaussian 建模。

\displaystyle\Pr (o_{1, \ldots, t}, s_{0, \ldots, t})=\Pr (s_0) \prod_{i = 1}^{t - 1} \Pr (s_i \mid s_{i - 1}) \Pr(o_i \mid s_i)

其中,

\displaystyle\Pr(s_0)\sim N(\cdot \mid\mu_0, \Gamma_0), \quad \Pr(s_i \mid s_{i - 1}) \sim N (\cdot \mid A s_{i - 1}, \Gamma), \quad \Pr (o_i \mid s_i) \sim N (\cdot \mid C s_i, \Sigma)

另一种描述这种关系的形式是使用 additive noise,

\displaystyle s_0 = \mu_0 + \epsilon_0, \epsilon_0 \sim N (\cdot \mid 0, \Gamma_0), \quad s_i = A s_{i - 1} + \epsilon_i, \epsilon_i \sim N (\cdot \mid 0, \Gamma), \quad o_i = C s_i + \xi_i, \xi_i \sim N (\cdot \mid 0, \Sigma).

使用的 clique tree 与前面一致,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = \iint N (s_0 \mid \mu_0, \Gamma_0) N (s_1 \mid As_0, \Gamma) \Pr (o_1 \mid C s_1, \Sigma) \, \mathrm{d} s_0 \mathrm{d} o_0 = N(s_1 \mid A \mu_0, A\Gamma_0A^\top + \Gamma) = \Pr (s_1)

\displaystyle \delta_{i \to i + 1} (s_i) = \iint \delta_{i-1\to i} (s_{i - 1}) N (s_i \mid As_{i - 1}, \Gamma) \Pr (o_i \mid C s_i, \Sigma) \, \mathrm{d} s_{i - 1} \mathrm{d} o_i = N(s_i \mid A \mu_{i - 1}, A\Gamma_{i - 1}A^\top + \Gamma) = \Pr (s_i),

其中 \mathbb{E} s_{i - 1} = \mu_i and \text{var}\, s_{i - 1} = \Gamma_{i - 1},后向消息也均为 1。对 MAP 类型的 query,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = \max_{s_0} N (s_0 \mid \mu_0, \Gamma_0) N (s_1 \mid A s_0, \Gamma) N (o_1 \mid Cs_1, \Sigma)

关于 s_0 的优化问题是

\displaystyle \min_{s_0} s_0^\top (\Gamma_0^{-1} + A\Gamma^{-1} A^\top)^{-1} s_0 - 2 (\mu_0^\top \Gamma_0^{-1} + s_1^\top \Gamma^{-1} A)s_0

其解为

\displaystyle s_0^\star = (\Gamma_0^{-1} + A\Gamma^{-1} A^\top)^{-1} (\Gamma_0^{-1} \mu_0 + A^\top \Gamma^{-1} s_1)

这是 s_1 的线性函数,因此大致的求解过程是,从 s_{i - 1} 的二次方程中解出 s_{i - 1} 得到一个使用 s_i 的线性函数表示的关系,代入后得到 s_i 的消息,这仍然是一个二次函数,向后代入即可。最后获得的 s_t 的方程解出 s_t 后进行回代就解出了其他的隐变量。

beliefs 为

\displaystyle \beta_1 (s_0, s_1, o_1) = \psi_1 \delta_{2 \to 1} = N (s_0\mid, \mu_0, \Gamma_0) N (s_1 \mid A s_0, \Gamma) N (o_1 \mid C s_1, \Sigma)

\displaystyle\beta_i(s_{i-1}, s_i, o_i) = \psi_i\delta_{i-1\to i}\delta_{i+1\to i}

类似有对应 belief。

对 filtering 问题,给定 o_1, \ldots, o_i 后计算 \Pr (s_i \mid o_{1, \ldots, i}) 可使用前向消息,

\displaystyle \Pr (s_1 \mid o_1) \propto \int \Pr (s_0) \Pr (s_1\mid s_0) \Pr(o_1 \mid s_1) \, \mathrm{d} s_0 = N (s_1 \mid A\mu_0, A\Gamma_0 A^\top + \Gamma) N (o_1 \mid Cs_1, \Sigma) \propto N (s_1 \mid \xi_1, \Lambda_1)

其中,

\displaystyle \Lambda_1 = (\Gamma_1^{-1} + C^\top \Sigma^{-1} C)^{-1}, \quad \xi_1= \Lambda_1 (\Gamma_1^{-1} \mu_1 + C^\top \Sigma^{-1} o_1)

\displaystyle \Pr (s_i \mid o_{1,\ldots, i}) \propto\int \Pr (s_{i - 1} \mid o_{1, \ldots, i - 1}) \Pr (s_i\mid s_{i - 1} \Pr(o_i \mid s_i)\,\mathrm{d} s_{i - 1} = N (s_i \mid A\xi_{i - 1}, A\Lambda_{i - 1} A^\top + \Gamma) N (o_i \mid Cs_i, \Sigma \propto N (s_i \mid \xi_i, \Lambda_i),

其中

\displaystyle \Lambda_i = (\Lambda_{i - 1}^{-1} + C^\top \Sigma^{-1} C)^{-1}, \quad \xi_i = \Lambda_i (\Lambda_{i - 1}^{-1} \xi_{i - 1} + C^\top \Sigma^{-1} o_i).

\Lambda_0 = \Gamma_1\xi_0 = \mu_1 则以上计算可用统一的形式表述。

对 prediction 问题,给定 o_1, \ldots, o_i\Pr (s_{i + 1} \mid o_{1, \ldots, i}) 可使用 filtering 的结果计算

\displaystyle \Pr (s_{i + 1} \mid o_{1, \ldots, t}) = \int \Pr (s_{i + 1} \mid s_i) \Pr (s_i \mid o_{1, \ldots, i})\, \mathrm{d} s_i = \int N (s_{i + 1} \mid A s_i, \Gamma) N (s_i \mid \xi_i, \Lambda_i) \,\mathrm{d} s_i = N (s_{i + 1} \mid A \xi_i, A\Lambda_iA^\top + \Gamma).

MEMM

我们直接对 \Pr (s_i \mid o_i) 使用 ME 建模,但是为了引入上下文关系,我们可以将这个 ME 弄成多个 \Pr_{s_{i - 1}} (s_i \mid o_i),这也就是说前面一个状态决定了后面使用的 ME 的参数。这样似然函数为

\displaystyle \Pr(s_{0, \ldots, t} \mid o_{0, \ldots, t}) = \prod_{i = 1}^t \Pr_{s_{i - 1}} (s_i\mid o_i).

这里的假定有,

  • Markovian 性:\Pr (s_i \mid s_{0, \ldots, i - 1}, o_{0, \ldots, i - 1}) = \Pr (s_i \mid s_{i - 1}, o_i)
  • ME 假定:\Pr (s_i \mid s_{i - 1}, o_i) = \Pr_{s_{i - 1}} (s_i \mid o_i) = p_{s_{i - 1}} (s_i \mid o_i)

我们使用与 HMM 一致的 cluster graph,前向消息为

\displaystyle \delta_{1\to 2} (s_1) = p_{s_0} (s_1 \mid o_1), \quad \delta_{i\to i+1} (s_i) = \sum_{s_{i-1}} \delta_{i-1 \to i} (s_{i - 1}) \psi_i = \sum_{s_{i - 1}} \delta_{i-1 \to i} (s_{i - 1}) p_{s_{i-1}} (s_i \mid o_i).

后向消息为

\displaystyle \delta_{t\to t-1} (s_{t - 1}) = \sum_{s_t} p_{s_{t - 1}} (s_t \mid o_t), \quad \delta_{i\to i-1} (s_{i - 1}) = \sum_{s_i} \delta_{i + 1\to i} (s_i) p_{s_{i - 1}} (s_i \mid o_i).

max-product message passing 仅仅需要将求和换成 max。belief propagation 中 belief 为

\displaystyle \beta_1 (s_0, s_1, o_1) = \psi_1 \delta_{2 \to 1} = \Pr (s_0, s_1 \mid o_{1, \ldots, t}), \quad \beta_i (s_{i - 1}, s_i, o_i) = \psi_i \delta_{i + 1 \to i} \delta_{i - 1\to i} = \Pr (s_{i - 1}, s_i \mid o_{1, \ldots, t}), \quad \beta_t (s_{t - 1}, s_t, o_t) = \psi_t \delta_{t - 1\to t} = \Pr (s_{t - 1}, s_t \mid o_{1, \ldots, t})

且 belief update 为

\displaystyle\mu_{i, i + 1} = \delta_{i\to i+ 1} (s_i) \delta_{i+1 \to i} (s_i) =\Pr (s_i \mid o_{1, \ldots, t})

其 filtering、prediction 和 smoothing 算法与 HMM 完全一样。

CRF

其假设为

  • Markovian 性,与 MEMM 类似;
  • invariant factor:对每个 transition,我们引入一个 log-linear 表示,\phi_i (s_{i - 1}, s_i, o_i) = \exp\left( \sum_j f_j(s_{i}, o_i, s_{i - 1})\right),其中 f_j 是所谓的 feature;

类似前面可以定义消息、belief 等。如果需要计算 log-likelihood,我们需要求 partition function 的函数值,这需要使用前向消息

\displaystyle Z (o_{1, \ldots, t}) = \sum_{o_{1, \ldots, t}} \prod_{i = 1}^t \exp\left( \sum_j \lambda_j f_j (s_i, o_i, s_{i - 1})\right),

就能避免指数求和项,而计算梯度的时候,

\displaystyle \frac{\partial \mathcal{L}}{\partial \lambda_j} = \sum_\text{over samples} \sum_{i = 1}^t f_j (s_i, o_i, s_{i - 1}) - \langle f_j (s_i, o_i, s_{i - 1})\rangle,

其中后者需要 \Pr (s_i, s_{i - 1} \mid o_{1, \ldots, t}),这正是 belief。

——————
And Sarah saw the son of Hagar the Egyptian, which she had born to Abraham, mocking.

posted @ 2014-06-07 22:29  prml  阅读(404)  评论(0编辑  收藏  举报