N-gram语言模型

语言模型

NLP是用来理解和解释语言的，语言模型可以帮助我们解决一些类型的问题，例如检查拼写、生成对话、内容识别、机器翻译等等，N-gram就是一种非常经典的语言模型。

Markov Assumption

一个简单假设：\(P(w_i|w_1...w_{i-1}) \approx P(w_i|w_{i-n+1}...w_{i-1})\)

那么，对于unigram模型，n=1，

\[P(w_1,w_2,...,w_m) = \prod_{i=1}^mP(w_i) \]

对于bigram模型，n=2，

\[P(w_1,w_2,...,w_m) = \prod_{i=1}^mP(w_i|w_{i-1}) \]

对于trigram模型，n=3，

\[P(w_1,w_2,...,w_m) = \prod_{i=1}^mP(w_i|w_{i-2}w_{i-1}) \]

极大似然估计MLE

实际中要如何计算那些概率呢？我们可以使用极大似然估计，利用频率来表示概率

那么，对于unigram模型，

\[P(w_i) = \frac{C(w_i)}{M} \]

对于bigram模型，

\[P(w_i|w_{i-1}) = \frac{C(w_{i-1}w_i)}{C(w_{i-1})} \]

对于n-gram模型，

\[P(w_i|w_{i-n+1}...w_{i-1}) = \frac{C(w_{i-n+1}...w_i)}{C(w_{i-n+1}...w_{i-1})} \]

举个栗子

两段文本：

计算trigram模型下<s><s>yes no no yes</s>的概率？

模型存在的问题

由于语言是连续的，n需要很大才能表示信息的传递
计算的概率可能会非常小，甚至为0（对于没出现过的words）

要怎么解决呢？我们需要用smoothing方法！

Smoothing

这里介绍几种常见的smoothing方法来解决概率极小甚至没有的问题。

**1. Laplacian (Add-one) Smoothing **

对于unigram模型，V表示vocab，

\[P_{add1}(w_i) = \frac{C(w_i)+1}{M+|V|} \]

对于bigram模型，

\[P_{add1}(w_i|w_{i-1}) = \frac{C(w_{i-1}w_i)+1}{C(w_{i-1})+|V|} \]

**2. Add-K Smoothing (Lidstone Smoothing) **

类似于Add one方法，但是加一可能会太大，所以我们可以选择一个分数k来控制smoothing。

对于unigram模型，V表示vocab，k是一个分数，

\[P_{add1}(w_i) = \frac{C(w_i)+k}{M+k|V|} \]

对于bigram模型，

\[P_{add1}(w_i|w_{i-1}) = \frac{C(w_{i-1}w_i)+k}{C(w_{i-1})+k|V|} \]

**3. Absolute Discounting **

主要思想："borrows" a fixed probability mass from observed n-gram counts. 定义一个参数a，从observed counts里面”借“a值的counts给unobserved counts。

**4. Backoff **

Katz Backoff: redistributes the mass based on a lower order model.

\[P_{katz}(w_i|w_{i-1})=\begin{cases} \frac{C(w_{i-1},w_i)-D}{C(w_{i-1})} & if C(w_{i-1},w_i) > 0 \\ \alpha(w_{i-1}) * \frac{P(w_i)}{\sum_{w_j:C(w_{i-1},w_i)=0}P(w_j)} & otherwise \end{cases}\]

解释一下，\(\alpha(w_{i-1})\)表示the amount of probability mass that has been discounted for context \(w_{i-1}\)，\(P(w_i)\)表示lower order model的probability of \(w_i\)，然后\(\sum_{w_j:C(w_{i-1},w_i)=0}P(w_j)\)表示sum lower-gram probabilities for all words that do not co-occur with \(w_{i-1}\)

但是这里有一个缺陷，如果使用的lower order model是unigram，那么单个词频高的词就会占优。

Kneser-Ney Smoothing

主要思想：redistribute probability mass based on how many number of different contexts word w has appeared in, also called "continuation probability"

\[P_{katz}(w_i|w_{i-1})=\begin{cases} \frac{C(w_{i-1},w_i)-D}{C(w_{i-1})} & if C(w_{i-1},w_i) > 0 \\ \alpha(w_{i-1}) * \frac{P(w_i)}{\sum_{w_j:C(w_{i-1},w_i)=0}P_{count}(w_j)} & otherwise \end{cases}\]

where

\[P_{count}(w_j) = \frac{|{w_{i-1}:C(w_{i-1},w_i)>0}|}{\sum_{w_i}|w_{i-1}:C(w_{i-1},w_i)>0|} \]

5. 插值 interpolation

结合多个n-gram模型，例如trigram模型可以变成：

\[P_{interploation}(w_m|w_{m-1},w_{m-2}) = \lambda_3P_3^*(w_m|w_{m-1},w_{m-2}) + \lambda_2P_2^*(w_m|w_{m-1}) + \lambda_1P_1^*(w_m) \]

where

\[\sum_{n=1}^{n_{max}}\lambda_n = 1 \]

Interpolated Kneser-Ney Smoothing (IKN Smoothing)

\[P_{IKN}(w_i|w_{i-1}) = \frac{C(w_{i-1},w_i)-D}{C(w_{i-1})} + \beta(w_{i-1})P_{count}(w_i) \]

where \(\beta(w_{i-1}) =\) 一个常数，能够使得\(P_{IKN}(w_i|w_{i-1})\)加起来为1

生成任务

NLP中有一类任务需要机器来生成下一个词或是一个句子，那么怎么来选择呢？常规的方法有

Argmax，贪心算法
Beam search decoding
Randomly samples from the distribution

评估方法

评价n-gram模型好坏，通常可以考虑Perplexity

\[PP(w_1,w_2,...,w_m) = \sqrt[m]{\frac{1}{P(w_1,w_2,...,w_m)}} \]

等价于

\[PP(w_1,w_2,...,w_m) = 2^{-\frac{log_2^{P(w_1,w_2,...,w_m)}}{m}} \]

所以，pp越低结果越好！

posted @ 2020-06-19 01:53 MrDoghead 阅读(261) 评论(0) 编辑收藏举报

刷新页面返回顶部

MrDoghead

一只小白的自我修炼

N-gram语言模型

语言模型

Markov Assumption

极大似然估计MLE

举个栗子

模型存在的问题

Smoothing

生成任务

评估方法

公告