20160103最大熵模型
最大熵模型
熵的定义
联合熵、相对熵、条件熵、互信息
最大熵模型
Maxent在NLP中应用
Maxent与MLE关系
1. 一个问题
谈一个问题:扔骰子N次结果平均是5.5,问6个面概率。
可以凸优化,可以极大似然估计。
minimize: \(S\left( p \right)=- \sum_i {p_i \ln(p_i)} \)
subject to: \(\sum_i{p_i}=1\)
\(\sum_{i}{i*p_i}=\mu\)
Lagrange函数:\(\zeta = -\sum_{i}{p_i\ln p_i}+\lambda_0 \left( 1-\sum_i{p_i} \right)+\lambda_1 \left( \mu - \sum_i{p_i*i} \right)\)
令\(\frac{\partial S}{\partial p_i}=0\),可得:
\(p_i=e^{-1-\lambda_0-i \lambda_1}\)
\(\lambda_0=5.932,\lambda_1=-1.087\)
熵定义:\(H\left( x \right) = -\sum_{x \in X}{p \left( x \right) \ln p \left( x \right) }\)
底数e,nat,底数2,单位是bit。
熵的理解:
1.不确定性的度量,成正比。
2.概率分布函数到值的映射。
3.个人理解:自然界总是向熵增的方向运动(大爆炸理论)和这里的最大熵模型挺类似的。
联合熵和条件熵:
两个随机变量XY联合分布可以形成联合熵Joint Entropy,用H(x,y)表示。
\(H \left( X,Y \right) -H \left( Y \right)\)
\(=-\sum_{x,y}{p \left( x,y \right) \log p \left( x,y \right) } + \sum_{y}{ p \left( y \right) \log p \left( y \right) }\)
\(=-\sum_{x,y}{p \left( x,y \right) \log p \left( x,y \right) } + \sum_{y} \sum_{x}{ p \left( x,y \right) \log p \left( y \right) }\)
\(=-\sum_{x,y}{p \left( x,y \right) \log p \left( x \mid y \right)}=H\left ( X \mid Y \right )\)
相对熵(互熵,交叉熵,鉴别信息,KL熵)
设p(x),q(x)是X中取值的两个概率分布,则p对q的相对熵是:
\(D\left( p \parallel q \right) = \sum_{x}{p\left( x \right) \log \frac{p\left( x \right)}{q\left( x \right)}}=E_{p\left ( x \right )} \log \frac{p\left( x \right)}{q\left( x \right)}\)
很明显\(D\left ( p \parallel q \right )\)与\(D\left ( q \parallel p \right )\)不一定相等。
互信息:
两个随机变量X,Y互信息,定义为X,Y联合分布和对立分布乘积的相对熵。
\(I\left ( X,Y \right )=D\left ( p\left ( X,Y \right ) \parallel p\left ( X \right ) p\left ( Y \right )\right )\)
\(I\left ( X,Y\right )= \sum_{x,y}{p\left ( x,y \right ) \log \frac{p\left ( x,y \right )}{p\left ( x\right )}p\left ( y \right )}\)
\(H\left ( X \right )-I\left ( X,Y \right )=H\left ( X \mid Y \right )=H\left ( X,Y \right )-H\left ( Y \right )\)
\(H\left ( X \mid Y \right )\leqslant H\left ( X \right )\)
最大熵模型原则:
1.承认已知的事物。
2.对未知事物不做任何假设,没有偏见。
MaxEnt 一般式:
\(\underset{p \in P}{\max} H\left (Y\mid X \right) = -\sum_{x,y} p\left ( x,y \right ) \log p\left (y \mid x \right )\)
\(P=\left \{ p \mid p是X上满足条件的概率分布 \right \}\)
最大熵模型总结
目的:\(p^{*} \left ( y \mid x \right )= arg \max H\left ( y \mid x \right )\)
定义特征函数:\(f_i \left( x,y \right) \in \left\{ 0,1 \right\},i=1,2,L,m\)
约束条件:\(\sum_{y \in Y} p\left( y \mid x \right)=1\)1①
\(E\left ( f_i \right ) = \widetilde{E}\left ( f_i \right ) i =1,2,\cdots ,m\)②
\(\tilde{E}\left ( f_i \right )= \sum_{\left ( x,y \right ) \in z} \hat{p}\left ( x,y \right ) f_i\left ( x,y \right )= \frac{1}{N} \sum_{\left ( x,y \right ) \in T} f_i\left ( x,y \right )\),其中\(N=\left | T \right |\)
\(E\left ( f_i \right )=\sum_{\left ( x,y \right ) \in z }p\left ( x,y \right )f_i \left ( x,y \right )= \sum_{\left ( x,y \right ) \in z} p \left ( x\right )\)
求解Maxent模型:
Lagrange函数:
\(\Lambda \left ( p,\vec{\lambda} \right )=H\left ( y \mid x \right )+\sum_{i=1}^{m}\lambda_i \left ( E\left ( f_i \right )-\tilde{E}\left ( f_i \right ) \right )+\lambda_{m+1}\left ( \sum_{y \in Y}p\left ( y \mid x \right )-1 \right )\)
\(\Rightarrow L=\sum_{\left ( x,y \right )}p\left ( y \mid x \right )\bar{p}\left ( x \right ) \log \frac{1}{p\left ( y \mid x \right )} + \sum_i \lambda_i \sum_{\left ( x,y \right )}f_i\left ( x,y \right )\left [ p\left ( y \mid x \right ) \tilde{p} \left ( x \right ) - \tilde{p} \left ( x,y \right )\right ] + \lambda_0 \left[ \sum_y p \left( y \mid x \right) -1 \right]\)
\(\frac{\partial L }{\partial p \left (y \mid x \right ) } = \bar{p} \left ( x \right )\left ( -\log p\left ( y \mid x \right ) -1 \right ) + \sum_i \lambda_i \bar{p} \left ( x \right )f_i\left ( x,y \right )+\lambda_0 \triangleq 0\)
\(\Rightarrow p^{*}\left ( y \mid x \right )=e^{\sum_i \lambda_i f_i\left ( x,y \right )+\frac{\lambda_0}{\bar{p}\left ( x \right )}-1}=\frac{1}{e^\left ( 1-\frac{\lambda_0}{\bar{p}\left ( x \right )} \right)}e^\left ( \sum_i \lambda_i f_i \left ( x,y \right ) \right )\)
将\(p^*\)归一化:
\(p^*\left( y \mid x \right)=1\)
\(\frac{1}{Z_\lambda \left ( x \right )} e^{\sum_i \lambda_i f_i \left ( x,y \right )}=1\)
\(\therefore Z_\lambda \left ( x \right )= e^{\sum_i \lambda_i f_i\left ( x,y \right )}\)
此时\(\lambda\)未知,并且Maxent模型即为\(p^*\left( y \mid x \right)=\frac{1}{Z_\lambda \left ( x \right )} e^{\sum_i \lambda_i f_i \left ( x,y \right )}\)
接下来解决两个问题:
1. 解释Maxent和MLE关系
2. 找到\(\lambda\)的求解算法
1.MLE
\(L_{\bar{p}}= \prod_x p \left ( x \right )^{\bar{p}\left ( x \right )}\)
其中\(p\left ( x \right )\)是估计概率分布,\(\bar{p}\left ( x \right )\)实验结果分布。
\(\log L\left ( \theta_1,\theta_2,\cdots, \theta_k \right )= \sum_{i=1}^n \log f\left ( x_i;\theta_1,\theta_2,\cdots ,\theta_k \right )\)
取对数:
\(L_{\bar{p}}=\log \left ( \prod _x p\left ( x \right ) ^{\bar{p}\left ( x \right )}\right )=\sum_x \bar{p}\left ( x \right )\log p\left ( x \right )\)
\(L_{\bar{p}}{\left(p\right)}=\sum _{x,y}{\bar{p}\left ( x,y \right ) \log p\left ( x,y \right )}=\sum_{x,y}{\bar{p}\left ( x,y \right ) \log \left [ \bar{p}\left ( x \right ) p\left ( y \mid x \right ) \right ]}\)
\(=\sum_{x,y} \bar{p}\left ( x,y \right )\log p\left ( y \mid x \right )+\sum_{x,y} \bar{p}\left ( x,y \right ) \log \bar{p}\left ( x \right )\)
前面部分与条件熵形式相同,后面是常数部分。
\(L=\sum_{x,y}p\left ( y \mid x \ \right ) \bar{p}\left ( x \right ) \log \frac{1}{p\left ( y \mid x \right )} + \sum_{i=1}^k {\lambda_i \sum_{x,y} f_i\left ( x,y \right ) \left [ p\left ( y \mid x \right ) \bar{p}\left ( x \right ) - \bar{p}\left ( x,y \right )\right ] + \lambda_0 \left [ \sum_y p\left ( y \mid x \right ) -1 \right ]}\)
将\(p_\lambda \left ( y \mid x \right )=\frac{1}{z_\lambda\left ( x \right )}e^{\sum_i \lambda_i f_i\left ( x,y \right )}\)代入L
\(\therefore L\left ( \lambda \right) = - \sum_{x,y} p\left ( y \mid x \right ) \bar{p}\left ( x \right ) \log p\left ( y \mid x \right ) + \sum_{i=1}^k \lambda_i \sum_{x,y} f_i\left (x,y \right )\left [ p \left ( y \mid x \right ) \bar{p} -\bar{p}\left ( x,y \right )\right ]+\lambda_0\left [ \sum_y p\left ( y \mid x \right ) -1 \right ]\)
\( = - \sum\limits_{x,y} {{p_\lambda }\left( {y\left| x \right.} \right)\bar {p} \left( x \right)\log {p_\lambda }\left( {y\left| x \right.} \right)} + \sum\limits_{i = 1}^k {{\lambda _i}\sum\limits_{x,y} {{f_i}\left( {x,y} \right)\left[ {{p_\lambda }\left( {y\left| x \right.} \right)\bar {p} \left( x \right) - \bar {p} \left( {x,y} \right)} \right]} }\)
\( = - \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right)\log {p_\lambda }\left( {y\left| x \right.} \right) + \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right)\sum\limits_{i = 1}^k {{\lambda _i}{f_i}\left( {x,y} \right) - \sum\limits_{i = 1}^k {\bar {p} \left( {x,y} \right){\lambda _i}\sum\limits_{x,y} {{f_i}\left( {x,y} \right)} } } } } \)
\( = \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right)\log {z_\lambda }\left( x \right)} - \sum\limits_{i = 1}^k {\bar {p} \left( {x,y} \right)\sum\limits_{x,y} {{\lambda _i}{f_i}\left( {x,y} \right)} } \)
将最大熵最优解\({p_\lambda }^*\left( {y\left| x \right.} \right) = \frac{1}{{{z_\lambda }\left( x \right)}}{e^{\sum\limits_i {{\lambda _i}{f_i}\left( {x,y} \right)} }}\)
\(L_{\bar{p} }{\left( p \right)} = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\log p\left( {y\left| x \right.} \right)}\)
\( = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\left( {\sum\limits_{i = 1}^n {{\lambda _i}{f_i}\left( {x,y} \right) - \log {z_\lambda }\left( x \right)} } \right)} \)
\( = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\lambda _i}{f_i}\left( {x,y} \right)} - \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\log {z_\lambda }\left( x \right)} }\)
\( = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\lambda _i}{f_i}\left( {x,y} \right)} - \sum\limits_x {\bar {p} \left( x \right)\log {z_\lambda }\left( x \right)} } \)
\(\lambda\)的求解,使用IIS,改进迭代尺度算法。
IIS:假设最大熵模型当前参数向量是\(\lambda\),希望找到新参数向量\(\lambda+ \delta\),使模型对数似然函数L增加。repeat it。
\(L\left( {\lambda + \delta } \right) - L\left( \lambda \right) = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right) - \sum\limits_x {\bar {p} \left( x \right)\log \frac{{{z_{\lambda + \delta }}\left( x \right)}}{{{z_\lambda }\left( x \right)}}} } } \)
\( \ge \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right) + 1 - \sum\limits_x {\bar {p} \left( x \right)\frac{{{z_{\lambda + \delta }}\left( x \right)}}{{{z_\lambda }\left( x \right)}}} } }\)
\(= \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right) + 1 - \sum\limits_x {\bar {p} \left( x \right)\sum\limits_y {{p_\lambda }\left( {y\left| x \right.} \right){e^{\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right)} }}} } } } \)
针对凸函数\( f\left( x \right) = {e^x} \)用Jensen不等式,\( {f^\# }\left( {x,y} \right) = \sum\limits_i {{f_i}\left( {x,y} \right)} \)
\( A\left( {\delta \left| \lambda \right.} \right) = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right) + 1 - \sum\limits_x {\bar {p} \left( x \right)} \sum\limits_y {{p_\lambda }\left( {y\left| x \right.} \right){e^{\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right)} }}} } } \)
\( = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right)} } + 1 - \sum\limits_X {\bar {p} \left( x \right)\sum\limits_y {{p_\lambda }\left( {y\left| \lambda \right.} \right){e^{{f^\# }\left( {x,y} \right)\sum\limits_{i = 1}^n {\frac{{{\delta _i}{f_i}\left( {x,y} \right)}}{{{f^\# }\left( {x,y} \right)}}} }}} } \)
\( \ge \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right)} } + 1 - \sum\limits_x {\bar {p} \left( x \right)\sum\limits_y {{p_\lambda }\left( {y\left| x \right.} \right)\sum\limits_{i = 1}^n {\frac{{{f_i}\left( {x,y} \right)}}{{{f^\# }\left( {x,y} \right)}}{e^{{\delta _i}{f^\# }\left( {x,y} \right)}}} } } \)
对上式求偏导,令其为0,求出\(\delta\)
\(B\left( {\delta \left| \lambda \right.} \right) = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right)\sum\limits_{i = 1}^n {{\delta _i}{f_i}\left( {x,y} \right)} } + 1 - \sum\limits_x {\bar {p} \left( x \right)\sum\limits_y {{p_\lambda }\left( {y\left| x \right.} \right)\sum\limits_{i = 1}^n {\frac{{{f_i}\left( {x,y} \right)}}{{{f^\# }\left( {x,y} \right)}}{e^{{\delta _i}{f^\# }\left( {x,y} \right)}}} } }\)
\(\frac{{\partial B\left( {\delta \left| \lambda \right.} \right)}}{{\partial {\delta _i}}} = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right){f_i}\left( {x,y} \right)} - \sum\limits_x {\bar {p} \left( x \right)} \sum\limits_y {{p_\lambda }\left( {y\left| x \right.} \right){f_i}\left( {x,y} \right){e^{{\delta _i}{f^\# }\left( {x,y} \right)}}} \)
\( = \sum\limits_{x,y} {\bar {p} \left( {x,y} \right){f_i}\left( {x,y} \right)} - \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right){f_i}\left( {x,y} \right){e^{{\delta _i}{f^\# }\left( {x,y} \right)}}} \)
\( = {E_{\bar {p} }}\left( {{f_i}} \right) - \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right)} {f_i}\left( {x,y} \right){e^{\delta {}_i{f^\# }\left( {x,y} \right)}}\)
令上式等于0,则有
\(\sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right){f_i}\left( {x,y} \right){e^{{\delta _i}{f^\# }\left( {x,y} \right)}}} - {E_{\bar {p} }}\left( {{f_i}} \right) = 0\)
\(\delta\)求法:
当\(f^\#\left( {x,y} \right)\)是常数时:
\({{\delta _i} = \frac{1}{M}\log \frac{{{E_{\bar {p} }}\left( {{f_i}} \right)}}{{{E_p}\left( {{f_i}} \right)}}}\)
当\(f^\#\left( {x,y} \right)\)不是常数时:
令\(g\left( {{\delta _i}} \right) = \sum\limits_{x,y} {\bar {p} \left( x \right){p_\lambda }\left( {y\left| x \right.} \right){f_i}\left( {x,y} \right){e^{{\delta _i}{f^\# }\left( {x,y} \right)}} - {E_p}\left( {{f_i}} \right){f^\# }\left( {x,y} \right)} \)
转换为\(g\left( \delta \right) = 0\)的根
Newton法:\({\delta _i}^{k + 1} = \delta _i^k - \frac{{g\left( {\delta _i^k} \right)}}{{g'\left( {\delta _i^k} \right)}}\)
此处求\(g\left( \delta \right) = 0\)的根,非极值,也可用BFGS,LBFGS
上式求出\(\lambda\)代到下式中,
\({p^*}\left( {y\left| x \right.} \right) = \frac{1}{{{z_\lambda }\left( x \right)}}{e^{\sum\limits_i {{\lambda _i}{f_i}\left( {x,y} \right)} }}\)
\({z_\lambda }\left( x \right) = \sum\limits_y {{e^{\sum\limits_i {{\lambda _i}{f_i}\left( {x,y} \right)} }}} \)