最大熵对应的概率分布
最大熵定理
设 \(X \sim p(x)\) 是一个连续型随机变量,其微分熵定义为
\[ h(X) = - \int p(x)\log p(x) dx
\]
其中,\(\log\) 一般取自然对数 \(\ln\), 单位为 奈特(nats)。
考虑如下优化问题:
\[\begin{array}{ll}
&\underset{p}{\text{Maximize}} & \displaystyle h(p) = - \int_S p(x)\log p(x) dx \\
&\text{Subject to} &\displaystyle \int_S p(x) dx = 1 \\[2pt]
&~ & p(x) \ge 0 \\[2pt]
&~ & \displaystyle \int_S p(x) f_i(x) dx = \alpha_i, ~i=1,2,3,\dots,n
\end{array}
\]
其中,集合 \(S\) 是随机变量的support,即其所有可能的取值。我们意图找到这样的概率分布 \(p\), 他满足所有的约束(前两条是概率公理的约束,最后一条叫做矩约束,在模型中有时会假设随机变量的矩为常数),并且能够使得熵最大。将上述优化问题写成标准形式:
\[\begin{array}{ll}
&\underset{p}{\text{Minimize}} & \displaystyle \int_S p(x)\log p(x) dx \\
&\text{Subject to} &-p(x) \le 0 \\[2pt]
&~ &\displaystyle \int_S p(x) dx = 1 \\
&~ & \displaystyle \int_S p(x) f_i(x) dx = \alpha_i, ~i=1,2,3,\dots,n
\end{array}
\]
使用Lagrange乘数法得到其Lagrangian
\[L(p,\boldsymbol{\lambda}) = \int_S p\log p ~dx - \mu_{-1}p + \mu_0 \left(\int_S p ~dx - 1\right) + \sum_{j=1}^n \lambda_j \left(\int_S pf_j~dx - \alpha_j\right)
\]
根据KKT条件对Lagrangian求导令为0,可得最优解。
\[\begin{gathered}
\frac{\partial L}{\partial p} = \ln p + 1 - \mu_{-1} + \mu_0 + \sum_{j=1}^n \lambda_jf_j := 0 \\
\implies p = \exp\left(-1 + \mu_{-1} - \mu_0 - \sum_{j=1}^n \lambda_j f_j \right) =\displaystyle c^* e^{-\sum_{j=1}^n\lambda_j^* f_j(x)} := p^*
\end{gathered}
\]
其中,我们要选择 \(c^*, \boldsymbol{\lambda}^*\) 使得 \(p(x)\) 满足约束。到这里我们知道,在所有满足约束的概率分布当中,\(p^*\) 是使得熵达到最大的那一个!
例子
高斯分布
-------->
约束:
- \(E(X) = 0 \implies f_1 = x\)
- \(E(X^2) = \sigma^2 \implies f_2 = x^2\)
根据上面的论证,最大熵分布应具有如下形式:
\[p(x) = ce^{-\lambda_1x - \lambda_2 x^2}
\]
再根据 KKT 条件:
- \(\int_{-\infty}^{+\infty} p(x) = 1\)
- \(\int_{-\infty}^{+\infty} x p(x) = 0\)
- \(\int x^2 p(x) = \sigma^2\)
由条件 \((2) \implies p(x)\) 是偶函数 \(\implies \lambda_1 = 0\), 原条件变成
- \(\int_{-\infty}^{+\infty} ce^{-\lambda_2x^2} = 1\)
- \(\int x^2 ce^{-\lambda_2x^2} = \sigma^2\)
\(\implies c = \frac{1}{\sqrt{2\pi\sigma^2}}, ~\lambda_2 = \frac{1}{2\sigma^2} \implies p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{x^2}{2\sigma^2}} \sim N(0, ~\sigma^2)\)
指数分布
--------->
约束:
- \(X \ge 0\)
- \(E(X) = \frac{1}{\mu}\)
根据上面的论证,最大熵分布应具有如下形式:
\[p(x) = ce^{-\lambda_1x}
\]
再根据 KKT 条件:
- \(\int_{0}^{+\infty} ce^{-\lambda_1x}= 1\)
- \(\int_0^{+\infty} x ce^{-\lambda_1x} = \frac{1}{\mu}\)
推导如下:
\[\begin{gathered}
\int_{0}^{+\infty} e^{-\lambda_1x} = \frac{1}{c} \implies \lambda_1 = c \\
\int_{0}^{+\infty}x e^{-\lambda_1x} = \frac{1}{c\mu} = \frac{1}{\lambda_1\mu} \implies \lambda_1 = \mu
\end{gathered}
\]
\(\implies p(x) = \mu e^{-\mu x} \sim Exp(\mu)\)
均匀分布
--------->
约束:
根据上面的论证,最大熵分布应具有如下形式:
\[\begin{gathered}
p(x) = ce^{- 0x} = c \\
\int_a^b c ~dx = 1 \implies c = \frac{1}{b-a}
\end{gathered}
\]
\(\implies p(x) = \frac{1}{b-a} \sim Unif(a,~b)\)
几何分布
--------->
几何分布计数直到第一次成功前所有的失败次数。\(P(X=k) = q^kp\)
约束:
- \(X = 0,1,2,\dots\)
- \(E(X) = \frac{1-p}{p}\)
根据上面的论证,最大熵分布应具有如下形式:
\[P(X=k) = p_k = ce^{-\lambda_1 k}
\]
再根据 KKT 条件:
- \(\sum_{k=0}^{\infty} p_k = 1\)
- \(\sum_{k=0}^{\infty} k p_k = \frac{1-p}{p}\)
推导如下:
\[\begin{gathered}
\sum_{k=0}^{\infty} ce^{-\lambda_1 k} = c \sum_{k=0}^{\infty} q^k \quad(\text{where }q = e^{-\lambda_1})\\
= \frac{c}{1-q} \implies c = 1-q \\
\sum_{k=0}^{\infty} k ce^{-\lambda_1 k} = c\sum_{k=1}^{\infty} k (e^{-\lambda_1})^k = c\sum_{k=1}^{\infty}k q^k = cq \sum kq^{k-1} \\
= cq \sum (q^k)' = cq \left(\sum_{k=1}^{\infty}q^k\right)' = cq \left(\frac{q}{1-q}\right)' \\
= cq \cdot \frac{1}{(1-q)^2} = \frac{q}{1-q} = \frac{1-p}{p}
\end{gathered}
\]
\(\implies e^{-\lambda_1} = q = 1-p, ~ c =p \implies P(X=k) = p_k = pq^k \sim Geom(p)\)