信息论与编码:信息度量

信息度量

1. 独立与马尔可夫链

独立(Independence)

对于两个随机变量\(X\)\(Y\),若对所有的\((x, y) \in \mathcal{X} \times \mathcal{Y}\),都有

\[p(x, y) = p(x)p(y) \]

则称\(X\)\(Y\)独立,记为\(X \perp Y\)

\(p(x), p(y), p(x, y)\)分别是\(\text{Pr}(X=x), \text{Pr}(Y=y), \text{Pr}(X=x, Y=y)\)的简写。

相互独立(Mutual Independence)

给定随机变量\(X_{1}, \cdots, X_{n}\),若对于所有的\((x_1, \cdots, x_{n}) \in \mathcal{X}_{1} \times \cdots \times \mathcal{X}_{n}\),都有:

\[p(x_{1}, \cdots, x_{n}) = p(x_{1})\cdots p(x_{n}) \]

\(X_{1}, \cdots, X_{n}\)相互独立。

两两独立(Pairwise Independence)

随机变量\(X_{1}, \cdots, X_{n}\)两两独立,若对于所有的\(1 \le i \lt j \le n\)\(X_{i}\)\(X_{j}\)独立。

相互独立可以推出两两独立。

条件独立(Conditional Independence)

对于随机变量\(X, Y, Z\),若:

\[p(x,y,z)p(y) = p(x,y)p(y,z) \]

则称\(X\)\(Z\)在给定\(Y\)的条件下独立,记作\(X \perp Z \mid Y\)\(X \rightarrow Y \rightarrow Z\)

马尔可夫链(Markov Chain)

对于随机变量\(X_{1}, \cdots, X_{n}\)\(n \ge 3\)),\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)构成马尔可夫链,若:

\[p(x_{1},\cdots,x_{n})p(x_{2})\cdots p(x_{n-1}) = p(x_{1},x_{2})\cdots p(x_{n-1},x_{n}) \]

马尔可夫链的等价定义:

  1. \(p(x_{1},\cdots,x_{n})=\begin{cases}p(x_{1})p(x_{2}|x_{1})\cdots p(x_{n}|x_{n-1}) & \text{if}\ \ p(x_{1})\cdots p(x_{n-1}) > 0\\ 0 & \text{otherwise}\end{cases}\)
  2. \(p(x_{t}|x_{1},\cdots,x_{t-1})=p(x_{t}|x_{t-1})\),其中\(1 \le t \le n\)

性质:

\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)是马尔可夫链, 则\(X_{n} \rightarrow \cdots \rightarrow X_{1}\)也是马尔可夫链,

性质:马尔可夫子链(Markov Subchains)

\(X_{1} \rightarrow \cdots \rightarrow X_{n}\)是马尔可夫链, \(\mathcal{N}_{n} = \left\{1, 2, \cdots, n\right\}\),对于\(\mathcal{N}_{n}\)的子集\(\alpha\),用\(X_{\alpha}\)表示\(\left\{X_{i}: i \in \alpha\right\}\)。给定\(\mathcal{N}_{n}\)的不相交子集\(\alpha_{1}, \cdots, \alpha_{m}\),若对于所有的\(k_{j} \in \alpha_{j}, j = 1, \cdots, m\)\(k_{1} \lt \cdots \lt k_{m}\),则\(X_{\alpha_{1}}\rightarrow\cdots\rightarrow X_{\alpha_{m}}\)构成一个马尔可夫链。

2. 香农信息度量

(entropy):

随机变量\(X\)的熵定义为:\(\displaystyle H(X) = -\sum_{x\in \mathcal{X}}p(x)\log p(x) = \sum_{x \in \mathcal{X}}p(x)\log \frac{1}{p(x)}\)

\(\displaystyle \log \frac{1}{p(X)}\)\(X\)的信息量,则熵是信息量的期望,即\(H(X) = E \log\frac{1}{p(X)}\)

示例:二元随机变量的熵

\(X \sim \text{Bernoulli}(p)\),则\(H(p) = p \times \log \frac{1}{p} + (1-p) \times \log \frac{1}{1-p}\)\(H(p)\)是关于\(p\)的函数,函数在\(p = 0.5\)处取最大值。

联合熵(joint entropy):

随机变量\(X, Y\)的联合熵定义为:\(\displaystyle H(X, Y) = -\sum_{x,y}p(x,y)\log p(x,y) = \sum_{x,y}p(x,y)\log \frac{1}{p(x,y)}\)

\(\log \frac{1}{p(X,Y)}\)是二元组\((X, Y)\)的信息量。

条件熵(conditional entropy):

对于随机变量\(X, Y\)\(Y\)在给定\(X\)条件下的条件熵定义为:

\[\begin{align*} H(Y|X) &= \sum_{x}p(x)H(Y|X=x)\\ &= \sum_{x}p(x)\sum_{y}p(y|x)\log \frac{1}{p(y|x)}\\ &= \sum_{x,y}p(x,y)\log \frac{1}{p(y|x)}\\ &= E\log \frac{1}{p(Y|X)} \end{align*} \]

联合熵与条件熵的关系:\(H(X,Y)=H(X)+H(Y|X) = H(Y) + H(X|Y)\)

\(\displaystyle H(X,Y|Z,W=w,S=s,U) = \sum_{x,y,z,u}p(x,y,z,u|w,s)\log \frac{1}{p(x,y|z,w,s,u)}\)

互信息(mutual information):

随机变量\(X,Y\)之间的互信息定义为:\(\displaystyle I(X;Y) = \sum_{x,y}p(x,y)\log \frac{p(x,y)}{p(x)p(y)} = E \log \frac{p(X,Y)}{p(X)p(Y)}\)

互信息与条件熵的关系:

\(H(X) = H(X|Y) + I(X;Y)\)

\(H(Y) = H(Y|X) + I(X;Y)\)

条件互信息(conditional mutual information):

对于随机变量\(X, Y, Z\)\(X,Y\)在给定\(Z\)条件下的条件互信息定义为:

\[\begin{align*} I(X;Y|Z) &= \sum_{z}p(z)\sum_{x,y}p(x,y|z)\log\frac{p(x,y|z)}{p(x|z)p(y|z)} \\ &= \sum_{x,y,z}p(x,y,z)\log\frac{p(x,y|z)}{p(x|z)p(y|z)}\\ &= E\log\frac{p(X,Y|Z)}{p(X|Z)p(Y|Z)} \end{align*} \]

\(\displaystyle I(X;Y|Z=z,V)=\sum_{x,y,v}p(x,y,v|z)\log\frac{p(x,y|z,v)}{p(x|z,v)p(y|z,v)}\)

3. 链式规则

\(\displaystyle H(X_{1}, \dots, X_{n})=\sum_{i=1}^{n}H(X_{i} \mid X_{1}, \dots, X_{i-1})\)

\(\displaystyle H(X_{1}, \dots, X_{n} \mid Y)=\sum_{i=1}^{n}H(X_{i} \mid X_{1}, \dots, X_{i-1},Y)\)

\(\displaystyle I(X_{1}, \dots, X_{n};Y) = \sum_{i=1}^{n}I(X_{i};Y|X_{1}, \dots, X_{i-1})\)

\(\displaystyle I(X_{1}, \dots, X_{n};Y\mid Z) = \sum_{i=1}^{n}I(X_{i};Y|X_{1}, \dots, X_{i-1}, Z)\)

4. 信息散度

信息散度/KL距离/相对熵

在同一个字典\(\mathcal{X}\)上的两个分布\(p\)\(q\)之间的信息散度(informational divergence)定义为:

\[D(p \parallel q) = \sum_{x \in \mathcal{X}}p(x) \log \frac{p(x)}{q(x)} = E_{p}\log \frac{p(X)}{q(X)} \]

\(\displaystyle I(X;Y) = D(p(x,y)\parallel p(x)q(x))\)

性质

对于同一个字典\(\mathcal{X}\)上的两个分布\(p\)\(q\)

\[\begin{align*} D(p\parallel q) &= \sum_{x \in \mathcal{X}}p(x) \log \frac{p(x)}{q(x)}\\ &= \log e \sum_{x \in \mathcal{X}}p(x) \ln \frac{p(x)}{q(x)}\\ &\ge \log e \sum_{x \in \mathcal{X}} p(x) (1 - \frac{q(x)}{p(x)})\\ &= \log e\sum_{x \in \mathcal{X}}(p(x) - q(x))\\ &= 0 \end{align*} \]

取得等号当且仅当\(p = q\)

度量(metric)

函数\(\rho(x, y)\)是一个度量函数,若对于所有的\(x, y\)

  1. \(\rho(x, y) \ge 0\)
  2. \(\rho(x, y) = \rho(y, x)\)
  3. \(\rho(x, y) = 0\)当且仅当\(x = y\)
  4. \(\rho(x, y) + \rho(y, z) \ge \rho(x, z)\)

例子

\(\rho(X, Y) = H(X|Y) + H(Y|X)\)满足条件1,2,4,若将\(X = Y\)定义为存在一个从\(X\)\(Y\)的一一映射,则条件3也满足。

条件4:

\[\begin{align*} \rho(X,Z) &= H(X|Z) + H(Z|X)\\ &= I(X;Y|Z) + H(X|Y,Z) + I(Y;Z|X) + H(Z|X,Y)\\ &\le H(Y|Z) + H(X|Y) + H(Y|X)+H(Z|Y)\\ &= H(X|Y) + H(Y|X) + H(Y|Z) + H(Z|Y)\\ &= \rho(X,Y) + \rho(Y,Z) \end{align*} \]

5. 基本不等式

Logarithm Inequality:\(\displaystyle \ln x \le x - 1 \Leftrightarrow \ln x \ge 1 - \frac{1}{x}\)

Jensen Inequality:\(f\)是凸函数,\(\lambda_i \ge 0\)\(\sum \lambda_i = 1\),则\(\displaystyle f\left(\sum \lambda_ix_i\right) \le \sum \lambda_i f(x_i)\)

Relative Inequality:\(\displaystyle \sum_i p_i \log \frac{p_i}{q_i} \ge 0\),等号成立当且仅当\(p_i = q_i\)

Log-Sum Inequality:\(\displaystyle \sum u_{i} \log \frac{u_i}{v_i} \ge \left(\sum u_{i}\right) \log \frac{\sum u_{i}}{\sum v_{i}}\),等号成立当且仅当\(\displaystyle \frac{u_{i}}{v_{i}} = constant\)

6. 关于信息度量的一些不等式

  • \(H(X) \ge 0\),等号成立当且仅当\(X\)是确定的。证明:\(H(X) = I(X;X) = D(p(x,x)\parallel p(x) p(x)) \ge 0\)

  • \(H(Y|X) \ge 0\),等号成立当且仅当\(Y\)\(X\)的一个函数。证明:\(H(Y|X) = I(Y;Y|X) = D(p(y,y|x)\parallel p(y|x)p(y|x))\ge 0\)

  • \(I(X;Y) \ge 0\),等号成立当且仅当\(X\)\(Y\)独立

  • \(I(X;Y|Z) \ge 0\),等号成立当且仅当\(X\)\(Y\)在给定\(Z\)时条件独立

定理:

\(H(Y|X) \le H(Y)\),等号成立当且仅当\(X\)\(Y\)独立。证明:\(H(Y) = H(Y|X) + I(X;Y) \ge H(Y|X)\)

定理:

\(\displaystyle H(X_1, X_2, \dots, X_n) \le \sum_{i=1}^{n} H(X_i)\),等号成立当且仅当\(X_i\)相互独立。证明:\(\displaystyle H(X_1, \dots, X_n) = \sum_{i=1}^{n}H(X_{i}|X_{1}, \dots, X_{i-1}) \le \sum_{i=1}^{n}H(X_i)\)

定理:

\(I(X;Y,Z) \ge I(X;Y)\),等号成立当且仅当\(X \rightarrow Y \rightarrow Z\)构成马尔可夫链。证明:\(I(X;Y,Z) = I(X;Y) + I(X;Z|Y) \ge I(X;Y)\)

定理:

\(U \rightarrow X \rightarrow Y \rightarrow V\)构成一个马尔可夫链,则\(I(X;Y) \ge I(U;V)\)。证明:由于\(U\rightarrow X\rightarrow Y\)是马尔可夫链,所以\(I(X;Y) = I(U,X;Y)=I(U;Y)+I(X;Y|U) \ge I(U;Y)\);同理,\(I(U;Y) \ge I(U;V)\)

定理:

对于随机变量\(X\),当\(X\)服从均匀分布时,熵取得最大值,即\(H(X) \le \log \left|\mathcal{X}\right|\)。证明:设\(u(x)\)\(\mathcal{X}\)上的均匀分布,\(D(p(x)\parallel u(x)) \ge 0\)

Fano's Inequality

\(X\)是随机变量,\(\hat{X}\)是对\(X\)的估计(\(X, \hat{X} \in \mathcal{X}\)),出错的概率是\(P_e = \text{Pr}(X \neq \hat{X})\),则:

\[H(X\mid \hat{X}) \le h_b(P_e) + P_e \log (\left|\mathcal{X}\right|-1) \]

证明:定义\(Y = 1\cdot\left\{X \neq \hat{X}\right\}\),则\(\text{Pr}(Y=1) = P_e, \text{Pr}(Y=0) = 1 - P_e, H(Y) = h_{b}(P_e)\)

\[\begin{align*} H(X|\hat{X}) &= H(X|\hat{X}) + H(Y|X,\hat{X})\\ &= H(X,Y|\hat{X})\\ &= H(Y|\hat{X})+H(X|Y,\hat{X})\\ &=H(Y|\hat{X}) + \text{Pr}(Y=1)H(X|Y=1,\hat{X})\\ &\le H(Y) + \text{Pr}(Y=1)\sum_{\hat{x} \in \mathcal{X}}\text{Pr}(\hat{X}=\hat{x})H(X|Y=1,\hat{X}=\hat{x})\\ &\le H(Y) + \text{Pr}(Y=1)\sum_{\hat{x} \in \mathcal{X}}\text{Pr}(\hat{X}=\hat{x})\log (\left|\mathcal{X}\right|-1)\\ &= h_b(P_e) + P_e\log (\left|\mathcal{X}\right|-1)\\ \end{align*} \]

7. 平稳信源的熵率

离散时间信源(discrete-time information source)\(\left\{X_{k}: k \ge 1\right\}\)

熵率(entropy rate)\(\left\{X_{k}\right\}\)的熵率定义为:\(H_X=\displaystyle \lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, X_2, \cdots, X_{n})\),若极限存在。

例子:

\(\left\{X_{k}\right\}\)是一个\(\text{i.i.d}\)信源,用\(X\)表示任何一个时间步的随机变量,则:

\[\lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \cdots, X_{n}) = \lim_{n\rightarrow \infty}\frac{n\cdot H(X)}{n} = H(X) \]

熵率存在,熵率是\(H(X)\)

例子:

\(\left\{X_{k}\right\}\)是一个信源,各个\(X_k\)相互独立,且\(H(X_{k}) = k\),则:

\[\lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \cdots, X_{n}) = \lim_{n\rightarrow \infty}\frac{n+1}{2} \]

熵率不存在。

平稳信源(stationary information source):对于一个信源\(\left\{X_{k}\right\}\),若对于任意的\(m, l \ge 1\)\(X_1, X_2, \cdots, X_m\)\(X_{1+l}, X_{2+l}, \cdots, X_{m+l}\)具有相同的联合概率分布,则称之为平稳信源。

定义:\(\displaystyle H_X^{'} = \lim_{n\rightarrow \infty}H(X_n|X_1, X_2, \dots, X_{n-1})\)

定理:平稳信源\(\left\{X_k\right\}\)的熵率\(H_X\)存在且\(H_X = H_{X}^{'}\)

证明:\(H(X_n|X_1, X_2, \dots, X_{n-1}) \le H(X_n|X_2, \dots, X_{n-1})=H(X_{n-1}|X_1, X_2, \dots, X_{n-2})\),令\(a_n = H(X_n|X_1, X_2, \dots, X_{n-1})\),则序列单调递减且存在下界,故极限存在。

\(\displaystyle H_{X}^{'} = \lim_{n \rightarrow \infty}a_{n} = \lim_{n\rightarrow n}\frac{\sum_{i=1}^{n}a_i}{n}=\lim_{n\rightarrow \infty} \frac{1}{n}\sum_{i=1}^{n}H(X_i|X_1, X_2, \dots, X_{i-1}) = \lim_{n\rightarrow \infty}\frac{1}{n}H(X_1, \dots, X_n) = H_{X}\)

posted @ 2019-12-29 19:44  gxzzz  阅读(1034)  评论(0编辑  收藏  举报