Elements of Information Theory: Exercises

2.2 Entropy of functions.

Let $X$ be a random variable taking on a finite number of values. What is the (general) inequality relationship of $H(X)$ and $H(Y)$ if
(a) $Y=2^X$ ?
(b) $Y=\cos X$ ?

$H(X) \geq H(Y)$ if $Y$ is a function of $X$. This holds because a function may cause an information loss when the function is not an one-to-one mapping.

2.6 Conditional mutual information vs. unconditional mutual information.

Give examples of joint random variables $X, Y$, and $Z$ such that
(a) $I(X ; Y \mid Z)<I(X ; Y)$.
(b) $I(X ; Y \mid Z)>I(X ; Y)$.

(a) Let $X=Y=Z$. All of them are binary.

Something counter intuition is that I may interpret $I(X ; Y \mid Z)=0$ as given $Z$, $X$ and $Y$ are independent, since $I(X; Y)=0$ if and only if $X$ and $Y$ are independent.

Interpretation by ChatGPT

In the case where $X=Y=Z$, the mutual information $I(X;Y\mid Z)$ is zero because the two random variables $X$ and $Y$ are completely determined by the value of the random variable $Z$. In other words, knowing the value of $Z$ gives us complete information about both $X$ and $Y$, so there is no additional information to be gained by considering $X$ and $Y$ together. This means that the two random variables are independent given the value of $Z$, so $I(X;Y\mid Z)=0$.

(b) let $X$ and $Y$ be independent and $Z = X +Y$. By intuition, given $Z$ will cause a drop of independency between $X$ and $Y$.

2.28 Mixing increases entropy.

Show that the entropy of the probability distribution, $\left(p_1, \ldots, p_i, \ldots, p_j, \ldots, p_m\right)$, is less than the entropy of the distribution $\left(p_1, \ldots, \frac{p_i+p_j}{2}, \ldots, \frac{p_i+p_j}{2}\right.$, $\left.\ldots, p_m\right)$. Show that in general any transfer of probability that makes the distribution more uniform increases the entropy.

2.30 Maximum entropy.

Find the probability mass function $p(x)$ that maximizes the entropy $H(X)$ of a nonnegative integer-valued random variable $X$ subject to the constraint

\[E X=\sum_{n=0}^{\infty} n p(n)=A \]

for a fixed value $A>0$. Evaluate this maximum $H(X)$.

Sol. We tackle this problem by using Lagrange Multipliers

\[\nabla f = \lambda \nabla g + \mu \nabla h \]

where

\[f = \sum p_i \log p_i \\ g = \sum p_i \\ h = \sum ip_i \]

which becomes

\[\left\{\begin{array}{l} 1+\log p_i=\lambda+i \mu \\ \sum p_{i}=1 \\ \sum i p_{i}=A \end{array}\right. \]

we can rewrite $1+\log p_i=\lambda+i \mu$ in the form

\[\begin{array}{l} e^{\lambda-1}\left(e^{\mu}\right)^{i} \\ \end{array} \]

to simplify the notation, let

\[\alpha=e^{\lambda-1} \\ \beta=e^{\mu} \\ \]

so the rest of the restrictions becomes

\[\left\{\begin{array}{l} \sum_{i=0}^{\infty} \alpha \beta^{i}=1 \\ \sum_{i=0}^{\infty} i \alpha \beta^{i}=A \end{array}\right. \]

which implies that,

\[\begin{aligned} \beta & =\frac{A}{A+1} \\ \alpha & =\frac{1}{A+1} \end{aligned} \]

So the entropy maximizing distribution is,

\[p_i=\frac{1}{A+1}\left(\frac{A}{A+1}\right)^i . \]

Plugging these values into the expression for the maximum entropy,

\[-\log \alpha-A \log \beta=(A+1) \log (A+1)-A \log A \]

2.33 Fano's inequality.

Let $\operatorname{Pr}(X=i)=p_i, i=1,2, \ldots, m$, and let $p_1 \geq p_2 \geq p_3 \geq \cdots \geq p_m$. The minimal probability of error predictor of $X$ is $\hat{X}=1$, with resulting probability of error $P_e=$ $1-p_1$. Maximize $H(\mathbf{p})$ subject to the constraint $1-p_1=P_e$ to find a bound on $P_e$ in terms of $H$. This is Fano's inequality in the absence of conditioning.

Solution (Thomas M. Cover Joy A. Thomas): (Fano's Inequality.) The minimal probability of error predictor when there is no information is $\hat{X}=1$, the most probable value of $X$. The probability of error in this case is $P_e=1-p_1$. Hence if we fix $P_e$, we fix $p_1$. We maximize the entropy of $X$ for a given $P_e$ to obtain an upper bound on the entropy for a given $P_e$. The entropy,

\[\begin{aligned} H(\mathbf{p}) & =-p_1 \log p_1-\sum_{i=2}^m p_i \log p_i \\ & =-p_1 \log p_1-\sum_{i=2}^m P_e \frac{p_i}{P_e} \log \frac{p_i}{P_e}-P_e \log P_e \\ & =H\left(P_e\right)+P_e H\left(\frac{p_2}{P_e}, \frac{p_3}{P_e}, \ldots, \frac{p_m}{P_e}\right) \\ & \leq H\left(P_e\right)+P_e \log (m-1) \end{aligned} \]

since the maximum of $H\left(\frac{p_2}{P_e}, \frac{p_3}{P_e}, \ldots, \frac{p_m}{P_e}\right)$ is attained by an uniform distribution. Hence any $X$ that can be predicted with a probability of error $P_e$ must satisfy

\[H(X) \leq H\left(P_e\right)+P_e \log (m-1) \]

which is the unconditional form of Fano's inequality. We can weaken this inequality to obtain an explicit lower bound for $P_e$

\[P_e \geq \frac{H(X)-1}{\log (m-1)} \]

Solution: (is this correct?)

\[\begin{aligned} H(X \mid \hat{X}) & =\operatorname{Pr}\{\hat{X}=1\} \cdot H(X \mid \hat{X}=1) \\ & =1 \cdot H(X \mid \hat{X}=1) \\ & =H(X) \end{aligned} \]

\[\begin{aligned} I(X ; \hat{X}) & =H(\hat{X})-H(\hat{X} \mid X) \\ & =H(X)-H(X \mid X) \end{aligned} \]

where $H(X) =0 $ because $X = 1$ with probability $1$ and $H(X \mid \hat X ) \leq H(X)$ because condition reduces entropy. So $H(X \mid \hat X) = H(X)$.

then

\[H(X \mid \hat X) = H(X) \leq H\left(P_e\right)+P_e \log (m-1). \]

2.46 Axiomatic definition of entropy (Difficult).

If we assume certain axioms for our measure of information, we will be forced to use a logarithmic measure such as entropy. Shannon used this to justify his initial definition of entropy. In this book we rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more difficult than the other problems in this section.
If a sequence of symmetric functions $H_m\left(p_1, p_2, \ldots, p_m\right)$ satisfies the following properties:

Normalization: $H_2\left(\frac{1}{2}, \frac{1}{2}\right)=1$,
Continuity: $H_2(p, 1-p)$ is a continuous function of $p$,
Grouping: $H_m\left(p_1, p_2, \ldots, p_m\right)=H_{m-1}\left(p_1+p_2, p_3, \ldots, p_m\right)+$ $\left(p_1+p_2\right) H_2\left(\frac{p_1}{p_1+p_2}, \frac{p_2}{p_1+p_2}\right)$ prove that $H_m$ must be of the form

\[H_m\left(p_1, p_2, \ldots, p_m\right)=-\sum_{i=1}^m p_i \log p_i, \quad m=2,3, \ldots \]

There are various other axiomatic formulations which result in the same definition of entropy. See, for example, the book by Csiszár and Körner [149].

3.13 Calculation of typical set.

To clarify the notion of a typical set $A_\epsilon^{(n)}$ and the smallest set of high probability $B_\delta^{(n)}$, we will calculate the set for a simple example. Consider a sequence of i.i.d. binary random variables, $X_1, X_2, \ldots, X_n$, where the probability that $X_i=$ 1 is $0.6$ (and therefore the probability that $X_i=0$ is $0.4$ ).
(a) Calculate $H(X)$.
(b) With $n=25$ and $\epsilon=0.1$, which sequences fall in the typical set $A_\epsilon^{(n)}$ ? What is the probability of the typical set? How many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with $k 1$ 's, $0 \leq k \leq 25$, and finding those sequences that are in the typical set.)
(c) How many elements are there in the smallest set that has probability $0.9$ ?
(d) How many elements are there in the intersection of the sets in parts (b) and (c)? What is the probability of this intersection?

The table on book seems to be problematic. Here is my version.

Not finished answer, may be also problematic.

Clear [n, k, p]
n = 25;
p = 0.6;

Table[{k, Binomial[n, k], 
   Binomial[n, k] *p^k *(1 - p)^(n - k), (-1/n) *
    Log[Binomial[n, k] *p^k *(1 - p)^(n - k)]/Log[2]}, {k, 0, 
   25}] // MatrixForm

\[\left( \begin{array}{cccc} k& \quad\left(\begin{array}{l} n \\ k \end{array}\right)& \quad\left(\begin{array}{l} n \\ k \end{array}\right) p^k(1-p)^{n-k}& \quad-\frac{1}{n} \log p\left(x^n\right) \\ 0 & 1 & \text{1.1258999068426266$\grave{ }$*${}^{\wedge}$-10} & 1.32193 \\ 1 & 25 & \text{4.22212465065985$\grave{ }$*${}^{\wedge}$-9} & 1.11278 \\ 2 & 300 & \text{7.599824371187732$\grave{ }$*${}^{\wedge}$-8} & 0.945978 \\ 3 & 2300 & \text{8.739798026865891$\grave{ }$*${}^{\wedge}$-7} & 0.805036 \\ 4 & 12650 & \text{7.210333372164358$\grave{ }$*${}^{\wedge}$-6} & 0.68326 \\ 5 & 53130 & 0.0000454251 & 0.577046 \\ 6 & 177100 & 0.000227126 & 0.484169 \\ 7 & 480700 & 0.000924725 & 0.403148 \\ 8 & 1081575 & 0.00312095 & 0.332952 \\ 9 & 2042975 & 0.00884269 & 0.272852 \\ 10 & 3268760 & 0.0212224 & 0.222331 \\ 11 & 4457400 & 0.0434095 & 0.181034 \\ 12 & 5200300 & 0.0759667 & 0.14874 \\ 13 & 5200300 & 0.11395 & 0.125341 \\ 14 & 4457400 & 0.146507 & 0.110838 \\ 15 & 3268760 & 0.161158 & 0.105338 \\ 16 & 2042975 & 0.151086 & 0.109062 \\ 17 & 1081575 & 0.11998 & 0.122366 \\ 18 & 480700 & 0.0799865 & 0.145764 \\ 19 & 177100 & 0.0442031 & 0.179988 \\ 20 & 53130 & 0.0198914 & 0.226069 \\ 21 & 12650 & 0.00710406 & 0.285486 \\ 22 & 2300 & 0.00193747 & 0.360464 \\ 23 & 300 & 0.000379071 & 0.45461 \\ 24 & 25 & 0.0000473838 & 0.57461 \\ 25 & 1 & \text{2.8430288029929685$\grave{ }$*${}^{\wedge}$-6} & 0.736966 \\ \end{array} \right) \]

Sol.

(a)

p = {0.4, 0.6};
hx = Total[-p Log [p]/Log[2]]
n = 25;
e = 0.1;

0.970951

$H(X)=-0.6 \log 0.6-0.4 \log 0.4=0.97095$ bits.

(b)

(Hx + e)
(Hx - e)

1.07095
0.870951

Choose items whose $$-\frac{1}{n} \log p\left(x^n\right)$$ lies in the interval $[0.870951,1.07095 ]$.

(c)

Clear[pr, n, k]
p = 0.6;
n = 25;
pr = Table[{k, Binomial[n, k] *p^k *(1 - p)^(n - k)}, {k, 0, 25}] ;
Total[pr[[2]]];
pr = SortBy[pr, #[[2]] &];
pr = Reverse[pr];
For[i = 2, i <= n, i++, pr[[i, 2]] = pr[[i, 2]] + pr[[i - 1, 2]]]
pr // MatrixForm

\[\left( \begin{array}{cc} k & -\frac{1}{n}\log p\left(x^n\right) \\ 15 & 0.161158 \\ 16 & 0.312244 \\ 14 & 0.458751 \\ 17 & 0.57873 \\ 13 & 0.69268 \\ 18 & 0.772667 \\ 12 & 0.848634 \\ 19 & 0.892837 \\ 11 & 0.936246 \\ 10 & 0.957469 \\ 20 & 0.97736 \\ 9 & 0.986203 \\ 21 & 0.993307 \\ 8 & 0.996428 \\ 22 & 0.998365 \\ 7 & 0.99929 \\ 23 & 0.999669 \\ 6 & 0.999896 \\ 24 & 0.999944 \\ 5 & 0.999989 \\ 4 & 0.999996 \\ 25 & 0.999999 \\ 3 & 1. \\ 2 & 1. \\ 1 & 1. \\ 0 & \text{1.1258999068426266$\grave{ }$*${}^{\wedge}$-10} \\ \end{array} \right) \]

We can use a greedy approach to choose items, when the total probability reaches 0.9, we stop the approach. The matrix above shows the cumulative value of $ -\frac{1}{n}\log p\left(x^n\right)$, from large to small. Details are omitted.

(d) Omitted.

5.28 Shannon code.

p176

Consider the following method for generating a code for a random variable $X$ that takes on $m$ values $\{1,2, \ldots, m\}$ with probabilities $p_1, p_2, \ldots, p_m$. Assume that the probabilities are ordered so that $p_1 \geq p_2 \geq \cdots \geq p_m$. Define

\[F_i=\sum_{k=1}^{i-1} p_k \]

the sum of the probabilities of all symbols less than $i$. Then the codeword for $i$ is the number $F_i \in[0,1]$ rounded off to $l_i$ bits, where $l_i=\left\lceil\log \frac{1}{p_i}\right\rceil$.
(a) Show that the code constructed by this process is prefix-free and that the average length satisfies

\[H(X) \leq L<H(X)+1 . \]

(b) Construct the code for the probability distribution $(0.5,0.25$, $0.125,0.125)$

7.4 Channel capacity.

Consider the discrete memoryless channel $Y=$ $X+Z(\bmod 11)$, where

\[Z=\left(\begin{array}{lll} 1, & 2, & 3 \\ \frac{1}{3}, & \frac{1}{3}, & \frac{1}{3} \end{array}\right) \]

and $X \in\{0,1, \ldots, 10\}$. Assume that $Z$ is independent of $X$.

(a) Find the capacity.

Solution

Since $Y$ has the form

\[Y= X +Z(\bmod 11), \]

the channel is a symmetric channel, if $X = x$ is given, there are $3$ possible value of $Y$ so

\[H(X\mid Y) = \sum_y \pr{Y = y}\log \frac{1}{\pr{Y = y}} = \log 3 \bit \]

Hence the channel capacity is

\[\begin{aligned} C &=\max _{p(x)} I(X ; Y) \\ &=\max _{p(x)} H(Y)-H(Y \mid X) \\ &=\max _{p(x)} H(Y)-\log 3 \\ &=\log 11-\log 3 \quad \text{When $\pr{Y = y} = 1/11$ for all $y$}. \\ & = \log \frac{11}{3} \end{aligned} \]

Reference

Elements of information theory/by Thomas M. Cover, Joy A. Thomas.–2nd ed.

posted @ 2022-12-16 21:44 miyasaka 阅读(106) 评论(0) 收藏举报

刷新页面返回顶部

0xfffffff

Elements of Information Theory: Exercises