Elements of Information Theory: Exercises

Elements of Information Theory: Exercises

2.2 Entropy of functions.

Let X be a random variable taking on a finite number of values. What is the (general) inequality relationship of H(X) and H(Y) if
(a) Y=2X ?
(b) Y=cosX ?

H(X)H(Y) if Y is a function of X. This holds because a function may cause an information loss when the function is not an one-to-one mapping.

2.6 Conditional mutual information vs. unconditional mutual information.

Give examples of joint random variables X,Y, and Z such that
(a) I(X;YZ)<I(X;Y).
(b) I(X;YZ)>I(X;Y).

(a) Let X=Y=Z. All of them are binary.

Something counter intuition is that I may interpret I(X;YZ)=0 as given Z, X and Y are independent, since I(X;Y)=0 if and only if X and Y are independent.

Interpretation by ChatGPT

In the case where X=Y=Z, the mutual information I(X;YZ) is zero because the two random variables X and Y are completely determined by the value of the random variable Z. In other words, knowing the value of Z gives us complete information about both X and Y, so there is no additional information to be gained by considering X and Y together. This means that the two random variables are independent given the value of Z, so I(X;YZ)=0.

(b) let X and Y be independent and Z=X+Y. By intuition, given Z will cause a drop of independency between X and Y.

2.28 Mixing increases entropy.

Show that the entropy of the probability distribution, (p1,,pi,,pj,,pm), is less than the entropy of the distribution (p1,,pi+pj2,,pi+pj2, ,pm). Show that in general any transfer of probability that makes the distribution more uniform increases the entropy.

2.30 Maximum entropy.

Find the probability mass function p(x) that maximizes the entropy H(X) of a nonnegative integer-valued random variable X subject to the constraint

EX=n=0np(n)=A

for a fixed value A>0. Evaluate this maximum H(X).

Sol. We tackle this problem by using Lagrange Multipliers

f=λg+μh

where

f=pilogpig=pih=ipi

which becomes

{1+logpi=λ+iμpi=1ipi=A

we can rewrite 1+logpi=λ+iμ in the form

eλ1(eμ)i

to simplify the notation, let

α=eλ1β=eμ

so the rest of the restrictions becomes

{i=0αβi=1i=0iαβi=A

which implies that,

β=AA+1α=1A+1

So the entropy maximizing distribution is,

pi=1A+1(AA+1)i.

Plugging these values into the expression for the maximum entropy,

logαAlogβ=(A+1)log(A+1)AlogA

2.33 Fano's inequality.

Let Pr(X=i)=pi,i=1,2,,m, and let p1p2p3pm. The minimal probability of error predictor of X is X^=1, with resulting probability of error Pe= 1p1. Maximize H(p) subject to the constraint 1p1=Pe to find a bound on Pe in terms of H. This is Fano's inequality in the absence of conditioning.

Solution (Thomas M. Cover Joy A. Thomas): (Fano's Inequality.) The minimal probability of error predictor when there is no information is X^=1, the most probable value of X. The probability of error in this case is Pe=1p1. Hence if we fix Pe, we fix p1. We maximize the entropy of X for a given Pe to obtain an upper bound on the entropy for a given Pe. The entropy,

H(p)=p1logp1i=2mpilogpi=p1logp1i=2mPepiPelogpiPePelogPe=H(Pe)+PeH(p2Pe,p3Pe,,pmPe)H(Pe)+Pelog(m1)

since the maximum of H(p2Pe,p3Pe,,pmPe) is attained by an uniform distribution. Hence any X that can be predicted with a probability of error Pe must satisfy

H(X)H(Pe)+Pelog(m1)

which is the unconditional form of Fano's inequality. We can weaken this inequality to obtain an explicit lower bound for Pe

PeH(X)1log(m1)

Solution: (is this correct?)

H(XX^)=Pr{X^=1}H(XX^=1)=1H(XX^=1)=H(X)

or

I(X;X^)=H(X^)H(X^X)=H(X)H(XX)

where H(X)=0 because X=1 with probability 1 and H(XX^)H(X) because condition reduces entropy. So H(XX^)=H(X).

then

H(XX^)=H(X)H(Pe)+Pelog(m1).

2.46 Axiomatic definition of entropy (Difficult).

If we assume certain axioms for our measure of information, we will be forced to use a logarithmic measure such as entropy. Shannon used this to justify his initial definition of entropy. In this book we rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more difficult than the other problems in this section.
If a sequence of symmetric functions Hm(p1,p2,,pm) satisfies the following properties:

  • Normalization: H2(12,12)=1,
  • Continuity: H2(p,1p) is a continuous function of p,
  • Grouping: Hm(p1,p2,,pm)=Hm1(p1+p2,p3,,pm)+ (p1+p2)H2(p1p1+p2,p2p1+p2) prove that Hm must be of the form

Hm(p1,p2,,pm)=i=1mpilogpi,m=2,3,

There are various other axiomatic formulations which result in the same definition of entropy. See, for example, the book by Csiszár and Körner [149].

3.13 Calculation of typical set.

To clarify the notion of a typical set Aϵ(n) and the smallest set of high probability Bδ(n), we will calculate the set for a simple example. Consider a sequence of i.i.d. binary random variables, X1,X2,,Xn, where the probability that Xi= 1 is 0.6 (and therefore the probability that Xi=0 is 0.4 ).
(a) Calculate H(X).
(b) With n=25 and ϵ=0.1, which sequences fall in the typical set Aϵ(n) ? What is the probability of the typical set? How many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with k1 's, 0k25, and finding those sequences that are in the typical set.)
(c) How many elements are there in the smallest set that has probability 0.9 ?
(d) How many elements are there in the intersection of the sets in parts (b) and (c)? What is the probability of this intersection?

The table on book seems to be problematic. Here is my version.

Not finished answer, may be also problematic.

Clear [n, k, p]
n = 25;
p = 0.6;

Table[{k, Binomial[n, k], 
   Binomial[n, k] *p^k *(1 - p)^(n - k), (-1/n) *
    Log[Binomial[n, k] *p^k *(1 - p)^(n - k)]/Log[2]}, {k, 0, 
   25}] // MatrixForm

(k(nk)(nk)pk(1p)nk1nlogp(xn)011.1258999068426266`*-101.321931254.22212465065985`*-91.1127823007.599824371187732`*-80.945978323008.739798026865891`*-70.8050364126507.210333372164358`*-60.683265531300.00004542510.57704661771000.0002271260.48416974807000.0009247250.403148810815750.003120950.332952920429750.008842690.2728521032687600.02122240.2223311144574000.04340950.1810341252003000.07596670.148741352003000.113950.1253411444574000.1465070.1108381532687600.1611580.1053381620429750.1510860.1090621710815750.119980.122366184807000.07998650.145764191771000.04420310.17998820531300.01989140.22606921126500.007104060.2854862223000.001937470.360464233000.0003790710.4546124250.00004738380.574612512.8430288029929685`*-60.736966)

Sol.

(a)

p = {0.4, 0.6};
hx = Total[-p Log [p]/Log[2]]
n = 25;
e = 0.1;
0.970951

H(X)=0.6log0.60.4log0.4=0.97095 bits.

(b)

(Hx + e)
(Hx - e)
1.07095
0.870951

Choose items whose 1nlogp(xn) lies in the interval [0.870951,1.07095].

(c)

Clear[pr, n, k]
p = 0.6;
n = 25;
pr = Table[{k, Binomial[n, k] *p^k *(1 - p)^(n - k)}, {k, 0, 25}] ;
Total[pr[[2]]];
pr = SortBy[pr, #[[2]] &];
pr = Reverse[pr];
For[i = 2, i <= n, i++, pr[[i, 2]] = pr[[i, 2]] + pr[[i - 1, 2]]]
pr // MatrixForm

(k1nlogp(xn)150.161158160.312244140.458751170.57873130.69268180.772667120.848634190.892837110.936246100.957469200.9773690.986203210.99330780.996428220.99836570.99929230.99966960.999896240.99994450.99998940.999996250.99999931.21.11.01.1258999068426266`*-10)

We can use a greedy approach to choose items, when the total probability reaches 0.9, we stop the approach. The matrix above shows the cumulative value of 1nlogp(xn), from large to small. Details are omitted.

(d) Omitted.

5.28 Shannon code.

p176

Consider the following method for generating a code for a random variable X that takes on m values {1,2,,m} with probabilities p1,p2,,pm. Assume that the probabilities are ordered so that p1p2pm. Define

Fi=k=1i1pk

the sum of the probabilities of all symbols less than i. Then the codeword for i is the number Fi[0,1] rounded off to li bits, where li=log1pi.
(a) Show that the code constructed by this process is prefix-free and that the average length satisfies

H(X)L<H(X)+1.

(b) Construct the code for the probability distribution (0.5,0.25, 0.125,0.125)

7.4 Channel capacity.

Consider the discrete memoryless channel Y= X+Z(mod11), where

Z=(1,2,313,13,13)

and X{0,1,,10}. Assume that Z is independent of X.

(a) Find the capacity.

Solution

Since Y has the form

Y=X+Z(mod11),

the channel is a symmetric channel, if X=x is given, there are 3 possible value of Y so

H(XY)=y\prY=ylog1\prY=y=log3\bit

Hence the channel capacity is

C=maxp(x)I(X;Y)=maxp(x)H(Y)H(YX)=maxp(x)H(Y)log3=log11log3When \prY=y=1/11 for all y.=log113

Reference

  • Elements of information theory/by Thomas M. Cover, Joy A. Thomas.–2nd ed.
posted @   K1øN  阅读(63)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
点击右上角即可分享
微信分享提示