entropy, crosss entropy

Information Entropy

Information Entropy measures the information missing before reception, saying the level of uncertainty of a random variable X .
Information entropy definition:

H (X) = - \sum_{i = 1}^{n} p (x_{i}) \log p (x_{i})

where,

X X is a random variable,

p (x_{i})

is the probability of

X=xi X = x i . When

log l o g is

log2 l o g 2 the unit of

H(X) H ( X ) is bit. When

log l o g is

log10 l o g 10 the unit of

H(X) H ( X ) is dit.

Example

English character

X is a random variable. It could be one character of $a, b, c . . . x, y, z$ . The information entropy of X :

\begin{aligned} H (X) & = - \sum_{i = 1}^{26} \frac{1}{26} * l o g_{2} \frac{1}{26} \\ = 4.7 \end{aligned}

This means the information entropy of a English character is 4.7 bit, meaning 5 binary numbers are able to encode a English character.

ASCII code

X is a random variable. It could be one ASCII code. The total number of ASCII code is 128. The information entropy of $X$ :

H (X) = - \sum i = 1 128 1 128 * l o g 2 1 128 = 7

This means the information entropy of a English character is 7 bit, meaning 7 binary numbers are able to encode an ASCII code. We use a Byte, which is 8 bits, to stand for a ASCII code. The extra one bit is used for checking.

================================================

Cross Entropy in Machine Learning

In information theory, the cross entropy between two probability distributions p and $q$ over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q , rather than the “true” distribution $p$ .
Cross entropy definition:

S (p, q) = - \sum x p (x) log q (x)

where,

x x is each certain value of the set.

p

is the target distribution,

q q is the temporary, unreal or unnatural distribution.

The more similar $p$ and q , the smaller $S (p, q)$ . So S(.) could be used as training target. There is an application, named tf.nn.softmax_cross_entropy_with_logits_v2(), in tensorflow for this.

Example

The training instance one-hot label is y_target=[0,1,0,0,0] . The one-hot label calculated by you algorithm y_tmp=[0.1,0.1,0.2,0.1,0.5] .

You want to make y_tmp approximating y_target . In other word, you goal is to make y_tmp[0] smaller, to make y_tmp[1] greater, to make y_tmp[2] smaller … It’s a complex task. so much to be consider.