entropy, crosss entropy

Information Entropy

Information Entropy measures the information missing before reception, saying the level of uncertainty of a random variable X X .
Information entropy definition:

H(X)=i=1np(xi)logp(xi)

where, X X is a random variable, p(xi) is the probability of X=xi X = x i . When log l o g is log2 l o g 2 the unit of H(X) H ( X ) is bit. When log l o g is log10 l o g 10 the unit of H(X) H ( X ) is dit.

Example

English character

X X is a random variable. It could be one character of a,b,c...x,y,z. The information entropy of X X :

H(X)=i=126126log2126=4.7

This means the information entropy of a English character is 4.7 bit, meaning 5 binary numbers are able to encode a English character.

ASCII code

X X is a random variable. It could be one ASCII code. The total number of ASCII code is 128. The information entropy of X:

H(X)=i=11281128log21128=7 H ( X ) = − ∑ i = 1 128 1 128 ∗ l o g 2 1 128 = 7

This means the information entropy of a English character is 7 bit, meaning 7 binary numbers are able to encode an ASCII code. We use a Byte, which is 8 bits, to stand for a ASCII code. The extra one bit is used for checking.

================================================

Cross Entropy in Machine Learning

In information theory, the cross entropy between two probability distributions p p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution q q , rather than the “true” distribution p.
Cross entropy definition:

S(p,q)=xp(x)logq(x) S ( p , q ) = − ∑ x p ( x ) log ⁡ q ( x )

where, x x is each certain value of the set.p is the target distribution, q q is the temporary, unreal or unnatural distribution.

The more similar p and q q , the smaller S(p,q). So S(.) S ( . ) could be used as training target. There is an application, named tf.nn.softmax_cross_entropy_with_logits_v2(), in tensorflow for this.

Example

The training instance one-hot label is y_target=[0,1,0,0,0] y _ t a r g e t = [ 0 , 1 , 0 , 0 , 0 ] . The one-hot label calculated by you algorithm y_tmp=[0.1,0.1,0.2,0.1,0.5] y _ t m p = [ 0.1 , 0.1 , 0.2 , 0.1 , 0.5 ] .

You want to make y_tmp y _ t m p approximating y_target y _ t a r g e t . In other word, you goal is to make y_tmp[0] y _ t m p [ 0 ] smaller, to make y_tmp[1] y _ t m p [ 1 ] greater, to make y_tmp[2] y _ t m p [ 2 ] smaller … It’s a complex task. so much to be consider.

How about make S(y_target,y_tmp) S ( y _ t a r g e t , y _ t m p ) be smaller? One shot all done. Better.

Ref

Cross entropy - Wikipedia
https://en.wikipedia.org/wiki/Cross_entropy

A Friendly Introduction to Cross-Entropy Loss
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/#entropy

Entropy - Wikipedia
https://en.wikipedia.org/wiki/Entropy

posted on 2018-04-09 21:12  yusisc  阅读(28)  评论(0编辑  收藏  举报

导航