entropy, crosss entropy
Information Entropy
Information Entropy measures the information missing before reception, saying the level of uncertainty of a random variable
X
X
.
Information entropy definition:
where, X X is a random variable, is the probability of X=xi X = x i . When log l o g is log2 l o g 2 the unit of H(X) H ( X ) is bit. When log l o g is log10 l o g 10 the unit of H(X) H ( X ) is dit.
Example
English character
X
X
is a random variable. It could be one character of . The information entropy of
X
X
:
This means the information entropy of a English character is 4.7 bit, meaning 5 binary numbers are able to encode a English character.
ASCII code
X
X
is a random variable. It could be one ASCII code. The total number of ASCII code is 128. The information entropy of :
This means the information entropy of a English character is 7 bit, meaning 7 binary numbers are able to encode an ASCII code. We use a Byte, which is 8 bits, to stand for a ASCII code. The extra one bit is used for checking.
================================================
Cross Entropy in Machine Learning
In information theory, the cross entropy between two probability distributions
p
p
and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “unnatural” probability distribution
q
q
, rather than the “true” distribution .
Cross entropy definition:
where, x x is each certain value of the set. is the target distribution, q q is the temporary, unreal or unnatural distribution.
The more similar and
q
q
, the smaller . So
S(.)
S
(
.
)
could be used as training target. There is an application, named
tf.nn.softmax_cross_entropy_with_logits_v2()
, in tensorflow for this.
Example
The training instance one-hot label is y_target=[0,1,0,0,0] y _ t a r g e t = [ 0 , 1 , 0 , 0 , 0 ] . The one-hot label calculated by you algorithm y_tmp=[0.1,0.1,0.2,0.1,0.5] y _ t m p = [ 0.1 , 0.1 , 0.2 , 0.1 , 0.5 ] .
You want to make y_tmp y _ t m p approximating y_target y _ t a r g e t . In other word, you goal is to make y_tmp[0] y _ t m p [ 0 ] smaller, to make y_tmp[1] y _ t m p [ 1 ] greater, to make y_tmp[2] y _ t m p [ 2 ] smaller … It’s a complex task. so much to be consider.
How about make S(y_target,y_tmp) S ( y _ t a r g e t , y _ t m p ) be smaller? One shot all done. Better.
Ref
Cross entropy - Wikipedia
https://en.wikipedia.org/wiki/Cross_entropy
A Friendly Introduction to Cross-Entropy Loss
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/#entropy
Entropy - Wikipedia
https://en.wikipedia.org/wiki/Entropy