Perplexity Vs Cross-entropy
Evaluating a Language Model: Perplexity
We have a serial of \(m\) sentences:
We could look at the probability under our model \(\prod_{i=1}^m{p(s_i)}\). Or more conveniently, the log probability:
where \(p(s_i)\) is the probability of sentence \(s_i\).
In fact, the usual evaluation measure is perplexity:
and \(M\) is the total number of words in the test data.
Cross-Entropy
Given words \(x_1,\cdots,x_t\), a language model prdicts the following word \(x_{t+1}\) by modeling:
where \(v_j\) is a word in the vocabulary.
The predicted output vector \(\hat y^t\in \mathbb{R}^{|V|}\) is a probability distribution over the vocabulary, and we optimize the cross-entrpy loss:
where \(y^t\) is the one-hot vector corresponding to the target word. This is a poiny-wise loss, and we sum the cross-ntropy loss across all examples in a sequence, across all sequences in the dataset in order to evaluate model performance.
The relationship between cross-entropy and ppl
which is the inverse probability of the correct word, according to the model distribution \(P\).
suppose \(y_i^t\) is the only nonzero element of \(y^t\). Then, note that:
Then, it follows that:
In fact, minizing the arthimic mean of the cross-entropy is identical to minimizing the geometric mean of the perplexity. If the model predictions are completely random, \(E[\hat y_i^t]=\frac{1}{|V|}\), and the expected cross-entropies are \(\log |V|\), (\(\log 10000\approx 9.21\))