Derivative of Softmax Loss Function
Derivative of Softmax Loss Function
A softmax classifier:
\[p_j = \frac{\exp{o_j}}{\sum_{k}\exp{o_k}}
\]
It has been used in a loss function of the form
\[L = - \sum_{j} y_j \log p_j
\]
where \(o\) is a vector. We need the derivative of \(L\) with respect to \(o\). We can get the partial of \(o_i\) :
\[\frac{\partial{p_j}}{\partial{o_i}} = p_i (1-p_i), \quad i = j \\
\frac{\partial{p_j}}{\partial{o_i}} = - p_i p_j, \quad i \ne j
\]
Hence the derivative of Loss with respect to \(o\) is:
\[\begin{align}
\frac{\partial{L}}{\partial{o_i}} & = - \sum_k y_k \frac{\partial{\log p_k}}{\partial{o_i}} \\
& = - \sum_k y_k \frac{1}{p_k} \frac{\partial{p_k}}{\partial{o_i}} \\
& = -y_i(1-p_i) - \sum_{k\ne i} y_k \frac{1}{p_k} (-p_kp_i) \\
& = -y_i + y_i p_i + \sum_{k\ne i} y_k p_i \\
& = p_i (\sum_k y_k) - y_i \\
\end{align}
\]
Given that \(\sum_k y_k = 1\) as \(y\) is a vector with only one non-zero element, which is 1. By other words, this is a classification problem.
\[\frac{\partial L}{\partial o_i} = p_i - y_i
\]
Reference
智慧在街市上呼喊,在宽阔处发声。