Reparamterization
Repremeterization
motivation
Why do we use the reparameterization? The motivation is to separate the uncertainty of random variables, so that the intermediate nodes that cannot be derived / gradient propagation can be derived.
Introduction
If we want to calculate the expectation of the below:
The \(p(z)\) is the pdf (has no relation with the \(\theta\)). If the derivae of \(f_\theta(z)\) exists with respect to the \(\theta\). Then we have the:
Because the \(f_\theta(z)\) is not the pdf. So the first part of the above can not be transformed into the expectation. So we can not use the Monte Carlo to solve it. If we can know what is \(p_\theta(z)\), then we can easily solve this. But most of time, we can not write the specific formulation of the \(p_\theta(z)\). So what we can only do is to sample different variables and try to keep the derivates information during the sampling. And the reparameterization is exactly a tick that can help us manage the above. It can be divided into two continuous steps:
- Sample an variable, what we call \(\epsilon\), from the distribution without \(\theta\).
- Transform the \(\epsilon\) into \(z\) by the transformation \(z=g_\theta(\epsilon)\) where the \(\theta\) is involved.
So the \(\mathbf{E}_{p(z)}[f_\theta(z)]\) become: \(\mathbf{E}_{\epsilon\sim q(\epsilon)}[f(g_\theta(\epsilon))]\). The randomness of the \(z\) is transformed into the \(\epsilon\).
Then we can have that:
For the formulation above, the first line and the second line are the general ticks for reparameterization: The randomness in random variable Z is decoupled from the formal information of data. R3 is the result of substituting R2 into the expectation that the gradient cannot be obtained. The core point is the change of the expected object. Then, after three continoous operations, we can convert the expected gradient into the expected gradient, and use Monte Carlo method for approximate solution.
Back to VAE
In the VAE, we have the ELBO(evcodence lower-bound) as below:
Where the \(\phi\) is the latent parameter and the \(\theta\) is the model parameter. So we have the gradient as below:
Under the assumption that both the prior distribution and post distribution satisfy the Gaussian distribution. We can simplify the above equation into:
From the view of Encoder and Decoder:
Discrete
If we replace the \(z\) with \(y\), we will have that
The discrete situation menas that in most of time, the \(y\) can be enumerated. In other word, the \(p_\theta(y)\) is a K-classification model.
We may think that, it is easy to calculate the above formulation considering that we just need to do the sum operation. But what if the \(k\) is a big number? For example, if \(k\) is a 1000-dimension vector, each entry of the vector satisfy the Bernoulli Distribution. Then we will have \(2^{1000}\) different vectors. If we want to do the sum operation for these vectors, we can never finsih the calculating.
Gumble Max
So back to the formulation (4), if we can use a novel method to acquire the effective estimation of the formulation without losing the gradient informtion of, we can easily solve the question.
So we additional introduce the Gumble distribution
to help us solve the above mentioned issue.
If the proability of the class is \(p_1,\ p_2,\ ...,\ p_k\), then the below formulation provides us a tool to sample according to the proability:
The explaination of the above formulation is yo firstly calculate the proability \(p_1,\ p_2,\ ...,\ p_k\) of each class, then sample \(k\) random samples from the uniform distribution \(\epsilon_1,\ ,\epsilon_2,\ ...,\ \epsilon_k\), add the \(\log(p_i)\) to the \(\log(-\log(\epsilon_i))\). Finally, we extract the maxium, the class corresponding to the max proability is what we want.
Now that the randomness has been transformed to the \(U[0,1]\) and there is no paramters to be considered in the \(U[0,1]\), so this is reparameterization method.
But this is still not enough
. In many situations, we want to calculate the derivate of the \(p\). So we need to replace the argmax by the softmax. Finally, we get the derivable Gumble Max: Gumble Softmax
The parameter \(\tau\) is the annealing parameter. If the \(\tau\) became smaller, the output of the Gumble Softmax will be more similiar to the one-hot which is the output of the Gumble Max, but the losing of the gradient will be more serious.
The proof of the Gumble Max:
If we want to prove the proability of the Gumble Max output \(i\) is \(p_i\). Without losing generality, we can firstly prove the proability of output \(1\) is \(p_1\). And we must claim that, if the Gumble Max output \(1\), it means that the \(\log(p_1)-\log(-\log(\epsilon_1))\) is the maxium, which also means that:\[\begin{gathered} \log p_{1}-\log \left(-\log \varepsilon_{1}\right)>\log p_{2}-\log \left(-\log \varepsilon_{2}\right) \\ \log p_{1}-\log \left(-\log \varepsilon_{1}\right)>\log p_{3}-\log \left(-\log \varepsilon_{3}\right) \\ \vdots \\ \log p_{1}-\log \left(-\log \varepsilon_{1}\right)>\log p_{k}-\log \left(-\log \varepsilon_{k}\right) \end{gathered} \]We must admit that the above inequalities stand independetly. Which means that \(\log p_{1}-\log \left(-\log \varepsilon_{1}\right)>\log p_{2}-\log \left(-\log \varepsilon_{2}\right)\) will not effect \(\log p_{1}-\log \left(-\log \varepsilon_{1}\right)>\log p_{k}-\log \left(-\log \varepsilon_{k}\right).\)
So we just need to analyze every inequality independently. Without losing the generality, we firstly analyze the first inequality. After the simplification, we can have:\[\varepsilon_{2}<\varepsilon_{1}^{p 2 / p_{1}} \leq 1 \]Because both \(\epsilon_1\) and \(\epsilon_2\) satisy the Uniform distribution, the proability of the \(\epsilon_2\leq \epsilon_1^{\frac{p_2}{p_1}}\) is \(\epsilon_1^{\frac{p_2}{p_1}}\). This is the proability of the first inepuality stands. So the proability of all inequalities stand is
\[\varepsilon_{1}^{p_{2} / p_{1}} \varepsilon_{1}^{p_{3} / p_{1}} \ldots \varepsilon_{1}^{p_{k} / p_{1}}=\varepsilon_{1}^{\left(p_{2}+p_{3}+\cdots+p_{k}\right) / p_{1}}=\varepsilon_{1}^{\left(1/ p_{1}\right)-1} \]Then we calculate the sum of proability for the \(\epsilon_1\):
\[\int_0^1 \epsilon_1^{1/p_1-1}d\epsilon_1=p_1 \]This is the proability that class 1 is the output, and it is exact the \(p_1\).
Finally we prove the correcrness of the Gumble Max Sample.