Deep Learning Week7 Notes
1. Tansposed Convolution
Consider 1d convolution with kernel \(k\):
This is because let \(u =i+a-1\), then:
We get:
Transposed Convolution: See Lecture (\(\large \text{Read Carefully}\))
2. Deep Autoencoder
A good autoencoder could be characterized with
the quadratic loss:
A simple example of such an autoencoder would be with both f and g linear, in which case the optimal solution is given by \(\text{PCA}\).
3. Variational autoencoders
\(q(X)\) is the data distribution, and
We want to maximize:
Can also show that:
So it makes sense to maximize:
with
- \(q(X)\) is the data distribution
- \(p(Z) =\mathcal{N}(0,1)\)
Therefore, the loss function:
Kingma and Welling propose that both the encoder \(f\) and decoder \(g\) map to Gaussian with diagonal convariance. Hence they map to twice the dimension \(f(x)=(\mu^f(x),\sigma^f(x))\) and:
- \(q(Z|X=x)\sim \mathcal{N}(\mu^f(x),\text{diag}(\sigma^f(x)))\)
- \(p(X|Z=z)\sim \mathcal{N}(\mu^g(z),\text{diag}(\sigma^g(z)))\)
The first term of \(L\) is the average of:
\(\text{By Code:}\)
param_f = model.encode(input)
mu_f, logvar_f = param_f.split(param_f.size(1)//2, 1)
kl = - 0.5 * (1 + logvar_f - mu_f.pow(2) - logvar_f.exp())
kl_loss = kl.sum() / input.size(0)
As Kingma and Welling (2013), we use a constant variance of \(1\) for the decoder, so the second term of ℒ becomes the average of:
over the \(x_n\) with one \(z_n\) sampled for each i.e.
\(\text{By code}\):
std_f = torch.exp(0.5 * logvar_f)
z = torch.randn_like(mu_f) * std_f + mu_f
output = model.decode(z)
fit = 0.5 * (output - input).pow(2)
fit_loss = fit.sum() / input.size(0)
loss = kl_loss+fit_loss
\(\large \text{Note:}\)
kl_loss
aims at making the distribution in the embedding space close to the normal density.fit_loss
aims at making the
reconstructed data point correct in a probabilistic sense: the original data point should be likely under the Gaussian that we get when we come back from the latent space.