xinyu04

导航

Deep Learning Week7 Notes

1. Tansposed Convolution

Consider 1d convolution with kernel \(k\):

\[\begin{align} y_i &= (x\circledast k)_i\\ &=\sum_a x_{i+a-1}k_a\\ &=\sum_u x_uk_{u-i+1} \end{align} \]

This is because let \(u =i+a-1\), then:

\[a = u-i+1 \]

We get:

\[\begin{align} \frac{\partial l}{\partial x}_u &=\frac{\partial l}{\partial x_u}\\ &= \sum_i \frac{\partial l}{\partial y_i}\frac{\partial y_i}{\partial x_u}\\ &= \sum_i \frac{\partial l}{\partial y_i}k_{u-i+1} \end{align} \]

Transposed Convolution: See Lecture (\(\large \text{Read Carefully}\))

2. Deep Autoencoder

A good autoencoder could be characterized with
the quadratic loss:

\[\begin{align} \mathbb{E}_{X\sim q}[||X-g\circ f(X)||^2] \end{align} \]

A simple example of such an autoencoder would be with both f and g linear, in which case the optimal solution is given by \(\text{PCA}\).

3. Variational autoencoders

\(q(X)\) is the data distribution, and

\[\begin{align} f(x)\sim q(Z|X=x) \end{align} \]

We want to maximize:

\[\mathbb{E}_{q(X)}[\log{P(X)}] \]

Can also show that:

\[\begin{align} \log{P(X=x)}\ge \mathbb{E}_{q(Z|X=x)}[\log{P(X=x|Z)}]-\mathbb{D}_{KL}(q(Z | X = x)||p(Z)) \end{align} \]

So it makes sense to maximize:

\[\begin{align} \mathbb{E}_{q(X,Z)}[\log{P(X|Z)}]-\mathbb{E}_{q(X)}[\mathbb{D}_{KL}(q(Z | X)||p(Z))] \end{align} \]

with

  • \(q(X)\) is the data distribution
  • \(p(Z) =\mathcal{N}(0,1)\)

Therefore, the loss function:

\[\begin{align} L= \mathbb{E}_{q(X)}[\mathbb{D}_{KL}(q(Z | X)||p(Z))]-\mathbb{E}_{q(X,Z)}[\log{P(X|Z)}] \end{align} \]

Kingma and Welling propose that both the encoder \(f\) and decoder \(g\) map to Gaussian with diagonal convariance. Hence they map to twice the dimension \(f(x)=(\mu^f(x),\sigma^f(x))\) and:

  • \(q(Z|X=x)\sim \mathcal{N}(\mu^f(x),\text{diag}(\sigma^f(x)))\)
  • \(p(X|Z=z)\sim \mathcal{N}(\mu^g(z),\text{diag}(\sigma^g(z)))\)

The first term of \(L\) is the average of:

\[\begin{align} \mathbb{D}_{KL}(q(Z | X=x)||p(Z))=-\frac{1}{2}\sum_d(1+2\log{\sigma_d^f(x)}-(\mu_d^f(x))^2-(\sigma_d^f(x))^2) \end{align} \]

\(\text{By Code:}\)

param_f = model.encode(input)
mu_f, logvar_f = param_f.split(param_f.size(1)//2, 1)

kl = - 0.5 * (1 + logvar_f - mu_f.pow(2) - logvar_f.exp())
kl_loss = kl.sum() / input.size(0)

As Kingma and Welling (2013), we use a constant variance of \(1\) for the decoder, so the second term of ℒ becomes the average of:

\[-\log{p(X=x|Z=z)}= \frac{1}{2}\sum_d(x_d-\mu_d^g(z))^2+\text{const} \]

over the \(x_n\) with one \(z_n\) sampled for each i.e.

\[z_n\sim \mathcal{N}(\mu^f(x_n),\sigma^f(x_n)) \]

\(\text{By code}\):

std_f = torch.exp(0.5 * logvar_f)
z = torch.randn_like(mu_f) * std_f + mu_f
output = model.decode(z)

fit = 0.5 * (output - input).pow(2)
fit_loss = fit.sum() / input.size(0)

loss = kl_loss+fit_loss

\(\large \text{Note:}\)

  • kl_loss aims at making the distribution in the embedding space close to the normal density.
  • fit_loss aims at making the
    reconstructed data point correct in a probabilistic sense: the original data point should be likely under the Gaussian that we get when we come back from the latent space.

\(\large\textbf{For more details, see }\)专栏, Blog
\(\\\)

posted on 2022-05-23 22:56  Blackzxy  阅读(17)  评论(0编辑  收藏  举报