xinyu04

导航

Deep Learning Week11 Notes

1. Generative Adversarial Networks (GAN)

\(\large\textbf{Aim:}\) learn high-dimension densities. Where two networks are trained jointly:

  • A Discriminator \(\textbf{D}\) to classify samples as "Fake" or "Real"
  • A Generator \(\textbf{G}\) to map a fixed distribution to samples that fool \(\textbf{D}\)

The approach is adversarial since the two networks have antagonistic objectives.

\(\Large\text{Note:}\)

  • The role of the discriminator \(D\) is to detect if a sample is from the real world or was generated.
  • The role of the generator \(G\) is to produce realistic samples: given some random noise following a fixed and simple distribution, it should produce samples which are realistic in the sense that they fool the discriminator.
  • the discriminator \(\textbf{D}\) is optimized to minimize a standard classification loss, and the generator \(\textbf{G}\) is optimized to maximize that loss.
  • \(\large\textbf{A key point }\)is that the generator maximize that loss through the discriminator. Hence the backward pass will propagate the gradient of the loss through the discriminator to the generator, and the generator will be constantly updated during training to remove any statistical structure that was picked up by the discriminator as specific to the synthetic samples.

Let \(\mathcal{X}\) be the signal space, and \(d\) the latent space dimension.

  • generation:

\[\bf{G}:\mathbb{R}^d\rightarrow \mathcal{X} \]

s trained so that [ideally] if it gets a random normal-distributed \(Z\) as input, it produces a sample following the data distribution as output.

  • discriminator:

\[\bf{D}:\mathcal{X}\rightarrow [0,1] \]

Given a set of 'real points':

\[x_n\sim \mu, n = 1,...,N \]

and if \(\textbf{G}\) is fixed, we can train \(\textbf{D}\) by generating

\[z_n\sim \mathcal{N}(0,I),n =1,...,N \]

building a two-class dataset:

\[\mathscr{D}=\{\underbrace{\left(x_{1}, 1\right), \ldots,\left(x_{N}, 1\right)}_{\text {real samples } \sim \mu}, \underbrace{\left(\mathbf{G}\left(z_{1}\right), 0\right), \ldots,\left(\mathbf{G}\left(z_{N}\right), 0\right)}_{\text {fake samples } \sim \mu_{\mathbf{G}}}\}, \]

where \(\mu\) is the true data distribution, and \(\mu_G\) is the distribution of \(\textbf{G}(Z)\) with \(Z\sim \mathcal{N}(0,I)\), and minimize the binary cross-entropy:

\[\begin{aligned} \mathscr{L}(\mathbf{D}) &=-\frac{1}{2 N}\left(\sum_{n=1}^{N} \log \mathbf{D}\left(x_{n}\right)+\sum_{n=1}^{N} \log \left(1-\mathbf{D}\left(\mathbf{G}\left(z_{n}\right)\right)\right)\right) \\ &=-\frac{1}{2}\left(\hat{\mathbb{E}}_{X \sim \mu}[\log \mathbf{D}(X)]+\hat{\mathbb{E}}_{X \sim \mu_{\mathbf{G}}}[\log (1-\mathbf{D}(X))]\right) . \end{aligned} \]

The situation is slightly more complicated since we also want to optimize \(\textbf{G}\) to maximize \(\textbf{D}\)’s loss.

\(\\\)
Define the loss of \(\textbf{G}\):

\[\mathscr{L}_{\mathbf{G}}(\mathbf{D}, \mathbf{G})=\mathbb{E}_{X \sim \mu}[\log \mathbf{D}(X)]+\mathbb{E}_{X \sim \mu_{\mathbf{G}}}[\log (1-\mathbf{D}(X))] \]

which is high if \(\textbf{D}\) is doing a good job (low cross entropy), and low if \(\textbf{G}\) fools \(\textbf{D}\).

Our untimate goal is a \(\textbf{G}^*\) that fools any \(\textbf{D}\), so

\[\mathbf{G}^{*}=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\mathbf{D}} \mathscr{L}_{\mathbf{G}}(\mathbf{D}, \mathbf{G}) . \]

If we define the optimal discriminator for a given generator:

\[\mathbf{D_G^*} = \operatorname{argmax}_{\mathbf{D}}\mathscr{L}_{\mathbf{G}}(\mathbf{D}, \mathbf{G}) \]

our objective becomes:

\[\mathbf{G^*} = \operatorname{argmin}_{\mathbf{G}}\mathscr{L}_{\mathbf{G}}(\mathbf{D_G^*}, \mathbf{G}) \]

\(\text{Hence, we have:}\)

\[\begin{aligned} \mathscr{L}_{\mathbf{G}}(\mathbf{D}, \mathbf{G}) &=\mathbb{E}_{X \sim \mu}[\log \mathbf{D}(X)]+\mathbb{E}_{X \sim \mu_{\mathbf{G}}}[\log (1-\mathbf{D}(X))] \\ &=\int_{x} \mu(x) \log \mathbf{D}(x)+\mu_{\mathbf{G}}(x) \log (1-\mathbf{D}(x)) d x \end{aligned} \]

Since:

\[\underset{d}{\operatorname{argmax}} \mu(x) \log d+\mu_{\mathbf{G}}(x) \log (1-d)=\frac{\mu(x)}{\mu(x)+\mu_{\mathbf{G}}(x)} \]

and

\[\mathbf{D_G^*} = \operatorname{argmax}_{\mathbf{D}}\mathscr{L}_{\mathbf{G}}(\mathbf{D}, \mathbf{G}) \]

If there is no regulation on \(\textbf{D}\), we get:

\[\forall x, \mathbf{D_G^*}(x) = \frac{\mu(x)}{\mu(x)+\mu_{\mathbf{G}}(x)} \]

Therefore, since

\[\forall x, \mathbf{D_G^*}(x) = \frac{\mu(x)}{\mu(x)+\mu_{\mathbf{G}}(x)} \]

we can get:

\[\begin{aligned} \mathscr{L}_{\mathbf{G}}\left(\mathbf{D}_{\mathbf{G}}^{*}, \mathbf{G}\right) &=\mathbb{E}_{X \sim \mu}\left[\log \mathbf{D}_{\mathbf{G}}^{*}(X)\right]+\mathbb{E}_{X \sim \mu_{\mathbf{G}}}\left[\log \left(1-\mathbf{D}_{\mathbf{G}}^{*}(X)\right)\right] \\ &=\mathbb{E}_{X \sim \mu}\left[\log \frac{\mu(X)}{\mu(X)+\mu_{\mathbf{G}}(X)}\right]+\mathbb{E}_{X \sim \mu_{\mathbf{G}}}\left[\log \frac{\mu_{\mathbf{G}}(X)}{\mu(X)+\mu_{\mathbf{G}}(X)}\right] \\ &=\mathbb{D}_{\mathrm{KL}}\left(\mu \| \frac{\mu+\mu_{\mathbf{G}}}{2}\right)+\mathbb{D}_{\mathrm{KL}}\left(\mu_{\mathbf{G}} \| \frac{\mu+\mu_{\mathbf{G}}}{2}\right)-\log 4 \\ &=2 \mathbb{D}_{\mathrm{JS}}\left(\mu, \mu_{\mathbf{G}}\right)-\log 4 \end{aligned} \]

where \(\mathbb{D}_{JS}\) is the Jensen-Shannon Divergence, a standard similarity measure between distributions.

\(\large\textbf{Example:}\) we take \(d=8,\mathcal{X} = \mathbb{R^2}\)

z_dim = 8
nb_hidden = 100

model_G = nn.Sequential(nn.Linear(z_dim, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 2))

model_D = nn.Sequential(nn.Linear(2, nb_hidden),
                        nn.ReLU(),
                        nn.Linear(nb_hidden, 1),
                        nn.Sigmoid())

batch_size, lr = 10, 1e-3

optimizer_G = optim.Adam(model_G.parameters(), lr = lr)
optimizer_D = optim.Adam(model_D.parameters(), lr = lr)

for e in range(nb_epochs):
    for t, real_batch in enumerate(real_samples.split(batch_size)):
        z = real_batch.new(real_batch.size(0), z_dim).normal_()
        fake_batch = model_G(z)

        D_scores_on_real = model_D(real_batch)
        D_scores_on_fake = model_D(fake_batch)

        if t%2 == 0:
            loss = (1 - D_scores_on_fake).log().mean()
            optimizer_G.zero_grad()
            loss.backward()
            optimizer_G.step()
        else:
            loss = - (1 - D_scores_on_fake).log().mean() \
                    - D_scores_on_real.log().mean()
            optimizer_D.zero_grad()
            loss.backward()
            optimizer_D.step()

Goodfellow et al. suggest to replace \(\mathbb{E}_{X \sim \mu_{\mathbf{G}}}\left[\log \left(1-\mathbf{D}(X)\right)\right]\) with a non-saturating cost:

\[-\mathbb{E}_{X \sim \mu_{\mathbf{G}}}\left[\log \left(\mathbf{D}(X)\right)\right] \]

\(\Large\textbf{Note: }\) The resulting optimization problem has the same optima as the original one.

Deep Convolutional GAN

\(\large\text{Tricks: see }\) Lecture-P18

Additionally, performance is hard to assess. Two standard measures are the Inception Score (Salimans et al., 2016) and the Fr´echet Inception Distance (Heusel et al., 2017), but assessment is often a “beauty contest”.

  • The Inception Score checks that when generated images are classified by an inception model (Szegedy et al., 2015) the estimated posterior distribution of classes is similar to the real class distribution, which in particular penalizes a missing class.
  • The Fr´echet Inception Distance looks at the distributions of the features in one of the feature maps of the inception model, for the real and synthetic samples, and estimate their similarity under a Gaussian model.

3. Wasserstein GAN

minimum mass displacement to transform one distribution into the other.

Intuitively, it increases monotonically with the distance between modes.

The Wasserstein distance can be defined as:

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\min _{q \in \Pi\left(\mu, \mu^{\prime}\right)} \mathbb{E}_{\left(X, X^{\prime}\right) \sim q}\left[\left\|X-X^{\prime}\right\|\right] \]

\(\text { where } \Pi\left(\mu, \mu^{\prime}\right) \text { is the set of distributions over } \mathscr{X}^{2} \text { whose marginals are } \mu \text { and } \mu^{\prime} \text {. }\)

So while it would make a lot of sense to look for a generator matching the density for this metric, that is:

\[\mathbf{G^*} = \operatorname{argmin}_{\mathbf{G}}\mathbb{W}(\mu, \mu_{\mathbf{G}}) \]

A duality theorem from Kantorovich and Rubinstein implies:

\[\mathbb{W}\left(\mu, \mu^{\prime}\right)=\max _{\|f\|_{L} \leq 1} \mathbb{E}_{X \sim \mu}[f(X)]-\mathbb{E}_{X \sim \mu^{\prime}}[f(X)] \]

where

\[\|f\|_{L}=\max _{x, x^{\prime}} \frac{\left\|f(x)-f\left(x^{\prime}\right)\right\|}{\left\|x-x^{\prime}\right\|} \]

Using this result, we are looking for a generator:

\[\begin{aligned} \mathbf{G}^{*} &=\underset{\mathbf{G}}{\operatorname{argmin}} \mathbb{W}\left(\mu, \mu_{\mathbf{G}}\right) \\ &=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\|\mathbf{D}\|_{L} \leq 1}\left(\mathbb{E}_{X \sim \mu}[\mathbf{D}(X)]-\mathbb{E}_{X \sim \mu_{\mathbf{G}}}[\mathbf{D}(X)]\right), \end{aligned} \]

The main issue in this formulation is to optimize the network \(\textbf{D}\) under a constraint on its Lipschitz seminorm:

\[\|\mathbf{D}\|_L\leq 1 \]

Arjovsky et al. achieve this by clipping \(\mathbf{D}\)’s weights

In some way the Wasserstein GAN trades the difficulty to optimize the \(\textbf{generator}\) for the difficulty to train the [regularized] discriminator.

Spectral Normalization

Spectral Normalization is a layer normalization that estimates the largest singular value of a weight matrix, and rescale it accordingly.

While computing the SVD of a matrix is expensive, computing [a good approximation of] the largest SV (singular value) can be done iteratively for a reasonable cost.

\(\Large\text{Note:}\) Spectral normalization addresses the control of the Lipschitz seminorm in a way which is less brutal than the weight clipping.

The largest singular value of a matrix \(W\) is also its spectral norm:

\[\sigma(W)=\max _{h:\|h\|_{2} \leq 1}\|W h\|_{2} \]

To calculate it, the power iteration method starts with a random vector \(u_0\) and iterates:

\[\begin{aligned} v_{n+1} &=\frac{W^{\top} u_{n}}{\left\|W^{\top} u_{n}\right\|_{2}} \\ u_{n+1} &=\frac{W v_{n+1}}{\left\|W v_{n+1}\right\|_{2}} \end{aligned} \]

that gives:

\[\sigma(W) = \lim_{n\rightarrow\infty} u_n^{\top}Wv_n \]

\(\text{Code:}\)

W = torch.randn(15, 15)
print(W.svd().S.max())

u = torch.randn(W.size(0))

for k in range(10):
    v = W.t() @ u
    v = v / v.norm()
    u = W @ v
    u = u / u.norm()

print(u.t() @ W @ v)

print

tensor(7.9129)
tensor(7.9129)

The same can be done in PyTorch with torch.nn.utils.spectral_norm, that wraps any linear layer into a module that performs the normalization.

3. Conditional GAN and image translation

The Conditional GAN proposed by Mirza and Osindero (2014) consists of parameterizing both \(\textbf{G}\) and \(\textbf{D}\) by a conditioning quantity \(Y\):

\[V(\mathbf{D}, \mathbf{G})=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(X, Y)]+\mathbb{E}_{Z \sim \mathcal{N}(0, I), Y \sim \mu_{Y}}[\log (1-\mathbf{D}(\mathbf{G}(Z, Y), Y))] \]

Image-to-Image translations

Isola et al. (2016) use a GAN-like setup to address this issue for the “translation” of images with pixel-to-pixel correspondence.

They define:

\[\begin{aligned} V(\mathbf{D}, \mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu}[\log \mathbf{D}(Y, X)]+\mathbb{E}_{Z \sim \mu_{Z}, X \sim \mu_{X}}[\log (1-\mathbf{D}(\mathbf{G}(Z, X), X))] \\ \mathscr{L}_{L^{1}}(\mathbf{G}) &=\mathbb{E}_{(X, Y) \sim \mu, Z \sim \mathcal{N}(0, I)}\left[\|Y-\mathbf{G}(Z, X)\|_{1}\right] \end{aligned} \]

and

\[\mathbf{G}^{*}=\underset{\mathbf{G}}{\operatorname{argmin}} \max _{\mathbf{D}} V(\mathbf{D}, \mathbf{G})+\lambda \mathscr{L}_{L^{1}}(\mathbf{G}) . \]

The term \(\mathscr{L}_{L^{1}}\) pushes toward proper pixel-wise prediction, and \(V\) makes the generator prefer realistic images to better fitting pixel-wise.

The main drawback of this technique is that it requires pairs of samples with pixel-to-pixel correspondence.

We consider \(X\) r.v. on \(\mathscr{X}\) a sample from the first data-set, and \(Y\) r.v. on \(\mathscr{Y}\) a sample for the second data-set. Zhu et al. (2017) propose to train at the same time two mappings:

\[\begin{aligned} &\mathbf{G}: \mathscr{x} \rightarrow \mathscr{y} \\ &\mathbf{F}: \mathscr{y} \rightarrow \mathscr{X} \end{aligned} \]

such that:

\[\begin{align} \textbf{G}(X)&\sim \mu_Y\\ \textbf{F}\circ \textbf{G}(X) &\sim X \end{align} \]

\(\large\text{Illustration: see }\) Lecture-P25

The loss optimized alternatively is:

\[\begin{aligned} V^{*}\left(\mathbf{G}, \mathbf{F}, \mathbf{D}_{X}, \mathbf{D}_{Y}\right)=& V\left(\mathbf{G}, \mathbf{D}_{Y}, X, Y\right)+V\left(\mathbf{F}, \mathbf{D}_{X}, Y, X\right) \\ &+\lambda\left(\mathbb{E}\left[\|\mathbf{F}(\mathbf{G}(X))-X\|_{1}\right]+\mathbb{E}\left[\|\mathbf{G}(\mathbf{F}(Y))-Y\|_{1}\right]\right) \end{aligned} \]

where \(V\) is a quadratic loss, instead of the usual \(\log\):

\[V\left(\mathbf{G}, \mathbf{D}_{Y}, X, Y\right)=\mathbb{E}\left[\left(\mathbf{D}_{Y}(Y)-1\right)^{2}\right]+\mathbb{E}\left[\mathbf{D}_{Y}(\mathbf{G}(X))^{2}\right] \]

The loss has four items:

  • \(V\left(\mathbf{G}, \mathbf{D}_{Y}, X, Y\right)\) estimates how much a signal \(X \sim \mu_{X}\) from \(\mathscr{X}\) brought back to \(y\) by \(\mathbf{G}\) looks like a signal from \(\mu_{Y}\),
  • \(V\left(\mathbf{F}, \mathbf{D}_{X}, Y, X\right)\) estimates how much a signal \(Y \sim \mu_{Y}\) brought back to \(\mathscr{X}\) by \(\mathbf{F}\) looks like a signal from \(\mu_{X}\),
  • \(\mathbb{E}\left[\|\mathbf{F}(\mathbf{G}(X))-X\|_{1}\right]\), estimates how well \(\mathbf{F} \circ \mathbf{G}\) keeps an \(X \sim \mu_{X}\) unchanged, and
  • \(\mathbb{E}\left[\|\mathbf{G}(\mathbf{F}(Y))-Y\|_{1}\right]\), estimates how well \(\mathbf{G} \circ \mathbf{F}\) keeps an \(Y \sim \mu_{Y}\) unchanged.

4. Model persistence and checkpoints

The underlying operation is serialization, that is the transcription of an arbitrary object into a sequence of bytes [that can be saved to disk].

>>> x = 34
>>> torch.save(x, 'x.pth')
>>> y = torch.load('x.pth')
>>> y
34
>>> z = { 'a': torch.randint(10, (2, 3)), 'b': nn.Linear(10, 20) }
>>> torch.save(z, 'z.pth')
>>> w = torch.load('z.pth')
>>> w
{'a': tensor([[4, 4, 4],
[8, 4, 1]]), 'b': Linear(in_features=10, out_features=20, bias=True)}

One can save directly a full model like this, including arbitrary fields:

>>> x = nn.Sequential(nn.Linear(3, 10), nn.ReLU(), nn.Linear(10, 1))
>>> x.blah = 14
>>> torch.save(x, 'model.pth')
>>>
>>> z = torch.load('model.pth')
>>> z(torch.randn(2, 3))
tensor([[ 0.0665],
[ 0.2116]])
>>> z.blah
14
for k in range(nb_epochs_finished, nb_epochs):
    acc_loss = 0

    for input, targets in zip(train_input.split(batch_size),
                            train_targets.split(batch_size)):
        output = model(input)
        loss = criterion(output, targets)
        acc_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    print(k, acc_loss)

    checkpoint = {
        'nb_epochs_finished': k + 1,
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict()
}
torch.save(checkpoint, checkpoint_name)

posted on 2022-06-05 06:26  Blackzxy  阅读(16)  评论(0编辑  收藏  举报