[基础] Latent Diffusion Model: High-Resolution Image Synthesis with Latent Diffusion Models

名称

Latent Diffusion Model, High-Resolution Image Synthesis with Latent Diffusion Models
时间：21.12
机构：runway

TL;DR

这篇文章介绍了一种名为潜在扩散模型（Latent Diffusion Models, LDMs）的新型高分辨率图像合成方法。LDMs通过在预训练的自编码器的潜在空间中应用扩散模型，实现了在有限计算资源下训练高质量图像合成模型的目标。文章提出的方法在多个图像合成任务上达到了新的最佳性能，包括图像修复、类条件图像合成和文本到图像合成，同时显著降低了与基于像素的扩散模型相比的计算量。

Method

从图中可看出，LDM主要分三部分：

AE(auto-encoder)

AE是图像空间与latent空间之间的转换工具。转换到latent空间有两个好处：
a) 后续做diffusion的计算量更小，特别是对于diffusion这种需要多步迭代的操作。
b) latent空间的特征语义信息更强，便于和其它模态(例如，文本或者初始图像)的特征融合。
另外，类似于VAE，为了避免AE压缩出的latent space过于发散(high variance)，通常会把latent space的特征分布用KL散度对齐到标准正态空间。

LDM(latent diffusion model)

类似于DDPM，只不过Zt是latent feature，Z0是AE的Encoder推理出的原始特征，ZT是纯噪声特征。LDM的噪声估计器是一个UNet，用来预测每一步去噪所需噪声。

Conditioning Mechanisms

条件特征可以是文本、图像或者其它模态信息，不过应该需要对应到同一个latent空间(比如，使用CLIP)。以文本为例，文本先验会经过Text Encoder得到特征，通过cross attention加权到diffusion每次去噪的噪声估计器UNet中间stage特征上。

Inference

所以推理过程应该类似于架构图上蓝框所示，输入文本或者其它模态先验以及latent space的高斯噪声，经过LDM扩散出相应语义的latent特征，经过Decoder一次性生成相应图像。

CodeReading

参考配置yaml：https://github.com/CompVis/stable-diffusion/blob/main/configs/stable-diffusion/v1-inference.yaml
核心模型包括LatentDiffusion、DiffusionWrapper、AutoencoderKL，其中AutoencoderKL也被称为first_stage_model其实是VAE在VQ-GAN有介绍过，LatentDiffusion的父类DDPM在之前小节DDPM介绍过。

LatentDiffusion

调用流程依然是 training_step -> shard_step -> get_input -> forwad，其中get_input与forward相对于DDPM有较大变化。

    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,
                  cond_key=None, return_original_cond=False, bs=None):
        ...
        # 调用VAE的encoder根据x实例化posterior
        encoder_posterior = self.encode_first_stage(x)
        # 使用posterior采样出具体的latent feature
        z = self.get_first_stage_encoding(encoder_posterior).detach()

        if self.model.conditioning_key is not None:
                ...
                # 里面一通操作，关键是调用了get_learned_conditioning函数，该函数核心操作是self.cond_stage_model.encode(c)，即根据condition文本调用CLIPEmbedder生成condition特征
                c = self.get_learned_conditioning(xc)
                ...
        ...
        out = [z, c]
        # 这里的开关打开可以返回VAE(first stage model)生成的图像(通常不打开)
        if return_first_stage_outputs:
            xrec = self.decode_first_stage(z)
            return [x, xrec]
        # 可以看出最终返回的时VAE encoder采样latent feature及condition特征
        return [z, c]

forward核心是调用p_losses函数，该函数与DDPM的p_losses几乎一样(即给latent feature加噪至第t步状态，再预测叠加的噪声)，但区别在于LDM在apply_model相对于DDPM多加了condition特征，具体定义如下：

    def apply_model(self, x_noisy, t, cond, return_ids=False):
      ...
            # 根据v1-inference.yaml中LDM参数定义，self.model.conditioning默认参数为c_crossattn，也就是说cond特征信息通常以cross attention方式融合至unet中
            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
            cond = {key: cond}

        if hasattr(self, "split_input_params"):
              ... # 由于self中不包含split_input_params，故不进if部分，该部分主要用来对大latent feature进行分块处理
        else:
            # 根据加噪后的Latent feature、步数t以及cond特征，unet预测所叠加的噪声并返回
            x_recon = self.model(x_noisy, t, **cond)
        return x_recon

UNetModel

    def forward(self, x, timesteps=None, context=None, y=None,**kwargs):
        ...
        # 时序步数先经过timestep_embedding的正弦编码，再经过time_embed的Linear层抽取特征
        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
        emb = self.time_embed(t_emb)
        ...
        # 由ResBlock为主体来构成UNet，中间会穿插：
        #    AttentionBlock(无cond特征时启动用，由Self-Attention构成)
        #    或者SpatialTransformer(有cond特征，且UNet的use_spatial_transformer=True，使用图像特征作为Q，来cond特征作为K/V特征，来融合cond特征信息。更符合直接设计，并且计算量更小)
        for module in self.input_blocks:
            h = module(h, emb, context)
            hs.append(h)
        h = self.middle_block(h, emb, context)
        for module in self.output_blocks:
            h = th.cat([h, hs.pop()], dim=1)
            h = module(h, emb, context)
        h = h.type(x.dtype)
        if self.predict_codebook_ids:
            return self.id_predictor(h)
        else:
            return self.out(h)

Q&A

Q: Stable Diffusion的众多版本有什么差异？官方训练代码提供的是什么版本？
如README中所述，官方提供了v1.1->v1.4的预训练weights，这些版本的差异主要是生成分辨率、迭代步数以及text condition的比例不同。官方代码示例是以v1.4作为预训练模型。更多信息参考这里

sd-v1-1.ckpt: 237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).
sd-v1-2.ckpt: Resumed from sd-v1-1.ckpt. 515k steps at resolution 512x512 on laion-aesthetics v2 5+ (a subset of laion2B-en with estimated aesthetics score > 5.0, and additionally filtered to images with an original size >= 512x512, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using the LAION-Aesthetics Predictor V2).
sd-v1-3.ckpt: Resumed from sd-v1-2.ckpt. 195k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
sd-v1-4.ckpt: Resumed from sd-v1-2.ckpt. 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

Experiment

介绍两个衡量图像生成效果的指标

FID

真实样本与生成样本采样相同数据，用高斯分布来拟合，判断这两个分布之间距离。例如zero shot FID-30K 是从验证集中随机抽取30k个prompts，待评测模型使用这些prompts生成图像再使用CNN抽取特征，计算特征拟合出的高斯分布与验证集所有图像特征的高斯分布之间的距离。

CLIP Score

将condition text与生成的图像送入训练好的CLIP模型中，计算整个测试集得分。

实验结果表明，LDMs在多个数据集上实现了新的最高得分，包括图像修复和类条件图像合成任务。LDMs在降低计算成本的同时，还能在多个任务上提供与最先进的基于像素的扩散模型相媲美或更好的性能。此外，LDMs还展示了在潜在空间中进行高分辨率图像合成的能力，这在以前的模型中是不可行的。

总结

LDMs的提出为高分辨率图像合成领域带来了新的视角，特别是在提高生成模型的效率和灵活性方面。通过在潜在空间中应用扩散模型，LDMs能够在保持图像质量的同时显著减少计算资源的需求。这种方法的成功表明，通过结合自编码器和扩散模型的优势，我们可以在不牺牲性能的情况下，更高效地处理复杂的图像数据。LDMs的这些特性可能会激发未来在图像合成、数据增强、潜在空间探索等领域的进一步研究和应用。

资料查询

折叠Title

FromChatGPT(提示词：XXX)

posted @ 2024-03-14 21:35 fariver 阅读(1094) 评论(0) 收藏举报

刷新页面返回顶部

fariver