Diffusion Models Already Have a Semantic Latent Space

Kwon, Mingi, Jaeseok Jeong, and Youngjung Uh. “Diffusion Models Already Have A Semantic Latent Space.” In The Eleventh International Conference on Learning Representations.

What story does this work tell?

[Motivation] Diffusion models lack semantic latent space which is essential for controlling the generative process.
[Insight] Denoising Diffusion Implicit Model (DDIM) provides nearly perfect reconstruction of original images, hence is suitable for image editing, which renders target attributes on the real image.
[Motivation] Simply editing the latent variables (i.e., intermediate noisy images) leads to distored images or incorrect manipulation.
[Motivation] Shifting the noise predicted by the noise predictor at each sampling step does not achive manipulating the generated image.
[Motivation] More complicated procedures are required: providing guidance in the reverse process, or finetuning models for an attribute.
[Previous Guidance Methods]
- Image Guidance: Mix the latent variabbles of the guiding image with unconditional latent variables.
  - [Limitation] Ambiguous to specify which attribute to reflect.
  - [Limitation] Lacks intuitive control for the magnitude of change.
- Classifier Guidance: Manipulate images by imposing gradients of a classifier on the latent variables in the reverse process to match the target class.
  - [Limitation] Requires training an extra classifier.
  - [Limitation] Computing gradients through the classifier during sampling is costly.
- Finetuning the whole model (DiffusionCLIP): Requires multiple models to reflect multiple descriptions.
[Insight] Generative Adversarial Networds (GANs) inherently provide straightforward image editing in their latent space.
- [Limitation] Finding the exact latent vector of a real image is often challenging and produces unexpected appearance changes.
[Motivation] It would allow admirable image editing if the diffusion models with the nearly perfect inversion property have such a semantic latent space.
[Previous Method] Diffusion Autoencoder: introduce a latent embedding of the original image as an additional input to the reverse process.
- [Limitation] Requires training from scratch and does not match with the pretrained diffusion model.
[Contribution] Asymmetric reverse process (Asyrp), a novel controllable reverse (i.e. backward denoising) process that
- discovers the semantic latent space (“h-space”) of a frozen diffusion model
- enables attribute editing of the original image through modification in the latent space
[Contribution] h-space, a semantic latent space with the following properties to accomodate semantic image manipulation:
- homogeneity: The same shift in h-space results in the same attribute change in all images.
- linearity: Linear changes in h-space lead to linear changes in attributes.
- compositionality: Adding mulitple changes manipulates the corresponding multiple attributes simultaneously.
- robustness: The changes do not degrade the quality of the resulting images.
- consistency across timesteps: The changes throughout the timesteps are almost identical to each other for a desired attribute change.
[Contribution] a principled design of the generative process that facilitates versatile editing and quality boosting by two quantifiable measures:
- editing strength of an interval
- quality deficiency at a timestep
[Conclusion] Asyrp is generally applicable to various architectures (DDPM+, iDDPM, ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, METFACES).
[Conclusion] The discovered h-space is effective to accomodate semantic image manipulation.
[Conclusion] The principle generation process achieves versatile editing and high quality by measuring editing strength of an interval and quality deficiency at a timestep.

Methodology

Problem Definition

Goal: Allow semantic latent manipulation of images generated from given a pretrained and frozen diffusion model.
[Insight] Let be a predicted noise at timestep in the reverse process. Let $\tilde{\epsilon}{t}^{\theta}\tilde{\epsilon}{t}^{\theta}x_0x_t$ destruct each other in the reverse process.

Asymetric Reverse Process (Asyrp)

Formalization of Asyrp: