Diffusion Models Already Have a Semantic Latent Space

Der Steppenwolf

Kwon, Mingi, Jaeseok Jeong, and Youngjung Uh. “Diffusion Models Already Have A Semantic Latent Space.” In The Eleventh International Conference on Learning Representations.

What story does this work tell?

  • [Motivation] Diffusion models lack semantic latent space which is essential for controlling the generative process.
  • [Insight] Denoising Diffusion Implicit Model (DDIM) provides nearly perfect reconstruction of original images, hence is suitable for image editing, which renders target attributes on the real image.
  • [Motivation] Simply editing the latent variables (i.e., intermediate noisy images) leads to distored images or incorrect manipulation.
  • [Motivation] Shifting the noise predicted by the noise predictor at each sampling step does not achive manipulating the generated image.
  • [Motivation] More complicated procedures are required: providing guidance in the reverse process, or finetuning models for an attribute.
  • [Previous Guidance Methods]
    • Image Guidance: Mix the latent variabbles of the guiding image with unconditional latent variables.
      • [Limitation] Ambiguous to specify which attribute to reflect.
      • [Limitation] Lacks intuitive control for the magnitude of change.
    • Classifier Guidance: Manipulate images by imposing gradients of a classifier on the latent variables in the reverse process to match the target class.
      • [Limitation] Requires training an extra classifier.
      • [Limitation] Computing gradients through the classifier during sampling is costly.
    • Finetuning the whole model (DiffusionCLIP): Requires multiple models to reflect multiple descriptions.
  • [Insight] Generative Adversarial Networds (GANs) inherently provide straightforward image editing in their latent space.
    • [Limitation] Finding the exact latent vector of a real image is often challenging and produces unexpected appearance changes.
  • [Motivation] It would allow admirable image editing if the diffusion models with the nearly perfect inversion property have such a semantic latent space.
  • [Previous Method] Diffusion Autoencoder: introduce a latent embedding of the original image as an additional input to the reverse process.
    • [Limitation] Requires training from scratch and does not match with the pretrained diffusion model.
  • [Contribution] Asymmetric reverse process (Asyrp), a novel controllable reverse (i.e. backward denoising) process that
    • discovers the semantic latent space (“h-space”) of a frozen diffusion model
    • enables attribute editing of the original image through modification in the latent space
  • [Contribution] h-space, a semantic latent space with the following properties to accomodate semantic image manipulation:
    • homogeneity: The same shift in h-space results in the same attribute change in all images.
    • linearity: Linear changes in h-space lead to linear changes in attributes.
    • compositionality: Adding mulitple changes manipulates the corresponding multiple attributes simultaneously.
    • robustness: The changes do not degrade the quality of the resulting images.
    • consistency across timesteps: The changes throughout the timesteps are almost identical to each other for a desired attribute change.
  • [Contribution] a principled design of the generative process that facilitates versatile editing and quality boosting by two quantifiable measures:
    • editing strength of an interval
    • quality deficiency at a timestep
  • [Conclusion] Asyrp is generally applicable to various architectures (DDPM+, iDDPM, ADM) and datasets (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, METFACES).
  • [Conclusion] The discovered h-space is effective to accomodate semantic image manipulation.
  • [Conclusion] The principle generation process achieves versatile editing and high quality by measuring editing strength of an interval and quality deficiency at a timestep.

Methodology

Problem Definition

  • Goal: Allow semantic latent manipulation of images generated from given a pretrained and frozen diffusion model.
  • [Insight] Let be a predicted noise at timestep in the reverse process. Let $\tilde{\epsilon}{t}^{\theta}\tilde{\epsilon}{t}^{\theta}x_0x_t$ destruct each other in the reverse process.

Asymetric Reverse Process (Asyrp)

  • Formalization of Asyrp:
  • Title: Diffusion Models Already Have a Semantic Latent Space
  • Author: Der Steppenwolf
  • Created at : 2025-06-06 12:55:36
  • Updated at : 2025-06-22 20:46:50
  • Link: https://st143575.github.io/steppenwolf.github.io/2025/06/06/Diffusion-Models-Already-Have-a-Semantic-Latent-Space/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments
On this page
Diffusion Models Already Have a Semantic Latent Space