Stanford CS236: Deep Generative Models I 2023 I Lecture 17 - Discrete Latent Variable Models

()

Score-Based Diffusion Models (SDMs)

SDMs are closely connected to noising diffusion probabilistic models (DDPMs).
DDPMs can be interpreted as a VAE where the encoder adds noise to the data and the decoder denoises it.
Optimizing the evidence lower bound in DDPMs corresponds to learning a sequence of denoisers, similar to noise conditional score models.
The diffusion version of DDPMs considers a continuous spectrum of noise levels, allowing for more efficient sampling and likelihood evaluation.
The process of adding noise is described by a stochastic differential equation (SDE).
The drift term in the SDE becomes important when reversing the direction of time.
The reverse SDE has a drift term that is the score of the corresponding perturbed data density at time T.
Both the forward and reverse SDEs describe the same kind of trajectories, and the only difference is the direction of time.
Score-based models can be used to learn generative models by estimating score functions using a neural network.
The score-based MCMC method uses Langevin dynamics to generate samples from a density corresponding to a given time.
Discretizing the time axis in the score-based SDE leads to numerical errors, which can be reduced by using larger Langevin dynamics steps.
Score-based models can be converted into flow models by eliminating noise at every step, resulting in an infinitely deep continuous time normalizing flow model.

The sampling process in SDMs can be reinterpreted as solving an ODE, where the dynamics of the ODE are defined by the score function of the diffusion model.
This perspective allows for leveraging techniques from numerical analysis and scientific computing to improve sampling efficiency and generate higher-quality samples.
Consistency models are neural networks that directly output the solution of the ODE, enabling fast sampling procedures.
Parallel-in-time methods can further accelerate the sampling process by leveraging multiple GPUs to compute the solution of the ODE in parallel.
Distillation techniques can be used to train student models that can approximate the solution of the ODE in fewer steps, leading to even faster sampling.

Stable Diffusion uses a latent diffusion model, which adds an extra encoder and decoder layer at the beginning of the model.
This allows for faster training on low-resolution images or low-dimensional data.
Stable Diffusion pre-trains the outer encoder and then keeps it fixed while training the diffusion model over the latent space.
To incorporate text into the model, a pre-trained language model is used to map the text to a vector representation, which is then fed into the neural network architecture.

To control the generation process without training a different model, the prior distribution of the generative model is combined with a classifier's likelihood to sample from the conditional distribution of images given a specific label.
Computing the denominator of the posterior distribution is intractable, making it difficult to directly sample from the posterior.
Working at the level of scores simplifies the computation of the posterior score, allowing for easy incorporation of pre-trained models and classifiers.
By modifying the drift in the SDE or ODE to include the score of the classifier, one can steer the generative process towards images that are consistent with a desired class or caption.
Classifier-free guidance is a technique that avoids explicit classifier training by taking the difference of two diffusion models, one conditioned on side information and the other not.