Stanford CS236: Deep Generative Models I 2023 I Lecture 13 - Score Based Models

()

Score-based models

Score-based models, also known as diffusion models, are a state-of-the-art class of generative models for continuous data modalities like images, videos, speech, and audio.
Unlike likelihood-based models that work with probability density functions, score-based models focus on the gradient of the log density, known as the score function.
Score-based models offer an alternative interpretation of probability distributions by representing them as vector fields or gradients, which can be computationally advantageous.
Score-based models address the challenge of normalization by modeling data using the score instead of the density, allowing for more flexible parameterizations without strict normalization constraints.

Score matching is a technique used to train energy-based models by fitting the model's score function to match the score function of the data distribution.
The Fisher Divergence, which measures the difference between two probability distributions, can be rewritten in terms of the score function, enabling efficient optimization without computing the partition function.
Score matching can be applied to a wide range of model families beyond energy-based models, as long as the gradient of the log density with respect to the input can be computed.
Score matching directly models the gradients (scores) rather than the likelihood, and it does not involve a normalization constant or latent variables.
The term "scores" is used in the literature and for loss functions like the Fisher score, hence the name "score matching."
Score matching aims to estimate the gradient of the data distribution to model the data.
The Fisher divergence is used to measure the difference between the true and estimated vector fields of gradients.
Minimizing the Fisher divergence as a function of theta is a reasonable learning objective.

Denoising score matching is an approach to address the computational challenges of score matching by estimating the gradient of data perturbed with noise.
Denoising score matching is computationally more efficient, especially when the noise level is relatively small.
The speaker introduces a method to approximate the score of a data density perturbed with noise, denoted as Q Sigma.
This approximation is achieved by replacing the Fisher Divergence between the model and the data with the Fisher Divergence between the model and the noise-perturbed data density.
The key idea is that when the noise level Sigma is small, the noise-perturbed data density Q Sigma is close to the original data density, making the estimated scores similar.
The resulting algorithm involves sampling data points, adding Gaussian noise, and estimating the denoising score matching loss based on the mini-batch.

Noising score matching is a technique used in generative modeling.
It involves adding noise to data points and training a model to estimate the noise.
The goal is to minimize the loss between the estimated noise and the actual noise.
This approach is scalable and easier to implement compared to directly modeling the distribution of clean data.
Noising score matching is equivalent to minimizing the original loss function up to a constant.
The optimal denoising strategy involves following the gradient of the perturbed log-likelihood.
The technique is applicable to various noise distributions as long as the gradient can be computed.

Random projections can be used to efficiently approximate the regional score matching loss.
Sliced Fisher Divergence is a variant of the Fisher Divergence that involves Jacobian Vector products, which can be efficiently estimated using backpropagation.
The projection operation is a dot product between the data and model gradients projected along a random direction.
Biasing the projections towards certain directions does not seem to make a significant difference in practice.
Sliced versions of score matching are constant with respect to the data dimension and perform similarly to exact score matching.

Inference in diffusion models can be done by following the gradient of the log probability density or using Markov Chain Monte Carlo (MCMC) methods.
Lyapunov dynamics sampling is a method for generating samples from a density using the estimated gradient.
Lyapunov dynamics sampling is a valid Markov chain Monte Carlo (MCMC) procedure in the limit of small step sizes and an infinite number of steps.
Real-world data tends to lie on low-dimensional manifolds, which can cause problems for Lyapunov dynamics sampling.
Diffusion models provide a way to fix this problem by estimating these scores more accurately all over the space and getting better guidance.