Stanford CS236: Deep Generative Models I 2023 I Lecture 18 - Diffusion Models for Discrete Data

()

Discrete Generative Models

Discrete data is harder to model than continuous data because samples cannot be interpolated between.
Applications of discrete generative models include natural language processing, biology, and natural sciences.
Existing continuous space models cannot be easily adapted to discrete data because they rely on calculus.
Transformers can be used to map discrete input sequences to continuous values, but this is not a requirement for discrete generative models.

Autoregressive modeling is widely used in natural language processing for tasks like language generation and translation.
In autoregressive modeling, the probability of a sequence of tokens is decomposed into a product of conditional probabilities.
Autoregressive modeling has advantages such as scalability, the ability to represent any probability distribution over sequences, and a reasonable inductive bias for natural language.
However, autoregressive modeling also has drawbacks such as sampling drift, lack of inductive bias for non-language tasks, and computational inefficiency.
The key challenge in using autoregressive modeling for discrete problems is ensuring that the sum of probabilities over all possible sequences equals one, which is computationally intractable.

The lecture explores generalizing score matching techniques to discrete cases to address the challenge of using autoregressive modeling for discrete problems.
The video discusses a method for extending score matching to discrete spaces using a finite difference approximation.
The concrete score is defined as the collection of all py/px minus one, where py is the probability of sequence y and px is the probability of sequence x.
To make the model computationally feasible, the ratios between two sequences that differ only at one position are modeled instead of modeling all py/px.
A sequence of sequence model is used to model these ratios, which can be implemented as a non-autoregressive Transformer.
The score entropy loss function is introduced as a generalization of score matching for discrete scores.
Two alternative loss functions are mentioned: implicit score entropy and denoising score entropy.
The denoising score entropy loss function is derived by assuming that the probability of a sequence is a convolution between a base distribution and a kernel.

The video introduces a new generative modeling technique called Score Entropy Discrete Diffusion (SEDD).
SEDD outperforms autoaggressive Transformers in terms of generation quality and speed.
SEDD allows for controllable generation, including prompting from an arbitrary location.
The model can generate coherent text sequences and can be used to infill text between given prompt tokens.
SEDD is particularly effective for long sequence generation and can achieve high-quality results with fewer sampling steps compared to autoaggressive Transformers.
The model size of SEDD is relatively small compared to autoaggressive Transformers, making it more efficient for large-scale language generation tasks.

Score-based diffusion models can be extended to discrete spaces by modeling the ratios of the data distribution, known as concrete scores.
A new score matching loss called score entropy is introduced, which can be optimized using denoising implicit variance.
Sampling from the score-based model can be done using a forward and reverse diffusion process, which synergizes with the score entropy loss.
The generation quality of score-based diffusion models can surpass autoregressive modeling because they can generate the whole sequence in parallel.
A likelihood bound based on score entropy is proposed, which aligns with the score entropy loss and allows for comparison with autoregressive models.
Score-based diffusion models challenge the dominance of autoregressive modeling on large-scale sequence generation tasks.
The proposed method outperforms previous continuous diffusion models in terms of likelihood and generation quality.
A discretization scheme called T-leaping is used to efficiently generate sequences from the score-based diffusion model.

Both score-based diffusion models and autoregressive models can learn any probability distribution over a discrete space, but score-based diffusion models may have a better inductive bias and be more amenable to optimization.

The video discusses a method for training a diffusion model using a sequence of sequence neural network (Transformer) with a non-causal mask.
The architecture allows the attention layer to go from everything to everything, similar to BERT.
The model is evaluated using generative perplexity and PET distance metric, showing improvements over previous methods.
The challenge lies in efficiently computing the matrix exponential of a large Q matrix, which can be computationally expensive.
Experimenting with more complex Q matrices did not yield better results due to fundamental architectural differences.