Stanford CS236: Deep Generative Models I 2023 I Lecture 3 - Autoregressive Models

()

Autoregressive Models

Autoregressive models are generative models used in natural language processing and machine learning.
They use the chain rule to factorize a probability distribution over a sequence of random variables.
Autoregressive models can be used for tasks such as text generation, language modeling, and anomaly detection.
They work by approximating the true conditional probabilities using neural networks.

The building block of autoregressive models is a neural network that predicts the conditional distribution of a single variable given a potentially large number of other variables.
Logistic regression and neural networks can be used to build autoregressive models.
Logistic regression assumes a linear dependency between the input features and the conditional probability, while neural networks can capture more complex, nonlinear dependencies.

Autoregressive models use the chain rule to predict individual components of a high dimensional output, such as an image, given the previous ones.
To use an autoregressive model to define a probability distribution, an ordering of the random variables must be chosen.
For an image, a raster scan ordering can be used, where the pixels are ordered from top left to bottom right.
The conditional probability of each pixel given the previous pixels can be modeled using a simple logistic regression model.
This type of generative model is called a fully visible sigmoid belief network.
The joint probability of a data point can be evaluated by multiplying together the predicted probabilities of each pixel given the previous pixels.
Autoregressive models allow for efficient sampling of images one pixel at a time.

A simple autoregressive model using logistic regression classifiers does not work well for complex datasets.
Replacing the logistic regression classifiers with a single-layer neural network improves the model's performance.
Tying the weights of the neural network across different pixels reduces the number of parameters and speeds up computation.
Tying the weights may also help reduce overfitting and improve generalization.
The speaker suggests an alternative parameterization for the model, where the weights look at preceding entries instead of multiplying the first weight with every predicted x_i. This could reduce the number of parameters and make the model more efficient.

Autoregressive models resemble encoder-decoder models, where the encoder maps the data point to a latent representation and the decoder maps the latent representation back to the original data point.
An out-encoder is a model that maps the input data point to a latent representation and then tries to reconstruct the original data point from the latent representation.
Out-encoders are typically used for representation learning, to find compressed representations of the data.
Out-encoders are not generative models, but a variational out-encoder can be used to generate data by feeding fake inputs to the decoder.
Variational autoencoders force the latent representations to be distributed according to a simple distribution, such as a Gaussian.
Autoregressive models require an ordering to generate data sequentially.
To obtain an autoregressive model from an autoencoder, constraints are imposed on the weight matrices of the neural networks to enforce an ordering.
Masking the connections in the neural network ensures that the predicted value for each random variable depends only on the past inputs, preventing the model from cheating by looking at future outputs.

Recurrent neural networks (RNNs) offer an alternative approach to autoregressive modeling by recursively updating a summary of the past inputs to predict the next variable.
RNNs have a fixed number of learnable parameters, regardless of the sequence length, making them computationally efficient.
RNNs can represent any computable function in theory but are challenging to train in practice, and choosing an ordering remains a problem in autoregressive models.