Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

()
Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

Latent Variable Models

  • Latent variable models introduce unobserved random variables (Z) to capture unobserved factors of variation in the data.
  • These models aim to model the joint probability distribution between observed variables (X) and latent variables (Z).
  • Latent variable models offer advantages such as increased flexibility and the ability to extract latent features for representation learning.
  • By conditioning on latent features, modeling the distribution of data points becomes easier as there is less variation to capture.
  • Deep neural networks are used to model the conditional distribution of X given Z, with parameters depending on the latent variables through neural networks.
  • The challenge lies in learning the parameters of the neural network since the latent variables are not observed during training.
  • The function of Z is to represent latent variables that affect the distribution of X.
  • In this model, X is not autoregressively generated, and P(x|Z) is a simple Gaussian distribution.
  • The parameters of the Gaussian distribution for P(x|Z) are determined through a potentially complex nonlinear relationship with respect to Z.
  • The individual conditional distributions P(x|Z) are typically modeled with simple distributions like Gaussians, but other distributions can be used as well.
  • The functions used to model P(x|Z) are the same for every Z, but different prior distributions can be used for Z.
  • The motivation for modeling P(x|Z) and P(X) is to make learning easier by clustering the data using the Z variables and to potentially gain insights into the latent factors of variation in the data.
  • The number of latent variables is a hyperparameter that is determined through training.
  • Sampling from the model is straightforward by first sampling Z from a Gaussian and then sampling X from the corresponding Gaussian defined by the mean and covariance predicted by the neural networks.

Mixture of Gaussians

  • The mixture of Gaussians is a simple example of a latent variable model where Z is a categorical random variable that determines the mixture component, and P(x|Z) is a Gaussian distribution with different means and covariances for each mixture component.
  • Mixture models can be useful for clustering data and can provide a better fit to data compared to a single Gaussian distribution.
  • Unsupervised learning aims to discover meaningful structures in data, but it's not always clear what constitutes a good structure or clustering.
  • Mixture models, such as a mixture of Gaussians, can be used for unsupervised learning by identifying the mixture components and clustering data points accordingly.
  • However, mixture models may not perform well on tasks like image classification unless the number of mixture components is extremely large.
  • An example of using a generative model on MNIST data shows that it can achieve reasonable clustering, but the clustering is not perfect and there are limitations to what can be discovered.

Variational Autoencoders (VAEs)

  • Variational autoencoders (VAEs) are a powerful way of combining simple models to create more expressive generative models.
  • VAEs use a continuous latent variable Z, which can take an infinite number of values, instead of a finite number of mixture components.
  • The means and standard deviations of the Gaussian components in a VAE are determined by neural networks, providing more flexibility than a mixture of Gaussians with a lookup table.
  • The sampling process in a VAE is similar to that of a mixture of Gaussians, involving sampling from the latent variable Z and then using neural networks to determine the parameters of the Gaussian distribution from which to sample.
  • In a mixture of Gaussians (MoG), Z is continuous, allowing for smooth transitions between clusters.
  • The mean of the latent representation C can be interpreted as an average representation of all the data points.
  • The prior distribution for Z does not have to be uniform; it can be any simple distribution that allows for efficient sampling.
  • In a Gaussian mixture model (GMM), there is no neural network; instead, a lookup table is used to map the latent variable Z to the parameters of the Gaussian distribution.
  • The marginal distribution over X in a mixture of Gaussians is obtained by integrating over all possible values of Z.
  • The dimensionality of Z is typically much lower than the dimensionality of X, allowing for dimensionality reduction.
  • It is possible to incorporate more information into the prior distribution by using a more complex model, such as an autoregressive model.
  • The number of components in a mixture of Gaussians is equal to the number of classes K.

Challenges in Learning Latent Variable Models

  • The challenge in learning mixture models is that the latent variables are missing, requiring marginalization over all possible completions of the data.
  • Evaluating the marginal probability over X requires integrating over all possible values of the latent variables Z, which can be intractable.
  • If the Z variables can only take a finite number of values, the sum can be computed by brute force, but for continuous variables, an integral is required.
  • Gradient computations are also expensive, making direct optimization challenging.
  • One approach is to use Monte Carlo sampling to approximate the sum, by randomly sampling a small number of values for Z and using the sample average as an approximation.
  • Sampling from a uniform distribution over Z is used because the sum is being converted into an expectation with respect to a uniform distribution, making the approximation tractable.
  • However, this approach is not ideal because it does not take into account the actual distribution of Z.
  • Uniformly sampling latent variables for completion is not effective because most completions would have low probability and high variance.
  • A smarter weighting of selecting latent variables is needed to improve the model's performance.

Latent Variables and Disentangled Representations

  • Latent variables (Z) are not necessarily meaningful features like hair or eye color, but they capture important factors of variation in the data.
  • Z can be a vector of multiple values, allowing for the representation of multiple salient factors of variation.
  • There are challenges in learning disentangled representations where latent variables have clear semantic meanings.
  • If labels are available, semi-supervised learning can be used to steer latent variables towards desired semantic meanings.
  • Important sampling is used to sample latent variables more efficiently by focusing on important completions.

Importance Sampling and the Evidence Lower Bound (ELBO)

  • The choice of the proposal distribution Q for important sampling is crucial and can significantly affect the model's performance.
  • The goal is to estimate the log marginal probability of a data point, which is intractable to compute directly.
  • Importance sampling is used to estimate the log marginal probability by sampling from a distribution Q(Z|X) and then taking the log of the ratio of the joint probability of (X, Z) under the true distribution and the joint probability of (X, Z) under Q(Z|X).
  • This estimator is unbiased, but it can be improved by using Jensen's inequality to derive a lower bound on the log marginal probability.
  • The evidence lower bound (ELBO) is a lower bound on the log marginal probability that can be optimized instead of the true log marginal probability.
  • The choice of Q(Z|X) controls how tight the ELBO is, and a good choice of Q(Z|X) can make the ELBO a very good approximation to the true log marginal probability.
  • It is easier to derive a lower bound on the log marginal probability than an upper bound.
  • The tightness of the bound on the quantity of interest can be quantified.
  • The bound becomes tight when Q is chosen to be the conditional distribution of Z given X under the model.
  • This optimal Q distribution is not easy to evaluate, which is why other methods are needed.
  • The optimal way of inferring the latent variables is to use the true distribution.
  • Inverting the neural network to find the likely inputs that would produce a given X is generally hard.
  • The machinery for training a VAE involves optimizing both P and Q.

Overwhelmed by Endless Content?