In Lecture 3, we were able to develop training schemes to have a model generate samples from the data distribution. However, this is not too useful. Instead, we want to be able to condition the generation on some context, such as a class label.

On the MNIST dataset, this would be like asking a model to generate an image of a specific digit, say 3, as opposed to just sampling anything from MNIST.

Our new problem is now to learn a guided diffusion model

$$u^\theta_t: \mathbb{R}^d \times \mathcal{Y} \times [0,1] \to \mathbb{R}^d, \quad (x,y,t) \mapsto u_t^\theta(x|y)$$

for the SDE

$$X_0 \sim p_{init}$$

$$\text{d}X_t = u_t^\theta(X_t | y) \text{d}t + \sigma_t \text{d}W_t,$$

where $y \in \mathcal{Y}$ is the context variable, and $\sigma_t = 0$ gives us a special case of a guided flow model.

Flow Model Guidance

To guide flow models, we can expand upon the conditional flow matching objective in a very natural way:

$$ \mathcal{L}_{CFM}^\text{guided}(\theta) = \underset{(z,y)\sim p_{data}(z,y), t \sim U, x \sim p_t(\cdot | z)}{\mathbb{E}} \left[ \| u_t^\theta(x | y) - u_t^\text{target}(x | z) \|^2 \right]. $$

While this does allow us to generate samples from $p_{data}(\cdot | y)$ in theory, empirically, it has been found to generate images that don’t match $y$ very well. A simple way to overcome this issue is to artifically increase the effect of guidance, which is incorporated in a technique called classifier-free guidance.

Classifier-Free Guidance

To illustrate classifier-free guidance, we will consider Gaussian probability paths, in which

$$u_t^\text{target} = a_t x + b_t \nabla \log p_t(x | y)$$

according to proposition 1, where

$$(a_t, b_t) = \left( \frac{\dot{\alpha}_t}{\alpha_t}, \frac{\dot{\alpha}_t \beta_t^2 - \dot{\beta}_t \beta_t \alpha_t}{\alpha_t} \right).$$

Now, note that (by Bayes’ rule) we can decompose the score function as

$$ \nabla \log p_t(x | y) = \nabla \log p_t(x) + \nabla \log p_t(y | x), \qquad (67) $$

since $\nabla_x \log p_t(y) = 0$. This allows us to rewrite the target vector field as

$$ u_t^\text{target}(x | y) = u_t^\text{target}(x) + b_t \nabla \log p_t(y | x), $$

and we can artificially scale the effect of guidance by introducing a guidance $w \geq 1$:

$$ \tilde{u}_t^\text{target}(x | y) = u_t^\text{target}(x) + w b_t \nabla \log p_t(y | x). $$

We can make the form of $\tilde{u}_t^\text{target}$ more intuitive by re-using Equation (67):

$$ \begin{align*} \tilde{u}_t^\text{target}(x) &= u_t^\text{target}(x) + w b_t \nabla \log p_t(y | x) \\ &= u_t^\text{target}(x) + w b_t \left( \nabla \log p_t(x | y) - \nabla \log p_t(x) \right) \\ &= u_t^\text{target}(x) - (w a_t x + w b_t \nabla \log p_t(x)) + (w a_t x + w b_t \nabla \log p_t(x | y)) \\ &= (1 - w) u_t^\text{target}(x) + w u_t^\text{target}(x | y). \end{align*} $$

With this new formulation, we can see that $\tilde{u}_t^\text{target}$ is just a linear combination of the unguided and guided vector fields. And, now, when we say unguided vector field, we really mean

$$u_t^\text{target}(x) = u_t^\text{target}(x | y = \varnothing),$$

so that we don’t need to learn two networks separately. When we account for the possibility of not conditioning on $y$, we rewrite the conditional flow matching loss as

$$ \mathcal{L}_{CFM}^\text{CFG}(\theta) = \mathbb{E}_\square \left[ \| u_t^\theta(x | \tilde{y}) - u_t^\text{target}(x | z) \|^2 \right], $$

$$ \square = (z,y)\sim p_{data}(z,y), t \sim U, x \sim p_t(\cdot | z), \text{ replace } y \text{ with } \varnothing \text{ with probability } \eta.$$

And then we sample by doing

$$ X_0 \sim p_{init}(x) $$

$$ \text{d}X_t = \tilde{u}_t^\theta(X_t | y) \text{d}t. $$

Guidance for Diffusion Models

Similar to flow models, we can define a guided score matching loss as

$$ \mathcal{L}_{SM}^\text{guided}(\theta) = \underset{(z,y), t, x}{\mathbb{E}} \left[ \| s_t^\theta(x | y) - \nabla \log p_t(x | z) \|^2 \right]. $$

And we derive the classifier-free guidance score function in a similar way as well using Equation (67):

$$ \begin{align*} \tilde{s}_t^\text{target}(x | y) &= \nabla \log p_t(x) + w \nabla \log p_t(y | x) \\ &= \nabla \log p_t(x) + w \left( \nabla \log p_t(x | y) - \nabla \log p_t(x) \right) \\ &= (1 - w) \nabla \log p_t(x) + w \nabla \log p_t(x | y). \end{align*} $$

Which gives us the CFG-compatible objective

$$ \mathcal{L}_{SM}^\text{CFG}(\theta) = \mathbb{E}_\square \left[ \| s_t^\theta(x | y) - \nabla \log p_t(x | z) \|^2 \right], $$

$$ \square = (z,y)\sim p_{data}(z,y), t \sim U, x \sim p_t(\cdot | z), \text{ replace } y \text{ with } \varnothing \text{ with probability } \eta.$$

And we generate $X_1$ by simulating the SDE

$$ X_0 \sim p_{init}(x) $$

$$ \text{d}X_t = \left[ \tilde{u}_t^\theta (x_t | y) + \frac{\sigma_t^2}{2} \tilde{s}_t^\theta(x_t | y) \right] \text{d}t + \sigma_t \text{d}W_t. $$