The Math

diffusiongym is built for reward adaptation of flow and diffusion models. Here, we provide a brief overview of the mathematical framework used in the library.

Flow models

The idea of diffusion and flow matching models (collectively referred to as flow models) is to construct a process with the same time marginals as the reference flow. As such, simulating the process from \(t=0\) to \(t=1\) transforms samples from \(p_0\) to samples from the target distribution.

Given an initial distribution \(\mathbf{x}_0 \sim p_0\) and samples \(\mathbf{x}_1\) from a target distribution, the reference flow is defined as:

(1)\[\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \beta_t \mathbf{x}_1\]

where \(\alpha_t\) and \(\beta_t\) are scalar functions of time \(t\) satisfying \(\alpha_0 = \beta_1 = 0\) and \(\alpha_1 = \beta_0 = 1\).

Note

In diffusiongym, \(\alpha_t\) and \(\beta_t\) are defined as subclasses of the Scheduler abstract class.

Generally, flow matching trains a velocity field \(v(\mathbf{x}_t, t)\) to match the time derivative of the reference flow. Then, the process is defined as the ordinary differential equation (ODE):

(2)\[\mathrm{d} \mathbf{x}_t = v(\mathbf{x}_t, t)\,\mathrm{d}t\]

However, we can also choose to sample from a family of stochastic differential equations (SDEs) that have the same time marginals:

(3)\[\mathrm{d} \mathbf{x}_t = \left( v(\mathbf{x}_t, t) + \frac{\sigma^2(t)}{2\beta_t \left( \frac{\dot{\alpha}_t}{\alpha_t} \beta_t - \dot{\beta}_t \right)} \left( v(\mathbf{x}_t, t) - \frac{\dot{\alpha}_t}{\alpha_t} \mathbf{x}_t \right) \right)\,\mathrm{d}t + \sigma(t)\,\mathrm{d}B_t\]

where \(\sigma(t)\) is an arbitrary diffusion coefficient and \(B_t\) is standard Brownian motion.

From now on, we will view the drift term in (3) as a constant defined through the base model:

(4)\[\mathrm{d} \mathbf{x}_t = b(\mathbf{x}_t, t)\,\mathrm{d}t + \sigma(t)\,\mathrm{d}B_t\]

where \(b(\mathbf{x}_t, t)\) is the drift term. The Environment classes in diffusiongym implements Euler-Maruyama sampling of this SDE, where the drift is defined through a BaseModel.

Note

The base model does not have to output the velocity field \(v(\mathbf{x}_t, t)\). It can also output the marginal noise \(\epsilon(\mathbf{x}_t, t)\) as in diffusion models, the endpoint \(\hat{\mathbf{x}}_1(\mathbf{x}_t, t)\), or the score \(\nabla_x \log p_t(\mathbf{x}_t)\). These are all equivalent up to a re-scaling [1]. You only need to make sure to choose the correct environment: VelocityEnvironment, EpsilonEnvironment, EndpointEnvironment, or ScoreEnvironment.

Reward adaptation

In order to adapt the base model to a task, a reward function \(r\) is introduced that is evaluated at the end of the generative process, i.e., \(t=1\). The idea of reward adaptation is to adapt the drift term in (4) such that samples \(\mathbf{x}_1\) have high reward.

Note

In diffusiongym, reward functions are implemented as subclasses of the Reward abstract class.

A common objective is KL-regularized reward maximization:

(5)\[\pi^{\star} \in \arg\max \mathbb{E}_{p_1^{\pi}} \left[ r(\mathbf{x}_1) \right] - D_{\mathrm{KL}} \left( p_1^{\pi} \;\middle|\; p_1 \right)\]

Many works [2] propose an equivalent SOC formulation where a control term \(u(\mathbf{x}_t, t)\) is added to the drift:

(6)\[\mathrm{d} \mathbf{x}_t = \left( b(\mathbf{x}_t, t) + \sigma(t) u(\mathbf{x}_t, t) \right)\,\mathrm{d}t + \sigma(t)\,\mathrm{d}B_t\]

And the objective is to minimize the cost functional at every state \((\mathbf{x}_t, t)\):

(7)\[J(u; \mathbf{x}_t, t) = \mathbb{E}_{p^u} \left[ \frac{1}{2} \int_t^1 \| u(\mathbf{x}_s, s) \|^2 \,\mathrm{d}s - r(\mathbf{x}_1) \;\middle|\; \mathbf{x}_t \right]\]

Note

Given a policy, the Environment.sample method simulates the controlled SDE and returns the rewards and cost functionals over the sampled trajectory. It also returns other data such as drifts and noises that may be useful for some algorithms.

Fine-tuning schemes define the control \(u(\mathbf{x}_t, t)\) through the controlled and uncontrolled base models:

(8)\[u(\mathbf{x}_t, t) = \sigma^{-1}(t) \left( b^{\star}(\mathbf{x}_t, t) - b(\mathbf{x}_t, t) \right)\]

where \(b^{\star}(\mathbf{x}_t, t)\) is the drift of the controlled process.

Alternative methods instead learn an auxiliary value function \(V(\mathbf{x}_t, t)\) defined as the optimal cost-to-go:

(9)\[V(\mathbf{x}_t, t) = \inf_{u} J(u; \mathbf{x}_t, t)\]

The optimal control can then be obtained from the value function:

(10)\[u^{\star}(\mathbf{x}_t, t) = -\sigma^\top(t) \nabla_x V(\mathbf{x}_t, t)\]

This method is more flexible in terms of resource requirements and does not require the reward function to be differentiable.

Note

To facilitate this, diffusiongym provides the ValuePolicy class that derives the optimal control for the value function approximator. This can be used with the Environment class by setting the control_policy property.

Footnotes