The Math
diffusiongym is built for reward adaptation of flow and diffusion models. Here, we provide a brief overview of the mathematical framework used in the library.
Flow models
The idea of diffusion and flow matching models (collectively referred to as flow models) is to construct a process with the same time marginals as the reference flow. As such, simulating the process from \(t=0\) to \(t=1\) transforms samples from \(p_0\) to samples from the target distribution.
Given an initial distribution \(\mathbf{x}_0 \sim p_0\) and samples \(\mathbf{x}_1\) from a target distribution, the reference flow is defined as:
where \(\alpha_t\) and \(\beta_t\) are scalar functions of time \(t\) satisfying \(\alpha_0 = \beta_1 = 0\) and \(\alpha_1 = \beta_0 = 1\).
Note
In diffusiongym, \(\alpha_t\) and \(\beta_t\) are defined as subclasses of the
Scheduler abstract class.
Generally, flow matching trains a velocity field \(v(\mathbf{x}_t, t)\) to match the time derivative of the reference flow. Then, the process is defined as the ordinary differential equation (ODE):
However, we can also choose to sample from a family of stochastic differential equations (SDEs) that have the same time marginals:
where \(\sigma(t)\) is an arbitrary diffusion coefficient and \(B_t\) is standard Brownian motion.
From now on, we will view the drift term in (3) as a constant defined through the base model:
where \(b(\mathbf{x}_t, t)\) is the drift term. The Environment classes in diffusiongym
implements Euler-Maruyama sampling of this SDE, where the drift is defined through a BaseModel.
Note
The base model does not have to output the velocity field \(v(\mathbf{x}_t, t)\). It can also
output the marginal noise \(\epsilon(\mathbf{x}_t, t)\) as in diffusion models, the endpoint
\(\hat{\mathbf{x}}_1(\mathbf{x}_t, t)\), or the score \(\nabla_x \log p_t(\mathbf{x}_t)\). These
are all equivalent up to a re-scaling [1]. You only need to make sure to choose the correct
environment: VelocityEnvironment, EpsilonEnvironment, EndpointEnvironment, or
ScoreEnvironment.
Reward adaptation
In order to adapt the base model to a task, a reward function \(r\) is introduced that is evaluated at the end of the generative process, i.e., \(t=1\). The idea of reward adaptation is to adapt the drift term in (4) such that samples \(\mathbf{x}_1\) have high reward.
Note
In diffusiongym, reward functions are implemented as subclasses of the Reward abstract class.
A common objective is KL-regularized reward maximization:
Many works [2] propose an equivalent SOC formulation where a control term \(u(\mathbf{x}_t, t)\) is added to the drift:
And the objective is to minimize the cost functional at every state \((\mathbf{x}_t, t)\):
Note
Given a policy, the Environment.sample method simulates the controlled SDE and returns the
rewards and cost functionals over the sampled trajectory. It also returns other data such as
drifts and noises that may be useful for some algorithms.
Fine-tuning schemes define the control \(u(\mathbf{x}_t, t)\) through the controlled and uncontrolled base models:
where \(b^{\star}(\mathbf{x}_t, t)\) is the drift of the controlled process.
Alternative methods instead learn an auxiliary value function \(V(\mathbf{x}_t, t)\) defined as the optimal cost-to-go:
The optimal control can then be obtained from the value function:
This method is more flexible in terms of resource requirements and does not require the reward function to be differentiable.
Note
To facilitate this, diffusiongym provides the ValuePolicy class that derives the optimal control for the
value function approximator. This can be used with the Environment class by setting the
control_policy property.
Footnotes