💡 TL;DR: We propose EasyTune, a reinforcement fine-tuning framework for diffusion models that decouples recursive dependencies and enables (1) dense and effective optimization, (2) memory-efficient training, and (3) fine-grained alignment.
Figure 1. Comparison of training costs and generation performance on HumanML3D. (a) Performance comparison of different fine-tuning methods. (b) Generalization performance across six pre-trained diffusion-based models.
In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from inefficient and coarse-grained optimization with high memory consumption. In this work, we first theoretically identify the fundamental reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Experiments show that compared with DRaFT-50, EasyTune achieves an 8.91% improvement in alignment metric (MM-Dist) while only incurring 31.16% of the additional memory overhead.
Figure 2. The framework of existing differentiable reward-based methods (left) and our proposed EasyTune (right). Existing methods backpropagate gradients through the overall denoising process, resulting in excessive memory, inefficient, and coarse-grained optimization. EasyTune optimizes by directly backpropagating gradients at each denoising step.
Existing differentiable reward-based methods suffer from inefficient and coarse-grained optimization with high memory consumption. We identify the fundamental reason: the recursive dependence between different steps in the denoising trajectory.
This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization:
Figure 3. Core insight of EasyTune. By replacing the recursive gradient in Eq.(4) with the formulation in Eq.(7), EasyTune decouples the recursive dependence of computation graph.
Table 1. Comparison of SoTA fine-tuning methods on HumanML3D dataset. ↑/↓/→ indicate higher/lower/closer-to-real values are better. Bold and underline highlight best and second-best results.
Table 2. Comparison of text-to-motion generation performance on the HumanML3D dataset.
Table 3. Evaluation on Text-Motion Retrieval Benchmark. "Noise" indicates whether the method can handle noisy motion from the denoising process.
Table 4. Performance enhancement of diffusion-based motion generation methods with EasyTune.
Table 5. Comparison of SoTA fine-tuning methods on KIT-ML dataset.
Figure 4. Loss curves for EasyTune and existing fine-tuning methods (Left). Comparison of winning rates % for diffusion models fine-tuned with and without SPL (Right).
Figure 5. Visual results on HumanML3D dataset. "w/o EasyTune" refers to motions generated by the original MLD model, while "w/ EasyTune" indicates motions generated by MLD fine-tuned using EasyTune.
Reward hacking is a known challenge in reinforcement learning, where continued optimization after convergence can degrade generation quality. This occurs when models over-fit to semantic alignment while neglecting realistic motion dynamics.
As illustrated in the videos below, models misinterpret prompts to over-fit specific actions. For example, the instruction to "lifts their right foot" may result in continuous, excessive lifting. Similarly, a sequence like "squats down, then stands up and moves forward" might be incorrectly generated as "squats down while moving forward."
Fortunately, this phenomenon can be effectively mitigated. Our method, combined with KL-divergence regularization, shows robust mitigation.
Generated motion sequences comparing original vs. EasyTune fine-tuned models.
@article{tan2025easytune,
title={EasyTune: Efficient Step-Aware Reinforcement Fine-Tuning for Diffusion-Based Motion Generation},
author={Tan, Xiaofeng and Weng, Wanjiang and Lei, Haodong and Wang, Hongsong},
journal={arXiv preprint},
year={2025}
}