EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

TL;DR: We propose EasyTune, a fine-tuning framework for diffusion models that decouples recursive dependencies and enables (1) dense and effective optimization, (2) memory-efficient training, and (3) fine-grained alignment.

Performance -- Better Quality and More Efficient

MY ALT TEXT

Comparison of the training costs and generation performance on HumanML3D. (a) Performance comparison of different fine-tuning methods. (b) Generalization performance across six pre-trained diffusion-based models.


Abstract

In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from inefficient and coarse-grained optimization with high memory consumption. In this work, we first theoretically identify the \emph{fundamental reason} of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose \textbf{EasyTune}, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and effective, (2) memory-efficient, and (3) fine-grained optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a \textbf{S}elf-refinement \textbf{P}reference \textbf{L}earning (\textbf{SPL}) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms ReFL by 62.1\% in MM-Dist improvement while requiring only 34.5\% of its additional memory overhead.


Framework

MY ALT TEXT

The framework of existing differentiable reward-based methods (left) and our proposed EasyTune (right). Existing methods backpropagate the gradients of the reward model through the overall denoising process, resulting in (1) excessive memory, (2) inefficient, and (3) coarse-grained optimization. In contrast, EasyTune optimizes the diffusion model by directly backpropagating the gradients at each denoising step, overcoming these issues.


Core Insight

MY ALT TEXT

Core insight of EasyTune. Core insight of EasyTune. By replacing the recursive gradient in Eq.(4) with the formulation in Eq.(7), EasyTune decouples the recursive dependence of computation graph, enabling (1) step-wise graph storage, (2) efficient per-step optimization, and (3) fine-grained parameter optimization.

Experiment Results


Comparison of SoTA fine-tuning methods on HumanML3D dataset. The arrows , , and indicate higher, lower, and closer-to-real-motion values are better, respectively. Bold and underline highlight the best and second-best results. Percentages in subscripts indicate improvements.

MY ALT TEXT

Comparison of text-to-motion generation performance on the HumanML3D dataset.

MY ALT TEXT

Evaluation on Text-Motion Retrieval Benchmark, HumanML3D and KIT-ML. The column “Noise” indicates whether the method can handle noisy motion from the denoising process.

MY ALT TEXT

Performance enhancement of diffusion-based motion generation methods with our Easytune.

MY ALT TEXT

Comparison of SoTA fine-tuning methods on KIT-ML dataset.

MY ALT TEXT

MY ALT TEXT

Loss curves for EasyTune and existing fine-tuning methods (Left). Comparison of winning rates % for diffusion models fine-tuned with and without SPL (Right). In the left figure, the x-axis represents the number of generated motion batches.

Visual results on HumanML3D dataset

MY ALT TEXT

Visual results on HumanML3D dataset. "w/o EasyTune" refers to motions generated by the original MLD model, while "w/ EasyTune" indicates motions generated by the MLD model fine-tuned using our proposed EasyTune.

Visualizations