EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

TL;DR: We propose EasyTune, a fine-tuning framework for diffusion models that decouples recursive dependencies and enables (1) dense and effective optimization, (2) memory-efficient training, and (3) fine-grained alignment.

Paper

Abstract

In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from inefficient and coarse-grained optimization with high memory consumption. In this work, we first theoretically identify the \emph{fundamental reason} of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose \textbf{EasyTune}, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and effective, (2) memory-efficient, and (3) fine-grained optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a \textbf{S}elf-refinement \textbf{P}reference \textbf{L}earning (\textbf{SPL}) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms ReFL by 62.1\% in MM-Dist improvement while requiring only 34.5\% of its additional memory overhead.

Visualizations

EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

Performance -- Better Quality and More Efficient

Comparison of the training costs and generation performance on HumanML3D. (a) Performance comparison of different fine-tuning methods. (b) Generalization performance across six pre-trained diffusion-based models.

Abstract

Framework

Core Insight

Experiment Results

Comparison of SoTA fine-tuning methods on HumanML3D dataset. The arrows ↑, ↓, and → indicate higher, lower, and closer-to-real-motion values are better, respectively. Bold and underline highlight the best and second-best results. Percentages in subscripts indicate improvements.

Comparison of text-to-motion generation performance on the HumanML3D dataset.

Evaluation on Text-Motion Retrieval Benchmark, HumanML3D and KIT-ML. The column “Noise” indicates whether the method can handle noisy motion from the denoising process.

Performance enhancement of diffusion-based motion generation methods with our Easytune.

Comparison of SoTA fine-tuning methods on KIT-ML dataset.

Loss curves for EasyTune and existing fine-tuning methods (Left). Comparison of winning rates % for diffusion models fine-tuned with and without SPL (Right). In the left figure, the x-axis represents the number of generated motion batches.

Visual results on HumanML3D dataset

Visual results on HumanML3D dataset. "w/o EasyTune" refers to motions generated by the original MLD model, while "w/ EasyTune" indicates motions generated by the MLD model fine-tuned using our proposed EasyTune.

Visualizations