ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning

Xiaofeng Tan1,3,†,* Jun Liu3,†Yuanting Fan3Bin-Bin Gao3Xi Jiang2Xiaochen Chen3Jinlong Peng3Chengjie Wang3Hongsong Wang1,‡Feng Zheng2,‡
1Southeast University  2Southern University of Science and Technology  3Tencent Youtu Lab
Equal contribution.  Corresponding authors.  *Work done during Xiaofeng Tan's internship at Tencent Youtu Lab.

Research Question

Why do visual hallucinations arise in reinforcement fine-tuning, and how to reduce them?

TL;DR: This work analyzes the issue from two perspectives—limited exploration and trajectory imitation—and proposes Dynamic Granularity Rollout (DGR) and Consistent Policy Gradient Optimization (CPGO) to address them.

Abstract

Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model's foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model's consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49% for low-level and 38% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1% (v.s. the baseline's decrease of -0.4%) over FLUX1.dev.

Visualization

Visualization

Motivation

Motivation

Why do visual hallucinations arise? We investigate from two perspectives: (1) Limited Exploration Domain during SDE rollouts, where existing methods exhibit fine-grained optimization through SDE process noise alone, leading to overemphasis on local details while neglecting global semantics. (2) Trajectory Imitation Optimization, where policy gradient methods inadvertently cause the model to imitate stochastic SDE trajectories, disrupting the velocity field consistency fundamental to flow models.

Method

Method

ConsistentRFT addresses hallucinations through two key components:

(1) Dynamic Granularity Rollout (DGR) targets the limited exploration domain by dynamically scheduling between fine-grained groups (same initial noise, varying process noise) and coarse-grained groups (independent random noise) to balance global semantics and local details.

(2) Consistent Policy Gradient Optimization (CPGO) resolves trajectory imitation issues by enforcing consistency through ODE-based single-step predictions, preserving the velocity field consistency learned during pre-training.

Main Results

Main Results Table 1 and 2 Main Results Table 3

Dynamic Granularity Rollout (DGR)

Qualitative comparison of coarse-, fine-, and dynamic-grained optimization strategies.

Consistent Policy Gradient Optimization

BibTeX

@article{tan2025consistentrft,
  title={ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning},
  author={Tan, Xiaofeng and Liu, Jun and Fan, Yuanting and Gao, Bin-Bin and Jiang, Xi and Chen, Xiaochen and Peng, Jinlong and Wang, Chengjie and Wang, Hongsong and Zheng, Feng},
  journal={arXiv preprint},
  year={2025}
}