arxiv:2604.18518

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Published on Apr 20

· Submitted by

JiaQi Wang on Apr 22

Beijing Academy of Artificial Intelligence

Upvote

Authors:

Jiaqi Wang ,

Ting Pan ,

Abstract

Uniform Discrete Diffusion Model integrated with reinforcement learning through novel optimization strategies achieves state-of-the-art performance in text-to-image tasks and OCR benchmarks.

AI-generated summary

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.

View arXiv page View PDF Project page GitHub 11 Add to collection

Community

Yovecents

Paper author Paper submitter 1 day ago

We propose UDM-GRPO, the first framework to integrate UDM with reinforcement learning. We encourage readers to explore our GitHub repository to better understand its effectiveness in modeling the RL process.