UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
Abstract
Uniform Discrete Diffusion Model integrated with reinforcement learning through novel optimization strategies achieves state-of-the-art performance in text-to-image tasks and OCR benchmarks.
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward (2026)
- LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models (2026)
- OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models (2026)
- Stepwise Credit Assignment for GRPO on Flow-Matching Models (2026)
- MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation (2026)
- FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling (2026)
- Diffusion Reinforcement Learning via Centered Reward Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18518 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
