f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Abstract
Preference alignment objectives are extended to general alignment settings using f-divergence variational representations, introducing novel on-policy and hybrid policy optimization methods for LLM alignment with theoretical and empirical validation.
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
Community
Recent research shows that Preference Alignment
(PA) objectives act as divergence estimators be-
tween aligned (chosen) and unaligned (rejected)
response distributions. In this work, we extend
this divergence-based perspective to general align-
ment settings, such as reinforcement learning with
verifiable rewards (RLVR), where only environ-
mental rewards are available. Within this unified
framework, we propose f-Group Relative Policy
Optimization (f-GRPO), a class of on-policy re-
inforcement learning, and f-Hybrid Alignment
Loss (f-HAL), a hybrid on/off policy objectives,
for general LLM alignment based on variational
representation of f-divergences. We provide the-
oretical guarantees that these classes of objec-
tives improve the average reward after alignment.
Empirically, we validate our framework on both
RLVR (Math Reasoning) and PA tasks (Safety
Alignment), demonstrating superior performance
and flexibility compared to current methods.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper