Papers
arxiv:2602.04651

SAFE: Stable Alignment Finetuning with Entropy-Aware Predictive Control for RLHF

Published on Feb 4
· Submitted by
Dipan Maity
on Feb 5
Authors:

Abstract

A new reinforcement learning algorithm for language model alignment that improves stability and performance over PPO through enhanced KL divergence control and adaptive reward management.

AI-generated summary

Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner and suffers form reward oscillations, entropy collapse, value function drift, and sudden policy divergence that require frequent restarts and extensive hyperparameter tuning. In this paper, we develop a new pure on policy actor-critic RL method for the LM-RLHF setting. We present SAFE (Stable Alignment Finetuning with Entropy-aware control),a novel RLHF algorithm that combines a Double Soft-Min Critic for pessimistic value estimation with a new multi-layer stabilization framework combining entropy-gated KL regulation, and PID-controlled adaptive thresholds. Unlike standard PPO's symmetric KL penalties, SAFE distinguishes high-entropy exploration from low-entropy mode collapse and adjusts penalties dynamically based on reward velocity. Experiments on a 3B parameter model show SAFE achieves +5.15\% training-average reward than PPO (0.725 vs 0.689), negligible reward crashes, and superior KL control than ppo . Our method adds minimal computational overhead and provides an interpretable, crash-resistant RLHF framework that maintains aggressive learning speed while ensuring stable long-horizon optimization suitable for production deployment. Code is available at https://github.com/ryyzn9/SAFE

Community

Paper submitter

An alternative to ppo for RLHF.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.04651 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.04651 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.04651 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.