Papers
arxiv:2601.20568

Reinforcement Unlearning via Group Relative Policy Optimization

Published on Mar 20
Authors:
,
,

Abstract

PURGE is a novel unlearning method for large language models that safely removes sensitive information using intrinsic rewards and group relative policy optimization, achieving high effectiveness with minimal utility loss.

AI-generated summary

During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach achieves up to x46 lower token usage per target than state-of-the-art methods, while improving fluency by +5.48% and adversarial robustness by +12.02% over the base model. Extensive evaluation on the Real World Knowledge Unlearning (RWKU) benchmark shows that PURGE reaches 11% unlearning effectiveness while preserving 98% of original utility. PURGE shows that framing LLM unlearning as a verifiable task enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2601.20568
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.20568 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.20568 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.20568 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.