Paper page - GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification 🎉🚀

Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck 🤔. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" 💥. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL 📉.

To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective 💡. Our framework includes two key designs:
🔹 Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity 🌟.
🔹 Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge 🛡️.

We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench 📊. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) ✨, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling 🏆.

The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below 🙏🔥~

📝 Paper: https://arxiv.org/abs/2604.14258
⭐ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! 🥳)

zwq2018

Paper submitter about 20 hours ago

Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification 🎉🚀

Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck 🤔. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" 💥. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL 📉.

To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective 💡. Our framework includes two key designs:
🔹 Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity 🌟.
🔹 Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge 🛡️.

We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench 📊. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) ✨, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling 🏆.

The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below 🙏🔥~

📝 Paper: https://arxiv.org/abs/2604.14258
⭐ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! 🥳)

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 1