GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Abstract
Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.
Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
Community
Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐๐
Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐ค. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐.
To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐ก. Our framework includes two key designs:
๐น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐.
๐น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐ก๏ธ.
We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐.
The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐๐ฅ~
๐ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐ฅณ)
Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐๐
Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐ค. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐.
To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐ก. Our framework includes two key designs:
๐น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐.
๐น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐ก๏ธ.
We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐.
The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐๐ฅ~
๐ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐ฅณ)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Surgical Post-Training: Cutting Errors, Keeping Knowledge (2026)
- Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning (2026)
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14258 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper