Papers
arxiv:2605.27028

Less is More: Early Stopping Rollout for On-Policy Distillation

Published on May 26
· Submitted by
Ziheng Zhou
on May 28
Authors:
,
,
,
,

Abstract

On-policy distillation suffers from teacher decay issues with later tokens, which are mitigated by Early Stopping Rollout that restricts training to initial response tokens, improving efficiency and stability.

AI-generated summary

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

Community

Early stopping method for OPD!

The paper diagnoses the "Off-policy Teacher Decay" problem for On-policy Distillation(OPD) - the later position in the trajectory is actually harmful! This is because the trajectory from the student is off-policy to the teacher, and when it gets longer, the teacher model goes off its original reasoning patterns and falls back to pre-training behavior - token completion. This causes the teacher signal quality to decay and the learning to be unstable and even collapsing during training.

The fix is very simple - stop the rollout early (Early Stopping Rollout, ESR). By rolling out just the first N tokens (as few as 50 to 100), the performance consistently outperforms full rollout across tasks, model families, sizes and training regimes (FFT or LoRA), and can be 24x faster. Moreover, it can exceed the teacher performance sometimes!

Why does it work so well? Aside from relieving the "Off-policy Teacher Decay" problem, the paper also discovered 1. the "Cascading Alignment" effect that late-position tokens can get "trained" (KL divergence reduced) automatically when only training on the early tokens; 2. the "Sub-mode Commitment" effect that shows ESR may commit to a non-dominant mode of teacher that is actually better, which enables it to even exceed the teacher. Beside, the paper also ablates that this position-based token selection strategy is different from KL or entropy based ones, and can are better than using them alone.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27028
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27028 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27028 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27028 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.