Papers
arxiv:2605.30833

Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation

Published on May 29
Authors:
,
,
,
,
,
,
,
,
,

Abstract

On-policy distillation suffers from supervision fidelity decay as reasoning chains lengthen, but lookahead group reward and entropy-triggered tree-attention mechanisms improve performance on long-generation tasks.

On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, Supervision Fidelity Decay (SFD): as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce Lookahead Group Reward (\ours{)}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, improves mean@8 by 2.57 points over OPD for a 7B student, with gains increasing in longer-generation and reaching +4.92 points on AIME-26 at 39k tokens.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30833
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30833 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30833 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30833 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.