Papers
arxiv:2605.26844

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

Published on May 26
· Submitted by
Yuanyi Wang
on Jun 1
Authors:
,
,
,
,
,
,
,

Abstract

Token-level teacher signals in on-policy distillation are better predicted by teachability—measuring local compatibility between teacher and student distributions—than by raw KL disagreement alone.

AI-generated summary

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

Community

Paper author Paper submitter

This work answers the question: "which token-level teacher signals in OPD are actually learnable?" Our fixed-context KL-reduction diagnostic shows that high disagreement token conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this as Token Teachability and propose TA-OPD, which selects only high-teachability positions for OPD. Across Qwen2.5/Qwen3 settings, TA-OPD often matches or surpasses full-token OPD with only 5% retained tokens, without reward models or verifiers.

In summary, this work establishes a fine-grained view of OPD: not every token-level teacher–student disagreement is worth learning, and Token Teachability identifies which signals are actually learnable.

cool~

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26844
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26844 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26844 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26844 in a Space README.md to link it from this page.

Collections including this paper 2