Papers
arxiv:2604.19295

TEMPO: Scaling Test-time Training for Large Reasoning Models

Published on Apr 21
· Submitted by
qingyang zhang
on Apr 22
Authors:
,
,
,
,
,
,
,
,

Abstract

TEMPO is a test-time training framework that alternates policy refinement with critic recalibration to sustain performance improvements in language models without diversity collapse.

AI-generated summary

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

Community

Paper author Paper submitter
edited 1 day ago

TEMPO: Scaling Test-time Training for Large Reasoning Models: models stop learning once training ends. Test-time training (TTT) tries to change that by letting models keep improving on new problems at inference. But current approaches plateau fast: the self-generated reward signal drifts and the model collapses into repeating one reasoning pattern. TEMPO fixes this with a simple EM-style loop: periodically recalibrate the reward critic on a small labeled set, then refine the policy on unlabeled test questions. This keeps the training signal honest as the model evolves. results: OLMO3-7B 33→51% on AIME 2024, Qwen3-14B 42→66%, still climbing at 350 steps when baselines flatline. Diversity (pass@k) stays high instead of collapsing. also generalizes to non-math reasoning tasks.

the crux for me is framing critic recalibration as the E-step in an EM loop and letting the policy improve on unlabeled test questions in the M-step. did you test cross-domain calibration, i.e., labeled data from a different domain than the test questions? the arXivLens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/tempo-scaling-test-time-training-for-large-reasoning-models-3719-7ecbd8a0
will be interesting to see how robust that step is in practice.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.19295
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.19295 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.19295 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.19295 in a Space README.md to link it from this page.

Collections including this paper 3