TEMPO: Scaling Test-time Training for Large Reasoning Models
Abstract
TEMPO is a test-time training framework that alternates policy refinement with critic recalibration to sustain performance improvements in language models without diversity collapse.
Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.
Community
TEMPO: Scaling Test-time Training for Large Reasoning Models: models stop learning once training ends. Test-time training (TTT) tries to change that by letting models keep improving on new problems at inference. But current approaches plateau fast: the self-generated reward signal drifts and the model collapses into repeating one reasoning pattern. TEMPO fixes this with a simple EM-style loop: periodically recalibrate the reward critic on a small labeled set, then refine the policy on unlabeled test questions. This keeps the training signal honest as the model evolves. results: OLMO3-7B 33→51% on AIME 2024, Qwen3-14B 42→66%, still climbing at 350 steps when baselines flatline. Diversity (pass@k) stays high instead of collapsing. also generalizes to non-math reasoning tasks.
the crux for me is framing critic recalibration as the E-step in an EM loop and letting the policy improve on unlabeled test questions in the M-step. did you test cross-domain calibration, i.e., labeled data from a different domain than the test questions? the arXivLens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/tempo-scaling-test-time-training-for-large-reasoning-models-3719-7ecbd8a0
will be interesting to see how robust that step is in practice.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TTVS: Boosting Self-Exploring Reinforcement Learning via Test-time Variational Synthesis (2026)
- FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization (2026)
- Tool Verification for Test-Time Reinforcement Learning (2026)
- What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time (2026)
- DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning (2026)
- Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)
- Compress the Easy, Explore the Hard: Difficulty-Aware Entropy Regularization for Efficient LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.19295 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper