arxiv:2603.10960

Ranking Reasoning LLMs under Test-Time Scaling

Published on Mar 11

Authors:

Abstract

Scorio is a library implementing statistical ranking methods for evaluating reasoning LLMs under test-time scaling, achieving high agreement with Bayesian gold standards across multiple math benchmarks.

AI-generated summary

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 20 reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to N=80 trials), most full-trial rankings agree closely with the Bayesian gold standard Bayes_{U}@80 (mean Kendall's τ_b = 0.93--0.95), and 19--34 methods recover exactly the same ordering. In the single-trial regime, the best methods reach τ_b approx 0.86. Using greedy decoding as an empirical prior (Bayes_{R_0}@N) reduces variance at N=1 by 16--52%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

View arXiv page View PDF GitHub 5 auto Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.10960

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.10960 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.10960 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.10960 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.