Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts Paper • 2606.05922 • Published 6 days ago • 47
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph Paper • 2511.00086 • Published Oct 29, 2025 • 42
m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models Paper • 2504.00869 • Published Apr 1, 2025 • 10
ViLBench: A Suite for Vision-Language Process Reward Modeling Paper • 2503.20271 • Published Mar 26, 2025 • 7
IHEval: Evaluating Language Models on Following the Instruction Hierarchy Paper • 2502.08745 • Published Feb 12, 2025 • 20