arxiv:2603.01571

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Published on Mar 2

· Submitted by

Qiyuan Zhang on Mar 4

Tencent Hunyuan

Upvote

Authors:

Abstract

Generative Reward Models can be improved by structuring Chain-of-Thought reasoning into breadth and depth components and optimizing them through supervised fine-tuning and reinforcement learning with verifiable rewards.

AI-generated summary

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at https://huggingface.co/collections/DonJoey/mix-grm{Hugging Face}, and the code is released at https://github.com/Don-Joey/Mix-GRM{Github}.

View arXiv page View PDF Project page Add to collection

Community

DonJoey

Paper submitter about 9 hours ago

🚀 Is making CoT longer really the silver bullet for Reward Models?

As long-cot dominates the LLM landscape, the standard approach to improving Generative Reward Models (LLM-as-a-Judge) has been straightforward: just force the model to generate longer reasoning traces. But does "one size fit all"?

In our new paper, "Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models," we prove that when it comes to evaluation, structure matters just as much as length.

🔥 The Core Problem:
Real-world evaluation is fundamentally divided:

Subjective Preference (e.g., Chat): Requires Breadth (B-CoT)—evaluating multiple dimensions like tone, format, and helpfulness simultaneously.
Objective Correctness (e.g., Math/Code): Requires Depth (D-CoT)—rigorous, step-by-step deductive verification.

Forcing a model to "think longer" on a subjective chat task often just accumulates noise, while using broad aspects on a math problem misses critical logical flaws.

💡 Enter Mix-GRM & Key Discoveries:

🧠 Synergizing Structures: We designed a framework that equips the GRM with both Breadth (B-CoT) and Depth (D-CoT) reasoning capabilities.

2.⚡ "Emergent Polarization": We trained the model using Reinforcement Learning (RLVR) relying exclusively on final verdict supervision—with zero explicit routing labels. Amazingly, the model's structural alignment surged to 95%. It autonomously learned to polarize its reasoning, dynamically selecting Breadth for Preference and Depth for Correctness.

📉 Highly Compute-Efficient: Unlike length-scaling baselines (like Self-Consistency) that burn massive amounts of tokens, Mix-GRM achieves superior performance while keeping token consumption within the exact same order of magnitude as standard single-pass reasoning.