Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Abstract
STOP is a systematic path pruning method for large reasoning models that improves efficiency and accuracy through learnable token-level pruning across different compute budgets.
Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP
Community
STOP! You might already be on the wrong reasoning path.
In parallel reasoning, many sampled trajectories are already doomed from their early prefixes, yet still consume full decoding budgets.
We propose STOP (Super Token for Pruning), a lightweight method that appends a short sequence of learnable [STOP] tokens and directly reads KV cache states to decide whether a trajectory should be continued. This enables early pruning of unpromising paths without re-encoding or external models.
STOP significantly improves reasoning performance on AIME and GPQA, while reducing token usage by over 70% in many settings.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- One-Token Verification for Reasoning Correctness Estimation (2026)
- Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement (2026)
- SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking (2026)
- Efficient Reasoning on the Edge (2026)
- Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning (2026)
- ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention (2026)
- Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16029 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper