BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning
Abstract
Reinforcement learning with exact physics rewards in a compact language model leads to procedural solution patterns rather than true physical reasoning, showing limited generalization beyond the training domain.
Can reinforcement learning with hard, verifiable rewards teach a compact language model to reason about physics, or does it primarily learn to pattern-match toward correct answers? We study this question by training a 1.5B-parameter reasoning model on beam statics, a classic engineering problem, using parameter-efficient RLVR with binary correctness rewards from symbolic solvers, without teacher-generated reasoning traces. The best BeamPERL checkpoint achieves a 66.7% improvement in Pass@1 over the base model. However, the learned competence is anisotropic: the model generalizes compositionally (more loads) but fails under topological shifts (moved supports) that require the same equilibrium equations. Intermediate checkpoints yield the strongest reasoning, while continued optimization degrades robustness while maintaining reward. These findings reveal a key limitation of outcome-level alignment: reinforcement learning with exact physics rewards induces procedural solution templates rather than internalization of governing equations. The precision of the reward signal - even when analytically exact - does not by itself guarantee transferable physical reasoning. Our results suggest that verifiable rewards may need to be paired with structured reasoning scaffolding to move beyond template matching toward robust scientific reasoning.
Community
Does pure RL with verifiable rewards teach models physics, or just clever pattern-matching?
We explored this question in our new paper by training a compact 1.5B reasoning model on beam statics (a classic structural engineering problem). We used parameter-efficient RLVR with strict, binary correctness rewards from symbolic solvers—meaning zero teacher-generated reasoning traces were used.
Here is what we found:
The good: The best BeamPERL checkpoint achieved a massive 66.7% improvement in Pass@1 over the base model.
The catch: The learned reasoning is highly anisotropic. The model generalizes beautifully to compositional changes (like adding more loads to the beam), but completely breaks down under topological shifts (like moving the locations of the supports) - even though the underlying equilibrium equations remain exactly the same.
Insights into training dynamics: Intermediate checkpoints actually yielded the strongest reasoning. Continued optimization degraded robustness while maintaining the reward, leaning into reward-hacking.
Key takeaway: Outcome-level alignment with exact physics rewards seems to induce procedural solution templates, not a true internalization of the governing equations. The precision of a reward signal - even when analytically perfect - does not by itself guarantee transferable physical reasoning.
To move beyond template-matching toward robust scientific comprehension, verifiable rewards likely need to be paired with structured reasoning scaffolding.
We are looking forward to discussing with the community.
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper