Papers
arxiv:2604.14142

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

Published on Apr 15
· Submitted by
TanYuQiao
on Apr 16
Authors:
,
,
,
,
,
,

Abstract

PreRL applies reward-driven online updates to the marginal distribution in pre-train space, while DSRL uses NSR-PreRL to expand reasoning horizons before standard RL fine-tuning.

AI-generated summary

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

Community

Paper submitter

We’re excited to share our new paper: From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space. https://arxiv.org/abs/2604.14142

Most existing RL for LLM reasoning optimizes the policy under a given question context, i.e. P(y|x).
In this work, we ask a different question:

Can we directly optimize reasoning trajectories themselves in pre-train space, instead of only optimizing them conditioned on a specific problem?

Our motivation is simple: if reasoning knowledge is already internalized in the model, then optimizing trajectories at the level of P(y) may provide a way to shape that internalized reasoning space more directly.

We introduce PreRL, which applies reward-driven online updates in pre-train space, and find a surprising result:

Negative Sample Reinforcement (NSR) is especially effective.
Instead of reinforcing only correct trajectories, pruning incorrect ones in pre-train space can strongly stimulate reasoning behaviors and provide a better foundation for subsequent RL.

Building on this, we propose DSRL, which first performs NSR-PreRL warmup and then switches to standard RL.
Across benchmarks, this leads to better reasoning performance, stronger exploration, and improved efficiency.

Code: https://github.com/Trae1ounG/Pretrain_Space_RLVR

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14142
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14142 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.14142 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14142 in a Space README.md to link it from this page.

Collections including this paper 1