Papers
arxiv:2605.08472

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Published on May 8
Submitted by
Aswin Ravikumar Rangsasamy Veerasamy
on May 20
Authors:
,
,
,
,
,
,

Abstract

Using diverse self-generated data during mid-training based on Polya's problem-solving approaches improves reinforcement learning performance in language models across mathematical reasoning and out-of-distribution tasks.

AI-generated summary

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

Community

Paper submitter

Excited to share our new paper 馃殌

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
We study a simple question: Can we make RL more effective by first teaching models multiple correct ways to solve the same problem?
Instead of reinforcing a single reasoning trajectory, can we expose the model to a richer space of valid approaches before RL begins?

Tweet01Hook


Our investigation is simple.
Before RL, we mid-train the model on multiple correct ways of solving the same problem, so that when RL begins, it operates over a richer set of priors rather than a single narrow reasoning mode.
Importantly, these reasoning traces are self-generated by the same base model that is later trained with RL. No human-written chains of thought, and no distillation from a stronger teacher model.

Tweet02Setup


To make the solutions diverse, we use problem-solving heuristics inspired by George P贸lya's How to Solve It.
For each question, the model is prompted to solve it using different approaches: analogy, working backward, decomposition, introducing auxiliary elements, logical step-by-step justification, bright ideas, and more.
This gives us structurally distinct reasoning traces for the same underlying problem.

Tweet03Heuristics


The generated solutions are filtered in two steps.
First, rule-based verification keeps only responses with the correct final answer.
Then, a reward model scores how well the response follows the intended heuristic.
The highest-scoring correct response per (question, heuristic) pair is selected, giving us multiple correct, heuristic-specific solution traces per question. 馃

Tweet04Filter


Why should this help RL?
Our theoretical view: mid-training on n correct approaches creates multiple high-probability continuations at reasoning branch points, an N-modal distribution.
Under a positive gradient, RL can meaningfully update across all N modes rather than sharpening a single one. Under a negative gradient, mass removed from the sampled approach redistributes to the remaining N-1 dominant modes, i.e., to the other valid approaches the model knows.
This is the mechanism by which RL learns to combine the approaches introduced during mid-training.

Tweet05Theory


Empirically, this improves GRPO-based RL.
On Llama-3.2-3B-Instruct, models initialized with our heuristic-guided mid-training consistently outperform vanilla RL and STaR+RL across six math benchmarks, with gains becoming clearer at larger pass@k.
At pass@64, the average improves from 44.21 for vanilla RL to 48.09 with n=16. 馃搳

Tweet06Results


One of our most interesting findings: RL doesn't just use the individual approaches from mid-training. It composes them.
We analyze reasoning traces using an LLM-based classifier across 64 P贸lya-style heuristics. At n=16, RL-trained models combine multiple problem-solving approaches in 56.7% of chains, vs. only 23.3% before RL. This composition rate grows as n increases.
Combinations like Bolzano + Decompose or Restate + Decompose + Carry-Out emerge consistently after RL, even though they were never observed together during mid-training. RL is doing the composition. 馃敆


Four additional findings from our analysis:
Under a fixed instance-level budget, 16 approaches on 463 questions outperform 1 approach on 7,408 questions, around 7% relative improvement after RL. This means learning more problem solving approaches is more beneficial than learning to solve more problems, during mid-training.

Correctness vs Diversity:. Diverse but incorrect reasoning traces fall below vanilla RL. With more incorrect problem solving approaches, the performance worsens more. Diversity alone is not enough, and correctness is pivotal.

More diverse than distillation. Our self-generated data scores Vendi 13.81 vs. 10.95 for QwQ-32B distillation, and gives better post-RL performance despite coming from a much weaker model.

Generalizes beyond math. Despite math-centric heuristics, gains on HumanEval (code) and MuSR (narrative reasoning) show that Polya鈥檚 problems solving approaches transfer.


Takeaway:
RL performance depends not only on the RL stage itself, but also on the distribution the model is exposed to beforehand.
Mid-training on diverse, self-generated, correct reasoning traces improves subsequent RL, and the effect is driven by RL learning to compose the approaches introduced during mid-training.

Tweet09Takeaway

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08472 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08472 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08472 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.