arxiv:2605.08472

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Published on May 8

· Submitted by

Aswin Ravikumar Rangsasamy Veerasamy on May 20

Arizona State University

Upvote

Authors:

Abstract

Using diverse self-generated data during mid-training based on Polya's problem-solving approaches improves reinforcement learning performance in language models across mathematical reasoning and out-of-distribution tasks.

AI-generated summary

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

View arXiv page View PDF Add to collection

Community

rrvaswin

Paper submitter about 7 hours ago

Excited to share our new paper 🚀

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
We study a simple question: Can we make RL more effective by first teaching models multiple correct ways to solve the same problem?
Instead of reinforcing a single reasoning trajectory, can we expose the model to a richer space of valid approaches before RL begins?

Our investigation is simple.
Before RL, we mid-train the model on multiple correct ways of solving the same problem, so that when RL begins, it operates over a richer set of priors rather than a single narrow reasoning mode.
Importantly, these reasoning traces are self-generated by the same base model that is later trained with RL. No human-written chains of thought, and no distillation from a stronger teacher model.

To make the solutions diverse, we use problem-solving heuristics inspired by George Pólya's How to Solve It.
For each question, the model is prompted to solve it using different approaches: analogy, working backward, decomposition, introducing auxiliary elements, logical step-by-step justification, bright ideas, and more.
This gives us structurally distinct reasoning traces for the same underlying problem.

The generated solutions are filtered in two steps.
First, rule-based verification keeps only responses with the correct final answer.
Then, a reward model scores how well the response follows the intended heuristic.
The highest-scoring correct response per (question, heuristic) pair is selected, giving us multiple correct, heuristic-specific solution traces per question. 🧠

Why should this help RL?
Our theoretical view: mid-training on n correct approaches creates multiple high-probability continuations at reasoning branch points, an N-modal distribution.
Under a positive gradient, RL can meaningfully update across all N modes rather than sharpening a single one. Under a negative gradient, mass removed from the sampled approach redistributes to the remaining N-1 dominant modes, i.e., to the other valid approaches the model knows.
This is the mechanism by which RL learns to combine the approaches introduced during mid-training.

Empirically, this improves GRPO-based RL.
On Llama-3.2-3B-Instruct, models initialized with our heuristic-guided mid-training consistently outperform vanilla RL and STaR+RL across six math benchmarks, with gains becoming clearer at larger pass@k.
At pass@64, the average improves from 44.21 for vanilla RL to 48.09 with n=16. 📊

One of our most interesting findings: RL doesn't just use the individual approaches from mid-training. It composes them.
We analyze reasoning traces using an LLM-based classifier across 64 Pólya-style heuristics. At n=16, RL-trained models combine multiple problem-solving approaches in 56.7% of chains, vs. only 23.3% before RL. This composition rate grows as n increases.
Combinations like Bolzano + Decompose or Restate + Decompose + Carry-Out emerge consistently after RL, even though they were never observed together during mid-training. RL is doing the composition. 🔗

Four additional findings from our analysis:
Under a fixed instance-level budget, 16 approaches on 463 questions outperform 1 approach on 7,408 questions, around 7% relative improvement after RL. This means learning more problem solving approaches is more beneficial than learning to solve more problems, during mid-training.

Correctness vs Diversity:. Diverse but incorrect reasoning traces fall below vanilla RL. With more incorrect problem solving approaches, the performance worsens more. Diversity alone is not enough, and correctness is pivotal.

More diverse than distillation. Our self-generated data scores Vendi 13.81 vs. 10.95 for QwQ-32B distillation, and gives better post-RL performance despite coming from a much weaker model.

Generalizes beyond math. Despite math-centric heuristics, gains on HumanEval (code) and MuSR (narrative reasoning) show that Polya’s problems solving approaches transfer.

Takeaway:
RL performance depends not only on the RL stage itself, but also on the distribution the model is exposed to beforehand.
Mid-training on diverse, self-generated, correct reasoning traces improves subsequent RL, and the effect is driven by RL learning to compose the approaches introduced during mid-training.