E2H Reasoning

Official model release of the paper "Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning".

📄 Paper Description

Recent RL post-trained models demonstrate strong reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is often less effective due to significant distribution gaps and sparse rewards. To address this, we introduce E2H Reasoner, a Curriculum Reinforcement Learning (CRL) approach that schedules tasks probabilistically from easy to hard, allowing Large Language Models (LLMs) to build reasoning skills gradually.

By first decomposing training datasets into subsets of increasing difficulty (trivial, easy, medium, hard) and applying probabilistic training schedulers (like Gaussian and Cosine samplers), E2H Reasoner helps models acquire core skills without suffering from task forgetting or overfitting to trivial patterns (reward hacking). E2H achieves state-of-the-art performance across multiple reasoning tasks, including Blocksworld, Countdown, MATH, AQUA, and GSM8K.

🚀 Models (DAPO Variants)

This repository exclusively hosts the DAPO (Difficulty-Aligned Policy Optimization) model variants evaluated in our paper. In our ablation studies, we demonstrate that E2H and DAPO are highly complementary. E2H shapes the difficulty-conditioned sampling distribution that DAPO resamples from, and combining them yields the strongest reasoning performance and reduces the fraction of batches with zero advantage during training.

Below is the list of available DAPO-trained models. (Note: E2H-G refers to the Gaussian sampler, and E2H-C refers to the Cosine sampler.)

Available Model Weights

Model Architecture	Scheduler	Dataset	Hugging Face Weights
Qwen 2.5 1.5B	E2H-G	Countdown	Link Here
Qwen 2.5 1.5B	E2H-C	MATH	Link Here
Qwen 2.5 1.5B	E2H-G	MATH	Link Here
Qwen 2.5 1.5B	Balanced	GSM8K	Link Here
Qwen 2.5 1.5B	E2H-C	GSM8K	Link Here
Qwen 2.5 1.5B	E2H-G	GSM8K	Link Here
Qwen 2.5 1.5B	Balanced	AQUA	Link Here
Qwen 2.5 1.5B	E2H-C	AQUA	Link Here
Qwen 2.5 1.5B	E2H-G	AQUA	Link Here

📚 Citation

If you find this work or the released models useful, please consider citing our paper:

@article{parashar2025curriculum,
  title={Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning},
  author={Parashar, Shubham and Gui, Shurui and Li, Xiner and Ling, Hongyi and Vemuri, Sushil and Olson, Blake and Li, Eric and Zhang, Yu and Caverlee, James and Kalathil, Dileep and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2506.06632},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for divelab/E2H-Reasoning

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Paper • 2506.06632 • Published Mar 16 • 2