YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
E2H Reasoning
Official model release of the paper "Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning".
π Paper Description
Recent RL post-trained models demonstrate strong reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is often less effective due to significant distribution gaps and sparse rewards. To address this, we introduce E2H Reasoner, a Curriculum Reinforcement Learning (CRL) approach that schedules tasks probabilistically from easy to hard, allowing Large Language Models (LLMs) to build reasoning skills gradually.
By first decomposing training datasets into subsets of increasing difficulty (trivial, easy, medium, hard) and applying probabilistic training schedulers (like Gaussian and Cosine samplers), E2H Reasoner helps models acquire core skills without suffering from task forgetting or overfitting to trivial patterns (reward hacking). E2H achieves state-of-the-art performance across multiple reasoning tasks, including Blocksworld, Countdown, MATH, AQUA, and GSM8K.
π Models (DAPO Variants)
This repository exclusively hosts the DAPO (Difficulty-Aligned Policy Optimization) model variants evaluated in our paper. In our ablation studies, we demonstrate that E2H and DAPO are highly complementary. E2H shapes the difficulty-conditioned sampling distribution that DAPO resamples from, and combining them yields the strongest reasoning performance and reduces the fraction of batches with zero advantage during training.
Below is the list of available DAPO-trained models. (Note: E2H-G refers to the Gaussian sampler, and E2H-C refers to the Cosine sampler.)
Available Model Weights
| Model Architecture | Scheduler | Dataset | Hugging Face Weights |
|---|---|---|---|
| Qwen 2.5 1.5B | E2H-G | Countdown | Link Here |
| Qwen 2.5 1.5B | E2H-C | MATH | Link Here |
| Qwen 2.5 1.5B | E2H-G | MATH | Link Here |
| Qwen 2.5 1.5B | Balanced | GSM8K | Link Here |
| Qwen 2.5 1.5B | E2H-C | GSM8K | Link Here |
| Qwen 2.5 1.5B | E2H-G | GSM8K | Link Here |
| Qwen 2.5 1.5B | Balanced | AQUA | Link Here |
| Qwen 2.5 1.5B | E2H-C | AQUA | Link Here |
| Qwen 2.5 1.5B | E2H-G | AQUA | Link Here |
π Citation
If you find this work or the released models useful, please consider citing our paper:
@article{parashar2025curriculum,
title={Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning},
author={Parashar, Shubham and Gui, Shurui and Li, Xiner and Ling, Hongyi and Vemuri, Sushil and Olson, Blake and Li, Eric and Zhang, Yu and Caverlee, James and Kalathil, Dileep and Ji, Shuiwang},
journal={arXiv preprint arXiv:2506.06632},
year={2025}
}