YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

E2H Reasoning

arXiv License

Official model release of the paper "Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning".

πŸ“„ Paper Description

Recent RL post-trained models demonstrate strong reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is often less effective due to significant distribution gaps and sparse rewards. To address this, we introduce E2H Reasoner, a Curriculum Reinforcement Learning (CRL) approach that schedules tasks probabilistically from easy to hard, allowing Large Language Models (LLMs) to build reasoning skills gradually.

By first decomposing training datasets into subsets of increasing difficulty (trivial, easy, medium, hard) and applying probabilistic training schedulers (like Gaussian and Cosine samplers), E2H Reasoner helps models acquire core skills without suffering from task forgetting or overfitting to trivial patterns (reward hacking). E2H achieves state-of-the-art performance across multiple reasoning tasks, including Blocksworld, Countdown, MATH, AQUA, and GSM8K.

πŸš€ Models (DAPO Variants)

This repository exclusively hosts the DAPO (Difficulty-Aligned Policy Optimization) model variants evaluated in our paper. In our ablation studies, we demonstrate that E2H and DAPO are highly complementary. E2H shapes the difficulty-conditioned sampling distribution that DAPO resamples from, and combining them yields the strongest reasoning performance and reduces the fraction of batches with zero advantage during training.

Below is the list of available DAPO-trained models. (Note: E2H-G refers to the Gaussian sampler, and E2H-C refers to the Cosine sampler.)

Available Model Weights

Model Architecture Scheduler Dataset Hugging Face Weights
Qwen 2.5 1.5B E2H-G Countdown Link Here
Qwen 2.5 1.5B E2H-C MATH Link Here
Qwen 2.5 1.5B E2H-G MATH Link Here
Qwen 2.5 1.5B Balanced GSM8K Link Here
Qwen 2.5 1.5B E2H-C GSM8K Link Here
Qwen 2.5 1.5B E2H-G GSM8K Link Here
Qwen 2.5 1.5B Balanced AQUA Link Here
Qwen 2.5 1.5B E2H-C AQUA Link Here
Qwen 2.5 1.5B E2H-G AQUA Link Here

πŸ“š Citation

If you find this work or the released models useful, please consider citing our paper:

@article{parashar2025curriculum,
  title={Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning},
  author={Parashar, Shubham and Gui, Shurui and Li, Xiner and Ling, Hongyi and Vemuri, Sushil and Olson, Blake and Li, Eric and Zhang, Yu and Caverlee, James and Kalathil, Dileep and Ji, Shuiwang},
  journal={arXiv preprint arXiv:2506.06632},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for divelab/E2H-Reasoning