DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Abstract
A novel reinforcement learning approach for large language models that addresses the exploration-exploitation trade-off through perplexity-based sample partitioning and bidirectional reward allocation mechanisms.
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
Community
Abstract
Mathematical Reasoning Results
Qwen3-4B-Base
| Method | AIME24 | AIME25 | MATH | AMC | OLY | MIN | AVG |
|---|---|---|---|---|---|---|---|
| Base | 9.58 | 3.75 | 48.48 | 31.48 | 25.52 | 24.82 | 23.94 |
| GRPO | 26.67 | 23.33 | 85.83 | 60.24 | 53.06 | 44.39 | 48.92 |
| DAPO | 26.25 | 23.75 | 86.43 | 61.90 | 53.88 | 44.34 | 49.43 |
| DAPO w/ EL | 26.67 | 24.58 | 86.78 | 62.95 | 54.53 | 44.53 | 50.01 |
| CDE | 26.67 | 24.17 | 85.93 | 62.35 | 52.25 | 43.11 | 49.08 |
| DiPO (ours) | 29.17 | 24.58 | 87.00 | 63.70 | 54.09 | 44.76 | 50.55 |
Qwen3-8B-Base
| Method | AIME24 | AIME25 | MATH | AMC | OLY | MIN | AVG |
|---|---|---|---|---|---|---|---|
| Base | 8.33 | 9.17 | 66.78 | 39.46 | 33.30 | 66.78 | 37.30 |
| GRPO | 31.67 | 24.58 | 89.08 | 69.28 | 56.20 | 48.62 | 53.24 |
| DAPO | 30.08 | 25.83 | 89.43 | 69.12 | 56.90 | 48.02 | 53.23 |
| DAPO w/ EL | 33.75 | 25.42 | 89.58 | 69.87 | 57.21 | 47.56 | 53.90 |
| CDE | 31.67 | 26.25 | 89.35 | 68.07 | 57.11 | 47.75 | 53.37 |
| DiPO (ours) | 35.00 | 27.50 | 89.55 | 71.23 | 57.73 | 47.75 | 54.79 |
Qwen2.5-7B
| Method | AIME24 | AIME25 | MATH | AMC | OLY | MIN | AVG |
|---|---|---|---|---|---|---|---|
| Base | 7.08 | 2.08 | 41.53 | 22.74 | 19.28 | 17.19 | 18.32 |
| GRPO | 20.42 | 15.42 | 79.15 | 58.43 | 42.42 | 36.95 | 42.13 |
| DAPO | 20.42 | 16.67 | 79.08 | 59.94 | 42.70 | 37.55 | 42.73 |
| DAPO w/ EL | 20.00 | 14.58 | 79.85 | 58.73 | 43.05 | 39.65 | 42.64 |
| CDE | 20.00 | 15.00 | 79.00 | 55.87 | 42.94 | 35.94 | 41.46 |
| DiPO (ours) | 22.92 | 16.67 | 80.35 | 60.09 | 43.72 | 37.59 | 43.56 |
BFCLv3 Function Calling Results
Qwen2.5-3B-Instruct
| Method | Non-Live Acc | Live Acc | Multi Turn Acc | Relevance Detection | Irrelevance Detection | Overall |
|---|---|---|---|---|---|---|
| Instruct | 42.52 | 53.96 | 1.00 | 44.44 | 82.49 | 33.04 |
| SFT400 | 69.29 | 41.40 | 0.00 | 94.44 | 60.14 | 34.08 |
| SFT400+PPO | 78.29 | 58.76 | 5.12 | 100.00 | 48.40 | 45.80 |
| SFT400+GRPO | 76.21 | 64.15 | 1.75 | 94.44 | 58.63 | 46.42 |
| PPO, Cold Start | 82.42 | 67.78 | 4.88 | 100.00 | 18.09 | 51.15 |
| ToolRL+GRPO | 81.58 | 73.78 | 3.75 | 100.00 | 56.44 | 52.98 |
| ToolRL+DAPO | 82.19 | 69.43 | 8.00 | 81.25 | 57.60 | 53.21 |
| ToolRL+DiPO | 83.42 | 73.06 | 8.62 | 100.00 | 54.16 | 55.03 |
Qwen2.5-7B-Instruct
| Method | Non-Live Acc | Live Acc | Multi Turn Acc | Relevance Detection | Irrelevance Detection | Overall |
|---|---|---|---|---|---|---|
| Instruct | 66.02 | 53.51 | 4.25 | 76.47 | 62.66 | 41.97 |
| SFT400 | 69.29 | 41.40 | 0.00 | 94.44 | 8.11 | 34.08 |
| SFT400+PPO | 83.90 | 51.84 | 0.25 | 100.00 | 29.66 | 42.02 |
| SFT400+GRPO | 80.69 | 46.51 | 0.25 | 100.00 | 14.19 | 39.25 |
| PPO, Cold Start | 79.33 | 63.17 | 0.38 | 88.89 | 52.92 | 46.68 |
| ToolRL+GRPO | 86.17 | 74.90 | 18.12 | 83.33 | 76.68 | 58.38 |
| ToolRL+DAPO | 87.10 | 76.31 | 19.75 | 87.50 | 67.25 | 61.06 |
| ToolRL+DiPO | 86.21 | 76.83 | 24.50 | 87.50 | 69.57 | 62.51 |
Get this paper in your agent:
hf papers read 2604.13902 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper