arxiv:2604.13902

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Published on Apr 15

· Submitted by

Xiaofan Li on Apr 20

East China Normal University

Upvote

Authors:

Abstract

A novel reinforcement learning approach for large language models that addresses the exploration-exploitation trade-off through perplexity-based sample partitioning and bidirectional reward allocation mechanisms.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

View arXiv page View PDF Add to collection

Community

FuNz

Paper submitter about 4 hours ago

Abstract

Mathematical Reasoning Results

Qwen3-4B-Base

Method	AIME24	AIME25	MATH	AMC	OLY	MIN	AVG
Base	9.58	3.75	48.48	31.48	25.52	24.82	23.94
GRPO	26.67	23.33	85.83	60.24	53.06	44.39	48.92
DAPO	26.25	23.75	86.43	61.90	53.88	44.34	49.43
DAPO w/ EL	26.67	24.58	86.78	62.95	54.53	44.53	50.01
CDE	26.67	24.17	85.93	62.35	52.25	43.11	49.08
DiPO (ours)	29.17	24.58	87.00	63.70	54.09	44.76	50.55

Qwen3-8B-Base

Method	AIME24	AIME25	MATH	AMC	OLY	MIN	AVG
Base	8.33	9.17	66.78	39.46	33.30	66.78	37.30
GRPO	31.67	24.58	89.08	69.28	56.20	48.62	53.24
DAPO	30.08	25.83	89.43	69.12	56.90	48.02	53.23
DAPO w/ EL	33.75	25.42	89.58	69.87	57.21	47.56	53.90
CDE	31.67	26.25	89.35	68.07	57.11	47.75	53.37
DiPO (ours)	35.00	27.50	89.55	71.23	57.73	47.75	54.79

Qwen2.5-7B

Method	AIME24	AIME25	MATH	AMC	OLY	MIN	AVG
Base	7.08	2.08	41.53	22.74	19.28	17.19	18.32
GRPO	20.42	15.42	79.15	58.43	42.42	36.95	42.13
DAPO	20.42	16.67	79.08	59.94	42.70	37.55	42.73
DAPO w/ EL	20.00	14.58	79.85	58.73	43.05	39.65	42.64
CDE	20.00	15.00	79.00	55.87	42.94	35.94	41.46
DiPO (ours)	22.92	16.67	80.35	60.09	43.72	37.59	43.56

BFCLv3 Function Calling Results

Qwen2.5-3B-Instruct

Method	Non-Live Acc	Live Acc	Multi Turn Acc	Relevance Detection	Irrelevance Detection	Overall
Instruct	42.52	53.96	1.00	44.44	82.49	33.04
SFT400	69.29	41.40	0.00	94.44	60.14	34.08
SFT400+PPO	78.29	58.76	5.12	100.00	48.40	45.80
SFT400+GRPO	76.21	64.15	1.75	94.44	58.63	46.42
PPO, Cold Start	82.42	67.78	4.88	100.00	18.09	51.15
ToolRL+GRPO	81.58	73.78	3.75	100.00	56.44	52.98
ToolRL+DAPO	82.19	69.43	8.00	81.25	57.60	53.21
ToolRL+DiPO	83.42	73.06	8.62	100.00	54.16	55.03

Qwen2.5-7B-Instruct

Method	Non-Live Acc	Live Acc	Multi Turn Acc	Relevance Detection	Irrelevance Detection	Overall
Instruct	66.02	53.51	4.25	76.47	62.66	41.97
SFT400	69.29	41.40	0.00	94.44	8.11	34.08
SFT400+PPO	83.90	51.84	0.25	100.00	29.66	42.02
SFT400+GRPO	80.69	46.51	0.25	100.00	14.19	39.25
PPO, Cold Start	79.33	63.17	0.38	88.89	52.92	46.68
ToolRL+GRPO	86.17	74.90	18.12	83.33	76.68	58.38
ToolRL+DAPO	87.10	76.31	19.75	87.50	67.25	61.06
ToolRL+DiPO	86.21	76.83	24.50	87.50	69.57	62.51

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.13902

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.13902 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.13902 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.13902 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.