arxiv:2603.02604

Heterogeneous Agent Collaborative Reinforcement Learning

Published on Mar 3

· Submitted by

hzx on Mar 5

#2 Paper of the day

ByteDance

Upvote

Authors:

Zhixia Zhang ,

Zixuan Huang ,

Jianxin Li ,

Abstract

HACRL enables collaborative reinforcement learning where heterogeneous agents share verified rollouts during training to improve collectively while maintaining independent operation at inference time, with HACPO achieving superior performance through efficient sample utilization and cross-agent knowledge transfer.

AI-generated summary

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.

View arXiv page View PDF Project page Add to collection

Community

hzxllll

Paper author Paper submitter about 15 hours ago

We formalize a new reinforcement learning paradigm HACRL to allow rollout share among heterogeneous agents, further we propose a new algorithm HACPO based on GSPO.
Project Page: https://zzx-peter.github.io/hacrl/

We are glad to share our code. Please feel free to contact 22376220@buaa.edu.cn