arxiv:2602.17616

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Published on Feb 19

Authors:

Abstract

Asynchronous reinforcement learning suffers from high variance due to stale rollouts, which VCPO addresses by dynamically scaling learning rate with effective sample size and applying a minimum-variance baseline for stable training.

AI-generated summary

Asynchronous reinforcement learning has become increasingly central to scaling LLM post-training, delivering major throughput gains by decoupling rollout generation from policy updates. However, widely used policy-gradient objectives such as REINFORCE and GRPO suffer under high asynchrony: stale rollouts produce heavy-tailed importance weights, so a small number of trajectories dominate updates and the policy-gradient estimator becomes markedly higher variance. Through systematic analysis on math, reasoning, and tool-use benchmarks, we find that this increasing variance is reliably predicted by collapsing effective sample size (ESS), which prior stabilization methods largely fail to address. Motivated by this diagnosis, we introduce Variance Controlled Policy Optimization (VCPO), a method that (i) dynamically scales the learning rate with ESS to dampen unreliable updates and (ii) applies a closed-form minimum-variance baseline for off-policy settings, without a critic model and adding minimal overhead. Empirically, across math and general reasoning benchmarks, this enables robustly stable asynchronous training compared to previous stabilization and algorithmic methods, even in highly off-policy regimes (128 steps off-policy). In a long-horizon, tool-use task, VCPO matches synchronous performance while delivering a 2.5times speedup in training time. Code is available at: https://github.com/mit-han-lab/vcpo

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.17616 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.17616 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.17616 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.