arxiv:2606.05922

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

Published on Jun 4

· Submitted by

Wenbo Pan on Jun 10

#3 Paper of the day

Microsoft Research

Upvote

Authors:

Abstract

Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

View arXiv page View PDF Project page GitHub 14 Add to collection

Community

wenbopan

Paper submitter about 12 hours ago

•

edited about 10 hours ago

RHO (Retrospective Harness Optimization) improves an LLM agent's harness — its skills, tools, and workflows — using only the agent's own past trajectories, with no ground-truth validation set. It selects a difficulty-diverse coreset of past tasks with a DPP, re-solves each task in parallel, diagnoses failures via self-validation and self-consistency, and picks among candidate harness updates by pairwise self-preference. A single optimization round improves SWE-Bench Pro pass rate from 59% to 78% without any external grading, with consistent gains on Terminal-Bench 2 and GAIA-2.