Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
Abstract
Trace2Skill enables scalable skill generation for LLM agents by analyzing diverse execution traces in parallel and consolidating them into transferable, declarative skills without parameter updates or external modules.
Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.
Community
Nice one! Thx...
very interesting and insightful paper
🔥 Trace2Skill – Distilling Trajectory-Local Lessons into Transferable Agent Skills (arXiv:2603.25158)
The Problem:
Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex reasoning tasks, but manual authoring creates a severe scalability bottleneck. On the flip side, automated skill generation often produces fragile or fragmented results—either by relying on shallow parametric knowledge or by sequentially overfitting to local, non-generalizable trajectories.
The Solution: Trace2Skill
This paper introduces Trace2Skill, a framework that mimics how human experts author skills. Rather than sequentially reacting to individual agent trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to holistically analyze a diverse pool of execution experiences. It extracts trajectory-specific lessons and uses inductive reasoning to hierarchically consolidate them into a single, unified, conflict-free skill directory.
🌟 Key Highlights:
- Human-Like Skill Authoring: Builds broad prior knowledge through extensive, parallel trajectory analysis before drafting or deepening comprehensive skills.
- Massive Performance Jumps: Significantly improves upon strong baselines across challenging domains (Spreadsheets, VisionQA, Math Reasoning), even beating Anthropic's official
xlsxskills. - Cross-Model Transferability: Evolved skills generalize incredibly well across LLM scales and out-of-distribution settings! For example, declarative skills evolved purely by a 35B parameter model (Qwen3.5-35B) improved a massive 122B agent by up to 57.65 absolute percentage points on WikiTableQuestions.
- Plug-and-Play: Achieves these results with no parameter updates and no external episodic retrieval modules needed at inference time.
🚧 Work in Progress & Future Work:
We note that this paper is currently a Work in Progress. While error-driven skill updates provide a highly reliable and safe learning signal, success-derived patches are much more volatile. Although success signals can yield the highest performance gains, they drop below baselines if not filtered perfectly during the hierarchical merge. Therefore, future work will focus on designing a more selective "success analyst" to better filter and stabilize success-derived patches during the skill distillation process.
oh great, thanks for it!
is there any github repo? or any implementation guide?
Awesome work, really impressive
Since the paper notes that this is a work in progress, I thought I would jot down a few observations while the details are still fresh. I hope some of these turn out to be useful.
Merge batch size
B_merge is fixed at 32 with no ablation. This parameter controls how many patches the LLM sees at each merge step, and therefore what counts as a "recurring" pattern. It likely has a non-trivial effect on which lessons survive consolidation. Some sensitivity analysis here would strengthen the paper's claims about the consolidation mechanism.
Stage-level attribution
When the 35B-authored skill hurts performance (DocVQA, −6.2 pp), the paper suggests a gap between task ability and reflective capacity. But the same 35B runs error diagnosis, patch writing, and patch merging. Without isolating these stages, it is unclear where the breakdown occurs. Misdiagnosis in Stage 2 would corrupt downstream merging regardless of the model's inductive ability.
The DocVQA reversal
The 35B > 122B task performance on DocVQA is interesting but appears only in this one domain, with no mention of seed averaging. The spreadsheet and math experiments do not reproduce this pattern. It may be worth tempering the broader conclusion drawn from it.
Human priors in the pipeline
The paper characterizes the system as self-contained, but the asymmetric analyst design, the merge prompt's prevalence bias rule, the programmatic guardrails, and the Anthropic style guidelines all embed significant human judgment about what good skills look like. This does not diminish the contribution; it just means the "no external teacher" framing could be stated more carefully.
That makes sense, I see. The stage-level breakdown point is particularly intriguing. Do you believe that the majority of the performance decline could be explained by isolating the merge step alone? Perhaps that could be made clearer with a straightforward ablation in which each stage is assessed separately?
My guess is the opposite: Stage 2 diagnosis is probably the bigger bottleneck. The merge step just extracts commonalities from whatever patches it gets, so if those patches already contain bad causal attributions, even perfect merging will just reinforce them.
The paper's own §4.3 supports this. Single-call analysis over-attributed parse errors as root causes 57% of the time, and that was with the 122B model.
A simple cross-ablation could help clarify: feed 122B-authored patches into a 35B merger and vice versa. If 122B patches + 35B merging still produces a decent skill, that would pin the problem on diagnosis rather than merging.
You make a valid point: merging simply makes the error worse if the causal attribution is incorrect at the patch level. The concept of cross-ablation is interesting. It would be intriguing to observe whether weaker merging across various tasks can be carried out with a stronger diagnosis alone.
Get this paper in your agent:
hf papers read 2603.25158 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper