arxiv:2606.04261

Can Generalist Agents Automate Data Curation?

Published on Jun 2

· Submitted by

Adam Nguyen on Jun 11

Upvote

Authors:

Adam Nguyen ,

Abstract

Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

View arXiv page View PDF GitHub 2 Add to collection

Community

adamtrnguyen

Paper author Paper submitter about 2 hours ago

Hi all. Quick summary of what we think is the interesting part:

Generalist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).

But trajectory analysis shows what we call the execution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don't fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.

One more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.

Environment, trajectory diagnostics, and all scaffolds are open source: https://github.com/feiyang-k/curation-bench. Happy to answer questions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04261 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04261 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04261 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.