Abstract
Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods.
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
Community
Hi all. Quick summary of what we think is the interesting part:
Generalist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).
But trajectory analysis shows what we call the execution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don't fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.
One more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.
Environment, trajectory diagnostics, and all scaffolds are open source: https://github.com/feiyang-k/curation-bench. Happy to answer questions.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper