arxiv:2602.07824

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Published on Feb 8

· Submitted by

SII-Tiantian Mi on Feb 17

SII - GAIR

Upvote

Authors:

Abstract

Data Darwinism presents a systematic framework for data-model co-evolution through a ten-level taxonomy, demonstrating that advanced processing techniques significantly improve foundation model performance on scientific text.

AI-generated summary

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

View arXiv page View PDF GitHub 11 Add to collection

Community

Mitiantian

Paper submitter about 9 hours ago

Data Darwinism

The quality of training data fundamentally determines foundation model performance, yet the field lacks systematic frameworks for data processing. We introduce Data Darwinism, a ten-level hierarchical taxonomy (L0--L9) organizing data transformations from selection to generation, preservation to transformation, and human-centric to machine-driven processing. This framework conceptualizes data as co-evolving with models: advanced models enable sophisticated processing, which produces superior training data for next-generation systems.

We validate this framework on scientific literature---a conceptually dense domain underutilized in open-source pre-training. We construct Darwin-Science, a 900B-token corpus implementing hierarchy levels L0--L5. Our key finding: raw scientific data suffers a severe learnability gap, providing negligible gains despite information density. We bridge this through L4 (Generative Refinement)---removing noise and repairing fragmentation---and L5 (Cognitive Completion)---expanding implicit reasoning, explicating terminology, and adding pedagogical bridges via frontier LLMs.

We establish rigorous controlled experiments with Darwin-Science-Eval (150K expert-level questions) and daVinci-origin-3B/7B---which we pre-train entirely from scratch on 5.37T tokens deliberately excluding scientific content, a substantial undertaking enabling contamination-free baselines and unambiguous attribution of gains to data processing rather than checkpoint artifacts. Through 600B continued pre-training tokens, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points on 20+ benchmarks, amplifying to +5.60 and +8.40 points on domain-aligned evaluation. Hierarchy progression from L0 to L5 yields +1.36 total gain, with L5 contributing +0.98, confirming systematic ascension unlocks latent value.

We release Darwin-Science and daVinci-origin-3B/7B models to enable principled, co-evolutionary data-model development.