HCAI-Lab/dolma3-6t-sample-500-docs
Updated • 20
Stratified working samples (5 sizes, seed=42) plus 100K random preconditioner sample.
Note 500 docs/bin (288K docs, 539M tokens).
Note 1K docs/bin (575K docs, 1.08B tokens).
Note 5K docs/bin (2.86M docs, 5.3B tokens).
Note 10K docs/bin (5.68M docs, 10.5B tokens) — basis for SOC-156 TrackStar.
Note 50K docs/bin (26.2M docs, 62.8B tokens).
Note 100K uniform-random preconditioner sample.