Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
HCAI-Lab 's Collections
Other projects under HCAI-Lab
Archive (pre-6T and legacy)
OLMES Evaluations
TrackStar — Scores + Analysis
TrackStar — Indices + Training Shards
Dolma3 — Query Data
Dolma3 — Working Samples + Preconditioner
Dolma3 — Source Corpus + Manifest

Dolma3 — Working Samples + Preconditioner

updated about 8 hours ago

Stratified working samples (5 sizes, seed=42) plus 100K random preconditioner sample.

Upvote
-

  • HCAI-Lab/dolma3-6t-sample-500-docs

    Updated about 8 hours ago • 20

    Note 500 docs/bin (288K docs, 539M tokens).


  • HCAI-Lab/dolma3-6t-sample-1000-docs

    Updated about 8 hours ago • 538

    Note 1K docs/bin (575K docs, 1.08B tokens).


  • HCAI-Lab/dolma3-6t-sample-5000-docs

    Updated about 8 hours ago • 11.5k

    Note 5K docs/bin (2.86M docs, 5.3B tokens).


  • HCAI-Lab/dolma3-6t-sample-10000-docs

    Preview • Updated about 8 hours ago • 19

    Note 10K docs/bin (5.68M docs, 10.5B tokens) — basis for SOC-156 TrackStar.


  • HCAI-Lab/dolma3-6t-sample-50000-docs

    Preview • Updated about 8 hours ago • 185

    Note 50K docs/bin (26.2M docs, 62.8B tokens).


  • HCAI-Lab/dolma3-6t-preconditioner-100k

    Updated about 8 hours ago • 258 • 1

    Note 100K uniform-random preconditioner sample.

Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs