Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
itsnotsplat 's Collections
Ai/real image classifier
Post-training
Pretraining

Pretraining

updated 19 days ago

This is general pretraining data for training a model from scratch. Around ~2.1 trillion tokens.

Upvote
1

  • ronantakizawa/github-top-code

    Viewer • Updated Feb 23 • 1.12M • 538 • 122

  • HuggingFaceFW/fineweb-edu

    Viewer • Updated Jul 11, 2025 • 3.5B • 367k • 1.03k

  • openbmb/UltraData-Math

    Viewer • Updated 2 days ago • 181M • 8.09k • 304

  • nick007x/github-code-2025

    Viewer • Updated 16 days ago • 148M • 2.04k • 116

  • angie-chen55/python-github-code

    Viewer • Updated May 31, 2022 • 7.23M • 815 • 37

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 42.4k • 904

  • nick007x/arxiv-papers

    Viewer • Updated 16 days ago • 2.55M • 8.04k • 180

  • hoskinson-center/proof-pile

    Viewer • Updated Aug 19, 2023 • 363k • 1.85k • 64

  • HuggingFaceTB/finemath

    Viewer • Updated Feb 6, 2025 • 48.3M • 15.8k • 358
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs