Pretraining Collection This is general pretraining data for training a model from scratch. Around ~2.1 trillion tokens. • 9 items • Updated 15 days ago • 1
Post-training Collection This is data used after pre-training. Contains around ~2.8 billion tokens. • 6 items • Updated 15 days ago • 1