Datasets
updated
common-pile/arxiv_abstracts_filtered
Viewer
• Updated
• 2.5M • 336
• 6
common-pile/youtube_filtered
Viewer
• Updated
• 986k • 40
• 5
common-pile/wikiteam_filtered
Viewer
• Updated
• 10.2M • 335
common-pile/wikimedia_filtered
Viewer
• Updated
• 12.9M • 477
• 5
common-pile/uspto_filtered
Viewer
• Updated
• 14.4M • 1.08k
• 3
common-pile/usgpo_filtered
Viewer
• Updated
• 2.34M • 91
• 1
common-pile/uk_hansard_filtered
Viewer
• Updated
• 47.9k • 231
• 1
common-pile/ubuntu_irc_filtered
Viewer
• Updated
• 216k • 21
• 1
common-pile/stackv2_html_filtered
Viewer
• Updated
• 1.67M • 5
• 2
common-pile/stackv2_edu_filtered
Viewer
• Updated
• 57M • 1.07k
• 5
common-pile/stackexchange_filtered
Viewer
• Updated
• 27.5M • 687
• 7
common-pile/regulations_filtered
Viewer
• Updated
• 192k • 93
common-pile/python_enhancement_proposals_filtered
Viewer
• Updated
• 655 • 70
• 1
common-pile/pubmed_filtered
Viewer
• Updated
• 4.77M • 500
• 3
common-pile/public_domain_review_filtered
Viewer
• Updated
• 1.41k • 20
common-pile/project_gutenberg_filtered
Viewer
• Updated
• 57.1k • 688
common-pile/pressbooks_filtered
Viewer
• Updated
• 54.5k • 29
common-pile/pre_1929_books_filtered
Viewer
• Updated
• 122k • 183
common-pile/peS2o_filtered
Viewer
• Updated
• 6.09M • 609
• 1
common-pile/oercommons_filtered
Viewer
• Updated
• 5.25k • 26
• 1
common-pile/news_filtered
Viewer
• Updated
• 127k • 56
• 1
common-pile/libretexts_filtered
Viewer
• Updated
• 40k • 61
• 1
common-pile/library_of_congress_filtered
Viewer
• Updated
• 128k • 124
• 2
common-pile/github_archive_filtered
Viewer
• Updated
• 23.3M • 245
• 1
common-pile/foodista_filtered
Preview
• Updated
• 27
• 1
common-pile/doab_filtered
Viewer
• Updated
• 404k • 91
• 1
common-pile/data_provenance_initiative_filtered
Viewer
• Updated
• 3.51M • 64
common-pile/cccc_filtered
Viewer
• Updated
• 10.8M • 367
• 1
common-pile/caselaw_access_project_filtered
Viewer
• Updated
• 5.5M • 211
• 9
common-pile/biodiversity_heritage_library_filtered
Viewer
• Updated
• 16.5M • 79
• 1
common-pile/arxiv_papers_filtered
Viewer
• Updated
• 309k • 344
• 7
togethercomputer/RedPajama-Data-V2
Updated
• 5.59k
• 397
allenai/llama-3.1-tulu-3-405b-preference-mixture
Viewer
• Updated
• 361k • 46
• 6
HuggingFaceFW/fineweb-edu
Viewer
• Updated
• 3.5B • 223k
• 987
nvidia/Llama-Nemotron-Post-Training-Dataset
Viewer
• Updated
• 3.91M • 2.83k
• 644
open-thoughts/OpenThoughts3-1.2M
Viewer
• Updated
• 1.2M • 9.05k
• 212