Some datasets that are smaller versions that are more convenient to use of some major datasets.
Kris Bailey PRO
krisbailey
AI & ML interests
quantization, optimization, novel model architectures, model architecture research and development, dataset construction, apple silicon optimizations
Recent Activity
posted
an
update
1 day ago
While doing various projects I kept running into situations where I wanted to be able to have representative samples of some of the current large SOTA datasets that were smaller so I didn't need to worry about slicing or anything else at runtime. So, I created sub datasets making sure to keep the same ratios of data sources. Each dataset card provides info for what's in it.
100M token datasets:
RedPajama v2 100M
Falcon RefinedWeb 100M
Cosmopedia 100M
1B token datasets:
Fineweb-edu 1B
RedPajama v1 1B
RedPajama v2 1B (use this one)
Cosmopedia 1B
10B token datasets:
RedPajama v1 10B
Cosmopedia 10B
Collection here:
https://huggingface.co/collections/krisbailey/bite-size-data
updated
a dataset
23 days ago
krisbailey/fineweb-edu-1B
updated
a dataset
23 days ago
krisbailey/RedPajama-Data-V2-1B
Organizations
None yet