Post
129
While doing various projects I kept running into situations where I wanted to be able to have representative samples of some of the current large SOTA datasets that were smaller so I didn't need to worry about slicing or anything else at runtime. So, I created sub datasets making sure to keep the same ratios of data sources. Each dataset card provides info for what's in it.
100M token datasets:
RedPajama v2 100M
Falcon RefinedWeb 100M
Cosmopedia 100M
1B token datasets:
Fineweb-edu 1B
RedPajama v1 1B
RedPajama v2 1B (use this one)
Cosmopedia 1B
10B token datasets:
RedPajama v1 10B
Cosmopedia 10B
Collection here:
https://huggingface.co/collections/krisbailey/bite-size-data
100M token datasets:
RedPajama v2 100M
Falcon RefinedWeb 100M
Cosmopedia 100M
1B token datasets:
Fineweb-edu 1B
RedPajama v1 1B
RedPajama v2 1B (use this one)
Cosmopedia 1B
10B token datasets:
RedPajama v1 10B
Cosmopedia 10B
Collection here:
https://huggingface.co/collections/krisbailey/bite-size-data