Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
Kris Bailey
PRO
krisbailey
Follow
mamme123's profile picture
1 follower
·
0 following
myfykris
krisbailey
AI & ML interests
quantization, optimization, novel model architectures, model architecture research and development, dataset construction, apple silicon optimizations
Recent Activity
posted
an
update
1 day ago
While doing various projects I kept running into situations where I wanted to be able to have representative samples of some of the current large SOTA datasets that were smaller so I didn't need to worry about slicing or anything else at runtime. So, I created sub datasets making sure to keep the same ratios of data sources. Each dataset card provides info for what's in it. 100M token datasets: RedPajama v2 100M Falcon RefinedWeb 100M Cosmopedia 100M 1B token datasets: Fineweb-edu 1B RedPajama v1 1B RedPajama v2 1B (use this one) Cosmopedia 1B 10B token datasets: RedPajama v1 10B Cosmopedia 10B Collection here: https://huggingface.co/collections/krisbailey/bite-size-data
updated
a dataset
23 days ago
krisbailey/fineweb-edu-1B
updated
a dataset
23 days ago
krisbailey/RedPajama-Data-V2-1B
View all activity
Organizations
None yet
krisbailey
's datasets
10
Sort:Â Recently updated
krisbailey/fineweb-edu-1B
Viewer
•
Updated
23 days ago
•
972k
•
94
krisbailey/RedPajama-Data-V2-1B
Viewer
•
Updated
23 days ago
•
701k
•
53
krisbailey/RedPajama-Data-V2-100M
Viewer
•
Updated
23 days ago
•
80k
•
38
krisbailey/falcon-refinedweb-1B
Viewer
•
Updated
23 days ago
•
1.69M
•
46
krisbailey/falcon-refinedweb-100M
Viewer
•
Updated
23 days ago
•
165k
•
47
krisbailey/cosmopedia-10B
Viewer
•
Updated
23 days ago
•
14.7M
•
61
krisbailey/cosmopedia-1b
Viewer
•
Updated
23 days ago
•
1.4M
•
66
krisbailey/cosmopedia-100M
Viewer
•
Updated
23 days ago
•
141k
•
37
krisbailey/RedPajama-10B-Weighted
Viewer
•
Updated
Jan 9
•
4.62M
•
22
krisbailey/RedPajama-1B-Weighted
Viewer
•
Updated
Jan 9
•
462k
•
53