kaizuberbuehler 's Collections Datasets
updated
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published • 31
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper
• 2404.01294
• Published • 17
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published • 17
DataComp-LM: In search of the next generation of training sets for
language models
Paper
• 2406.11794
• Published • 55
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
• 2406.08973
• Published • 89
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published • 32
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
• 2406.08451
• Published • 26
argilla/magpie-ultra-v0.1
Viewer
• Updated • 50k • 974
• 221
Viewer
• Updated • 52.5B • 262k
• 2.73k
Viewer
• Updated • 61.6M • 95.6k
• 1.17k
Viewer
• Updated • 31.1M • 12.6k
• 681
Viewer
• Updated • 546M • 14.3k
• 973
Viewer
• Updated • 1M • 21.9k
• 806
Viewer
• Updated • 2.14M • 95.7k
• 933
Viewer
• Updated • 55.1k • 33
• 96
HuggingFaceFW/fineweb-edu
Viewer
• Updated • 3.5B • 313k
• 1.01k
Viewer
• Updated • 1.75M • 249
• 104
Viewer
• Updated • 100k • 8.07k
• 265
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
• 2409.12568
• Published • 50
RedPajama: an Open Dataset for Training Large Language Models
Paper
• 2411.12372
• Published • 58
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
• 2411.07461
• Published • 23
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
• 2411.04905
• Published • 127
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published • 53
Viewer
• Updated • 450k • 14.1k
• 721
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
• 2501.18511
• Published • 20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus
Expansion
Paper
• 2502.04235
• Published • 23
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
• 2502.06589
• Published • 21
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Paper
• 2502.09082
• Published • 32
EgoLife: Towards Egocentric Life Assistant
Paper
• 2503.03803
• Published • 46
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding
Paper
• 2503.02951
• Published • 33
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
• 2503.10582
• Published • 24
ReFeed: Multi-dimensional Summarization Refinement with Reflective
Reasoning on Feedback
Paper
• 2503.21332
• Published • 23
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Paper
• 2504.01943
• Published • 18
MegaMath: Pushing the Limits of Open Math Corpora
Paper
• 2504.02807
• Published • 35
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training
Paper
• 2504.13161
• Published • 97
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper
• 2504.11393
• Published • 18
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient
Training of Code LLMs
Paper
• 2504.14655
• Published • 21