Data generation
updated
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models
Paper
• 2402.13064
• Published
• 50
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
• 2402.10379
• Published
• 31
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based
Approach
Paper
• 2405.15613
• Published
• 17
Are You Sure? Rank Them Again: Repeated Ranking For Better Preference
Datasets
Paper
• 2405.18952
• Published
• 10
MAmmoTH2: Scaling Instructions from the Web
Paper
• 2405.03548
• Published
• 6
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs
with Nothing
Paper
• 2406.08464
• Published
• 71
West-of-N: Synthetic Preference Generation for Improved Reward Modeling
Paper
• 2401.12086
• Published
• 1
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale
Synthetic Personas
Paper
• 2501.15427
• Published
• 6
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and
Refinement
Paper
• 2501.12273
• Published
• 14
How to Synthesize Text Data without Model Collapse?
Paper
• 2412.14689
• Published
• 53
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published
• 31