Hub documentation
Data Designer
Data Designer
Data Designer is NVIDIA NeMo’s framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets.
Prerequisites
pip install data-designer
Download datasets from the Hub as seeds
Use HuggingFaceSeedSource to load datasets directly from the Hub as seed data for generation.
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
# Load seed data from HuggingFace
seed_source = dd.HuggingFaceSeedSource(
path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet",
token="hf_...", # Optional, for private datasets
)
config_builder.with_seed_dataset(seed_source)
# Reference seed columns in prompts
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
model_alias="openai-gpt-5",
prompt="Write notes for a patient with {{ diagnosis }}. Symptoms: {{ patient_summary }}",
)
)
preview = data_designer.preview(config_builder, num_records=5)Push generated datasets to the Hub
Use the built-in push_to_hub method to upload generated datasets to the Hub.
# Generate dataset
results = data_designer.create(config_builder, num_records=1000, dataset_name="my-dataset")
# Push to Hub
url = results.push_to_hub(
repo_id="username/my-synthetic-dataset",
description="Synthetic dataset generated with Data Designer.",
tags=["medical", "notes"],
private=False,
)Resources
- Data Designer Documentation
- GitHub Repository
- Seed Datasets Guide
- Guide to using Data Designer with Inference Providers