Training
Training Config
All training is controlled by a JSON training config file and a JSON data config file.
See examples/config/ for ready-to-use configs.
Training config file on Emilia is: examples/config/train_config_emilia.json
Data config file for Emilia is: examples/config/data_config_emilia.json
Key fields in training config file:
| Field | Description | Default |
|---|---|---|
llm_name_or_path |
local LLM path or huggingface id | Qwen/Qwen3-0.6B |
steps |
Total training steps | 300,000 |
learning_rate |
Peak learning rate | 1e-4 |
batch_tokens |
Tokens per batch on each GPU | 8192 |
attn_implementation |
Attention backend: "flex_attention" or "sdpa" |
"flex_attention" |
output_dir and data_config are passed via command line (see below).
Attention Implementation
By default, training uses flex_attention, which requires PyTorch ≥ 2.5 and a compatible GPU (e.g. NVIDIA Ampere or newer). If your environment does not support flex_attention, set attn_implementation to "sdpa" in your training config. See examples/config/train_config_finetune_sdpa.json for a ready-to-use SDPA config:
{
"attn_implementation": "sdpa",
"max_sample_tokens": 2000,
"min_sample_tokens": 50,
"max_batch_size": 64
}
"sdpa" uses PyTorch's built-in scaled dot-product attention and works on a wider range of hardware.
The following fields only apply when attn_implementation != "flex_attention":
| Field | Description | Default |
|---|---|---|
max_sample_tokens |
Maximum token length per sample; longer samples are dropped | 2000 |
min_sample_tokens |
Minimum token length per sample; shorter samples are dropped | 50 |
max_batch_size |
Cap on the number of samples per batch | 64 |
batch_tokens remains the primary control for memory usage — it sets the total token budget per batch. max_batch_size is a safety guard to prevent a batch of many short samples from creating an unusually large batch dimension.
Batching strategy
The two backends use different batching strategies, which are selected automatically:
| Backend | Batching strategy | Batch shape | Notes |
|---|---|---|---|
flex_attention |
Sequence packing | [1, C, batch_tokens] |
Multiple samples concatenated into one long sequence; document boundaries tracked via document_ids |
sdpa |
Length-grouped padding | [B, C, max_len] |
Samples with similar token lengths are grouped into the same batch and padded to the local maximum length |
Why different strategies?
- With
flex_attention, sequence packing is memory-efficient because a compactBlockMask(not a dense matrix) describes which tokens can attend to each other across document boundaries. - With
sdpa, length-grouped padding is used instead: samples of similar token lengths are batched together and padded to the local maximum, so a lightweight[B, 1, max_len, max_len]boolean attention mask suffices with low overhead and minimal wasted padding.
Launching Training
accelerate launch \
--gpu_ids "0,1,2,3,4,5,6,7" \
--num_processes 8 \
-m omnivoice.cli.train \
--train_config config/train_config_emilia.json \
--data_config config/data_config_emilia.json \
--output_dir exp/omnivoice_emilia
Resuming Training
Set resume_from_checkpoint in your training config to resume from an existing checkpoint:
{
"resume_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}
Initializing from a Pretrained Model
To start training from a pretrained OmniVoice checkpoint (for fine-tuning):
{
"init_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}
Monitoring
Training logs to TensorBoard:
tensorboard --logdir exp/omnivoice_emilia/tensorboard