Training

Training Config

All training is controlled by a JSON training config file and a JSON data config file.

See examples/config/ for ready-to-use configs.

Training config file on Emilia is: examples/config/train_config_emilia.json

Data config file for Emilia is: examples/config/data_config_emilia.json

Key fields in training config file:

Field	Description	Default
`llm_name_or_path`	local LLM path or huggingface id	Qwen/Qwen3-0.6B
`steps`	Total training steps	300,000
`learning_rate`	Peak learning rate	1e-4
`batch_tokens`	Tokens per batch on each GPU	8192
`attn_implementation`	Attention backend: `"flex_attention"` or `"sdpa"`	`"flex_attention"`

output_dir and data_config are passed via command line (see below).

Attention Implementation

By default, training uses flex_attention, which requires PyTorch ≥ 2.5 and a compatible GPU (e.g. NVIDIA Ampere or newer). If your environment does not support flex_attention, set attn_implementation to "sdpa" in your training config. See examples/config/train_config_finetune_sdpa.json for a ready-to-use SDPA config:

{
    "attn_implementation": "sdpa",
    "max_sample_tokens": 2000,
    "min_sample_tokens": 50,
    "max_batch_size": 64
}

"sdpa" uses PyTorch's built-in scaled dot-product attention and works on a wider range of hardware.

The following fields only apply when attn_implementation != "flex_attention":

Field	Description	Default
`max_sample_tokens`	Maximum token length per sample; longer samples are dropped	2000
`min_sample_tokens`	Minimum token length per sample; shorter samples are dropped	50
`max_batch_size`	Cap on the number of samples per batch	64

batch_tokens remains the primary control for memory usage — it sets the total token budget per batch. max_batch_size is a safety guard to prevent a batch of many short samples from creating an unusually large batch dimension.

Batching strategy

The two backends use different batching strategies, which are selected automatically:

Backend	Batching strategy	Batch shape	Notes
`flex_attention`	Sequence packing	`[1, C, batch_tokens]`	Multiple samples concatenated into one long sequence; document boundaries tracked via `document_ids`
`sdpa`	Length-grouped padding	`[B, C, max_len]`	Samples with similar token lengths are grouped into the same batch and padded to the local maximum length

Why different strategies?

With flex_attention, sequence packing is memory-efficient because a compact BlockMask (not a dense matrix) describes which tokens can attend to each other across document boundaries.
With sdpa, length-grouped padding is used instead: samples of similar token lengths are batched together and padded to the local maximum, so a lightweight [B, 1, max_len, max_len] boolean attention mask suffices with low overhead and minimal wasted padding.

Launching Training

accelerate launch \
    --gpu_ids "0,1,2,3,4,5,6,7" \
    --num_processes 8 \
    -m omnivoice.cli.train \
    --train_config config/train_config_emilia.json \
    --data_config config/data_config_emilia.json \
    --output_dir exp/omnivoice_emilia

Resuming Training

Set resume_from_checkpoint in your training config to resume from an existing checkpoint:

{
    "resume_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}

Initializing from a Pretrained Model

To start training from a pretrained OmniVoice checkpoint (for fine-tuning):

{
    "init_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}

Monitoring

Training logs to TensorBoard:

tensorboard --logdir exp/omnivoice_emilia/tensorboard