File size: 3,981 Bytes
a4d9876 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | # Training
## Training Config
All training is controlled by a JSON training config file and a JSON data config file.
See [examples/config/](../examples/config/) for ready-to-use configs.
Training config file on Emilia is: [examples/config/train_config_emilia.json](../examples/config/train_config_emilia.json)
Data config file for Emilia is: [examples/config/data_config_emilia.json](../examples/config/data_config_emilia.json)
Key fields in training config file:
| Field | Description | Default |
|---|---|---|
| `llm_name_or_path` | local LLM path or huggingface id | Qwen/Qwen3-0.6B |
| `steps` | Total training steps | 300,000 |
| `learning_rate` | Peak learning rate | 1e-4 |
| `batch_tokens` | Tokens per batch on each GPU | 8192 |
| `attn_implementation` | Attention backend: `"flex_attention"` or `"sdpa"` | `"flex_attention"` |
`output_dir` and `data_config` are passed via command line (see below).
## Attention Implementation
By default, training uses `flex_attention`, which requires PyTorch ≥ 2.5 and a compatible GPU (e.g. NVIDIA Ampere or newer). If your environment does not support `flex_attention`, set `attn_implementation` to `"sdpa"` in your training config. See [examples/config/train_config_finetune_sdpa.json](../examples/config/train_config_finetune_sdpa.json) for a ready-to-use SDPA config:
```json
{
"attn_implementation": "sdpa",
"max_sample_tokens": 2000,
"min_sample_tokens": 50,
"max_batch_size": 64
}
```
`"sdpa"` uses PyTorch's built-in scaled dot-product attention and works on a wider range of hardware.
The following fields only apply when `attn_implementation != "flex_attention"`:
| Field | Description | Default |
|---|---|---|
| `max_sample_tokens` | Maximum token length per sample; longer samples are dropped | 2000 |
| `min_sample_tokens` | Minimum token length per sample; shorter samples are dropped | 50 |
| `max_batch_size` | Cap on the number of samples per batch | 64 |
`batch_tokens` remains the primary control for memory usage — it sets the total token budget per batch. `max_batch_size` is a safety guard to prevent a batch of many short samples from creating an unusually large batch dimension.
### Batching strategy
The two backends use **different batching strategies**, which are selected automatically:
| Backend | Batching strategy | Batch shape | Notes |
|---|---|---|---|
| `flex_attention` | Sequence packing | `[1, C, batch_tokens]` | Multiple samples concatenated into one long sequence; document boundaries tracked via `document_ids` |
| `sdpa` | Length-grouped padding | `[B, C, max_len]` | Samples with similar token lengths are grouped into the same batch and padded to the local maximum length |
**Why different strategies?**
- With `flex_attention`, sequence packing is memory-efficient because a compact `BlockMask` (not a dense matrix) describes which tokens can attend to each other across document boundaries.
- With `sdpa`, length-grouped padding is used instead: samples of similar token lengths are batched together and padded to the local maximum, so a lightweight `[B, 1, max_len, max_len]` boolean attention mask suffices with low overhead and minimal wasted padding.
## Launching Training
```bash
accelerate launch \
--gpu_ids "0,1,2,3,4,5,6,7" \
--num_processes 8 \
-m omnivoice.cli.train \
--train_config config/train_config_emilia.json \
--data_config config/data_config_emilia.json \
--output_dir exp/omnivoice_emilia
```
## Resuming Training
Set `resume_from_checkpoint` in your training config to resume from an existing checkpoint:
```json
{
"resume_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}
```
## Initializing from a Pretrained Model
To start training from a pretrained OmniVoice checkpoint (for fine-tuning):
```json
{
"init_from_checkpoint": "exp/omnivoice/checkpoint-100000"
}
```
## Monitoring
Training logs to TensorBoard:
```bash
tensorboard --logdir exp/omnivoice_emilia/tensorboard
```
|