Add files using upload-large-folder tool

a4d9876 verified 26 days ago

3.98 kB

	# Training

	## Training Config

	All training is controlled by a JSON training config file and a JSON data config file.

	See [examples/config/](../examples/config/) for ready-to-use configs.

	Training config file on Emilia is: [examples/config/train_config_emilia.json](../examples/config/train_config_emilia.json)

	Data config file for Emilia is: [examples/config/data_config_emilia.json](../examples/config/data_config_emilia.json)


	Key fields in training config file:

	\| Field \| Description \| Default \|
	\|---\|---\|---\|
	\| `llm_name_or_path` \| local LLM path or huggingface id \| Qwen/Qwen3-0.6B \|
	\| `steps` \| Total training steps \| 300,000 \|
	\| `learning_rate` \| Peak learning rate \| 1e-4 \|
	\| `batch_tokens` \| Tokens per batch on each GPU \| 8192 \|
	\| `attn_implementation` \| Attention backend: `"flex_attention"` or `"sdpa"` \| `"flex_attention"` \|

	`output_dir` and `data_config` are passed via command line (see below).

	## Attention Implementation

	By default, training uses `flex_attention`, which requires PyTorch ≥ 2.5 and a compatible GPU (e.g. NVIDIA Ampere or newer). If your environment does not support `flex_attention`, set `attn_implementation` to `"sdpa"` in your training config. See [examples/config/train_config_finetune_sdpa.json](../examples/config/train_config_finetune_sdpa.json) for a ready-to-use SDPA config:

	```json
	{
	"attn_implementation": "sdpa",
	"max_sample_tokens": 2000,
	"min_sample_tokens": 50,
	"max_batch_size": 64
	}
	```

	`"sdpa"` uses PyTorch's built-in scaled dot-product attention and works on a wider range of hardware.

	The following fields only apply when `attn_implementation != "flex_attention"`:

	\| Field \| Description \| Default \|
	\|---\|---\|---\|
	\| `max_sample_tokens` \| Maximum token length per sample; longer samples are dropped \| 2000 \|
	\| `min_sample_tokens` \| Minimum token length per sample; shorter samples are dropped \| 50 \|
	\| `max_batch_size` \| Cap on the number of samples per batch \| 64 \|

	`batch_tokens` remains the primary control for memory usage — it sets the total token budget per batch. `max_batch_size` is a safety guard to prevent a batch of many short samples from creating an unusually large batch dimension.

	### Batching strategy

	The two backends use different batching strategies, which are selected automatically:

	\| Backend \| Batching strategy \| Batch shape \| Notes \|
	\|---\|---\|---\|---\|
	\| `flex_attention` \| Sequence packing \| `[1, C, batch_tokens]` \| Multiple samples concatenated into one long sequence; document boundaries tracked via `document_ids` \|
	\| `sdpa` \| Length-grouped padding \| `[B, C, max_len]` \| Samples with similar token lengths are grouped into the same batch and padded to the local maximum length \|

	Why different strategies?

	- With `flex_attention`, sequence packing is memory-efficient because a compact `BlockMask` (not a dense matrix) describes which tokens can attend to each other across document boundaries.
	- With `sdpa`, length-grouped padding is used instead: samples of similar token lengths are batched together and padded to the local maximum, so a lightweight `[B, 1, max_len, max_len]` boolean attention mask suffices with low overhead and minimal wasted padding.

	## Launching Training

	```bash
	accelerate launch \
	--gpu_ids "0,1,2,3,4,5,6,7" \
	--num_processes 8 \
	-m omnivoice.cli.train \
	--train_config config/train_config_emilia.json \
	--data_config config/data_config_emilia.json \
	--output_dir exp/omnivoice_emilia
	```

	## Resuming Training

	Set `resume_from_checkpoint` in your training config to resume from an existing checkpoint:

	```json
	{
	"resume_from_checkpoint": "exp/omnivoice/checkpoint-100000"
	}
	```

	## Initializing from a Pretrained Model

	To start training from a pretrained OmniVoice checkpoint (for fine-tuning):

	```json
	{
	"init_from_checkpoint": "exp/omnivoice/checkpoint-100000"
	}
	```

	## Monitoring

	Training logs to TensorBoard:
	```bash
	tensorboard --logdir exp/omnivoice_emilia/tensorboard
	```