OmniVoice_sync_data_and_code / docs /generation-parameters.md

Add files using upload-large-folder tool

a4d9876 verified 26 days ago

3.86 kB

	# Generation Parameters

	Parameters can be passed as keyword arguments to `model.generate(...)` or via the `OmniVoiceGenerationConfig` dataclass. See below for the full list and which category each belongs to.

	```python
	# 1) Direct keyword arguments
	audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)

	# 2) Via OmniVoiceGenerationConfig dataclass
	from omnivoice import OmniVoiceGenerationConfig

	config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
	audio = model.generate(text="Hello world", generation_config=config)
	```

	## Decoding

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `num_step` \| int \| 32 \| Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference. \|
	\| `denoise` \| bool \| True \| Prepend the `<\|denoise\|>` token to the input, which signals the model to produce cleaner speech. \|
	\| `guidance_scale` \| float \| 2.0 \| Classifier-free guidance scale.\|
	\| `t_shift` \| float \| 0.1 \| Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding. \|

	## Sampling

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `position_temperature` \| float \| 5.0 \| Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness. \|
	\| `class_temperature` \| float \| 0.0 \| Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness. \|
	\| `layer_penalty_factor` \| float \| 5.0 \| Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first. \|

	## Duration & Speed

	These accept a single value applied to all items, or a per-item list (useful in batch mode):

	```python
	# Fixed 10-second output
	audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)

	# Faster speech (1.2x faster than estimated)
	audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)
	```

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `duration` \| float or list[float \\| None] \| None \| Fixed output duration in seconds. Overrides `speed` when set. \|
	\| `speed` \| float or list[float \\| None] \| None \| Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when `duration` is set. Defaults to 1.0 when both are None. \|

	Priority: `duration` > `speed`.

	> Note: When using `duration`, the default post-processing step may trim trailing silence, causing the actual output to be slightly shorter than the requested duration. If you need the output duration to exactly match the specified value, set `postprocess_output=False` to disable silence removal.

	## Pre/Post Processing

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `preprocess_prompt` \| bool \| True \| Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text). \|
	\| `postprocess_output` \| bool \| True \| Apply post-processing to generated audio (remove long silences). \|

	## Long-Form Generation

	To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds `audio_chunk_duration`, with each segment producing approximately `audio_chunk_duration` seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `audio_chunk_duration` \| float \| 15.0 \| Target chunk duration (seconds) when splitting long text. \|
	\| `audio_chunk_threshold` \| float \| 30.0 \| Estimated audio duration (seconds) above which chunking is activated. \|