Generation Parameters

Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.

# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)

# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig

config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)

Decoding

Parameter	Type	Default	Description
`num_step`	int	32	Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference.
`denoise`	bool	True	Prepend the `<
`guidance_scale`	float	2.0	Classifier-free guidance scale.
`t_shift`	float	0.1	Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding.

Sampling

Parameter	Type	Default	Description
`position_temperature`	float	5.0	Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness.
`class_temperature`	float	0.0	Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness.
`layer_penalty_factor`	float	5.0	Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first.

Duration & Speed

These accept a single value applied to all items, or a per-item list (useful in batch mode):

# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)

# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)

Parameter	Type	Default	Description
`duration`	float or list[float \| None]	None	Fixed output duration in seconds. Overrides `speed` when set.
`speed`	float or list[float \| None]	None	Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when `duration` is set. Defaults to 1.0 when both are None.

Priority: duration > speed.

Note: When using duration, the default post-processing step may trim trailing silence, causing the actual output to be slightly shorter than the requested duration. If you need the output duration to exactly match the specified value, set postprocess_output=False to disable silence removal.

Pre/Post Processing

Parameter	Type	Default	Description
`preprocess_prompt`	bool	True	Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text).
`postprocess_output`	bool	True	Apply post-processing to generated audio (remove long silences).

Long-Form Generation

To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.

Parameter	Type	Default	Description
`audio_chunk_duration`	float	15.0	Target chunk duration (seconds) when splitting long text.
`audio_chunk_threshold`	float	30.0	Estimated audio duration (seconds) above which chunking is activated.