Generation Parameters
Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.
# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)
# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig
config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)
Decoding
| Parameter | Type | Default | Description |
|---|---|---|---|
num_step |
int | 32 | Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference. |
denoise |
bool | True | Prepend the `< |
guidance_scale |
float | 2.0 | Classifier-free guidance scale. |
t_shift |
float | 0.1 | Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding. |
Sampling
| Parameter | Type | Default | Description |
|---|---|---|---|
position_temperature |
float | 5.0 | Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness. |
class_temperature |
float | 0.0 | Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness. |
layer_penalty_factor |
float | 5.0 | Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first. |
Duration & Speed
These accept a single value applied to all items, or a per-item list (useful in batch mode):
# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)
# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)
| Parameter | Type | Default | Description |
|---|---|---|---|
duration |
float or list[float | None] | None | Fixed output duration in seconds. Overrides speed when set. |
speed |
float or list[float | None] | None | Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when duration is set. Defaults to 1.0 when both are None. |
Priority: duration > speed.
Note: When using
duration, the default post-processing step may trim trailing silence, causing the actual output to be slightly shorter than the requested duration. If you need the output duration to exactly match the specified value, setpostprocess_output=Falseto disable silence removal.
Pre/Post Processing
| Parameter | Type | Default | Description |
|---|---|---|---|
preprocess_prompt |
bool | True | Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text). |
postprocess_output |
bool | True | Apply post-processing to generated audio (remove long silences). |
Long-Form Generation
To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_chunk_duration |
float | 15.0 | Target chunk duration (seconds) when splitting long text. |
audio_chunk_threshold |
float | 30.0 | Estimated audio duration (seconds) above which chunking is activated. |