OmniVoice_sync_data_and_code / docs /generation-parameters.md
Abdelrahman2922's picture
Add files using upload-large-folder tool
a4d9876 verified

Generation Parameters

Parameters can be passed as keyword arguments to model.generate(...) or via the OmniVoiceGenerationConfig dataclass. See below for the full list and which category each belongs to.

# 1) Direct keyword arguments
audio = model.generate(text="Hello world", num_step=32, guidance_scale=2.0)

# 2) Via OmniVoiceGenerationConfig dataclass
from omnivoice import OmniVoiceGenerationConfig

config = OmniVoiceGenerationConfig(num_step=32, guidance_scale=2.0)
audio = model.generate(text="Hello world", generation_config=config)

Decoding

Parameter Type Default Description
num_step int 32 Number of iterative unmasking steps. Higher values improve quality but slow down generation. Use 16 for faster inference.
denoise bool True Prepend the `<
guidance_scale float 2.0 Classifier-free guidance scale.
t_shift float 0.1 Time-step shift for the noise schedule. Smaller values emphasise earlier steps in decoding.

Sampling

Parameter Type Default Description
position_temperature float 5.0 Temperature for mask-position selection. 0 = greedy (deterministic). Higher values increase randomness.
class_temperature float 0.0 Temperature for token sampling at each step. 0 = greedy (deterministic). Higher values increase randomness.
layer_penalty_factor float 5.0 Penalty applied to deeper codebook layers, encouraging earlier (lower) layers to unmask first.

Duration & Speed

These accept a single value applied to all items, or a per-item list (useful in batch mode):

# Fixed 10-second output
audio = model.generate(text="Hello, this is a test of duration control", duration=10.0)

# Faster speech (1.2x faster than estimated)
audio = model.generate(text="Hello, this is a test of duration control", speed=1.2)
Parameter Type Default Description
duration float or list[float | None] None Fixed output duration in seconds. Overrides speed when set.
speed float or list[float | None] None Speed factor. Values > 1.0 produce shorter audio (faster); values < 1.0 produce longer audio (slower). Ignored when duration is set. Defaults to 1.0 when both are None.

Priority: duration > speed.

Note: When using duration, the default post-processing step may trim trailing silence, causing the actual output to be slightly shorter than the requested duration. If you need the output duration to exactly match the specified value, set postprocess_output=False to disable silence removal.

Pre/Post Processing

Parameter Type Default Description
preprocess_prompt bool True Whether to apply preprocessing to the voice-clone prompt audio (remove long silences in reference audio, add punctuation in the end of reference text).
postprocess_output bool True Apply post-processing to generated audio (remove long silences).

Long-Form Generation

To support stable long-form speech generation with low VRAM consumption, the text is automatically split into smaller segments when the estimated duration of the generated speech exceeds audio_chunk_duration, with each segment producing approximately audio_chunk_duration seconds of audio. This approach allows the model to accept arbitrarily long text and generate arbitrarily long speech with near-constant VRAM consumption.

Parameter Type Default Description
audio_chunk_duration float 15.0 Target chunk duration (seconds) when splitting long text.
audio_chunk_threshold float 30.0 Estimated audio duration (seconds) above which chunking is activated.