Advanced Data Preparation
The advanced pipeline adds denoising and prompt noise augmentation on top of the basic tokenization workflow. Each stage is optional.
Prerequisites
- Denoising: Sidon model checkpoints (
feature_extractor_cuda.pt,decoder_cuda.pt) from https://huggingface.co/sarulab-speech/sidon-v0.1/tree/main. - Noise augmentation: noise + RIR tar shards with
data.lstmanifests
Pipeline Overview
Step 1 (optional): Denoise
Raw audio → Sidon denoiser → clean audio
Step 2: Tokenize (with optional noise augmentation)
Clean audio + noise augment on prefix → audio tokenizer → tokens
Denoise
Use the Sidon speech enhancement model to remove background noise from raw audio.
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python -m omnivoice.scripts.denoise_audio \
--input_jsonl data.jsonl \
--tar_output_pattern data/denoised/audios/shard-%06d.tar \
--jsonl_output_pattern data/denoised/txts/shard-%06d.jsonl \
--feature_extractor_path /path/to/sidon_feature_extractor_cuda.pt \
--decoder_path /path/to/sidon_decoder_cuda.pt \
--target_sample_rate 24000 \
--batch_duration 200.0
What it does:
- Reads your JSONL manifest
- Runs Sidon denoiser on each audio file
- Outputs denoised audio as custom WebDataset tar/jsonl shards
- Generates a
data.lstmanifest indata/denoised/
You can also pass
--input_manifest /path/to/data.lstif you already have a custom webdataset format dataset. The next step would be passing the generateddata.lstfile with--input_manifesttoomnivoice.scripts.extract_audio_tokensfor tokens extraction.
Tokenize with noise augmentation
Adds environmental noise and room reverb to prompt audio during tokenization, making the model robust to noisy reference audio at inference time. Note that in our model, we only add noise augmentation for a small proportion of data, making sure the model can also generate good audio with clean reference audio.
You need two additional datasets in WebDataset format:
- Noise recordings: environmental noise tar shards with a
data.lstmanifest - Room impulse responses (RIR): RIR tar shards with a
data.lstmanifest
export CUDA_VISIBLE_DEVICES="0,1,2,4"
python -m omnivoice.scripts.extract_audio_tokens_add_noise \
--input_jsonl data.jsonl \
--tar_output_pattern data/tokens/shard-%06d.tar \
--jsonl_output_pattern data/txts/shard-%06d.jsonl \
--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
--noise_manifest data/noise_shards/data.lst \
--rir_manifest data/rir_shards/data.lst \
--nj_per_gpu 3
You can also pass
--input_manifest /path/to/data.lstif you already have a custom webdataset format dataset.