# Advanced Data Preparation The advanced pipeline adds **denoising** and **prompt noise augmentation** on top of the basic tokenization workflow. Each stage is optional. ## Prerequisites - **Denoising**: Sidon model checkpoints (`feature_extractor_cuda.pt`, `decoder_cuda.pt`) from https://huggingface.co/sarulab-speech/sidon-v0.1/tree/main. - **Noise augmentation**: noise + RIR tar shards with `data.lst` manifests ## Pipeline Overview ``` Step 1 (optional): Denoise Raw audio → Sidon denoiser → clean audio Step 2: Tokenize (with optional noise augmentation) Clean audio + noise augment on prefix → audio tokenizer → tokens ``` ## Denoise Use the [Sidon](https://github.com/sarulab-speech/Sidon) speech enhancement model to remove background noise from raw audio. ```bash export CUDA_VISIBLE_DEVICES="0,1,2,3" python -m omnivoice.scripts.denoise_audio \ --input_jsonl data.jsonl \ --tar_output_pattern data/denoised/audios/shard-%06d.tar \ --jsonl_output_pattern data/denoised/txts/shard-%06d.jsonl \ --feature_extractor_path /path/to/sidon_feature_extractor_cuda.pt \ --decoder_path /path/to/sidon_decoder_cuda.pt \ --target_sample_rate 24000 \ --batch_duration 200.0 ``` What it does: 1. Reads your JSONL manifest 2. Runs Sidon denoiser on each audio file 3. Outputs denoised audio as custom WebDataset tar/jsonl shards 4. Generates a `data.lst` manifest in `data/denoised/` > You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset. > The next step would be passing the generated `data.lst` file with `--input_manifest` to `omnivoice.scripts.extract_audio_tokens` for tokens extraction. ### Tokenize with noise augmentation Adds environmental noise and room reverb to **prompt audio** during tokenization, making the model robust to noisy reference audio at inference time. Note that in our model, we only add noise augmentation for a small proportion of data, making sure the model can also generate good audio with clean reference audio. You need two additional datasets in WebDataset format: - **Noise recordings**: environmental noise tar shards with a `data.lst` manifest - **Room impulse responses (RIR)**: RIR tar shards with a `data.lst` manifest ```bash export CUDA_VISIBLE_DEVICES="0,1,2,4" python -m omnivoice.scripts.extract_audio_tokens_add_noise \ --input_jsonl data.jsonl \ --tar_output_pattern data/tokens/shard-%06d.tar \ --jsonl_output_pattern data/txts/shard-%06d.jsonl \ --tokenizer_path eustlb/higgs-audio-v2-tokenizer \ --noise_manifest data/noise_shards/data.lst \ --rir_manifest data/rir_shards/data.lst \ --nj_per_gpu 3 ``` > You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.