OmniVoice_sync_data_and_code / docs /data_preparation_advanced.md
Abdelrahman2922's picture
Add files using upload-large-folder tool
a4d9876 verified
# Advanced Data Preparation
The advanced pipeline adds **denoising** and **prompt noise augmentation** on top of the basic tokenization workflow. Each stage is optional.
## Prerequisites
- **Denoising**: Sidon model checkpoints (`feature_extractor_cuda.pt`, `decoder_cuda.pt`) from https://huggingface.co/sarulab-speech/sidon-v0.1/tree/main.
- **Noise augmentation**: noise + RIR tar shards with `data.lst` manifests
## Pipeline Overview
```
Step 1 (optional): Denoise
Raw audio → Sidon denoiser → clean audio
Step 2: Tokenize (with optional noise augmentation)
Clean audio + noise augment on prefix → audio tokenizer → tokens
```
## Denoise
Use the [Sidon](https://github.com/sarulab-speech/Sidon) speech enhancement model to remove background noise from raw audio.
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python -m omnivoice.scripts.denoise_audio \
--input_jsonl data.jsonl \
--tar_output_pattern data/denoised/audios/shard-%06d.tar \
--jsonl_output_pattern data/denoised/txts/shard-%06d.jsonl \
--feature_extractor_path /path/to/sidon_feature_extractor_cuda.pt \
--decoder_path /path/to/sidon_decoder_cuda.pt \
--target_sample_rate 24000 \
--batch_duration 200.0
```
What it does:
1. Reads your JSONL manifest
2. Runs Sidon denoiser on each audio file
3. Outputs denoised audio as custom WebDataset tar/jsonl shards
4. Generates a `data.lst` manifest in `data/denoised/`
> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.
> The next step would be passing the generated `data.lst` file with `--input_manifest` to `omnivoice.scripts.extract_audio_tokens` for tokens extraction.
### Tokenize with noise augmentation
Adds environmental noise and room reverb to **prompt audio** during tokenization, making the model robust to noisy reference audio at inference time. Note that in our model, we only add noise augmentation for a small proportion of data, making sure the model can also generate good audio with clean reference audio.
You need two additional datasets in WebDataset format:
- **Noise recordings**: environmental noise tar shards with a `data.lst` manifest
- **Room impulse responses (RIR)**: RIR tar shards with a `data.lst` manifest
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,4"
python -m omnivoice.scripts.extract_audio_tokens_add_noise \
--input_jsonl data.jsonl \
--tar_output_pattern data/tokens/shard-%06d.tar \
--jsonl_output_pattern data/txts/shard-%06d.jsonl \
--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
--noise_manifest data/noise_shards/data.lst \
--rir_manifest data/rir_shards/data.lst \
--nj_per_gpu 3
```
> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.