File size: 2,809 Bytes
a4d9876 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | # Advanced Data Preparation
The advanced pipeline adds **denoising** and **prompt noise augmentation** on top of the basic tokenization workflow. Each stage is optional.
## Prerequisites
- **Denoising**: Sidon model checkpoints (`feature_extractor_cuda.pt`, `decoder_cuda.pt`) from https://huggingface.co/sarulab-speech/sidon-v0.1/tree/main.
- **Noise augmentation**: noise + RIR tar shards with `data.lst` manifests
## Pipeline Overview
```
Step 1 (optional): Denoise
Raw audio → Sidon denoiser → clean audio
Step 2: Tokenize (with optional noise augmentation)
Clean audio + noise augment on prefix → audio tokenizer → tokens
```
## Denoise
Use the [Sidon](https://github.com/sarulab-speech/Sidon) speech enhancement model to remove background noise from raw audio.
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
python -m omnivoice.scripts.denoise_audio \
--input_jsonl data.jsonl \
--tar_output_pattern data/denoised/audios/shard-%06d.tar \
--jsonl_output_pattern data/denoised/txts/shard-%06d.jsonl \
--feature_extractor_path /path/to/sidon_feature_extractor_cuda.pt \
--decoder_path /path/to/sidon_decoder_cuda.pt \
--target_sample_rate 24000 \
--batch_duration 200.0
```
What it does:
1. Reads your JSONL manifest
2. Runs Sidon denoiser on each audio file
3. Outputs denoised audio as custom WebDataset tar/jsonl shards
4. Generates a `data.lst` manifest in `data/denoised/`
> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.
> The next step would be passing the generated `data.lst` file with `--input_manifest` to `omnivoice.scripts.extract_audio_tokens` for tokens extraction.
### Tokenize with noise augmentation
Adds environmental noise and room reverb to **prompt audio** during tokenization, making the model robust to noisy reference audio at inference time. Note that in our model, we only add noise augmentation for a small proportion of data, making sure the model can also generate good audio with clean reference audio.
You need two additional datasets in WebDataset format:
- **Noise recordings**: environmental noise tar shards with a `data.lst` manifest
- **Room impulse responses (RIR)**: RIR tar shards with a `data.lst` manifest
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,4"
python -m omnivoice.scripts.extract_audio_tokens_add_noise \
--input_jsonl data.jsonl \
--tar_output_pattern data/tokens/shard-%06d.tar \
--jsonl_output_pattern data/txts/shard-%06d.jsonl \
--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
--noise_manifest data/noise_shards/data.lst \
--rir_manifest data/rir_shards/data.lst \
--nj_per_gpu 3
```
> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.
|