OmniVoice_sync_data_and_code / docs /data_preparation_advanced.md

Add files using upload-large-folder tool

a4d9876 verified 26 days ago

2.81 kB

	# Advanced Data Preparation

	The advanced pipeline adds denoising and prompt noise augmentation on top of the basic tokenization workflow. Each stage is optional.

	## Prerequisites

	- Denoising: Sidon model checkpoints (`feature_extractor_cuda.pt`, `decoder_cuda.pt`) from https://huggingface.co/sarulab-speech/sidon-v0.1/tree/main.
	- Noise augmentation: noise + RIR tar shards with `data.lst` manifests

	## Pipeline Overview

	```
	Step 1 (optional): Denoise
	Raw audio → Sidon denoiser → clean audio

	Step 2: Tokenize (with optional noise augmentation)
	Clean audio + noise augment on prefix → audio tokenizer → tokens
	```


	## Denoise

	Use the [Sidon](https://github.com/sarulab-speech/Sidon) speech enhancement model to remove background noise from raw audio.

	```bash
	export CUDA_VISIBLE_DEVICES="0,1,2,3"
	python -m omnivoice.scripts.denoise_audio \
	--input_jsonl data.jsonl \
	--tar_output_pattern data/denoised/audios/shard-%06d.tar \
	--jsonl_output_pattern data/denoised/txts/shard-%06d.jsonl \
	--feature_extractor_path /path/to/sidon_feature_extractor_cuda.pt \
	--decoder_path /path/to/sidon_decoder_cuda.pt \
	--target_sample_rate 24000 \
	--batch_duration 200.0
	```

	What it does:
	1. Reads your JSONL manifest
	2. Runs Sidon denoiser on each audio file
	3. Outputs denoised audio as custom WebDataset tar/jsonl shards
	4. Generates a `data.lst` manifest in `data/denoised/`

	> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.
	> The next step would be passing the generated `data.lst` file with `--input_manifest` to `omnivoice.scripts.extract_audio_tokens` for tokens extraction.


	### Tokenize with noise augmentation

	Adds environmental noise and room reverb to prompt audio during tokenization, making the model robust to noisy reference audio at inference time. Note that in our model, we only add noise augmentation for a small proportion of data, making sure the model can also generate good audio with clean reference audio.

	You need two additional datasets in WebDataset format:
	- Noise recordings: environmental noise tar shards with a `data.lst` manifest
	- Room impulse responses (RIR): RIR tar shards with a `data.lst` manifest

	```bash
	export CUDA_VISIBLE_DEVICES="0,1,2,4"
	python -m omnivoice.scripts.extract_audio_tokens_add_noise \
	--input_jsonl data.jsonl \
	--tar_output_pattern data/tokens/shard-%06d.tar \
	--jsonl_output_pattern data/txts/shard-%06d.jsonl \
	--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
	--noise_manifest data/noise_shards/data.lst \
	--rir_manifest data/rir_shards/data.lst \
	--nj_per_gpu 3
	```

	> You can also pass `--input_manifest /path/to/data.lst` if you already have a custom webdataset format dataset.