# Data Preparation OmniVoice trains on a custom WebDataset format where audio data is packed into **tar shards** with paired **JSONL metadata** files. Each tar shard contains hundreds to thousands of samples (as `.npy` audio token arrays), drastically reducing disk I/O during training. The separated jsonl file allows for easier modification of metadata. This document explains the data format in detail and walks through the preparation pipeline. ## 1. Input Format Prepare a JSONL file where each line is a JSON object: ```jsonl {"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"} {"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "你好世界", "language_id": "zh"} ``` Fields: - `id` — unique sample identifier (used to match samples across shards and label files) - `audio_path` — absolute path to the audio file (wav/flac/mp3, will be resampled to 24 kHz) - `text` — transcript text - `language_id` — (optional) language code, used for multilingual training, can be omitted ## 2. Processing The tokenization script `extract_audio_tokens.py` converts audio into 8-layer discrete tokens and packs them into WebDataset shards. ```bash export CUDA_VISIBLE_DEVICES="0,1,2,4" # GPUs used for token extraction python -m omnivoice.scripts.extract_audio_tokens \ --input_jsonl data.jsonl \ --tar_output_pattern output/audios/shard-%06d.tar \ --jsonl_output_pattern output/txts/shard-%06d.jsonl \ --tokenizer_path eustlb/higgs-audio-v2-tokenizer \ --nj_per_gpu 3 \ --shuffle True ``` What it does: 1. Reads your JSONL manifest 2. Encodes each audio file into discrete tokens using audio tokenizer 3. Packs tokens into WebDataset tar shards with paired jsonl metadata files 4. Generates a `data.lst` manifest file
Alternative: WebDataset Input (if you already have raw-audio tar shards) Pass the `data.lst` manifest instead of `--input_jsonl`: ```bash export CUDA_VISIBLE_DEVICES="0,1,2,4" # GPUs used for token extraction python -m omnivoice.scripts.extract_audio_tokens \ --input_manifest existing_data/data.lst \ --tar_output_pattern output/audios/shard-%06d.tar \ --jsonl_output_pattern output/txts/shard-%06d.jsonl \ --tokenizer_path eustlb/higgs-audio-v2-tokenizer \ --nj_per_gpu 3 \ --shuffle True ``` The existing_data/data.lst is generated with: ```bash python -m omnivoice.scripts.jsonl_to_webdataset \ --input data.jsonl \ --output data/shards \ --sr 24000 \ --shard-size 1000 ``` This resamples audio to the target sample rate and packs FLAC files into tar shards with paired jsonl metadata files.
### Explanation of the script's options: | Option | Default | Description | |---|---|---| | `--input_manifest` | None | Path to input dataset manifest (`data.lst`), mutually exclusive with `--input_jsonl` | | `--input_jsonl` | None | Path to raw JSONL file, mutually exclusive with `--input_manifest` | | `--tar_output_pattern` | (required) | Tar shard output pattern, e.g. `output/audios/shard-%06d.tar` | | `--jsonl_output_pattern` | (required) | JSONL shard output pattern, e.g. `output/txts/shard-%06d.jsonl` | | `--tokenizer_path` | `eustlb/higgs-audio-v2-tokenizer` | HuggingFace tokenizer path or local path | | `--nj_per_gpu` | 3 | Worker processes per GPU | | `--loader_workers` | 24 | DataLoader workers for streaming `IterableDataset` | | `--shuffle` | True | Shuffle samples before sharding | | `--shuffle-seed` | 42 | Random seed for shuffling | | `--samples_per_shard` | 1000 | Max samples per tar shard | | `--min_num_shards` | 32 | Minimum number of output shards (ensures shard count >= num\_gpu × num\_workers) | | `--min_length` | 0.0 | Skip audio shorter than this (seconds) | | `--max_length` | inf | Skip audio longer than this (seconds) | | `--skip_errors` | False | Continue on processing errors instead of aborting | | `--num_machines` | 1 | Total number of machines for distributed runs | | `--machine_index` | 0 | Zero-based machine index for distributed preprocessing | ### Output Structure Output structure with the following output patterns ```bash --tar_output_pattern output/audios/shard-%06d.tar \ --jsonl_output_pattern output/txts/shard-%06d.jsonl ``` will be: ``` output/ ├── audios/ # WebDataset tar shards (audio tokens) │ ├── shard-000000.tar # Each tar packs ~1000 samples │ ├── shard-000001.tar │ └── ... ├── txts/ # Per-shard companion JSONL labels │ ├── shard-000000.jsonl # One JSON line per sample in the corresponding tar │ ├── shard-000001.jsonl │ └── ... ├── data.lst # Manifest linking tar ↔ jsonl shards └── errors.jsonl # Samples that failed processing (if any) ``` `data.lst` and `errors.jsonl` are written to the **parent directory** of `audios/` and `txts/`. ### The `data.lst` manifest Each line in `data.lst` describes one shard: ``` /path/to/shard-000000.tar /path/to/shard-000000.jsonl 1000 3600.500 /path/to/shard-000001.tar /path/to/shard-000001.jsonl 800 2880.200 ``` Format: ` ` - Paths are **absolute** - `.tar` file contains the audio tokens. - `.jsonl` file contains the metadata in the original provided JSONL file, allows easier access and modification of metadata without decompressing the tar file. - This manifest is what the training data config references. ### Inside a tar shard Each `.tar` file packs **many samples** (default 1000 per shard) into a single archive. This is the key advantage of WebDataset: instead of reading thousands of tiny files, the dataloader reads sequentially from a few large tars, drastically reducing disk I/O pressure. Each sample in the tar is a pair of files with matching keys: ``` shard-000000.tar: sample_001.npy # Audio tokens: numpy array, shape [8, T], dtype int16 sample_002.npy ... sample_1000.npy ``` ## 3. Data Config for Training After creating WebDataset shards, write a data config JSON that references them: ```json { "train": [ { "language_id": "en", "manifest_path": ["data/custom/tokens/train/data.lst"], "repeat": 1 } ], "dev": [ { "language_id": "en", "manifest_path": ["data/custom/tokens/dev/data.lst"], "repeat": 1 } ] } ``` - `manifest_path` — list of `data.lst` files (one per shard directory) - `repeat` — how many times to repeat this dataset per epoch (useful for balancing languages) - `language_id` is not used, just for a better data organization. See [examples/config/](../examples/config/) for ready-to-use data config files. > See [docs/data_preparation_advanced.md](../docs/data_preparation_advanced.md) for denoising and noise augmentation.