Data Preparation
OmniVoice trains on a custom WebDataset format where audio data is packed into tar shards with paired JSONL metadata files. Each tar shard contains hundreds to thousands of samples (as .npy audio token arrays), drastically reducing disk I/O during training. The separated jsonl file allows for easier modification of metadata. This document explains the data format in detail and walks through the preparation pipeline.
1. Input Format
Prepare a JSONL file where each line is a JSON object:
{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "δ½ ε₯½δΈη", "language_id": "zh"}
Fields:
idβ unique sample identifier (used to match samples across shards and label files)audio_pathβ absolute path to the audio file (wav/flac/mp3, will be resampled to 24 kHz)textβ transcript textlanguage_idβ (optional) language code, used for multilingual training, can be omitted
2. Processing
The tokenization script extract_audio_tokens.py converts audio into 8-layer discrete tokens and packs them into WebDataset shards.
export CUDA_VISIBLE_DEVICES="0,1,2,4" # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
--input_jsonl data.jsonl \
--tar_output_pattern output/audios/shard-%06d.tar \
--jsonl_output_pattern output/txts/shard-%06d.jsonl \
--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
--nj_per_gpu 3 \
--shuffle True
What it does:
- Reads your JSONL manifest
- Encodes each audio file into discrete tokens using audio tokenizer
- Packs tokens into WebDataset tar shards with paired jsonl metadata files
- Generates a
data.lstmanifest file
Alternative: WebDataset Input (if you already have raw-audio tar shards)
Pass the data.lst manifest instead of --input_jsonl:
export CUDA_VISIBLE_DEVICES="0,1,2,4" # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
--input_manifest existing_data/data.lst \
--tar_output_pattern output/audios/shard-%06d.tar \
--jsonl_output_pattern output/txts/shard-%06d.jsonl \
--tokenizer_path eustlb/higgs-audio-v2-tokenizer \
--nj_per_gpu 3 \
--shuffle True
The existing_data/data.lst is generated with:
python -m omnivoice.scripts.jsonl_to_webdataset \
--input data.jsonl \
--output data/shards \
--sr 24000 \
--shard-size 1000
This resamples audio to the target sample rate and packs FLAC files into tar shards with paired jsonl metadata files.
Explanation of the script's options:
| Option | Default | Description |
|---|---|---|
--input_manifest |
None | Path to input dataset manifest (data.lst), mutually exclusive with --input_jsonl |
--input_jsonl |
None | Path to raw JSONL file, mutually exclusive with --input_manifest |
--tar_output_pattern |
(required) | Tar shard output pattern, e.g. output/audios/shard-%06d.tar |
--jsonl_output_pattern |
(required) | JSONL shard output pattern, e.g. output/txts/shard-%06d.jsonl |
--tokenizer_path |
eustlb/higgs-audio-v2-tokenizer |
HuggingFace tokenizer path or local path |
--nj_per_gpu |
3 | Worker processes per GPU |
--loader_workers |
24 | DataLoader workers for streaming IterableDataset |
--shuffle |
True | Shuffle samples before sharding |
--shuffle-seed |
42 | Random seed for shuffling |
--samples_per_shard |
1000 | Max samples per tar shard |
--min_num_shards |
32 | Minimum number of output shards (ensures shard count >= num_gpu Γ num_workers) |
--min_length |
0.0 | Skip audio shorter than this (seconds) |
--max_length |
inf | Skip audio longer than this (seconds) |
--skip_errors |
False | Continue on processing errors instead of aborting |
--num_machines |
1 | Total number of machines for distributed runs |
--machine_index |
0 | Zero-based machine index for distributed preprocessing |
Output Structure
Output structure with the following output patterns
--tar_output_pattern output/audios/shard-%06d.tar \
--jsonl_output_pattern output/txts/shard-%06d.jsonl
will be:
output/
βββ audios/ # WebDataset tar shards (audio tokens)
β βββ shard-000000.tar # Each tar packs ~1000 samples
β βββ shard-000001.tar
β βββ ...
βββ txts/ # Per-shard companion JSONL labels
β βββ shard-000000.jsonl # One JSON line per sample in the corresponding tar
β βββ shard-000001.jsonl
β βββ ...
βββ data.lst # Manifest linking tar β jsonl shards
βββ errors.jsonl # Samples that failed processing (if any)
data.lst and errors.jsonl are written to the parent directory of audios/ and txts/.
The data.lst manifest
Each line in data.lst describes one shard:
/path/to/shard-000000.tar /path/to/shard-000000.jsonl 1000 3600.500
/path/to/shard-000001.tar /path/to/shard-000001.jsonl 800 2880.200
Format: <tar_path> <jsonl_path> <num_samples> <total_duration_seconds>
- Paths are absolute
.tarfile contains the audio tokens..jsonlfile contains the metadata in the original provided JSONL file, allows easier access and modification of metadata without decompressing the tar file.- This manifest is what the training data config references.
Inside a tar shard
Each .tar file packs many samples (default 1000 per shard) into a single archive. This is the key advantage of WebDataset: instead of reading thousands of tiny files, the dataloader reads sequentially from a few large tars, drastically reducing disk I/O pressure.
Each sample in the tar is a pair of files with matching keys:
shard-000000.tar:
sample_001.npy # Audio tokens: numpy array, shape [8, T], dtype int16
sample_002.npy
...
sample_1000.npy
3. Data Config for Training
After creating WebDataset shards, write a data config JSON that references them:
{
"train": [
{
"language_id": "en",
"manifest_path": ["data/custom/tokens/train/data.lst"],
"repeat": 1
}
],
"dev": [
{
"language_id": "en",
"manifest_path": ["data/custom/tokens/dev/data.lst"],
"repeat": 1
}
]
}
manifest_pathβ list ofdata.lstfiles (one per shard directory)repeatβ how many times to repeat this dataset per epoch (useful for balancing languages)language_idis not used, just for a better data organization.
See examples/config/ for ready-to-use data config files.
See docs/data_preparation_advanced.md for denoising and noise augmentation.