OmniVoice_sync_data_and_code / docs /data_preparation.md
Abdelrahman2922's picture
Add files using upload-large-folder tool
a4d9876 verified

Data Preparation

OmniVoice trains on a custom WebDataset format where audio data is packed into tar shards with paired JSONL metadata files. Each tar shard contains hundreds to thousands of samples (as .npy audio token arrays), drastically reducing disk I/O during training. The separated jsonl file allows for easier modification of metadata. This document explains the data format in detail and walks through the preparation pipeline.

1. Input Format

Prepare a JSONL file where each line is a JSON object:

{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "δ½ ε₯½δΈ–η•Œ", "language_id": "zh"}

Fields:

  • id β€” unique sample identifier (used to match samples across shards and label files)
  • audio_path β€” absolute path to the audio file (wav/flac/mp3, will be resampled to 24 kHz)
  • text β€” transcript text
  • language_id β€” (optional) language code, used for multilingual training, can be omitted

2. Processing

The tokenization script extract_audio_tokens.py converts audio into 8-layer discrete tokens and packs them into WebDataset shards.

export CUDA_VISIBLE_DEVICES="0,1,2,4"  # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
    --input_jsonl data.jsonl \
    --tar_output_pattern output/audios/shard-%06d.tar \
    --jsonl_output_pattern output/txts/shard-%06d.jsonl \
    --tokenizer_path eustlb/higgs-audio-v2-tokenizer \
    --nj_per_gpu 3 \
    --shuffle True

What it does:

  1. Reads your JSONL manifest
  2. Encodes each audio file into discrete tokens using audio tokenizer
  3. Packs tokens into WebDataset tar shards with paired jsonl metadata files
  4. Generates a data.lst manifest file
Alternative: WebDataset Input (if you already have raw-audio tar shards)

Pass the data.lst manifest instead of --input_jsonl:

export CUDA_VISIBLE_DEVICES="0,1,2,4"  # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
    --input_manifest existing_data/data.lst \
    --tar_output_pattern output/audios/shard-%06d.tar \
    --jsonl_output_pattern output/txts/shard-%06d.jsonl \
    --tokenizer_path eustlb/higgs-audio-v2-tokenizer \
    --nj_per_gpu 3 \
    --shuffle True

The existing_data/data.lst is generated with:

python -m omnivoice.scripts.jsonl_to_webdataset \
    --input data.jsonl \
    --output data/shards \
    --sr 24000 \
    --shard-size 1000

This resamples audio to the target sample rate and packs FLAC files into tar shards with paired jsonl metadata files.

Explanation of the script's options:

Option Default Description
--input_manifest None Path to input dataset manifest (data.lst), mutually exclusive with --input_jsonl
--input_jsonl None Path to raw JSONL file, mutually exclusive with --input_manifest
--tar_output_pattern (required) Tar shard output pattern, e.g. output/audios/shard-%06d.tar
--jsonl_output_pattern (required) JSONL shard output pattern, e.g. output/txts/shard-%06d.jsonl
--tokenizer_path eustlb/higgs-audio-v2-tokenizer HuggingFace tokenizer path or local path
--nj_per_gpu 3 Worker processes per GPU
--loader_workers 24 DataLoader workers for streaming IterableDataset
--shuffle True Shuffle samples before sharding
--shuffle-seed 42 Random seed for shuffling
--samples_per_shard 1000 Max samples per tar shard
--min_num_shards 32 Minimum number of output shards (ensures shard count >= num_gpu Γ— num_workers)
--min_length 0.0 Skip audio shorter than this (seconds)
--max_length inf Skip audio longer than this (seconds)
--skip_errors False Continue on processing errors instead of aborting
--num_machines 1 Total number of machines for distributed runs
--machine_index 0 Zero-based machine index for distributed preprocessing

Output Structure

Output structure with the following output patterns

--tar_output_pattern output/audios/shard-%06d.tar \
--jsonl_output_pattern output/txts/shard-%06d.jsonl

will be:

output/
β”œβ”€β”€ audios/                    # WebDataset tar shards (audio tokens)
β”‚   β”œβ”€β”€ shard-000000.tar       # Each tar packs ~1000 samples
β”‚   β”œβ”€β”€ shard-000001.tar
β”‚   └── ...
β”œβ”€β”€ txts/                      # Per-shard companion JSONL labels
β”‚   β”œβ”€β”€ shard-000000.jsonl     # One JSON line per sample in the corresponding tar
β”‚   β”œβ”€β”€ shard-000001.jsonl
β”‚   └── ...
β”œβ”€β”€ data.lst                   # Manifest linking tar ↔ jsonl shards
└── errors.jsonl               # Samples that failed processing (if any)

data.lst and errors.jsonl are written to the parent directory of audios/ and txts/.

The data.lst manifest

Each line in data.lst describes one shard:

/path/to/shard-000000.tar /path/to/shard-000000.jsonl 1000 3600.500
/path/to/shard-000001.tar /path/to/shard-000001.jsonl 800 2880.200

Format: <tar_path> <jsonl_path> <num_samples> <total_duration_seconds>

  • Paths are absolute
  • .tar file contains the audio tokens.
  • .jsonl file contains the metadata in the original provided JSONL file, allows easier access and modification of metadata without decompressing the tar file.
  • This manifest is what the training data config references.

Inside a tar shard

Each .tar file packs many samples (default 1000 per shard) into a single archive. This is the key advantage of WebDataset: instead of reading thousands of tiny files, the dataloader reads sequentially from a few large tars, drastically reducing disk I/O pressure.

Each sample in the tar is a pair of files with matching keys:

shard-000000.tar:
  sample_001.npy    # Audio tokens: numpy array, shape [8, T], dtype int16
  sample_002.npy
  ...
  sample_1000.npy

3. Data Config for Training

After creating WebDataset shards, write a data config JSON that references them:

{
    "train": [
        {
            "language_id": "en",
            "manifest_path": ["data/custom/tokens/train/data.lst"],
            "repeat": 1
        }
    ],
    "dev": [
        {
            "language_id": "en",
            "manifest_path": ["data/custom/tokens/dev/data.lst"],
            "repeat": 1
        }
    ]
}
  • manifest_path β€” list of data.lst files (one per shard directory)
  • repeat β€” how many times to repeat this dataset per epoch (useful for balancing languages)
  • language_id is not used, just for a better data organization.

See examples/config/ for ready-to-use data config files.

See docs/data_preparation_advanced.md for denoising and noise augmentation.