File size: 6,992 Bytes
a4d9876
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Data Preparation

OmniVoice trains on a custom WebDataset format where audio data is packed into **tar shards** with paired **JSONL metadata** files. Each tar shard contains hundreds to thousands of samples (as `.npy` audio token arrays), drastically reducing disk I/O during training. The separated jsonl file allows for easier modification of metadata. This document explains the data format in detail and walks through the preparation pipeline.


## 1. Input Format

Prepare a JSONL file where each line is a JSON object:

```jsonl
{"id": "sample_001", "audio_path": "/data/audio/001.wav", "text": "Hello world", "language_id": "en"}
{"id": "sample_002", "audio_path": "/data/audio/002.wav", "text": "δ½ ε₯½δΈ–η•Œ", "language_id": "zh"}
```

Fields:
- `id` β€” unique sample identifier (used to match samples across shards and label files)
- `audio_path` β€” absolute path to the audio file (wav/flac/mp3, will be resampled to 24 kHz)
- `text` β€” transcript text
- `language_id` β€” (optional) language code, used for multilingual training, can be omitted


## 2. Processing

The tokenization script `extract_audio_tokens.py` converts audio into 8-layer discrete tokens and packs them into WebDataset shards.

```bash
export CUDA_VISIBLE_DEVICES="0,1,2,4"  # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
    --input_jsonl data.jsonl \
    --tar_output_pattern output/audios/shard-%06d.tar \
    --jsonl_output_pattern output/txts/shard-%06d.jsonl \
    --tokenizer_path eustlb/higgs-audio-v2-tokenizer \
    --nj_per_gpu 3 \
    --shuffle True
```

What it does:
1. Reads your JSONL manifest
2. Encodes each audio file into discrete tokens using audio tokenizer
3. Packs tokens into WebDataset tar shards with paired jsonl metadata files
4. Generates a `data.lst` manifest file

<details>
<summary><strong>Alternative:</strong> WebDataset Input (if you already have raw-audio tar shards)</summary>

Pass the `data.lst` manifest instead of `--input_jsonl`:

```bash
export CUDA_VISIBLE_DEVICES="0,1,2,4"  # GPUs used for token extraction
python -m omnivoice.scripts.extract_audio_tokens \
    --input_manifest existing_data/data.lst \
    --tar_output_pattern output/audios/shard-%06d.tar \
    --jsonl_output_pattern output/txts/shard-%06d.jsonl \
    --tokenizer_path eustlb/higgs-audio-v2-tokenizer \
    --nj_per_gpu 3 \
    --shuffle True
```

The existing_data/data.lst is generated with:
```bash
python -m omnivoice.scripts.jsonl_to_webdataset \
    --input data.jsonl \
    --output data/shards \
    --sr 24000 \
    --shard-size 1000
```

This resamples audio to the target sample rate and packs FLAC files into tar shards with paired jsonl metadata files.

</details>



### Explanation of the script's options:

| Option | Default | Description |
|---|---|---|
| `--input_manifest` | None | Path to input dataset manifest (`data.lst`), mutually exclusive with `--input_jsonl` |
| `--input_jsonl` | None | Path to raw JSONL file, mutually exclusive with `--input_manifest` |
| `--tar_output_pattern` | (required) | Tar shard output pattern, e.g. `output/audios/shard-%06d.tar` |
| `--jsonl_output_pattern` | (required) | JSONL shard output pattern, e.g. `output/txts/shard-%06d.jsonl` |
| `--tokenizer_path` | `eustlb/higgs-audio-v2-tokenizer` | HuggingFace tokenizer path or local path |
| `--nj_per_gpu` | 3 | Worker processes per GPU |
| `--loader_workers` | 24 | DataLoader workers for streaming `IterableDataset` |
| `--shuffle` | True | Shuffle samples before sharding |
| `--shuffle-seed` | 42 | Random seed for shuffling |
| `--samples_per_shard` | 1000 | Max samples per tar shard |
| `--min_num_shards` | 32 | Minimum number of output shards (ensures shard count >= num\_gpu Γ— num\_workers) |
| `--min_length` | 0.0 | Skip audio shorter than this (seconds) |
| `--max_length` | inf | Skip audio longer than this (seconds) |
| `--skip_errors` | False | Continue on processing errors instead of aborting |
| `--num_machines` | 1 | Total number of machines for distributed runs |
| `--machine_index` | 0 | Zero-based machine index for distributed preprocessing |


### Output Structure

Output structure with the following output patterns

```bash
--tar_output_pattern output/audios/shard-%06d.tar \
--jsonl_output_pattern output/txts/shard-%06d.jsonl
```

will be:

```
output/
β”œβ”€β”€ audios/                    # WebDataset tar shards (audio tokens)
β”‚   β”œβ”€β”€ shard-000000.tar       # Each tar packs ~1000 samples
β”‚   β”œβ”€β”€ shard-000001.tar
β”‚   └── ...
β”œβ”€β”€ txts/                      # Per-shard companion JSONL labels
β”‚   β”œβ”€β”€ shard-000000.jsonl     # One JSON line per sample in the corresponding tar
β”‚   β”œβ”€β”€ shard-000001.jsonl
β”‚   └── ...
β”œβ”€β”€ data.lst                   # Manifest linking tar ↔ jsonl shards
└── errors.jsonl               # Samples that failed processing (if any)
```

`data.lst` and `errors.jsonl` are written to the **parent directory** of `audios/` and `txts/`.


### The `data.lst` manifest

Each line in `data.lst` describes one shard:

```
/path/to/shard-000000.tar /path/to/shard-000000.jsonl 1000 3600.500
/path/to/shard-000001.tar /path/to/shard-000001.jsonl 800 2880.200
```

Format: `<tar_path> <jsonl_path> <num_samples> <total_duration_seconds>`

- Paths are **absolute**
- `.tar` file contains the audio tokens.
- `.jsonl` file contains the metadata in the original provided JSONL file, allows easier access and modification of metadata without decompressing the tar file.
- This manifest is what the training data config references.

### Inside a tar shard

Each `.tar` file packs **many samples** (default 1000 per shard) into a single archive. This is the key advantage of WebDataset: instead of reading thousands of tiny files, the dataloader reads sequentially from a few large tars, drastically reducing disk I/O pressure.

Each sample in the tar is a pair of files with matching keys:

```
shard-000000.tar:
  sample_001.npy    # Audio tokens: numpy array, shape [8, T], dtype int16
  sample_002.npy
  ...
  sample_1000.npy
```

## 3. Data Config for Training

After creating WebDataset shards, write a data config JSON that references them:

```json
{
    "train": [
        {
            "language_id": "en",
            "manifest_path": ["data/custom/tokens/train/data.lst"],
            "repeat": 1
        }
    ],
    "dev": [
        {
            "language_id": "en",
            "manifest_path": ["data/custom/tokens/dev/data.lst"],
            "repeat": 1
        }
    ]
}
```
- `manifest_path` β€” list of `data.lst` files (one per shard directory)
- `repeat` β€” how many times to repeat this dataset per epoch (useful for balancing languages)
- `language_id` is not used, just for a better data organization.

See [examples/config/](../examples/config/) for ready-to-use data config files.

> See [docs/data_preparation_advanced.md](../docs/data_preparation_advanced.md) for denoising and noise augmentation.