SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision
Paper • 2512.20308 • Published
MLP-based robust speech quantizers trained with CTC loss and iterative pseudo-labeling on augmented audio, following Algayres et al., Interspeech 2023. Evaluated on K ∈ {100, 200, 500} vocabulary sizes.
| Encoder | Checkpoint | Layer | Pre-training data |
|---|---|---|---|
| HuBERT Base | hubert-base-ls960 |
6 | LibriSpeech 960h |
| DinoSR | original + SpidR-reproduced | 5 | LibriSpeech 960h |
| SpidR | spidr-base |
6 | LibriSpeech 960h |
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="iliasslasri/robust_speech_quantizer",
filename="500_vocab_size/round_1/E1_best.pt"
)
config_path = hf_hub_download(
repo_id="iliasslasri/robust_speech_quantizer",
filename="500_vocab_size/config.yaml"
)
| Augmentation | Audio |
|---|---|
| Clean | |
| Time Stretch | |
| Pitch Shift | |
| Reverberation | |
| Noise | |
| Echo | |
| Random Noise | |
| Pink Noise | |
| Lowpass Filter | |
| Highpass Filter | |
| Bandpass Filter | |
| Smooth | |
| Boost Audio | |
| Duck Audio | |
| Up-Down Resample |
We trained quantizers across different encoders, codebook sizes, and augmentation strategies. The augmentation configurations are:
| Encoder | Layer | Codebook | Augmentation Strategy |
|---|---|---|---|
| HuBERT | 6 | 500 | All augmentations, chained |
| All augmentations, single | |||
| No extra augmentations, single | |||
| SpidR | 6 | 256 | No extra augmentations, single |
| All augmentations, chained | |||
| DinoSR (original) | 5 | 256 | All augmentations, chained |
| DinoSR (reproduced) | 5 | 256 | All augmentations, chained |
Base model
facebook/hubert-base-ls960