Robust Speech Quantizer (HuBERT / DinoSR / SpidR)

GitHub Repository

MLP-based robust speech quantizers trained with CTC loss and iterative pseudo-labeling on augmented audio, following Algayres et al., Interspeech 2023. Evaluated on K ∈ {100, 200, 500} vocabulary sizes.

Encoders

Encoder	Checkpoint	Layer	Pre-training data
HuBERT Base	`hubert-base-ls960`	6	LibriSpeech 960h
DinoSR	original + SpidR-reproduced	5	LibriSpeech 960h
SpidR	`spidr-base`	6	LibriSpeech 960h

Quick Start

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="iliasslasri/robust_speech_quantizer",
    filename="500_vocab_size/round_1/E1_best.pt"
)
config_path = hf_hub_download(
    repo_id="iliasslasri/robust_speech_quantizer",
    filename="500_vocab_size/config.yaml"
)

Augmentations

Augmentation	Audio
Clean
Time Stretch
Pitch Shift
Reverberation
Noise
Echo
Random Noise
Pink Noise
Lowpass Filter
Highpass Filter
Bandpass Filter
Smooth
Boost Audio
Duck Audio
Up-Down Resample

Experiments

We trained quantizers across different encoders, codebook sizes, and augmentation strategies. The augmentation configurations are:

All augmentations, chained — all augmentations from the table above are enabled, and multiple augmentations are applied sequentially to each sample. The number of chained augmentations is sampled from a uniform distribution between 0 and 4.
All augmentations, single — all augmentations are enabled, but only one randomly chosen augmentation is applied per sample.
No extra augmentations, single — only the baseline augmentations (from the original paper) are used, with one applied per sample.

Encoder	Layer	Codebook	Augmentation Strategy
HuBERT	6	500	All augmentations, chained
			All augmentations, single
			No extra augmentations, single

SpidR	6	256	No extra augmentations, single
			All augmentations, chained

DinoSR (original)	5	256	All augmentations, chained
DinoSR (reproduced)	5	256	All augmentations, chained

Model tree for iliasslasri/robust_speech_quantizer

Base model

facebook/hubert-base-ls960

Finetuned

(148)

this model

Dataset used to train iliasslasri/robust_speech_quantizer

Papers for iliasslasri/robust_speech_quantizer

SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision

Paper • 2512.20308 • Published Dec 23, 2025

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Paper • 2305.10005 • Published May 17, 2023 • 4

iliasslasri
/

robust_speech_quantizer