Spectra-AASIST3
Spectra-AASIST3 β a speech anti-spoofing model pairing a wav2vec 2.0 (XLS-R-300m) self-supervised front-end with a KAN-enhanced AASIST (KAN-AASIST) back-end. The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code / checkpoint:
lab260/Spectra-AASIST3(model.safetensors, self-contained β bundles the SSL encoder weights) - Paper: none β pre-release / unpublished model, so it appears in the Arena's π Unpublished / Proprietary tier (listed but unranked, regardless of score).
- Parameters: ~318.95 M
The exact wrapper that produced the Arena scores is in
spectra_aasist3.py; the vendored network is
spectra_aasist3_net.py (copied from the source model.py).
Architecture
- wav2vec 2.0 XLS-R-300m front-end β HF
transformersWav2Vec2Model(facebook/wav2vec2-xls-r-300m), producing 1024-d frame features (the base arch is fetched at init, then every weight is overwritten by the checkpoint). - MLP bridge β a single-layer
Linear(1024 β 128)projection (SELU, dropout 0.1). - KAN-AASIST back-end β max-pool, a RawNet2-style residual encoder, spectral (GAT-S) and temporal (GAT-T) graph-attention layers with graph pooling, four parallel inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN) output layer.
- The 2-logit output is read at index 1 = bona fide.
How scores are produced
- Input: raw audio at 16 kHz mono. Preemphasis (0.97) is applied to the full waveform (matching the source README eval pipeline), then a deterministic first-64,600-sample window (~4.04 s; tile-repeat if shorter β no random crop).
- No resampling in the wrapper (audio arrives at
expected_sample_rate = 16000). - Output: 2-class logits; the bona-fide logit (index 1) is the score.
batch_size = 24(throughput plateaus ~50 utt/s for bs β₯ 16 on an RTX 4070 Ti SUPER).
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible
Speech Anti-Spoofing Arena.
Each result is sha-pinned and reproducible from the score file via
speech-spoof-bench reproduce --scoring.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| CD-ADD | test | 0.00 | 20,786 | 0 | modern neural-TTS deepfake |
| SONAR | test | 0.44 | 3,948 | 0 | multilingual real-world deepfake |
| CVoiceFake_small | test | 0.51 | 138,136 | 0 | multilingual TTS/vocoder deepfake |
| CFAD | test | 0.71 | 62,999 | 0 | Chinese fake-audio detection |
| LibriSeVoc | test | 0.83 | 18,487 | 0 | vocoder-based deepfake |
| ASVspoof2019_LA | test | 0.97 | 71,237 | 0 | in-domain family |
| InTheWild | test | 1.20 | 31,779 | 0 | out-of-domain (real-world) |
| ASVspoof2021_DF | test | 4.30 | 611,829 | 0 | cross-dataset (deepfake) |
| ASVspoof2021_LA | test | 4.38 | 181,566 | 0 | cross-dataset (logical access) |
Usage
The wrapper loads weights from the Hub via PyTorchModelHubMixin:
import numpy as np
from spectra_aasist3 import SpectraAASIST3 # spectra_aasist3.py + spectra_aasist3_net.py
m = SpectraAASIST3()
m.load() # from_pretrained("lab260/Spectra-AASIST3")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0]) # higher = more bona fide
m.unload()
Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the
network, and returns logits[:, 1] (class 1 = bona fide). spectra_aasist3.py
is the exact speech_spoof_bench model that produced the Arena scores.txt.
License
Apache-2.0 β see the source repository.
- Downloads last month
- 253