Spectra-AASIST3

Spectra-AASIST3 — a speech anti-spoofing model pairing a wav2vec 2.0 (XLS-R-300m) self-supervised front-end with a KAN-enhanced AASIST (KAN-AASIST) back-end. The model takes a raw speech waveform and returns a score where higher = more bona fide.

Code / checkpoint: lab260/Spectra-AASIST3 (model.safetensors, self-contained — bundles the SSL encoder weights)
Paper: none — pre-release / unpublished model, so it appears in the Arena's 🔓 Unpublished / Proprietary tier (listed but unranked, regardless of score).
Parameters: ~318.95 M

The exact wrapper that produced the Arena scores is in spectra_aasist3.py; the vendored network is spectra_aasist3_net.py (copied from the source model.py).

Architecture

wav2vec 2.0 XLS-R-300m front-end — HF transformers Wav2Vec2Model (facebook/wav2vec2-xls-r-300m), producing 1024-d frame features (the base arch is fetched at init, then every weight is overwritten by the checkpoint).
MLP bridge — a single-layer Linear(1024 → 128) projection (SELU, dropout 0.1).
KAN-AASIST back-end — max-pool, a RawNet2-style residual encoder, spectral (GAT-S) and temporal (GAT-T) graph-attention layers with graph pooling, four parallel inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN) output layer.
The 2-logit output is read at index 1 = bona fide.

How scores are produced

Input: raw audio at 16 kHz mono. Preemphasis (0.97) is applied to the full waveform (matching the source README eval pipeline), then a deterministic first-64,600-sample window (~4.04 s; tile-repeat if shorter — no random crop).
No resampling in the wrapper (audio arrives at expected_sample_rate = 16000).
Output: 2-class logits; the bona-fide logit (index 1) is the score.
batch_size = 24 (throughput plateaus ~50 utt/s for bs ≥ 16 on an RTX 4070 Ti SUPER).

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Each result is sha-pinned and reproducible from the score file via speech-spoof-bench reproduce --scoring.

Dataset	Split	EER %	Trials	Notes
CD-ADD	test	0.00	20,786	modern neural-TTS deepfake
SONAR	test	0.44	3,948	multilingual real-world deepfake
CVoiceFake_small	test	0.51	138,136	multilingual TTS/vocoder deepfake
CFAD	test	0.71	62,999	Chinese fake-audio detection
LibriSeVoc	test	0.83	18,487	vocoder-based deepfake
ASVspoof2019_LA	test	0.97	71,237	in-domain family
InTheWild	test	1.20	31,779	out-of-domain (real-world)
ASVspoof2021_DF	test	4.30	611,829	cross-dataset (deepfake)
ASVspoof2021_LA	test	4.38	181,566	cross-dataset (logical access)
ASVspoof5	test	15.09	680,774	adversarial / hardest set

Usage

The wrapper loads weights from the Hub via PyTorchModelHubMixin:

import numpy as np
from spectra_aasist3 import SpectraAASIST3   # spectra_aasist3.py + spectra_aasist3_net.py

m = SpectraAASIST3()
m.load()                                          # from_pretrained("lab260/Spectra-AASIST3")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the network, and returns logits[:, 1] (class 1 = bona fide). spectra_aasist3.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

License

Apache-2.0 — see the source repository.

Contact

Email: kborodin.research@gmail.com
Telegram: @korallll_ai

Downloads last month: 453

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

lab260
/

Spectra-AASIST3