Spectra-AASIST3

EER% 0.00 on CD-ADD EER% 0.44 on SONAR EER% 0.51 on CVoiceFake_small EER% 0.71 on CFAD EER% 0.83 on LibriSeVoc EER% 0.97 on ASVspoof2019_LA EER% 1.20 on InTheWild EER% 4.30 on ASVspoof2021_DF EER% 4.38 on ASVspoof2021_LA arena tier arena rank

Spectra-AASIST3 β€” a speech anti-spoofing model pairing a wav2vec 2.0 (XLS-R-300m) self-supervised front-end with a KAN-enhanced AASIST (KAN-AASIST) back-end. The model takes a raw speech waveform and returns a score where higher = more bona fide.

  • Code / checkpoint: lab260/Spectra-AASIST3 (model.safetensors, self-contained β€” bundles the SSL encoder weights)
  • Paper: none β€” pre-release / unpublished model, so it appears in the Arena's πŸ”“ Unpublished / Proprietary tier (listed but unranked, regardless of score).
  • Parameters: ~318.95 M

The exact wrapper that produced the Arena scores is in spectra_aasist3.py; the vendored network is spectra_aasist3_net.py (copied from the source model.py).

Architecture

  1. wav2vec 2.0 XLS-R-300m front-end β€” HF transformers Wav2Vec2Model (facebook/wav2vec2-xls-r-300m), producing 1024-d frame features (the base arch is fetched at init, then every weight is overwritten by the checkpoint).
  2. MLP bridge β€” a single-layer Linear(1024 β†’ 128) projection (SELU, dropout 0.1).
  3. KAN-AASIST back-end β€” max-pool, a RawNet2-style residual encoder, spectral (GAT-S) and temporal (GAT-T) graph-attention layers with graph pooling, four parallel inference branches with learnable master tokens, and a Kolmogorov-Arnold (KAN) output layer.
  4. The 2-logit output is read at index 1 = bona fide.

How scores are produced

  • Input: raw audio at 16 kHz mono. Preemphasis (0.97) is applied to the full waveform (matching the source README eval pipeline), then a deterministic first-64,600-sample window (~4.04 s; tile-repeat if shorter β€” no random crop).
  • No resampling in the wrapper (audio arrives at expected_sample_rate = 16000).
  • Output: 2-class logits; the bona-fide logit (index 1) is the score.
  • batch_size = 24 (throughput plateaus ~50 utt/s for bs β‰₯ 16 on an RTX 4070 Ti SUPER).

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Each result is sha-pinned and reproducible from the score file via speech-spoof-bench reproduce --scoring.

Dataset Split EER % Trials Skipped Notes
CD-ADD test 0.00 20,786 0 modern neural-TTS deepfake
SONAR test 0.44 3,948 0 multilingual real-world deepfake
CVoiceFake_small test 0.51 138,136 0 multilingual TTS/vocoder deepfake
CFAD test 0.71 62,999 0 Chinese fake-audio detection
LibriSeVoc test 0.83 18,487 0 vocoder-based deepfake
ASVspoof2019_LA test 0.97 71,237 0 in-domain family
InTheWild test 1.20 31,779 0 out-of-domain (real-world)
ASVspoof2021_DF test 4.30 611,829 0 cross-dataset (deepfake)
ASVspoof2021_LA test 4.38 181,566 0 cross-dataset (logical access)

Usage

The wrapper loads weights from the Hub via PyTorchModelHubMixin:

import numpy as np
from spectra_aasist3 import SpectraAASIST3   # spectra_aasist3.py + spectra_aasist3_net.py

m = SpectraAASIST3()
m.load()                                          # from_pretrained("lab260/Spectra-AASIST3")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper applies preemphasis, windows to 64,600 samples, runs the network, and returns logits[:, 1] (class 1 = bona fide). spectra_aasist3.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

License

Apache-2.0 β€” see the source repository.

Downloads last month
253
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including lab260/Spectra-AASIST3