license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
ResCapsGuard
Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in "Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry" (Borodin et al., ETASR 2024) — the capsule-network sibling of Res2TCNGuard. The model takes a raw speech waveform and returns a score where higher = more bona fide.
- Code: https://github.com/lab260ru/ResCapsGuard
- Paper: https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- Parameters: 1,606,664 (1.607 M)
- Checkpoint:
new_capsules_changed_sinc_layer.pth
This repo is self-contained for inference: the network definition is in
_net.py, a standalone scorer in evaluate.py, and
the exact wrapper used to produce the Arena scores in
rescapsguard.py.
Architecture
ResCapsGuard operates directly on the raw waveform:
- Sinc-convolution front-end (
SincConv) — learnable band-pass filters that turn the waveform into a time–frequency representation. - Res2Net-style encoder — stacked
Res_blocks (2-D convolutions with SELU and max-pooling) that build a deep spectro-temporal feature map. - Primary capsules — a bank of capsule branches, each ending in a channel-wise
statistics pooling (
ChanelWiseStats, mean + std) to produce per-capsule vectors. - Dynamic routing (
RoutingMechanism) — routing-by-agreement (with the squash non-linearity) to two output capsules, bona fide vs. spoof. The bona-fide capsule activation (index 1) is the returned score.
How it was trained
- Data: the ASVspoof 2019 Logical Access (LA) dataset, following the protocol in the paper (train/validate on a single attack type, evaluate on the eval split with more advanced and unseen attacks — testing generalization to harder scenarios).
- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). During training a random segment is cut from each utterance.
- Best reported result (paper): EER = 2.25 %, min t-DCF = 0.0744.
See the training notebook for the full training and evaluation code.
Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | 1.86 | 71,237 | 0 | in-domain (training data) |
| CD-ADD | test | 54.55 | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | 55.92 | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |
| ASVspoof2021_LA | test | 18.70 | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | 17.00 | 611,829 | 0 | cross-dataset generalization |
The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval set; the deterministic window (vs. the paper's random crop) accounts for the small difference. As with its Res2TCNGuard sibling, the model trained only on ASVspoof2019 LA degrades on the newer/cross-domain ASVspoof2021 LA and DF sets and does not generalize to the out-of-domain CD-ADD and InTheWild sets — the cost of training on a single attack type. The ASVspoof2021_DF result (17.00 %) matches the sibling Res2TCNGuard's 17.02 % on the same eval.
Usage
The checkpoint is a state_dict for the CapsuleNet network defined in
_net.py (extracted verbatim from the source notebook). The input is
windowed to exactly 64,600 samples at 16 kHz mono with pad_fixed (first 64,600
samples, tile-repeat if shorter).
Score one file from the command line:
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float> (higher = more bona fide)
Or from Python:
import numpy as np
from evaluate import load_model, score # _net.py + evaluate.py are in this repo
model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(score(model, audio)) # higher = more bona fide
Internally score does _z, class_ = model(x, random=False, dropout=0) on the windowed
input and returns class_[:, 1] (index 1 = bona fide). rescapsguard.py
is the same logic packaged as a speech_spoof_bench model — the exact code that produced
the Arena scores.txt.
Citation
This model / paper:
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
place={Greece},
title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
volume={14},
number={6},
url={https://etasr.com/index.php/ETASR/article/view/8906},
DOI={10.48084/etasr.8906},
journal={Engineering, Technology & Applied Science Research},
author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
year={2024},
month={Dec.},
pages={18409--18414}
}
Training dataset — ASVspoof 2019:
@article{wang2020asvspoof,
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
journal={Computer Speech \& Language},
volume={64},
pages={101114},
year={2020},
publisher={Elsevier}
}
License
MIT — see the source repository.