| --- |
| license: mit |
| tags: |
| - audio |
| - anti-spoofing |
| - audio-deepfake-detection |
| - speech |
| - asvspoof |
| --- |
| |
| # ResCapsGuard |
|
|
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
| [](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard) |
|
|
| Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in |
| *"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"* |
| (Borodin et al., ETASR 2024) — the capsule-network sibling of |
| [Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The |
| model takes a raw speech waveform and returns a score where **higher = more bona fide**. |
|
|
| - **Code:** https://github.com/lab260ru/ResCapsGuard |
| - **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906) |
| - **Parameters:** 1,606,664 (1.607 M) |
| - **Checkpoint:** [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth) |
|
|
| This repo is self-contained for inference: the network definition is in |
| [`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and |
| the exact wrapper used to produce the Arena scores in |
| [`rescapsguard.py`](./rescapsguard.py). |
|
|
| ## Architecture |
|
|
| ResCapsGuard operates directly on the raw waveform: |
|
|
| 1. **Sinc-convolution front-end** (`SincConv`) — learnable band-pass filters that turn |
| the waveform into a time–frequency representation. |
| 2. **Res2Net-style encoder** — stacked `Res_block`s (2-D convolutions with SELU and |
| max-pooling) that build a deep spectro-temporal feature map. |
| 3. **Primary capsules** — a bank of capsule branches, each ending in a channel-wise |
| statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors. |
| 4. **Dynamic routing** (`RoutingMechanism`) — routing-by-agreement (with the squash |
| non-linearity) to **two output capsules**, bona fide vs. spoof. The bona-fide |
| capsule activation (index 1) is the returned score. |
|
|
| ## How it was trained |
|
|
| - **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset, following the protocol in |
| the paper (train/validate on a single attack type, evaluate on the eval split with |
| more advanced and unseen attacks — testing generalization to harder scenarios). |
| - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). |
| During training a random segment is cut from each utterance. |
| - **Best reported result (paper):** EER = **2.25 %**, min t-DCF = 0.0744. |
|
|
| See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb) |
| for the full training and evaluation code. |
|
|
| ## Benchmark result (Speech Anti-Spoofing Arena) |
|
|
| Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard). |
| Scores were computed with a **deterministic first-64,600-sample window** (no random |
| crop), so the numbers are exactly reproducible from the pinned score file. |
|
|
| | Dataset | Split | EER % | Trials | Skipped | Notes | |
| |---|---|---|---|---|---| |
| | ASVspoof2019_LA | test | **1.86** | 71,237 | 0 | in-domain (training data) | |
| | CD-ADD | test | **54.55** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize | |
| | InTheWild | test | **55.92** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize | |
| | ASVspoof2021_LA | test | **18.70** | 181,566 | 0 | cross-dataset generalization | |
| | ASVspoof2021_DF | test | **17.00** | 611,829 | 0 | cross-dataset generalization | |
| |
| The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval |
| set; the deterministic window (vs. the paper's random crop) accounts for the small |
| difference. As with its Res2TCNGuard sibling, the model trained only on ASVspoof2019 LA |
| degrades on the newer/cross-domain ASVspoof2021 LA and DF sets and does not generalize to |
| the out-of-domain CD-ADD and InTheWild sets — the cost of training on a single attack |
| type. The ASVspoof2021_DF result (17.00 %) matches the sibling Res2TCNGuard's 17.02 % on |
| the same eval. |
| |
| ## Usage |
| |
| The checkpoint is a `state_dict` for the `CapsuleNet` network defined in |
| [`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is |
| windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600 |
| samples, tile-repeat if shorter). |
|
|
| Score one file from the command line: |
|
|
| ```bash |
| pip install torch numpy soundfile scipy |
| python evaluate.py path/to/audio.wav |
| # -> bona-fide score: <float> (higher = more bona fide) |
| ``` |
|
|
| Or from Python: |
|
|
| ```python |
| import numpy as np |
| from evaluate import load_model, score # _net.py + evaluate.py are in this repo |
| |
| model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu") |
| audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz |
| print(score(model, audio)) # higher = more bona fide |
| ``` |
|
|
| Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed |
| input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py) |
| is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced |
| the Arena `scores.txt`. |
|
|
| ## Citation |
|
|
| **This model / paper:** |
|
|
| ```bibtex |
| @article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024, |
| place={Greece}, |
| title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry}, |
| volume={14}, |
| number={6}, |
| url={https://etasr.com/index.php/ETASR/article/view/8906}, |
| DOI={10.48084/etasr.8906}, |
| journal={Engineering, Technology & Applied Science Research}, |
| author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail}, |
| year={2024}, |
| month={Dec.}, |
| pages={18409--18414} |
| } |
| ``` |
|
|
| **Training dataset — ASVspoof 2019:** |
|
|
| ```bibtex |
| @article{wang2020asvspoof, |
| title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech}, |
| author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others}, |
| journal={Computer Speech \& Language}, |
| volume={64}, |
| pages={101114}, |
| year={2020}, |
| publisher={Elsevier} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard). |
|
|