File size: 7,795 Bytes
5dd5117 524ba6d a4c8776 f2defeb 5dd5117 524ba6d a4c8776 f2defeb 5dd5117 524ba6d f2defeb 5dd5117 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
license: mit
tags:
- audio
- anti-spoofing
- audio-deepfake-detection
- speech
- asvspoof
---
# ResCapsGuard
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
(Borodin et al., ETASR 2024) — the capsule-network sibling of
[Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The
model takes a raw speech waveform and returns a score where **higher = more bona fide**.
- **Code:** https://github.com/lab260ru/ResCapsGuard
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- **Parameters:** 1,606,664 (1.607 M)
- **Checkpoint:** [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth)
This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
the exact wrapper used to produce the Arena scores in
[`rescapsguard.py`](./rescapsguard.py).
## Architecture
ResCapsGuard operates directly on the raw waveform:
1. **Sinc-convolution front-end** (`SincConv`) — learnable band-pass filters that turn
the waveform into a time–frequency representation.
2. **Res2Net-style encoder** — stacked `Res_block`s (2-D convolutions with SELU and
max-pooling) that build a deep spectro-temporal feature map.
3. **Primary capsules** — a bank of capsule branches, each ending in a channel-wise
statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors.
4. **Dynamic routing** (`RoutingMechanism`) — routing-by-agreement (with the squash
non-linearity) to **two output capsules**, bona fide vs. spoof. The bona-fide
capsule activation (index 1) is the returned score.
## How it was trained
- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset, following the protocol in
the paper (train/validate on a single attack type, evaluate on the eval split with
more advanced and unseen attacks — testing generalization to harder scenarios).
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
During training a random segment is cut from each utterance.
- **Best reported result (paper):** EER = **2.25 %**, min t-DCF = 0.0744.
See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb)
for the full training and evaluation code.
## Benchmark result (Speech Anti-Spoofing Arena)
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.
| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.86** | 71,237 | 0 | in-domain (training data) |
| CD-ADD | test | **54.55** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | **55.92** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |
| ASVspoof2021_LA | test | **18.70** | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | **17.00** | 611,829 | 0 | cross-dataset generalization |
The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval
set; the deterministic window (vs. the paper's random crop) accounts for the small
difference. As with its Res2TCNGuard sibling, the model trained only on ASVspoof2019 LA
degrades on the newer/cross-domain ASVspoof2021 LA and DF sets and does not generalize to
the out-of-domain CD-ADD and InTheWild sets — the cost of training on a single attack
type. The ASVspoof2021_DF result (17.00 %) matches the sibling Res2TCNGuard's 17.02 % on
the same eval.
## Usage
The checkpoint is a `state_dict` for the `CapsuleNet` network defined in
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is
windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600
samples, tile-repeat if shorter).
Score one file from the command line:
```bash
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float> (higher = more bona fide)
```
Or from Python:
```python
import numpy as np
from evaluate import load_model, score # _net.py + evaluate.py are in this repo
model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(score(model, audio)) # higher = more bona fide
```
Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed
input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py)
is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced
the Arena `scores.txt`.
## Citation
**This model / paper:**
```bibtex
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
place={Greece},
title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
volume={14},
number={6},
url={https://etasr.com/index.php/ETASR/article/view/8906},
DOI={10.48084/etasr.8906},
journal={Engineering, Technology & Applied Science Research},
author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
year={2024},
month={Dec.},
pages={18409--18414}
}
```
**Training dataset — ASVspoof 2019:**
```bibtex
@article{wang2020asvspoof,
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
journal={Computer Speech \& Language},
volume={64},
pages={101114},
year={2020},
publisher={Elsevier}
}
```
## License
MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard).
|