Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- anti-spoofing
|
| 6 |
+
- audio-deepfake-detection
|
| 7 |
+
- speech
|
| 8 |
+
- asvspoof
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# ResCapsGuard
|
| 12 |
+
|
| 13 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
|
| 14 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
|
| 15 |
+
[](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
|
| 16 |
+
|
| 17 |
+
Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
|
| 18 |
+
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
|
| 19 |
+
(Borodin et al., ETASR 2024) — the capsule-network sibling of
|
| 20 |
+
[Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The
|
| 21 |
+
model takes a raw speech waveform and returns a score where **higher = more bona fide**.
|
| 22 |
+
|
| 23 |
+
- **Code:** https://github.com/lab260ru/ResCapsGuard
|
| 24 |
+
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
|
| 25 |
+
- **Parameters:** 1,606,664 (1.607 M)
|
| 26 |
+
- **Checkpoint:** [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth)
|
| 27 |
+
|
| 28 |
+
This repo is self-contained for inference: the network definition is in
|
| 29 |
+
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
|
| 30 |
+
the exact wrapper used to produce the Arena scores in
|
| 31 |
+
[`rescapsguard.py`](./rescapsguard.py).
|
| 32 |
+
|
| 33 |
+
## Architecture
|
| 34 |
+
|
| 35 |
+
ResCapsGuard operates directly on the raw waveform:
|
| 36 |
+
|
| 37 |
+
1. **Sinc-convolution front-end** (`SincConv`) — learnable band-pass filters that turn
|
| 38 |
+
the waveform into a time–frequency representation.
|
| 39 |
+
2. **Res2Net-style encoder** — stacked `Res_block`s (2-D convolutions with SELU and
|
| 40 |
+
max-pooling) that build a deep spectro-temporal feature map.
|
| 41 |
+
3. **Primary capsules** — a bank of capsule branches, each ending in a channel-wise
|
| 42 |
+
statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors.
|
| 43 |
+
4. **Dynamic routing** (`RoutingMechanism`) — routing-by-agreement (with the squash
|
| 44 |
+
non-linearity) to **two output capsules**, bona fide vs. spoof. The bona-fide
|
| 45 |
+
capsule activation (index 1) is the returned score.
|
| 46 |
+
|
| 47 |
+
## How it was trained
|
| 48 |
+
|
| 49 |
+
- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset, following the protocol in
|
| 50 |
+
the paper (train/validate on a single attack type, evaluate on the eval split with
|
| 51 |
+
more advanced and unseen attacks — testing generalization to harder scenarios).
|
| 52 |
+
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
|
| 53 |
+
During training a random segment is cut from each utterance.
|
| 54 |
+
- **Best reported result (paper):** EER = **2.25 %**, min t-DCF = 0.0744.
|
| 55 |
+
|
| 56 |
+
See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb)
|
| 57 |
+
for the full training and evaluation code.
|
| 58 |
+
|
| 59 |
+
## Benchmark result (Speech Anti-Spoofing Arena)
|
| 60 |
+
|
| 61 |
+
Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard).
|
| 62 |
+
Scores were computed with a **deterministic first-64,600-sample window** (no random
|
| 63 |
+
crop), so the numbers are exactly reproducible from the pinned score file.
|
| 64 |
+
|
| 65 |
+
| Dataset | Split | EER % | Trials | Skipped | Notes |
|
| 66 |
+
|---|---|---|---|---|---|
|
| 67 |
+
| ASVspoof2019_LA | test | **1.86** | 71,237 | 0 | in-domain (training data) |
|
| 68 |
+
|
| 69 |
+
The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval
|
| 70 |
+
set; the deterministic window (vs. the paper's random crop) accounts for the small
|
| 71 |
+
difference. Cross-dataset rows (ASVspoof2021_DF/LA, CD-ADD, InTheWild) are added as
|
| 72 |
+
their submissions are merged.
|
| 73 |
+
|
| 74 |
+
## Usage
|
| 75 |
+
|
| 76 |
+
The checkpoint is a `state_dict` for the `CapsuleNet` network defined in
|
| 77 |
+
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is
|
| 78 |
+
windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600
|
| 79 |
+
samples, tile-repeat if shorter).
|
| 80 |
+
|
| 81 |
+
Score one file from the command line:
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
pip install torch numpy soundfile scipy
|
| 85 |
+
python evaluate.py path/to/audio.wav
|
| 86 |
+
# -> bona-fide score: <float> (higher = more bona fide)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
Or from Python:
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
import numpy as np
|
| 93 |
+
from evaluate import load_model, score # _net.py + evaluate.py are in this repo
|
| 94 |
+
|
| 95 |
+
model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
|
| 96 |
+
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
|
| 97 |
+
print(score(model, audio)) # higher = more bona fide
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed
|
| 101 |
+
input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py)
|
| 102 |
+
is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced
|
| 103 |
+
the Arena `scores.txt`.
|
| 104 |
+
|
| 105 |
+
## Citation
|
| 106 |
+
|
| 107 |
+
**This model / paper:**
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
+
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
|
| 111 |
+
place={Greece},
|
| 112 |
+
title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
|
| 113 |
+
volume={14},
|
| 114 |
+
number={6},
|
| 115 |
+
url={https://etasr.com/index.php/ETASR/article/view/8906},
|
| 116 |
+
DOI={10.48084/etasr.8906},
|
| 117 |
+
journal={Engineering, Technology & Applied Science Research},
|
| 118 |
+
author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
|
| 119 |
+
year={2024},
|
| 120 |
+
month={Dec.},
|
| 121 |
+
pages={18409--18414}
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
**Training dataset — ASVspoof 2019:**
|
| 126 |
+
|
| 127 |
+
```bibtex
|
| 128 |
+
@article{wang2020asvspoof,
|
| 129 |
+
title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
|
| 130 |
+
author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
|
| 131 |
+
journal={Computer Speech \& Language},
|
| 132 |
+
volume={64},
|
| 133 |
+
pages={101114},
|
| 134 |
+
year={2020},
|
| 135 |
+
publisher={Elsevier}
|
| 136 |
+
}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
## License
|
| 140 |
+
|
| 141 |
+
MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard).
|