File size: 7,795 Bytes
5dd5117
 
 
 
 
 
 
 
 
 
 
 
 
524ba6d
 
a4c8776
f2defeb
5dd5117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524ba6d
 
a4c8776
f2defeb
5dd5117
 
 
524ba6d
f2defeb
 
 
 
5dd5117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: mit
tags:
  - audio
  - anti-spoofing
  - audio-deepfake-detection
  - speech
  - asvspoof
---

# ResCapsGuard

[![EER% 1.86 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.86%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![EER% 54.55 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-54.55%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![EER% 55.92 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-55.92%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![EER% 18.70 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-18.70%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![EER% 17.00 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-17.00%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)

Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
*"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
(Borodin et al., ETASR 2024) — the capsule-network sibling of
[Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The
model takes a raw speech waveform and returns a score where **higher = more bona fide**.

- **Code:** https://github.com/lab260ru/ResCapsGuard
- **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
- **Parameters:** 1,606,664 (1.607 M)
- **Checkpoint:** [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth)

This repo is self-contained for inference: the network definition is in
[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
the exact wrapper used to produce the Arena scores in
[`rescapsguard.py`](./rescapsguard.py).

## Architecture

ResCapsGuard operates directly on the raw waveform:

1. **Sinc-convolution front-end** (`SincConv`) — learnable band-pass filters that turn
   the waveform into a time–frequency representation.
2. **Res2Net-style encoder** — stacked `Res_block`s (2-D convolutions with SELU and
   max-pooling) that build a deep spectro-temporal feature map.
3. **Primary capsules** — a bank of capsule branches, each ending in a channel-wise
   statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors.
4. **Dynamic routing** (`RoutingMechanism`) — routing-by-agreement (with the squash
   non-linearity) to **two output capsules**, bona fide vs. spoof. The bona-fide
   capsule activation (index 1) is the returned score.

## How it was trained

- **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset, following the protocol in
  the paper (train/validate on a single attack type, evaluate on the eval split with
  more advanced and unseen attacks — testing generalization to harder scenarios).
- **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
  During training a random segment is cut from each utterance.
- **Best reported result (paper):** EER = **2.25 %**, min t-DCF = 0.0744.

See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb)
for the full training and evaluation code.

## Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard).
Scores were computed with a **deterministic first-64,600-sample window** (no random
crop), so the numbers are exactly reproducible from the pinned score file.

| Dataset | Split | EER % | Trials | Skipped | Notes |
|---|---|---|---|---|---|
| ASVspoof2019_LA | test | **1.86** | 71,237 | 0 | in-domain (training data) |
| CD-ADD | test | **54.55** | 20,786 | 0 | out-of-domain (modern neural-TTS); does not generalize |
| InTheWild | test | **55.92** | 31,779 | 0 | out-of-domain (real-world deepfakes); does not generalize |
| ASVspoof2021_LA | test | **18.70** | 181,566 | 0 | cross-dataset generalization |
| ASVspoof2021_DF | test | **17.00** | 611,829 | 0 | cross-dataset generalization |

The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval
set; the deterministic window (vs. the paper's random crop) accounts for the small
difference. As with its Res2TCNGuard sibling, the model trained only on ASVspoof2019 LA
degrades on the newer/cross-domain ASVspoof2021 LA and DF sets and does not generalize to
the out-of-domain CD-ADD and InTheWild sets — the cost of training on a single attack
type. The ASVspoof2021_DF result (17.00 %) matches the sibling Res2TCNGuard's 17.02 % on
the same eval.

## Usage

The checkpoint is a `state_dict` for the `CapsuleNet` network defined in
[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is
windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600
samples, tile-repeat if shorter).

Score one file from the command line:

```bash
pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float>  (higher = more bona fide)
```

Or from Python:

```python
import numpy as np
from evaluate import load_model, score   # _net.py + evaluate.py are in this repo

model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32)  # float32 mono 16 kHz
print(score(model, audio))                          # higher = more bona fide
```

Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed
input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py)
is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced
the Arena `scores.txt`.

## Citation

**This model / paper:**

```bibtex
@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
  place={Greece},
  title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
  volume={14},
  number={6},
  url={https://etasr.com/index.php/ETASR/article/view/8906},
  DOI={10.48084/etasr.8906},
  journal={Engineering, Technology & Applied Science Research},
  author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
  year={2024},
  month={Dec.},
  pages={18409--18414}
}
```

**Training dataset — ASVspoof 2019:**

```bibtex
@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}
```

## License

MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard).