korallll commited on
Commit
5dd5117
·
verified ·
1 Parent(s): 016f686

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - anti-spoofing
6
+ - audio-deepfake-detection
7
+ - speech
8
+ - asvspoof
9
+ ---
10
+
11
+ # ResCapsGuard
12
+
13
+ [![EER% 1.86 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.86%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
14
+ [![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
15
+ [![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
16
+
17
+ Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
18
+ *"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"*
19
+ (Borodin et al., ETASR 2024) — the capsule-network sibling of
20
+ [Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The
21
+ model takes a raw speech waveform and returns a score where **higher = more bona fide**.
22
+
23
+ - **Code:** https://github.com/lab260ru/ResCapsGuard
24
+ - **Paper:** https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
25
+ - **Parameters:** 1,606,664 (1.607 M)
26
+ - **Checkpoint:** [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth)
27
+
28
+ This repo is self-contained for inference: the network definition is in
29
+ [`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
30
+ the exact wrapper used to produce the Arena scores in
31
+ [`rescapsguard.py`](./rescapsguard.py).
32
+
33
+ ## Architecture
34
+
35
+ ResCapsGuard operates directly on the raw waveform:
36
+
37
+ 1. **Sinc-convolution front-end** (`SincConv`) — learnable band-pass filters that turn
38
+ the waveform into a time–frequency representation.
39
+ 2. **Res2Net-style encoder** — stacked `Res_block`s (2-D convolutions with SELU and
40
+ max-pooling) that build a deep spectro-temporal feature map.
41
+ 3. **Primary capsules** — a bank of capsule branches, each ending in a channel-wise
42
+ statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors.
43
+ 4. **Dynamic routing** (`RoutingMechanism`) — routing-by-agreement (with the squash
44
+ non-linearity) to **two output capsules**, bona fide vs. spoof. The bona-fide
45
+ capsule activation (index 1) is the returned score.
46
+
47
+ ## How it was trained
48
+
49
+ - **Data:** the ASVspoof 2019 **Logical Access (LA)** dataset, following the protocol in
50
+ the paper (train/validate on a single attack type, evaluate on the eval split with
51
+ more advanced and unseen attacks — testing generalization to harder scenarios).
52
+ - **Input length:** raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
53
+ During training a random segment is cut from each utterance.
54
+ - **Best reported result (paper):** EER = **2.25 %**, min t-DCF = 0.0744.
55
+
56
+ See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb)
57
+ for the full training and evaluation code.
58
+
59
+ ## Benchmark result (Speech Anti-Spoofing Arena)
60
+
61
+ Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard).
62
+ Scores were computed with a **deterministic first-64,600-sample window** (no random
63
+ crop), so the numbers are exactly reproducible from the pinned score file.
64
+
65
+ | Dataset | Split | EER % | Trials | Skipped | Notes |
66
+ |---|---|---|---|---|---|
67
+ | ASVspoof2019_LA | test | **1.86** | 71,237 | 0 | in-domain (training data) |
68
+
69
+ The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval
70
+ set; the deterministic window (vs. the paper's random crop) accounts for the small
71
+ difference. Cross-dataset rows (ASVspoof2021_DF/LA, CD-ADD, InTheWild) are added as
72
+ their submissions are merged.
73
+
74
+ ## Usage
75
+
76
+ The checkpoint is a `state_dict` for the `CapsuleNet` network defined in
77
+ [`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is
78
+ windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600
79
+ samples, tile-repeat if shorter).
80
+
81
+ Score one file from the command line:
82
+
83
+ ```bash
84
+ pip install torch numpy soundfile scipy
85
+ python evaluate.py path/to/audio.wav
86
+ # -> bona-fide score: <float> (higher = more bona fide)
87
+ ```
88
+
89
+ Or from Python:
90
+
91
+ ```python
92
+ import numpy as np
93
+ from evaluate import load_model, score # _net.py + evaluate.py are in this repo
94
+
95
+ model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
96
+ audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
97
+ print(score(model, audio)) # higher = more bona fide
98
+ ```
99
+
100
+ Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed
101
+ input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py)
102
+ is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced
103
+ the Arena `scores.txt`.
104
+
105
+ ## Citation
106
+
107
+ **This model / paper:**
108
+
109
+ ```bibtex
110
+ @article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
111
+ place={Greece},
112
+ title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
113
+ volume={14},
114
+ number={6},
115
+ url={https://etasr.com/index.php/ETASR/article/view/8906},
116
+ DOI={10.48084/etasr.8906},
117
+ journal={Engineering, Technology & Applied Science Research},
118
+ author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
119
+ year={2024},
120
+ month={Dec.},
121
+ pages={18409--18414}
122
+ }
123
+ ```
124
+
125
+ **Training dataset — ASVspoof 2019:**
126
+
127
+ ```bibtex
128
+ @article{wang2020asvspoof,
129
+ title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
130
+ author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
131
+ journal={Computer Speech \& Language},
132
+ volume={64},
133
+ pages={101114},
134
+ year={2020},
135
+ publisher={Elsevier}
136
+ }
137
+ ```
138
+
139
+ ## License
140
+
141
+ MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard).