Upload README.md with huggingface_hub

f2defeb verified 1 day ago

7.8 kB

	---
	license: mit
	tags:
	- audio
	- anti-spoofing
	- audio-deepfake-detection
	- speech
	- asvspoof
	---

	# ResCapsGuard

	[![EER% 1.86 on ASVspoof2019_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2019__LA-1.86%25-brightgreen)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![EER% 54.55 on CD-ADD](https://img.shields.io/badge/EER%25%20on%20CD--ADD-54.55%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![EER% 55.92 on InTheWild](https://img.shields.io/badge/EER%25%20on%20InTheWild-55.92%25-red)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![EER% 18.70 on ASVspoof2021_LA](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__LA-18.70%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![EER% 17.00 on ASVspoof2021_DF](https://img.shields.io/badge/EER%25%20on%20ASVspoof2021__DF-17.00%25-yellow)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![arena tier](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/tier.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)
	[![arena rank](https://img.shields.io/endpoint?url=https://speechantispoofingbenchmarks-speechantispoofingarena.hf.space/badge/rescapsguard/rank.json)](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard)

	Capsule-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in
	"Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry"
	(Borodin et al., ETASR 2024) — the capsule-network sibling of
	[Res2TCNGuard](https://huggingface.co/SpeechAntiSpoofingBenchmarks/Res2TCNGuard). The
	model takes a raw speech waveform and returns a score where higher = more bona fide.

	- Code: https://github.com/lab260ru/ResCapsGuard
	- Paper: https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
	- Parameters: 1,606,664 (1.607 M)
	- Checkpoint: [`new_capsules_changed_sinc_layer.pth`](./new_capsules_changed_sinc_layer.pth)

	This repo is self-contained for inference: the network definition is in
	[`_net.py`](./_net.py), a standalone scorer in [`evaluate.py`](./evaluate.py), and
	the exact wrapper used to produce the Arena scores in
	[`rescapsguard.py`](./rescapsguard.py).

	## Architecture

	ResCapsGuard operates directly on the raw waveform:

	1. Sinc-convolution front-end (`SincConv`) — learnable band-pass filters that turn
	the waveform into a time–frequency representation.
	2. Res2Net-style encoder — stacked `Res_block`s (2-D convolutions with SELU and
	max-pooling) that build a deep spectro-temporal feature map.
	3. Primary capsules — a bank of capsule branches, each ending in a channel-wise
	statistics pooling (`ChanelWiseStats`, mean + std) to produce per-capsule vectors.
	4. Dynamic routing (`RoutingMechanism`) — routing-by-agreement (with the squash
	non-linearity) to two output capsules, bona fide vs. spoof. The bona-fide
	capsule activation (index 1) is the returned score.

	## How it was trained

	- Data: the ASVspoof 2019 Logical Access (LA) dataset, following the protocol in
	the paper (train/validate on a single attack type, evaluate on the eval split with
	more advanced and unseen attacks — testing generalization to harder scenarios).
	- Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s).
	During training a random segment is cut from each utterance.
	- Best reported result (paper): EER = 2.25 %, min t-DCF = 0.0744.

	See the [training notebook](https://github.com/lab260ru/ResCapsGuard/blob/main/new_capsules_changed_sinc.ipynb)
	for the full training and evaluation code.

	## Benchmark result (Speech Anti-Spoofing Arena)

	Evaluated through the reproducible [Speech Anti-Spoofing Arena](https://huggingface.co/spaces/SpeechAntiSpoofingBenchmarks/SpeechAntiSpoofingArena?system=rescapsguard).
	Scores were computed with a deterministic first-64,600-sample window (no random
	crop), so the numbers are exactly reproducible from the pinned score file.

	\| Dataset \| Split \| EER % \| Trials \| Skipped \| Notes \|
	\|---\|---\|---\|---\|---\|---\|
	\| ASVspoof2019_LA \| test \| 1.86 \| 71,237 \| 0 \| in-domain (training data) \|
	\| CD-ADD \| test \| 54.55 \| 20,786 \| 0 \| out-of-domain (modern neural-TTS); does not generalize \|
	\| InTheWild \| test \| 55.92 \| 31,779 \| 0 \| out-of-domain (real-world deepfakes); does not generalize \|
	\| ASVspoof2021_LA \| test \| 18.70 \| 181,566 \| 0 \| cross-dataset generalization \|
	\| ASVspoof2021_DF \| test \| 17.00 \| 611,829 \| 0 \| cross-dataset generalization \|

	The ASVspoof2019_LA result reproduces near the paper's reported 2.25 % on the LA eval
	set; the deterministic window (vs. the paper's random crop) accounts for the small
	difference. As with its Res2TCNGuard sibling, the model trained only on ASVspoof2019 LA
	degrades on the newer/cross-domain ASVspoof2021 LA and DF sets and does not generalize to
	the out-of-domain CD-ADD and InTheWild sets — the cost of training on a single attack
	type. The ASVspoof2021_DF result (17.00 %) matches the sibling Res2TCNGuard's 17.02 % on
	the same eval.

	## Usage

	The checkpoint is a `state_dict` for the `CapsuleNet` network defined in
	[`_net.py`](./_net.py) (extracted verbatim from the source notebook). The input is
	windowed to exactly 64,600 samples at 16 kHz mono with `pad_fixed` (first 64,600
	samples, tile-repeat if shorter).

	Score one file from the command line:

	```bash
	pip install torch numpy soundfile scipy
	python evaluate.py path/to/audio.wav
	# -> bona-fide score: <float> (higher = more bona fide)
	```

	Or from Python:

	```python
	import numpy as np
	from evaluate import load_model, score # _net.py + evaluate.py are in this repo

	model = load_model("new_capsules_changed_sinc_layer.pth", device="cpu")
	audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
	print(score(model, audio)) # higher = more bona fide
	```

	Internally `score` does `_z, class_ = model(x, random=False, dropout=0)` on the windowed
	input and returns `class_[:, 1]` (index 1 = bona fide). [`rescapsguard.py`](./rescapsguard.py)
	is the same logic packaged as a `speech_spoof_bench` model — the exact code that produced
	the Arena `scores.txt`.

	## Citation

	This model / paper:

	```bibtex
	@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
	place={Greece},
	title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
	volume={14},
	number={6},
	url={https://etasr.com/index.php/ETASR/article/view/8906},
	DOI={10.48084/etasr.8906},
	journal={Engineering, Technology & Applied Science Research},
	author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
	year={2024},
	month={Dec.},
	pages={18409--18414}
	}
	```

	Training dataset — ASVspoof 2019:

	```bibtex
	@article{wang2020asvspoof,
	title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
	author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
	journal={Computer Speech \& Language},
	volume={64},
	pages={101114},
	year={2020},
	publisher={Elsevier}
	}
	```

	## License

	MIT — see the [source repository](https://github.com/lab260ru/ResCapsGuard).