Res2TCNGuard

TCN-based audio anti-spoofing (voice-deepfake detection) countermeasure proposed in "Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry" (Borodin et al., ETASR 2024). The model takes a raw speech waveform and returns a score where higher = more bona fide.

Code: https://github.com/lab260ru/Res2TCNGuard
Paper: https://etasr.com/index.php/ETASR/article/view/8906 (DOI: 10.48084/etasr.8906)
Parameters: 172,102 (0.172 M)
Checkpoint: best_1.495.pth

This repo is self-contained for inference: the network definition is in _net.py, a standalone scorer in evaluate.py, and the exact wrapper used to produce the Arena scores in res2tcnguard.py.

Architecture

Res2TCNGuard operates directly on the raw waveform:

Sinc-convolution front-end (SincConv_fast) — learnable band-pass filters that turn the waveform into a time–frequency representation.
Res2Net encoder — stacked Res2Blocks with multi-scale residual connections and squeeze-and-excitation (SE) attention.
Dual temporal convolutional networks — two TemporalConvNet branches model the time and spectral axes separately; their pooled features are concatenated and passed to a small linear classifier (bona fide vs. spoof).

How it was trained

Data: the ASVspoof 2019 Logical Access (LA) dataset. Following the protocol in the paper, the model is trained and validated on subsets representing a single attack type and then evaluated on the eval split, which contains more advanced and unseen spoofing attacks — testing the model's ability to generalize to harder attack scenarios.
Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). During training a random segment is cut from each utterance (so reported numbers can vary slightly between runs).
Optimization: Adam (lr = 1e-4), trained for up to 70 epochs; the checkpoint with the best eval EER is kept.
Best reported result (paper): EER = 1.49 %, min t-DCF = 0.0451.

See the training notebook for the full training and evaluation code.

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file.

Dataset	Split	EER %	Trials	Notes
ASVspoof2019_LA	test	1.50	71,237	in-domain (training data)
ASVspoof2021_DF	test	17.02	611,829	cross-dataset generalization
ASVspoof2021_LA	test	13.67	181,566	cross-dataset generalization
CD-ADD	test	56.10	20,786	out-of-domain (modern neural-TTS); does not generalize
InTheWild	test	52.52	31,779	out-of-domain (real-world deepfakes); does not generalize

The ASVspoof2019_LA result reproduces the paper's reported 1.49 % on the LA eval set. ASVspoof2021_DF is an out-of-domain test (the model was trained only on ASVspoof2019 LA), so a higher EER is expected — it measures generalization to newer, unseen attacks.

Usage

The checkpoint is a state_dict for the TestModel network defined in _net.py (extracted verbatim from the source notebook). The input must be exactly 64,600 samples at 16 kHz mono — the classifier head is fixed-size — so window the waveform with pad_fixed (first 64,600 samples, tile-repeat if shorter).

Score one file from the command line:

pip install torch numpy soundfile scipy
python evaluate.py path/to/audio.wav
# -> bona-fide score: <float>  (higher = more bona fide)

Or from Python:

import numpy as np
from evaluate import load_model, score   # _net.py + evaluate.py are in this repo

model = load_model("best_1.495.pth", device="cpu")
audio = np.random.randn(48000).astype(np.float32)  # float32 mono 16 kHz
print(score(model, audio))                          # higher = more bona fide

Internally score does _, logits = model(x) on the windowed input and returns logits[:, 1] (class 1 = bona fide). res2tcnguard.py is the same logic packaged as a speech_spoof_bench model — the exact code that produced the Arena scores.txt.

Citation

This model / paper:

@article{Borodin_Kudryavtsev_Mkrtchian_Gorodnichev_2024,
  place={Greece},
  title={Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry},
  volume={14},
  number={6},
  url={https://etasr.com/index.php/ETASR/article/view/8906},
  DOI={10.48084/etasr.8906},
  journal={Engineering, Technology & Applied Science Research},
  author={Borodin, Kirill and Kudryavtsev, Vasiliy and Mkrtchian, Grach and Gorodnichev, Mikhail},
  year={2024},
  month={Dec.},
  pages={18409--18414}
}

Training dataset — ASVspoof 2019:

@article{wang2020asvspoof,
  title={ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech},
  author={Wang, Xin and Yamagishi, Junichi and Todisco, Massimiliano and Delgado, H{\'e}ctor and Nautsch, Andreas and Evans, Nicholas and Sahidullah, Md and Vestman, Ville and Kinnunen, Tomi and Lee, Kong Aik and others},
  journal={Computer Speech \& Language},
  volume={64},
  pages={101114},
  year={2020},
  publisher={Elsevier}
}

License

MIT — see the source repository.

Maintainer

Maintained by Kirill Borodin (SpeechAntiSpoofingBenchmarks).

Email: kborodin.research@gmail.com
Telegram: @korallll_ai

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support