ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
Abstract
ArtifactNet uses a lightweight neural network framework to detect AI-generated music by analyzing codec-specific artifacts in audio signals, achieving superior performance compared to existing methods through codec-aware training and efficient architecture design.
We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) extracts codec residuals from magnitude spectrograms, which are then decomposed via HPSS into 7-channel forensic features for classification by a compact CNN (0.4M parameters; 4.0M total). We introduce ArtifactBench, a multi-generator evaluation benchmark comprising 6,183 tracks (4,383 AI from 22 generators and 1,800 real from 6 diverse sources). Each track is tagged with bench_origin for fair zero-shot evaluation. On the unseen test partition (n=2,263), ArtifactNet achieves F1 = 0.9829 with FPR = 1.49%, compared to CLAM (F1 = 0.7576, FPR = 69.26%) and SpecTTTra (F1 = 0.7713, FPR = 19.43%) evaluated under identical conditions with published checkpoints. Codec-aware training (4-way WAV/MP3/AAC/Opus augmentation) further reduces cross-codec probability drift by 83% (Delta = 0.95 -> 0.16), resolving the primary codec-invariance failure mode. These results establish forensic physics -- direct extraction of codec-level artifacts -- as a more generalizable and parameter-efficient paradigm for AI music detection than representation learning, using 49x fewer parameters than CLAM and 4.8x fewer than SpecTTTra.
Community
ArtifactNet detects AI-generated music by extracting irreversible RVQ
codec artifacts via a bounded-mask UNet + HPSS forensic features —
outperforming 194M-param CLAM (F1=0.758) with only 4M parameters
(F1=0.983, FPR=1.5%) on a 22-generator zero-shot benchmark.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TQCodec: Towards neural audio codec for high-fidelity music streaming (2026)
- Quantizer-Aware Hierarchical Neural Codec Modeling for Speech Deepfake Detection (2026)
- TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models (2026)
- DashengTokenizer: One layer is enough for unified audio understanding and generation (2026)
- StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection (2026)
- AFSS: Artifact-Focused Self-Synthesis for Mitigating Bias in Audio Deepfake Detection (2026)
- Echoes: A semantically-aligned music deepfake detection dataset (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper