Spaces:
Running
# 🎙️ FakeShield AI Audio Forensic Lab — Complete Technical Documentation
This document is the definitive reference for the FakeShield AI Audio Lab. It explains the full audio forensic workflow, model architecture, signal detectors, output invariants, API behavior, and frontend integration.
1. Purpose
The FakeShield AI Audio Lab is designed to analyze uploaded speech and audio files to detect:
- AI-generated voice cloning or TTS
- voice conversion and speaker replacement
- spliced or edited audio segments
- synthetic artifacts from modern neural vocoders
It combines deep learning classifiers, signal processing heuristics, and robustness stress-testing to produce both a verdict and an explanation-friendly forensic report.
2. High-Level Architecture
Components
- Frontend UI: React audio lab page, upload controls, progress indicators, and forensic report renderer.
- API layer: FastAPI router for asynchronous analysis jobs.
- Backend analysis pipeline: audio loading, preprocessing, feature extraction, model scoring, fusion, timeline generation, and explanation.
- MongoDB persistence: audit logging of completed scans.
Key file locations
- Backend audio models:
backend/app/models/audio/ - API router:
backend/app/routers/audio_router.py - React frontend page:
fakeshield/src/pages/AudioLab/AudioLabPage.tsx - Forensic report display:
fakeshield/src/pages/AudioLab/AudioReportPanel.tsx - Upload/poll service:
fakeshield/src/services/audioService.ts - Result schema:
fakeshield/src/services/api/audioAnalysis.ts
3. End-to-End Workflow
3.1 Upload and Validation
The React Audio Lab UI accepts common audio formats:
- WAV, MP3, FLAC, OGG, M4A, MP4
Client-side validation lives in fakeshield/src/services/audioService.ts:
- file type by extension
- maximum size 50 MB
- non-empty files
The selected file is uploaded to POST /audio/analyze/async.
3.2 Job Creation and Background Processing
backend/app/routers/audio_router.py handles upload submission:
- validates MIME type and extension
- enforces a 50 MB limit
- creates a
job_id - runs
analyze_audioin a background task usingasyncio.to_thread
A separate status endpoint, GET /audio/status/{job_id}, allows the frontend to poll until the analysis is complete.
3.3 Analysis Pipeline
The analysis occurs in backend/app/models/audio/audio_detector.py.
Steps:
- Load audio and normalize.
- Apply voice activity detection (VAD).
- Build 5-second audio chunks.
- Run six signal detectors in parallel.
- Compute telephony stability.
- Fuse signal outputs into a final verdict.
- Build a segment timeline.
- Generate explainability and recommended actions.
- Persist scan summary to MongoDB.
4. Data Preprocessing (backend/app/models/audio/audio_loader.py)
Responsibilities
- decode uploaded bytes into waveform data
- convert stereo to mono
- resample to 16 kHz
- normalize peak amplitude
- remove silent segments with VAD
- split audio into fixed 5-second chunks
Algorithm details
- Primary decoder:
soundfilefor lossless formats - Fallback decoder:
librosa.loadfor lossy containers via FFmpeg - Resampling target:
TARGET_SR = 16000 - Peak normalization: maximum amplitude set to
0.95 - VAD threshold:
15%of mean RMS energy - Chunk length:
CHUNK_SEC = 5.0 - Minimum chunk duration:
0.5seconds
5. Signal Detection Modules
FakeShield uses six distinct signals to cover complementary forensic indicators.
5.1 WavLM SSL Deepfake Detection (backend/app/models/audio/signal_wavlm.py)
- Primary model:
abhishtagatya/wavlm-base-960h-itw-deepfake - Purpose: detect "in-the-wild" neural speech synthesis
- Input: multiple 5-second chunks plus one telephony-resampled chunk
- Batching: chunks are processed in a single batched forward pass for efficiency
- Output:
- mean AI probability
- per-chunk scores
- model statistics (
max,var)
5.2 AST / Wav2Vec2 Spoof Classification (backend/app/models/audio/signal_wav2vec.py)
- Primary model:
MattyB95/AST-ASVspoof5-Synthetic-Voice-Detection - Fallback model:
abhishtagatya/wav2vec2-base-960h-asv19-deepfake - Purpose: spectrogram-based deepfake and spoof detection
- Method: audio is converted into Mel spectrogram features and processed by a transformer
- Normalization: outputs are amplified with a power curve to increase contrast in high-confidence ranges
5.3 Speaker Consistency Auditor (backend/app/models/audio/signal_speaker.py)
- Purpose: detect identity drift or unnatural stability across chunks
- Technique:
- tries
pyannote/embedding - falls back to
facebook/wav2vec2-basehidden states - final fallback uses MFCC mean/std embeddings
- tries
- Features:
- mean cosine similarity across consecutive chunks
- standard deviation of similarity values
- count of identity jumps
- Suspicion logic:
- low mean similarity → voice conversion or splicing
- low similarity variance → overly constant cloned voice
5.4 Prosody and Rhythm Auditor (backend/app/models/audio/signal_prosody.py)
- Purpose: identify unnatural pitch, timing, and pause patterns
- Main features extracted:
- pitch standard deviation in semitones
- pitch range and kurtosis
- IOI coefficient of variation
- pause duration CV
- speaking rate variance
- Suspect conditions:
- monotone pitch (
f0_stdvery low) - metronomic rhythm (
ioi_cvlow) - overly regular pauses
- constant speaking rate
- monotone pitch (
5.5 Spectral Artifact Auditor (backend/app/models/audio/signal_spectral.py)
- Purpose: detect vocoder and resampling artifacts in the frequency domain
- Key spectral indicators:
- MFCC delta variance
- spectral flatness variance
- high-frequency energy ratio
- mel-spectrogram frame-to-frame difference
- spectral centroid variance
- Global + chunk scoring: combines a full-waveform global score with per-chunk scores
5.6 Codec and Artifact Auditor (backend/app/models/audio/signal_codec.py)
- Purpose: catch low-level production artifacts not visible to classifiers
- Examined artifacts:
- noise floor and SNR proxy
- DC offset
- spectral ripple from resampling
- ENF power at 50/60 Hz
- near-clipping ratio
- low-pass residual variance (dithering/noise signature)
6. Robustness Stress Test
Telephony stability
- The pipeline resamples the first audio chunk to
8 kHzand back to16 kHz. - It compares the original WavLM score to the telephony-resampled score.
- The resulting stability metric penalizes large score deltas.
- This catches synthetic audio that changes behavior under real-world channel distortion.
Score formula
stability = 1.0 - min(1.0, abs(score_orig - score_telephony) / 0.40)
>= 0.70: stable< 0.60: suspicious
7. Fusion Logic and Verdict Mapping (backend/app/models/audio/audio_fusion.py)
Decision hierarchy
- Strong agreement between WavLM and AST
- Strong WavLM plus supporting prosody or speaker evidence
- Speaker/prosody combined inconsistency
- Telephony instability guard
- Adaptive weighted blending
- Authenticity protection if all signals are weak
Weighted blend
WavLM: 40%AST/Wav2Vec: 20%Prosody: 15%Speaker: 15%Spectral: 5%Codec: 5%
Final classification
AI-Generated: fused score >= 0.80AI-Generated: fused score >= 0.55Suspicious: fused score >= 0.48Authentic: fused score < 0.48
Threat mapping
AI-Generated→CRITICALSuspicious→MEDIUMAuthentic→SAFE
8. Explainability Output (backend/app/models/audio/audio_explanation.py)
The project generates a structured explanation object that contains:
forensic_summaryrecommended_actionconfidenceprimary_reasonssupporting_reasonsexonerating_factorsstability_report
Explanation heuristics
- High WavLM scores become critical evidence.
- AST/Wav2Vec support is added when results are strong.
- Prosody evidence is translated into human-readable descriptions like "robotic pitch monotony" or "metronomic speech rhythm".
- Speaker analysis distinguishes between "identity drift" and "unnatural constancy".
- Stability results are included as an exonerating or cautionary note.
9. Timeline Construction (backend/app/models/audio/audio_segmentation.py)
A timeline is built from chunk-level scores for visualization:
- Each 5-second interval receives a risk score.
- The score is computed from WavLM/Wav2Vec, spectral, prosody, and speaker signals.
- Timeline segments include start/end timestamps and signal breakdown for UI heatmaps.
Segment output fields
segmentstart_secend_secai_scorelevelsignals
10. Frontend Architecture
Audio Lab page
fakeshield/src/pages/AudioLab/AudioLabPage.tsx manages:
- file selection and drag-and-drop
- progress and upload states
- analysis start/reset actions
- gauge and status rendering
- result presentation
Report panel
fakeshield/src/pages/AudioLab/AudioReportPanel.tsx renders:
- case summary and severity
- fusion rule badge
- action recommendation
- metadata cards
- primary and supporting evidence blocks
- exonerating factors list
- JSON report download
Service layer
fakeshield/src/services/audioService.ts implements:
- client-side file validation
- upload to
/audio/analyze/async - polling
/audio/status/{job_id}every 2.5 seconds - timeout after 180 seconds
- progress messaging for the UI
Result schema
fakeshield/src/services/api/audioAnalysis.ts defines the TypeScript shape for results, including:
- signal scores
- timeline segments
- detailed signal metadata
- forensic reasons
- audio metadata
11. Deployment & Operational Notes
API health
GET /audio/health returns available signal names and dependency requirements.
Model loading
- WavLM and AST models are loaded lazily when first needed.
- The system uses
MODEL_LOAD_LOCKto avoid concurrent model initialization races. - Models are loaded on CPU and forced into evaluation mode.
Performance
- PyTorch is restricted to a single thread during analysis:
torch.set_num_threads(1). - Parallel signal extraction is limited to 6 workers.
- Audio chunking is capped at the first 3 chunks (15 seconds) for speed.
12. Summary
The FakeShield AI Audio Lab is a mature forensic module that blends:
- deep fake speech classifiers
- spectral and codec artifact heuristics
- speaker identity drift analysis
- prosody and timing analysis
- telephony robustness evaluation
- hierarchical fusion logic
- natural language explainability
This document captures the full lab workflow across backend, API, and frontend implementation.