Spaces:

Akash4911
/

fakeshield-api

Running

App Files Files Community

fakeshield-api / fakeshield /docs /audio_lab_documentation.md

Akash4911

Production Deploy: Improved robustness and logging

66b6851 10 days ago

preview code

raw

history blame contribute delete

10.9 kB

	# 🎙️ FakeShield AI Audio Forensic Lab — Complete Technical Documentation

	This document is the definitive reference for the FakeShield AI Audio Lab. It explains the full audio forensic workflow, model architecture, signal detectors, output invariants, API behavior, and frontend integration.

	---

	## 1. Purpose

	The FakeShield AI Audio Lab is designed to analyze uploaded speech and audio files to detect:
	* AI-generated voice cloning or TTS
	* voice conversion and speaker replacement
	* spliced or edited audio segments
	* synthetic artifacts from modern neural vocoders

	It combines deep learning classifiers, signal processing heuristics, and robustness stress-testing to produce both a verdict and an explanation-friendly forensic report.

	---

	## 2. High-Level Architecture

	### Components

	* Frontend UI: React audio lab page, upload controls, progress indicators, and forensic report renderer.
	* API layer: FastAPI router for asynchronous analysis jobs.
	* Backend analysis pipeline: audio loading, preprocessing, feature extraction, model scoring, fusion, timeline generation, and explanation.
	* MongoDB persistence: audit logging of completed scans.

	### Key file locations

	* Backend audio models: `backend/app/models/audio/`
	* API router: `backend/app/routers/audio_router.py`
	* React frontend page: `fakeshield/src/pages/AudioLab/AudioLabPage.tsx`
	* Forensic report display: `fakeshield/src/pages/AudioLab/AudioReportPanel.tsx`
	* Upload/poll service: `fakeshield/src/services/audioService.ts`
	* Result schema: `fakeshield/src/services/api/audioAnalysis.ts`

	---

	## 3. End-to-End Workflow

	### 3.1 Upload and Validation

	The React Audio Lab UI accepts common audio formats:
	* WAV, MP3, FLAC, OGG, M4A, MP4

	Client-side validation lives in `fakeshield/src/services/audioService.ts`:
	* file type by extension
	* maximum size 50 MB
	* non-empty files

	The selected file is uploaded to `POST /audio/analyze/async`.

	### 3.2 Job Creation and Background Processing

	`backend/app/routers/audio_router.py` handles upload submission:
	* validates MIME type and extension
	* enforces a 50 MB limit
	* creates a `job_id`
	* runs `analyze_audio` in a background task using `asyncio.to_thread`

	A separate status endpoint, `GET /audio/status/{job_id}`, allows the frontend to poll until the analysis is complete.

	### 3.3 Analysis Pipeline

	The analysis occurs in `backend/app/models/audio/audio_detector.py`.

	Steps:
	1. Load audio and normalize.
	2. Apply voice activity detection (VAD).
	3. Build 5-second audio chunks.
	4. Run six signal detectors in parallel.
	5. Compute telephony stability.
	6. Fuse signal outputs into a final verdict.
	7. Build a segment timeline.
	8. Generate explainability and recommended actions.
	9. Persist scan summary to MongoDB.

	---

	## 4. Data Preprocessing (`backend/app/models/audio/audio_loader.py`)

	### Responsibilities

	* decode uploaded bytes into waveform data
	* convert stereo to mono
	* resample to 16 kHz
	* normalize peak amplitude
	* remove silent segments with VAD
	* split audio into fixed 5-second chunks

	### Algorithm details

	* Primary decoder: `soundfile` for lossless formats
	* Fallback decoder: `librosa.load` for lossy containers via FFmpeg
	* Resampling target: `TARGET_SR = 16000`
	* Peak normalization: maximum amplitude set to `0.95`
	* VAD threshold: `15%` of mean RMS energy
	* Chunk length: `CHUNK_SEC = 5.0`
	* Minimum chunk duration: `0.5` seconds

	---

	## 5. Signal Detection Modules

	FakeShield uses six distinct signals to cover complementary forensic indicators.

	### 5.1 WavLM SSL Deepfake Detection (`backend/app/models/audio/signal_wavlm.py`)

	* Primary model: `abhishtagatya/wavlm-base-960h-itw-deepfake`
	* Purpose: detect "in-the-wild" neural speech synthesis
	* Input: multiple 5-second chunks plus one telephony-resampled chunk
	* Batching: chunks are processed in a single batched forward pass for efficiency
	* Output:
	* mean AI probability
	* per-chunk scores
	* model statistics (`max`, `var`)

	### 5.2 AST / Wav2Vec2 Spoof Classification (`backend/app/models/audio/signal_wav2vec.py`)

	* Primary model: `MattyB95/AST-ASVspoof5-Synthetic-Voice-Detection`
	* Fallback model: `abhishtagatya/wav2vec2-base-960h-asv19-deepfake`
	* Purpose: spectrogram-based deepfake and spoof detection
	* Method: audio is converted into Mel spectrogram features and processed by a transformer
	* Normalization: outputs are amplified with a power curve to increase contrast in high-confidence ranges

	### 5.3 Speaker Consistency Auditor (`backend/app/models/audio/signal_speaker.py`)

	* Purpose: detect identity drift or unnatural stability across chunks
	* Technique:
	* tries `pyannote/embedding`
	* falls back to `facebook/wav2vec2-base` hidden states
	* final fallback uses MFCC mean/std embeddings
	* Features:
	* mean cosine similarity across consecutive chunks
	* standard deviation of similarity values
	* count of identity jumps
	* Suspicion logic:
	* low mean similarity → voice conversion or splicing
	* low similarity variance → overly constant cloned voice

	### 5.4 Prosody and Rhythm Auditor (`backend/app/models/audio/signal_prosody.py`)

	* Purpose: identify unnatural pitch, timing, and pause patterns
	* Main features extracted:
	* pitch standard deviation in semitones
	* pitch range and kurtosis
	* IOI coefficient of variation
	* pause duration CV
	* speaking rate variance
	* Suspect conditions:
	* monotone pitch (`f0_std` very low)
	* metronomic rhythm (`ioi_cv` low)
	* overly regular pauses
	* constant speaking rate

	### 5.5 Spectral Artifact Auditor (`backend/app/models/audio/signal_spectral.py`)

	* Purpose: detect vocoder and resampling artifacts in the frequency domain
	* Key spectral indicators:
	* MFCC delta variance
	* spectral flatness variance
	* high-frequency energy ratio
	* mel-spectrogram frame-to-frame difference
	* spectral centroid variance
	* Global + chunk scoring: combines a full-waveform global score with per-chunk scores

	### 5.6 Codec and Artifact Auditor (`backend/app/models/audio/signal_codec.py`)

	* Purpose: catch low-level production artifacts not visible to classifiers
	* Examined artifacts:
	* noise floor and SNR proxy
	* DC offset
	* spectral ripple from resampling
	* ENF power at 50/60 Hz
	* near-clipping ratio
	* low-pass residual variance (dithering/noise signature)

	---

	## 6. Robustness Stress Test

	### Telephony stability

	* The pipeline resamples the first audio chunk to `8 kHz` and back to `16 kHz`.
	* It compares the original WavLM score to the telephony-resampled score.
	* The resulting stability metric penalizes large score deltas.
	* This catches synthetic audio that changes behavior under real-world channel distortion.

	### Score formula

	```python
	stability = 1.0 - min(1.0, abs(score_orig - score_telephony) / 0.40)
	```

	* `>= 0.70`: stable
	* `< 0.60`: suspicious

	---

	## 7. Fusion Logic and Verdict Mapping (`backend/app/models/audio/audio_fusion.py`)

	### Decision hierarchy

	1. Strong agreement between WavLM and AST
	2. Strong WavLM plus supporting prosody or speaker evidence
	3. Speaker/prosody combined inconsistency
	4. Telephony instability guard
	5. Adaptive weighted blending
	6. Authenticity protection if all signals are weak

	### Weighted blend

	* `WavLM`: 40%
	* `AST/Wav2Vec`: 20%
	* `Prosody`: 15%
	* `Speaker`: 15%
	* `Spectral`: 5%
	* `Codec`: 5%

	### Final classification

	* `AI-Generated`: fused score >= 0.80
	* `AI-Generated`: fused score >= 0.55
	* `Suspicious`: fused score >= 0.48
	* `Authentic`: fused score < 0.48

	### Threat mapping

	* `AI-Generated` → `CRITICAL`
	* `Suspicious` → `MEDIUM`
	* `Authentic` → `SAFE`

	---

	## 8. Explainability Output (`backend/app/models/audio/audio_explanation.py`)

	The project generates a structured explanation object that contains:
	* `forensic_summary`
	* `recommended_action`
	* `confidence`
	* `primary_reasons`
	* `supporting_reasons`
	* `exonerating_factors`
	* `stability_report`

	### Explanation heuristics

	* High WavLM scores become critical evidence.
	* AST/Wav2Vec support is added when results are strong.
	* Prosody evidence is translated into human-readable descriptions like "robotic pitch monotony" or "metronomic speech rhythm".
	* Speaker analysis distinguishes between "identity drift" and "unnatural constancy".
	* Stability results are included as an exonerating or cautionary note.

	---

	## 9. Timeline Construction (`backend/app/models/audio/audio_segmentation.py`)

	A timeline is built from chunk-level scores for visualization:
	* Each 5-second interval receives a risk score.
	* The score is computed from WavLM/Wav2Vec, spectral, prosody, and speaker signals.
	* Timeline segments include start/end timestamps and signal breakdown for UI heatmaps.

	### Segment output fields

	* `segment`
	* `start_sec`
	* `end_sec`
	* `ai_score`
	* `level`
	* `signals`

	---

	## 10. Frontend Architecture

	### Audio Lab page

	`fakeshield/src/pages/AudioLab/AudioLabPage.tsx` manages:
	* file selection and drag-and-drop
	* progress and upload states
	* analysis start/reset actions
	* gauge and status rendering
	* result presentation

	### Report panel

	`fakeshield/src/pages/AudioLab/AudioReportPanel.tsx` renders:
	* case summary and severity
	* fusion rule badge
	* action recommendation
	* metadata cards
	* primary and supporting evidence blocks
	* exonerating factors list
	* JSON report download

	### Service layer

	`fakeshield/src/services/audioService.ts` implements:
	* client-side file validation
	* upload to `/audio/analyze/async`
	* polling `/audio/status/{job_id}` every 2.5 seconds
	* timeout after 180 seconds
	* progress messaging for the UI

	### Result schema

	`fakeshield/src/services/api/audioAnalysis.ts` defines the TypeScript shape for results, including:
	* signal scores
	* timeline segments
	* detailed signal metadata
	* forensic reasons
	* audio metadata

	---

	## 11. Deployment & Operational Notes

	### API health

	`GET /audio/health` returns available signal names and dependency requirements.

	### Model loading

	* WavLM and AST models are loaded lazily when first needed.
	* The system uses `MODEL_LOAD_LOCK` to avoid concurrent model initialization races.
	* Models are loaded on CPU and forced into evaluation mode.

	### Performance

	* PyTorch is restricted to a single thread during analysis: `torch.set_num_threads(1)`.
	* Parallel signal extraction is limited to 6 workers.
	* Audio chunking is capped at the first 3 chunks (15 seconds) for speed.

	---

	## 12. Summary

	The FakeShield AI Audio Lab is a mature forensic module that blends:
	* deep fake speech classifiers
	* spectral and codec artifact heuristics
	* speaker identity drift analysis
	* prosody and timing analysis
	* telephony robustness evaluation
	* hierarchical fusion logic
	* natural language explainability

	This document captures the full lab workflow across backend, API, and frontend implementation.