Spaces:
Running
title: AnomalyOS
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 7860
AnomalyOS π
Industrial Visual Intelligence Platform
Zero training on defects. The AI only knows normal.
What Is This
AnomalyOS is a five-mode industrial visual inspection platform built on PatchCore (CVPR 2022), implemented from scratch in PyTorch. The system detects defects in manufactured products using only normal training images β no defect labels required. Every anomalous image is explained through four independent XAI methods, retrieved against a historical defect knowledge base, traced through a root-cause graph, and reported via a grounded LLM.
The 15-second demo: The AI has never seen a defective product. It only knows what normal looks like. Show it anything broken and it finds the fault, explains why using four independent methods, retrieves the five most similar historical defects from its memory, traces their root causes through a knowledge graph, and generates a remediation report.
Architecture
graph LR
A[User Image] --> B[Inspector Mode]
B --> C[CLIP Full-Image]
B --> D[WideResNet Patches]
C --> E[Index 1: Category Routing]
D --> F[Index 3: PatchCore Scoring]
F --> G{Anomalous?}
G -- No --> H[Normal Result]
G -- Yes --> I[Defect Crop]
I --> J[CLIP Crop Embedding]
J --> K[Index 2: Similar Cases]
K --> L[Knowledge Graph]
L --> M[Groq LLM Report]
F --> N[XAI Layer]
N --> O[Heatmap + GradCAM++ + SHAP + Retrieval Trace]
Three FAISS indexes, three granularities:
- Index 1 β CLIP full-image, 15 vectors, category routing
- Index 2 β CLIP defect-crop, 5354 vectors, historical retrieval
- Index 3 β WideResNet patches, per-category coreset, anomaly scoring
Five Modes
| Mode | Purpose |
|---|---|
| π¬ Inspector | Upload image β defect detection + heatmap + report |
| 𧬠Forensics | Deep XAI on any past case (GradCAM++, SHAP, retrieval trace) |
| π Analytics | Aggregated stats, Evidently drift monitoring |
| ποΈ Arena | Competitive game β beat the AI at defect detection |
| π Knowledge Base | Browse defect graph, natural language search |
Technical Decisions
Why PatchCore over a trained classifier? Real manufacturing lines do not have labelled defect datasets. Defects are rare, varied, and novel. PatchCore requires only normal samples and learns the distribution of normal patch features. Any deviation at inference is flagged. The scoring mechanism is a nearest-neighbour distance β inherently interpretable with no post-hoc XAI required for localisation.
Why hierarchical RAG over flat search? A flat index over all 5354 images confuses product categories β a bottle scratch and a carpet scratch share visual similarities that cause cross-category retrieval noise. Hierarchical routing first identifies the category via full-image CLIP embeddings, then retrieves within the category-specific subset. Validated on a 50-question evaluation set: flat search Precision@5 = 61%, hierarchical = 93%.
Why three FAISS indexes? Each index operates at a different granularity and serves a different purpose. Index 1 routes at category level. Index 2 retrieves visually similar historical defects for RAG context. Index 3 IS the PatchCore scoring mechanism β one coreset per product category, because each category has its own definition of normal.
Why GradCAM++ over basic Grad-CAM? Basic Grad-CAM uses only positive gradients and produces fragmented activation maps. GradCAM++ uses a weighted combination of both positive and negative gradients, resulting in more focused and anatomically precise localisation maps. Implementation complexity is nearly identical β it is a direct upgrade.
Why SHAP over LIME? SHAP provides theoretically grounded attribution values with the efficiency axiom (values sum to the prediction). LIME is slower and produces less consistent results across runs. For five interpretable features, SHAP is the correct choice.
Why MiDaS-small not MiDaS-large? The depth signal feeds a five-value statistical summary, not a pixel-level task. MiDaS-small produces identical summary statistics at ~80ms CPU vs ~800ms for large. The architecture is model-agnostic β swapping to DPT-Large is one line change when GPU budget allows.
Why coreset subsampling? 2.8M patch vectors across all normal training images cannot all live in RAM or be searched efficiently. The greedy k-center coreset selects M representative patches such that every original patch is within bounded distance of a centre. At 1% coreset: 97.81% average AUROC at <5s CPU latency. At 10%: marginal AUROC gain for 10x the storage and latency.
Why DagsHub over plain MLflow?
DagsHub provides free hosted MLflow tracking and DVC remote storage under one account. No self-hosted MLflow server required. All experiment runs, model weights, and FAISS indexes are versioned and reproducible from a single dvc pull.
Performance
Image AUROC per Category (PatchCore, 1% coreset)
| Category | AUROC | Category | AUROC | |
|---|---|---|---|---|
| bottle | 1.0000 β | pill | 0.9722 β | |
| hazelnut | 1.0000 β | grid | 0.9816 β | |
| leather | 1.0000 β | cable | 0.9828 β | |
| tile | 1.0000 β | carpet | 0.9835 β | |
| metal_nut | 0.9976 β | wood | 0.9877 β | |
| transistor | 0.9929 β | capsule | 0.9813 β | |
| zipper | 0.9659 β | screw | 0.9545 β | |
| toothbrush | 0.8722 β |
Average AUROC: 0.9781 (target β₯0.97 β)
Toothbrush and screw score lower across all PatchCore implementations in the literature β toothbrush has only 60 training images (thin coreset), screw has highly regular fine-grained thread patterns that challenge patch-level matching.
Retrieval Quality
- Precision@5 (hierarchical): 0.9307
- Precision@5 (flat baseline): ~0.61
- Improvement: +32 percentage points from hierarchical routing
Inference Latency (CPU, HF Spaces)
- End-to-end (excl. LLM): ~3-5s
- FAISS k-NN search: <5ms
- CLIP encoding: ~150ms
- WideResNet extraction: ~200ms
MLOps
Experiment Tracking (MLflow on DagsHub)
Screenshot: DagsHub MLflow Dashboard
15+ logged runs across three experiments:
- PatchCore ablation (coreset % vs AUROC/latency)
- EfficientNet fine-tuning (10 Optuna trials)
- Retrieval quality evaluation (Precision@1, Precision@5, MRR)
CI/CD (GitHub Actions)
Three-stage smoke test on every deploy:
GET /healthβ 200 OKPOST /inspectwith 224Γ224 test image β valid responseGET /metricsβ 200 OK
Data Versioning (DVC + DagsHub)
All artifacts versioned and reproducible:
dvc pull # pulls all FAISS indexes, PCA model, thresholds, graph
Drift Monitoring (Evidently AI)
Reference: first 200 inference records. Current: most recent 200 records. Metrics: anomaly score distribution, predicted category distribution. Note: drift simulation uses injected OOD records for portfolio demonstration.
Limitations
- Dataset bias: MVTec AD contains Austrian/European industrial products. Performance on other product types or manufacturing contexts is unknown and likely degraded.
- Category specificity: PatchCore builds one coreset per product category. A category not in the 15 MVTec classes requires retraining from scratch.
- Retrieval degradation: Index 2 retrieval precision degrades on novel defect types not present in the training set.
- LLM reports unverified: Groq Llama-3 reports are grounded in retrieved context but not verified by domain experts. Do not use for real industrial decisions.
- Drift monitoring simulated: Evidently drift reports use artificially injected OOD records. Not real production drift.
- CPU latency: 3-5s end-to-end on HF Spaces free tier (no GPU). Architecture is GPU-ready.
- Not for production use: This is a portfolio demonstration project. Not suitable for safety-critical industrial deployment under any circumstances.
Bug Log
Bug 1 β Greedy coreset RAM explosion
What: Naive pairwise distance computation over 2.8M patch vectors caused OOM crash during coreset construction. A single distance matrix over 2.8MΓ256 float32 vectors requires ~5.7GB RAM.
Found: Kaggle notebook killed with OOM error during first coreset build attempt.
Fixed: Batched distance computation in chunks of 10,000 vectors. Peak RAM reduced from ~6GB to ~400MB. Added to greedy_coreset() as batched_l2_distance().
Bug 2 β FAISS IndexFlatIP vs IndexFlatL2 for CLIP
What: Used IndexFlatL2 for CLIP embeddings initially. CLIP embeddings are L2-normalised, so L2 distance and cosine similarity are equivalent only when using inner product search. L2 on normalised vectors produces correct rankings but wrong distance values, confusing the similarity score display. Found: Similarity scores in Index 2 retrieval were showing values >1.0 in the UI. Fixed: Changed Index 1 and Index 2 to IndexFlatIP. Inner product on L2-normalised vectors = cosine similarity, range [0,1].
Bug 3 β grayscale_lbp import error in enrichment pipeline
What: Cell 2 of notebook 01 imported grayscale_lbp from skimage.feature. This function does not exist β the correct function is local_binary_pattern.
Found: ImportError on Cell 2 execution.
Fixed: Replaced all grayscale_lbp imports with from skimage.feature import local_binary_pattern.
Setup & Reproduction
# 1. Clone
git clone https://github.com/devangmishra1424/AnomalyOS.git
cd AnomalyOS
# 2. Pull all artifacts (FAISS indexes, PCA model, thresholds, graph)
dvc pull
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set environment variables
export HF_TOKEN=your_token
export GROQ_API_KEY=your_key
export DAGSHUB_TOKEN=your_token
# 5. Launch API
uvicorn api.main:app --host 0.0.0.0 --port 7860
# 6. Launch Gradio (separate terminal)
python app.py
Project Structure
AnomalyOS/
βββ notebooks/ # Kaggle training notebooks (01-05)
βββ src/ # Core ML: patchcore, orchestrator, xai, llm
βββ api/ # FastAPI: endpoints, schemas, startup, logger
βββ mlops/ # Evidently, Optuna, retrieval evaluation
βββ tests/ # pytest suite (5 test files)
βββ data/ # DVC-tracked: FAISS indexes, graph, thresholds
βββ models/ # DVC-tracked: PCA model, EfficientNet weights
βββ app.py # Gradio frontend (5 tabs)
βββ docker/Dockerfile # python:3.11-slim, port 7860
Built by Devang Pradeep Mishra | GitHub | HuggingFace