Spaces:
Running
Running
added cronjob
#2
by Pujan-Dev - opened
- .env-example +0 -47
- .gitignore +1 -6
- Procfile +1 -0
- README.md +0 -166
- __init__.py +0 -1
- app.py +17 -28
- config.py +0 -59
- features/Modelsdfa/English_model/feature_names.json +0 -18
- features/Modelsdfa/English_model/metadata.json +0 -13
- features/__init__.py +0 -5
- features/ai_human_image_classifier/model_loader.py +4 -5
- features/image_classifier/model_loader.py +19 -19
- features/image_edit_detector/controller.py +2 -3
- features/nepali_text_classifier/controller.py +24 -113
- features/nepali_text_classifier/inferencer.py +15 -81
- features/nepali_text_classifier/model_loader.py +51 -234
- features/nepali_text_classifier/preprocess.py +6 -5
- features/nepali_text_classifier/routes.py +6 -21
- features/rag_chatbot/__init__.py +0 -0
- features/rag_chatbot/controller.py +0 -178
- features/rag_chatbot/document_handler.py +0 -37
- features/rag_chatbot/rag_pipeline.py +0 -329
- features/rag_chatbot/routes.py +0 -107
- features/real_forged_classifier/__init__.py +0 -9
- features/real_forged_classifier/controller.py +2 -95
- features/real_forged_classifier/inferencer.py +1 -5
- features/real_forged_classifier/main.py +26 -0
- features/real_forged_classifier/model_loader.py +39 -181
- features/real_forged_classifier/preprocessor.py +1 -1
- features/real_forged_classifier/routes.py +4 -24
- features/text_classifier/controller.py +51 -85
- features/text_classifier/inferencer.py +29 -261
- features/text_classifier/model_loader.py +34 -97
- features/text_classifier/preprocess.py +7 -5
- features/text_classifier/routes.py +2 -3
- requirements.txt +1 -18
- test.md +31 -0
.env-example
CHANGED
|
@@ -1,49 +1,2 @@
|
|
| 1 |
MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
|
| 2 |
|
| 3 |
-
# Language/text classifier models
|
| 4 |
-
English_model="Pujan-Dev/Ai_vs_HUMAN"
|
| 5 |
-
Nepali_model="features/Model/Nepali_model"
|
| 6 |
-
LANG_MODEL="features/Model/English_model"
|
| 7 |
-
|
| 8 |
-
# Hugging Face private model access
|
| 9 |
-
# Create a READ token at: https://huggingface.co/settings/tokens
|
| 10 |
-
HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
| 11 |
-
# Optional alias, either variable can be used
|
| 12 |
-
HUGGINGFACE_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
| 13 |
-
|
| 14 |
-
# Legacy variables (kept for compatibility)
|
| 15 |
-
REPOSITORY_ID_English_Detector="nepali-detector"
|
| 16 |
-
REPOSITORY_ID_Nepali_Detector="nepali-detector"
|
| 17 |
-
|
| 18 |
-
# Image classifier
|
| 19 |
-
IMAGE_CLASSIFIER_REPO_ID="can-org/AI-VS-HUMAN-IMAGE-classifier"
|
| 20 |
-
IMAGE_CLASSIFIER_MODEL_DIR="./IMG_Models"
|
| 21 |
-
IMAGE_CLASSIFIER_WEIGHTS_FILE="latest-my_cnn_model.h5"
|
| 22 |
-
|
| 23 |
-
# AI vs Human image detector
|
| 24 |
-
AI_HUMAN_CLIP_MODEL_NAME="ViT-L/14"
|
| 25 |
-
AI_HUMAN_SVM_REPO_ID="rhnsa/ai_human_image_detector"
|
| 26 |
-
AI_HUMAN_SVM_FILENAME="svm_model_real.joblib"
|
| 27 |
-
|
| 28 |
-
# Real vs Forged detector
|
| 29 |
-
REAL_FORGED_MODEL_REPO_ID="rhnsa/real_forged_classifier"
|
| 30 |
-
REAL_FORGED_MODEL_FILENAME="fft_cnn_model_78.pth"
|
| 31 |
-
|
| 32 |
-
# RAG + Chroma settings
|
| 33 |
-
CHROMA_HOST="localhost"
|
| 34 |
-
CHROMA_PORT="8000"
|
| 35 |
-
RAG_COLLECTION_NAME="company_docs_collection"
|
| 36 |
-
RAG_MAX_FILE_SIZE="104857600"
|
| 37 |
-
RAG_MAX_QUERY_LENGTH="1000"
|
| 38 |
-
|
| 39 |
-
# LLM settings
|
| 40 |
-
LLM_PROVIDER="openai"
|
| 41 |
-
LLM_API_KEY="sk-xxxx"
|
| 42 |
-
LLM_MODEL="gpt-3.5-turbo"
|
| 43 |
-
LLM_TEMPERATURE="0"
|
| 44 |
-
LLM_MAX_TOKENS="2048"
|
| 45 |
-
|
| 46 |
-
# Notebook/scraper API keys
|
| 47 |
-
GEMINI_API_KEY=""
|
| 48 |
-
GROQ_API_KEY="gsk_xxxx"
|
| 49 |
-
OPENROUTER_API_KEY="sk-or-xxxx"
|
|
|
|
| 1 |
MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.gitignore
CHANGED
|
@@ -13,13 +13,11 @@ __pycache__/
|
|
| 13 |
.vscode/
|
| 14 |
.idea/
|
| 15 |
*.swp
|
| 16 |
-
*Model/
|
| 17 |
|
| 18 |
# ---- Jupyter / IPython ----
|
| 19 |
.ipynb_checkpoints/
|
| 20 |
*.ipynb
|
| 21 |
-
|
| 22 |
-
*.csv
|
| 23 |
# ---- Model & Data Artifacts ----
|
| 24 |
*.pth
|
| 25 |
*.pt
|
|
@@ -68,6 +66,3 @@ notebooks
|
|
| 68 |
np_text_model/classifier/sentencepiece.bpe.model
|
| 69 |
np_text_model/classifier/tokenizer.json
|
| 70 |
|
| 71 |
-
# vector database
|
| 72 |
-
chroma_data
|
| 73 |
-
chroma_database
|
|
|
|
| 13 |
.vscode/
|
| 14 |
.idea/
|
| 15 |
*.swp
|
|
|
|
| 16 |
|
| 17 |
# ---- Jupyter / IPython ----
|
| 18 |
.ipynb_checkpoints/
|
| 19 |
*.ipynb
|
| 20 |
+
|
|
|
|
| 21 |
# ---- Model & Data Artifacts ----
|
| 22 |
*.pth
|
| 23 |
*.pt
|
|
|
|
| 66 |
np_text_model/classifier/sentencepiece.bpe.model
|
| 67 |
np_text_model/classifier/tokenizer.json
|
| 68 |
|
|
|
|
|
|
|
|
|
Procfile
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
web: uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000}
|
README.md
CHANGED
|
@@ -14,175 +14,9 @@ pinned: false
|
|
| 14 |
This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
|
| 15 |
|
| 16 |
## How to run locally
|
| 17 |
-
---
|
| 18 |
-
title: Testing AI Contain
|
| 19 |
-
emoji: 🤖
|
| 20 |
-
colorFrom: blue
|
| 21 |
-
colorTo: green
|
| 22 |
-
sdk: docker
|
| 23 |
-
sdk_version: "latest"
|
| 24 |
-
app_file: app.py
|
| 25 |
-
pinned: false
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
# AI-Contain-Checker
|
| 29 |
-
# AI-Content-Checker
|
| 30 |
-
|
| 31 |
-
A modular AI content detection system with support for **image classification**, **image edit detection**, **Nepali text classification**, and **general text classification**. Built for performance and extensibility, it is ideal for detecting AI-generated content in both visual and textual forms.
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
## 🌟 Features
|
| 35 |
-
|
| 36 |
-
### 🖼️ Image Classifier
|
| 37 |
-
|
| 38 |
-
* **Purpose**: Classifies whether an image is AI-generated or a real-life photo.
|
| 39 |
-
* **Model**: Fine-tuned **InceptionV3** CNN.
|
| 40 |
-
* **Dataset**: Custom curated dataset with **\~79,950 images** for binary classification.
|
| 41 |
-
* **Location**: [`features/image_classifier`](features/image_classifier)
|
| 42 |
-
* **Docs**: [`docs/features/image_classifier.md`](docs/features/image_classifier.md)
|
| 43 |
-
|
| 44 |
-
### 🖌️ Image Edit Detector
|
| 45 |
-
|
| 46 |
-
* **Purpose**: Detects image tampering or post-processing.
|
| 47 |
-
* **Techniques Used**:
|
| 48 |
-
|
| 49 |
-
* **Error Level Analysis (ELA)**: Visualizes compression artifacts.
|
| 50 |
-
* **Fast Fourier Transform (FFT)**: Detects unnatural frequency patterns.
|
| 51 |
-
* **Location**: [`features/image_edit_detector`](features/image_edit_detector)
|
| 52 |
-
* **Docs**:
|
| 53 |
-
|
| 54 |
-
* [ELA](docs/detector/ELA.md)
|
| 55 |
-
* [FFT](docs/detector/fft.md )
|
| 56 |
-
* [Metadata Analysis](docs/detector/meta.md)
|
| 57 |
-
* [Backend Notes](docs/detector/note-for-backend.md)
|
| 58 |
-
|
| 59 |
-
### 📝 Nepali Text Classifier
|
| 60 |
-
|
| 61 |
-
* **Purpose**: Determines if Nepali text content is AI-generated or written by a human.
|
| 62 |
-
* **Model**: Based on `XLMRClassifier` fine-tuned on Nepali language data.
|
| 63 |
-
* **Dataset**: Scraped dataset of **\~18,000** Nepali texts.
|
| 64 |
-
* **Location**: [`features/nepali_text_classifier`](features/nepali_text_classifier)
|
| 65 |
-
* **Docs**: [`docs/features/nepali_text_classifier.md`](docs/features/nepali_text_classifier.md)
|
| 66 |
-
|
| 67 |
-
### 🌐 English Text Classifier
|
| 68 |
-
|
| 69 |
-
* **Purpose**: Detects if English text is AI-generated or human-written.
|
| 70 |
-
* **Pipeline**:
|
| 71 |
-
|
| 72 |
-
* Uses **GPT2 tokenizer** for input preprocessing.
|
| 73 |
-
* Custom binary classifier to differentiate between AI and human-written content.
|
| 74 |
-
* **Location**: [`features/text_classifier`](features/text_classifier)
|
| 75 |
-
* **Docs**: [`docs/features/text_classifier.md`](docs/features/text_classifier.md)
|
| 76 |
-
|
| 77 |
-
---
|
| 78 |
-
|
| 79 |
-
## 🗂️ Project Structure
|
| 80 |
-
|
| 81 |
-
```bash
|
| 82 |
-
AI-Checker/
|
| 83 |
-
│
|
| 84 |
-
├── app.py # Main FastAPI entry point
|
| 85 |
-
├── config.py # Configuration settings
|
| 86 |
-
├── Dockerfile # Docker build script
|
| 87 |
-
├── Procfile # Deployment file for Heroku or similar
|
| 88 |
-
├── requirements.txt # Python dependencies
|
| 89 |
-
├── README.md # You are here 📘
|
| 90 |
-
│
|
| 91 |
-
├── features/ # Core detection modules
|
| 92 |
-
│ ├── image_classifier/
|
| 93 |
-
│ ├── image_edit_detector/
|
| 94 |
-
│ ├── nepali_text_classifier/
|
| 95 |
-
│ └── text_classifier/
|
| 96 |
-
│
|
| 97 |
-
├── docs/ # Internal and API documentation
|
| 98 |
-
│ ├── api_endpoints.md
|
| 99 |
-
│ ├── deployment.md
|
| 100 |
-
│ ├── detector/
|
| 101 |
-
│ │ ├── ELA.md
|
| 102 |
-
│ │ ├── fft.md
|
| 103 |
-
│ │ ├── meta.md
|
| 104 |
-
│ │ └── note-for-backend.md
|
| 105 |
-
│ ├── functions.md
|
| 106 |
-
│ ├── nestjs_integration.md
|
| 107 |
-
│ ├── security.md
|
| 108 |
-
│ ├── setup.md
|
| 109 |
-
│ └── structure.md
|
| 110 |
-
│
|
| 111 |
-
├── IMG_Models/ # Saved image classifier model(s)
|
| 112 |
-
│ └── latest-my_cnn_model.h5
|
| 113 |
-
│
|
| 114 |
-
├── notebooks/ # Experimental and debug notebooks
|
| 115 |
-
├── static/ # Static assets if needed
|
| 116 |
-
└── test.md # Test notes
|
| 117 |
-
````
|
| 118 |
-
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
-
## 📚 Documentation Links
|
| 122 |
-
|
| 123 |
-
* [API Endpoints](docs/api_endpoints.md)
|
| 124 |
-
* [Deployment Guide](docs/deployment.md)
|
| 125 |
-
* [Detector Documentation](docs/detector/)
|
| 126 |
-
|
| 127 |
-
* [Error Level Analysis (ELA)](docs/detector/ELA.md)
|
| 128 |
-
* [Fast Fourier Transform (FFT)](docs/detector/fft.md)
|
| 129 |
-
* [Metadata Analysis](docs/detector/meta.md)
|
| 130 |
-
* [Backend Notes](docs/detector/note-for-backend.md)
|
| 131 |
-
* [Functions Overview](docs/functions.md)
|
| 132 |
-
* [NestJS Integration Guide](docs/nestjs_integration.md)
|
| 133 |
-
* [Security Details](docs/security.md)
|
| 134 |
-
* [Setup Instructions](docs/setup.md)
|
| 135 |
-
* [Project Structure](docs/structure.md)
|
| 136 |
-
|
| 137 |
-
---
|
| 138 |
-
|
| 139 |
-
## 🚀 Usage
|
| 140 |
-
|
| 141 |
-
1. **Install dependencies**
|
| 142 |
|
| 143 |
```bash
|
| 144 |
docker build -t testing-ai-contain .
|
| 145 |
docker run -p 7860:7860 testing-ai-contain
|
| 146 |
|
| 147 |
```
|
| 148 |
-
```bash
|
| 149 |
-
pip install -r requirements.txt
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
2. **Run the API**
|
| 153 |
-
|
| 154 |
-
```bash
|
| 155 |
-
chroma run --path ./chroma_database ## to run chromadb locally
|
| 156 |
-
uvicorn app:app --reload --port 8001 ## fastapi (run after chromadb)
|
| 157 |
-
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
3. **Build Docker (optional)**
|
| 161 |
-
|
| 162 |
-
```bash
|
| 163 |
-
docker build -t ai-contain-checker .
|
| 164 |
-
docker run -p 8000:8000 ai-contain-checker
|
| 165 |
-
```
|
| 166 |
-
|
| 167 |
-
---
|
| 168 |
-
|
| 169 |
-
## 🔐 Security & Integration
|
| 170 |
-
|
| 171 |
-
* **Token Authentication** and **IP Whitelisting** supported.
|
| 172 |
-
* NestJS integration guide: [`docs/nestjs_integration.md`](docs/nestjs_integration.md)
|
| 173 |
-
* Rate limiting handled using `slowapi`.
|
| 174 |
-
|
| 175 |
-
---
|
| 176 |
-
|
| 177 |
-
## 🛡️ Future Plans
|
| 178 |
-
|
| 179 |
-
* Add **video classifier** module.
|
| 180 |
-
* Expand dataset for **multilingual** AI content detection.
|
| 181 |
-
* Add **fine-tuning UI** for models.
|
| 182 |
-
|
| 183 |
-
---
|
| 184 |
-
|
| 185 |
-
## 📄 License
|
| 186 |
-
|
| 187 |
-
See full license terms here: [`LICENSE.md`](license.md)
|
| 188 |
-
|
|
|
|
| 14 |
This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
|
| 15 |
|
| 16 |
## How to run locally
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
```bash
|
| 19 |
docker build -t testing-ai-contain .
|
| 20 |
docker run -p 7860:7860 testing-ai-contain
|
| 21 |
|
| 22 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1,35 +1,25 @@
|
|
| 1 |
-
import warnings
|
| 2 |
-
|
| 3 |
-
import requests
|
| 4 |
from fastapi import FastAPI, Request
|
| 5 |
-
from fastapi.responses import FileResponse, JSONResponse
|
| 6 |
-
from fastapi.staticfiles import StaticFiles
|
| 7 |
from slowapi import Limiter, _rate_limit_exceeded_handler
|
| 8 |
-
from
|
| 9 |
from slowapi.middleware import SlowAPIMiddleware
|
|
|
|
| 10 |
from slowapi.util import get_remote_address
|
| 11 |
-
|
| 12 |
-
from
|
| 13 |
-
from features.image_classifier.routes import router as image_classifier_router
|
| 14 |
-
from features.image_edit_detector.routes import router as image_edit_detector_router
|
| 15 |
-
from features.real_forged_classifier.routes import router as real_forged_classifier_router
|
| 16 |
from features.nepali_text_classifier.routes import (
|
| 17 |
router as nepali_text_classifier_router,
|
| 18 |
)
|
| 19 |
-
from features.
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
{"name": "English Text Classifier", "description": "Endpoints for English AI-vs-human text analysis."},
|
| 26 |
-
{"name": "Nepali Text Classifier", "description": "Endpoints for Nepali AI-vs-human text analysis."},
|
| 27 |
-
{"name": "AI Image Classifier", "description": "Endpoints for AI-vs-human image classification."},
|
| 28 |
-
{"name": "Image Edit Detection", "description": "Endpoints for edited/forged image detection."},
|
| 29 |
-
{"name": "System", "description": "Health and root endpoints."},
|
| 30 |
-
]
|
| 31 |
|
| 32 |
-
app = FastAPI(
|
| 33 |
# added the robots.txt
|
| 34 |
# Set up SlowAPI
|
| 35 |
app.state.limiter = limiter
|
|
@@ -47,14 +37,13 @@ app.add_exception_handler(
|
|
| 47 |
app.add_middleware(SlowAPIMiddleware)
|
| 48 |
|
| 49 |
# Include your routes
|
| 50 |
-
app.include_router(text_classifier_router, prefix="/text"
|
| 51 |
-
app.include_router(nepali_text_classifier_router, prefix="/NP"
|
| 52 |
-
app.include_router(image_classifier_router, prefix="/AI-image"
|
| 53 |
-
app.include_router(image_edit_detector_router, prefix="/detect"
|
| 54 |
-
app.include_router(real_forged_classifier_router, prefix="/real-forged", tags=["Real/Forged Image Classifier"])
|
| 55 |
|
| 56 |
|
| 57 |
-
@app.get("/"
|
| 58 |
@limiter.limit(ACCESS_RATE)
|
| 59 |
async def root(request: Request):
|
| 60 |
return {
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from fastapi import FastAPI, Request
|
|
|
|
|
|
|
| 2 |
from slowapi import Limiter, _rate_limit_exceeded_handler
|
| 3 |
+
from fastapi.responses import FileResponse
|
| 4 |
from slowapi.middleware import SlowAPIMiddleware
|
| 5 |
+
from slowapi.errors import RateLimitExceeded
|
| 6 |
from slowapi.util import get_remote_address
|
| 7 |
+
from fastapi.responses import JSONResponse
|
| 8 |
+
from features.text_classifier.routes import router as text_classifier_router
|
|
|
|
|
|
|
|
|
|
| 9 |
from features.nepali_text_classifier.routes import (
|
| 10 |
router as nepali_text_classifier_router,
|
| 11 |
)
|
| 12 |
+
from features.image_classifier.routes import router as image_classifier_router
|
| 13 |
+
from features.image_edit_detector.routes import router as image_edit_detector_router
|
| 14 |
+
from fastapi.staticfiles import StaticFiles
|
| 15 |
|
| 16 |
+
from config import ACCESS_RATE
|
| 17 |
+
|
| 18 |
+
import requests
|
| 19 |
|
| 20 |
+
limiter = Limiter(key_func=get_remote_address, default_limits=[ACCESS_RATE])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
app = FastAPI()
|
| 23 |
# added the robots.txt
|
| 24 |
# Set up SlowAPI
|
| 25 |
app.state.limiter = limiter
|
|
|
|
| 37 |
app.add_middleware(SlowAPIMiddleware)
|
| 38 |
|
| 39 |
# Include your routes
|
| 40 |
+
app.include_router(text_classifier_router, prefix="/text")
|
| 41 |
+
app.include_router(nepali_text_classifier_router, prefix="/NP")
|
| 42 |
+
app.include_router(image_classifier_router, prefix="/AI-image")
|
| 43 |
+
app.include_router(image_edit_detector_router, prefix="/detect")
|
|
|
|
| 44 |
|
| 45 |
|
| 46 |
+
@app.get("/")
|
| 47 |
@limiter.limit(ACCESS_RATE)
|
| 48 |
async def root(request: Request):
|
| 49 |
return {
|
config.py
CHANGED
|
@@ -1,61 +1,2 @@
|
|
| 1 |
-
import os
|
| 2 |
-
|
| 3 |
-
import dotenv
|
| 4 |
-
|
| 5 |
-
dotenv.load_dotenv()
|
| 6 |
-
|
| 7 |
ACCESS_RATE = "20/minute"
|
| 8 |
|
| 9 |
-
|
| 10 |
-
class Config:
|
| 11 |
-
Nepali_model_folder = os.getenv("Nepali_model")
|
| 12 |
-
English_model_folder = os.getenv("English_model")
|
| 13 |
-
REPO_ID_LANG = os.getenv("English_model") or "Pujan-Dev/Ai_vs_HUMAN"
|
| 14 |
-
LANG_MODEL = os.getenv("LANG_MODEL")
|
| 15 |
-
HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
|
| 16 |
-
SECRET_TOKEN = os.getenv("MY_SECRET_TOKEN")
|
| 17 |
-
|
| 18 |
-
IMAGE_CLASSIFIER_REPO_ID = os.getenv("IMAGE_CLASSIFIER_REPO_ID", "can-org/AI-VS-HUMAN-IMAGE-classifier")
|
| 19 |
-
IMAGE_CLASSIFIER_MODEL_DIR = os.getenv("IMAGE_CLASSIFIER_MODEL_DIR", "./IMG_Models")
|
| 20 |
-
IMAGE_CLASSIFIER_WEIGHTS_FILE = os.getenv("IMAGE_CLASSIFIER_WEIGHTS_FILE", "latest-my_cnn_model.h5")
|
| 21 |
-
|
| 22 |
-
AI_HUMAN_CLIP_MODEL_NAME = os.getenv("AI_HUMAN_CLIP_MODEL_NAME", "ViT-L/14")
|
| 23 |
-
AI_HUMAN_SVM_REPO_ID = os.getenv("AI_HUMAN_SVM_REPO_ID", "rhnsa/ai_human_image_detector")
|
| 24 |
-
AI_HUMAN_SVM_FILENAME = os.getenv("AI_HUMAN_SVM_FILENAME", "svm_model_real.joblib")
|
| 25 |
-
|
| 26 |
-
REAL_FORGED_MODEL_REPO_ID = os.getenv("REAL_FORGED_MODEL_REPO_ID", "rhnsa/real_forged_classifier")
|
| 27 |
-
REAL_FORGED_MODEL_FILENAME = os.getenv("REAL_FORGED_MODEL_FILENAME", "fft_cnn_model_78.pth")
|
| 28 |
-
REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
|
| 29 |
-
DOCUMENT_FORGERY_MODEL_REPO_ID = os.getenv(
|
| 30 |
-
"DOCUMENT_FORGERY_MODEL_REPO_ID",
|
| 31 |
-
REPO_ID_LANG
|
| 32 |
-
)
|
| 33 |
-
DOCUMENT_FORGERY_MODEL_FILENAME = os.getenv(
|
| 34 |
-
"DOCUMENT_FORGERY_MODEL_FILENAME",
|
| 35 |
-
"document_forgery/pixel_forgery_v3_best.pth",
|
| 36 |
-
)
|
| 37 |
-
DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
|
| 38 |
-
"DOCUMENT_FORGERY_MODEL_PATH",
|
| 39 |
-
"features/Modelsdfa/document_forgery/pixel_forgery_v3_best.pth",
|
| 40 |
-
)
|
| 41 |
-
# Decision thresholds for document forgery detector (probabilities in 0..1)
|
| 42 |
-
DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
|
| 43 |
-
DOCUMENT_FORGERY_FORGED_LOW = float(os.getenv("DOCUMENT_FORGERY_FORGED_LOW", "0.55"))
|
| 44 |
-
|
| 45 |
-
RAG_CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
|
| 46 |
-
RAG_CHROMA_PORT = int(os.getenv("CHROMA_PORT", "8000"))
|
| 47 |
-
RAG_COLLECTION_NAME = os.getenv("RAG_COLLECTION_NAME", "company_docs_collection")
|
| 48 |
-
|
| 49 |
-
RAG_LLM_PROVIDER = os.getenv("LLM_PROVIDER", "openai").lower()
|
| 50 |
-
RAG_LLM_API_KEY = os.getenv("LLM_API_KEY")
|
| 51 |
-
RAG_LLM_MODEL = os.getenv("LLM_MODEL", "gpt-3.5-turbo")
|
| 52 |
-
RAG_LLM_TEMPERATURE = float(os.getenv("LLM_TEMPERATURE", "0"))
|
| 53 |
-
RAG_LLM_MAX_TOKENS = int(os.getenv("LLM_MAX_TOKENS", "2048"))
|
| 54 |
-
|
| 55 |
-
RAG_MAX_FILE_SIZE = int(os.getenv("RAG_MAX_FILE_SIZE", str(100 * 1024 * 1024)))
|
| 56 |
-
RAG_MAX_QUERY_LENGTH = int(os.getenv("RAG_MAX_QUERY_LENGTH", "1000"))
|
| 57 |
-
RAG_SUPPORTED_CONTENT_TYPES = {
|
| 58 |
-
"application/pdf",
|
| 59 |
-
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
| 60 |
-
"text/plain",
|
| 61 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
ACCESS_RATE = "20/minute"
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/Modelsdfa/English_model/feature_names.json
DELETED
|
@@ -1,18 +0,0 @@
|
|
| 1 |
-
[
|
| 2 |
-
"perplexity",
|
| 3 |
-
"burst_mean",
|
| 4 |
-
"burst_std",
|
| 5 |
-
"burst_max",
|
| 6 |
-
"burst_min",
|
| 7 |
-
"burst_range",
|
| 8 |
-
"num_words",
|
| 9 |
-
"num_chars",
|
| 10 |
-
"num_sentences",
|
| 11 |
-
"avg_word_len",
|
| 12 |
-
"avg_sent_len",
|
| 13 |
-
"lexical_diversity",
|
| 14 |
-
"punct_ratio",
|
| 15 |
-
"caps_ratio",
|
| 16 |
-
"flesch_reading",
|
| 17 |
-
"flesch_grade"
|
| 18 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/Modelsdfa/English_model/metadata.json
DELETED
|
@@ -1,13 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"selected_model": "hybrid_tfidf_logistic",
|
| 3 |
-
"cv_best_f1": 0.8593569681592504,
|
| 4 |
-
"num_engineered_features": 16,
|
| 5 |
-
"num_word_tfidf_features": 86956,
|
| 6 |
-
"num_char_tfidf_features": 80000,
|
| 7 |
-
"train_samples": 15952,
|
| 8 |
-
"test_samples": 3988,
|
| 9 |
-
"train_accuracy": 0.980253259779338,
|
| 10 |
-
"train_f1": 0.980182447310475,
|
| 11 |
-
"test_accuracy": 0.8713640922768305,
|
| 12 |
-
"test_f1": 0.8707482993197279
|
| 13 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/__init__.py
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
"""Top-level features package for the aiapi project."""
|
| 2 |
-
|
| 3 |
-
__all__ = [
|
| 4 |
-
# Subpackages are dynamically discovered; keep this minimal.
|
| 5 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/ai_human_image_classifier/model_loader.py
CHANGED
|
@@ -3,7 +3,6 @@ import torch
|
|
| 3 |
import joblib
|
| 4 |
from pathlib import Path
|
| 5 |
from huggingface_hub import hf_hub_download
|
| 6 |
-
from config import Config
|
| 7 |
|
| 8 |
class ModelLoader:
|
| 9 |
"""
|
|
@@ -57,7 +56,7 @@ class ModelLoader:
|
|
| 57 |
print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
|
| 58 |
try:
|
| 59 |
# Download the model file from the Hub. It returns the cached path.
|
| 60 |
-
model_path = hf_hub_download(repo_id=repo_id, filename=filename
|
| 61 |
print(f"SVM model downloaded to: {model_path}")
|
| 62 |
|
| 63 |
# Load the model from the downloaded path
|
|
@@ -69,9 +68,9 @@ class ModelLoader:
|
|
| 69 |
|
| 70 |
# --- Global Model Instance ---
|
| 71 |
# This creates a single instance of the models that can be imported by other modules.
|
| 72 |
-
CLIP_MODEL_NAME =
|
| 73 |
-
SVM_REPO_ID =
|
| 74 |
-
SVM_FILENAME =
|
| 75 |
|
| 76 |
# This instance will be created when the application starts.
|
| 77 |
models = ModelLoader(
|
|
|
|
| 3 |
import joblib
|
| 4 |
from pathlib import Path
|
| 5 |
from huggingface_hub import hf_hub_download
|
|
|
|
| 6 |
|
| 7 |
class ModelLoader:
|
| 8 |
"""
|
|
|
|
| 56 |
print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
|
| 57 |
try:
|
| 58 |
# Download the model file from the Hub. It returns the cached path.
|
| 59 |
+
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
| 60 |
print(f"SVM model downloaded to: {model_path}")
|
| 61 |
|
| 62 |
# Load the model from the downloaded path
|
|
|
|
| 68 |
|
| 69 |
# --- Global Model Instance ---
|
| 70 |
# This creates a single instance of the models that can be imported by other modules.
|
| 71 |
+
CLIP_MODEL_NAME = 'ViT-L/14'
|
| 72 |
+
SVM_REPO_ID = 'rhnsa/ai_human_image_detector'
|
| 73 |
+
SVM_FILENAME = 'svm_model_real.joblib' # The name of your model file in the Hugging Face repo
|
| 74 |
|
| 75 |
# This instance will be created when the application starts.
|
| 76 |
models = ModelLoader(
|
features/image_classifier/model_loader.py
CHANGED
|
@@ -1,21 +1,27 @@
|
|
| 1 |
import os
|
| 2 |
import shutil
|
| 3 |
import logging
|
|
|
|
|
|
|
| 4 |
from huggingface_hub import snapshot_download
|
| 5 |
-
from config import Config
|
| 6 |
-
|
| 7 |
-
os.environ.setdefault("CUDA_VISIBLE_DEVICES", "-1")
|
| 8 |
-
os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")
|
| 9 |
|
| 10 |
# Model config
|
| 11 |
-
REPO_ID =
|
| 12 |
-
MODEL_DIR =
|
| 13 |
-
WEIGHTS_PATH = os.path.join(MODEL_DIR,
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
# Global model reference
|
| 17 |
_model_img = None
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
def warmup():
|
| 20 |
global _model_img
|
| 21 |
download_model_repo()
|
|
@@ -26,7 +32,7 @@ def download_model_repo():
|
|
| 26 |
if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
|
| 27 |
logging.info("Image model already exists, skipping download.")
|
| 28 |
return
|
| 29 |
-
snapshot_path = snapshot_download(repo_id=REPO_ID
|
| 30 |
os.makedirs(MODEL_DIR, exist_ok=True)
|
| 31 |
shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
|
| 32 |
|
|
@@ -35,17 +41,11 @@ def load_model():
|
|
| 35 |
if _model_img is not None:
|
| 36 |
return _model_img
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
class Cast(tf.keras.layers.Layer):
|
| 41 |
-
def call(self, inputs):
|
| 42 |
-
return tf.cast(inputs, tf.float32)
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
WEIGHTS_PATH, custom_objects={"Cast": Cast}
|
| 48 |
-
)
|
| 49 |
print("Model input shape:", _model_img.input_shape)
|
| 50 |
return _model_img
|
| 51 |
|
|
|
|
| 1 |
import os
|
| 2 |
import shutil
|
| 3 |
import logging
|
| 4 |
+
import tensorflow as tf
|
| 5 |
+
from tensorflow.keras.layers import Layer
|
| 6 |
from huggingface_hub import snapshot_download
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
# Model config
|
| 9 |
+
REPO_ID = "can-org/AI-VS-HUMAN-IMAGE-classifier"
|
| 10 |
+
MODEL_DIR = "./IMG_Models"
|
| 11 |
+
WEIGHTS_PATH = os.path.join(MODEL_DIR, "latest-my_cnn_model.h5")
|
| 12 |
+
|
| 13 |
+
# Device info (for logging)
|
| 14 |
+
gpus = tf.config.list_physical_devices("GPU")
|
| 15 |
+
device = "cuda" if gpus else "cpu"
|
| 16 |
|
| 17 |
# Global model reference
|
| 18 |
_model_img = None
|
| 19 |
|
| 20 |
+
# Custom layer used in the model
|
| 21 |
+
class Cast(Layer):
|
| 22 |
+
def call(self, inputs):
|
| 23 |
+
return tf.cast(inputs, tf.float32)
|
| 24 |
+
|
| 25 |
def warmup():
|
| 26 |
global _model_img
|
| 27 |
download_model_repo()
|
|
|
|
| 32 |
if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
|
| 33 |
logging.info("Image model already exists, skipping download.")
|
| 34 |
return
|
| 35 |
+
snapshot_path = snapshot_download(repo_id=REPO_ID)
|
| 36 |
os.makedirs(MODEL_DIR, exist_ok=True)
|
| 37 |
shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
|
| 38 |
|
|
|
|
| 41 |
if _model_img is not None:
|
| 42 |
return _model_img
|
| 43 |
|
| 44 |
+
print(f"{'GPU detected' if device == 'cuda' else 'No GPU detected'}, loading model on {device.upper()}.")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
_model_img = tf.keras.models.load_model(
|
| 47 |
+
WEIGHTS_PATH, custom_objects={"Cast": Cast}
|
| 48 |
+
)
|
|
|
|
|
|
|
| 49 |
print("Model input shape:", _model_img.input_shape)
|
| 50 |
return _model_img
|
| 51 |
|
features/image_edit_detector/controller.py
CHANGED
|
@@ -7,9 +7,8 @@ from .detectors.ela import run_ela
|
|
| 7 |
from .preprocess import preprocess_image
|
| 8 |
from fastapi import HTTPException,status,Depends
|
| 9 |
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
| 10 |
-
from config import Config
|
| 11 |
security=HTTPBearer()
|
| 12 |
-
|
| 13 |
async def process_image_ela(image_bytes: bytes, quality: int=90):
|
| 14 |
image = Image.open(io.BytesIO(image_bytes))
|
| 15 |
|
|
@@ -41,7 +40,7 @@ async def process_meta_image(image_bytes: bytes) -> dict:
|
|
| 41 |
|
| 42 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 43 |
token = credentials.credentials
|
| 44 |
-
expected_token =
|
| 45 |
if token != expected_token:
|
| 46 |
raise HTTPException(
|
| 47 |
status_code=status.HTTP_403_FORBIDDEN,
|
|
|
|
| 7 |
from .preprocess import preprocess_image
|
| 8 |
from fastapi import HTTPException,status,Depends
|
| 9 |
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
|
|
|
| 10 |
security=HTTPBearer()
|
| 11 |
+
import os
|
| 12 |
async def process_image_ela(image_bytes: bytes, quality: int=90):
|
| 13 |
image = Image.open(io.BytesIO(image_bytes))
|
| 14 |
|
|
|
|
| 40 |
|
| 41 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 42 |
token = credentials.credentials
|
| 43 |
+
expected_token = os.getenv("MY_SECRET_TOKEN")
|
| 44 |
if token != expected_token:
|
| 45 |
raise HTTPException(
|
| 46 |
status_code=status.HTTP_403_FORBIDDEN,
|
features/nepali_text_classifier/controller.py
CHANGED
|
@@ -1,87 +1,23 @@
|
|
| 1 |
import asyncio
|
| 2 |
-
import hashlib
|
| 3 |
-
import logging
|
| 4 |
-
import random
|
| 5 |
from io import BytesIO
|
| 6 |
from fastapi import HTTPException, UploadFile, status, Depends
|
| 7 |
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
| 8 |
-
|
| 9 |
from features.nepali_text_classifier.inferencer import classify_text
|
| 10 |
from features.nepali_text_classifier.preprocess import *
|
| 11 |
import re
|
| 12 |
|
| 13 |
security = HTTPBearer()
|
| 14 |
|
| 15 |
-
|
| 16 |
-
def parse_selected_models(models: str | None) -> list[str] | None:
|
| 17 |
-
if not models:
|
| 18 |
-
return None
|
| 19 |
-
parsed = [m.strip() for m in models.split(",") if m.strip()]
|
| 20 |
-
return parsed[:2] if parsed else None
|
| 21 |
-
|
| 22 |
def contains_english(text: str) -> bool:
|
| 23 |
# Remove escape characters
|
| 24 |
cleaned = text.replace("\n", "").replace("\t", "")
|
| 25 |
return bool(re.search(r'[a-zA-Z]', cleaned))
|
| 26 |
|
| 27 |
|
| 28 |
-
def _clamp(value: float, lower: float, upper: float) -> float:
|
| 29 |
-
return max(lower, min(upper, value))
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
def _raw_ai_score(label: str, confidence: float) -> float:
|
| 33 |
-
conf = _clamp(float(confidence), 0.0, 100.0)
|
| 34 |
-
return conf if label == "AI" else (100.0 - conf)
|
| 35 |
-
|
| 36 |
-
def _sentence_bias_strength(overall_confidence: float) -> float:
|
| 37 |
-
# Equation: beta = min(0.15, 0.05 + 0.10 * (C_doc / 100))
|
| 38 |
-
return min(0.15, 0.05 + 0.10 * (_clamp(overall_confidence, 0.0, 100.0) / 100.0))
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
def _deterministic_jitter(seed_text: str, max_jitter: float = 3.0) -> float:
|
| 42 |
-
digest = hashlib.sha256(seed_text.encode("utf-8")).digest()
|
| 43 |
-
seed_value = int.from_bytes(digest[:8], byteorder="big", signed=False)
|
| 44 |
-
rng = random.Random(seed_value)
|
| 45 |
-
return rng.uniform(-max_jitter, max_jitter)
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
def _add_likelihood_randomness(likelihood: float, seed_text: str, max_jitter: float = 3.0) -> float:
|
| 49 |
-
jitter = _deterministic_jitter(seed_text=seed_text, max_jitter=max_jitter)
|
| 50 |
-
return _clamp(likelihood + jitter, 50.0, 99.95)
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
def _biased_sentence_result(
|
| 54 |
-
sentence_result: dict,
|
| 55 |
-
overall_confidence: float,
|
| 56 |
-
target_label: str = "Human",
|
| 57 |
-
seed_text: str = "",
|
| 58 |
-
) -> dict:
|
| 59 |
-
raw_label = sentence_result["label"]
|
| 60 |
-
raw_confidence = float(sentence_result["confidence"])
|
| 61 |
-
raw_ai = _raw_ai_score(raw_label, raw_confidence)
|
| 62 |
-
|
| 63 |
-
target_ai = 100.0 if target_label == "AI" else 0.0
|
| 64 |
-
beta = _sentence_bias_strength(overall_confidence)
|
| 65 |
-
|
| 66 |
-
# Equation: S_biased = (1 - beta) * S_raw + beta * T
|
| 67 |
-
biased_ai = _clamp((1.0 - beta) * raw_ai + beta * target_ai, 0.0, 100.0)
|
| 68 |
-
# Force final label toward overall target to ensure overall bias is applied.
|
| 69 |
-
biased_label = target_label
|
| 70 |
-
biased_confidence = biased_ai if target_label == "AI" else (100.0 - biased_ai)
|
| 71 |
-
biased_confidence = _add_likelihood_randomness(
|
| 72 |
-
biased_confidence,
|
| 73 |
-
seed_text=f"{seed_text}|{target_label}|{round(overall_confidence, 2)}",
|
| 74 |
-
)
|
| 75 |
-
|
| 76 |
-
return {
|
| 77 |
-
"biased_label": biased_label,
|
| 78 |
-
"biased_confidence": round(biased_confidence, 2),
|
| 79 |
-
}
|
| 80 |
-
|
| 81 |
-
|
| 82 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 83 |
token = credentials.credentials
|
| 84 |
-
expected_token =
|
| 85 |
if token != expected_token:
|
| 86 |
raise HTTPException(
|
| 87 |
status_code=status.HTTP_403_FORBIDDEN,
|
|
@@ -89,16 +25,15 @@ async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(secur
|
|
| 89 |
)
|
| 90 |
return token
|
| 91 |
|
| 92 |
-
async def nepali_text_analysis(text: str
|
| 93 |
end_symbol_for_NP_text(text)
|
| 94 |
words = text.split()
|
| 95 |
if len(words) < 10:
|
| 96 |
raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
|
| 97 |
-
if len(text) >
|
| 98 |
-
raise HTTPException(status_code=413, detail="Text must be less than
|
| 99 |
|
| 100 |
-
|
| 101 |
-
result = await asyncio.to_thread(classify_text, text, selected_models, 2)
|
| 102 |
|
| 103 |
return result
|
| 104 |
|
|
@@ -116,19 +51,18 @@ async def extract_file_contents(file:UploadFile)-> str:
|
|
| 116 |
else:
|
| 117 |
raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
|
| 118 |
|
| 119 |
-
async def handle_file_upload(file: UploadFile
|
| 120 |
try:
|
| 121 |
file_contents = await extract_file_contents(file)
|
| 122 |
end_symbol_for_NP_text(file_contents)
|
| 123 |
-
if len(file_contents) >
|
| 124 |
-
raise HTTPException(status_code=413, detail="Text must be less than
|
| 125 |
|
| 126 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 127 |
if not cleaned_text:
|
| 128 |
raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
|
| 129 |
|
| 130 |
-
|
| 131 |
-
result = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
|
| 132 |
return result
|
| 133 |
except Exception as e:
|
| 134 |
logging.error(f"Error processing file: {e}")
|
|
@@ -136,45 +70,34 @@ async def handle_file_upload(file: UploadFile, models: str | None = None):
|
|
| 136 |
|
| 137 |
|
| 138 |
|
| 139 |
-
async def handle_sentence_level_analysis(text: str
|
| 140 |
text = text.strip()
|
| 141 |
-
if len(text) >
|
| 142 |
-
raise HTTPException(status_code=413, detail="Text must be less than
|
| 143 |
|
| 144 |
end_symbol_for_NP_text(text)
|
| 145 |
|
| 146 |
# Split text into sentences
|
| 147 |
sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
|
| 148 |
-
selected_models = parse_selected_models(models)
|
| 149 |
-
|
| 150 |
-
overall = await asyncio.to_thread(classify_text, text, selected_models, 2)
|
| 151 |
-
overall_label = overall["label"]
|
| 152 |
-
overall_confidence = float(overall["confidence"])
|
| 153 |
|
| 154 |
results = []
|
| 155 |
for sentence in sentences:
|
| 156 |
end_symbol_for_NP_text(sentence)
|
| 157 |
-
result = await asyncio.to_thread(classify_text, sentence
|
| 158 |
-
biased = _biased_sentence_result(
|
| 159 |
-
result,
|
| 160 |
-
overall_confidence,
|
| 161 |
-
target_label=overall_label,
|
| 162 |
-
seed_text=sentence,
|
| 163 |
-
)
|
| 164 |
results.append({
|
| 165 |
"text": sentence,
|
| 166 |
-
"result":
|
| 167 |
-
"likelihood":
|
| 168 |
})
|
| 169 |
|
| 170 |
return {"analysis": results}
|
| 171 |
|
| 172 |
|
| 173 |
-
async def handle_file_sentence(file:UploadFile
|
| 174 |
try:
|
| 175 |
file_contents = await extract_file_contents(file)
|
| 176 |
-
if len(file_contents) >
|
| 177 |
-
raise HTTPException(status_code=413, detail="Text must be less than
|
| 178 |
|
| 179 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 180 |
if not cleaned_text:
|
|
@@ -183,27 +106,16 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
|
|
| 183 |
|
| 184 |
# Split text into sentences
|
| 185 |
sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
|
| 186 |
-
selected_models = parse_selected_models(models)
|
| 187 |
-
|
| 188 |
-
overall = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
|
| 189 |
-
overall_label = overall["label"]
|
| 190 |
-
overall_confidence = float(overall["confidence"])
|
| 191 |
|
| 192 |
results = []
|
| 193 |
for sentence in sentences:
|
| 194 |
end_symbol_for_NP_text(sentence)
|
| 195 |
|
| 196 |
-
result = await asyncio.to_thread(classify_text, sentence
|
| 197 |
-
biased = _biased_sentence_result(
|
| 198 |
-
result,
|
| 199 |
-
overall_confidence,
|
| 200 |
-
target_label=overall_label,
|
| 201 |
-
seed_text=sentence,
|
| 202 |
-
)
|
| 203 |
results.append({
|
| 204 |
"text": sentence,
|
| 205 |
-
"result":
|
| 206 |
-
"likelihood":
|
| 207 |
})
|
| 208 |
|
| 209 |
return {"analysis": results}
|
|
@@ -213,7 +125,6 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
|
|
| 213 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 214 |
|
| 215 |
|
| 216 |
-
def classify(text: str
|
| 217 |
-
|
| 218 |
-
return classify_text(text, selected_models, 2)
|
| 219 |
|
|
|
|
| 1 |
import asyncio
|
|
|
|
|
|
|
|
|
|
| 2 |
from io import BytesIO
|
| 3 |
from fastapi import HTTPException, UploadFile, status, Depends
|
| 4 |
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
| 5 |
+
import os
|
| 6 |
from features.nepali_text_classifier.inferencer import classify_text
|
| 7 |
from features.nepali_text_classifier.preprocess import *
|
| 8 |
import re
|
| 9 |
|
| 10 |
security = HTTPBearer()
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
def contains_english(text: str) -> bool:
|
| 13 |
# Remove escape characters
|
| 14 |
cleaned = text.replace("\n", "").replace("\t", "")
|
| 15 |
return bool(re.search(r'[a-zA-Z]', cleaned))
|
| 16 |
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 19 |
token = credentials.credentials
|
| 20 |
+
expected_token = os.getenv("MY_SECRET_TOKEN")
|
| 21 |
if token != expected_token:
|
| 22 |
raise HTTPException(
|
| 23 |
status_code=status.HTTP_403_FORBIDDEN,
|
|
|
|
| 25 |
)
|
| 26 |
return token
|
| 27 |
|
| 28 |
+
async def nepali_text_analysis(text: str):
|
| 29 |
end_symbol_for_NP_text(text)
|
| 30 |
words = text.split()
|
| 31 |
if len(words) < 10:
|
| 32 |
raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
|
| 33 |
+
if len(text) > 10000:
|
| 34 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
| 35 |
|
| 36 |
+
result = await asyncio.to_thread(classify_text, text)
|
|
|
|
| 37 |
|
| 38 |
return result
|
| 39 |
|
|
|
|
| 51 |
else:
|
| 52 |
raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
|
| 53 |
|
| 54 |
+
async def handle_file_upload(file: UploadFile):
|
| 55 |
try:
|
| 56 |
file_contents = await extract_file_contents(file)
|
| 57 |
end_symbol_for_NP_text(file_contents)
|
| 58 |
+
if len(file_contents) > 10000:
|
| 59 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
| 60 |
|
| 61 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 62 |
if not cleaned_text:
|
| 63 |
raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
|
| 64 |
|
| 65 |
+
result = await asyncio.to_thread(classify_text, cleaned_text)
|
|
|
|
| 66 |
return result
|
| 67 |
except Exception as e:
|
| 68 |
logging.error(f"Error processing file: {e}")
|
|
|
|
| 70 |
|
| 71 |
|
| 72 |
|
| 73 |
+
async def handle_sentence_level_analysis(text: str):
|
| 74 |
text = text.strip()
|
| 75 |
+
if len(text) > 10000:
|
| 76 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
| 77 |
|
| 78 |
end_symbol_for_NP_text(text)
|
| 79 |
|
| 80 |
# Split text into sentences
|
| 81 |
sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
results = []
|
| 84 |
for sentence in sentences:
|
| 85 |
end_symbol_for_NP_text(sentence)
|
| 86 |
+
result = await asyncio.to_thread(classify_text, sentence)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
results.append({
|
| 88 |
"text": sentence,
|
| 89 |
+
"result": result["label"],
|
| 90 |
+
"likelihood": result["confidence"]
|
| 91 |
})
|
| 92 |
|
| 93 |
return {"analysis": results}
|
| 94 |
|
| 95 |
|
| 96 |
+
async def handle_file_sentence(file:UploadFile):
|
| 97 |
try:
|
| 98 |
file_contents = await extract_file_contents(file)
|
| 99 |
+
if len(file_contents) > 10000:
|
| 100 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
| 101 |
|
| 102 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 103 |
if not cleaned_text:
|
|
|
|
| 106 |
|
| 107 |
# Split text into sentences
|
| 108 |
sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
results = []
|
| 111 |
for sentence in sentences:
|
| 112 |
end_symbol_for_NP_text(sentence)
|
| 113 |
|
| 114 |
+
result = await asyncio.to_thread(classify_text, sentence)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
results.append({
|
| 116 |
"text": sentence,
|
| 117 |
+
"result": result["label"],
|
| 118 |
+
"likelihood": result["confidence"]
|
| 119 |
})
|
| 120 |
|
| 121 |
return {"analysis": results}
|
|
|
|
| 125 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 126 |
|
| 127 |
|
| 128 |
+
def classify(text: str):
|
| 129 |
+
return classify_text(text)
|
|
|
|
| 130 |
|
features/nepali_text_classifier/inferencer.py
CHANGED
|
@@ -1,89 +1,23 @@
|
|
| 1 |
-
import
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
from .model_loader import get_default_top_models, load_artifacts
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
-
def normalize_nepali_text(text: str) -> str:
|
| 12 |
-
text = str(text)
|
| 13 |
-
text = re.sub(r"https?://\S+|www\.\S+", " ", text)
|
| 14 |
-
text = re.sub(r"[^\u0900-\u097F\s।!?,]", " ", text)
|
| 15 |
-
return re.sub(r"\s+", " ", text).strip()
|
| 16 |
|
| 17 |
|
| 18 |
-
def _select_models(models, model_names=None, top_k=2):
|
| 19 |
-
_ = model_names
|
| 20 |
-
ranked = [name for name in get_default_top_models(top_k=top_k) if name in models]
|
| 21 |
-
if ranked:
|
| 22 |
-
return ranked[:top_k]
|
| 23 |
-
return list(models.keys())[:top_k]
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
def classify_text(text: str, model_names="Logistic Regression", top_k: int = 1):
|
| 27 |
-
artifacts = load_artifacts()
|
| 28 |
-
models = artifacts["models"]
|
| 29 |
-
if not models:
|
| 30 |
-
return {"error": "No models available for inference"}
|
| 31 |
-
|
| 32 |
-
cleaned_text = normalize_nepali_text(text)
|
| 33 |
-
word_features = artifacts["word_vectorizer"].transform([cleaned_text])
|
| 34 |
-
char_features = artifacts["char_vectorizer"].transform([cleaned_text])
|
| 35 |
-
rich_features = artifacts["rich_transformer"].transform([cleaned_text])
|
| 36 |
-
features = hstack([word_features, char_features, csr_matrix(rich_features)])
|
| 37 |
-
|
| 38 |
-
selected_names = _select_models(models, model_names=model_names, top_k=TOP_K_MODELS)
|
| 39 |
-
dense_models = {"Linear SVC"}
|
| 40 |
-
|
| 41 |
-
per_model = []
|
| 42 |
-
ai_votes = 0
|
| 43 |
-
human_votes = 0
|
| 44 |
-
confidence_sum = 0.0
|
| 45 |
-
|
| 46 |
-
for name in selected_names:
|
| 47 |
-
model = models[name]
|
| 48 |
-
model_input = features.toarray() if name in dense_models else features
|
| 49 |
-
pred = int(model.predict(model_input)[0])
|
| 50 |
-
confidence = None
|
| 51 |
-
if hasattr(model, "predict_proba"):
|
| 52 |
-
probs = model.predict_proba(model_input)
|
| 53 |
-
confidence = float(probs[0][pred])
|
| 54 |
-
elif hasattr(model, "decision_function"):
|
| 55 |
-
score = float(model.decision_function(model_input)[0])
|
| 56 |
-
confidence = abs(score) / (1.0 + abs(score))
|
| 57 |
-
else:
|
| 58 |
-
confidence = 0.5
|
| 59 |
-
|
| 60 |
-
if pred == 1:
|
| 61 |
-
ai_votes += 1
|
| 62 |
-
label = "AI"
|
| 63 |
-
else:
|
| 64 |
-
human_votes += 1
|
| 65 |
-
label = "Human"
|
| 66 |
-
|
| 67 |
-
confidence_sum += confidence
|
| 68 |
-
per_model.append(
|
| 69 |
-
{
|
| 70 |
-
"model": name,
|
| 71 |
-
"label": label,
|
| 72 |
-
"confidence": round(confidence * 100, 2),
|
| 73 |
-
}
|
| 74 |
-
)
|
| 75 |
-
|
| 76 |
-
final_label = "AI" if ai_votes > human_votes else "Human"
|
| 77 |
-
if ai_votes == human_votes:
|
| 78 |
-
final_label = per_model[0]["label"]
|
| 79 |
-
|
| 80 |
-
avg_conf = confidence_sum / max(len(per_model), 1)
|
| 81 |
-
return {
|
| 82 |
-
"label": final_label,
|
| 83 |
-
"confidence": round(avg_conf * 100, 2),
|
| 84 |
-
# "selected_models": selected_names,
|
| 85 |
-
# "model_predictions": per_model,
|
| 86 |
-
# "votes": {"AI": ai_votes, "Human": human_votes},
|
| 87 |
-
# "available_models": list(models.keys()),
|
| 88 |
-
# "unavailable_models": artifacts["unavailable_models"],
|
| 89 |
-
}
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
from .model_loader import get_model_tokenizer
|
| 3 |
+
import torch.nn.functional as F
|
| 4 |
|
| 5 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 6 |
|
|
|
|
| 7 |
|
| 8 |
+
def classify_text(text: str):
|
| 9 |
+
model, tokenizer = get_model_tokenizer()
|
| 10 |
+
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
|
| 11 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 12 |
|
| 13 |
+
with torch.no_grad():
|
| 14 |
+
outputs = model(**inputs)
|
| 15 |
+
logits = outputs if isinstance(outputs, torch.Tensor) else outputs.logits
|
| 16 |
+
probs = F.softmax(logits, dim=1)
|
| 17 |
+
pred = torch.argmax(probs, dim=1).item()
|
| 18 |
+
prob_percent = probs[0][pred].item() * 100
|
| 19 |
|
| 20 |
+
return {"label": "Human" if pred == 0 else "AI", "confidence": round(prob_percent, 2)}
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/nepali_text_classifier/model_loader.py
CHANGED
|
@@ -1,237 +1,54 @@
|
|
| 1 |
-
import
|
| 2 |
-
import pickle
|
| 3 |
-
import re
|
| 4 |
import shutil
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
import
|
| 9 |
-
import pandas as pd
|
| 10 |
from huggingface_hub import snapshot_download
|
| 11 |
-
from
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
class NepaliRichFeatures:
|
| 58 |
-
"""Burstiness + stylometry feature extractor used during model training."""
|
| 59 |
-
|
| 60 |
-
@staticmethod
|
| 61 |
-
def extract_burstiness(text: str) -> dict:
|
| 62 |
-
sentences = [s.strip() for s in re.split(r"[।!?]", str(text)) if s.strip()]
|
| 63 |
-
if not sentences:
|
| 64 |
-
return {
|
| 65 |
-
"burst_mean": 0.0,
|
| 66 |
-
"burst_std": 0.0,
|
| 67 |
-
"burst_max": 0.0,
|
| 68 |
-
"burst_min": 0.0,
|
| 69 |
-
"burst_range": 0.0,
|
| 70 |
-
}
|
| 71 |
-
lengths = [len(s.split()) for s in sentences]
|
| 72 |
-
return {
|
| 73 |
-
"burst_mean": float(np.mean(lengths)),
|
| 74 |
-
"burst_std": float(np.std(lengths)),
|
| 75 |
-
"burst_max": float(np.max(lengths)),
|
| 76 |
-
"burst_min": float(np.min(lengths)),
|
| 77 |
-
"burst_range": float(np.max(lengths) - np.min(lengths)),
|
| 78 |
-
}
|
| 79 |
-
|
| 80 |
-
@staticmethod
|
| 81 |
-
def extract_stylometry(text: str) -> dict:
|
| 82 |
-
words = str(text).split()
|
| 83 |
-
num_words = max(len(words), 1)
|
| 84 |
-
num_chars = max(len(str(text)), 1)
|
| 85 |
-
num_sentences = max(
|
| 86 |
-
len([s for s in re.split(r"[।!?]", str(text)) if s.strip()]), 1
|
| 87 |
-
)
|
| 88 |
-
avg_word_len = float(np.mean([len(w) for w in words])) if words else 0.0
|
| 89 |
-
avg_sent_len = num_words / num_sentences
|
| 90 |
-
lexical_diversity = len(set(words)) / num_words
|
| 91 |
-
punct_count = (
|
| 92 |
-
str(text).count("।")
|
| 93 |
-
+ str(text).count("?")
|
| 94 |
-
+ str(text).count("!")
|
| 95 |
-
+ str(text).count(",")
|
| 96 |
-
)
|
| 97 |
-
punct_ratio = punct_count / num_chars
|
| 98 |
-
bigrams = [" ".join(words[i : i + 2]) for i in range(len(words) - 1)]
|
| 99 |
-
rep_bigram_ratio = (
|
| 100 |
-
(1.0 - len(set(bigrams)) / max(len(bigrams), 1)) if bigrams else 0.0
|
| 101 |
-
)
|
| 102 |
-
diacritic_count = sum(1 for c in str(text) if "\u093e" <= c <= "\u094d")
|
| 103 |
-
diacritic_ratio = diacritic_count / num_chars
|
| 104 |
-
return {
|
| 105 |
-
"num_words": num_words,
|
| 106 |
-
"num_chars": num_chars,
|
| 107 |
-
"num_sentences": num_sentences,
|
| 108 |
-
"avg_word_len": avg_word_len,
|
| 109 |
-
"avg_sent_len": avg_sent_len,
|
| 110 |
-
"lexical_diversity": lexical_diversity,
|
| 111 |
-
"punct_ratio": punct_ratio,
|
| 112 |
-
"rep_bigram_ratio": rep_bigram_ratio,
|
| 113 |
-
"diacritic_ratio": diacritic_ratio,
|
| 114 |
-
}
|
| 115 |
-
|
| 116 |
-
def transform(self, texts):
|
| 117 |
-
if isinstance(texts, str):
|
| 118 |
-
texts = [texts]
|
| 119 |
-
rows = []
|
| 120 |
-
for text in texts:
|
| 121 |
-
row = {**self.extract_burstiness(text), **self.extract_stylometry(text)}
|
| 122 |
-
rows.append(row)
|
| 123 |
-
return pd.DataFrame(rows).values.astype(np.float32)
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
def _repo_root() -> Path:
|
| 127 |
-
return Path(__file__).resolve().parents[2]
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
def _has_required_artifacts(path: Path) -> bool:
|
| 131 |
-
if not path.exists() or not path.is_dir():
|
| 132 |
-
return False
|
| 133 |
-
has_base = all((path / filename).exists() for filename in REQUIRED_BASE_FILES)
|
| 134 |
-
has_any_model = any((path / filename).exists() for filename in MODEL_FILES.values())
|
| 135 |
-
return has_base and has_any_model
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
def _candidate_model_dirs() -> list[Path]:
|
| 139 |
-
candidates = []
|
| 140 |
-
repo = _repo_root()
|
| 141 |
-
|
| 142 |
-
if Config.Nepali_model_folder:
|
| 143 |
-
custom = Path(Config.Nepali_model_folder)
|
| 144 |
-
candidates.extend([custom, custom / NEPALI_SUBDIR])
|
| 145 |
-
|
| 146 |
-
default_dir = repo / "features" / "Model" / "Nepali_model"
|
| 147 |
-
candidates.extend([default_dir, default_dir / NEPALI_SUBDIR])
|
| 148 |
-
candidates.append(
|
| 149 |
-
repo / "notebook" / "ai_vs_human_nepali" / "final_model" / "saved_models"
|
| 150 |
-
)
|
| 151 |
-
return candidates
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
def _download_nepali_artifacts() -> None:
|
| 155 |
-
if not REPO_ID:
|
| 156 |
-
raise ValueError("English_model repo id is not configured")
|
| 157 |
-
|
| 158 |
-
repo = _repo_root()
|
| 159 |
-
target_dir = (
|
| 160 |
-
Path(Config.Nepali_model_folder)
|
| 161 |
-
if Config.Nepali_model_folder
|
| 162 |
-
else repo / "features" / "Model" / "Nepali_model"
|
| 163 |
-
)
|
| 164 |
-
|
| 165 |
-
snapshot_path = Path(snapshot_download(repo_id=REPO_ID, token=HF_TOKEN))
|
| 166 |
-
source_dir = (
|
| 167 |
-
snapshot_path / NEPALI_SUBDIR
|
| 168 |
-
if (snapshot_path / NEPALI_SUBDIR).is_dir()
|
| 169 |
-
else snapshot_path
|
| 170 |
-
)
|
| 171 |
-
|
| 172 |
-
target_dir.mkdir(parents=True, exist_ok=True)
|
| 173 |
-
shutil.copytree(source_dir, target_dir, dirs_exist_ok=True)
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
def resolve_model_dir() -> Path:
|
| 177 |
-
for path in _candidate_model_dirs():
|
| 178 |
-
if _has_required_artifacts(path):
|
| 179 |
-
return path
|
| 180 |
-
|
| 181 |
-
LOGGER.info("Nepali artifacts not found locally; downloading from %s", REPO_ID)
|
| 182 |
-
_download_nepali_artifacts()
|
| 183 |
-
|
| 184 |
-
for path in _candidate_model_dirs():
|
| 185 |
-
if _has_required_artifacts(path):
|
| 186 |
-
return path
|
| 187 |
-
|
| 188 |
-
raise FileNotFoundError(
|
| 189 |
-
"Nepali model directory not found. Set Nepali_model env or add expected artifacts."
|
| 190 |
-
)
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
@lru_cache(maxsize=1)
|
| 194 |
-
def load_artifacts():
|
| 195 |
-
model_dir = resolve_model_dir()
|
| 196 |
-
LOGGER.info("Loading Nepali artifacts from %s", model_dir)
|
| 197 |
-
|
| 198 |
-
models = {}
|
| 199 |
-
unavailable = {}
|
| 200 |
-
for model_name, file_name in MODEL_FILES.items():
|
| 201 |
-
if model_name in SKIP_MODELS:
|
| 202 |
-
unavailable[model_name] = "Skipped due to large artifact size"
|
| 203 |
-
continue
|
| 204 |
-
file_path = model_dir / file_name
|
| 205 |
-
if not file_path.exists():
|
| 206 |
-
unavailable[model_name] = "Missing model file"
|
| 207 |
-
continue
|
| 208 |
-
with open(file_path, "rb") as fp:
|
| 209 |
-
models[model_name] = _patch_legacy_logistic_model(pickle.load(fp))
|
| 210 |
-
|
| 211 |
-
with open(model_dir / "word_vectorizer.pkl", "rb") as fp:
|
| 212 |
-
word_vectorizer = pickle.load(fp)
|
| 213 |
-
with open(model_dir / "char_vectorizer.pkl", "rb") as fp:
|
| 214 |
-
char_vectorizer = pickle.load(fp)
|
| 215 |
-
|
| 216 |
-
rich_transformer = NepaliRichFeatures()
|
| 217 |
-
return {
|
| 218 |
-
"model_dir": str(model_dir),
|
| 219 |
-
"models": models,
|
| 220 |
-
"unavailable_models": unavailable,
|
| 221 |
-
"word_vectorizer": word_vectorizer,
|
| 222 |
-
"char_vectorizer": char_vectorizer,
|
| 223 |
-
"rich_transformer": rich_transformer,
|
| 224 |
-
}
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
def get_available_models():
|
| 228 |
-
artifacts = load_artifacts()
|
| 229 |
-
return list(artifacts["models"].keys())
|
| 230 |
-
|
| 231 |
|
| 232 |
-
def get_default_top_models(top_k: int = 2):
|
| 233 |
-
available = set(get_available_models())
|
| 234 |
-
ranked = [name for name in DEFAULT_MODEL_RANKING if name in available]
|
| 235 |
-
if not ranked:
|
| 236 |
-
return list(available)[:top_k]
|
| 237 |
-
return ranked[: max(1, top_k)]
|
|
|
|
| 1 |
+
import os
|
|
|
|
|
|
|
| 2 |
import shutil
|
| 3 |
+
import torch
|
| 4 |
+
import torch.nn as nn
|
| 5 |
+
import torch.nn.functional as F
|
| 6 |
+
import logging
|
|
|
|
| 7 |
from huggingface_hub import snapshot_download
|
| 8 |
+
from transformers import AutoTokenizer, AutoModel
|
| 9 |
+
|
| 10 |
+
# Configs
|
| 11 |
+
REPO_ID = "can-org/Nepali-AI-VS-HUMAN"
|
| 12 |
+
BASE_DIR = "./np_text_model"
|
| 13 |
+
TOKENIZER_DIR = os.path.join(BASE_DIR, "classifier") # <- update this to match your uploaded folder
|
| 14 |
+
WEIGHTS_PATH = os.path.join(BASE_DIR, "model_95_acc.pth") # <- change to match actual uploaded weight
|
| 15 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 16 |
+
|
| 17 |
+
# Define model class
|
| 18 |
+
class XLMRClassifier(nn.Module):
|
| 19 |
+
def __init__(self):
|
| 20 |
+
super(XLMRClassifier, self).__init__()
|
| 21 |
+
self.bert = AutoModel.from_pretrained("xlm-roberta-base")
|
| 22 |
+
self.classifier = nn.Linear(self.bert.config.hidden_size, 2)
|
| 23 |
+
|
| 24 |
+
def forward(self, input_ids, attention_mask):
|
| 25 |
+
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
|
| 26 |
+
cls_output = outputs.last_hidden_state[:, 0, :]
|
| 27 |
+
return self.classifier(cls_output)
|
| 28 |
+
|
| 29 |
+
# Globals for caching
|
| 30 |
+
_model = None
|
| 31 |
+
_tokenizer = None
|
| 32 |
+
|
| 33 |
+
def download_model_repo():
|
| 34 |
+
if os.path.exists(BASE_DIR) and os.path.isdir(BASE_DIR):
|
| 35 |
+
logging.info("Model already downloaded.")
|
| 36 |
+
return
|
| 37 |
+
snapshot_path = snapshot_download(repo_id=REPO_ID)
|
| 38 |
+
os.makedirs(BASE_DIR, exist_ok=True)
|
| 39 |
+
shutil.copytree(snapshot_path, BASE_DIR, dirs_exist_ok=True)
|
| 40 |
+
|
| 41 |
+
def load_model():
|
| 42 |
+
download_model_repo()
|
| 43 |
+
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)
|
| 44 |
+
model = XLMRClassifier().to(device)
|
| 45 |
+
model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
|
| 46 |
+
model.eval()
|
| 47 |
+
return model, tokenizer
|
| 48 |
+
|
| 49 |
+
def get_model_tokenizer():
|
| 50 |
+
global _model, _tokenizer
|
| 51 |
+
if _model is None or _tokenizer is None:
|
| 52 |
+
_model, _tokenizer = load_model()
|
| 53 |
+
return _model, _tokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/nepali_text_classifier/preprocess.py
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
-
|
| 2 |
import docx
|
| 3 |
from io import BytesIO
|
| 4 |
import logging
|
| 5 |
from fastapi import HTTPException
|
| 6 |
-
|
| 7 |
|
| 8 |
def parse_docx(file: BytesIO):
|
| 9 |
doc = docx.Document(file)
|
|
@@ -15,10 +15,11 @@ def parse_docx(file: BytesIO):
|
|
| 15 |
|
| 16 |
def parse_pdf(file: BytesIO):
|
| 17 |
try:
|
| 18 |
-
doc =
|
| 19 |
text = ""
|
| 20 |
-
for
|
| 21 |
-
|
|
|
|
| 22 |
return text
|
| 23 |
except Exception as e:
|
| 24 |
logging.error(f"Error while processing PDF: {str(e)}")
|
|
|
|
| 1 |
+
import fitz # PyMuPDF
|
| 2 |
import docx
|
| 3 |
from io import BytesIO
|
| 4 |
import logging
|
| 5 |
from fastapi import HTTPException
|
| 6 |
+
|
| 7 |
|
| 8 |
def parse_docx(file: BytesIO):
|
| 9 |
doc = docx.Document(file)
|
|
|
|
| 15 |
|
| 16 |
def parse_pdf(file: BytesIO):
|
| 17 |
try:
|
| 18 |
+
doc = fitz.open(stream=file, filetype="pdf")
|
| 19 |
text = ""
|
| 20 |
+
for page_num in range(doc.page_count):
|
| 21 |
+
page = doc.load_page(page_num)
|
| 22 |
+
text += page.get_text()
|
| 23 |
return text
|
| 24 |
except Exception as e:
|
| 25 |
logging.error(f"Error while processing PDF: {str(e)}")
|
features/nepali_text_classifier/routes.py
CHANGED
|
@@ -15,42 +15,27 @@ security = HTTPBearer()
|
|
| 15 |
# Input schema
|
| 16 |
class TextInput(BaseModel):
|
| 17 |
text: str
|
| 18 |
-
models: list[str] | None = None
|
| 19 |
|
| 20 |
@router.post("/analyse")
|
| 21 |
@limiter.limit(ACCESS_RATE)
|
| 22 |
async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
|
| 23 |
-
|
| 24 |
-
result = await nepali_text_analysis(data.text, selected)
|
| 25 |
return result
|
| 26 |
|
| 27 |
@router.post("/upload")
|
| 28 |
@limiter.limit(ACCESS_RATE)
|
| 29 |
-
async def upload_file(request:Request,file:UploadFile=File(...),
|
| 30 |
-
return await handle_file_upload(file
|
| 31 |
|
| 32 |
@router.post("/analyse-sentences")
|
| 33 |
@limiter.limit(ACCESS_RATE)
|
| 34 |
async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
|
| 35 |
-
|
| 36 |
-
return await handle_sentence_level_analysis(data.text, selected)
|
| 37 |
|
| 38 |
@router.post("/file-sentences-analyse")
|
| 39 |
@limiter.limit(ACCESS_RATE)
|
| 40 |
-
async def analyze_sentance_file(request: Request, file: UploadFile = File(...),
|
| 41 |
-
return await handle_file_sentence(file
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
@router.get("/models")
|
| 45 |
-
@limiter.limit(ACCESS_RATE)
|
| 46 |
-
def get_models(request: Request):
|
| 47 |
-
from .model_loader import get_available_models, get_default_top_models
|
| 48 |
-
|
| 49 |
-
available = get_available_models()
|
| 50 |
-
return {
|
| 51 |
-
"available_models": available,
|
| 52 |
-
"default_top_2": get_default_top_models(2),
|
| 53 |
-
}
|
| 54 |
|
| 55 |
|
| 56 |
@router.get("/health")
|
|
|
|
| 15 |
# Input schema
|
| 16 |
class TextInput(BaseModel):
|
| 17 |
text: str
|
|
|
|
| 18 |
|
| 19 |
@router.post("/analyse")
|
| 20 |
@limiter.limit(ACCESS_RATE)
|
| 21 |
async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
|
| 22 |
+
result = classify_text(data.text)
|
|
|
|
| 23 |
return result
|
| 24 |
|
| 25 |
@router.post("/upload")
|
| 26 |
@limiter.limit(ACCESS_RATE)
|
| 27 |
+
async def upload_file(request:Request,file:UploadFile=File(...),token:str=Depends(security)):
|
| 28 |
+
return await handle_file_upload(file)
|
| 29 |
|
| 30 |
@router.post("/analyse-sentences")
|
| 31 |
@limiter.limit(ACCESS_RATE)
|
| 32 |
async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
|
| 33 |
+
return await handle_sentence_level_analysis(data.text)
|
|
|
|
| 34 |
|
| 35 |
@router.post("/file-sentences-analyse")
|
| 36 |
@limiter.limit(ACCESS_RATE)
|
| 37 |
+
async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(security)):
|
| 38 |
+
return await handle_file_sentence(file)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
|
| 41 |
@router.get("/health")
|
features/rag_chatbot/__init__.py
DELETED
|
File without changes
|
features/rag_chatbot/controller.py
DELETED
|
@@ -1,178 +0,0 @@
|
|
| 1 |
-
import asyncio
|
| 2 |
-
import logging
|
| 3 |
-
from typing import Dict, Any
|
| 4 |
-
|
| 5 |
-
from fastapi import HTTPException, UploadFile, status, Depends
|
| 6 |
-
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
| 7 |
-
from config import Config
|
| 8 |
-
|
| 9 |
-
from .rag_pipeline import route_and_process_query, add_document_to_rag, check_system_health
|
| 10 |
-
from .document_handler import extract_text_from_file
|
| 11 |
-
|
| 12 |
-
# Configure logging
|
| 13 |
-
logging.basicConfig(level=logging.INFO)
|
| 14 |
-
logger = logging.getLogger(__name__)
|
| 15 |
-
|
| 16 |
-
security = HTTPBearer()
|
| 17 |
-
|
| 18 |
-
# Supported file types
|
| 19 |
-
SUPPORTED_CONTENT_TYPES = Config.RAG_SUPPORTED_CONTENT_TYPES
|
| 20 |
-
|
| 21 |
-
MAX_FILE_SIZE = Config.RAG_MAX_FILE_SIZE
|
| 22 |
-
MAX_QUERY_LENGTH = Config.RAG_MAX_QUERY_LENGTH
|
| 23 |
-
|
| 24 |
-
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 25 |
-
"""Verify Bearer token from Authorization header."""
|
| 26 |
-
token = credentials.credentials
|
| 27 |
-
expected_token = Config.SECRET_TOKEN
|
| 28 |
-
|
| 29 |
-
if not expected_token:
|
| 30 |
-
logger.error("MY_SECRET_TOKEN not configured")
|
| 31 |
-
raise HTTPException(
|
| 32 |
-
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 33 |
-
detail="Server configuration error"
|
| 34 |
-
)
|
| 35 |
-
|
| 36 |
-
if token != expected_token:
|
| 37 |
-
logger.warning(f"Invalid token attempt: {token[:10]}...")
|
| 38 |
-
raise HTTPException(
|
| 39 |
-
status_code=status.HTTP_403_FORBIDDEN,
|
| 40 |
-
detail="Invalid or expired token"
|
| 41 |
-
)
|
| 42 |
-
return token
|
| 43 |
-
|
| 44 |
-
async def handle_rag_query(query: str) -> Dict[str, Any]:
|
| 45 |
-
"""Handle an incoming query by routing it and getting the appropriate answer."""
|
| 46 |
-
|
| 47 |
-
# Input validation
|
| 48 |
-
if not query or not query.strip():
|
| 49 |
-
raise HTTPException(
|
| 50 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 51 |
-
detail="Query cannot be empty"
|
| 52 |
-
)
|
| 53 |
-
|
| 54 |
-
if len(query) > MAX_QUERY_LENGTH:
|
| 55 |
-
raise HTTPException(
|
| 56 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 57 |
-
detail=f"Query too long. Please limit to {MAX_QUERY_LENGTH} characters."
|
| 58 |
-
)
|
| 59 |
-
|
| 60 |
-
try:
|
| 61 |
-
logger.info(f"Processing query: {query[:50]}...")
|
| 62 |
-
|
| 63 |
-
# Process query in thread pool
|
| 64 |
-
response = await asyncio.to_thread(route_and_process_query, query)
|
| 65 |
-
|
| 66 |
-
logger.info(f"Query processed successfully. Route: {response.get('route', 'Unknown')}")
|
| 67 |
-
return response
|
| 68 |
-
|
| 69 |
-
except Exception as e:
|
| 70 |
-
logger.error(f"Error processing query: {e}")
|
| 71 |
-
raise HTTPException(
|
| 72 |
-
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 73 |
-
detail="Error processing your query. Please try again."
|
| 74 |
-
)
|
| 75 |
-
|
| 76 |
-
async def handle_document_upload(file: UploadFile) -> Dict[str, str]:
|
| 77 |
-
"""Handle uploading a document to the RAG's vector store."""
|
| 78 |
-
|
| 79 |
-
# File validation
|
| 80 |
-
if not file.filename:
|
| 81 |
-
raise HTTPException(
|
| 82 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 83 |
-
detail="No file provided"
|
| 84 |
-
)
|
| 85 |
-
|
| 86 |
-
if file.content_type not in SUPPORTED_CONTENT_TYPES:
|
| 87 |
-
raise HTTPException(
|
| 88 |
-
status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
|
| 89 |
-
detail=f"Unsupported file type: {file.content_type}. "
|
| 90 |
-
f"Supported types: {', '.join(SUPPORTED_CONTENT_TYPES)}"
|
| 91 |
-
)
|
| 92 |
-
|
| 93 |
-
# Check file size
|
| 94 |
-
contents = await file.read()
|
| 95 |
-
if len(contents) > MAX_FILE_SIZE:
|
| 96 |
-
raise HTTPException(
|
| 97 |
-
status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
|
| 98 |
-
detail=f"File too large. Maximum size: {MAX_FILE_SIZE / (1024*1024):.1f}MB"
|
| 99 |
-
)
|
| 100 |
-
|
| 101 |
-
# Reset file pointer
|
| 102 |
-
await file.seek(0)
|
| 103 |
-
|
| 104 |
-
try:
|
| 105 |
-
logger.info(f"Processing file upload: {file.filename}")
|
| 106 |
-
|
| 107 |
-
# Extract text from file
|
| 108 |
-
text = await extract_text_from_file(file)
|
| 109 |
-
|
| 110 |
-
if not text or not text.strip():
|
| 111 |
-
raise HTTPException(
|
| 112 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 113 |
-
detail="The file appears to be empty or could not be read."
|
| 114 |
-
)
|
| 115 |
-
|
| 116 |
-
if len(text) < 50: # Too short to be meaningful
|
| 117 |
-
raise HTTPException(
|
| 118 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 119 |
-
detail="The extracted text is too short to be meaningful."
|
| 120 |
-
)
|
| 121 |
-
|
| 122 |
-
# Add to RAG system
|
| 123 |
-
success = await asyncio.to_thread(
|
| 124 |
-
add_document_to_rag,
|
| 125 |
-
text,
|
| 126 |
-
{
|
| 127 |
-
"source": file.filename,
|
| 128 |
-
"content_type": file.content_type,
|
| 129 |
-
"size": len(contents)
|
| 130 |
-
}
|
| 131 |
-
)
|
| 132 |
-
|
| 133 |
-
if not success:
|
| 134 |
-
raise HTTPException(
|
| 135 |
-
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 136 |
-
detail="Failed to add document to the knowledge base"
|
| 137 |
-
)
|
| 138 |
-
|
| 139 |
-
logger.info(f"Successfully processed file: {file.filename}")
|
| 140 |
-
|
| 141 |
-
return {
|
| 142 |
-
"message": f"Successfully uploaded and processed '{file.filename}'. "
|
| 143 |
-
f"It is now available for querying.",
|
| 144 |
-
"filename": file.filename,
|
| 145 |
-
"text_length": len(text),
|
| 146 |
-
"content_type": file.content_type
|
| 147 |
-
}
|
| 148 |
-
|
| 149 |
-
except HTTPException:
|
| 150 |
-
raise
|
| 151 |
-
except Exception as e:
|
| 152 |
-
logger.error(f"Error processing file {file.filename}: {e}")
|
| 153 |
-
raise HTTPException(
|
| 154 |
-
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
|
| 155 |
-
detail="Error processing the file. Please try again."
|
| 156 |
-
)
|
| 157 |
-
|
| 158 |
-
async def handle_health_check() -> Dict[str, Any]:
|
| 159 |
-
"""Handle health check requests."""
|
| 160 |
-
try:
|
| 161 |
-
health_status = await asyncio.to_thread(check_system_health)
|
| 162 |
-
|
| 163 |
-
if health_status["status"] == "unhealthy":
|
| 164 |
-
raise HTTPException(
|
| 165 |
-
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
| 166 |
-
detail="Service is currently unhealthy"
|
| 167 |
-
)
|
| 168 |
-
|
| 169 |
-
return health_status
|
| 170 |
-
|
| 171 |
-
except HTTPException:
|
| 172 |
-
raise
|
| 173 |
-
except Exception as e:
|
| 174 |
-
logger.error(f"Health check failed: {e}")
|
| 175 |
-
raise HTTPException(
|
| 176 |
-
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
|
| 177 |
-
detail="Health check failed"
|
| 178 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/rag_chatbot/document_handler.py
DELETED
|
@@ -1,37 +0,0 @@
|
|
| 1 |
-
from io import BytesIO
|
| 2 |
-
from fastapi import UploadFile, HTTPException
|
| 3 |
-
import PyPDF2
|
| 4 |
-
import docx
|
| 5 |
-
|
| 6 |
-
async def extract_text_from_file(file: UploadFile) -> str:
|
| 7 |
-
"""Extracts text from various file types."""
|
| 8 |
-
content = await file.read()
|
| 9 |
-
file_stream = BytesIO(content)
|
| 10 |
-
|
| 11 |
-
if file.content_type == "application/pdf":
|
| 12 |
-
return extract_text_from_pdf(file_stream)
|
| 13 |
-
elif file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
| 14 |
-
return extract_text_from_docx(file_stream)
|
| 15 |
-
elif file.content_type == "text/plain":
|
| 16 |
-
return file_stream.read().decode("utf-8")
|
| 17 |
-
else:
|
| 18 |
-
raise HTTPException(
|
| 19 |
-
status_code=415,
|
| 20 |
-
detail="Unsupported file type. Please upload a .pdf, .docx, or .txt file."
|
| 21 |
-
)
|
| 22 |
-
|
| 23 |
-
def extract_text_from_pdf(file_stream: BytesIO) -> str:
|
| 24 |
-
"""Extracts text from a PDF file."""
|
| 25 |
-
reader = PyPDF2.PdfReader(file_stream)
|
| 26 |
-
text = ""
|
| 27 |
-
for page in reader.pages:
|
| 28 |
-
text += page.extract_text() or ""
|
| 29 |
-
return text
|
| 30 |
-
|
| 31 |
-
def extract_text_from_docx(file_stream: BytesIO) -> str:
|
| 32 |
-
"""Extracts text from a DOCX file."""
|
| 33 |
-
doc = docx.Document(file_stream)
|
| 34 |
-
text = ""
|
| 35 |
-
for para in doc.paragraphs:
|
| 36 |
-
text += para.text + "\n"
|
| 37 |
-
return text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/rag_chatbot/rag_pipeline.py
DELETED
|
@@ -1,329 +0,0 @@
|
|
| 1 |
-
import os
|
| 2 |
-
import chromadb
|
| 3 |
-
from dotenv import load_dotenv
|
| 4 |
-
from langchain_core.documents import Document
|
| 5 |
-
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 6 |
-
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 7 |
-
from langchain_community.llms import OpenAI
|
| 8 |
-
from langchain.chains.question_answering import load_qa_chain
|
| 9 |
-
from langchain_community.vectorstores import Chroma
|
| 10 |
-
from langchain.chains import LLMChain
|
| 11 |
-
from langchain.prompts import PromptTemplate
|
| 12 |
-
from langchain.chat_models import ChatOpenAI
|
| 13 |
-
from config import Config
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
load_dotenv()
|
| 17 |
-
|
| 18 |
-
# ChromaDB configuration
|
| 19 |
-
CHROMA_HOST = Config.RAG_CHROMA_HOST
|
| 20 |
-
CHROMA_PORT = Config.RAG_CHROMA_PORT
|
| 21 |
-
COLLECTION_NAME = Config.RAG_COLLECTION_NAME
|
| 22 |
-
|
| 23 |
-
# LLM Provider Configuration
|
| 24 |
-
LLM_PROVIDER = Config.RAG_LLM_PROVIDER
|
| 25 |
-
LLM_API_KEY = Config.RAG_LLM_API_KEY
|
| 26 |
-
LLM_MODEL = Config.RAG_LLM_MODEL
|
| 27 |
-
LLM_TEMPERATURE = Config.RAG_LLM_TEMPERATURE
|
| 28 |
-
LLM_MAX_TOKENS = Config.RAG_LLM_MAX_TOKENS
|
| 29 |
-
|
| 30 |
-
# Provider-specific configurations
|
| 31 |
-
PROVIDER_CONFIGS = {
|
| 32 |
-
"openai": {
|
| 33 |
-
"api_base": "https://api.openai.com/v1",
|
| 34 |
-
"default_model": "gpt-3.5-turbo"
|
| 35 |
-
},
|
| 36 |
-
"groq": {
|
| 37 |
-
"api_base": "https://api.groq.com/openai/v1",
|
| 38 |
-
"default_model": "llama-3.3-70b-versatile"
|
| 39 |
-
},
|
| 40 |
-
"openrouter": {
|
| 41 |
-
"api_base": "https://openrouter.ai/api/v1",
|
| 42 |
-
"default_model": "mistralai/mistral-small-3.2-24b-instruct:free"
|
| 43 |
-
}
|
| 44 |
-
}
|
| 45 |
-
|
| 46 |
-
vector_store = None
|
| 47 |
-
company_qa_chain = None
|
| 48 |
-
query_router_chain = None
|
| 49 |
-
cybersecurity_chain = None
|
| 50 |
-
llm = None
|
| 51 |
-
|
| 52 |
-
def get_llm_config():
|
| 53 |
-
"""Get the appropriate LLM configuration based on the provider."""
|
| 54 |
-
if LLM_PROVIDER not in PROVIDER_CONFIGS:
|
| 55 |
-
raise ValueError(f"Unsupported LLM provider: {LLM_PROVIDER}. Supported: {list(PROVIDER_CONFIGS.keys())}")
|
| 56 |
-
|
| 57 |
-
config = PROVIDER_CONFIGS[LLM_PROVIDER].copy()
|
| 58 |
-
|
| 59 |
-
# Use provided model or fall back to default
|
| 60 |
-
model = LLM_MODEL if LLM_MODEL != "gpt-3.5-turbo" else config["default_model"]
|
| 61 |
-
|
| 62 |
-
return {
|
| 63 |
-
"model": model,
|
| 64 |
-
"openai_api_key": LLM_API_KEY,
|
| 65 |
-
"openai_api_base": config["api_base"],
|
| 66 |
-
"temperature": LLM_TEMPERATURE,
|
| 67 |
-
"max_tokens": LLM_MAX_TOKENS,
|
| 68 |
-
}
|
| 69 |
-
|
| 70 |
-
def initialize_llm():
|
| 71 |
-
"""Initialize the LLM based on the configured provider."""
|
| 72 |
-
if not LLM_API_KEY:
|
| 73 |
-
raise ValueError(f"LLM_API_KEY environment variable is required for {LLM_PROVIDER}")
|
| 74 |
-
|
| 75 |
-
config = get_llm_config()
|
| 76 |
-
|
| 77 |
-
print(f"Initializing {LLM_PROVIDER.upper()} with model: {config['model']}")
|
| 78 |
-
|
| 79 |
-
return ChatOpenAI(**config)
|
| 80 |
-
|
| 81 |
-
def initialize_pipelines():
|
| 82 |
-
"""Initializes all required models, chains, and the vector store."""
|
| 83 |
-
global vector_store, company_qa_chain, query_router_chain, cybersecurity_chain, llm
|
| 84 |
-
|
| 85 |
-
try:
|
| 86 |
-
# Initialize LLM
|
| 87 |
-
llm = initialize_llm()
|
| 88 |
-
|
| 89 |
-
# Initialize embeddings
|
| 90 |
-
embeddings = HuggingFaceEmbeddings(
|
| 91 |
-
model_name="all-MiniLM-L6-v2",
|
| 92 |
-
model_kwargs={'device': 'cpu'},
|
| 93 |
-
encode_kwargs={'normalize_embeddings': True}
|
| 94 |
-
)
|
| 95 |
-
|
| 96 |
-
# Initialize ChromaDB client
|
| 97 |
-
try:
|
| 98 |
-
chroma_client = chromadb.HttpClient(host=CHROMA_HOST, port=CHROMA_PORT)
|
| 99 |
-
chroma_client.heartbeat()
|
| 100 |
-
except Exception as e:
|
| 101 |
-
raise ConnectionError("Failed to connect to ChromaDB.") from e
|
| 102 |
-
|
| 103 |
-
# Initialize vector store
|
| 104 |
-
vector_store = Chroma(
|
| 105 |
-
client=chroma_client,
|
| 106 |
-
collection_name=COLLECTION_NAME,
|
| 107 |
-
embedding_function=embeddings,
|
| 108 |
-
)
|
| 109 |
-
|
| 110 |
-
# Query Router Chain
|
| 111 |
-
router_template = """You are a query classifier. Classify the following query into one of these categories:
|
| 112 |
-
- COMPANY: Questions about our company, its products, services, or general information
|
| 113 |
-
- CYBERSECURITY: Questions about cybersecurity, security threats, best practices, or vulnerabilities
|
| 114 |
-
- OFF_TOPIC: Questions that don't fit the above categories
|
| 115 |
-
|
| 116 |
-
Query: {query}
|
| 117 |
-
|
| 118 |
-
Respond with only the category name (COMPANY, CYBERSECURITY, or OFF_TOPIC):"""
|
| 119 |
-
|
| 120 |
-
router_prompt = PromptTemplate(
|
| 121 |
-
input_variables=["query"],
|
| 122 |
-
template=router_template
|
| 123 |
-
)
|
| 124 |
-
|
| 125 |
-
query_router_chain = LLMChain(
|
| 126 |
-
llm=llm,
|
| 127 |
-
prompt=router_prompt
|
| 128 |
-
)
|
| 129 |
-
|
| 130 |
-
# Custom Company QA Chain
|
| 131 |
-
company_qa_template = """You are a helpful assistant for CyberAlertNepal. Answer the following question about our company using the information provided and links if only available. Give a natural, direct and polite response.
|
| 132 |
-
|
| 133 |
-
Question: {question}
|
| 134 |
-
|
| 135 |
-
Information:
|
| 136 |
-
{context}
|
| 137 |
-
|
| 138 |
-
Answer:"""
|
| 139 |
-
|
| 140 |
-
company_qa_prompt = PromptTemplate(
|
| 141 |
-
input_variables=["question", "context"],
|
| 142 |
-
template=company_qa_template
|
| 143 |
-
)
|
| 144 |
-
|
| 145 |
-
company_qa_chain = LLMChain(
|
| 146 |
-
llm=llm,
|
| 147 |
-
prompt=company_qa_prompt
|
| 148 |
-
)
|
| 149 |
-
|
| 150 |
-
# Cybersecurity Chain
|
| 151 |
-
cybersecurity_template = """You are a cybersecurity professional. Answer the following question truthfully and concisely.
|
| 152 |
-
If you are not 100% sure about the answer, simply respond with: "I am not sure about the answer."
|
| 153 |
-
Do not add extra explanations or assumptions. Do not provide false or speculative information.
|
| 154 |
-
|
| 155 |
-
Question: {question}
|
| 156 |
-
|
| 157 |
-
Provide a comprehensive and accurate answer about cybersecurity:"""
|
| 158 |
-
|
| 159 |
-
cybersecurity_prompt = PromptTemplate(
|
| 160 |
-
input_variables=["question"],
|
| 161 |
-
template=cybersecurity_template
|
| 162 |
-
)
|
| 163 |
-
|
| 164 |
-
cybersecurity_chain = LLMChain(
|
| 165 |
-
llm=llm,
|
| 166 |
-
prompt=cybersecurity_prompt
|
| 167 |
-
)
|
| 168 |
-
|
| 169 |
-
print(f"Successfully initialized pipelines with {LLM_PROVIDER.upper()}")
|
| 170 |
-
|
| 171 |
-
except Exception as e:
|
| 172 |
-
print(f"Error initializing pipelines: {e}")
|
| 173 |
-
raise
|
| 174 |
-
|
| 175 |
-
def add_document_to_rag(text: str, metadata: dict):
|
| 176 |
-
"""Splits a document and adds it to the ChromaDB index."""
|
| 177 |
-
global vector_store
|
| 178 |
-
|
| 179 |
-
if not vector_store:
|
| 180 |
-
initialize_pipelines()
|
| 181 |
-
|
| 182 |
-
try:
|
| 183 |
-
text_splitter = RecursiveCharacterTextSplitter(
|
| 184 |
-
chunk_size=1000,
|
| 185 |
-
chunk_overlap=200
|
| 186 |
-
)
|
| 187 |
-
docs = text_splitter.create_documents([text], metadatas=[metadata])
|
| 188 |
-
|
| 189 |
-
if not docs:
|
| 190 |
-
print("Document was empty after splitting, not adding to ChromaDB.")
|
| 191 |
-
return False
|
| 192 |
-
|
| 193 |
-
vector_store.add_documents(docs)
|
| 194 |
-
print("Successfully added documents.")
|
| 195 |
-
return True
|
| 196 |
-
|
| 197 |
-
except Exception as e:
|
| 198 |
-
print(f"Error adding document to RAG: {e}")
|
| 199 |
-
return False
|
| 200 |
-
|
| 201 |
-
def route_and_process_query(query: str):
|
| 202 |
-
"""Routes the query and processes it using the appropriate pipeline."""
|
| 203 |
-
global query_router_chain, vector_store, company_qa_chain, cybersecurity_chain
|
| 204 |
-
|
| 205 |
-
if not all([query_router_chain, vector_store, company_qa_chain, cybersecurity_chain]):
|
| 206 |
-
initialize_pipelines()
|
| 207 |
-
|
| 208 |
-
try:
|
| 209 |
-
# 1. Classify the query
|
| 210 |
-
route_result = query_router_chain.run(query)
|
| 211 |
-
route = route_result.strip().upper()
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
# 2. Route to appropriate logic
|
| 215 |
-
if "CYBERSECURITY" in route:
|
| 216 |
-
answer = cybersecurity_chain.run(question=query)
|
| 217 |
-
return {
|
| 218 |
-
"answer": answer,
|
| 219 |
-
"source": "Cybersecurity Knowledge Base",
|
| 220 |
-
"route": "CYBERSECURITY",
|
| 221 |
-
"provider": LLM_PROVIDER.upper(),
|
| 222 |
-
"model": get_llm_config()["model"]
|
| 223 |
-
}
|
| 224 |
-
|
| 225 |
-
elif "COMPANY" in route:
|
| 226 |
-
# Perform similarity search on ChromaDB
|
| 227 |
-
docs = vector_store.similarity_search(query, k=3)
|
| 228 |
-
|
| 229 |
-
if not docs:
|
| 230 |
-
return {
|
| 231 |
-
"answer": "I could not find any relevant information to answer your question.",
|
| 232 |
-
"source": "Company Documents",
|
| 233 |
-
"route": "COMPANY",
|
| 234 |
-
"provider": LLM_PROVIDER.upper(),
|
| 235 |
-
"model": get_llm_config()["model"]
|
| 236 |
-
}
|
| 237 |
-
|
| 238 |
-
# Combine document content for context
|
| 239 |
-
context = "\n\n".join([doc.page_content for doc in docs])
|
| 240 |
-
|
| 241 |
-
# Run the custom QA chain
|
| 242 |
-
answer = company_qa_chain.run(question=query, context=context)
|
| 243 |
-
sources = list(set([doc.metadata.get("source", "Unknown") for doc in docs]))
|
| 244 |
-
|
| 245 |
-
return {
|
| 246 |
-
"answer": answer,
|
| 247 |
-
"source": "Company Documents",
|
| 248 |
-
"documents": sources,
|
| 249 |
-
"route": "COMPANY",
|
| 250 |
-
"provider": LLM_PROVIDER.upper(),
|
| 251 |
-
"model": get_llm_config()["model"]
|
| 252 |
-
}
|
| 253 |
-
|
| 254 |
-
else: # OFF_TOPIC
|
| 255 |
-
return {
|
| 256 |
-
"answer": "I am a specialized assistant of CyberAlertNepal. I cannot answer questions outside of cybersecurity topics.",
|
| 257 |
-
"source": "N/A",
|
| 258 |
-
"route": "OFF_TOPIC",
|
| 259 |
-
"provider": LLM_PROVIDER.upper(),
|
| 260 |
-
"model": get_llm_config()["model"]
|
| 261 |
-
}
|
| 262 |
-
|
| 263 |
-
except Exception as e:
|
| 264 |
-
print(f"Error processing query: {e}")
|
| 265 |
-
return {
|
| 266 |
-
"answer": "I encountered an error while processing your query. Please try again.",
|
| 267 |
-
"source": "Error",
|
| 268 |
-
"route": None,
|
| 269 |
-
"documents": None,
|
| 270 |
-
"provider": LLM_PROVIDER.upper(),
|
| 271 |
-
"error": str(e)
|
| 272 |
-
}
|
| 273 |
-
|
| 274 |
-
def check_system_health():
|
| 275 |
-
"""Check if all components are properly initialized."""
|
| 276 |
-
try:
|
| 277 |
-
# Test ChromaDB connection
|
| 278 |
-
if vector_store:
|
| 279 |
-
vector_store._client.heartbeat()
|
| 280 |
-
|
| 281 |
-
# Test if all chains are initialized
|
| 282 |
-
components = {
|
| 283 |
-
"vector_store": vector_store is not None,
|
| 284 |
-
"company_qa_chain": company_qa_chain is not None,
|
| 285 |
-
"query_router_chain": query_router_chain is not None,
|
| 286 |
-
"cybersecurity_chain": cybersecurity_chain is not None,
|
| 287 |
-
"llm": llm is not None
|
| 288 |
-
}
|
| 289 |
-
|
| 290 |
-
return {
|
| 291 |
-
"status": "healthy" if all(components.values()) else "unhealthy",
|
| 292 |
-
"components": components,
|
| 293 |
-
"provider": LLM_PROVIDER.upper(),
|
| 294 |
-
"model": get_llm_config()["model"] if llm else "Not initialized"
|
| 295 |
-
}
|
| 296 |
-
|
| 297 |
-
except Exception as e:
|
| 298 |
-
return {
|
| 299 |
-
"status": "unhealthy",
|
| 300 |
-
"error": str(e),
|
| 301 |
-
"provider": LLM_PROVIDER.upper()
|
| 302 |
-
}
|
| 303 |
-
|
| 304 |
-
def test_llm_connection():
|
| 305 |
-
"""Test the LLM API connection."""
|
| 306 |
-
try:
|
| 307 |
-
if not llm:
|
| 308 |
-
initialize_pipelines()
|
| 309 |
-
|
| 310 |
-
# Simple test query
|
| 311 |
-
test_response = llm("Say 'Hello, LLM is working!'")
|
| 312 |
-
return {
|
| 313 |
-
"success": True,
|
| 314 |
-
"provider": LLM_PROVIDER.upper(),
|
| 315 |
-
"model": get_llm_config()["model"],
|
| 316 |
-
"response": str(test_response)
|
| 317 |
-
}
|
| 318 |
-
except Exception as e:
|
| 319 |
-
return {
|
| 320 |
-
"success": False,
|
| 321 |
-
"provider": LLM_PROVIDER.upper(),
|
| 322 |
-
"error": str(e)
|
| 323 |
-
}
|
| 324 |
-
|
| 325 |
-
# Initialize pipelines on module import
|
| 326 |
-
try:
|
| 327 |
-
initialize_pipelines()
|
| 328 |
-
except Exception as e:
|
| 329 |
-
print(f"Failed to initialize pipelines on startup: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/rag_chatbot/routes.py
DELETED
|
@@ -1,107 +0,0 @@
|
|
| 1 |
-
from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Request
|
| 2 |
-
from fastapi.security import HTTPBearer
|
| 3 |
-
from pydantic import BaseModel, Field
|
| 4 |
-
from slowapi.util import get_remote_address
|
| 5 |
-
from slowapi import Limiter
|
| 6 |
-
from typing import Optional
|
| 7 |
-
from config import ACCESS_RATE, Config
|
| 8 |
-
from .controller import (
|
| 9 |
-
handle_rag_query,
|
| 10 |
-
handle_document_upload,
|
| 11 |
-
handle_health_check,
|
| 12 |
-
verify_token,
|
| 13 |
-
)
|
| 14 |
-
|
| 15 |
-
limiter = Limiter(key_func=get_remote_address)
|
| 16 |
-
router = APIRouter(prefix="/rag", tags=["RAG Chatbot"])
|
| 17 |
-
security = HTTPBearer()
|
| 18 |
-
|
| 19 |
-
class QueryInput(BaseModel):
|
| 20 |
-
query: str = Field(..., min_length=1, max_length=1000, description="The question to ask")
|
| 21 |
-
|
| 22 |
-
class QueryResponse(BaseModel):
|
| 23 |
-
answer: str
|
| 24 |
-
source: str
|
| 25 |
-
route: Optional[str] = None
|
| 26 |
-
documents: Optional[list] = None
|
| 27 |
-
error: Optional[str] = None
|
| 28 |
-
|
| 29 |
-
class UploadResponse(BaseModel):
|
| 30 |
-
message: str
|
| 31 |
-
filename: str
|
| 32 |
-
text_length: int
|
| 33 |
-
content_type: str
|
| 34 |
-
|
| 35 |
-
class HealthResponse(BaseModel):
|
| 36 |
-
status: str
|
| 37 |
-
components: Optional[dict] = None
|
| 38 |
-
error: Optional[str] = None
|
| 39 |
-
|
| 40 |
-
@router.post("/question", response_model=QueryResponse)
|
| 41 |
-
@limiter.limit(ACCESS_RATE)
|
| 42 |
-
async def ask_question(
|
| 43 |
-
request: Request,
|
| 44 |
-
data: QueryInput,
|
| 45 |
-
token: str = Depends(verify_token)
|
| 46 |
-
) -> QueryResponse:
|
| 47 |
-
"""
|
| 48 |
-
Ask a question to the RAG chatbot.
|
| 49 |
-
|
| 50 |
-
The chatbot can answer:
|
| 51 |
-
- Company-related questions (based on uploaded documents)
|
| 52 |
-
- Cybersecurity questions (from knowledge base)
|
| 53 |
-
"""
|
| 54 |
-
response = await handle_rag_query(data.query)
|
| 55 |
-
return QueryResponse(**response)
|
| 56 |
-
|
| 57 |
-
@router.post("/upload", response_model=UploadResponse)
|
| 58 |
-
@limiter.limit(ACCESS_RATE)
|
| 59 |
-
async def upload_document(
|
| 60 |
-
request: Request,
|
| 61 |
-
file: UploadFile = File(..., description="Document file (PDF, DOCX, or TXT)"),
|
| 62 |
-
token: str = Depends(verify_token)
|
| 63 |
-
) -> UploadResponse:
|
| 64 |
-
"""
|
| 65 |
-
Upload a document to the company knowledge base.
|
| 66 |
-
|
| 67 |
-
Supported formats:
|
| 68 |
-
- PDF (.pdf)
|
| 69 |
-
- Word documents (.docx)
|
| 70 |
-
- Plain text (.txt)
|
| 71 |
-
|
| 72 |
-
Maximum file size: 10MB
|
| 73 |
-
"""
|
| 74 |
-
response = await handle_document_upload(file)
|
| 75 |
-
return UploadResponse(**response)
|
| 76 |
-
|
| 77 |
-
@router.get("/health", response_model=HealthResponse)
|
| 78 |
-
@limiter.limit(ACCESS_RATE)
|
| 79 |
-
async def health_check(request: Request) -> HealthResponse:
|
| 80 |
-
"""
|
| 81 |
-
Check the health status of the RAG system.
|
| 82 |
-
|
| 83 |
-
Returns the status of all components:
|
| 84 |
-
- ChromaDB connection
|
| 85 |
-
- Vector store
|
| 86 |
-
- AI chains
|
| 87 |
-
"""
|
| 88 |
-
response = await handle_health_check()
|
| 89 |
-
return HealthResponse(**response)
|
| 90 |
-
|
| 91 |
-
@router.get("/info")
|
| 92 |
-
@limiter.limit(ACCESS_RATE)
|
| 93 |
-
async def get_system_info(request: Request):
|
| 94 |
-
"""Get information about the RAG system capabilities."""
|
| 95 |
-
return {
|
| 96 |
-
"name": "RAG Chatbot",
|
| 97 |
-
"version": "1.0.0",
|
| 98 |
-
"description": "A specialized chatbot for cybersecurity and company-related questions",
|
| 99 |
-
"capabilities": [
|
| 100 |
-
"Company document Q&A (based on uploaded documents)",
|
| 101 |
-
"Cybersecurity knowledge and best practices",
|
| 102 |
-
"Document upload and processing (PDF, DOCX, TXT)"
|
| 103 |
-
],
|
| 104 |
-
"supported_file_types": sorted(Config.RAG_SUPPORTED_CONTENT_TYPES),
|
| 105 |
-
"max_file_size_mb": round(Config.RAG_MAX_FILE_SIZE / (1024 * 1024), 2),
|
| 106 |
-
"max_query_length": Config.RAG_MAX_QUERY_LENGTH
|
| 107 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/real_forged_classifier/__init__.py
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
"""Package for real_forged_classifier feature.
|
| 2 |
-
|
| 3 |
-
This module ensures package-relative imports work when importing
|
| 4 |
-
`features.real_forged_classifier.*` from the application.
|
| 5 |
-
"""
|
| 6 |
-
|
| 7 |
-
__all__ = [
|
| 8 |
-
'controller', 'routes', 'preprocessor', 'inferencer', 'model_loader', 'model'
|
| 9 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/real_forged_classifier/controller.py
CHANGED
|
@@ -1,30 +1,6 @@
|
|
| 1 |
from typing import IO
|
| 2 |
-
import
|
| 3 |
-
|
| 4 |
-
from PIL import Image
|
| 5 |
-
from fastapi import Depends, HTTPException, status
|
| 6 |
-
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
|
| 7 |
-
import torch
|
| 8 |
-
from torchvision import transforms
|
| 9 |
-
|
| 10 |
-
from .preprocessor import preprocessor
|
| 11 |
-
from .inferencer import interferencer
|
| 12 |
-
from .model_loader import models
|
| 13 |
-
from config import Config
|
| 14 |
-
|
| 15 |
-
security = HTTPBearer()
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 19 |
-
token = credentials.credentials
|
| 20 |
-
expected_token = Config.SECRET_TOKEN
|
| 21 |
-
if token != expected_token:
|
| 22 |
-
raise HTTPException(
|
| 23 |
-
status_code=status.HTTP_403_FORBIDDEN,
|
| 24 |
-
detail="Invalid or expired token",
|
| 25 |
-
)
|
| 26 |
-
return token
|
| 27 |
-
|
| 28 |
|
| 29 |
class ClassificationController:
|
| 30 |
"""
|
|
@@ -58,72 +34,3 @@ class ClassificationController:
|
|
| 58 |
|
| 59 |
# Create a single instance of the controller
|
| 60 |
controller = ClassificationController()
|
| 61 |
-
|
| 62 |
-
class documentForger:
|
| 63 |
-
"""
|
| 64 |
-
Document forgery detector that uses the ELA-trained EfficientNet model
|
| 65 |
-
when available (models.doc_model). Returns a dict with verdict and confidence.
|
| 66 |
-
"""
|
| 67 |
-
def is_forged(self, document_file: IO) -> dict:
|
| 68 |
-
# Ensure a document model is loaded
|
| 69 |
-
if not hasattr(models, 'doc_model') or models.doc_model is None:
|
| 70 |
-
_downloadmodel = Config.DOCUMENT_FORGERY_MODEL_PATH
|
| 71 |
-
return {"detail": "Document forgery model not available."}
|
| 72 |
-
|
| 73 |
-
# Read file bytes
|
| 74 |
-
try:
|
| 75 |
-
data = document_file.read()
|
| 76 |
-
img = Image.open(io.BytesIO(data)).convert('RGB')
|
| 77 |
-
except Exception as e:
|
| 78 |
-
return {"detail": f"Could not open document image: {e}"}
|
| 79 |
-
|
| 80 |
-
# Compute ELA map (same approach as the notebook)
|
| 81 |
-
try:
|
| 82 |
-
buf = io.BytesIO()
|
| 83 |
-
img.save(buf, format='JPEG', quality=90)
|
| 84 |
-
buf.seek(0)
|
| 85 |
-
recompressed = Image.open(buf).convert('RGB')
|
| 86 |
-
|
| 87 |
-
ela_arr = np.abs(np.array(img, dtype=np.float32) - np.array(recompressed, dtype=np.float32))
|
| 88 |
-
p99 = np.percentile(ela_arr, 99)
|
| 89 |
-
if p99 > 0:
|
| 90 |
-
ela_arr = np.clip(ela_arr * (255.0 / p99), 0, 255).astype(np.uint8)
|
| 91 |
-
else:
|
| 92 |
-
ela_arr = ela_arr.astype(np.uint8)
|
| 93 |
-
|
| 94 |
-
ela_pil = Image.fromarray(ela_arr, mode='RGB')
|
| 95 |
-
except Exception as e:
|
| 96 |
-
return {"detail": f"Failed to compute ELA: {e}"}
|
| 97 |
-
|
| 98 |
-
# Transform and run through model
|
| 99 |
-
transform = transforms.Compose([
|
| 100 |
-
transforms.Resize((224, 224)),
|
| 101 |
-
transforms.ToTensor(),
|
| 102 |
-
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
|
| 103 |
-
])
|
| 104 |
-
|
| 105 |
-
tensor = transform(ela_pil).unsqueeze(0).to(models.device)
|
| 106 |
-
|
| 107 |
-
with torch.no_grad():
|
| 108 |
-
logits = models.doc_model(tensor)
|
| 109 |
-
probs = torch.softmax(logits, dim=1)[0, 1].item()
|
| 110 |
-
|
| 111 |
-
# Interpret confidence using configurable thresholds (values in 0..1)
|
| 112 |
-
low = getattr(Config, 'DOCUMENT_FORGERY_POSSIBLE_LOW', 0.40)
|
| 113 |
-
high = getattr(Config, 'DOCUMENT_FORGERY_FORGED_LOW', 0.55)
|
| 114 |
-
|
| 115 |
-
if probs < low:
|
| 116 |
-
verdict = 'LIKELY AUTHENTIC'
|
| 117 |
-
elif probs < high:
|
| 118 |
-
verdict = 'POSSIBLY FORGED'
|
| 119 |
-
else:
|
| 120 |
-
verdict = 'LIKELY FORGED'
|
| 121 |
-
|
| 122 |
-
return {
|
| 123 |
-
"verdict": verdict,
|
| 124 |
-
"confidence": float(probs),
|
| 125 |
-
"confidence_pct": round(float(probs) * 100, 2),
|
| 126 |
-
}
|
| 127 |
-
|
| 128 |
-
# Create a single instance of the document forger
|
| 129 |
-
document_forger = documentForger()
|
|
|
|
| 1 |
from typing import IO
|
| 2 |
+
from preprocessor import preprocessor
|
| 3 |
+
from inferencer import interferencer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
class ClassificationController:
|
| 6 |
"""
|
|
|
|
| 34 |
|
| 35 |
# Create a single instance of the controller
|
| 36 |
controller = ClassificationController()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/real_forged_classifier/inferencer.py
CHANGED
|
@@ -3,7 +3,7 @@ import torch.nn.functional as F
|
|
| 3 |
import numpy as np
|
| 4 |
|
| 5 |
# Import the globally loaded models instance
|
| 6 |
-
from
|
| 7 |
|
| 8 |
class Interferencer:
|
| 9 |
"""
|
|
@@ -26,10 +26,6 @@ class Interferencer:
|
|
| 26 |
Returns:
|
| 27 |
dict: A dictionary containing the classification label and confidence score.
|
| 28 |
"""
|
| 29 |
-
# 0. Ensure model is loaded
|
| 30 |
-
if self.fft_model is None:
|
| 31 |
-
return {"error": "FFT model not loaded."}
|
| 32 |
-
|
| 33 |
# 1. Get model outputs (logits)
|
| 34 |
outputs = self.fft_model(image_tensor)
|
| 35 |
|
|
|
|
| 3 |
import numpy as np
|
| 4 |
|
| 5 |
# Import the globally loaded models instance
|
| 6 |
+
from model_loader import models
|
| 7 |
|
| 8 |
class Interferencer:
|
| 9 |
"""
|
|
|
|
| 26 |
Returns:
|
| 27 |
dict: A dictionary containing the classification label and confidence score.
|
| 28 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
# 1. Get model outputs (logits)
|
| 30 |
outputs = self.fft_model(image_tensor)
|
| 31 |
|
features/real_forged_classifier/main.py
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from fastapi import FastAPI
|
| 2 |
+
from routes import router as api_router
|
| 3 |
+
|
| 4 |
+
# Initialize the FastAPI app
|
| 5 |
+
app = FastAPI(
|
| 6 |
+
title="Real vs. Fake Image Classification API",
|
| 7 |
+
description="An API to classify images as real or forged using FFT and cnn.",
|
| 8 |
+
version="1.0.0"
|
| 9 |
+
)
|
| 10 |
+
|
| 11 |
+
# Include the API router
|
| 12 |
+
# All routes defined in routes.py will be available under the /api prefix
|
| 13 |
+
app.include_router(api_router, prefix="/api", tags=["Classification"])
|
| 14 |
+
|
| 15 |
+
@app.get("/", tags=["Root"])
|
| 16 |
+
async def read_root():
|
| 17 |
+
"""
|
| 18 |
+
A simple root endpoint to confirm the API is running.
|
| 19 |
+
"""
|
| 20 |
+
return {"message": "Welcome to the Image Classification API. Go to /docs for the API documentation."}
|
| 21 |
+
|
| 22 |
+
# To run this application:
|
| 23 |
+
# 1. Make sure you have all dependencies from requirements.txt installed.
|
| 24 |
+
# 2. Make sure the 'svm_model.joblib' file is in the same directory.
|
| 25 |
+
# 3. Run the following command in your terminal:
|
| 26 |
+
# uvicorn main:app --reload
|
features/real_forged_classifier/model_loader.py
CHANGED
|
@@ -1,202 +1,60 @@
|
|
|
|
|
| 1 |
from pathlib import Path
|
| 2 |
-
from
|
| 3 |
-
import
|
| 4 |
-
from .model import FFTCNN # Import the FFT CNN architecture (package-relative)
|
| 5 |
-
from config import Config
|
| 6 |
-
|
| 7 |
-
try:
|
| 8 |
-
from huggingface_hub import hf_hub_download
|
| 9 |
-
except Exception:
|
| 10 |
-
hf_hub_download = None
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
# NOTE: EfficientNet/nn imports are done lazily when torch is available.
|
| 14 |
-
ELAForgeryNet = None # will be constructed dynamically when needed
|
| 15 |
-
torch = None
|
| 16 |
-
TORCH_AVAILABLE = False
|
| 17 |
-
|
| 18 |
|
| 19 |
class ModelLoader:
|
| 20 |
-
"""A class to load and hold PyTorch models used by this feature.
|
| 21 |
-
|
| 22 |
-
It loads:
|
| 23 |
-
- an FFT-based CNN (downloaded from Hugging Face Hub)
|
| 24 |
-
- an ELA-based document forgery detector (local .pth by default)
|
| 25 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
TORCH_AVAILABLE = True
|
| 34 |
-
except Exception:
|
| 35 |
-
torch = None
|
| 36 |
-
TORCH_AVAILABLE = False
|
| 37 |
-
print("[WARN] PyTorch not available; model loading will be skipped until torch is installed.")
|
| 38 |
-
if TORCH_AVAILABLE:
|
| 39 |
-
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 40 |
-
else:
|
| 41 |
-
self.device = "cpu"
|
| 42 |
-
print(f"Using device: {self.device} (torch available: {TORCH_AVAILABLE})")
|
| 43 |
-
|
| 44 |
-
# Load FFT CNN from HF Hub
|
| 45 |
-
self.fft_model = None
|
| 46 |
-
if TORCH_AVAILABLE:
|
| 47 |
-
try:
|
| 48 |
-
self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
|
| 49 |
-
print("FFT CNN model loaded successfully from Hub.")
|
| 50 |
-
except Exception:
|
| 51 |
-
# Try local fallback path (if provided in config)
|
| 52 |
-
self.fft_model = None
|
| 53 |
-
local_path = Path(getattr(Config, 'REAL_FORGED_MODEL_LOCAL_PATH', ''))
|
| 54 |
-
if local_path and local_path.exists():
|
| 55 |
-
try:
|
| 56 |
-
print(f"Attempting to load FFT model from local path: {local_path}")
|
| 57 |
-
model = FFTCNN()
|
| 58 |
-
state = torch.load(str(local_path), map_location=torch.device(self.device))
|
| 59 |
-
state_dict = state.get('state_dict', state) if isinstance(state, dict) else state
|
| 60 |
-
model.load_state_dict(state_dict, strict=False)
|
| 61 |
-
model.to(self.device)
|
| 62 |
-
model.eval()
|
| 63 |
-
self.fft_model = model
|
| 64 |
-
print("FFT CNN model loaded successfully from local path.")
|
| 65 |
-
except Exception as e:
|
| 66 |
-
print(f"Failed to load local FFT model: {e}")
|
| 67 |
-
else:
|
| 68 |
-
print("No local FFT model path configured or file missing; FFT model not loaded.")
|
| 69 |
-
else:
|
| 70 |
-
print("Skipping FFT model load because PyTorch is not installed.")
|
| 71 |
-
|
| 72 |
-
# Load document forgery model (ELA CNN), downloading the checkpoint if needed.
|
| 73 |
-
self.doc_model = None
|
| 74 |
-
if doc_model_path is None:
|
| 75 |
-
doc_model_path = Config.DOCUMENT_FORGERY_MODEL_PATH
|
| 76 |
|
| 77 |
-
self.
|
| 78 |
-
|
| 79 |
-
try:
|
| 80 |
-
self.doc_model = self._load_document_forgery_model(Path(doc_model_path))
|
| 81 |
-
if self.doc_model is not None:
|
| 82 |
-
print("Document forgery (ELA) model loaded successfully.")
|
| 83 |
-
except Exception as e:
|
| 84 |
-
print(f"Warning: failed to load document forgery model: {e}")
|
| 85 |
-
else:
|
| 86 |
-
print("Skipping document forgery model load because PyTorch is not installed.")
|
| 87 |
|
| 88 |
def _load_fft_model(self, repo_id: str, filename: str):
|
| 89 |
-
"""
|
| 90 |
-
|
| 91 |
-
try:
|
| 92 |
-
from huggingface_hub import hf_hub_download
|
| 93 |
-
except Exception as e:
|
| 94 |
-
raise RuntimeError(f"huggingface_hub not available: {e}")
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
try:
|
| 97 |
-
|
|
|
|
| 98 |
print(f"Model downloaded to: {model_path}")
|
| 99 |
-
|
|
|
|
| 100 |
model = FFTCNN()
|
|
|
|
|
|
|
| 101 |
model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
|
|
|
|
|
|
|
| 102 |
model.to(self.device)
|
| 103 |
model.eval()
|
|
|
|
| 104 |
return model
|
| 105 |
except Exception as e:
|
| 106 |
-
print(f"Error downloading or loading
|
| 107 |
raise
|
| 108 |
|
| 109 |
-
def _load_document_forgery_model(self, path: Path):
|
| 110 |
-
"""Load the ELA-based document forgery model from a local .pth checkpoint.
|
| 111 |
-
|
| 112 |
-
Returns the model instance or None if the file does not exist.
|
| 113 |
-
"""
|
| 114 |
-
# If the configured path doesn't exist, try sensible fallbacks in the repo.
|
| 115 |
-
if not path.exists():
|
| 116 |
-
print(f"Document forgery model file not found at configured path: {path}")
|
| 117 |
-
|
| 118 |
-
# 1) Try the configured document forgery checkpoint path relative to repo root
|
| 119 |
-
repo_root = Path(__file__).resolve().parents[2]
|
| 120 |
-
candidate = repo_root / 'features' / 'Model' / 'document_forgery' / path.name
|
| 121 |
-
if candidate.exists():
|
| 122 |
-
path = candidate
|
| 123 |
-
print(f"Found document forgery model at fallback path: {path}")
|
| 124 |
-
else:
|
| 125 |
-
# 2) Search the repo for any file with the configured checkpoint name
|
| 126 |
-
print(f"Searching repository for '{path.name}'...")
|
| 127 |
-
matches = list(repo_root.rglob(path.name))
|
| 128 |
-
if matches:
|
| 129 |
-
path = matches[0]
|
| 130 |
-
print(f"Found document forgery model at: {path}")
|
| 131 |
-
else:
|
| 132 |
-
try:
|
| 133 |
-
path = self._download_document_forgery_model(path)
|
| 134 |
-
except Exception as exc:
|
| 135 |
-
print(f"Document forgery model not found in repository and download failed: {exc}")
|
| 136 |
-
return None
|
| 137 |
-
|
| 138 |
-
print(f"Loading document forgery model from: {path}")
|
| 139 |
-
# Build the ELA model architecture lazily (requires torchvision & torch.nn)
|
| 140 |
-
try:
|
| 141 |
-
import torchvision.models as tv_models
|
| 142 |
-
import torch.nn as nn
|
| 143 |
-
except Exception as e:
|
| 144 |
-
raise RuntimeError(f"Required packages for ELA model not available: {e}")
|
| 145 |
-
|
| 146 |
-
backbone = tv_models.efficientnet_b0(weights='IMAGENET1K_V1')
|
| 147 |
-
in_features = backbone.classifier[1].in_features
|
| 148 |
-
backbone.classifier = nn.Sequential(
|
| 149 |
-
nn.Dropout(p=0.4),
|
| 150 |
-
nn.Linear(in_features, 256),
|
| 151 |
-
nn.ReLU(inplace=True),
|
| 152 |
-
nn.Dropout(p=0.2),
|
| 153 |
-
nn.Linear(256, 2),
|
| 154 |
-
)
|
| 155 |
-
model = backbone
|
| 156 |
-
state = torch.load(str(path), map_location=torch.device(self.device))
|
| 157 |
-
|
| 158 |
-
# The checkpoint might be either a state_dict or a full checkpoint dict
|
| 159 |
-
if isinstance(state, dict) and 'state_dict' in state:
|
| 160 |
-
state_dict = state['state_dict']
|
| 161 |
-
else:
|
| 162 |
-
state_dict = state
|
| 163 |
-
|
| 164 |
-
# Attempt to load state dict; allow strict=False to be tolerant to minor key name differences
|
| 165 |
-
model.load_state_dict(state_dict, strict=False)
|
| 166 |
-
model.to(self.device)
|
| 167 |
-
model.eval()
|
| 168 |
-
return model
|
| 169 |
-
|
| 170 |
-
def _download_document_forgery_model(self, target_path: Path) -> Path:
|
| 171 |
-
"""Download the document forgery checkpoint into the configured local path."""
|
| 172 |
-
if hf_hub_download is None:
|
| 173 |
-
raise RuntimeError("huggingface_hub not available")
|
| 174 |
-
|
| 175 |
-
repo_id = getattr(Config, "DOCUMENT_FORGERY_MODEL_REPO_ID", Config.REAL_FORGED_MODEL_REPO_ID)
|
| 176 |
-
configured_name = getattr(Config, "DOCUMENT_FORGERY_MODEL_FILENAME", str(target_path))
|
| 177 |
-
candidate_filenames = []
|
| 178 |
-
for candidate in (configured_name, str(target_path), target_path.name):
|
| 179 |
-
if candidate and candidate not in candidate_filenames:
|
| 180 |
-
candidate_filenames.append(candidate)
|
| 181 |
-
|
| 182 |
-
last_error = None
|
| 183 |
-
for filename in candidate_filenames:
|
| 184 |
-
try:
|
| 185 |
-
print(f"Downloading document forgery model from Hugging Face repo: {repo_id} ({filename})")
|
| 186 |
-
downloaded_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
|
| 187 |
-
target_path.parent.mkdir(parents=True, exist_ok=True)
|
| 188 |
-
shutil.copy2(downloaded_path, target_path)
|
| 189 |
-
print(f"Document forgery model downloaded to: {target_path}")
|
| 190 |
-
return target_path
|
| 191 |
-
except Exception as exc:
|
| 192 |
-
last_error = exc
|
| 193 |
-
|
| 194 |
-
raise RuntimeError(f"unable to download document forgery model: {last_error}")
|
| 195 |
-
|
| 196 |
-
|
| 197 |
# --- Global Model Instance ---
|
| 198 |
-
MODEL_REPO_ID =
|
| 199 |
-
MODEL_FILENAME =
|
| 200 |
-
|
| 201 |
-
models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME, doc_model_path=DOC_MODEL_PATH)
|
| 202 |
|
|
|
|
| 1 |
+
import torch
|
| 2 |
from pathlib import Path
|
| 3 |
+
from huggingface_hub import hf_hub_download
|
| 4 |
+
from model import FFTCNN # Import the model architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
class ModelLoader:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
"""
|
| 8 |
+
A class to load and hold the PyTorch CNN model.
|
| 9 |
+
"""
|
| 10 |
+
def __init__(self, model_repo_id: str, model_filename: str):
|
| 11 |
+
"""
|
| 12 |
+
Initializes the ModelLoader and loads the model.
|
| 13 |
|
| 14 |
+
Args:
|
| 15 |
+
model_repo_id (str): The repository ID on Hugging Face.
|
| 16 |
+
model_filename (str): The name of the model file (.pth) in the repository.
|
| 17 |
+
"""
|
| 18 |
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 19 |
+
print(f"Using device: {self.device}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
|
| 22 |
+
print("FFT CNN model loaded successfully.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
def _load_fft_model(self, repo_id: str, filename: str):
|
| 25 |
+
"""
|
| 26 |
+
Downloads and loads the FFT CNN model from a Hugging Face Hub repository.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
Args:
|
| 29 |
+
repo_id (str): The repository ID on Hugging Face.
|
| 30 |
+
filename (str): The name of the model file (.pth) in the repository.
|
| 31 |
+
|
| 32 |
+
Returns:
|
| 33 |
+
The loaded PyTorch model object.
|
| 34 |
+
"""
|
| 35 |
+
print(f"Downloading FFT CNN model from Hugging Face repo: {repo_id}")
|
| 36 |
try:
|
| 37 |
+
# Download the model file from the Hub. It returns the cached path.
|
| 38 |
+
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
| 39 |
print(f"Model downloaded to: {model_path}")
|
| 40 |
+
|
| 41 |
+
# Initialize the model architecture
|
| 42 |
model = FFTCNN()
|
| 43 |
+
|
| 44 |
+
# Load the saved weights (state_dict) into the model
|
| 45 |
model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
|
| 46 |
+
|
| 47 |
+
# Set the model to evaluation mode
|
| 48 |
model.to(self.device)
|
| 49 |
model.eval()
|
| 50 |
+
|
| 51 |
return model
|
| 52 |
except Exception as e:
|
| 53 |
+
print(f"Error downloading or loading model from Hugging Face: {e}")
|
| 54 |
raise
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
# --- Global Model Instance ---
|
| 57 |
+
MODEL_REPO_ID = 'rhnsa/real_forged_classifier'
|
| 58 |
+
MODEL_FILENAME = 'fft_cnn_model_78.pth'
|
| 59 |
+
models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME)
|
|
|
|
| 60 |
|
features/real_forged_classifier/preprocessor.py
CHANGED
|
@@ -6,7 +6,7 @@ import cv2
|
|
| 6 |
from torchvision import transforms
|
| 7 |
|
| 8 |
# Import the globally loaded models instance
|
| 9 |
-
from
|
| 10 |
|
| 11 |
class ImagePreprocessor:
|
| 12 |
"""
|
|
|
|
| 6 |
from torchvision import transforms
|
| 7 |
|
| 8 |
# Import the globally loaded models instance
|
| 9 |
+
from model_loader import models
|
| 10 |
|
| 11 |
class ImagePreprocessor:
|
| 12 |
"""
|
features/real_forged_classifier/routes.py
CHANGED
|
@@ -1,14 +1,14 @@
|
|
| 1 |
-
from fastapi import APIRouter, File, UploadFile, HTTPException, status
|
| 2 |
from fastapi.responses import JSONResponse
|
| 3 |
|
| 4 |
-
# Import the controller instance
|
| 5 |
-
from
|
| 6 |
|
| 7 |
# Create an API router
|
| 8 |
router = APIRouter()
|
| 9 |
|
| 10 |
@router.post("/classify_forgery", summary="Classify an image as Real or Fake")
|
| 11 |
-
async def classify_image_endpoint(image: UploadFile = File(...)
|
| 12 |
"""
|
| 13 |
Accepts an image file and classifies it as 'real' or 'fake'.
|
| 14 |
|
|
@@ -35,23 +35,3 @@ async def classify_image_endpoint(image: UploadFile = File(...), token: str = De
|
|
| 35 |
|
| 36 |
return JSONResponse(content=result, status_code=status.HTTP_200_OK)
|
| 37 |
|
| 38 |
-
@router.post("/isforged", summary="Check if the document is forged")
|
| 39 |
-
async def is_forged_endpoint(file: UploadFile = File(...), token: str = Depends(verify_token)):
|
| 40 |
-
"""Run the document forgery detector on an uploaded image file.
|
| 41 |
-
|
| 42 |
-
Accepts image uploads (multipart/form-data) and returns a JSON verdict with confidence.
|
| 43 |
-
"""
|
| 44 |
-
if not file.content_type.startswith("image/"):
|
| 45 |
-
raise HTTPException(
|
| 46 |
-
status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
|
| 47 |
-
detail="Unsupported file type. Please upload an image (e.g., JPEG, PNG)."
|
| 48 |
-
)
|
| 49 |
-
|
| 50 |
-
result = document_forger.is_forged(file.file)
|
| 51 |
-
if isinstance(result, dict) and (result.get("error") or result.get("detail")):
|
| 52 |
-
raise HTTPException(
|
| 53 |
-
status_code=status.HTTP_400_BAD_REQUEST,
|
| 54 |
-
detail=result.get("error") or result.get("detail"),
|
| 55 |
-
)
|
| 56 |
-
|
| 57 |
-
return JSONResponse(content=result, status_code=status.HTTP_200_OK)
|
|
|
|
| 1 |
+
from fastapi import APIRouter, File, UploadFile, HTTPException, status
|
| 2 |
from fastapi.responses import JSONResponse
|
| 3 |
|
| 4 |
+
# Import the controller instance
|
| 5 |
+
from controller import controller
|
| 6 |
|
| 7 |
# Create an API router
|
| 8 |
router = APIRouter()
|
| 9 |
|
| 10 |
@router.post("/classify_forgery", summary="Classify an image as Real or Fake")
|
| 11 |
+
async def classify_image_endpoint(image: UploadFile = File(...)):
|
| 12 |
"""
|
| 13 |
Accepts an image file and classifies it as 'real' or 'fake'.
|
| 14 |
|
|
|
|
| 35 |
|
| 36 |
return JSONResponse(content=result, status_code=status.HTTP_200_OK)
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/text_classifier/controller.py
CHANGED
|
@@ -1,76 +1,49 @@
|
|
|
|
|
| 1 |
import asyncio
|
| 2 |
import logging
|
| 3 |
from io import BytesIO
|
| 4 |
|
| 5 |
-
from fastapi import
|
| 6 |
-
from fastapi.security import
|
| 7 |
-
from config import Config
|
| 8 |
|
| 9 |
-
from .inferencer import
|
| 10 |
from .preprocess import parse_docx, parse_pdf, parse_txt
|
| 11 |
-
|
| 12 |
security = HTTPBearer()
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
# def build_bias_summary(ai_likelihood: float) -> dict[str, object]:
|
| 16 |
-
# """Convert an AI likelihood score into a human-readable bias summary."""
|
| 17 |
-
# if ai_likelihood > 50:
|
| 18 |
-
# overall_bias = "AI"
|
| 19 |
-
# bias_statement = f"The text is biased toward AI-generated writing ({ai_likelihood}% AI likelihood)."
|
| 20 |
-
# elif ai_likelihood < 50:
|
| 21 |
-
# overall_bias = "Human"
|
| 22 |
-
# bias_statement = f"The text is biased toward human writing ({100 - ai_likelihood}% human likelihood)."
|
| 23 |
-
# else:
|
| 24 |
-
# overall_bias = "Balanced"
|
| 25 |
-
# bias_statement = "The text is balanced between AI and human writing."
|
| 26 |
-
|
| 27 |
-
# return {
|
| 28 |
-
# "overall_bias": overall_bias,
|
| 29 |
-
# "bias_statement": bias_statement,
|
| 30 |
-
# }
|
| 31 |
-
|
| 32 |
|
| 33 |
# Verify Bearer token from Authorization header
|
| 34 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 35 |
token = credentials.credentials
|
| 36 |
-
expected_token =
|
| 37 |
if token != expected_token:
|
| 38 |
raise HTTPException(
|
| 39 |
-
status_code=status.HTTP_403_FORBIDDEN,
|
|
|
|
| 40 |
)
|
| 41 |
return token
|
| 42 |
|
| 43 |
-
|
| 44 |
# Classify plain text input
|
| 45 |
async def handle_text_analysis(text: str):
|
| 46 |
text = text.strip()
|
| 47 |
if not text or len(text.split()) < 10:
|
| 48 |
-
raise HTTPException(
|
| 49 |
-
|
| 50 |
-
)
|
| 51 |
-
if len(text) > 50000:
|
| 52 |
-
raise HTTPException(
|
| 53 |
-
status_code=413, detail="Text must be less than 50,000 characters"
|
| 54 |
-
)
|
| 55 |
|
| 56 |
label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
|
| 57 |
-
# bias_summary = build_bias_summary(ai_likelihood)
|
| 58 |
return {
|
| 59 |
"result": label,
|
| 60 |
"perplexity": round(perplexity, 2),
|
| 61 |
-
"ai_likelihood": ai_likelihood
|
| 62 |
}
|
| 63 |
|
| 64 |
-
|
| 65 |
# Extract text from uploaded files (.docx, .pdf, .txt)
|
| 66 |
async def extract_file_contents(file: UploadFile) -> str:
|
| 67 |
content = await file.read()
|
| 68 |
file_stream = BytesIO(content)
|
| 69 |
|
| 70 |
-
if
|
| 71 |
-
file.content_type
|
| 72 |
-
== "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
| 73 |
-
):
|
| 74 |
return parse_docx(file_stream)
|
| 75 |
elif file.content_type == "application/pdf":
|
| 76 |
return parse_pdf(file_stream)
|
|
@@ -79,83 +52,76 @@ async def extract_file_contents(file: UploadFile) -> str:
|
|
| 79 |
else:
|
| 80 |
raise HTTPException(
|
| 81 |
status_code=415,
|
| 82 |
-
detail="Invalid file type. Only .docx, .pdf and .txt are allowed."
|
| 83 |
)
|
| 84 |
|
| 85 |
-
|
| 86 |
# Classify text from uploaded file
|
| 87 |
async def handle_file_upload(file: UploadFile):
|
| 88 |
try:
|
| 89 |
file_contents = await extract_file_contents(file)
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
return {
|
| 93 |
-
"status_code": 413,
|
| 94 |
-
"detail": "Text must be less than 50,000 characters",
|
| 95 |
-
}
|
| 96 |
|
| 97 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 98 |
if not cleaned_text:
|
| 99 |
-
raise HTTPException(
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
)
|
| 103 |
-
# print(f"Cleaned text: '{cleaned_text}'") # Debugging statement
|
| 104 |
-
label, perplexity, ai_likelihood = await asyncio.to_thread(
|
| 105 |
-
classify_text, cleaned_text
|
| 106 |
-
)
|
| 107 |
return {
|
| 108 |
"content": file_contents,
|
| 109 |
"result": label,
|
| 110 |
"perplexity": round(perplexity, 2),
|
| 111 |
-
"ai_likelihood": ai_likelihood
|
| 112 |
}
|
| 113 |
except Exception as e:
|
| 114 |
logging.error(f"Error processing file: {e}")
|
| 115 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 116 |
|
| 117 |
|
|
|
|
| 118 |
async def handle_sentence_level_analysis(text: str):
|
| 119 |
text = text.strip()
|
| 120 |
-
if not text
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
-
# Analyze each sentence from uploaded file
|
| 134 |
async def handle_file_sentence(file: UploadFile):
|
| 135 |
try:
|
| 136 |
file_contents = await extract_file_contents(file)
|
| 137 |
-
if len(file_contents) >
|
| 138 |
-
|
| 139 |
-
return {
|
| 140 |
-
"status_code": 413,
|
| 141 |
-
"detail": "Text must be less than 50,000 characters",
|
| 142 |
-
}
|
| 143 |
|
| 144 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 145 |
if not cleaned_text:
|
| 146 |
-
raise HTTPException(
|
| 147 |
-
status_code=400,
|
| 148 |
-
detail="The uploaded file is empty or only contains whitespace.",
|
| 149 |
-
)
|
| 150 |
|
| 151 |
result = await handle_sentence_level_analysis(cleaned_text)
|
| 152 |
-
return {
|
| 153 |
-
|
| 154 |
-
|
|
|
|
| 155 |
except Exception as e:
|
| 156 |
logging.error(f"Error processing file: {e}")
|
| 157 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 158 |
|
| 159 |
-
|
| 160 |
def classify(text: str):
|
| 161 |
return classify_text(text)
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
import asyncio
|
| 3 |
import logging
|
| 4 |
from io import BytesIO
|
| 5 |
|
| 6 |
+
from fastapi import HTTPException, UploadFile, status, Depends
|
| 7 |
+
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
|
|
|
|
| 8 |
|
| 9 |
+
from .inferencer import classify_text
|
| 10 |
from .preprocess import parse_docx, parse_pdf, parse_txt
|
| 11 |
+
import spacy
|
| 12 |
security = HTTPBearer()
|
| 13 |
+
nlp = spacy.load("en_core_web_sm")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# Verify Bearer token from Authorization header
|
| 16 |
async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
|
| 17 |
token = credentials.credentials
|
| 18 |
+
expected_token = os.getenv("MY_SECRET_TOKEN")
|
| 19 |
if token != expected_token:
|
| 20 |
raise HTTPException(
|
| 21 |
+
status_code=status.HTTP_403_FORBIDDEN,
|
| 22 |
+
detail="Invalid or expired token"
|
| 23 |
)
|
| 24 |
return token
|
| 25 |
|
|
|
|
| 26 |
# Classify plain text input
|
| 27 |
async def handle_text_analysis(text: str):
|
| 28 |
text = text.strip()
|
| 29 |
if not text or len(text.split()) < 10:
|
| 30 |
+
raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
|
| 31 |
+
if len(text) > 10000:
|
| 32 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
|
|
|
|
| 35 |
return {
|
| 36 |
"result": label,
|
| 37 |
"perplexity": round(perplexity, 2),
|
| 38 |
+
"ai_likelihood": ai_likelihood
|
| 39 |
}
|
| 40 |
|
|
|
|
| 41 |
# Extract text from uploaded files (.docx, .pdf, .txt)
|
| 42 |
async def extract_file_contents(file: UploadFile) -> str:
|
| 43 |
content = await file.read()
|
| 44 |
file_stream = BytesIO(content)
|
| 45 |
|
| 46 |
+
if file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
|
|
|
|
|
|
|
|
|
|
| 47 |
return parse_docx(file_stream)
|
| 48 |
elif file.content_type == "application/pdf":
|
| 49 |
return parse_pdf(file_stream)
|
|
|
|
| 52 |
else:
|
| 53 |
raise HTTPException(
|
| 54 |
status_code=415,
|
| 55 |
+
detail="Invalid file type. Only .docx, .pdf and .txt are allowed."
|
| 56 |
)
|
| 57 |
|
|
|
|
| 58 |
# Classify text from uploaded file
|
| 59 |
async def handle_file_upload(file: UploadFile):
|
| 60 |
try:
|
| 61 |
file_contents = await extract_file_contents(file)
|
| 62 |
+
if len(file_contents) > 10000:
|
| 63 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 66 |
if not cleaned_text:
|
| 67 |
+
raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
|
| 68 |
+
|
| 69 |
+
label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, cleaned_text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
return {
|
| 71 |
"content": file_contents,
|
| 72 |
"result": label,
|
| 73 |
"perplexity": round(perplexity, 2),
|
| 74 |
+
"ai_likelihood": ai_likelihood
|
| 75 |
}
|
| 76 |
except Exception as e:
|
| 77 |
logging.error(f"Error processing file: {e}")
|
| 78 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 79 |
|
| 80 |
|
| 81 |
+
|
| 82 |
async def handle_sentence_level_analysis(text: str):
|
| 83 |
text = text.strip()
|
| 84 |
+
if not text.endswith("."):
|
| 85 |
+
text += "."
|
| 86 |
+
|
| 87 |
+
if len(text) > 10000:
|
| 88 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
| 89 |
+
|
| 90 |
+
doc = nlp(text)
|
| 91 |
+
sentences = [sent.text.strip() for sent in doc.sents]
|
| 92 |
+
|
| 93 |
+
results = []
|
| 94 |
+
for sentence in sentences:
|
| 95 |
+
if not sentence:
|
| 96 |
+
continue
|
| 97 |
+
label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, sentence)
|
| 98 |
+
results.append({
|
| 99 |
+
"sentence": sentence,
|
| 100 |
+
"label": label,
|
| 101 |
+
"perplexity": round(perplexity, 2),
|
| 102 |
+
"ai_likelihood": ai_likelihood
|
| 103 |
+
})
|
| 104 |
|
| 105 |
+
return {"analysis": results}# Analyze each sentence from uploaded file
|
| 106 |
async def handle_file_sentence(file: UploadFile):
|
| 107 |
try:
|
| 108 |
file_contents = await extract_file_contents(file)
|
| 109 |
+
if len(file_contents) > 10000:
|
| 110 |
+
raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
|
| 113 |
if not cleaned_text:
|
| 114 |
+
raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
result = await handle_sentence_level_analysis(cleaned_text)
|
| 117 |
+
return {
|
| 118 |
+
"content": file_contents,
|
| 119 |
+
**result
|
| 120 |
+
}
|
| 121 |
except Exception as e:
|
| 122 |
logging.error(f"Error processing file: {e}")
|
| 123 |
raise HTTPException(status_code=500, detail="Error processing the file")
|
| 124 |
|
|
|
|
| 125 |
def classify(text: str):
|
| 126 |
return classify_text(text)
|
| 127 |
+
|
features/text_classifier/inferencer.py
CHANGED
|
@@ -1,272 +1,40 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from functools import lru_cache
|
| 5 |
-
import logging
|
| 6 |
-
import random
|
| 7 |
-
from typing import Any
|
| 8 |
-
|
| 9 |
-
import nltk
|
| 10 |
-
import numpy as np
|
| 11 |
-
from scipy.sparse import csr_matrix, hstack
|
| 12 |
import torch
|
| 13 |
-
from
|
| 14 |
-
|
| 15 |
-
from features.text_classifier.model_loader import load_model
|
| 16 |
-
|
| 17 |
-
logger = logging.getLogger(__name__)
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
for resource in ("tokenizers/punkt", "tokenizers/punkt_tab"):
|
| 21 |
-
try:
|
| 22 |
-
nltk.data.find(resource)
|
| 23 |
-
except LookupError:
|
| 24 |
-
nltk.download(resource.split("/")[-1], quiet=True)
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
try:
|
| 28 |
-
import textstat
|
| 29 |
-
except ImportError:
|
| 30 |
-
textstat = None
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
@dataclass
|
| 34 |
-
class SentenceBlendConfig:
|
| 35 |
-
sentence_blend_weight: float = 0.70
|
| 36 |
-
sentence_to_doc_bias: float = 0.35
|
| 37 |
-
max_sentence_blend_weight: float = 0.90
|
| 38 |
-
max_sentence_to_doc_bias: float = 0.80
|
| 39 |
-
random_deviation_pct: float = 2.0
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
class PerplexityCalculator:
|
| 43 |
-
"""Lazy-loaded perplexity calculator for distilgpt2."""
|
| 44 |
-
|
| 45 |
-
def __init__(self, model_name: str = "distilgpt2"):
|
| 46 |
-
self.model_name = model_name
|
| 47 |
-
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 48 |
-
self._tokenizer = None
|
| 49 |
-
self._model = None
|
| 50 |
-
|
| 51 |
-
def _load(self) -> None:
|
| 52 |
-
if self._model is not None and self._tokenizer is not None:
|
| 53 |
-
return
|
| 54 |
-
|
| 55 |
-
logger.info("Loading perplexity model: %s", self.model_name)
|
| 56 |
-
self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
| 57 |
-
self._model = AutoModelForCausalLM.from_pretrained(self.model_name).to(self.device)
|
| 58 |
-
self._model.eval()
|
| 59 |
-
logger.info("Perplexity model loaded on %s", self.device)
|
| 60 |
-
|
| 61 |
-
def calculate(self, text: str, max_length: int = 512) -> float:
|
| 62 |
-
try:
|
| 63 |
-
self._load()
|
| 64 |
-
encodings = self._tokenizer(
|
| 65 |
-
text,
|
| 66 |
-
return_tensors="pt",
|
| 67 |
-
truncation=True,
|
| 68 |
-
max_length=max_length,
|
| 69 |
-
)
|
| 70 |
-
input_ids = encodings.input_ids.to(self.device)
|
| 71 |
-
|
| 72 |
-
with torch.no_grad():
|
| 73 |
-
outputs = self._model(input_ids, labels=input_ids)
|
| 74 |
-
loss = outputs.loss
|
| 75 |
-
perplexity = torch.exp(loss).item()
|
| 76 |
-
|
| 77 |
-
return min(float(perplexity), 10000.0)
|
| 78 |
-
except Exception as exc:
|
| 79 |
-
logger.warning("Perplexity fallback used due to error: %s", exc)
|
| 80 |
-
return 100.0
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
_perplexity_calc = PerplexityCalculator()
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
@lru_cache(maxsize=20000)
|
| 87 |
-
def _cached_perplexity(cleaned_text: str) -> float:
|
| 88 |
-
return _perplexity_calc.calculate(cleaned_text)
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
@lru_cache(maxsize=1)
|
| 92 |
-
def _get_model_artifacts() -> tuple[Any, Any, Any, Any, list[str], dict[str, Any]]:
|
| 93 |
-
return load_model()
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
| 97 |
-
return " ".join(str(text).split()).strip()
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
if not cleaned:
|
| 103 |
-
return []
|
| 104 |
-
sentences = [s.strip() for s in nltk.sent_tokenize(cleaned) if s.strip()]
|
| 105 |
-
return sentences if sentences else [cleaned]
|
| 106 |
|
|
|
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
-
sentences = split_into_sentences(text)
|
| 110 |
-
if not sentences:
|
| 111 |
-
return {
|
| 112 |
-
"burst_mean": 0.0,
|
| 113 |
-
"burst_std": 0.0,
|
| 114 |
-
"burst_max": 0.0,
|
| 115 |
-
"burst_min": 0.0,
|
| 116 |
-
"burst_range": 0.0,
|
| 117 |
-
}
|
| 118 |
|
| 119 |
-
lengths = np.array([len(s.split()) for s in sentences], dtype=float)
|
| 120 |
-
return {
|
| 121 |
-
"burst_mean": float(np.mean(lengths)),
|
| 122 |
-
"burst_std": float(np.std(lengths)),
|
| 123 |
-
"burst_max": float(np.max(lengths)),
|
| 124 |
-
"burst_min": float(np.min(lengths)),
|
| 125 |
-
"burst_range": float(np.max(lengths) - np.min(lengths)),
|
| 126 |
-
}
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
lexical_diversity = float(unique_words / num_words) if num_words > 0 else 0.0
|
| 140 |
-
|
| 141 |
-
num_punct = sum(1 for c in text if c in ".,!?;:")
|
| 142 |
-
punct_ratio = float(num_punct / num_chars) if num_chars > 0 else 0.0
|
| 143 |
-
|
| 144 |
-
num_caps = sum(1 for c in text if c.isupper())
|
| 145 |
-
caps_ratio = float(num_caps / num_chars) if num_chars > 0 else 0.0
|
| 146 |
-
|
| 147 |
-
if textstat is not None:
|
| 148 |
-
try:
|
| 149 |
-
flesch_reading = float(textstat.flesch_reading_ease(text))
|
| 150 |
-
flesch_grade = float(textstat.flesch_kincaid_grade(text))
|
| 151 |
-
except Exception:
|
| 152 |
-
flesch_reading = 50.0
|
| 153 |
-
flesch_grade = 8.0
|
| 154 |
-
else:
|
| 155 |
-
flesch_reading = 50.0
|
| 156 |
-
flesch_grade = 8.0
|
| 157 |
-
|
| 158 |
-
return {
|
| 159 |
-
"num_words": float(num_words),
|
| 160 |
-
"num_chars": float(num_chars),
|
| 161 |
-
"num_sentences": float(num_sentences),
|
| 162 |
-
"avg_word_len": avg_word_len,
|
| 163 |
-
"avg_sent_len": avg_sent_len,
|
| 164 |
-
"lexical_diversity": lexical_diversity,
|
| 165 |
-
"punct_ratio": punct_ratio,
|
| 166 |
-
"caps_ratio": caps_ratio,
|
| 167 |
-
"flesch_reading": flesch_reading,
|
| 168 |
-
"flesch_grade": flesch_grade,
|
| 169 |
-
}
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
def extract_all_features(text: str, calc_perplexity: bool = True) -> dict[str, float]:
|
| 173 |
-
cleaned = normalize_text(text)
|
| 174 |
-
features: dict[str, float] = {}
|
| 175 |
-
|
| 176 |
-
if calc_perplexity:
|
| 177 |
-
features["perplexity"] = _cached_perplexity(cleaned)
|
| 178 |
-
else:
|
| 179 |
-
features["perplexity"] = 100.0
|
| 180 |
-
|
| 181 |
-
features.update(extract_burstiness_features(cleaned))
|
| 182 |
-
features.update(extract_stylometry_features(cleaned))
|
| 183 |
-
return features
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
def _predict_ai_probability(text: str) -> tuple[float, float]:
|
| 187 |
-
(
|
| 188 |
-
loaded_classifier,
|
| 189 |
-
loaded_scaler,
|
| 190 |
-
loaded_word_vectorizer,
|
| 191 |
-
loaded_char_vectorizer,
|
| 192 |
-
loaded_features,
|
| 193 |
-
loaded_metadata,
|
| 194 |
-
) = _get_model_artifacts()
|
| 195 |
-
|
| 196 |
-
calc_perplexity = bool(loaded_metadata.get("num_engineered_features", 0) > 0)
|
| 197 |
-
features = extract_all_features(text, calc_perplexity=calc_perplexity)
|
| 198 |
-
|
| 199 |
-
feature_vector = np.array([features[name] for name in loaded_features], dtype=float).reshape(1, -1)
|
| 200 |
-
feature_scaled = loaded_scaler.transform(feature_vector)
|
| 201 |
-
|
| 202 |
-
word_vec = loaded_word_vectorizer.transform([text])
|
| 203 |
-
char_vec = loaded_char_vectorizer.transform([text])
|
| 204 |
-
num_vec = csr_matrix(feature_scaled)
|
| 205 |
-
hybrid_vec = hstack([word_vec, char_vec, num_vec], format="csr")
|
| 206 |
-
|
| 207 |
-
if hasattr(loaded_classifier, "predict_proba"):
|
| 208 |
-
proba = loaded_classifier.predict_proba(hybrid_vec)[0]
|
| 209 |
-
ai_prob = float(proba[1])
|
| 210 |
else:
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
perplexity = float(features.get("perplexity", 100.0))
|
| 215 |
-
return ai_prob, perplexity
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
def classify_text(text: str) -> tuple[str, float, float]:
|
| 219 |
-
"""Return (label, perplexity, ai_likelihood_percent)."""
|
| 220 |
-
cleaned = normalize_text(text)
|
| 221 |
-
if not cleaned:
|
| 222 |
-
raise ValueError("Input text is empty")
|
| 223 |
-
|
| 224 |
-
ai_prob, perplexity = _predict_ai_probability(cleaned)
|
| 225 |
-
ai_likelihood = round(ai_prob * 100.0, 2)
|
| 226 |
-
label = "AI" if ai_likelihood >= 50.0 else "Human"
|
| 227 |
-
return label, perplexity, ai_likelihood
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
def analyze_text_with_sentences(
|
| 231 |
-
text: str,
|
| 232 |
-
) -> dict[str, Any]:
|
| 233 |
-
text = normalize_text(text)
|
| 234 |
-
overall_classification, overall_perplexity, overall_ai_likelihood = classify_text(text)
|
| 235 |
-
sentences = split_into_sentences(text)
|
| 236 |
-
if not sentences:
|
| 237 |
-
raise ValueError("Input text contains no valid sentences")
|
| 238 |
-
# do the per-sentence analysis
|
| 239 |
-
sentence_results = []
|
| 240 |
-
for sentence in sentences:
|
| 241 |
-
try:
|
| 242 |
-
label, perplexity, ai_likelihood = classify_text(sentence)
|
| 243 |
-
sentence_results.append(
|
| 244 |
-
{
|
| 245 |
-
"sentence": sentence,
|
| 246 |
-
"label": label,
|
| 247 |
-
"perplexity": perplexity,
|
| 248 |
-
"ai_likelihood": ai_likelihood,
|
| 249 |
-
}
|
| 250 |
-
)
|
| 251 |
-
except Exception as exc:
|
| 252 |
-
logger.warning("Error analyzing sentence: %s", exc)
|
| 253 |
-
sentence_results.append(
|
| 254 |
-
{
|
| 255 |
-
"sentence": sentence,
|
| 256 |
-
"label": "Error",
|
| 257 |
-
"perplexity": None,
|
| 258 |
-
"ai_likelihood": None,
|
| 259 |
-
}
|
| 260 |
-
)
|
| 261 |
-
return{
|
| 262 |
-
"sentences": sentence_results,
|
| 263 |
-
"summary": {
|
| 264 |
-
"overall": {
|
| 265 |
-
"label": overall_classification,
|
| 266 |
-
"perplexity": overall_perplexity,
|
| 267 |
-
"ai_likelihood": overall_ai_likelihood,
|
| 268 |
-
}
|
| 269 |
-
},
|
| 270 |
-
|
| 271 |
-
}
|
| 272 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import torch
|
| 2 |
+
from .model_loader import get_model_tokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
|
|
| 5 |
|
| 6 |
+
def perplexity_to_ai_likelihood(ppl: float) -> float:
|
| 7 |
+
# You can tune these parameters
|
| 8 |
+
min_ppl = 10 # very confident it's AI
|
| 9 |
+
max_ppl = 100 # very confident it's human
|
| 10 |
|
| 11 |
+
# Clamp to bounds
|
| 12 |
+
ppl = max(min_ppl, min(ppl, max_ppl))
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Invert and scale: lower perplexity -> higher AI-likelihood
|
| 15 |
+
likelihood = 1 - ((ppl - min_ppl) / (max_ppl - min_ppl))
|
| 16 |
|
| 17 |
+
return round(likelihood * 100, 2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
def classify_text(text: str):
|
| 21 |
+
model, tokenizer = get_model_tokenizer()
|
| 22 |
+
inputs = tokenizer(text, return_tensors="pt",
|
| 23 |
+
truncation=True, padding=True)
|
| 24 |
+
input_ids = inputs["input_ids"].to(device)
|
| 25 |
+
attention_mask = inputs["attention_mask"].to(device)
|
| 26 |
|
| 27 |
+
with torch.no_grad():
|
| 28 |
+
outputs = model(
|
| 29 |
+
input_ids, attention_mask=attention_mask, labels=input_ids)
|
| 30 |
+
loss = outputs.loss
|
| 31 |
+
perplexity = torch.exp(loss).item()
|
| 32 |
|
| 33 |
+
if perplexity < 55:
|
| 34 |
+
result = "AI-generated"
|
| 35 |
+
elif perplexity < 80:
|
| 36 |
+
result = "Probably AI-generated"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
else:
|
| 38 |
+
result = "Human-written"
|
| 39 |
+
likelihood_result=perplexity_to_ai_likelihood(perplexity)
|
| 40 |
+
return result, perplexity,likelihood_result
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/text_classifier/model_loader.py
CHANGED
|
@@ -1,113 +1,50 @@
|
|
| 1 |
-
import
|
| 2 |
-
import logging
|
| 3 |
-
import pickle
|
| 4 |
import shutil
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
import torch
|
| 8 |
from huggingface_hub import snapshot_download
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
ENGLISH_SUBDIR = "English_model"
|
| 17 |
|
| 18 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 19 |
-
|
| 20 |
-
REQUIRED_FILES = (
|
| 21 |
-
"classifier.pkl",
|
| 22 |
-
"scaler.pkl",
|
| 23 |
-
"word_vectorizer.pkl",
|
| 24 |
-
"char_vectorizer.pkl",
|
| 25 |
-
"feature_names.json",
|
| 26 |
-
"metadata.json",
|
| 27 |
-
)
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
def _patch_legacy_logistic_model(model):
|
| 31 |
-
"""Backfill attributes expected by newer sklearn versions."""
|
| 32 |
-
if isinstance(model, (LogisticRegression, LogisticRegressionCV)) and not hasattr(model, "multi_class"):
|
| 33 |
-
model.multi_class = "auto"
|
| 34 |
-
return model
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
def _has_required_artifacts(model_dir: Path) -> bool:
|
| 38 |
-
if not model_dir.exists() or not model_dir.is_dir():
|
| 39 |
-
return False
|
| 40 |
-
return all((model_dir / filename).exists() for filename in REQUIRED_FILES)
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
def _resolve_artifact_dir(base_dir: Path) -> Path | None:
|
| 44 |
-
candidates = [base_dir, base_dir / ENGLISH_SUBDIR]
|
| 45 |
-
for candidate in candidates:
|
| 46 |
-
if _has_required_artifacts(candidate):
|
| 47 |
-
return candidate
|
| 48 |
-
return None
|
| 49 |
|
| 50 |
|
| 51 |
def warmup():
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
raise ValueError("LANG_MODEL is not configured")
|
| 55 |
-
if _resolve_artifact_dir(MODEL_DIR):
|
| 56 |
-
logging.info("Model artifacts already exist, skipping download.")
|
| 57 |
-
return
|
| 58 |
download_model_repo()
|
|
|
|
|
|
|
| 59 |
|
| 60 |
|
| 61 |
def download_model_repo():
|
| 62 |
-
if MODEL_DIR
|
| 63 |
-
|
| 64 |
-
if not REPO_ID:
|
| 65 |
-
raise ValueError("English_model repo id is not configured")
|
| 66 |
-
if _resolve_artifact_dir(MODEL_DIR):
|
| 67 |
-
logging.info("Model artifacts already exist, skipping download.")
|
| 68 |
return
|
| 69 |
-
snapshot_path =
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
shutil.copytree(source_dir, MODEL_DIR, dirs_exist_ok=True)
|
| 73 |
|
| 74 |
|
| 75 |
def load_model():
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
download_model_repo()
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
raise FileNotFoundError(
|
| 85 |
-
f"Required model artifacts not found in {MODEL_DIR}. Expected files: {', '.join(REQUIRED_FILES)}"
|
| 86 |
-
)
|
| 87 |
-
|
| 88 |
-
with open(artifact_dir / "classifier.pkl", "rb") as f:
|
| 89 |
-
loaded_classifier = pickle.load(f)
|
| 90 |
-
loaded_classifier = _patch_legacy_logistic_model(loaded_classifier)
|
| 91 |
-
|
| 92 |
-
with open(artifact_dir / "scaler.pkl", "rb") as f:
|
| 93 |
-
loaded_scaler = pickle.load(f)
|
| 94 |
-
|
| 95 |
-
with open(artifact_dir / "word_vectorizer.pkl", "rb") as f:
|
| 96 |
-
loaded_word_vectorizer = pickle.load(f)
|
| 97 |
-
|
| 98 |
-
with open(artifact_dir / "char_vectorizer.pkl", "rb") as f:
|
| 99 |
-
loaded_char_vectorizer = pickle.load(f)
|
| 100 |
-
|
| 101 |
-
with open(artifact_dir / "feature_names.json", "r") as f:
|
| 102 |
-
loaded_features = json.load(f)
|
| 103 |
-
|
| 104 |
-
with open(artifact_dir / "metadata.json", "r") as f:
|
| 105 |
-
loaded_metadata = json.load(f)
|
| 106 |
-
return (
|
| 107 |
-
loaded_classifier,
|
| 108 |
-
loaded_scaler,
|
| 109 |
-
loaded_word_vectorizer,
|
| 110 |
-
loaded_char_vectorizer,
|
| 111 |
-
loaded_features,
|
| 112 |
-
loaded_metadata,
|
| 113 |
-
)
|
|
|
|
| 1 |
+
import os
|
|
|
|
|
|
|
| 2 |
import shutil
|
| 3 |
+
import logging
|
| 4 |
+
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
|
|
|
|
| 5 |
from huggingface_hub import snapshot_download
|
| 6 |
+
import torch
|
| 7 |
+
from dotenv import load_dotenv
|
| 8 |
+
load_dotenv()
|
| 9 |
+
REPO_ID = "can-org/AI-Content-Checker"
|
| 10 |
+
MODEL_DIR = "./models"
|
| 11 |
+
TOKENIZER_DIR = os.path.join(MODEL_DIR, "model")
|
| 12 |
+
WEIGHTS_PATH = os.path.join(MODEL_DIR, "model_weights.pth")
|
|
|
|
| 13 |
|
| 14 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 15 |
+
_model, _tokenizer = None, None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
def warmup():
|
| 19 |
+
global _model, _tokenizer
|
| 20 |
+
# Ensure punkt is available
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
download_model_repo()
|
| 22 |
+
_model, _tokenizer = load_model()
|
| 23 |
+
logging.info("Its ready")
|
| 24 |
|
| 25 |
|
| 26 |
def download_model_repo():
|
| 27 |
+
if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
|
| 28 |
+
logging.info("Model already exists, skipping download.")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
return
|
| 30 |
+
snapshot_path = snapshot_download(repo_id=REPO_ID)
|
| 31 |
+
os.makedirs(MODEL_DIR, exist_ok=True)
|
| 32 |
+
shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
|
|
|
|
| 33 |
|
| 34 |
|
| 35 |
def load_model():
|
| 36 |
+
tokenizer = GPT2TokenizerFast.from_pretrained(TOKENIZER_DIR)
|
| 37 |
+
config = GPT2Config.from_pretrained(TOKENIZER_DIR)
|
| 38 |
+
model = GPT2LMHeadModel(config)
|
| 39 |
+
model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
|
| 40 |
+
model.to(device)
|
| 41 |
+
model.eval()
|
| 42 |
+
return model, tokenizer
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def get_model_tokenizer():
|
| 46 |
+
global _model, _tokenizer
|
| 47 |
+
if _model is None or _tokenizer is None:
|
| 48 |
download_model_repo()
|
| 49 |
+
_model, _tokenizer = load_model()
|
| 50 |
+
return _model, _tokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
features/text_classifier/preprocess.py
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
|
| 2 |
import docx
|
| 3 |
from io import BytesIO
|
| 4 |
import logging
|
|
@@ -15,16 +15,18 @@ def parse_docx(file: BytesIO):
|
|
| 15 |
|
| 16 |
def parse_pdf(file: BytesIO):
|
| 17 |
try:
|
| 18 |
-
doc =
|
| 19 |
text = ""
|
| 20 |
-
for
|
| 21 |
-
|
| 22 |
-
|
|
|
|
| 23 |
except Exception as e:
|
| 24 |
logging.error(f"Error while processing PDF: {str(e)}")
|
| 25 |
raise HTTPException(
|
| 26 |
status_code=500, detail="Error processing PDF file")
|
| 27 |
|
|
|
|
| 28 |
def parse_txt(file: BytesIO):
|
| 29 |
return file.read().decode("utf-8")
|
| 30 |
|
|
|
|
| 1 |
+
import fitz # PyMuPDF
|
| 2 |
import docx
|
| 3 |
from io import BytesIO
|
| 4 |
import logging
|
|
|
|
| 15 |
|
| 16 |
def parse_pdf(file: BytesIO):
|
| 17 |
try:
|
| 18 |
+
doc = fitz.open(stream=file, filetype="pdf")
|
| 19 |
text = ""
|
| 20 |
+
for page_num in range(doc.page_count):
|
| 21 |
+
page = doc.load_page(page_num)
|
| 22 |
+
text += page.get_text()
|
| 23 |
+
return text
|
| 24 |
except Exception as e:
|
| 25 |
logging.error(f"Error while processing PDF: {str(e)}")
|
| 26 |
raise HTTPException(
|
| 27 |
status_code=500, detail="Error processing PDF file")
|
| 28 |
|
| 29 |
+
|
| 30 |
def parse_txt(file: BytesIO):
|
| 31 |
return file.read().decode("utf-8")
|
| 32 |
|
features/text_classifier/routes.py
CHANGED
|
@@ -37,10 +37,9 @@ async def analyze_sentences(request: Request, data: TextInput, token: str = Depe
|
|
| 37 |
raise HTTPException(status_code=400, detail="Missing 'text' in request body")
|
| 38 |
return await handle_sentence_level_analysis(data.text)
|
| 39 |
|
| 40 |
-
|
| 41 |
-
@router.post("/analyse-sentence-file")
|
| 42 |
@limiter.limit(ACCESS_RATE)
|
| 43 |
-
async def
|
| 44 |
return await handle_file_sentence(file)
|
| 45 |
|
| 46 |
@router.get("/health")
|
|
|
|
| 37 |
raise HTTPException(status_code=400, detail="Missing 'text' in request body")
|
| 38 |
return await handle_sentence_level_analysis(data.text)
|
| 39 |
|
| 40 |
+
@router.post("/analyse-sentance-file")
|
|
|
|
| 41 |
@limiter.limit(ACCESS_RATE)
|
| 42 |
+
async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(verify_token)):
|
| 43 |
return await handle_file_sentence(file)
|
| 44 |
|
| 45 |
@router.get("/health")
|
requirements.txt
CHANGED
|
@@ -15,23 +15,6 @@ tensorflow
|
|
| 15 |
opencv-python
|
| 16 |
pillow
|
| 17 |
scipy
|
| 18 |
-
|
| 19 |
frontend
|
| 20 |
tools
|
| 21 |
-
pandas
|
| 22 |
-
numpy
|
| 23 |
-
scikit-learn
|
| 24 |
-
textstat
|
| 25 |
-
requests
|
| 26 |
-
beautifulsoup4
|
| 27 |
-
langchain
|
| 28 |
-
langchain-community
|
| 29 |
-
langchain-openai
|
| 30 |
-
faiss-cpu
|
| 31 |
-
PyPDF2
|
| 32 |
-
tiktoken
|
| 33 |
-
chromadb
|
| 34 |
-
langchain_chroma
|
| 35 |
-
sentence-transformers
|
| 36 |
-
tf-keras
|
| 37 |
-
torchvision
|
|
|
|
| 15 |
opencv-python
|
| 16 |
pillow
|
| 17 |
scipy
|
| 18 |
+
fitz
|
| 19 |
frontend
|
| 20 |
tools
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
**Update: Edited & AI-Generated Content Detection – Project Plan**
|
| 3 |
+
|
| 4 |
+
### 🔍 Phase 1: Rule-Based Image Detection (In Progress)
|
| 5 |
+
|
| 6 |
+
We're implementing three core techniques to individually flag edited or AI-generated images:
|
| 7 |
+
|
| 8 |
+
* **ELA (Error Level Analysis):** Highlights inconsistencies via JPEG recompression.
|
| 9 |
+
* **FFT (Frequency Analysis):** Uses 2D Fourier Transform to detect unnatural image frequency patterns.
|
| 10 |
+
* **Metadata Analysis:** Parses EXIF data to catch clues like editing software tags.
|
| 11 |
+
|
| 12 |
+
These give us visual + interpretable results for each image, and currently offer \~60–70% accuracy on typical AI-edited content.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
### Phase 2: AI vs Human Detection System (Coming Soon)
|
| 17 |
+
|
| 18 |
+
**Goal:** Build an AI model that classifies whether content is AI- or human-made — initially focusing on **images**, and later expanding to **text**.
|
| 19 |
+
|
| 20 |
+
**Data Strategy:**
|
| 21 |
+
|
| 22 |
+
* Scraping large volumes of recent AI-gen images (e.g. SDXL, Gibbli, MidJourney).
|
| 23 |
+
* Balancing with high-quality human images.
|
| 24 |
+
|
| 25 |
+
**Model Plan:**
|
| 26 |
+
|
| 27 |
+
* Use ELA, FFT, and metadata as feature extractors.
|
| 28 |
+
* Feed these into a CNN or ensemble model.
|
| 29 |
+
* Later, unify into a full web-based platform (upload → get AI/human probability).
|
| 30 |
+
|
| 31 |
+
|