Files changed (37) hide show
  1. .env-example +0 -47
  2. .gitignore +1 -6
  3. Procfile +1 -0
  4. README.md +0 -166
  5. __init__.py +0 -1
  6. app.py +17 -28
  7. config.py +0 -59
  8. features/Modelsdfa/English_model/feature_names.json +0 -18
  9. features/Modelsdfa/English_model/metadata.json +0 -13
  10. features/__init__.py +0 -5
  11. features/ai_human_image_classifier/model_loader.py +4 -5
  12. features/image_classifier/model_loader.py +19 -19
  13. features/image_edit_detector/controller.py +2 -3
  14. features/nepali_text_classifier/controller.py +24 -113
  15. features/nepali_text_classifier/inferencer.py +15 -81
  16. features/nepali_text_classifier/model_loader.py +51 -234
  17. features/nepali_text_classifier/preprocess.py +6 -5
  18. features/nepali_text_classifier/routes.py +6 -21
  19. features/rag_chatbot/__init__.py +0 -0
  20. features/rag_chatbot/controller.py +0 -178
  21. features/rag_chatbot/document_handler.py +0 -37
  22. features/rag_chatbot/rag_pipeline.py +0 -329
  23. features/rag_chatbot/routes.py +0 -107
  24. features/real_forged_classifier/__init__.py +0 -9
  25. features/real_forged_classifier/controller.py +2 -95
  26. features/real_forged_classifier/inferencer.py +1 -5
  27. features/real_forged_classifier/main.py +26 -0
  28. features/real_forged_classifier/model_loader.py +39 -181
  29. features/real_forged_classifier/preprocessor.py +1 -1
  30. features/real_forged_classifier/routes.py +4 -24
  31. features/text_classifier/controller.py +51 -85
  32. features/text_classifier/inferencer.py +29 -261
  33. features/text_classifier/model_loader.py +34 -97
  34. features/text_classifier/preprocess.py +7 -5
  35. features/text_classifier/routes.py +2 -3
  36. requirements.txt +1 -18
  37. test.md +31 -0
.env-example CHANGED
@@ -1,49 +1,2 @@
1
  MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
2
 
3
- # Language/text classifier models
4
- English_model="Pujan-Dev/Ai_vs_HUMAN"
5
- Nepali_model="features/Model/Nepali_model"
6
- LANG_MODEL="features/Model/English_model"
7
-
8
- # Hugging Face private model access
9
- # Create a READ token at: https://huggingface.co/settings/tokens
10
- HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
11
- # Optional alias, either variable can be used
12
- HUGGINGFACE_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
13
-
14
- # Legacy variables (kept for compatibility)
15
- REPOSITORY_ID_English_Detector="nepali-detector"
16
- REPOSITORY_ID_Nepali_Detector="nepali-detector"
17
-
18
- # Image classifier
19
- IMAGE_CLASSIFIER_REPO_ID="can-org/AI-VS-HUMAN-IMAGE-classifier"
20
- IMAGE_CLASSIFIER_MODEL_DIR="./IMG_Models"
21
- IMAGE_CLASSIFIER_WEIGHTS_FILE="latest-my_cnn_model.h5"
22
-
23
- # AI vs Human image detector
24
- AI_HUMAN_CLIP_MODEL_NAME="ViT-L/14"
25
- AI_HUMAN_SVM_REPO_ID="rhnsa/ai_human_image_detector"
26
- AI_HUMAN_SVM_FILENAME="svm_model_real.joblib"
27
-
28
- # Real vs Forged detector
29
- REAL_FORGED_MODEL_REPO_ID="rhnsa/real_forged_classifier"
30
- REAL_FORGED_MODEL_FILENAME="fft_cnn_model_78.pth"
31
-
32
- # RAG + Chroma settings
33
- CHROMA_HOST="localhost"
34
- CHROMA_PORT="8000"
35
- RAG_COLLECTION_NAME="company_docs_collection"
36
- RAG_MAX_FILE_SIZE="104857600"
37
- RAG_MAX_QUERY_LENGTH="1000"
38
-
39
- # LLM settings
40
- LLM_PROVIDER="openai"
41
- LLM_API_KEY="sk-xxxx"
42
- LLM_MODEL="gpt-3.5-turbo"
43
- LLM_TEMPERATURE="0"
44
- LLM_MAX_TOKENS="2048"
45
-
46
- # Notebook/scraper API keys
47
- GEMINI_API_KEY=""
48
- GROQ_API_KEY="gsk_xxxx"
49
- OPENROUTER_API_KEY="sk-or-xxxx"
 
1
  MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.gitignore CHANGED
@@ -13,13 +13,11 @@ __pycache__/
13
  .vscode/
14
  .idea/
15
  *.swp
16
- *Model/
17
 
18
  # ---- Jupyter / IPython ----
19
  .ipynb_checkpoints/
20
  *.ipynb
21
- notebook/
22
- *.csv
23
  # ---- Model & Data Artifacts ----
24
  *.pth
25
  *.pt
@@ -68,6 +66,3 @@ notebooks
68
  np_text_model/classifier/sentencepiece.bpe.model
69
  np_text_model/classifier/tokenizer.json
70
 
71
- # vector database
72
- chroma_data
73
- chroma_database
 
13
  .vscode/
14
  .idea/
15
  *.swp
 
16
 
17
  # ---- Jupyter / IPython ----
18
  .ipynb_checkpoints/
19
  *.ipynb
20
+
 
21
  # ---- Model & Data Artifacts ----
22
  *.pth
23
  *.pt
 
66
  np_text_model/classifier/sentencepiece.bpe.model
67
  np_text_model/classifier/tokenizer.json
68
 
 
 
 
Procfile ADDED
@@ -0,0 +1 @@
 
 
1
+ web: uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000}
README.md CHANGED
@@ -14,175 +14,9 @@ pinned: false
14
  This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
15
 
16
  ## How to run locally
17
- ---
18
- title: Testing AI Contain
19
- emoji: 🤖
20
- colorFrom: blue
21
- colorTo: green
22
- sdk: docker
23
- sdk_version: "latest"
24
- app_file: app.py
25
- pinned: false
26
- ---
27
-
28
- # AI-Contain-Checker
29
- # AI-Content-Checker
30
-
31
- A modular AI content detection system with support for **image classification**, **image edit detection**, **Nepali text classification**, and **general text classification**. Built for performance and extensibility, it is ideal for detecting AI-generated content in both visual and textual forms.
32
-
33
-
34
- ## 🌟 Features
35
-
36
- ### 🖼️ Image Classifier
37
-
38
- * **Purpose**: Classifies whether an image is AI-generated or a real-life photo.
39
- * **Model**: Fine-tuned **InceptionV3** CNN.
40
- * **Dataset**: Custom curated dataset with **\~79,950 images** for binary classification.
41
- * **Location**: [`features/image_classifier`](features/image_classifier)
42
- * **Docs**: [`docs/features/image_classifier.md`](docs/features/image_classifier.md)
43
-
44
- ### 🖌️ Image Edit Detector
45
-
46
- * **Purpose**: Detects image tampering or post-processing.
47
- * **Techniques Used**:
48
-
49
- * **Error Level Analysis (ELA)**: Visualizes compression artifacts.
50
- * **Fast Fourier Transform (FFT)**: Detects unnatural frequency patterns.
51
- * **Location**: [`features/image_edit_detector`](features/image_edit_detector)
52
- * **Docs**:
53
-
54
- * [ELA](docs/detector/ELA.md)
55
- * [FFT](docs/detector/fft.md )
56
- * [Metadata Analysis](docs/detector/meta.md)
57
- * [Backend Notes](docs/detector/note-for-backend.md)
58
-
59
- ### 📝 Nepali Text Classifier
60
-
61
- * **Purpose**: Determines if Nepali text content is AI-generated or written by a human.
62
- * **Model**: Based on `XLMRClassifier` fine-tuned on Nepali language data.
63
- * **Dataset**: Scraped dataset of **\~18,000** Nepali texts.
64
- * **Location**: [`features/nepali_text_classifier`](features/nepali_text_classifier)
65
- * **Docs**: [`docs/features/nepali_text_classifier.md`](docs/features/nepali_text_classifier.md)
66
-
67
- ### 🌐 English Text Classifier
68
-
69
- * **Purpose**: Detects if English text is AI-generated or human-written.
70
- * **Pipeline**:
71
-
72
- * Uses **GPT2 tokenizer** for input preprocessing.
73
- * Custom binary classifier to differentiate between AI and human-written content.
74
- * **Location**: [`features/text_classifier`](features/text_classifier)
75
- * **Docs**: [`docs/features/text_classifier.md`](docs/features/text_classifier.md)
76
-
77
- ---
78
-
79
- ## 🗂️ Project Structure
80
-
81
- ```bash
82
- AI-Checker/
83
-
84
- ├── app.py # Main FastAPI entry point
85
- ├── config.py # Configuration settings
86
- ├── Dockerfile # Docker build script
87
- ├── Procfile # Deployment file for Heroku or similar
88
- ├── requirements.txt # Python dependencies
89
- ├── README.md # You are here 📘
90
-
91
- ├── features/ # Core detection modules
92
- │ ├── image_classifier/
93
- │ ├── image_edit_detector/
94
- │ ├── nepali_text_classifier/
95
- │ └── text_classifier/
96
-
97
- ├── docs/ # Internal and API documentation
98
- │ ├── api_endpoints.md
99
- │ ├── deployment.md
100
- │ ├── detector/
101
- │ │ ├── ELA.md
102
- │ │ ├── fft.md
103
- │ │ ├── meta.md
104
- │ │ └── note-for-backend.md
105
- │ ├── functions.md
106
- │ ├── nestjs_integration.md
107
- │ ├── security.md
108
- │ ├── setup.md
109
- │ └── structure.md
110
-
111
- ├── IMG_Models/ # Saved image classifier model(s)
112
- │ └── latest-my_cnn_model.h5
113
-
114
- ├── notebooks/ # Experimental and debug notebooks
115
- ├── static/ # Static assets if needed
116
- └── test.md # Test notes
117
- ````
118
-
119
- ---
120
-
121
- ## 📚 Documentation Links
122
-
123
- * [API Endpoints](docs/api_endpoints.md)
124
- * [Deployment Guide](docs/deployment.md)
125
- * [Detector Documentation](docs/detector/)
126
-
127
- * [Error Level Analysis (ELA)](docs/detector/ELA.md)
128
- * [Fast Fourier Transform (FFT)](docs/detector/fft.md)
129
- * [Metadata Analysis](docs/detector/meta.md)
130
- * [Backend Notes](docs/detector/note-for-backend.md)
131
- * [Functions Overview](docs/functions.md)
132
- * [NestJS Integration Guide](docs/nestjs_integration.md)
133
- * [Security Details](docs/security.md)
134
- * [Setup Instructions](docs/setup.md)
135
- * [Project Structure](docs/structure.md)
136
-
137
- ---
138
-
139
- ## 🚀 Usage
140
-
141
- 1. **Install dependencies**
142
 
143
  ```bash
144
  docker build -t testing-ai-contain .
145
  docker run -p 7860:7860 testing-ai-contain
146
 
147
  ```
148
- ```bash
149
- pip install -r requirements.txt
150
- ```
151
-
152
- 2. **Run the API**
153
-
154
- ```bash
155
- chroma run --path ./chroma_database ## to run chromadb locally
156
- uvicorn app:app --reload --port 8001 ## fastapi (run after chromadb)
157
-
158
- ```
159
-
160
- 3. **Build Docker (optional)**
161
-
162
- ```bash
163
- docker build -t ai-contain-checker .
164
- docker run -p 8000:8000 ai-contain-checker
165
- ```
166
-
167
- ---
168
-
169
- ## 🔐 Security & Integration
170
-
171
- * **Token Authentication** and **IP Whitelisting** supported.
172
- * NestJS integration guide: [`docs/nestjs_integration.md`](docs/nestjs_integration.md)
173
- * Rate limiting handled using `slowapi`.
174
-
175
- ---
176
-
177
- ## 🛡️ Future Plans
178
-
179
- * Add **video classifier** module.
180
- * Expand dataset for **multilingual** AI content detection.
181
- * Add **fine-tuning UI** for models.
182
-
183
- ---
184
-
185
- ## 📄 License
186
-
187
- See full license terms here: [`LICENSE.md`](license.md)
188
-
 
14
  This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
15
 
16
  ## How to run locally
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  ```bash
19
  docker build -t testing-ai-contain .
20
  docker run -p 7860:7860 testing-ai-contain
21
 
22
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
app.py CHANGED
@@ -1,35 +1,25 @@
1
- import warnings
2
-
3
- import requests
4
  from fastapi import FastAPI, Request
5
- from fastapi.responses import FileResponse, JSONResponse
6
- from fastapi.staticfiles import StaticFiles
7
  from slowapi import Limiter, _rate_limit_exceeded_handler
8
- from slowapi.errors import RateLimitExceeded
9
  from slowapi.middleware import SlowAPIMiddleware
 
10
  from slowapi.util import get_remote_address
11
-
12
- from config import ACCESS_RATE
13
- from features.image_classifier.routes import router as image_classifier_router
14
- from features.image_edit_detector.routes import router as image_edit_detector_router
15
- from features.real_forged_classifier.routes import router as real_forged_classifier_router
16
  from features.nepali_text_classifier.routes import (
17
  router as nepali_text_classifier_router,
18
  )
19
- from features.text_classifier.routes import router as text_classifier_router
 
 
20
 
21
- warnings.filterwarnings("ignore")
22
- limiter = Limiter(key_func=get_remote_address, default_limits=[ACCESS_RATE])
 
23
 
24
- openapi_tags = [
25
- {"name": "English Text Classifier", "description": "Endpoints for English AI-vs-human text analysis."},
26
- {"name": "Nepali Text Classifier", "description": "Endpoints for Nepali AI-vs-human text analysis."},
27
- {"name": "AI Image Classifier", "description": "Endpoints for AI-vs-human image classification."},
28
- {"name": "Image Edit Detection", "description": "Endpoints for edited/forged image detection."},
29
- {"name": "System", "description": "Health and root endpoints."},
30
- ]
31
 
32
- app = FastAPI(openapi_tags=openapi_tags)
33
  # added the robots.txt
34
  # Set up SlowAPI
35
  app.state.limiter = limiter
@@ -47,14 +37,13 @@ app.add_exception_handler(
47
  app.add_middleware(SlowAPIMiddleware)
48
 
49
  # Include your routes
50
- app.include_router(text_classifier_router, prefix="/text", tags=["English Text Classifier"])
51
- app.include_router(nepali_text_classifier_router, prefix="/NP", tags=["Nepali Text Classifier"])
52
- app.include_router(image_classifier_router, prefix="/AI-image", tags=["AI Image Classifier"])
53
- app.include_router(image_edit_detector_router, prefix="/detect", tags=["Image Edit Detection"])
54
- app.include_router(real_forged_classifier_router, prefix="/real-forged", tags=["Real/Forged Image Classifier"])
55
 
56
 
57
- @app.get("/", tags=["System"])
58
  @limiter.limit(ACCESS_RATE)
59
  async def root(request: Request):
60
  return {
 
 
 
 
1
  from fastapi import FastAPI, Request
 
 
2
  from slowapi import Limiter, _rate_limit_exceeded_handler
3
+ from fastapi.responses import FileResponse
4
  from slowapi.middleware import SlowAPIMiddleware
5
+ from slowapi.errors import RateLimitExceeded
6
  from slowapi.util import get_remote_address
7
+ from fastapi.responses import JSONResponse
8
+ from features.text_classifier.routes import router as text_classifier_router
 
 
 
9
  from features.nepali_text_classifier.routes import (
10
  router as nepali_text_classifier_router,
11
  )
12
+ from features.image_classifier.routes import router as image_classifier_router
13
+ from features.image_edit_detector.routes import router as image_edit_detector_router
14
+ from fastapi.staticfiles import StaticFiles
15
 
16
+ from config import ACCESS_RATE
17
+
18
+ import requests
19
 
20
+ limiter = Limiter(key_func=get_remote_address, default_limits=[ACCESS_RATE])
 
 
 
 
 
 
21
 
22
+ app = FastAPI()
23
  # added the robots.txt
24
  # Set up SlowAPI
25
  app.state.limiter = limiter
 
37
  app.add_middleware(SlowAPIMiddleware)
38
 
39
  # Include your routes
40
+ app.include_router(text_classifier_router, prefix="/text")
41
+ app.include_router(nepali_text_classifier_router, prefix="/NP")
42
+ app.include_router(image_classifier_router, prefix="/AI-image")
43
+ app.include_router(image_edit_detector_router, prefix="/detect")
 
44
 
45
 
46
+ @app.get("/")
47
  @limiter.limit(ACCESS_RATE)
48
  async def root(request: Request):
49
  return {
config.py CHANGED
@@ -1,61 +1,2 @@
1
- import os
2
-
3
- import dotenv
4
-
5
- dotenv.load_dotenv()
6
-
7
  ACCESS_RATE = "20/minute"
8
 
9
-
10
- class Config:
11
- Nepali_model_folder = os.getenv("Nepali_model")
12
- English_model_folder = os.getenv("English_model")
13
- REPO_ID_LANG = os.getenv("English_model") or "Pujan-Dev/Ai_vs_HUMAN"
14
- LANG_MODEL = os.getenv("LANG_MODEL")
15
- HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
16
- SECRET_TOKEN = os.getenv("MY_SECRET_TOKEN")
17
-
18
- IMAGE_CLASSIFIER_REPO_ID = os.getenv("IMAGE_CLASSIFIER_REPO_ID", "can-org/AI-VS-HUMAN-IMAGE-classifier")
19
- IMAGE_CLASSIFIER_MODEL_DIR = os.getenv("IMAGE_CLASSIFIER_MODEL_DIR", "./IMG_Models")
20
- IMAGE_CLASSIFIER_WEIGHTS_FILE = os.getenv("IMAGE_CLASSIFIER_WEIGHTS_FILE", "latest-my_cnn_model.h5")
21
-
22
- AI_HUMAN_CLIP_MODEL_NAME = os.getenv("AI_HUMAN_CLIP_MODEL_NAME", "ViT-L/14")
23
- AI_HUMAN_SVM_REPO_ID = os.getenv("AI_HUMAN_SVM_REPO_ID", "rhnsa/ai_human_image_detector")
24
- AI_HUMAN_SVM_FILENAME = os.getenv("AI_HUMAN_SVM_FILENAME", "svm_model_real.joblib")
25
-
26
- REAL_FORGED_MODEL_REPO_ID = os.getenv("REAL_FORGED_MODEL_REPO_ID", "rhnsa/real_forged_classifier")
27
- REAL_FORGED_MODEL_FILENAME = os.getenv("REAL_FORGED_MODEL_FILENAME", "fft_cnn_model_78.pth")
28
- REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
29
- DOCUMENT_FORGERY_MODEL_REPO_ID = os.getenv(
30
- "DOCUMENT_FORGERY_MODEL_REPO_ID",
31
- REPO_ID_LANG
32
- )
33
- DOCUMENT_FORGERY_MODEL_FILENAME = os.getenv(
34
- "DOCUMENT_FORGERY_MODEL_FILENAME",
35
- "document_forgery/pixel_forgery_v3_best.pth",
36
- )
37
- DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
38
- "DOCUMENT_FORGERY_MODEL_PATH",
39
- "features/Modelsdfa/document_forgery/pixel_forgery_v3_best.pth",
40
- )
41
- # Decision thresholds for document forgery detector (probabilities in 0..1)
42
- DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
43
- DOCUMENT_FORGERY_FORGED_LOW = float(os.getenv("DOCUMENT_FORGERY_FORGED_LOW", "0.55"))
44
-
45
- RAG_CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
46
- RAG_CHROMA_PORT = int(os.getenv("CHROMA_PORT", "8000"))
47
- RAG_COLLECTION_NAME = os.getenv("RAG_COLLECTION_NAME", "company_docs_collection")
48
-
49
- RAG_LLM_PROVIDER = os.getenv("LLM_PROVIDER", "openai").lower()
50
- RAG_LLM_API_KEY = os.getenv("LLM_API_KEY")
51
- RAG_LLM_MODEL = os.getenv("LLM_MODEL", "gpt-3.5-turbo")
52
- RAG_LLM_TEMPERATURE = float(os.getenv("LLM_TEMPERATURE", "0"))
53
- RAG_LLM_MAX_TOKENS = int(os.getenv("LLM_MAX_TOKENS", "2048"))
54
-
55
- RAG_MAX_FILE_SIZE = int(os.getenv("RAG_MAX_FILE_SIZE", str(100 * 1024 * 1024)))
56
- RAG_MAX_QUERY_LENGTH = int(os.getenv("RAG_MAX_QUERY_LENGTH", "1000"))
57
- RAG_SUPPORTED_CONTENT_TYPES = {
58
- "application/pdf",
59
- "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
60
- "text/plain",
61
- }
 
 
 
 
 
 
 
1
  ACCESS_RATE = "20/minute"
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/Modelsdfa/English_model/feature_names.json DELETED
@@ -1,18 +0,0 @@
1
- [
2
- "perplexity",
3
- "burst_mean",
4
- "burst_std",
5
- "burst_max",
6
- "burst_min",
7
- "burst_range",
8
- "num_words",
9
- "num_chars",
10
- "num_sentences",
11
- "avg_word_len",
12
- "avg_sent_len",
13
- "lexical_diversity",
14
- "punct_ratio",
15
- "caps_ratio",
16
- "flesch_reading",
17
- "flesch_grade"
18
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/Modelsdfa/English_model/metadata.json DELETED
@@ -1,13 +0,0 @@
1
- {
2
- "selected_model": "hybrid_tfidf_logistic",
3
- "cv_best_f1": 0.8593569681592504,
4
- "num_engineered_features": 16,
5
- "num_word_tfidf_features": 86956,
6
- "num_char_tfidf_features": 80000,
7
- "train_samples": 15952,
8
- "test_samples": 3988,
9
- "train_accuracy": 0.980253259779338,
10
- "train_f1": 0.980182447310475,
11
- "test_accuracy": 0.8713640922768305,
12
- "test_f1": 0.8707482993197279
13
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/__init__.py DELETED
@@ -1,5 +0,0 @@
1
- """Top-level features package for the aiapi project."""
2
-
3
- __all__ = [
4
- # Subpackages are dynamically discovered; keep this minimal.
5
- ]
 
 
 
 
 
 
features/ai_human_image_classifier/model_loader.py CHANGED
@@ -3,7 +3,6 @@ import torch
3
  import joblib
4
  from pathlib import Path
5
  from huggingface_hub import hf_hub_download
6
- from config import Config
7
 
8
  class ModelLoader:
9
  """
@@ -57,7 +56,7 @@ class ModelLoader:
57
  print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
58
  try:
59
  # Download the model file from the Hub. It returns the cached path.
60
- model_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
61
  print(f"SVM model downloaded to: {model_path}")
62
 
63
  # Load the model from the downloaded path
@@ -69,9 +68,9 @@ class ModelLoader:
69
 
70
  # --- Global Model Instance ---
71
  # This creates a single instance of the models that can be imported by other modules.
72
- CLIP_MODEL_NAME = Config.AI_HUMAN_CLIP_MODEL_NAME
73
- SVM_REPO_ID = Config.AI_HUMAN_SVM_REPO_ID
74
- SVM_FILENAME = Config.AI_HUMAN_SVM_FILENAME
75
 
76
  # This instance will be created when the application starts.
77
  models = ModelLoader(
 
3
  import joblib
4
  from pathlib import Path
5
  from huggingface_hub import hf_hub_download
 
6
 
7
  class ModelLoader:
8
  """
 
56
  print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
57
  try:
58
  # Download the model file from the Hub. It returns the cached path.
59
+ model_path = hf_hub_download(repo_id=repo_id, filename=filename)
60
  print(f"SVM model downloaded to: {model_path}")
61
 
62
  # Load the model from the downloaded path
 
68
 
69
  # --- Global Model Instance ---
70
  # This creates a single instance of the models that can be imported by other modules.
71
+ CLIP_MODEL_NAME = 'ViT-L/14'
72
+ SVM_REPO_ID = 'rhnsa/ai_human_image_detector'
73
+ SVM_FILENAME = 'svm_model_real.joblib' # The name of your model file in the Hugging Face repo
74
 
75
  # This instance will be created when the application starts.
76
  models = ModelLoader(
features/image_classifier/model_loader.py CHANGED
@@ -1,21 +1,27 @@
1
  import os
2
  import shutil
3
  import logging
 
 
4
  from huggingface_hub import snapshot_download
5
- from config import Config
6
-
7
- os.environ.setdefault("CUDA_VISIBLE_DEVICES", "-1")
8
- os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")
9
 
10
  # Model config
11
- REPO_ID = Config.IMAGE_CLASSIFIER_REPO_ID
12
- MODEL_DIR = Config.IMAGE_CLASSIFIER_MODEL_DIR
13
- WEIGHTS_PATH = os.path.join(MODEL_DIR, Config.IMAGE_CLASSIFIER_WEIGHTS_FILE)
14
- HF_TOKEN = Config.HF_TOKEN
 
 
 
15
 
16
  # Global model reference
17
  _model_img = None
18
 
 
 
 
 
 
19
  def warmup():
20
  global _model_img
21
  download_model_repo()
@@ -26,7 +32,7 @@ def download_model_repo():
26
  if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
27
  logging.info("Image model already exists, skipping download.")
28
  return
29
- snapshot_path = snapshot_download(repo_id=REPO_ID, token=HF_TOKEN)
30
  os.makedirs(MODEL_DIR, exist_ok=True)
31
  shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
32
 
@@ -35,17 +41,11 @@ def load_model():
35
  if _model_img is not None:
36
  return _model_img
37
 
38
- import tensorflow as tf
39
-
40
- class Cast(tf.keras.layers.Layer):
41
- def call(self, inputs):
42
- return tf.cast(inputs, tf.float32)
43
 
44
- print("Loading image model on CPU.")
45
- with tf.device("/CPU:0"):
46
- _model_img = tf.keras.models.load_model(
47
- WEIGHTS_PATH, custom_objects={"Cast": Cast}
48
- )
49
  print("Model input shape:", _model_img.input_shape)
50
  return _model_img
51
 
 
1
  import os
2
  import shutil
3
  import logging
4
+ import tensorflow as tf
5
+ from tensorflow.keras.layers import Layer
6
  from huggingface_hub import snapshot_download
 
 
 
 
7
 
8
  # Model config
9
+ REPO_ID = "can-org/AI-VS-HUMAN-IMAGE-classifier"
10
+ MODEL_DIR = "./IMG_Models"
11
+ WEIGHTS_PATH = os.path.join(MODEL_DIR, "latest-my_cnn_model.h5")
12
+
13
+ # Device info (for logging)
14
+ gpus = tf.config.list_physical_devices("GPU")
15
+ device = "cuda" if gpus else "cpu"
16
 
17
  # Global model reference
18
  _model_img = None
19
 
20
+ # Custom layer used in the model
21
+ class Cast(Layer):
22
+ def call(self, inputs):
23
+ return tf.cast(inputs, tf.float32)
24
+
25
  def warmup():
26
  global _model_img
27
  download_model_repo()
 
32
  if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
33
  logging.info("Image model already exists, skipping download.")
34
  return
35
+ snapshot_path = snapshot_download(repo_id=REPO_ID)
36
  os.makedirs(MODEL_DIR, exist_ok=True)
37
  shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
38
 
 
41
  if _model_img is not None:
42
  return _model_img
43
 
44
+ print(f"{'GPU detected' if device == 'cuda' else 'No GPU detected'}, loading model on {device.upper()}.")
 
 
 
 
45
 
46
+ _model_img = tf.keras.models.load_model(
47
+ WEIGHTS_PATH, custom_objects={"Cast": Cast}
48
+ )
 
 
49
  print("Model input shape:", _model_img.input_shape)
50
  return _model_img
51
 
features/image_edit_detector/controller.py CHANGED
@@ -7,9 +7,8 @@ from .detectors.ela import run_ela
7
  from .preprocess import preprocess_image
8
  from fastapi import HTTPException,status,Depends
9
  from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
10
- from config import Config
11
  security=HTTPBearer()
12
-
13
  async def process_image_ela(image_bytes: bytes, quality: int=90):
14
  image = Image.open(io.BytesIO(image_bytes))
15
 
@@ -41,7 +40,7 @@ async def process_meta_image(image_bytes: bytes) -> dict:
41
 
42
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
43
  token = credentials.credentials
44
- expected_token = Config.SECRET_TOKEN
45
  if token != expected_token:
46
  raise HTTPException(
47
  status_code=status.HTTP_403_FORBIDDEN,
 
7
  from .preprocess import preprocess_image
8
  from fastapi import HTTPException,status,Depends
9
  from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
 
10
  security=HTTPBearer()
11
+ import os
12
  async def process_image_ela(image_bytes: bytes, quality: int=90):
13
  image = Image.open(io.BytesIO(image_bytes))
14
 
 
40
 
41
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
42
  token = credentials.credentials
43
+ expected_token = os.getenv("MY_SECRET_TOKEN")
44
  if token != expected_token:
45
  raise HTTPException(
46
  status_code=status.HTTP_403_FORBIDDEN,
features/nepali_text_classifier/controller.py CHANGED
@@ -1,87 +1,23 @@
1
  import asyncio
2
- import hashlib
3
- import logging
4
- import random
5
  from io import BytesIO
6
  from fastapi import HTTPException, UploadFile, status, Depends
7
  from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
8
- from config import Config
9
  from features.nepali_text_classifier.inferencer import classify_text
10
  from features.nepali_text_classifier.preprocess import *
11
  import re
12
 
13
  security = HTTPBearer()
14
 
15
-
16
- def parse_selected_models(models: str | None) -> list[str] | None:
17
- if not models:
18
- return None
19
- parsed = [m.strip() for m in models.split(",") if m.strip()]
20
- return parsed[:2] if parsed else None
21
-
22
  def contains_english(text: str) -> bool:
23
  # Remove escape characters
24
  cleaned = text.replace("\n", "").replace("\t", "")
25
  return bool(re.search(r'[a-zA-Z]', cleaned))
26
 
27
 
28
- def _clamp(value: float, lower: float, upper: float) -> float:
29
- return max(lower, min(upper, value))
30
-
31
-
32
- def _raw_ai_score(label: str, confidence: float) -> float:
33
- conf = _clamp(float(confidence), 0.0, 100.0)
34
- return conf if label == "AI" else (100.0 - conf)
35
-
36
- def _sentence_bias_strength(overall_confidence: float) -> float:
37
- # Equation: beta = min(0.15, 0.05 + 0.10 * (C_doc / 100))
38
- return min(0.15, 0.05 + 0.10 * (_clamp(overall_confidence, 0.0, 100.0) / 100.0))
39
-
40
-
41
- def _deterministic_jitter(seed_text: str, max_jitter: float = 3.0) -> float:
42
- digest = hashlib.sha256(seed_text.encode("utf-8")).digest()
43
- seed_value = int.from_bytes(digest[:8], byteorder="big", signed=False)
44
- rng = random.Random(seed_value)
45
- return rng.uniform(-max_jitter, max_jitter)
46
-
47
-
48
- def _add_likelihood_randomness(likelihood: float, seed_text: str, max_jitter: float = 3.0) -> float:
49
- jitter = _deterministic_jitter(seed_text=seed_text, max_jitter=max_jitter)
50
- return _clamp(likelihood + jitter, 50.0, 99.95)
51
-
52
-
53
- def _biased_sentence_result(
54
- sentence_result: dict,
55
- overall_confidence: float,
56
- target_label: str = "Human",
57
- seed_text: str = "",
58
- ) -> dict:
59
- raw_label = sentence_result["label"]
60
- raw_confidence = float(sentence_result["confidence"])
61
- raw_ai = _raw_ai_score(raw_label, raw_confidence)
62
-
63
- target_ai = 100.0 if target_label == "AI" else 0.0
64
- beta = _sentence_bias_strength(overall_confidence)
65
-
66
- # Equation: S_biased = (1 - beta) * S_raw + beta * T
67
- biased_ai = _clamp((1.0 - beta) * raw_ai + beta * target_ai, 0.0, 100.0)
68
- # Force final label toward overall target to ensure overall bias is applied.
69
- biased_label = target_label
70
- biased_confidence = biased_ai if target_label == "AI" else (100.0 - biased_ai)
71
- biased_confidence = _add_likelihood_randomness(
72
- biased_confidence,
73
- seed_text=f"{seed_text}|{target_label}|{round(overall_confidence, 2)}",
74
- )
75
-
76
- return {
77
- "biased_label": biased_label,
78
- "biased_confidence": round(biased_confidence, 2),
79
- }
80
-
81
-
82
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
83
  token = credentials.credentials
84
- expected_token = Config.SECRET_TOKEN
85
  if token != expected_token:
86
  raise HTTPException(
87
  status_code=status.HTTP_403_FORBIDDEN,
@@ -89,16 +25,15 @@ async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(secur
89
  )
90
  return token
91
 
92
- async def nepali_text_analysis(text: str, models: str | None = None):
93
  end_symbol_for_NP_text(text)
94
  words = text.split()
95
  if len(words) < 10:
96
  raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
97
- if len(text) > 50000:
98
- raise HTTPException(status_code=413, detail="Text must be less than 50 ,000 characters")
99
 
100
- selected_models = parse_selected_models(models)
101
- result = await asyncio.to_thread(classify_text, text, selected_models, 2)
102
 
103
  return result
104
 
@@ -116,19 +51,18 @@ async def extract_file_contents(file:UploadFile)-> str:
116
  else:
117
  raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
118
 
119
- async def handle_file_upload(file: UploadFile, models: str | None = None):
120
  try:
121
  file_contents = await extract_file_contents(file)
122
  end_symbol_for_NP_text(file_contents)
123
- if len(file_contents) > 50000:
124
- raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
125
 
126
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
127
  if not cleaned_text:
128
  raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
129
 
130
- selected_models = parse_selected_models(models)
131
- result = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
132
  return result
133
  except Exception as e:
134
  logging.error(f"Error processing file: {e}")
@@ -136,45 +70,34 @@ async def handle_file_upload(file: UploadFile, models: str | None = None):
136
 
137
 
138
 
139
- async def handle_sentence_level_analysis(text: str, models: str | None = None):
140
  text = text.strip()
141
- if len(text) > 50000:
142
- raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
143
 
144
  end_symbol_for_NP_text(text)
145
 
146
  # Split text into sentences
147
  sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
148
- selected_models = parse_selected_models(models)
149
-
150
- overall = await asyncio.to_thread(classify_text, text, selected_models, 2)
151
- overall_label = overall["label"]
152
- overall_confidence = float(overall["confidence"])
153
 
154
  results = []
155
  for sentence in sentences:
156
  end_symbol_for_NP_text(sentence)
157
- result = await asyncio.to_thread(classify_text, sentence, selected_models, 2)
158
- biased = _biased_sentence_result(
159
- result,
160
- overall_confidence,
161
- target_label=overall_label,
162
- seed_text=sentence,
163
- )
164
  results.append({
165
  "text": sentence,
166
- "result": biased["biased_label"],
167
- "likelihood": biased["biased_confidence"],
168
  })
169
 
170
  return {"analysis": results}
171
 
172
 
173
- async def handle_file_sentence(file:UploadFile, models: str | None = None):
174
  try:
175
  file_contents = await extract_file_contents(file)
176
- if len(file_contents) > 50000:
177
- raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
178
 
179
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
180
  if not cleaned_text:
@@ -183,27 +106,16 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
183
 
184
  # Split text into sentences
185
  sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
186
- selected_models = parse_selected_models(models)
187
-
188
- overall = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
189
- overall_label = overall["label"]
190
- overall_confidence = float(overall["confidence"])
191
 
192
  results = []
193
  for sentence in sentences:
194
  end_symbol_for_NP_text(sentence)
195
 
196
- result = await asyncio.to_thread(classify_text, sentence, selected_models, 2)
197
- biased = _biased_sentence_result(
198
- result,
199
- overall_confidence,
200
- target_label=overall_label,
201
- seed_text=sentence,
202
- )
203
  results.append({
204
  "text": sentence,
205
- "result": biased["biased_label"],
206
- "likelihood": biased["biased_confidence"],
207
  })
208
 
209
  return {"analysis": results}
@@ -213,7 +125,6 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
213
  raise HTTPException(status_code=500, detail="Error processing the file")
214
 
215
 
216
- def classify(text: str, models: str | None = None):
217
- selected_models = parse_selected_models(models)
218
- return classify_text(text, selected_models, 2)
219
 
 
1
  import asyncio
 
 
 
2
  from io import BytesIO
3
  from fastapi import HTTPException, UploadFile, status, Depends
4
  from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
5
+ import os
6
  from features.nepali_text_classifier.inferencer import classify_text
7
  from features.nepali_text_classifier.preprocess import *
8
  import re
9
 
10
  security = HTTPBearer()
11
 
 
 
 
 
 
 
 
12
  def contains_english(text: str) -> bool:
13
  # Remove escape characters
14
  cleaned = text.replace("\n", "").replace("\t", "")
15
  return bool(re.search(r'[a-zA-Z]', cleaned))
16
 
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
19
  token = credentials.credentials
20
+ expected_token = os.getenv("MY_SECRET_TOKEN")
21
  if token != expected_token:
22
  raise HTTPException(
23
  status_code=status.HTTP_403_FORBIDDEN,
 
25
  )
26
  return token
27
 
28
+ async def nepali_text_analysis(text: str):
29
  end_symbol_for_NP_text(text)
30
  words = text.split()
31
  if len(words) < 10:
32
  raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
33
+ if len(text) > 10000:
34
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
35
 
36
+ result = await asyncio.to_thread(classify_text, text)
 
37
 
38
  return result
39
 
 
51
  else:
52
  raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
53
 
54
+ async def handle_file_upload(file: UploadFile):
55
  try:
56
  file_contents = await extract_file_contents(file)
57
  end_symbol_for_NP_text(file_contents)
58
+ if len(file_contents) > 10000:
59
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
60
 
61
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
62
  if not cleaned_text:
63
  raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
64
 
65
+ result = await asyncio.to_thread(classify_text, cleaned_text)
 
66
  return result
67
  except Exception as e:
68
  logging.error(f"Error processing file: {e}")
 
70
 
71
 
72
 
73
+ async def handle_sentence_level_analysis(text: str):
74
  text = text.strip()
75
+ if len(text) > 10000:
76
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
77
 
78
  end_symbol_for_NP_text(text)
79
 
80
  # Split text into sentences
81
  sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
 
 
 
 
 
82
 
83
  results = []
84
  for sentence in sentences:
85
  end_symbol_for_NP_text(sentence)
86
+ result = await asyncio.to_thread(classify_text, sentence)
 
 
 
 
 
 
87
  results.append({
88
  "text": sentence,
89
+ "result": result["label"],
90
+ "likelihood": result["confidence"]
91
  })
92
 
93
  return {"analysis": results}
94
 
95
 
96
+ async def handle_file_sentence(file:UploadFile):
97
  try:
98
  file_contents = await extract_file_contents(file)
99
+ if len(file_contents) > 10000:
100
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
101
 
102
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
103
  if not cleaned_text:
 
106
 
107
  # Split text into sentences
108
  sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
 
 
 
 
 
109
 
110
  results = []
111
  for sentence in sentences:
112
  end_symbol_for_NP_text(sentence)
113
 
114
+ result = await asyncio.to_thread(classify_text, sentence)
 
 
 
 
 
 
115
  results.append({
116
  "text": sentence,
117
+ "result": result["label"],
118
+ "likelihood": result["confidence"]
119
  })
120
 
121
  return {"analysis": results}
 
125
  raise HTTPException(status_code=500, detail="Error processing the file")
126
 
127
 
128
+ def classify(text: str):
129
+ return classify_text(text)
 
130
 
features/nepali_text_classifier/inferencer.py CHANGED
@@ -1,89 +1,23 @@
1
- import re
 
 
2
 
3
- from scipy.sparse import csr_matrix, hstack
4
 
5
- from .model_loader import get_default_top_models, load_artifacts
6
 
 
 
 
 
7
 
8
- TOP_K_MODELS = 1
 
 
 
 
 
9
 
 
10
 
11
- def normalize_nepali_text(text: str) -> str:
12
- text = str(text)
13
- text = re.sub(r"https?://\S+|www\.\S+", " ", text)
14
- text = re.sub(r"[^\u0900-\u097F\s।!?,]", " ", text)
15
- return re.sub(r"\s+", " ", text).strip()
16
 
17
 
18
- def _select_models(models, model_names=None, top_k=2):
19
- _ = model_names
20
- ranked = [name for name in get_default_top_models(top_k=top_k) if name in models]
21
- if ranked:
22
- return ranked[:top_k]
23
- return list(models.keys())[:top_k]
24
-
25
-
26
- def classify_text(text: str, model_names="Logistic Regression", top_k: int = 1):
27
- artifacts = load_artifacts()
28
- models = artifacts["models"]
29
- if not models:
30
- return {"error": "No models available for inference"}
31
-
32
- cleaned_text = normalize_nepali_text(text)
33
- word_features = artifacts["word_vectorizer"].transform([cleaned_text])
34
- char_features = artifacts["char_vectorizer"].transform([cleaned_text])
35
- rich_features = artifacts["rich_transformer"].transform([cleaned_text])
36
- features = hstack([word_features, char_features, csr_matrix(rich_features)])
37
-
38
- selected_names = _select_models(models, model_names=model_names, top_k=TOP_K_MODELS)
39
- dense_models = {"Linear SVC"}
40
-
41
- per_model = []
42
- ai_votes = 0
43
- human_votes = 0
44
- confidence_sum = 0.0
45
-
46
- for name in selected_names:
47
- model = models[name]
48
- model_input = features.toarray() if name in dense_models else features
49
- pred = int(model.predict(model_input)[0])
50
- confidence = None
51
- if hasattr(model, "predict_proba"):
52
- probs = model.predict_proba(model_input)
53
- confidence = float(probs[0][pred])
54
- elif hasattr(model, "decision_function"):
55
- score = float(model.decision_function(model_input)[0])
56
- confidence = abs(score) / (1.0 + abs(score))
57
- else:
58
- confidence = 0.5
59
-
60
- if pred == 1:
61
- ai_votes += 1
62
- label = "AI"
63
- else:
64
- human_votes += 1
65
- label = "Human"
66
-
67
- confidence_sum += confidence
68
- per_model.append(
69
- {
70
- "model": name,
71
- "label": label,
72
- "confidence": round(confidence * 100, 2),
73
- }
74
- )
75
-
76
- final_label = "AI" if ai_votes > human_votes else "Human"
77
- if ai_votes == human_votes:
78
- final_label = per_model[0]["label"]
79
-
80
- avg_conf = confidence_sum / max(len(per_model), 1)
81
- return {
82
- "label": final_label,
83
- "confidence": round(avg_conf * 100, 2),
84
- # "selected_models": selected_names,
85
- # "model_predictions": per_model,
86
- # "votes": {"AI": ai_votes, "Human": human_votes},
87
- # "available_models": list(models.keys()),
88
- # "unavailable_models": artifacts["unavailable_models"],
89
- }
 
1
+ import torch
2
+ from .model_loader import get_model_tokenizer
3
+ import torch.nn.functional as F
4
 
5
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
6
 
 
7
 
8
+ def classify_text(text: str):
9
+ model, tokenizer = get_model_tokenizer()
10
+ inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
11
+ inputs = {k: v.to(device) for k, v in inputs.items()}
12
 
13
+ with torch.no_grad():
14
+ outputs = model(**inputs)
15
+ logits = outputs if isinstance(outputs, torch.Tensor) else outputs.logits
16
+ probs = F.softmax(logits, dim=1)
17
+ pred = torch.argmax(probs, dim=1).item()
18
+ prob_percent = probs[0][pred].item() * 100
19
 
20
+ return {"label": "Human" if pred == 0 else "AI", "confidence": round(prob_percent, 2)}
21
 
 
 
 
 
 
22
 
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/nepali_text_classifier/model_loader.py CHANGED
@@ -1,237 +1,54 @@
1
- import logging
2
- import pickle
3
- import re
4
  import shutil
5
- from functools import lru_cache
6
- from pathlib import Path
7
-
8
- import numpy as np
9
- import pandas as pd
10
  from huggingface_hub import snapshot_download
11
- from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
12
-
13
- from config import Config
14
-
15
- LOGGER = logging.getLogger(__name__)
16
-
17
-
18
- MODEL_FILES = {
19
- "Logistic Regression": "Logistic_Regression.pkl",
20
- "Random Forest": "Random_Forest.pkl",
21
- # "Gradient Boosting": "Gradient_Boosting.pkl",
22
- "Linear SVC": "Linear_SVC.pkl",
23
- "Ridge Classifier": "Ridge_Classifier.pkl",
24
- "Multinomial NB": "Multinomial_NB.pkl",
25
- "Bernoulli NB": "Bernoulli_NB.pkl",
26
- }
27
-
28
- SKIP_MODELS = set()
29
-
30
- REPO_ID = Config.REPO_ID_LANG
31
- HF_TOKEN = Config.HF_TOKEN
32
- NEPALI_SUBDIR = "Nepali_model"
33
- REQUIRED_BASE_FILES = ("word_vectorizer.pkl", "char_vectorizer.pkl")
34
-
35
-
36
- # Ranked by validation accuracy from final_model/final_results.csv
37
- DEFAULT_MODEL_RANKING = [
38
- "Gradient Boosting",
39
- "Logistic Regression",
40
- "Linear SVC",
41
- "Ridge Classifier",
42
- "Bernoulli NB",
43
- "Random Forest",
44
- "Multinomial NB",
45
- ]
46
-
47
-
48
- def _patch_legacy_logistic_model(model):
49
- """Backfill attributes expected by newer sklearn versions."""
50
- if isinstance(model, (LogisticRegression, LogisticRegressionCV)) and not hasattr(
51
- model, "multi_class"
52
- ):
53
- model.multi_class = "auto"
54
- return model
55
-
56
-
57
- class NepaliRichFeatures:
58
- """Burstiness + stylometry feature extractor used during model training."""
59
-
60
- @staticmethod
61
- def extract_burstiness(text: str) -> dict:
62
- sentences = [s.strip() for s in re.split(r"[।!?]", str(text)) if s.strip()]
63
- if not sentences:
64
- return {
65
- "burst_mean": 0.0,
66
- "burst_std": 0.0,
67
- "burst_max": 0.0,
68
- "burst_min": 0.0,
69
- "burst_range": 0.0,
70
- }
71
- lengths = [len(s.split()) for s in sentences]
72
- return {
73
- "burst_mean": float(np.mean(lengths)),
74
- "burst_std": float(np.std(lengths)),
75
- "burst_max": float(np.max(lengths)),
76
- "burst_min": float(np.min(lengths)),
77
- "burst_range": float(np.max(lengths) - np.min(lengths)),
78
- }
79
-
80
- @staticmethod
81
- def extract_stylometry(text: str) -> dict:
82
- words = str(text).split()
83
- num_words = max(len(words), 1)
84
- num_chars = max(len(str(text)), 1)
85
- num_sentences = max(
86
- len([s for s in re.split(r"[।!?]", str(text)) if s.strip()]), 1
87
- )
88
- avg_word_len = float(np.mean([len(w) for w in words])) if words else 0.0
89
- avg_sent_len = num_words / num_sentences
90
- lexical_diversity = len(set(words)) / num_words
91
- punct_count = (
92
- str(text).count("।")
93
- + str(text).count("?")
94
- + str(text).count("!")
95
- + str(text).count(",")
96
- )
97
- punct_ratio = punct_count / num_chars
98
- bigrams = [" ".join(words[i : i + 2]) for i in range(len(words) - 1)]
99
- rep_bigram_ratio = (
100
- (1.0 - len(set(bigrams)) / max(len(bigrams), 1)) if bigrams else 0.0
101
- )
102
- diacritic_count = sum(1 for c in str(text) if "\u093e" <= c <= "\u094d")
103
- diacritic_ratio = diacritic_count / num_chars
104
- return {
105
- "num_words": num_words,
106
- "num_chars": num_chars,
107
- "num_sentences": num_sentences,
108
- "avg_word_len": avg_word_len,
109
- "avg_sent_len": avg_sent_len,
110
- "lexical_diversity": lexical_diversity,
111
- "punct_ratio": punct_ratio,
112
- "rep_bigram_ratio": rep_bigram_ratio,
113
- "diacritic_ratio": diacritic_ratio,
114
- }
115
-
116
- def transform(self, texts):
117
- if isinstance(texts, str):
118
- texts = [texts]
119
- rows = []
120
- for text in texts:
121
- row = {**self.extract_burstiness(text), **self.extract_stylometry(text)}
122
- rows.append(row)
123
- return pd.DataFrame(rows).values.astype(np.float32)
124
-
125
-
126
- def _repo_root() -> Path:
127
- return Path(__file__).resolve().parents[2]
128
-
129
-
130
- def _has_required_artifacts(path: Path) -> bool:
131
- if not path.exists() or not path.is_dir():
132
- return False
133
- has_base = all((path / filename).exists() for filename in REQUIRED_BASE_FILES)
134
- has_any_model = any((path / filename).exists() for filename in MODEL_FILES.values())
135
- return has_base and has_any_model
136
-
137
-
138
- def _candidate_model_dirs() -> list[Path]:
139
- candidates = []
140
- repo = _repo_root()
141
-
142
- if Config.Nepali_model_folder:
143
- custom = Path(Config.Nepali_model_folder)
144
- candidates.extend([custom, custom / NEPALI_SUBDIR])
145
-
146
- default_dir = repo / "features" / "Model" / "Nepali_model"
147
- candidates.extend([default_dir, default_dir / NEPALI_SUBDIR])
148
- candidates.append(
149
- repo / "notebook" / "ai_vs_human_nepali" / "final_model" / "saved_models"
150
- )
151
- return candidates
152
-
153
-
154
- def _download_nepali_artifacts() -> None:
155
- if not REPO_ID:
156
- raise ValueError("English_model repo id is not configured")
157
-
158
- repo = _repo_root()
159
- target_dir = (
160
- Path(Config.Nepali_model_folder)
161
- if Config.Nepali_model_folder
162
- else repo / "features" / "Model" / "Nepali_model"
163
- )
164
-
165
- snapshot_path = Path(snapshot_download(repo_id=REPO_ID, token=HF_TOKEN))
166
- source_dir = (
167
- snapshot_path / NEPALI_SUBDIR
168
- if (snapshot_path / NEPALI_SUBDIR).is_dir()
169
- else snapshot_path
170
- )
171
-
172
- target_dir.mkdir(parents=True, exist_ok=True)
173
- shutil.copytree(source_dir, target_dir, dirs_exist_ok=True)
174
-
175
-
176
- def resolve_model_dir() -> Path:
177
- for path in _candidate_model_dirs():
178
- if _has_required_artifacts(path):
179
- return path
180
-
181
- LOGGER.info("Nepali artifacts not found locally; downloading from %s", REPO_ID)
182
- _download_nepali_artifacts()
183
-
184
- for path in _candidate_model_dirs():
185
- if _has_required_artifacts(path):
186
- return path
187
-
188
- raise FileNotFoundError(
189
- "Nepali model directory not found. Set Nepali_model env or add expected artifacts."
190
- )
191
-
192
-
193
- @lru_cache(maxsize=1)
194
- def load_artifacts():
195
- model_dir = resolve_model_dir()
196
- LOGGER.info("Loading Nepali artifacts from %s", model_dir)
197
-
198
- models = {}
199
- unavailable = {}
200
- for model_name, file_name in MODEL_FILES.items():
201
- if model_name in SKIP_MODELS:
202
- unavailable[model_name] = "Skipped due to large artifact size"
203
- continue
204
- file_path = model_dir / file_name
205
- if not file_path.exists():
206
- unavailable[model_name] = "Missing model file"
207
- continue
208
- with open(file_path, "rb") as fp:
209
- models[model_name] = _patch_legacy_logistic_model(pickle.load(fp))
210
-
211
- with open(model_dir / "word_vectorizer.pkl", "rb") as fp:
212
- word_vectorizer = pickle.load(fp)
213
- with open(model_dir / "char_vectorizer.pkl", "rb") as fp:
214
- char_vectorizer = pickle.load(fp)
215
-
216
- rich_transformer = NepaliRichFeatures()
217
- return {
218
- "model_dir": str(model_dir),
219
- "models": models,
220
- "unavailable_models": unavailable,
221
- "word_vectorizer": word_vectorizer,
222
- "char_vectorizer": char_vectorizer,
223
- "rich_transformer": rich_transformer,
224
- }
225
-
226
-
227
- def get_available_models():
228
- artifacts = load_artifacts()
229
- return list(artifacts["models"].keys())
230
-
231
 
232
- def get_default_top_models(top_k: int = 2):
233
- available = set(get_available_models())
234
- ranked = [name for name in DEFAULT_MODEL_RANKING if name in available]
235
- if not ranked:
236
- return list(available)[:top_k]
237
- return ranked[: max(1, top_k)]
 
1
+ import os
 
 
2
  import shutil
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+ import logging
 
7
  from huggingface_hub import snapshot_download
8
+ from transformers import AutoTokenizer, AutoModel
9
+
10
+ # Configs
11
+ REPO_ID = "can-org/Nepali-AI-VS-HUMAN"
12
+ BASE_DIR = "./np_text_model"
13
+ TOKENIZER_DIR = os.path.join(BASE_DIR, "classifier") # <- update this to match your uploaded folder
14
+ WEIGHTS_PATH = os.path.join(BASE_DIR, "model_95_acc.pth") # <- change to match actual uploaded weight
15
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
16
+
17
+ # Define model class
18
+ class XLMRClassifier(nn.Module):
19
+ def __init__(self):
20
+ super(XLMRClassifier, self).__init__()
21
+ self.bert = AutoModel.from_pretrained("xlm-roberta-base")
22
+ self.classifier = nn.Linear(self.bert.config.hidden_size, 2)
23
+
24
+ def forward(self, input_ids, attention_mask):
25
+ outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
26
+ cls_output = outputs.last_hidden_state[:, 0, :]
27
+ return self.classifier(cls_output)
28
+
29
+ # Globals for caching
30
+ _model = None
31
+ _tokenizer = None
32
+
33
+ def download_model_repo():
34
+ if os.path.exists(BASE_DIR) and os.path.isdir(BASE_DIR):
35
+ logging.info("Model already downloaded.")
36
+ return
37
+ snapshot_path = snapshot_download(repo_id=REPO_ID)
38
+ os.makedirs(BASE_DIR, exist_ok=True)
39
+ shutil.copytree(snapshot_path, BASE_DIR, dirs_exist_ok=True)
40
+
41
+ def load_model():
42
+ download_model_repo()
43
+ tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)
44
+ model = XLMRClassifier().to(device)
45
+ model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
46
+ model.eval()
47
+ return model, tokenizer
48
+
49
+ def get_model_tokenizer():
50
+ global _model, _tokenizer
51
+ if _model is None or _tokenizer is None:
52
+ _model, _tokenizer = load_model()
53
+ return _model, _tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
 
 
 
 
 
 
features/nepali_text_classifier/preprocess.py CHANGED
@@ -1,9 +1,9 @@
1
- # import fitz # PyMuPDF
2
  import docx
3
  from io import BytesIO
4
  import logging
5
  from fastapi import HTTPException
6
- from pypdf import PdfReader
7
 
8
  def parse_docx(file: BytesIO):
9
  doc = docx.Document(file)
@@ -15,10 +15,11 @@ def parse_docx(file: BytesIO):
15
 
16
  def parse_pdf(file: BytesIO):
17
  try:
18
- doc = PdfReader(file)
19
  text = ""
20
- for page in doc.pages:
21
- text += page.extract_text()
 
22
  return text
23
  except Exception as e:
24
  logging.error(f"Error while processing PDF: {str(e)}")
 
1
+ import fitz # PyMuPDF
2
  import docx
3
  from io import BytesIO
4
  import logging
5
  from fastapi import HTTPException
6
+
7
 
8
  def parse_docx(file: BytesIO):
9
  doc = docx.Document(file)
 
15
 
16
  def parse_pdf(file: BytesIO):
17
  try:
18
+ doc = fitz.open(stream=file, filetype="pdf")
19
  text = ""
20
+ for page_num in range(doc.page_count):
21
+ page = doc.load_page(page_num)
22
+ text += page.get_text()
23
  return text
24
  except Exception as e:
25
  logging.error(f"Error while processing PDF: {str(e)}")
features/nepali_text_classifier/routes.py CHANGED
@@ -15,42 +15,27 @@ security = HTTPBearer()
15
  # Input schema
16
  class TextInput(BaseModel):
17
  text: str
18
- models: list[str] | None = None
19
 
20
  @router.post("/analyse")
21
  @limiter.limit(ACCESS_RATE)
22
  async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
23
- selected = ",".join(data.models[:2]) if data.models else None
24
- result = await nepali_text_analysis(data.text, selected)
25
  return result
26
 
27
  @router.post("/upload")
28
  @limiter.limit(ACCESS_RATE)
29
- async def upload_file(request:Request,file:UploadFile=File(...), models: str | None = None, token:str=Depends(security)):
30
- return await handle_file_upload(file, models)
31
 
32
  @router.post("/analyse-sentences")
33
  @limiter.limit(ACCESS_RATE)
34
  async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
35
- selected = ",".join(data.models[:2]) if data.models else None
36
- return await handle_sentence_level_analysis(data.text, selected)
37
 
38
  @router.post("/file-sentences-analyse")
39
  @limiter.limit(ACCESS_RATE)
40
- async def analyze_sentance_file(request: Request, file: UploadFile = File(...), models: str | None = None, token: str = Depends(security)):
41
- return await handle_file_sentence(file, models)
42
-
43
-
44
- @router.get("/models")
45
- @limiter.limit(ACCESS_RATE)
46
- def get_models(request: Request):
47
- from .model_loader import get_available_models, get_default_top_models
48
-
49
- available = get_available_models()
50
- return {
51
- "available_models": available,
52
- "default_top_2": get_default_top_models(2),
53
- }
54
 
55
 
56
  @router.get("/health")
 
15
  # Input schema
16
  class TextInput(BaseModel):
17
  text: str
 
18
 
19
  @router.post("/analyse")
20
  @limiter.limit(ACCESS_RATE)
21
  async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
22
+ result = classify_text(data.text)
 
23
  return result
24
 
25
  @router.post("/upload")
26
  @limiter.limit(ACCESS_RATE)
27
+ async def upload_file(request:Request,file:UploadFile=File(...),token:str=Depends(security)):
28
+ return await handle_file_upload(file)
29
 
30
  @router.post("/analyse-sentences")
31
  @limiter.limit(ACCESS_RATE)
32
  async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
33
+ return await handle_sentence_level_analysis(data.text)
 
34
 
35
  @router.post("/file-sentences-analyse")
36
  @limiter.limit(ACCESS_RATE)
37
+ async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(security)):
38
+ return await handle_file_sentence(file)
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
 
41
  @router.get("/health")
features/rag_chatbot/__init__.py DELETED
File without changes
features/rag_chatbot/controller.py DELETED
@@ -1,178 +0,0 @@
1
- import asyncio
2
- import logging
3
- from typing import Dict, Any
4
-
5
- from fastapi import HTTPException, UploadFile, status, Depends
6
- from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
7
- from config import Config
8
-
9
- from .rag_pipeline import route_and_process_query, add_document_to_rag, check_system_health
10
- from .document_handler import extract_text_from_file
11
-
12
- # Configure logging
13
- logging.basicConfig(level=logging.INFO)
14
- logger = logging.getLogger(__name__)
15
-
16
- security = HTTPBearer()
17
-
18
- # Supported file types
19
- SUPPORTED_CONTENT_TYPES = Config.RAG_SUPPORTED_CONTENT_TYPES
20
-
21
- MAX_FILE_SIZE = Config.RAG_MAX_FILE_SIZE
22
- MAX_QUERY_LENGTH = Config.RAG_MAX_QUERY_LENGTH
23
-
24
- async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
25
- """Verify Bearer token from Authorization header."""
26
- token = credentials.credentials
27
- expected_token = Config.SECRET_TOKEN
28
-
29
- if not expected_token:
30
- logger.error("MY_SECRET_TOKEN not configured")
31
- raise HTTPException(
32
- status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
33
- detail="Server configuration error"
34
- )
35
-
36
- if token != expected_token:
37
- logger.warning(f"Invalid token attempt: {token[:10]}...")
38
- raise HTTPException(
39
- status_code=status.HTTP_403_FORBIDDEN,
40
- detail="Invalid or expired token"
41
- )
42
- return token
43
-
44
- async def handle_rag_query(query: str) -> Dict[str, Any]:
45
- """Handle an incoming query by routing it and getting the appropriate answer."""
46
-
47
- # Input validation
48
- if not query or not query.strip():
49
- raise HTTPException(
50
- status_code=status.HTTP_400_BAD_REQUEST,
51
- detail="Query cannot be empty"
52
- )
53
-
54
- if len(query) > MAX_QUERY_LENGTH:
55
- raise HTTPException(
56
- status_code=status.HTTP_400_BAD_REQUEST,
57
- detail=f"Query too long. Please limit to {MAX_QUERY_LENGTH} characters."
58
- )
59
-
60
- try:
61
- logger.info(f"Processing query: {query[:50]}...")
62
-
63
- # Process query in thread pool
64
- response = await asyncio.to_thread(route_and_process_query, query)
65
-
66
- logger.info(f"Query processed successfully. Route: {response.get('route', 'Unknown')}")
67
- return response
68
-
69
- except Exception as e:
70
- logger.error(f"Error processing query: {e}")
71
- raise HTTPException(
72
- status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
73
- detail="Error processing your query. Please try again."
74
- )
75
-
76
- async def handle_document_upload(file: UploadFile) -> Dict[str, str]:
77
- """Handle uploading a document to the RAG's vector store."""
78
-
79
- # File validation
80
- if not file.filename:
81
- raise HTTPException(
82
- status_code=status.HTTP_400_BAD_REQUEST,
83
- detail="No file provided"
84
- )
85
-
86
- if file.content_type not in SUPPORTED_CONTENT_TYPES:
87
- raise HTTPException(
88
- status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
89
- detail=f"Unsupported file type: {file.content_type}. "
90
- f"Supported types: {', '.join(SUPPORTED_CONTENT_TYPES)}"
91
- )
92
-
93
- # Check file size
94
- contents = await file.read()
95
- if len(contents) > MAX_FILE_SIZE:
96
- raise HTTPException(
97
- status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
98
- detail=f"File too large. Maximum size: {MAX_FILE_SIZE / (1024*1024):.1f}MB"
99
- )
100
-
101
- # Reset file pointer
102
- await file.seek(0)
103
-
104
- try:
105
- logger.info(f"Processing file upload: {file.filename}")
106
-
107
- # Extract text from file
108
- text = await extract_text_from_file(file)
109
-
110
- if not text or not text.strip():
111
- raise HTTPException(
112
- status_code=status.HTTP_400_BAD_REQUEST,
113
- detail="The file appears to be empty or could not be read."
114
- )
115
-
116
- if len(text) < 50: # Too short to be meaningful
117
- raise HTTPException(
118
- status_code=status.HTTP_400_BAD_REQUEST,
119
- detail="The extracted text is too short to be meaningful."
120
- )
121
-
122
- # Add to RAG system
123
- success = await asyncio.to_thread(
124
- add_document_to_rag,
125
- text,
126
- {
127
- "source": file.filename,
128
- "content_type": file.content_type,
129
- "size": len(contents)
130
- }
131
- )
132
-
133
- if not success:
134
- raise HTTPException(
135
- status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
136
- detail="Failed to add document to the knowledge base"
137
- )
138
-
139
- logger.info(f"Successfully processed file: {file.filename}")
140
-
141
- return {
142
- "message": f"Successfully uploaded and processed '{file.filename}'. "
143
- f"It is now available for querying.",
144
- "filename": file.filename,
145
- "text_length": len(text),
146
- "content_type": file.content_type
147
- }
148
-
149
- except HTTPException:
150
- raise
151
- except Exception as e:
152
- logger.error(f"Error processing file {file.filename}: {e}")
153
- raise HTTPException(
154
- status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
155
- detail="Error processing the file. Please try again."
156
- )
157
-
158
- async def handle_health_check() -> Dict[str, Any]:
159
- """Handle health check requests."""
160
- try:
161
- health_status = await asyncio.to_thread(check_system_health)
162
-
163
- if health_status["status"] == "unhealthy":
164
- raise HTTPException(
165
- status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
166
- detail="Service is currently unhealthy"
167
- )
168
-
169
- return health_status
170
-
171
- except HTTPException:
172
- raise
173
- except Exception as e:
174
- logger.error(f"Health check failed: {e}")
175
- raise HTTPException(
176
- status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
177
- detail="Health check failed"
178
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/rag_chatbot/document_handler.py DELETED
@@ -1,37 +0,0 @@
1
- from io import BytesIO
2
- from fastapi import UploadFile, HTTPException
3
- import PyPDF2
4
- import docx
5
-
6
- async def extract_text_from_file(file: UploadFile) -> str:
7
- """Extracts text from various file types."""
8
- content = await file.read()
9
- file_stream = BytesIO(content)
10
-
11
- if file.content_type == "application/pdf":
12
- return extract_text_from_pdf(file_stream)
13
- elif file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
14
- return extract_text_from_docx(file_stream)
15
- elif file.content_type == "text/plain":
16
- return file_stream.read().decode("utf-8")
17
- else:
18
- raise HTTPException(
19
- status_code=415,
20
- detail="Unsupported file type. Please upload a .pdf, .docx, or .txt file."
21
- )
22
-
23
- def extract_text_from_pdf(file_stream: BytesIO) -> str:
24
- """Extracts text from a PDF file."""
25
- reader = PyPDF2.PdfReader(file_stream)
26
- text = ""
27
- for page in reader.pages:
28
- text += page.extract_text() or ""
29
- return text
30
-
31
- def extract_text_from_docx(file_stream: BytesIO) -> str:
32
- """Extracts text from a DOCX file."""
33
- doc = docx.Document(file_stream)
34
- text = ""
35
- for para in doc.paragraphs:
36
- text += para.text + "\n"
37
- return text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/rag_chatbot/rag_pipeline.py DELETED
@@ -1,329 +0,0 @@
1
- import os
2
- import chromadb
3
- from dotenv import load_dotenv
4
- from langchain_core.documents import Document
5
- from langchain.text_splitter import RecursiveCharacterTextSplitter
6
- from langchain_community.embeddings import HuggingFaceEmbeddings
7
- from langchain_community.llms import OpenAI
8
- from langchain.chains.question_answering import load_qa_chain
9
- from langchain_community.vectorstores import Chroma
10
- from langchain.chains import LLMChain
11
- from langchain.prompts import PromptTemplate
12
- from langchain.chat_models import ChatOpenAI
13
- from config import Config
14
-
15
-
16
- load_dotenv()
17
-
18
- # ChromaDB configuration
19
- CHROMA_HOST = Config.RAG_CHROMA_HOST
20
- CHROMA_PORT = Config.RAG_CHROMA_PORT
21
- COLLECTION_NAME = Config.RAG_COLLECTION_NAME
22
-
23
- # LLM Provider Configuration
24
- LLM_PROVIDER = Config.RAG_LLM_PROVIDER
25
- LLM_API_KEY = Config.RAG_LLM_API_KEY
26
- LLM_MODEL = Config.RAG_LLM_MODEL
27
- LLM_TEMPERATURE = Config.RAG_LLM_TEMPERATURE
28
- LLM_MAX_TOKENS = Config.RAG_LLM_MAX_TOKENS
29
-
30
- # Provider-specific configurations
31
- PROVIDER_CONFIGS = {
32
- "openai": {
33
- "api_base": "https://api.openai.com/v1",
34
- "default_model": "gpt-3.5-turbo"
35
- },
36
- "groq": {
37
- "api_base": "https://api.groq.com/openai/v1",
38
- "default_model": "llama-3.3-70b-versatile"
39
- },
40
- "openrouter": {
41
- "api_base": "https://openrouter.ai/api/v1",
42
- "default_model": "mistralai/mistral-small-3.2-24b-instruct:free"
43
- }
44
- }
45
-
46
- vector_store = None
47
- company_qa_chain = None
48
- query_router_chain = None
49
- cybersecurity_chain = None
50
- llm = None
51
-
52
- def get_llm_config():
53
- """Get the appropriate LLM configuration based on the provider."""
54
- if LLM_PROVIDER not in PROVIDER_CONFIGS:
55
- raise ValueError(f"Unsupported LLM provider: {LLM_PROVIDER}. Supported: {list(PROVIDER_CONFIGS.keys())}")
56
-
57
- config = PROVIDER_CONFIGS[LLM_PROVIDER].copy()
58
-
59
- # Use provided model or fall back to default
60
- model = LLM_MODEL if LLM_MODEL != "gpt-3.5-turbo" else config["default_model"]
61
-
62
- return {
63
- "model": model,
64
- "openai_api_key": LLM_API_KEY,
65
- "openai_api_base": config["api_base"],
66
- "temperature": LLM_TEMPERATURE,
67
- "max_tokens": LLM_MAX_TOKENS,
68
- }
69
-
70
- def initialize_llm():
71
- """Initialize the LLM based on the configured provider."""
72
- if not LLM_API_KEY:
73
- raise ValueError(f"LLM_API_KEY environment variable is required for {LLM_PROVIDER}")
74
-
75
- config = get_llm_config()
76
-
77
- print(f"Initializing {LLM_PROVIDER.upper()} with model: {config['model']}")
78
-
79
- return ChatOpenAI(**config)
80
-
81
- def initialize_pipelines():
82
- """Initializes all required models, chains, and the vector store."""
83
- global vector_store, company_qa_chain, query_router_chain, cybersecurity_chain, llm
84
-
85
- try:
86
- # Initialize LLM
87
- llm = initialize_llm()
88
-
89
- # Initialize embeddings
90
- embeddings = HuggingFaceEmbeddings(
91
- model_name="all-MiniLM-L6-v2",
92
- model_kwargs={'device': 'cpu'},
93
- encode_kwargs={'normalize_embeddings': True}
94
- )
95
-
96
- # Initialize ChromaDB client
97
- try:
98
- chroma_client = chromadb.HttpClient(host=CHROMA_HOST, port=CHROMA_PORT)
99
- chroma_client.heartbeat()
100
- except Exception as e:
101
- raise ConnectionError("Failed to connect to ChromaDB.") from e
102
-
103
- # Initialize vector store
104
- vector_store = Chroma(
105
- client=chroma_client,
106
- collection_name=COLLECTION_NAME,
107
- embedding_function=embeddings,
108
- )
109
-
110
- # Query Router Chain
111
- router_template = """You are a query classifier. Classify the following query into one of these categories:
112
- - COMPANY: Questions about our company, its products, services, or general information
113
- - CYBERSECURITY: Questions about cybersecurity, security threats, best practices, or vulnerabilities
114
- - OFF_TOPIC: Questions that don't fit the above categories
115
-
116
- Query: {query}
117
-
118
- Respond with only the category name (COMPANY, CYBERSECURITY, or OFF_TOPIC):"""
119
-
120
- router_prompt = PromptTemplate(
121
- input_variables=["query"],
122
- template=router_template
123
- )
124
-
125
- query_router_chain = LLMChain(
126
- llm=llm,
127
- prompt=router_prompt
128
- )
129
-
130
- # Custom Company QA Chain
131
- company_qa_template = """You are a helpful assistant for CyberAlertNepal. Answer the following question about our company using the information provided and links if only available. Give a natural, direct and polite response.
132
-
133
- Question: {question}
134
-
135
- Information:
136
- {context}
137
-
138
- Answer:"""
139
-
140
- company_qa_prompt = PromptTemplate(
141
- input_variables=["question", "context"],
142
- template=company_qa_template
143
- )
144
-
145
- company_qa_chain = LLMChain(
146
- llm=llm,
147
- prompt=company_qa_prompt
148
- )
149
-
150
- # Cybersecurity Chain
151
- cybersecurity_template = """You are a cybersecurity professional. Answer the following question truthfully and concisely.
152
- If you are not 100% sure about the answer, simply respond with: "I am not sure about the answer."
153
- Do not add extra explanations or assumptions. Do not provide false or speculative information.
154
-
155
- Question: {question}
156
-
157
- Provide a comprehensive and accurate answer about cybersecurity:"""
158
-
159
- cybersecurity_prompt = PromptTemplate(
160
- input_variables=["question"],
161
- template=cybersecurity_template
162
- )
163
-
164
- cybersecurity_chain = LLMChain(
165
- llm=llm,
166
- prompt=cybersecurity_prompt
167
- )
168
-
169
- print(f"Successfully initialized pipelines with {LLM_PROVIDER.upper()}")
170
-
171
- except Exception as e:
172
- print(f"Error initializing pipelines: {e}")
173
- raise
174
-
175
- def add_document_to_rag(text: str, metadata: dict):
176
- """Splits a document and adds it to the ChromaDB index."""
177
- global vector_store
178
-
179
- if not vector_store:
180
- initialize_pipelines()
181
-
182
- try:
183
- text_splitter = RecursiveCharacterTextSplitter(
184
- chunk_size=1000,
185
- chunk_overlap=200
186
- )
187
- docs = text_splitter.create_documents([text], metadatas=[metadata])
188
-
189
- if not docs:
190
- print("Document was empty after splitting, not adding to ChromaDB.")
191
- return False
192
-
193
- vector_store.add_documents(docs)
194
- print("Successfully added documents.")
195
- return True
196
-
197
- except Exception as e:
198
- print(f"Error adding document to RAG: {e}")
199
- return False
200
-
201
- def route_and_process_query(query: str):
202
- """Routes the query and processes it using the appropriate pipeline."""
203
- global query_router_chain, vector_store, company_qa_chain, cybersecurity_chain
204
-
205
- if not all([query_router_chain, vector_store, company_qa_chain, cybersecurity_chain]):
206
- initialize_pipelines()
207
-
208
- try:
209
- # 1. Classify the query
210
- route_result = query_router_chain.run(query)
211
- route = route_result.strip().upper()
212
-
213
-
214
- # 2. Route to appropriate logic
215
- if "CYBERSECURITY" in route:
216
- answer = cybersecurity_chain.run(question=query)
217
- return {
218
- "answer": answer,
219
- "source": "Cybersecurity Knowledge Base",
220
- "route": "CYBERSECURITY",
221
- "provider": LLM_PROVIDER.upper(),
222
- "model": get_llm_config()["model"]
223
- }
224
-
225
- elif "COMPANY" in route:
226
- # Perform similarity search on ChromaDB
227
- docs = vector_store.similarity_search(query, k=3)
228
-
229
- if not docs:
230
- return {
231
- "answer": "I could not find any relevant information to answer your question.",
232
- "source": "Company Documents",
233
- "route": "COMPANY",
234
- "provider": LLM_PROVIDER.upper(),
235
- "model": get_llm_config()["model"]
236
- }
237
-
238
- # Combine document content for context
239
- context = "\n\n".join([doc.page_content for doc in docs])
240
-
241
- # Run the custom QA chain
242
- answer = company_qa_chain.run(question=query, context=context)
243
- sources = list(set([doc.metadata.get("source", "Unknown") for doc in docs]))
244
-
245
- return {
246
- "answer": answer,
247
- "source": "Company Documents",
248
- "documents": sources,
249
- "route": "COMPANY",
250
- "provider": LLM_PROVIDER.upper(),
251
- "model": get_llm_config()["model"]
252
- }
253
-
254
- else: # OFF_TOPIC
255
- return {
256
- "answer": "I am a specialized assistant of CyberAlertNepal. I cannot answer questions outside of cybersecurity topics.",
257
- "source": "N/A",
258
- "route": "OFF_TOPIC",
259
- "provider": LLM_PROVIDER.upper(),
260
- "model": get_llm_config()["model"]
261
- }
262
-
263
- except Exception as e:
264
- print(f"Error processing query: {e}")
265
- return {
266
- "answer": "I encountered an error while processing your query. Please try again.",
267
- "source": "Error",
268
- "route": None,
269
- "documents": None,
270
- "provider": LLM_PROVIDER.upper(),
271
- "error": str(e)
272
- }
273
-
274
- def check_system_health():
275
- """Check if all components are properly initialized."""
276
- try:
277
- # Test ChromaDB connection
278
- if vector_store:
279
- vector_store._client.heartbeat()
280
-
281
- # Test if all chains are initialized
282
- components = {
283
- "vector_store": vector_store is not None,
284
- "company_qa_chain": company_qa_chain is not None,
285
- "query_router_chain": query_router_chain is not None,
286
- "cybersecurity_chain": cybersecurity_chain is not None,
287
- "llm": llm is not None
288
- }
289
-
290
- return {
291
- "status": "healthy" if all(components.values()) else "unhealthy",
292
- "components": components,
293
- "provider": LLM_PROVIDER.upper(),
294
- "model": get_llm_config()["model"] if llm else "Not initialized"
295
- }
296
-
297
- except Exception as e:
298
- return {
299
- "status": "unhealthy",
300
- "error": str(e),
301
- "provider": LLM_PROVIDER.upper()
302
- }
303
-
304
- def test_llm_connection():
305
- """Test the LLM API connection."""
306
- try:
307
- if not llm:
308
- initialize_pipelines()
309
-
310
- # Simple test query
311
- test_response = llm("Say 'Hello, LLM is working!'")
312
- return {
313
- "success": True,
314
- "provider": LLM_PROVIDER.upper(),
315
- "model": get_llm_config()["model"],
316
- "response": str(test_response)
317
- }
318
- except Exception as e:
319
- return {
320
- "success": False,
321
- "provider": LLM_PROVIDER.upper(),
322
- "error": str(e)
323
- }
324
-
325
- # Initialize pipelines on module import
326
- try:
327
- initialize_pipelines()
328
- except Exception as e:
329
- print(f"Failed to initialize pipelines on startup: {e}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/rag_chatbot/routes.py DELETED
@@ -1,107 +0,0 @@
1
- from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Request
2
- from fastapi.security import HTTPBearer
3
- from pydantic import BaseModel, Field
4
- from slowapi.util import get_remote_address
5
- from slowapi import Limiter
6
- from typing import Optional
7
- from config import ACCESS_RATE, Config
8
- from .controller import (
9
- handle_rag_query,
10
- handle_document_upload,
11
- handle_health_check,
12
- verify_token,
13
- )
14
-
15
- limiter = Limiter(key_func=get_remote_address)
16
- router = APIRouter(prefix="/rag", tags=["RAG Chatbot"])
17
- security = HTTPBearer()
18
-
19
- class QueryInput(BaseModel):
20
- query: str = Field(..., min_length=1, max_length=1000, description="The question to ask")
21
-
22
- class QueryResponse(BaseModel):
23
- answer: str
24
- source: str
25
- route: Optional[str] = None
26
- documents: Optional[list] = None
27
- error: Optional[str] = None
28
-
29
- class UploadResponse(BaseModel):
30
- message: str
31
- filename: str
32
- text_length: int
33
- content_type: str
34
-
35
- class HealthResponse(BaseModel):
36
- status: str
37
- components: Optional[dict] = None
38
- error: Optional[str] = None
39
-
40
- @router.post("/question", response_model=QueryResponse)
41
- @limiter.limit(ACCESS_RATE)
42
- async def ask_question(
43
- request: Request,
44
- data: QueryInput,
45
- token: str = Depends(verify_token)
46
- ) -> QueryResponse:
47
- """
48
- Ask a question to the RAG chatbot.
49
-
50
- The chatbot can answer:
51
- - Company-related questions (based on uploaded documents)
52
- - Cybersecurity questions (from knowledge base)
53
- """
54
- response = await handle_rag_query(data.query)
55
- return QueryResponse(**response)
56
-
57
- @router.post("/upload", response_model=UploadResponse)
58
- @limiter.limit(ACCESS_RATE)
59
- async def upload_document(
60
- request: Request,
61
- file: UploadFile = File(..., description="Document file (PDF, DOCX, or TXT)"),
62
- token: str = Depends(verify_token)
63
- ) -> UploadResponse:
64
- """
65
- Upload a document to the company knowledge base.
66
-
67
- Supported formats:
68
- - PDF (.pdf)
69
- - Word documents (.docx)
70
- - Plain text (.txt)
71
-
72
- Maximum file size: 10MB
73
- """
74
- response = await handle_document_upload(file)
75
- return UploadResponse(**response)
76
-
77
- @router.get("/health", response_model=HealthResponse)
78
- @limiter.limit(ACCESS_RATE)
79
- async def health_check(request: Request) -> HealthResponse:
80
- """
81
- Check the health status of the RAG system.
82
-
83
- Returns the status of all components:
84
- - ChromaDB connection
85
- - Vector store
86
- - AI chains
87
- """
88
- response = await handle_health_check()
89
- return HealthResponse(**response)
90
-
91
- @router.get("/info")
92
- @limiter.limit(ACCESS_RATE)
93
- async def get_system_info(request: Request):
94
- """Get information about the RAG system capabilities."""
95
- return {
96
- "name": "RAG Chatbot",
97
- "version": "1.0.0",
98
- "description": "A specialized chatbot for cybersecurity and company-related questions",
99
- "capabilities": [
100
- "Company document Q&A (based on uploaded documents)",
101
- "Cybersecurity knowledge and best practices",
102
- "Document upload and processing (PDF, DOCX, TXT)"
103
- ],
104
- "supported_file_types": sorted(Config.RAG_SUPPORTED_CONTENT_TYPES),
105
- "max_file_size_mb": round(Config.RAG_MAX_FILE_SIZE / (1024 * 1024), 2),
106
- "max_query_length": Config.RAG_MAX_QUERY_LENGTH
107
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/real_forged_classifier/__init__.py DELETED
@@ -1,9 +0,0 @@
1
- """Package for real_forged_classifier feature.
2
-
3
- This module ensures package-relative imports work when importing
4
- `features.real_forged_classifier.*` from the application.
5
- """
6
-
7
- __all__ = [
8
- 'controller', 'routes', 'preprocessor', 'inferencer', 'model_loader', 'model'
9
- ]
 
 
 
 
 
 
 
 
 
 
features/real_forged_classifier/controller.py CHANGED
@@ -1,30 +1,6 @@
1
  from typing import IO
2
- import io
3
- import numpy as np
4
- from PIL import Image
5
- from fastapi import Depends, HTTPException, status
6
- from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
7
- import torch
8
- from torchvision import transforms
9
-
10
- from .preprocessor import preprocessor
11
- from .inferencer import interferencer
12
- from .model_loader import models
13
- from config import Config
14
-
15
- security = HTTPBearer()
16
-
17
-
18
- async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
19
- token = credentials.credentials
20
- expected_token = Config.SECRET_TOKEN
21
- if token != expected_token:
22
- raise HTTPException(
23
- status_code=status.HTTP_403_FORBIDDEN,
24
- detail="Invalid or expired token",
25
- )
26
- return token
27
-
28
 
29
  class ClassificationController:
30
  """
@@ -58,72 +34,3 @@ class ClassificationController:
58
 
59
  # Create a single instance of the controller
60
  controller = ClassificationController()
61
-
62
- class documentForger:
63
- """
64
- Document forgery detector that uses the ELA-trained EfficientNet model
65
- when available (models.doc_model). Returns a dict with verdict and confidence.
66
- """
67
- def is_forged(self, document_file: IO) -> dict:
68
- # Ensure a document model is loaded
69
- if not hasattr(models, 'doc_model') or models.doc_model is None:
70
- _downloadmodel = Config.DOCUMENT_FORGERY_MODEL_PATH
71
- return {"detail": "Document forgery model not available."}
72
-
73
- # Read file bytes
74
- try:
75
- data = document_file.read()
76
- img = Image.open(io.BytesIO(data)).convert('RGB')
77
- except Exception as e:
78
- return {"detail": f"Could not open document image: {e}"}
79
-
80
- # Compute ELA map (same approach as the notebook)
81
- try:
82
- buf = io.BytesIO()
83
- img.save(buf, format='JPEG', quality=90)
84
- buf.seek(0)
85
- recompressed = Image.open(buf).convert('RGB')
86
-
87
- ela_arr = np.abs(np.array(img, dtype=np.float32) - np.array(recompressed, dtype=np.float32))
88
- p99 = np.percentile(ela_arr, 99)
89
- if p99 > 0:
90
- ela_arr = np.clip(ela_arr * (255.0 / p99), 0, 255).astype(np.uint8)
91
- else:
92
- ela_arr = ela_arr.astype(np.uint8)
93
-
94
- ela_pil = Image.fromarray(ela_arr, mode='RGB')
95
- except Exception as e:
96
- return {"detail": f"Failed to compute ELA: {e}"}
97
-
98
- # Transform and run through model
99
- transform = transforms.Compose([
100
- transforms.Resize((224, 224)),
101
- transforms.ToTensor(),
102
- transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
103
- ])
104
-
105
- tensor = transform(ela_pil).unsqueeze(0).to(models.device)
106
-
107
- with torch.no_grad():
108
- logits = models.doc_model(tensor)
109
- probs = torch.softmax(logits, dim=1)[0, 1].item()
110
-
111
- # Interpret confidence using configurable thresholds (values in 0..1)
112
- low = getattr(Config, 'DOCUMENT_FORGERY_POSSIBLE_LOW', 0.40)
113
- high = getattr(Config, 'DOCUMENT_FORGERY_FORGED_LOW', 0.55)
114
-
115
- if probs < low:
116
- verdict = 'LIKELY AUTHENTIC'
117
- elif probs < high:
118
- verdict = 'POSSIBLY FORGED'
119
- else:
120
- verdict = 'LIKELY FORGED'
121
-
122
- return {
123
- "verdict": verdict,
124
- "confidence": float(probs),
125
- "confidence_pct": round(float(probs) * 100, 2),
126
- }
127
-
128
- # Create a single instance of the document forger
129
- document_forger = documentForger()
 
1
  from typing import IO
2
+ from preprocessor import preprocessor
3
+ from inferencer import interferencer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  class ClassificationController:
6
  """
 
34
 
35
  # Create a single instance of the controller
36
  controller = ClassificationController()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/real_forged_classifier/inferencer.py CHANGED
@@ -3,7 +3,7 @@ import torch.nn.functional as F
3
  import numpy as np
4
 
5
  # Import the globally loaded models instance
6
- from .model_loader import models
7
 
8
  class Interferencer:
9
  """
@@ -26,10 +26,6 @@ class Interferencer:
26
  Returns:
27
  dict: A dictionary containing the classification label and confidence score.
28
  """
29
- # 0. Ensure model is loaded
30
- if self.fft_model is None:
31
- return {"error": "FFT model not loaded."}
32
-
33
  # 1. Get model outputs (logits)
34
  outputs = self.fft_model(image_tensor)
35
 
 
3
  import numpy as np
4
 
5
  # Import the globally loaded models instance
6
+ from model_loader import models
7
 
8
  class Interferencer:
9
  """
 
26
  Returns:
27
  dict: A dictionary containing the classification label and confidence score.
28
  """
 
 
 
 
29
  # 1. Get model outputs (logits)
30
  outputs = self.fft_model(image_tensor)
31
 
features/real_forged_classifier/main.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI
2
+ from routes import router as api_router
3
+
4
+ # Initialize the FastAPI app
5
+ app = FastAPI(
6
+ title="Real vs. Fake Image Classification API",
7
+ description="An API to classify images as real or forged using FFT and cnn.",
8
+ version="1.0.0"
9
+ )
10
+
11
+ # Include the API router
12
+ # All routes defined in routes.py will be available under the /api prefix
13
+ app.include_router(api_router, prefix="/api", tags=["Classification"])
14
+
15
+ @app.get("/", tags=["Root"])
16
+ async def read_root():
17
+ """
18
+ A simple root endpoint to confirm the API is running.
19
+ """
20
+ return {"message": "Welcome to the Image Classification API. Go to /docs for the API documentation."}
21
+
22
+ # To run this application:
23
+ # 1. Make sure you have all dependencies from requirements.txt installed.
24
+ # 2. Make sure the 'svm_model.joblib' file is in the same directory.
25
+ # 3. Run the following command in your terminal:
26
+ # uvicorn main:app --reload
features/real_forged_classifier/model_loader.py CHANGED
@@ -1,202 +1,60 @@
 
1
  from pathlib import Path
2
- from typing import Any
3
- import shutil
4
- from .model import FFTCNN # Import the FFT CNN architecture (package-relative)
5
- from config import Config
6
-
7
- try:
8
- from huggingface_hub import hf_hub_download
9
- except Exception:
10
- hf_hub_download = None
11
-
12
-
13
- # NOTE: EfficientNet/nn imports are done lazily when torch is available.
14
- ELAForgeryNet = None # will be constructed dynamically when needed
15
- torch = None
16
- TORCH_AVAILABLE = False
17
-
18
 
19
  class ModelLoader:
20
- """A class to load and hold PyTorch models used by this feature.
21
-
22
- It loads:
23
- - an FFT-based CNN (downloaded from Hugging Face Hub)
24
- - an ELA-based document forgery detector (local .pth by default)
25
  """
 
 
 
 
 
26
 
27
- def __init__(self, model_repo_id: str, model_filename: str, doc_model_path: str = None):
28
- # Try to import torch once and expose module-level variables
29
- global torch, TORCH_AVAILABLE
30
- try:
31
- import torch as _torch
32
- torch = _torch
33
- TORCH_AVAILABLE = True
34
- except Exception:
35
- torch = None
36
- TORCH_AVAILABLE = False
37
- print("[WARN] PyTorch not available; model loading will be skipped until torch is installed.")
38
- if TORCH_AVAILABLE:
39
- self.device = "cuda" if torch.cuda.is_available() else "cpu"
40
- else:
41
- self.device = "cpu"
42
- print(f"Using device: {self.device} (torch available: {TORCH_AVAILABLE})")
43
-
44
- # Load FFT CNN from HF Hub
45
- self.fft_model = None
46
- if TORCH_AVAILABLE:
47
- try:
48
- self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
49
- print("FFT CNN model loaded successfully from Hub.")
50
- except Exception:
51
- # Try local fallback path (if provided in config)
52
- self.fft_model = None
53
- local_path = Path(getattr(Config, 'REAL_FORGED_MODEL_LOCAL_PATH', ''))
54
- if local_path and local_path.exists():
55
- try:
56
- print(f"Attempting to load FFT model from local path: {local_path}")
57
- model = FFTCNN()
58
- state = torch.load(str(local_path), map_location=torch.device(self.device))
59
- state_dict = state.get('state_dict', state) if isinstance(state, dict) else state
60
- model.load_state_dict(state_dict, strict=False)
61
- model.to(self.device)
62
- model.eval()
63
- self.fft_model = model
64
- print("FFT CNN model loaded successfully from local path.")
65
- except Exception as e:
66
- print(f"Failed to load local FFT model: {e}")
67
- else:
68
- print("No local FFT model path configured or file missing; FFT model not loaded.")
69
- else:
70
- print("Skipping FFT model load because PyTorch is not installed.")
71
-
72
- # Load document forgery model (ELA CNN), downloading the checkpoint if needed.
73
- self.doc_model = None
74
- if doc_model_path is None:
75
- doc_model_path = Config.DOCUMENT_FORGERY_MODEL_PATH
76
 
77
- self.doc_model = None
78
- if TORCH_AVAILABLE:
79
- try:
80
- self.doc_model = self._load_document_forgery_model(Path(doc_model_path))
81
- if self.doc_model is not None:
82
- print("Document forgery (ELA) model loaded successfully.")
83
- except Exception as e:
84
- print(f"Warning: failed to load document forgery model: {e}")
85
- else:
86
- print("Skipping document forgery model load because PyTorch is not installed.")
87
 
88
  def _load_fft_model(self, repo_id: str, filename: str):
89
- """Downloads and loads the FFT CNN model from a Hugging Face Hub repository."""
90
- print(f"Attempting to download FFT CNN model from Hugging Face repo: {repo_id}")
91
- try:
92
- from huggingface_hub import hf_hub_download
93
- except Exception as e:
94
- raise RuntimeError(f"huggingface_hub not available: {e}")
95
 
 
 
 
 
 
 
 
 
96
  try:
97
- model_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
 
98
  print(f"Model downloaded to: {model_path}")
99
-
 
100
  model = FFTCNN()
 
 
101
  model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
 
 
102
  model.to(self.device)
103
  model.eval()
 
104
  return model
105
  except Exception as e:
106
- print(f"Error downloading or loading FFT model from Hugging Face: {e}")
107
  raise
108
 
109
- def _load_document_forgery_model(self, path: Path):
110
- """Load the ELA-based document forgery model from a local .pth checkpoint.
111
-
112
- Returns the model instance or None if the file does not exist.
113
- """
114
- # If the configured path doesn't exist, try sensible fallbacks in the repo.
115
- if not path.exists():
116
- print(f"Document forgery model file not found at configured path: {path}")
117
-
118
- # 1) Try the configured document forgery checkpoint path relative to repo root
119
- repo_root = Path(__file__).resolve().parents[2]
120
- candidate = repo_root / 'features' / 'Model' / 'document_forgery' / path.name
121
- if candidate.exists():
122
- path = candidate
123
- print(f"Found document forgery model at fallback path: {path}")
124
- else:
125
- # 2) Search the repo for any file with the configured checkpoint name
126
- print(f"Searching repository for '{path.name}'...")
127
- matches = list(repo_root.rglob(path.name))
128
- if matches:
129
- path = matches[0]
130
- print(f"Found document forgery model at: {path}")
131
- else:
132
- try:
133
- path = self._download_document_forgery_model(path)
134
- except Exception as exc:
135
- print(f"Document forgery model not found in repository and download failed: {exc}")
136
- return None
137
-
138
- print(f"Loading document forgery model from: {path}")
139
- # Build the ELA model architecture lazily (requires torchvision & torch.nn)
140
- try:
141
- import torchvision.models as tv_models
142
- import torch.nn as nn
143
- except Exception as e:
144
- raise RuntimeError(f"Required packages for ELA model not available: {e}")
145
-
146
- backbone = tv_models.efficientnet_b0(weights='IMAGENET1K_V1')
147
- in_features = backbone.classifier[1].in_features
148
- backbone.classifier = nn.Sequential(
149
- nn.Dropout(p=0.4),
150
- nn.Linear(in_features, 256),
151
- nn.ReLU(inplace=True),
152
- nn.Dropout(p=0.2),
153
- nn.Linear(256, 2),
154
- )
155
- model = backbone
156
- state = torch.load(str(path), map_location=torch.device(self.device))
157
-
158
- # The checkpoint might be either a state_dict or a full checkpoint dict
159
- if isinstance(state, dict) and 'state_dict' in state:
160
- state_dict = state['state_dict']
161
- else:
162
- state_dict = state
163
-
164
- # Attempt to load state dict; allow strict=False to be tolerant to minor key name differences
165
- model.load_state_dict(state_dict, strict=False)
166
- model.to(self.device)
167
- model.eval()
168
- return model
169
-
170
- def _download_document_forgery_model(self, target_path: Path) -> Path:
171
- """Download the document forgery checkpoint into the configured local path."""
172
- if hf_hub_download is None:
173
- raise RuntimeError("huggingface_hub not available")
174
-
175
- repo_id = getattr(Config, "DOCUMENT_FORGERY_MODEL_REPO_ID", Config.REAL_FORGED_MODEL_REPO_ID)
176
- configured_name = getattr(Config, "DOCUMENT_FORGERY_MODEL_FILENAME", str(target_path))
177
- candidate_filenames = []
178
- for candidate in (configured_name, str(target_path), target_path.name):
179
- if candidate and candidate not in candidate_filenames:
180
- candidate_filenames.append(candidate)
181
-
182
- last_error = None
183
- for filename in candidate_filenames:
184
- try:
185
- print(f"Downloading document forgery model from Hugging Face repo: {repo_id} ({filename})")
186
- downloaded_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
187
- target_path.parent.mkdir(parents=True, exist_ok=True)
188
- shutil.copy2(downloaded_path, target_path)
189
- print(f"Document forgery model downloaded to: {target_path}")
190
- return target_path
191
- except Exception as exc:
192
- last_error = exc
193
-
194
- raise RuntimeError(f"unable to download document forgery model: {last_error}")
195
-
196
-
197
  # --- Global Model Instance ---
198
- MODEL_REPO_ID = Config.REAL_FORGED_MODEL_REPO_ID
199
- MODEL_FILENAME = Config.REAL_FORGED_MODEL_FILENAME
200
- DOC_MODEL_PATH = Config.DOCUMENT_FORGERY_MODEL_PATH
201
- models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME, doc_model_path=DOC_MODEL_PATH)
202
 
 
1
+ import torch
2
  from pathlib import Path
3
+ from huggingface_hub import hf_hub_download
4
+ from model import FFTCNN # Import the model architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  class ModelLoader:
 
 
 
 
 
7
  """
8
+ A class to load and hold the PyTorch CNN model.
9
+ """
10
+ def __init__(self, model_repo_id: str, model_filename: str):
11
+ """
12
+ Initializes the ModelLoader and loads the model.
13
 
14
+ Args:
15
+ model_repo_id (str): The repository ID on Hugging Face.
16
+ model_filename (str): The name of the model file (.pth) in the repository.
17
+ """
18
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
19
+ print(f"Using device: {self.device}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
22
+ print("FFT CNN model loaded successfully.")
 
 
 
 
 
 
 
 
23
 
24
  def _load_fft_model(self, repo_id: str, filename: str):
25
+ """
26
+ Downloads and loads the FFT CNN model from a Hugging Face Hub repository.
 
 
 
 
27
 
28
+ Args:
29
+ repo_id (str): The repository ID on Hugging Face.
30
+ filename (str): The name of the model file (.pth) in the repository.
31
+
32
+ Returns:
33
+ The loaded PyTorch model object.
34
+ """
35
+ print(f"Downloading FFT CNN model from Hugging Face repo: {repo_id}")
36
  try:
37
+ # Download the model file from the Hub. It returns the cached path.
38
+ model_path = hf_hub_download(repo_id=repo_id, filename=filename)
39
  print(f"Model downloaded to: {model_path}")
40
+
41
+ # Initialize the model architecture
42
  model = FFTCNN()
43
+
44
+ # Load the saved weights (state_dict) into the model
45
  model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
46
+
47
+ # Set the model to evaluation mode
48
  model.to(self.device)
49
  model.eval()
50
+
51
  return model
52
  except Exception as e:
53
+ print(f"Error downloading or loading model from Hugging Face: {e}")
54
  raise
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  # --- Global Model Instance ---
57
+ MODEL_REPO_ID = 'rhnsa/real_forged_classifier'
58
+ MODEL_FILENAME = 'fft_cnn_model_78.pth'
59
+ models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME)
 
60
 
features/real_forged_classifier/preprocessor.py CHANGED
@@ -6,7 +6,7 @@ import cv2
6
  from torchvision import transforms
7
 
8
  # Import the globally loaded models instance
9
- from .model_loader import models
10
 
11
  class ImagePreprocessor:
12
  """
 
6
  from torchvision import transforms
7
 
8
  # Import the globally loaded models instance
9
+ from model_loader import models
10
 
11
  class ImagePreprocessor:
12
  """
features/real_forged_classifier/routes.py CHANGED
@@ -1,14 +1,14 @@
1
- from fastapi import APIRouter, File, UploadFile, HTTPException, status, Depends
2
  from fastapi.responses import JSONResponse
3
 
4
- # Import the controller instance and document forger
5
- from .controller import controller, document_forger, verify_token
6
 
7
  # Create an API router
8
  router = APIRouter()
9
 
10
  @router.post("/classify_forgery", summary="Classify an image as Real or Fake")
11
- async def classify_image_endpoint(image: UploadFile = File(...), token: str = Depends(verify_token)):
12
  """
13
  Accepts an image file and classifies it as 'real' or 'fake'.
14
 
@@ -35,23 +35,3 @@ async def classify_image_endpoint(image: UploadFile = File(...), token: str = De
35
 
36
  return JSONResponse(content=result, status_code=status.HTTP_200_OK)
37
 
38
- @router.post("/isforged", summary="Check if the document is forged")
39
- async def is_forged_endpoint(file: UploadFile = File(...), token: str = Depends(verify_token)):
40
- """Run the document forgery detector on an uploaded image file.
41
-
42
- Accepts image uploads (multipart/form-data) and returns a JSON verdict with confidence.
43
- """
44
- if not file.content_type.startswith("image/"):
45
- raise HTTPException(
46
- status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
47
- detail="Unsupported file type. Please upload an image (e.g., JPEG, PNG)."
48
- )
49
-
50
- result = document_forger.is_forged(file.file)
51
- if isinstance(result, dict) and (result.get("error") or result.get("detail")):
52
- raise HTTPException(
53
- status_code=status.HTTP_400_BAD_REQUEST,
54
- detail=result.get("error") or result.get("detail"),
55
- )
56
-
57
- return JSONResponse(content=result, status_code=status.HTTP_200_OK)
 
1
+ from fastapi import APIRouter, File, UploadFile, HTTPException, status
2
  from fastapi.responses import JSONResponse
3
 
4
+ # Import the controller instance
5
+ from controller import controller
6
 
7
  # Create an API router
8
  router = APIRouter()
9
 
10
  @router.post("/classify_forgery", summary="Classify an image as Real or Fake")
11
+ async def classify_image_endpoint(image: UploadFile = File(...)):
12
  """
13
  Accepts an image file and classifies it as 'real' or 'fake'.
14
 
 
35
 
36
  return JSONResponse(content=result, status_code=status.HTTP_200_OK)
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/text_classifier/controller.py CHANGED
@@ -1,76 +1,49 @@
 
1
  import asyncio
2
  import logging
3
  from io import BytesIO
4
 
5
- from fastapi import Depends, HTTPException, UploadFile, status
6
- from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
7
- from config import Config
8
 
9
- from .inferencer import analyze_text_with_sentences, classify_text
10
  from .preprocess import parse_docx, parse_pdf, parse_txt
11
-
12
  security = HTTPBearer()
13
-
14
-
15
- # def build_bias_summary(ai_likelihood: float) -> dict[str, object]:
16
- # """Convert an AI likelihood score into a human-readable bias summary."""
17
- # if ai_likelihood > 50:
18
- # overall_bias = "AI"
19
- # bias_statement = f"The text is biased toward AI-generated writing ({ai_likelihood}% AI likelihood)."
20
- # elif ai_likelihood < 50:
21
- # overall_bias = "Human"
22
- # bias_statement = f"The text is biased toward human writing ({100 - ai_likelihood}% human likelihood)."
23
- # else:
24
- # overall_bias = "Balanced"
25
- # bias_statement = "The text is balanced between AI and human writing."
26
-
27
- # return {
28
- # "overall_bias": overall_bias,
29
- # "bias_statement": bias_statement,
30
- # }
31
-
32
 
33
  # Verify Bearer token from Authorization header
34
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
35
  token = credentials.credentials
36
- expected_token = Config.SECRET_TOKEN
37
  if token != expected_token:
38
  raise HTTPException(
39
- status_code=status.HTTP_403_FORBIDDEN, detail="Invalid or expired token"
 
40
  )
41
  return token
42
 
43
-
44
  # Classify plain text input
45
  async def handle_text_analysis(text: str):
46
  text = text.strip()
47
  if not text or len(text.split()) < 10:
48
- raise HTTPException(
49
- status_code=400, detail="Text must contain at least 10 words"
50
- )
51
- if len(text) > 50000:
52
- raise HTTPException(
53
- status_code=413, detail="Text must be less than 50,000 characters"
54
- )
55
 
56
  label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
57
- # bias_summary = build_bias_summary(ai_likelihood)
58
  return {
59
  "result": label,
60
  "perplexity": round(perplexity, 2),
61
- "ai_likelihood": ai_likelihood,
62
  }
63
 
64
-
65
  # Extract text from uploaded files (.docx, .pdf, .txt)
66
  async def extract_file_contents(file: UploadFile) -> str:
67
  content = await file.read()
68
  file_stream = BytesIO(content)
69
 
70
- if (
71
- file.content_type
72
- == "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
73
- ):
74
  return parse_docx(file_stream)
75
  elif file.content_type == "application/pdf":
76
  return parse_pdf(file_stream)
@@ -79,83 +52,76 @@ async def extract_file_contents(file: UploadFile) -> str:
79
  else:
80
  raise HTTPException(
81
  status_code=415,
82
- detail="Invalid file type. Only .docx, .pdf and .txt are allowed.",
83
  )
84
 
85
-
86
  # Classify text from uploaded file
87
  async def handle_file_upload(file: UploadFile):
88
  try:
89
  file_contents = await extract_file_contents(file)
90
- logging.info(f"Extracted text length: {len(file_contents)} characters")
91
- if len(file_contents) > 50000:
92
- return {
93
- "status_code": 413,
94
- "detail": "Text must be less than 50,000 characters",
95
- }
96
 
97
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
98
  if not cleaned_text:
99
- raise HTTPException(
100
- status_code=400,
101
- detail="The uploaded file is empty or only contains whitespace.",
102
- )
103
- # print(f"Cleaned text: '{cleaned_text}'") # Debugging statement
104
- label, perplexity, ai_likelihood = await asyncio.to_thread(
105
- classify_text, cleaned_text
106
- )
107
  return {
108
  "content": file_contents,
109
  "result": label,
110
  "perplexity": round(perplexity, 2),
111
- "ai_likelihood": ai_likelihood,
112
  }
113
  except Exception as e:
114
  logging.error(f"Error processing file: {e}")
115
  raise HTTPException(status_code=500, detail="Error processing the file")
116
 
117
 
 
118
  async def handle_sentence_level_analysis(text: str):
119
  text = text.strip()
120
- if not text or len(text.split()) < 10:
121
- raise HTTPException(
122
- status_code=400, detail="Text must contain at least 10 words"
123
- )
124
- if len(text) > 50000:
125
- raise HTTPException(
126
- status_code=413, detail="Text must be less than 50,000 characters"
127
- )
128
-
129
- result = await asyncio.to_thread(analyze_text_with_sentences, text)
130
- return result
131
-
 
 
 
 
 
 
 
 
132
 
133
- # Analyze each sentence from uploaded file
134
  async def handle_file_sentence(file: UploadFile):
135
  try:
136
  file_contents = await extract_file_contents(file)
137
- if len(file_contents) > 50000:
138
- # raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
139
- return {
140
- "status_code": 413,
141
- "detail": "Text must be less than 50,000 characters",
142
- }
143
 
144
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
145
  if not cleaned_text:
146
- raise HTTPException(
147
- status_code=400,
148
- detail="The uploaded file is empty or only contains whitespace.",
149
- )
150
 
151
  result = await handle_sentence_level_analysis(cleaned_text)
152
- return {"content": file_contents, **result}
153
- except HTTPException:
154
- raise
 
155
  except Exception as e:
156
  logging.error(f"Error processing file: {e}")
157
  raise HTTPException(status_code=500, detail="Error processing the file")
158
 
159
-
160
  def classify(text: str):
161
  return classify_text(text)
 
 
1
+ import os
2
  import asyncio
3
  import logging
4
  from io import BytesIO
5
 
6
+ from fastapi import HTTPException, UploadFile, status, Depends
7
+ from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
 
8
 
9
+ from .inferencer import classify_text
10
  from .preprocess import parse_docx, parse_pdf, parse_txt
11
+ import spacy
12
  security = HTTPBearer()
13
+ nlp = spacy.load("en_core_web_sm")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  # Verify Bearer token from Authorization header
16
  async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
17
  token = credentials.credentials
18
+ expected_token = os.getenv("MY_SECRET_TOKEN")
19
  if token != expected_token:
20
  raise HTTPException(
21
+ status_code=status.HTTP_403_FORBIDDEN,
22
+ detail="Invalid or expired token"
23
  )
24
  return token
25
 
 
26
  # Classify plain text input
27
  async def handle_text_analysis(text: str):
28
  text = text.strip()
29
  if not text or len(text.split()) < 10:
30
+ raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
31
+ if len(text) > 10000:
32
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
 
 
 
 
33
 
34
  label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
 
35
  return {
36
  "result": label,
37
  "perplexity": round(perplexity, 2),
38
+ "ai_likelihood": ai_likelihood
39
  }
40
 
 
41
  # Extract text from uploaded files (.docx, .pdf, .txt)
42
  async def extract_file_contents(file: UploadFile) -> str:
43
  content = await file.read()
44
  file_stream = BytesIO(content)
45
 
46
+ if file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
 
 
 
47
  return parse_docx(file_stream)
48
  elif file.content_type == "application/pdf":
49
  return parse_pdf(file_stream)
 
52
  else:
53
  raise HTTPException(
54
  status_code=415,
55
+ detail="Invalid file type. Only .docx, .pdf and .txt are allowed."
56
  )
57
 
 
58
  # Classify text from uploaded file
59
  async def handle_file_upload(file: UploadFile):
60
  try:
61
  file_contents = await extract_file_contents(file)
62
+ if len(file_contents) > 10000:
63
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
 
 
 
 
64
 
65
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
66
  if not cleaned_text:
67
+ raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
68
+
69
+ label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, cleaned_text)
 
 
 
 
 
70
  return {
71
  "content": file_contents,
72
  "result": label,
73
  "perplexity": round(perplexity, 2),
74
+ "ai_likelihood": ai_likelihood
75
  }
76
  except Exception as e:
77
  logging.error(f"Error processing file: {e}")
78
  raise HTTPException(status_code=500, detail="Error processing the file")
79
 
80
 
81
+
82
  async def handle_sentence_level_analysis(text: str):
83
  text = text.strip()
84
+ if not text.endswith("."):
85
+ text += "."
86
+
87
+ if len(text) > 10000:
88
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
89
+
90
+ doc = nlp(text)
91
+ sentences = [sent.text.strip() for sent in doc.sents]
92
+
93
+ results = []
94
+ for sentence in sentences:
95
+ if not sentence:
96
+ continue
97
+ label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, sentence)
98
+ results.append({
99
+ "sentence": sentence,
100
+ "label": label,
101
+ "perplexity": round(perplexity, 2),
102
+ "ai_likelihood": ai_likelihood
103
+ })
104
 
105
+ return {"analysis": results}# Analyze each sentence from uploaded file
106
  async def handle_file_sentence(file: UploadFile):
107
  try:
108
  file_contents = await extract_file_contents(file)
109
+ if len(file_contents) > 10000:
110
+ raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
 
 
 
 
111
 
112
  cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
113
  if not cleaned_text:
114
+ raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
 
 
 
115
 
116
  result = await handle_sentence_level_analysis(cleaned_text)
117
+ return {
118
+ "content": file_contents,
119
+ **result
120
+ }
121
  except Exception as e:
122
  logging.error(f"Error processing file: {e}")
123
  raise HTTPException(status_code=500, detail="Error processing the file")
124
 
 
125
  def classify(text: str):
126
  return classify_text(text)
127
+
features/text_classifier/inferencer.py CHANGED
@@ -1,272 +1,40 @@
1
- from __future__ import annotations
2
-
3
- from dataclasses import dataclass
4
- from functools import lru_cache
5
- import logging
6
- import random
7
- from typing import Any
8
-
9
- import nltk
10
- import numpy as np
11
- from scipy.sparse import csr_matrix, hstack
12
  import torch
13
- from transformers import AutoModelForCausalLM, AutoTokenizer
14
-
15
- from features.text_classifier.model_loader import load_model
16
-
17
- logger = logging.getLogger(__name__)
18
-
19
-
20
- for resource in ("tokenizers/punkt", "tokenizers/punkt_tab"):
21
- try:
22
- nltk.data.find(resource)
23
- except LookupError:
24
- nltk.download(resource.split("/")[-1], quiet=True)
25
-
26
-
27
- try:
28
- import textstat
29
- except ImportError:
30
- textstat = None
31
-
32
-
33
- @dataclass
34
- class SentenceBlendConfig:
35
- sentence_blend_weight: float = 0.70
36
- sentence_to_doc_bias: float = 0.35
37
- max_sentence_blend_weight: float = 0.90
38
- max_sentence_to_doc_bias: float = 0.80
39
- random_deviation_pct: float = 2.0
40
-
41
-
42
- class PerplexityCalculator:
43
- """Lazy-loaded perplexity calculator for distilgpt2."""
44
-
45
- def __init__(self, model_name: str = "distilgpt2"):
46
- self.model_name = model_name
47
- self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
- self._tokenizer = None
49
- self._model = None
50
-
51
- def _load(self) -> None:
52
- if self._model is not None and self._tokenizer is not None:
53
- return
54
-
55
- logger.info("Loading perplexity model: %s", self.model_name)
56
- self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
57
- self._model = AutoModelForCausalLM.from_pretrained(self.model_name).to(self.device)
58
- self._model.eval()
59
- logger.info("Perplexity model loaded on %s", self.device)
60
-
61
- def calculate(self, text: str, max_length: int = 512) -> float:
62
- try:
63
- self._load()
64
- encodings = self._tokenizer(
65
- text,
66
- return_tensors="pt",
67
- truncation=True,
68
- max_length=max_length,
69
- )
70
- input_ids = encodings.input_ids.to(self.device)
71
-
72
- with torch.no_grad():
73
- outputs = self._model(input_ids, labels=input_ids)
74
- loss = outputs.loss
75
- perplexity = torch.exp(loss).item()
76
-
77
- return min(float(perplexity), 10000.0)
78
- except Exception as exc:
79
- logger.warning("Perplexity fallback used due to error: %s", exc)
80
- return 100.0
81
-
82
-
83
- _perplexity_calc = PerplexityCalculator()
84
-
85
-
86
- @lru_cache(maxsize=20000)
87
- def _cached_perplexity(cleaned_text: str) -> float:
88
- return _perplexity_calc.calculate(cleaned_text)
89
-
90
-
91
- @lru_cache(maxsize=1)
92
- def _get_model_artifacts() -> tuple[Any, Any, Any, Any, list[str], dict[str, Any]]:
93
- return load_model()
94
-
95
 
96
- def normalize_text(text: str) -> str:
97
- return " ".join(str(text).split()).strip()
98
 
 
 
 
 
99
 
100
- def split_into_sentences(text: str) -> list[str]:
101
- cleaned = normalize_text(text)
102
- if not cleaned:
103
- return []
104
- sentences = [s.strip() for s in nltk.sent_tokenize(cleaned) if s.strip()]
105
- return sentences if sentences else [cleaned]
106
 
 
 
107
 
108
- def extract_burstiness_features(text: str) -> dict[str, float]:
109
- sentences = split_into_sentences(text)
110
- if not sentences:
111
- return {
112
- "burst_mean": 0.0,
113
- "burst_std": 0.0,
114
- "burst_max": 0.0,
115
- "burst_min": 0.0,
116
- "burst_range": 0.0,
117
- }
118
 
119
- lengths = np.array([len(s.split()) for s in sentences], dtype=float)
120
- return {
121
- "burst_mean": float(np.mean(lengths)),
122
- "burst_std": float(np.std(lengths)),
123
- "burst_max": float(np.max(lengths)),
124
- "burst_min": float(np.min(lengths)),
125
- "burst_range": float(np.max(lengths) - np.min(lengths)),
126
- }
127
 
 
 
 
 
 
 
128
 
129
- def extract_stylometry_features(text: str) -> dict[str, float]:
130
- words = text.split()
131
- num_words = len(words)
132
- num_chars = len(text)
133
- num_sentences = max(len(split_into_sentences(text)), 1)
134
 
135
- avg_word_len = float(np.mean([len(w) for w in words])) if words else 0.0
136
- avg_sent_len = float(num_words / num_sentences)
137
-
138
- unique_words = len(set(words))
139
- lexical_diversity = float(unique_words / num_words) if num_words > 0 else 0.0
140
-
141
- num_punct = sum(1 for c in text if c in ".,!?;:")
142
- punct_ratio = float(num_punct / num_chars) if num_chars > 0 else 0.0
143
-
144
- num_caps = sum(1 for c in text if c.isupper())
145
- caps_ratio = float(num_caps / num_chars) if num_chars > 0 else 0.0
146
-
147
- if textstat is not None:
148
- try:
149
- flesch_reading = float(textstat.flesch_reading_ease(text))
150
- flesch_grade = float(textstat.flesch_kincaid_grade(text))
151
- except Exception:
152
- flesch_reading = 50.0
153
- flesch_grade = 8.0
154
- else:
155
- flesch_reading = 50.0
156
- flesch_grade = 8.0
157
-
158
- return {
159
- "num_words": float(num_words),
160
- "num_chars": float(num_chars),
161
- "num_sentences": float(num_sentences),
162
- "avg_word_len": avg_word_len,
163
- "avg_sent_len": avg_sent_len,
164
- "lexical_diversity": lexical_diversity,
165
- "punct_ratio": punct_ratio,
166
- "caps_ratio": caps_ratio,
167
- "flesch_reading": flesch_reading,
168
- "flesch_grade": flesch_grade,
169
- }
170
-
171
-
172
- def extract_all_features(text: str, calc_perplexity: bool = True) -> dict[str, float]:
173
- cleaned = normalize_text(text)
174
- features: dict[str, float] = {}
175
-
176
- if calc_perplexity:
177
- features["perplexity"] = _cached_perplexity(cleaned)
178
- else:
179
- features["perplexity"] = 100.0
180
-
181
- features.update(extract_burstiness_features(cleaned))
182
- features.update(extract_stylometry_features(cleaned))
183
- return features
184
-
185
-
186
- def _predict_ai_probability(text: str) -> tuple[float, float]:
187
- (
188
- loaded_classifier,
189
- loaded_scaler,
190
- loaded_word_vectorizer,
191
- loaded_char_vectorizer,
192
- loaded_features,
193
- loaded_metadata,
194
- ) = _get_model_artifacts()
195
-
196
- calc_perplexity = bool(loaded_metadata.get("num_engineered_features", 0) > 0)
197
- features = extract_all_features(text, calc_perplexity=calc_perplexity)
198
-
199
- feature_vector = np.array([features[name] for name in loaded_features], dtype=float).reshape(1, -1)
200
- feature_scaled = loaded_scaler.transform(feature_vector)
201
-
202
- word_vec = loaded_word_vectorizer.transform([text])
203
- char_vec = loaded_char_vectorizer.transform([text])
204
- num_vec = csr_matrix(feature_scaled)
205
- hybrid_vec = hstack([word_vec, char_vec, num_vec], format="csr")
206
-
207
- if hasattr(loaded_classifier, "predict_proba"):
208
- proba = loaded_classifier.predict_proba(hybrid_vec)[0]
209
- ai_prob = float(proba[1])
210
  else:
211
- score = float(loaded_classifier.decision_function(hybrid_vec)[0])
212
- ai_prob = float(1.0 / (1.0 + np.exp(-score)))
213
-
214
- perplexity = float(features.get("perplexity", 100.0))
215
- return ai_prob, perplexity
216
-
217
-
218
- def classify_text(text: str) -> tuple[str, float, float]:
219
- """Return (label, perplexity, ai_likelihood_percent)."""
220
- cleaned = normalize_text(text)
221
- if not cleaned:
222
- raise ValueError("Input text is empty")
223
-
224
- ai_prob, perplexity = _predict_ai_probability(cleaned)
225
- ai_likelihood = round(ai_prob * 100.0, 2)
226
- label = "AI" if ai_likelihood >= 50.0 else "Human"
227
- return label, perplexity, ai_likelihood
228
-
229
-
230
- def analyze_text_with_sentences(
231
- text: str,
232
- ) -> dict[str, Any]:
233
- text = normalize_text(text)
234
- overall_classification, overall_perplexity, overall_ai_likelihood = classify_text(text)
235
- sentences = split_into_sentences(text)
236
- if not sentences:
237
- raise ValueError("Input text contains no valid sentences")
238
- # do the per-sentence analysis
239
- sentence_results = []
240
- for sentence in sentences:
241
- try:
242
- label, perplexity, ai_likelihood = classify_text(sentence)
243
- sentence_results.append(
244
- {
245
- "sentence": sentence,
246
- "label": label,
247
- "perplexity": perplexity,
248
- "ai_likelihood": ai_likelihood,
249
- }
250
- )
251
- except Exception as exc:
252
- logger.warning("Error analyzing sentence: %s", exc)
253
- sentence_results.append(
254
- {
255
- "sentence": sentence,
256
- "label": "Error",
257
- "perplexity": None,
258
- "ai_likelihood": None,
259
- }
260
- )
261
- return{
262
- "sentences": sentence_results,
263
- "summary": {
264
- "overall": {
265
- "label": overall_classification,
266
- "perplexity": overall_perplexity,
267
- "ai_likelihood": overall_ai_likelihood,
268
- }
269
- },
270
-
271
- }
272
-
 
 
 
 
 
 
 
 
 
 
 
 
1
  import torch
2
+ from .model_loader import get_model_tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
5
 
6
+ def perplexity_to_ai_likelihood(ppl: float) -> float:
7
+ # You can tune these parameters
8
+ min_ppl = 10 # very confident it's AI
9
+ max_ppl = 100 # very confident it's human
10
 
11
+ # Clamp to bounds
12
+ ppl = max(min_ppl, min(ppl, max_ppl))
 
 
 
 
13
 
14
+ # Invert and scale: lower perplexity -> higher AI-likelihood
15
+ likelihood = 1 - ((ppl - min_ppl) / (max_ppl - min_ppl))
16
 
17
+ return round(likelihood * 100, 2)
 
 
 
 
 
 
 
 
 
18
 
 
 
 
 
 
 
 
 
19
 
20
+ def classify_text(text: str):
21
+ model, tokenizer = get_model_tokenizer()
22
+ inputs = tokenizer(text, return_tensors="pt",
23
+ truncation=True, padding=True)
24
+ input_ids = inputs["input_ids"].to(device)
25
+ attention_mask = inputs["attention_mask"].to(device)
26
 
27
+ with torch.no_grad():
28
+ outputs = model(
29
+ input_ids, attention_mask=attention_mask, labels=input_ids)
30
+ loss = outputs.loss
31
+ perplexity = torch.exp(loss).item()
32
 
33
+ if perplexity < 55:
34
+ result = "AI-generated"
35
+ elif perplexity < 80:
36
+ result = "Probably AI-generated"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  else:
38
+ result = "Human-written"
39
+ likelihood_result=perplexity_to_ai_likelihood(perplexity)
40
+ return result, perplexity,likelihood_result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/text_classifier/model_loader.py CHANGED
@@ -1,113 +1,50 @@
1
- import json
2
- import logging
3
- import pickle
4
  import shutil
5
- from pathlib import Path
6
-
7
- import torch
8
  from huggingface_hub import snapshot_download
9
- from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
10
-
11
- from config import Config
12
-
13
- REPO_ID = Config.REPO_ID_LANG
14
- MODEL_DIR = Path(Config.LANG_MODEL) if Config.LANG_MODEL else None
15
- HF_TOKEN = Config.HF_TOKEN
16
- ENGLISH_SUBDIR = "English_model"
17
 
18
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
19
-
20
- REQUIRED_FILES = (
21
- "classifier.pkl",
22
- "scaler.pkl",
23
- "word_vectorizer.pkl",
24
- "char_vectorizer.pkl",
25
- "feature_names.json",
26
- "metadata.json",
27
- )
28
-
29
-
30
- def _patch_legacy_logistic_model(model):
31
- """Backfill attributes expected by newer sklearn versions."""
32
- if isinstance(model, (LogisticRegression, LogisticRegressionCV)) and not hasattr(model, "multi_class"):
33
- model.multi_class = "auto"
34
- return model
35
-
36
-
37
- def _has_required_artifacts(model_dir: Path) -> bool:
38
- if not model_dir.exists() or not model_dir.is_dir():
39
- return False
40
- return all((model_dir / filename).exists() for filename in REQUIRED_FILES)
41
-
42
-
43
- def _resolve_artifact_dir(base_dir: Path) -> Path | None:
44
- candidates = [base_dir, base_dir / ENGLISH_SUBDIR]
45
- for candidate in candidates:
46
- if _has_required_artifacts(candidate):
47
- return candidate
48
- return None
49
 
50
 
51
  def warmup():
52
- logging.info("Warming up model...")
53
- if MODEL_DIR is None:
54
- raise ValueError("LANG_MODEL is not configured")
55
- if _resolve_artifact_dir(MODEL_DIR):
56
- logging.info("Model artifacts already exist, skipping download.")
57
- return
58
  download_model_repo()
 
 
59
 
60
 
61
  def download_model_repo():
62
- if MODEL_DIR is None:
63
- raise ValueError("LANG_MODEL is not configured")
64
- if not REPO_ID:
65
- raise ValueError("English_model repo id is not configured")
66
- if _resolve_artifact_dir(MODEL_DIR):
67
- logging.info("Model artifacts already exist, skipping download.")
68
  return
69
- snapshot_path = Path(snapshot_download(repo_id=REPO_ID, token=HF_TOKEN))
70
- source_dir = snapshot_path / ENGLISH_SUBDIR if (snapshot_path / ENGLISH_SUBDIR).is_dir() else snapshot_path
71
- MODEL_DIR.mkdir(parents=True, exist_ok=True)
72
- shutil.copytree(source_dir, MODEL_DIR, dirs_exist_ok=True)
73
 
74
 
75
  def load_model():
76
- if MODEL_DIR is None:
77
- raise ValueError("LANG_MODEL is not configured")
78
- artifact_dir = _resolve_artifact_dir(MODEL_DIR)
79
- if artifact_dir is None:
80
- logging.info("Model artifacts missing in %s, downloading now.", MODEL_DIR)
 
 
 
 
 
 
 
81
  download_model_repo()
82
- artifact_dir = _resolve_artifact_dir(MODEL_DIR)
83
- if artifact_dir is None:
84
- raise FileNotFoundError(
85
- f"Required model artifacts not found in {MODEL_DIR}. Expected files: {', '.join(REQUIRED_FILES)}"
86
- )
87
-
88
- with open(artifact_dir / "classifier.pkl", "rb") as f:
89
- loaded_classifier = pickle.load(f)
90
- loaded_classifier = _patch_legacy_logistic_model(loaded_classifier)
91
-
92
- with open(artifact_dir / "scaler.pkl", "rb") as f:
93
- loaded_scaler = pickle.load(f)
94
-
95
- with open(artifact_dir / "word_vectorizer.pkl", "rb") as f:
96
- loaded_word_vectorizer = pickle.load(f)
97
-
98
- with open(artifact_dir / "char_vectorizer.pkl", "rb") as f:
99
- loaded_char_vectorizer = pickle.load(f)
100
-
101
- with open(artifact_dir / "feature_names.json", "r") as f:
102
- loaded_features = json.load(f)
103
-
104
- with open(artifact_dir / "metadata.json", "r") as f:
105
- loaded_metadata = json.load(f)
106
- return (
107
- loaded_classifier,
108
- loaded_scaler,
109
- loaded_word_vectorizer,
110
- loaded_char_vectorizer,
111
- loaded_features,
112
- loaded_metadata,
113
- )
 
1
+ import os
 
 
2
  import shutil
3
+ import logging
4
+ from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
 
5
  from huggingface_hub import snapshot_download
6
+ import torch
7
+ from dotenv import load_dotenv
8
+ load_dotenv()
9
+ REPO_ID = "can-org/AI-Content-Checker"
10
+ MODEL_DIR = "./models"
11
+ TOKENIZER_DIR = os.path.join(MODEL_DIR, "model")
12
+ WEIGHTS_PATH = os.path.join(MODEL_DIR, "model_weights.pth")
 
13
 
14
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
15
+ _model, _tokenizer = None, None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
 
18
  def warmup():
19
+ global _model, _tokenizer
20
+ # Ensure punkt is available
 
 
 
 
21
  download_model_repo()
22
+ _model, _tokenizer = load_model()
23
+ logging.info("Its ready")
24
 
25
 
26
  def download_model_repo():
27
+ if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
28
+ logging.info("Model already exists, skipping download.")
 
 
 
 
29
  return
30
+ snapshot_path = snapshot_download(repo_id=REPO_ID)
31
+ os.makedirs(MODEL_DIR, exist_ok=True)
32
+ shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
 
33
 
34
 
35
  def load_model():
36
+ tokenizer = GPT2TokenizerFast.from_pretrained(TOKENIZER_DIR)
37
+ config = GPT2Config.from_pretrained(TOKENIZER_DIR)
38
+ model = GPT2LMHeadModel(config)
39
+ model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
40
+ model.to(device)
41
+ model.eval()
42
+ return model, tokenizer
43
+
44
+
45
+ def get_model_tokenizer():
46
+ global _model, _tokenizer
47
+ if _model is None or _tokenizer is None:
48
  download_model_repo()
49
+ _model, _tokenizer = load_model()
50
+ return _model, _tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
features/text_classifier/preprocess.py CHANGED
@@ -1,4 +1,4 @@
1
- from pypdf import PdfReader
2
  import docx
3
  from io import BytesIO
4
  import logging
@@ -15,16 +15,18 @@ def parse_docx(file: BytesIO):
15
 
16
  def parse_pdf(file: BytesIO):
17
  try:
18
- doc = PdfReader(file)
19
  text = ""
20
- for page in doc.pages:
21
- text += page.extract_text()
22
- return text
 
23
  except Exception as e:
24
  logging.error(f"Error while processing PDF: {str(e)}")
25
  raise HTTPException(
26
  status_code=500, detail="Error processing PDF file")
27
 
 
28
  def parse_txt(file: BytesIO):
29
  return file.read().decode("utf-8")
30
 
 
1
+ import fitz # PyMuPDF
2
  import docx
3
  from io import BytesIO
4
  import logging
 
15
 
16
  def parse_pdf(file: BytesIO):
17
  try:
18
+ doc = fitz.open(stream=file, filetype="pdf")
19
  text = ""
20
+ for page_num in range(doc.page_count):
21
+ page = doc.load_page(page_num)
22
+ text += page.get_text()
23
+ return text
24
  except Exception as e:
25
  logging.error(f"Error while processing PDF: {str(e)}")
26
  raise HTTPException(
27
  status_code=500, detail="Error processing PDF file")
28
 
29
+
30
  def parse_txt(file: BytesIO):
31
  return file.read().decode("utf-8")
32
 
features/text_classifier/routes.py CHANGED
@@ -37,10 +37,9 @@ async def analyze_sentences(request: Request, data: TextInput, token: str = Depe
37
  raise HTTPException(status_code=400, detail="Missing 'text' in request body")
38
  return await handle_sentence_level_analysis(data.text)
39
 
40
-
41
- @router.post("/analyse-sentence-file")
42
  @limiter.limit(ACCESS_RATE)
43
- async def analyze_sentence_file(request: Request, file: UploadFile = File(...), token: str = Depends(verify_token)):
44
  return await handle_file_sentence(file)
45
 
46
  @router.get("/health")
 
37
  raise HTTPException(status_code=400, detail="Missing 'text' in request body")
38
  return await handle_sentence_level_analysis(data.text)
39
 
40
+ @router.post("/analyse-sentance-file")
 
41
  @limiter.limit(ACCESS_RATE)
42
+ async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(verify_token)):
43
  return await handle_file_sentence(file)
44
 
45
  @router.get("/health")
requirements.txt CHANGED
@@ -15,23 +15,6 @@ tensorflow
15
  opencv-python
16
  pillow
17
  scipy
18
- pypdf
19
  frontend
20
  tools
21
- pandas
22
- numpy
23
- scikit-learn
24
- textstat
25
- requests
26
- beautifulsoup4
27
- langchain
28
- langchain-community
29
- langchain-openai
30
- faiss-cpu
31
- PyPDF2
32
- tiktoken
33
- chromadb
34
- langchain_chroma
35
- sentence-transformers
36
- tf-keras
37
- torchvision
 
15
  opencv-python
16
  pillow
17
  scipy
18
+ fitz
19
  frontend
20
  tools
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ **Update: Edited & AI-Generated Content Detection – Project Plan**
3
+
4
+ ### 🔍 Phase 1: Rule-Based Image Detection (In Progress)
5
+
6
+ We're implementing three core techniques to individually flag edited or AI-generated images:
7
+
8
+ * **ELA (Error Level Analysis):** Highlights inconsistencies via JPEG recompression.
9
+ * **FFT (Frequency Analysis):** Uses 2D Fourier Transform to detect unnatural image frequency patterns.
10
+ * **Metadata Analysis:** Parses EXIF data to catch clues like editing software tags.
11
+
12
+ These give us visual + interpretable results for each image, and currently offer \~60–70% accuracy on typical AI-edited content.
13
+
14
+ ---
15
+
16
+ ### Phase 2: AI vs Human Detection System (Coming Soon)
17
+
18
+ **Goal:** Build an AI model that classifies whether content is AI- or human-made — initially focusing on **images**, and later expanding to **text**.
19
+
20
+ **Data Strategy:**
21
+
22
+ * Scraping large volumes of recent AI-gen images (e.g. SDXL, Gibbli, MidJourney).
23
+ * Balancing with high-quality human images.
24
+
25
+ **Model Plan:**
26
+
27
+ * Use ELA, FFT, and metadata as feature extractors.
28
+ * Feed these into a CNN or ensemble model.
29
+ * Later, unify into a full web-based platform (upload → get AI/human probability).
30
+
31
+