Spaces:

can-org
/

Testing-AI-Contain

Running

App Files Files Community

added cronjob

by Pujan-Dev - opened Feb 10

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+374

-2259

This PR is in draft mode

Files changed (37) hide show

.env-example +0 -47
.gitignore +1 -6
Procfile +1 -0
README.md +0 -166
__init__.py +0 -1
app.py +17 -28
config.py +0 -59
features/Modelsdfa/English_model/feature_names.json +0 -18
features/Modelsdfa/English_model/metadata.json +0 -13
features/__init__.py +0 -5
features/ai_human_image_classifier/model_loader.py +4 -5
features/image_classifier/model_loader.py +19 -19
features/image_edit_detector/controller.py +2 -3
features/nepali_text_classifier/controller.py +24 -113
features/nepali_text_classifier/inferencer.py +15 -81
features/nepali_text_classifier/model_loader.py +51 -234
features/nepali_text_classifier/preprocess.py +6 -5
features/nepali_text_classifier/routes.py +6 -21
features/rag_chatbot/__init__.py +0 -0
features/rag_chatbot/controller.py +0 -178
features/rag_chatbot/document_handler.py +0 -37
features/rag_chatbot/rag_pipeline.py +0 -329
features/rag_chatbot/routes.py +0 -107
features/real_forged_classifier/__init__.py +0 -9
features/real_forged_classifier/controller.py +2 -95
features/real_forged_classifier/inferencer.py +1 -5
features/real_forged_classifier/main.py +26 -0
features/real_forged_classifier/model_loader.py +39 -181
features/real_forged_classifier/preprocessor.py +1 -1
features/real_forged_classifier/routes.py +4 -24
features/text_classifier/controller.py +51 -85
features/text_classifier/inferencer.py +29 -261
features/text_classifier/model_loader.py +34 -97
features/text_classifier/preprocess.py +7 -5
features/text_classifier/routes.py +2 -3
requirements.txt +1 -18
test.md +31 -0

.env-example CHANGED Viewed

@@ -1,49 +1,2 @@
 MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
-# Language/text classifier models
-English_model="Pujan-Dev/Ai_vs_HUMAN"
-Nepali_model="features/Model/Nepali_model"
-LANG_MODEL="features/Model/English_model"
-# Hugging Face private model access
-# Create a READ token at: https://huggingface.co/settings/tokens
-HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
-# Optional alias, either variable can be used
-HUGGINGFACE_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
-# Legacy variables (kept for compatibility)
-REPOSITORY_ID_English_Detector="nepali-detector"
-REPOSITORY_ID_Nepali_Detector="nepali-detector"
-# Image classifier
-IMAGE_CLASSIFIER_REPO_ID="can-org/AI-VS-HUMAN-IMAGE-classifier"
-IMAGE_CLASSIFIER_MODEL_DIR="./IMG_Models"
-IMAGE_CLASSIFIER_WEIGHTS_FILE="latest-my_cnn_model.h5"
-# AI vs Human image detector
-AI_HUMAN_CLIP_MODEL_NAME="ViT-L/14"
-AI_HUMAN_SVM_REPO_ID="rhnsa/ai_human_image_detector"
-AI_HUMAN_SVM_FILENAME="svm_model_real.joblib"
-# Real vs Forged detector
-REAL_FORGED_MODEL_REPO_ID="rhnsa/real_forged_classifier"
-REAL_FORGED_MODEL_FILENAME="fft_cnn_model_78.pth"
-# RAG + Chroma settings
-CHROMA_HOST="localhost"
-CHROMA_PORT="8000"
-RAG_COLLECTION_NAME="company_docs_collection"
-RAG_MAX_FILE_SIZE="104857600"
-RAG_MAX_QUERY_LENGTH="1000"
-# LLM settings
-LLM_PROVIDER="openai"
-LLM_API_KEY="sk-xxxx"
-LLM_MODEL="gpt-3.5-turbo"
-LLM_TEMPERATURE="0"
-LLM_MAX_TOKENS="2048"
-# Notebook/scraper API keys
-GEMINI_API_KEY=""
-GROQ_API_KEY="gsk_xxxx"
-OPENROUTER_API_KEY="sk-or-xxxx"


1	MY_SECRET_TOKEN="SECRET_CODE_TOKEN"
2

.gitignore CHANGED Viewed

@@ -13,13 +13,11 @@ __pycache__/
 .vscode/
 .idea/
 *.swp
-*Model/
 # ---- Jupyter / IPython ----
 .ipynb_checkpoints/
 *.ipynb
-notebook/
-*.csv
 # ---- Model & Data Artifacts ----
 *.pth
 *.pt
@@ -68,6 +66,3 @@ notebooks
 np_text_model/classifier/sentencepiece.bpe.model
 np_text_model/classifier/tokenizer.json
-# vector database
-chroma_data
-chroma_database

 .vscode/
 .idea/
 *.swp
 # ---- Jupyter / IPython ----
 .ipynb_checkpoints/
 *.ipynb
 # ---- Model & Data Artifacts ----
 *.pth
 *.pt
 np_text_model/classifier/sentencepiece.bpe.model
 np_text_model/classifier/tokenizer.json

Procfile ADDED Viewed

	@@ -0,0 +1 @@


1	+ web: uvicorn app:app --host 0.0.0.0 --port ${PORT:-8000}

README.md CHANGED Viewed

@@ -14,175 +14,9 @@ pinned: false
 This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
 ## How to run locally
----
-title: Testing AI Contain
-emoji: 🤖
-colorFrom: blue
-colorTo: green
-sdk: docker
-sdk_version: "latest"
-app_file: app.py
-pinned: false
----
-# AI-Contain-Checker
-# AI-Content-Checker
-A modular AI content detection system with support for **image classification**, **image edit detection**, **Nepali text classification**, and **general text classification**. Built for performance and extensibility, it is ideal for detecting AI-generated content in both visual and textual forms.
-## 🌟 Features
-### 🖼️ Image Classifier
-* **Purpose**: Classifies whether an image is AI-generated or a real-life photo.
-* **Model**: Fine-tuned **InceptionV3** CNN.
-* **Dataset**: Custom curated dataset with **\~79,950 images** for binary classification.
-* **Location**: [`features/image_classifier`](features/image_classifier)
-* **Docs**: [`docs/features/image_classifier.md`](docs/features/image_classifier.md)
-### 🖌️ Image Edit Detector
-* **Purpose**: Detects image tampering or post-processing.
-* **Techniques Used**:
-  * **Error Level Analysis (ELA)**: Visualizes compression artifacts.
-  * **Fast Fourier Transform (FFT)**: Detects unnatural frequency patterns.
-* **Location**: [`features/image_edit_detector`](features/image_edit_detector)
-* **Docs**:
-  * [ELA](docs/detector/ELA.md)
-  * [FFT](docs/detector/fft.md )
-  * [Metadata Analysis](docs/detector/meta.md)
-  * [Backend Notes](docs/detector/note-for-backend.md)
-### 📝 Nepali Text Classifier
-* **Purpose**: Determines if Nepali text content is AI-generated or written by a human.
-* **Model**: Based on `XLMRClassifier` fine-tuned on Nepali language data.
-* **Dataset**: Scraped dataset of **\~18,000** Nepali texts.
-* **Location**: [`features/nepali_text_classifier`](features/nepali_text_classifier)
-* **Docs**: [`docs/features/nepali_text_classifier.md`](docs/features/nepali_text_classifier.md)
-### 🌐 English Text Classifier
-* **Purpose**: Detects if English text is AI-generated or human-written.
-* **Pipeline**:
-  * Uses **GPT2 tokenizer** for input preprocessing.
-  * Custom binary classifier to differentiate between AI and human-written content.
-* **Location**: [`features/text_classifier`](features/text_classifier)
-* **Docs**: [`docs/features/text_classifier.md`](docs/features/text_classifier.md)
----
-## 🗂️ Project Structure
-```bash
-AI-Checker/
-│
-├── app.py                  # Main FastAPI entry point
-├── config.py               # Configuration settings
-├── Dockerfile              # Docker build script
-├── Procfile                # Deployment file for Heroku or similar
-├── requirements.txt        # Python dependencies
-├── README.md               # You are here 📘
-│
-├── features/               # Core detection modules
-│   ├── image_classifier/
-│   ├── image_edit_detector/
-│   ├── nepali_text_classifier/
-│   └── text_classifier/
-│
-├── docs/                   # Internal and API documentation
-│   ├── api_endpoints.md
-│   ├── deployment.md
-│   ├── detector/
-│   │   ├── ELA.md
-│   │   ├── fft.md
-│   │   ├── meta.md
-│   │   └── note-for-backend.md
-│   ├── functions.md
-│   ├── nestjs_integration.md
-│   ├── security.md
-│   ├── setup.md
-│   └── structure.md
-│
-├── IMG_Models/             # Saved image classifier model(s)
-│   └── latest-my_cnn_model.h5
-│
-├── notebooks/              # Experimental and debug notebooks
-├── static/                 # Static assets if needed
-└── test.md                 # Test notes
-````
----
-## 📚 Documentation Links
-* [API Endpoints](docs/api_endpoints.md)
-* [Deployment Guide](docs/deployment.md)
-* [Detector Documentation](docs/detector/)
-  * [Error Level Analysis (ELA)](docs/detector/ELA.md)
-  * [Fast Fourier Transform (FFT)](docs/detector/fft.md)
-  * [Metadata Analysis](docs/detector/meta.md)
-  * [Backend Notes](docs/detector/note-for-backend.md)
-* [Functions Overview](docs/functions.md)
-* [NestJS Integration Guide](docs/nestjs_integration.md)
-* [Security Details](docs/security.md)
-* [Setup Instructions](docs/setup.md)
-* [Project Structure](docs/structure.md)
----
-## 🚀 Usage
-1. **Install dependencies**
 ```bash
 docker build -t testing-ai-contain .
 docker run -p 7860:7860 testing-ai-contain
 ```
-   ```bash
-   pip install -r requirements.txt
-   ```
-2. **Run the API**
-   ```bash
-   chroma run --path ./chroma_database ## to run chromadb locally
-   uvicorn app:app --reload --port 8001 ## fastapi (run after chromadb)
-   ```
-3. **Build Docker (optional)**
-   ```bash
-   docker build -t ai-contain-checker .
-   docker run -p 8000:8000 ai-contain-checker
-   ```
----
-## 🔐 Security & Integration
-* **Token Authentication** and **IP Whitelisting** supported.
-* NestJS integration guide: [`docs/nestjs_integration.md`](docs/nestjs_integration.md)
-* Rate limiting handled using `slowapi`.
----
-## 🛡️ Future Plans
-* Add **video classifier** module.
-* Expand dataset for **multilingual** AI content detection.
-* Add **fine-tuning UI** for models.
----
-## 📄 License
-See full license terms here: [`LICENSE.md`](license.md)

 This Hugging Face Space uses **Docker** to run a custom environment for AI content detection.
 ## How to run locally
 ```bash
 docker build -t testing-ai-contain .
 docker run -p 7860:7860 testing-ai-contain
 ```

__init__.py DELETED Viewed

	@@ -1 +0,0 @@
1	-

app.py CHANGED Viewed

@@ -1,35 +1,25 @@
-import warnings
-import requests
 from fastapi import FastAPI, Request
-from fastapi.responses import FileResponse, JSONResponse
-from fastapi.staticfiles import StaticFiles
 from slowapi import Limiter, _rate_limit_exceeded_handler
-from slowapi.errors import RateLimitExceeded
 from slowapi.middleware import SlowAPIMiddleware
 from slowapi.util import get_remote_address
-from config import ACCESS_RATE
-from features.image_classifier.routes import router as image_classifier_router
-from features.image_edit_detector.routes import router as image_edit_detector_router
-from features.real_forged_classifier.routes import router as real_forged_classifier_router
 from features.nepali_text_classifier.routes import (
     router as nepali_text_classifier_router,
 )
-from features.text_classifier.routes import router as text_classifier_router
-warnings.filterwarnings("ignore")
-limiter = Limiter(key_func=get_remote_address, default_limits=[ACCESS_RATE])
-openapi_tags = [
-    {"name": "English Text Classifier", "description": "Endpoints for English AI-vs-human text analysis."},
-    {"name": "Nepali Text Classifier", "description": "Endpoints for Nepali AI-vs-human text analysis."},
-    {"name": "AI Image Classifier", "description": "Endpoints for AI-vs-human image classification."},
-    {"name": "Image Edit Detection", "description": "Endpoints for edited/forged image detection."},
-    {"name": "System", "description": "Health and root endpoints."},
-]
-app = FastAPI(openapi_tags=openapi_tags)
 # added the robots.txt
 # Set up SlowAPI
 app.state.limiter = limiter
@@ -47,14 +37,13 @@ app.add_exception_handler(
 app.add_middleware(SlowAPIMiddleware)
 # Include your routes
-app.include_router(text_classifier_router, prefix="/text", tags=["English Text Classifier"])
-app.include_router(nepali_text_classifier_router, prefix="/NP", tags=["Nepali Text Classifier"])
-app.include_router(image_classifier_router, prefix="/AI-image", tags=["AI Image Classifier"])
-app.include_router(image_edit_detector_router, prefix="/detect", tags=["Image Edit Detection"])
-app.include_router(real_forged_classifier_router, prefix="/real-forged", tags=["Real/Forged Image Classifier"])
-@app.get("/", tags=["System"])
 @limiter.limit(ACCESS_RATE)
 async def root(request: Request):
     return {

 from fastapi import FastAPI, Request
 from slowapi import Limiter, _rate_limit_exceeded_handler
+from fastapi.responses import FileResponse
 from slowapi.middleware import SlowAPIMiddleware
+from slowapi.errors import RateLimitExceeded
 from slowapi.util import get_remote_address
+from fastapi.responses import JSONResponse
+from features.text_classifier.routes import router as text_classifier_router
 from features.nepali_text_classifier.routes import (
     router as nepali_text_classifier_router,
 )
+from features.image_classifier.routes import router as image_classifier_router
+from features.image_edit_detector.routes import router as image_edit_detector_router
+from fastapi.staticfiles import StaticFiles
+from config import ACCESS_RATE
+import requests
+limiter = Limiter(key_func=get_remote_address, default_limits=[ACCESS_RATE])
+app = FastAPI()
 # added the robots.txt
 # Set up SlowAPI
 app.state.limiter = limiter
 app.add_middleware(SlowAPIMiddleware)
 # Include your routes
+app.include_router(text_classifier_router, prefix="/text")
+app.include_router(nepali_text_classifier_router, prefix="/NP")
+app.include_router(image_classifier_router, prefix="/AI-image")
+app.include_router(image_edit_detector_router, prefix="/detect")
+@app.get("/")
 @limiter.limit(ACCESS_RATE)
 async def root(request: Request):
     return {

config.py CHANGED Viewed

@@ -1,61 +1,2 @@
-import os
-import dotenv
-dotenv.load_dotenv()
 ACCESS_RATE = "20/minute"
-class Config:
-    Nepali_model_folder = os.getenv("Nepali_model")
-    English_model_folder = os.getenv("English_model")
-    REPO_ID_LANG = os.getenv("English_model") or "Pujan-Dev/Ai_vs_HUMAN"
-    LANG_MODEL = os.getenv("LANG_MODEL")
-    HF_TOKEN = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_TOKEN")
-    SECRET_TOKEN = os.getenv("MY_SECRET_TOKEN")
-    IMAGE_CLASSIFIER_REPO_ID = os.getenv("IMAGE_CLASSIFIER_REPO_ID", "can-org/AI-VS-HUMAN-IMAGE-classifier")
-    IMAGE_CLASSIFIER_MODEL_DIR = os.getenv("IMAGE_CLASSIFIER_MODEL_DIR", "./IMG_Models")
-    IMAGE_CLASSIFIER_WEIGHTS_FILE = os.getenv("IMAGE_CLASSIFIER_WEIGHTS_FILE", "latest-my_cnn_model.h5")
-    AI_HUMAN_CLIP_MODEL_NAME = os.getenv("AI_HUMAN_CLIP_MODEL_NAME", "ViT-L/14")
-    AI_HUMAN_SVM_REPO_ID = os.getenv("AI_HUMAN_SVM_REPO_ID", "rhnsa/ai_human_image_detector")
-    AI_HUMAN_SVM_FILENAME = os.getenv("AI_HUMAN_SVM_FILENAME", "svm_model_real.joblib")
-    REAL_FORGED_MODEL_REPO_ID = os.getenv("REAL_FORGED_MODEL_REPO_ID", "rhnsa/real_forged_classifier")
-    REAL_FORGED_MODEL_FILENAME = os.getenv("REAL_FORGED_MODEL_FILENAME", "fft_cnn_model_78.pth")
-    REAL_FORGED_MODEL_LOCAL_PATH = os.getenv("REAL_FORGED_MODEL_LOCAL_PATH", "Model/real_forged/fft_cnn_model_78.pth")
-    DOCUMENT_FORGERY_MODEL_REPO_ID = os.getenv(
-        "DOCUMENT_FORGERY_MODEL_REPO_ID",
-        REPO_ID_LANG
-    )
-    DOCUMENT_FORGERY_MODEL_FILENAME = os.getenv(
-        "DOCUMENT_FORGERY_MODEL_FILENAME",
-        "document_forgery/pixel_forgery_v3_best.pth",
-    )
-    DOCUMENT_FORGERY_MODEL_PATH = os.getenv(
-        "DOCUMENT_FORGERY_MODEL_PATH",
-        "features/Modelsdfa/document_forgery/pixel_forgery_v3_best.pth",
-    )
-    # Decision thresholds for document forgery detector (probabilities in 0..1)
-    DOCUMENT_FORGERY_POSSIBLE_LOW = float(os.getenv("DOCUMENT_FORGERY_POSSIBLE_LOW", "0.40"))
-    DOCUMENT_FORGERY_FORGED_LOW = float(os.getenv("DOCUMENT_FORGERY_FORGED_LOW", "0.55"))
-    RAG_CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
-    RAG_CHROMA_PORT = int(os.getenv("CHROMA_PORT", "8000"))
-    RAG_COLLECTION_NAME = os.getenv("RAG_COLLECTION_NAME", "company_docs_collection")
-    RAG_LLM_PROVIDER = os.getenv("LLM_PROVIDER", "openai").lower()
-    RAG_LLM_API_KEY = os.getenv("LLM_API_KEY")
-    RAG_LLM_MODEL = os.getenv("LLM_MODEL", "gpt-3.5-turbo")
-    RAG_LLM_TEMPERATURE = float(os.getenv("LLM_TEMPERATURE", "0"))
-    RAG_LLM_MAX_TOKENS = int(os.getenv("LLM_MAX_TOKENS", "2048"))
-    RAG_MAX_FILE_SIZE = int(os.getenv("RAG_MAX_FILE_SIZE", str(100 * 1024 * 1024)))
-    RAG_MAX_QUERY_LENGTH = int(os.getenv("RAG_MAX_QUERY_LENGTH", "1000"))
-    RAG_SUPPORTED_CONTENT_TYPES = {
-        "application/pdf",
-        "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
-        "text/plain",
-    }








1	ACCESS_RATE = "20/minute"
2

features/Modelsdfa/English_model/feature_names.json DELETED Viewed

@@ -1,18 +0,0 @@
-[
-  "perplexity",
-  "burst_mean",
-  "burst_std",
-  "burst_max",
-  "burst_min",
-  "burst_range",
-  "num_words",
-  "num_chars",
-  "num_sentences",
-  "avg_word_len",
-  "avg_sent_len",
-  "lexical_diversity",
-  "punct_ratio",
-  "caps_ratio",
-  "flesch_reading",
-  "flesch_grade"
-]

features/Modelsdfa/English_model/metadata.json DELETED Viewed

@@ -1,13 +0,0 @@
-{
-  "selected_model": "hybrid_tfidf_logistic",
-  "cv_best_f1": 0.8593569681592504,
-  "num_engineered_features": 16,
-  "num_word_tfidf_features": 86956,
-  "num_char_tfidf_features": 80000,
-  "train_samples": 15952,
-  "test_samples": 3988,
-  "train_accuracy": 0.980253259779338,
-  "train_f1": 0.980182447310475,
-  "test_accuracy": 0.8713640922768305,
-  "test_f1": 0.8707482993197279
-}

features/__init__.py DELETED Viewed

@@ -1,5 +0,0 @@
-"""Top-level features package for the aiapi project."""
-__all__ = [
-    # Subpackages are dynamically discovered; keep this minimal.
-]

features/ai_human_image_classifier/model_loader.py CHANGED Viewed

@@ -3,7 +3,6 @@ import torch
 import joblib
 from pathlib import Path
 from huggingface_hub import hf_hub_download
-from config import Config
 class ModelLoader:
     """
@@ -57,7 +56,7 @@ class ModelLoader:
         print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
         try:
             # Download the model file from the Hub. It returns the cached path.
-            model_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
             print(f"SVM model downloaded to: {model_path}")
             # Load the model from the downloaded path
@@ -69,9 +68,9 @@ class ModelLoader:
 # --- Global Model Instance ---
 # This creates a single instance of the models that can be imported by other modules.
-CLIP_MODEL_NAME = Config.AI_HUMAN_CLIP_MODEL_NAME
-SVM_REPO_ID = Config.AI_HUMAN_SVM_REPO_ID
-SVM_FILENAME = Config.AI_HUMAN_SVM_FILENAME
 # This instance will be created when the application starts.
 models = ModelLoader(

 import joblib
 from pathlib import Path
 from huggingface_hub import hf_hub_download
 class ModelLoader:
     """
         print(f"Downloading SVM model from Hugging Face repo: {repo_id}")
         try:
             # Download the model file from the Hub. It returns the cached path.
+            model_path = hf_hub_download(repo_id=repo_id, filename=filename)
             print(f"SVM model downloaded to: {model_path}")
             # Load the model from the downloaded path
 # --- Global Model Instance ---
 # This creates a single instance of the models that can be imported by other modules.
+CLIP_MODEL_NAME = 'ViT-L/14'
+SVM_REPO_ID = 'rhnsa/ai_human_image_detector'
+SVM_FILENAME = 'svm_model_real.joblib' # The name of your model file in the Hugging Face repo
 # This instance will be created when the application starts.
 models = ModelLoader(

features/image_classifier/model_loader.py CHANGED Viewed

@@ -1,21 +1,27 @@
 import os
 import shutil
 import logging
 from huggingface_hub import snapshot_download
-from config import Config
-os.environ.setdefault("CUDA_VISIBLE_DEVICES", "-1")
-os.environ.setdefault("TF_CPP_MIN_LOG_LEVEL", "2")
 # Model config
-REPO_ID = Config.IMAGE_CLASSIFIER_REPO_ID
-MODEL_DIR = Config.IMAGE_CLASSIFIER_MODEL_DIR
-WEIGHTS_PATH = os.path.join(MODEL_DIR, Config.IMAGE_CLASSIFIER_WEIGHTS_FILE)
-HF_TOKEN = Config.HF_TOKEN
 # Global model reference
 _model_img = None
 def warmup():
     global _model_img
     download_model_repo()
@@ -26,7 +32,7 @@ def download_model_repo():
     if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
         logging.info("Image model already exists, skipping download.")
         return
-    snapshot_path = snapshot_download(repo_id=REPO_ID, token=HF_TOKEN)
     os.makedirs(MODEL_DIR, exist_ok=True)
     shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
@@ -35,17 +41,11 @@ def load_model():
     if _model_img is not None:
         return _model_img
-    import tensorflow as tf
-    class Cast(tf.keras.layers.Layer):
-        def call(self, inputs):
-            return tf.cast(inputs, tf.float32)
-    print("Loading image model on CPU.")
-    with tf.device("/CPU:0"):
-        _model_img = tf.keras.models.load_model(
-            WEIGHTS_PATH, custom_objects={"Cast": Cast}
-        )
     print("Model input shape:", _model_img.input_shape)
     return _model_img

 import os
 import shutil
 import logging
+import tensorflow as tf
+from tensorflow.keras.layers import Layer
 from huggingface_hub import snapshot_download
 # Model config
+REPO_ID = "can-org/AI-VS-HUMAN-IMAGE-classifier"
+MODEL_DIR = "./IMG_Models"
+WEIGHTS_PATH = os.path.join(MODEL_DIR, "latest-my_cnn_model.h5")
+# Device info (for logging)
+gpus = tf.config.list_physical_devices("GPU")
+device = "cuda" if gpus else "cpu"
 # Global model reference
 _model_img = None
+# Custom layer used in the model
+class Cast(Layer):
+    def call(self, inputs):
+        return tf.cast(inputs, tf.float32)
 def warmup():
     global _model_img
     download_model_repo()
     if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
         logging.info("Image model already exists, skipping download.")
         return
+    snapshot_path = snapshot_download(repo_id=REPO_ID)
     os.makedirs(MODEL_DIR, exist_ok=True)
     shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
     if _model_img is not None:
         return _model_img
+    print(f"{'GPU detected' if device == 'cuda' else 'No GPU detected'}, loading model on {device.upper()}.")
+    _model_img = tf.keras.models.load_model(
+        WEIGHTS_PATH, custom_objects={"Cast": Cast}
+    )
     print("Model input shape:", _model_img.input_shape)
     return _model_img

features/image_edit_detector/controller.py CHANGED Viewed

@@ -7,9 +7,8 @@ from .detectors.ela import run_ela
 from .preprocess import preprocess_image
 from fastapi import HTTPException,status,Depends
 from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
-from config import Config
 security=HTTPBearer()
 async def process_image_ela(image_bytes: bytes, quality: int=90):
     image = Image.open(io.BytesIO(image_bytes))
@@ -41,7 +40,7 @@ async def process_meta_image(image_bytes: bytes) -> dict:
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
-    expected_token = Config.SECRET_TOKEN
     if token != expected_token:
         raise HTTPException(
             status_code=status.HTTP_403_FORBIDDEN,

 from .preprocess import preprocess_image
 from fastapi import HTTPException,status,Depends
 from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
 security=HTTPBearer()
+import os
 async def process_image_ela(image_bytes: bytes, quality: int=90):
     image = Image.open(io.BytesIO(image_bytes))
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
+    expected_token = os.getenv("MY_SECRET_TOKEN")
     if token != expected_token:
         raise HTTPException(
             status_code=status.HTTP_403_FORBIDDEN,

features/nepali_text_classifier/controller.py CHANGED Viewed

@@ -1,87 +1,23 @@
 import asyncio
-import hashlib
-import logging
-import random
 from io import BytesIO
 from fastapi import HTTPException, UploadFile, status, Depends
 from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
-from config import Config
 from features.nepali_text_classifier.inferencer import classify_text
 from  features.nepali_text_classifier.preprocess import *
 import re
 security = HTTPBearer()
-def parse_selected_models(models: str | None) -> list[str] | None:
-    if not models:
-        return None
-    parsed = [m.strip() for m in models.split(",") if m.strip()]
-    return parsed[:2] if parsed else None
 def contains_english(text: str) -> bool:
     # Remove escape characters
     cleaned = text.replace("\n", "").replace("\t", "")
     return bool(re.search(r'[a-zA-Z]', cleaned))
-def _clamp(value: float, lower: float, upper: float) -> float:
-    return max(lower, min(upper, value))
-def _raw_ai_score(label: str, confidence: float) -> float:
-    conf = _clamp(float(confidence), 0.0, 100.0)
-    return conf if label == "AI" else (100.0 - conf)
-    def _sentence_bias_strength(overall_confidence: float) -> float:
-        # Equation: beta = min(0.15, 0.05 + 0.10 * (C_doc / 100))
-        return min(0.15, 0.05 + 0.10 * (_clamp(overall_confidence, 0.0, 100.0) / 100.0))
-def _deterministic_jitter(seed_text: str, max_jitter: float = 3.0) -> float:
-    digest = hashlib.sha256(seed_text.encode("utf-8")).digest()
-    seed_value = int.from_bytes(digest[:8], byteorder="big", signed=False)
-    rng = random.Random(seed_value)
-    return rng.uniform(-max_jitter, max_jitter)
-def _add_likelihood_randomness(likelihood: float, seed_text: str, max_jitter: float = 3.0) -> float:
-    jitter = _deterministic_jitter(seed_text=seed_text, max_jitter=max_jitter)
-    return _clamp(likelihood + jitter, 50.0, 99.95)
-def _biased_sentence_result(
-    sentence_result: dict,
-    overall_confidence: float,
-    target_label: str = "Human",
-    seed_text: str = "",
-) -> dict:
-    raw_label = sentence_result["label"]
-    raw_confidence = float(sentence_result["confidence"])
-    raw_ai = _raw_ai_score(raw_label, raw_confidence)
-    target_ai = 100.0 if target_label == "AI" else 0.0
-    beta = _sentence_bias_strength(overall_confidence)
-    # Equation: S_biased = (1 - beta) * S_raw + beta * T
-    biased_ai = _clamp((1.0 - beta) * raw_ai + beta * target_ai, 0.0, 100.0)
-    # Force final label toward overall target to ensure overall bias is applied.
-    biased_label = target_label
-    biased_confidence = biased_ai if target_label == "AI" else (100.0 - biased_ai)
-    biased_confidence = _add_likelihood_randomness(
-        biased_confidence,
-        seed_text=f"{seed_text}|{target_label}|{round(overall_confidence, 2)}",
-    )
-    return {
-        "biased_label": biased_label,
-        "biased_confidence": round(biased_confidence, 2),
-    }
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
-    expected_token = Config.SECRET_TOKEN
     if token != expected_token:
         raise HTTPException(
             status_code=status.HTTP_403_FORBIDDEN,
@@ -89,16 +25,15 @@ async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(secur
         )
     return token
-async def nepali_text_analysis(text: str, models: str | None = None):
     end_symbol_for_NP_text(text)
     words = text.split()
     if len(words) < 10:
         raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
-    if len(text) > 50000:
-        raise HTTPException(status_code=413, detail="Text must be less than 50 ,000 characters")
-    selected_models = parse_selected_models(models)
-    result = await asyncio.to_thread(classify_text, text, selected_models, 2)
     return result
@@ -116,19 +51,18 @@ async def extract_file_contents(file:UploadFile)-> str:
     else:
         raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
-async def handle_file_upload(file: UploadFile, models: str | None = None):
     try:
         file_contents = await extract_file_contents(file)
         end_symbol_for_NP_text(file_contents)
-        if len(file_contents) > 50000:
-            raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
             raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
-        selected_models = parse_selected_models(models)
-        result = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
         return result
     except Exception as e:
         logging.error(f"Error processing file: {e}")
@@ -136,45 +70,34 @@ async def handle_file_upload(file: UploadFile, models: str | None = None):
-async def handle_sentence_level_analysis(text: str, models: str | None = None):
     text = text.strip()
-    if len(text) > 50000:
-        raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
     end_symbol_for_NP_text(text)
     # Split text into sentences
     sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
-    selected_models = parse_selected_models(models)
-    overall = await asyncio.to_thread(classify_text, text, selected_models, 2)
-    overall_label = overall["label"]
-    overall_confidence = float(overall["confidence"])
     results = []
     for sentence in sentences:
         end_symbol_for_NP_text(sentence)
-        result = await asyncio.to_thread(classify_text, sentence, selected_models, 2)
-        biased = _biased_sentence_result(
-            result,
-            overall_confidence,
-            target_label=overall_label,
-            seed_text=sentence,
-        )
         results.append({
             "text": sentence,
-            "result": biased["biased_label"],
-            "likelihood": biased["biased_confidence"],
         })
     return {"analysis": results}
-async def handle_file_sentence(file:UploadFile, models: str | None = None):
     try:
         file_contents = await extract_file_contents(file)
-        if len(file_contents) > 50000:
-            raise HTTPException(status_code=413, detail="Text must be less than 50,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
@@ -183,27 +106,16 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
         # Split text into sentences
         sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
-        selected_models = parse_selected_models(models)
-        overall = await asyncio.to_thread(classify_text, cleaned_text, selected_models, 2)
-        overall_label = overall["label"]
-        overall_confidence = float(overall["confidence"])
         results = []
         for sentence in sentences:
             end_symbol_for_NP_text(sentence)
-            result = await asyncio.to_thread(classify_text, sentence, selected_models, 2)
-            biased = _biased_sentence_result(
-                result,
-                overall_confidence,
-                target_label=overall_label,
-                seed_text=sentence,
-            )
             results.append({
                 "text": sentence,
-                "result": biased["biased_label"],
-                "likelihood": biased["biased_confidence"],
             })
         return {"analysis": results}
@@ -213,7 +125,6 @@ async def handle_file_sentence(file:UploadFile, models: str | None = None):
         raise HTTPException(status_code=500, detail="Error processing the file")
-def classify(text: str, models: str | None = None):
-    selected_models = parse_selected_models(models)
-    return classify_text(text, selected_models, 2)

 import asyncio
 from io import BytesIO
 from fastapi import HTTPException, UploadFile, status, Depends
 from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
+import os
 from features.nepali_text_classifier.inferencer import classify_text
 from  features.nepali_text_classifier.preprocess import *
 import re
 security = HTTPBearer()
 def contains_english(text: str) -> bool:
     # Remove escape characters
     cleaned = text.replace("\n", "").replace("\t", "")
     return bool(re.search(r'[a-zA-Z]', cleaned))
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
+    expected_token = os.getenv("MY_SECRET_TOKEN")
     if token != expected_token:
         raise HTTPException(
             status_code=status.HTTP_403_FORBIDDEN,
         )
     return token
+async def nepali_text_analysis(text: str):
     end_symbol_for_NP_text(text)
     words = text.split()
     if len(words) < 10:
         raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
+    if len(text) > 10000:
+        raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
+    result = await asyncio.to_thread(classify_text, text)
     return result
     else:
         raise HTTPException(status_code=415,detail="Invalid file type. Only .docx,.pdf and .txt are allowed")
+async def handle_file_upload(file: UploadFile):
     try:
         file_contents = await extract_file_contents(file)
         end_symbol_for_NP_text(file_contents)
+        if len(file_contents) > 10000:
+            raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
             raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
+        result = await asyncio.to_thread(classify_text, cleaned_text)
         return result
     except Exception as e:
         logging.error(f"Error processing file: {e}")
+async def handle_sentence_level_analysis(text: str):
     text = text.strip()
+    if len(text) > 10000:
+        raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
     end_symbol_for_NP_text(text)
     # Split text into sentences
     sentences = [s.strip() + "।" for s in text.split("।") if s.strip()]
     results = []
     for sentence in sentences:
         end_symbol_for_NP_text(sentence)
+        result = await asyncio.to_thread(classify_text, sentence)
         results.append({
             "text": sentence,
+            "result": result["label"],
+            "likelihood": result["confidence"]
         })
     return {"analysis": results}
+async def handle_file_sentence(file:UploadFile):
     try:
         file_contents = await extract_file_contents(file)
+        if len(file_contents) > 10000:
+            raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
         # Split text into sentences
         sentences = [s.strip() + "।" for s in cleaned_text.split("।") if s.strip()]
         results = []
         for sentence in sentences:
             end_symbol_for_NP_text(sentence)
+            result = await asyncio.to_thread(classify_text, sentence)
             results.append({
                 "text": sentence,
+                "result": result["label"],
+                "likelihood": result["confidence"]
             })
         return {"analysis": results}
         raise HTTPException(status_code=500, detail="Error processing the file")
+def classify(text: str):
+    return classify_text(text)

features/nepali_text_classifier/inferencer.py CHANGED Viewed

@@ -1,89 +1,23 @@
-import re
-from scipy.sparse import csr_matrix, hstack
-from .model_loader import get_default_top_models, load_artifacts
-TOP_K_MODELS = 1
-def normalize_nepali_text(text: str) -> str:
-    text = str(text)
-    text = re.sub(r"https?://\S+|www\.\S+", " ", text)
-    text = re.sub(r"[^\u0900-\u097F\s।!?,]", " ", text)
-    return re.sub(r"\s+", " ", text).strip()
-def _select_models(models, model_names=None, top_k=2):
-    _ = model_names
-    ranked = [name for name in get_default_top_models(top_k=top_k) if name in models]
-    if ranked:
-        return ranked[:top_k]
-    return list(models.keys())[:top_k]
-def classify_text(text: str, model_names="Logistic Regression", top_k: int = 1):
-    artifacts = load_artifacts()
-    models = artifacts["models"]
-    if not models:
-        return {"error": "No models available for inference"}
-    cleaned_text = normalize_nepali_text(text)
-    word_features = artifacts["word_vectorizer"].transform([cleaned_text])
-    char_features = artifacts["char_vectorizer"].transform([cleaned_text])
-    rich_features = artifacts["rich_transformer"].transform([cleaned_text])
-    features = hstack([word_features, char_features, csr_matrix(rich_features)])
-    selected_names = _select_models(models, model_names=model_names, top_k=TOP_K_MODELS)
-    dense_models = {"Linear SVC"}
-    per_model = []
-    ai_votes = 0
-    human_votes = 0
-    confidence_sum = 0.0
-    for name in selected_names:
-        model = models[name]
-        model_input = features.toarray() if name in dense_models else features
-        pred = int(model.predict(model_input)[0])
-        confidence = None
-        if hasattr(model, "predict_proba"):
-            probs = model.predict_proba(model_input)
-            confidence = float(probs[0][pred])
-        elif hasattr(model, "decision_function"):
-            score = float(model.decision_function(model_input)[0])
-            confidence = abs(score) / (1.0 + abs(score))
-        else:
-            confidence = 0.5
-        if pred == 1:
-            ai_votes += 1
-            label = "AI"
-        else:
-            human_votes += 1
-            label = "Human"
-        confidence_sum += confidence
-        per_model.append(
-            {
-                "model": name,
-                "label": label,
-                "confidence": round(confidence * 100, 2),
-            }
-        )
-    final_label = "AI" if ai_votes > human_votes else "Human"
-    if ai_votes == human_votes:
-        final_label = per_model[0]["label"]
-    avg_conf = confidence_sum / max(len(per_model), 1)
-    return {
-        "label": final_label,
-        "confidence": round(avg_conf * 100, 2),
-        # "selected_models": selected_names,
-        # "model_predictions": per_model,
-        # "votes": {"AI": ai_votes, "Human": human_votes},
-        # "available_models": list(models.keys()),
-        # "unavailable_models": artifacts["unavailable_models"],
-    }

+import torch
+from .model_loader import get_model_tokenizer
+import torch.nn.functional as F
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def classify_text(text: str):
+    model, tokenizer = get_model_tokenizer()
+    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model(**inputs)
+        logits = outputs if isinstance(outputs, torch.Tensor) else outputs.logits
+        probs = F.softmax(logits, dim=1)
+        pred = torch.argmax(probs, dim=1).item()
+        prob_percent = probs[0][pred].item() * 100
+    return {"label": "Human" if pred == 0 else "AI", "confidence": round(prob_percent, 2)}

features/nepali_text_classifier/model_loader.py CHANGED Viewed

@@ -1,237 +1,54 @@
-import logging
-import pickle
-import re
 import shutil
-from functools import lru_cache
-from pathlib import Path
-import numpy as np
-import pandas as pd
 from huggingface_hub import snapshot_download
-from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
-from config import Config
-LOGGER = logging.getLogger(__name__)
-MODEL_FILES = {
-    "Logistic Regression": "Logistic_Regression.pkl",
-    "Random Forest": "Random_Forest.pkl",
-    # "Gradient Boosting": "Gradient_Boosting.pkl",
-    "Linear SVC": "Linear_SVC.pkl",
-    "Ridge Classifier": "Ridge_Classifier.pkl",
-    "Multinomial NB": "Multinomial_NB.pkl",
-    "Bernoulli NB": "Bernoulli_NB.pkl",
-}
-SKIP_MODELS = set()
-REPO_ID = Config.REPO_ID_LANG
-HF_TOKEN = Config.HF_TOKEN
-NEPALI_SUBDIR = "Nepali_model"
-REQUIRED_BASE_FILES = ("word_vectorizer.pkl", "char_vectorizer.pkl")
-# Ranked by validation accuracy from final_model/final_results.csv
-DEFAULT_MODEL_RANKING = [
-    "Gradient Boosting",
-    "Logistic Regression",
-    "Linear SVC",
-    "Ridge Classifier",
-    "Bernoulli NB",
-    "Random Forest",
-    "Multinomial NB",
-]
-def _patch_legacy_logistic_model(model):
-    """Backfill attributes expected by newer sklearn versions."""
-    if isinstance(model, (LogisticRegression, LogisticRegressionCV)) and not hasattr(
-        model, "multi_class"
-    ):
-        model.multi_class = "auto"
-    return model
-class NepaliRichFeatures:
-    """Burstiness + stylometry feature extractor used during model training."""
-    @staticmethod
-    def extract_burstiness(text: str) -> dict:
-        sentences = [s.strip() for s in re.split(r"[।!?]", str(text)) if s.strip()]
-        if not sentences:
-            return {
-                "burst_mean": 0.0,
-                "burst_std": 0.0,
-                "burst_max": 0.0,
-                "burst_min": 0.0,
-                "burst_range": 0.0,
-            }
-        lengths = [len(s.split()) for s in sentences]
-        return {
-            "burst_mean": float(np.mean(lengths)),
-            "burst_std": float(np.std(lengths)),
-            "burst_max": float(np.max(lengths)),
-            "burst_min": float(np.min(lengths)),
-            "burst_range": float(np.max(lengths) - np.min(lengths)),
-        }
-    @staticmethod
-    def extract_stylometry(text: str) -> dict:
-        words = str(text).split()
-        num_words = max(len(words), 1)
-        num_chars = max(len(str(text)), 1)
-        num_sentences = max(
-            len([s for s in re.split(r"[।!?]", str(text)) if s.strip()]), 1
-        )
-        avg_word_len = float(np.mean([len(w) for w in words])) if words else 0.0
-        avg_sent_len = num_words / num_sentences
-        lexical_diversity = len(set(words)) / num_words
-        punct_count = (
-            str(text).count("।")
-            + str(text).count("?")
-            + str(text).count("!")
-            + str(text).count(",")
-        )
-        punct_ratio = punct_count / num_chars
-        bigrams = [" ".join(words[i : i + 2]) for i in range(len(words) - 1)]
-        rep_bigram_ratio = (
-            (1.0 - len(set(bigrams)) / max(len(bigrams), 1)) if bigrams else 0.0
-        )
-        diacritic_count = sum(1 for c in str(text) if "\u093e" <= c <= "\u094d")
-        diacritic_ratio = diacritic_count / num_chars
-        return {
-            "num_words": num_words,
-            "num_chars": num_chars,
-            "num_sentences": num_sentences,
-            "avg_word_len": avg_word_len,
-            "avg_sent_len": avg_sent_len,
-            "lexical_diversity": lexical_diversity,
-            "punct_ratio": punct_ratio,
-            "rep_bigram_ratio": rep_bigram_ratio,
-            "diacritic_ratio": diacritic_ratio,
-        }
-    def transform(self, texts):
-        if isinstance(texts, str):
-            texts = [texts]
-        rows = []
-        for text in texts:
-            row = {**self.extract_burstiness(text), **self.extract_stylometry(text)}
-            rows.append(row)
-        return pd.DataFrame(rows).values.astype(np.float32)
-def _repo_root() -> Path:
-    return Path(__file__).resolve().parents[2]
-def _has_required_artifacts(path: Path) -> bool:
-    if not path.exists() or not path.is_dir():
-        return False
-    has_base = all((path / filename).exists() for filename in REQUIRED_BASE_FILES)
-    has_any_model = any((path / filename).exists() for filename in MODEL_FILES.values())
-    return has_base and has_any_model
-def _candidate_model_dirs() -> list[Path]:
-    candidates = []
-    repo = _repo_root()
-    if Config.Nepali_model_folder:
-        custom = Path(Config.Nepali_model_folder)
-        candidates.extend([custom, custom / NEPALI_SUBDIR])
-    default_dir = repo / "features" / "Model" / "Nepali_model"
-    candidates.extend([default_dir, default_dir / NEPALI_SUBDIR])
-    candidates.append(
-        repo / "notebook" / "ai_vs_human_nepali" / "final_model" / "saved_models"
-    )
-    return candidates
-def _download_nepali_artifacts() -> None:
-    if not REPO_ID:
-        raise ValueError("English_model repo id is not configured")
-    repo = _repo_root()
-    target_dir = (
-        Path(Config.Nepali_model_folder)
-        if Config.Nepali_model_folder
-        else repo / "features" / "Model" / "Nepali_model"
-    )
-    snapshot_path = Path(snapshot_download(repo_id=REPO_ID, token=HF_TOKEN))
-    source_dir = (
-        snapshot_path / NEPALI_SUBDIR
-        if (snapshot_path / NEPALI_SUBDIR).is_dir()
-        else snapshot_path
-    )
-    target_dir.mkdir(parents=True, exist_ok=True)
-    shutil.copytree(source_dir, target_dir, dirs_exist_ok=True)
-def resolve_model_dir() -> Path:
-    for path in _candidate_model_dirs():
-        if _has_required_artifacts(path):
-            return path
-    LOGGER.info("Nepali artifacts not found locally; downloading from %s", REPO_ID)
-    _download_nepali_artifacts()
-    for path in _candidate_model_dirs():
-        if _has_required_artifacts(path):
-            return path
-    raise FileNotFoundError(
-        "Nepali model directory not found. Set Nepali_model env or add expected artifacts."
-    )
-@lru_cache(maxsize=1)
-def load_artifacts():
-    model_dir = resolve_model_dir()
-    LOGGER.info("Loading Nepali artifacts from %s", model_dir)
-    models = {}
-    unavailable = {}
-    for model_name, file_name in MODEL_FILES.items():
-        if model_name in SKIP_MODELS:
-            unavailable[model_name] = "Skipped due to large artifact size"
-            continue
-        file_path = model_dir / file_name
-        if not file_path.exists():
-            unavailable[model_name] = "Missing model file"
-            continue
-        with open(file_path, "rb") as fp:
-            models[model_name] = _patch_legacy_logistic_model(pickle.load(fp))
-    with open(model_dir / "word_vectorizer.pkl", "rb") as fp:
-        word_vectorizer = pickle.load(fp)
-    with open(model_dir / "char_vectorizer.pkl", "rb") as fp:
-        char_vectorizer = pickle.load(fp)
-    rich_transformer = NepaliRichFeatures()
-    return {
-        "model_dir": str(model_dir),
-        "models": models,
-        "unavailable_models": unavailable,
-        "word_vectorizer": word_vectorizer,
-        "char_vectorizer": char_vectorizer,
-        "rich_transformer": rich_transformer,
-    }
-def get_available_models():
-    artifacts = load_artifacts()
-    return list(artifacts["models"].keys())
-def get_default_top_models(top_k: int = 2):
-    available = set(get_available_models())
-    ranked = [name for name in DEFAULT_MODEL_RANKING if name in available]
-    if not ranked:
-        return list(available)[:top_k]
-    return ranked[: max(1, top_k)]

+import os
 import shutil
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import logging
 from huggingface_hub import snapshot_download
+from transformers import AutoTokenizer, AutoModel
+# Configs
+REPO_ID = "can-org/Nepali-AI-VS-HUMAN"
+BASE_DIR = "./np_text_model"
+TOKENIZER_DIR = os.path.join(BASE_DIR, "classifier")  # <- update this to match your uploaded folder
+WEIGHTS_PATH = os.path.join(BASE_DIR, "model_95_acc.pth")  # <- change to match actual uploaded weight
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Define model class
+class XLMRClassifier(nn.Module):
+    def __init__(self):
+        super(XLMRClassifier, self).__init__()
+        self.bert = AutoModel.from_pretrained("xlm-roberta-base")
+        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)
+    def forward(self, input_ids, attention_mask):
+        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
+        cls_output = outputs.last_hidden_state[:, 0, :]
+        return self.classifier(cls_output)
+# Globals for caching
+_model = None
+_tokenizer = None
+def download_model_repo():
+    if os.path.exists(BASE_DIR) and os.path.isdir(BASE_DIR):
+        logging.info("Model already downloaded.")
+        return
+    snapshot_path = snapshot_download(repo_id=REPO_ID)
+    os.makedirs(BASE_DIR, exist_ok=True)
+    shutil.copytree(snapshot_path, BASE_DIR, dirs_exist_ok=True)
+def load_model():
+    download_model_repo()
+    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_DIR)
+    model = XLMRClassifier().to(device)
+    model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
+    model.eval()
+    return model, tokenizer
+def get_model_tokenizer():
+    global _model, _tokenizer
+    if _model is None or _tokenizer is None:
+        _model, _tokenizer = load_model()
+    return _model, _tokenizer

features/nepali_text_classifier/preprocess.py CHANGED Viewed

@@ -1,9 +1,9 @@
-# import fitz  # PyMuPDF
 import docx
 from io import BytesIO
 import logging
 from fastapi import HTTPException
-from pypdf import PdfReader
 def parse_docx(file: BytesIO):
     doc = docx.Document(file)
@@ -15,10 +15,11 @@ def parse_docx(file: BytesIO):
 def parse_pdf(file: BytesIO):
     try:
-        doc = PdfReader(file)
         text = ""
-        for page in doc.pages:
-            text += page.extract_text()
         return text
     except Exception as e:
         logging.error(f"Error while processing PDF: {str(e)}")

+import fitz  # PyMuPDF
 import docx
 from io import BytesIO
 import logging
 from fastapi import HTTPException
 def parse_docx(file: BytesIO):
     doc = docx.Document(file)
 def parse_pdf(file: BytesIO):
     try:
+        doc = fitz.open(stream=file, filetype="pdf")
         text = ""
+        for page_num in range(doc.page_count):
+            page = doc.load_page(page_num)
+            text += page.get_text()
         return text
     except Exception as e:
         logging.error(f"Error while processing PDF: {str(e)}")

features/nepali_text_classifier/routes.py CHANGED Viewed

@@ -15,42 +15,27 @@ security = HTTPBearer()
 # Input schema
 class TextInput(BaseModel):
     text: str
-    models: list[str] | None = None
 @router.post("/analyse")
 @limiter.limit(ACCESS_RATE)
 async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
-    selected = ",".join(data.models[:2]) if data.models else None
-    result = await nepali_text_analysis(data.text, selected)
     return result
 @router.post("/upload")
 @limiter.limit(ACCESS_RATE)
-async def upload_file(request:Request,file:UploadFile=File(...), models: str | None = None, token:str=Depends(security)):
-    return await handle_file_upload(file, models)
 @router.post("/analyse-sentences")
 @limiter.limit(ACCESS_RATE)
 async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
-    selected = ",".join(data.models[:2]) if data.models else None
-    return await  handle_sentence_level_analysis(data.text, selected)
 @router.post("/file-sentences-analyse")
 @limiter.limit(ACCESS_RATE)
-async def analyze_sentance_file(request: Request, file: UploadFile = File(...), models: str | None = None, token: str = Depends(security)):
-    return await handle_file_sentence(file, models)
-@router.get("/models")
-@limiter.limit(ACCESS_RATE)
-def get_models(request: Request):
-    from .model_loader import get_available_models, get_default_top_models
-    available = get_available_models()
-    return {
-        "available_models": available,
-        "default_top_2": get_default_top_models(2),
-    }
 @router.get("/health")

 # Input schema
 class TextInput(BaseModel):
     text: str
 @router.post("/analyse")
 @limiter.limit(ACCESS_RATE)
 async def analyse(request: Request, data: TextInput, token: str = Depends(security)):
+    result = classify_text(data.text)
     return result
 @router.post("/upload")
 @limiter.limit(ACCESS_RATE)
+async def upload_file(request:Request,file:UploadFile=File(...),token:str=Depends(security)):
+    return await handle_file_upload(file)
 @router.post("/analyse-sentences")
 @limiter.limit(ACCESS_RATE)
 async def upload_file(request:Request,data:TextInput,token:str=Depends(security)):
+    return await  handle_sentence_level_analysis(data.text)
 @router.post("/file-sentences-analyse")
 @limiter.limit(ACCESS_RATE)
+async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(security)):
+    return await handle_file_sentence(file)
 @router.get("/health")

features/rag_chatbot/__init__.py DELETED Viewed

File without changes

features/rag_chatbot/controller.py DELETED Viewed

@@ -1,178 +0,0 @@
-import asyncio
-import logging
-from typing import Dict, Any
-from fastapi import HTTPException, UploadFile, status, Depends
-from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
-from config import Config
-from .rag_pipeline import route_and_process_query, add_document_to_rag, check_system_health
-from .document_handler import extract_text_from_file
-# Configure logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-security = HTTPBearer()
-# Supported file types
-SUPPORTED_CONTENT_TYPES = Config.RAG_SUPPORTED_CONTENT_TYPES
-MAX_FILE_SIZE = Config.RAG_MAX_FILE_SIZE
-MAX_QUERY_LENGTH = Config.RAG_MAX_QUERY_LENGTH
-async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
-    """Verify Bearer token from Authorization header."""
-    token = credentials.credentials
-    expected_token = Config.SECRET_TOKEN
-    if not expected_token:
-        logger.error("MY_SECRET_TOKEN not configured")
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail="Server configuration error"
-        )
-    if token != expected_token:
-        logger.warning(f"Invalid token attempt: {token[:10]}...")
-        raise HTTPException(
-            status_code=status.HTTP_403_FORBIDDEN,
-            detail="Invalid or expired token"
-        )
-    return token
-async def handle_rag_query(query: str) -> Dict[str, Any]:
-    """Handle an incoming query by routing it and getting the appropriate answer."""
-    # Input validation
-    if not query or not query.strip():
-        raise HTTPException(
-            status_code=status.HTTP_400_BAD_REQUEST,
-            detail="Query cannot be empty"
-        )
-    if len(query) > MAX_QUERY_LENGTH:
-        raise HTTPException(
-            status_code=status.HTTP_400_BAD_REQUEST,
-            detail=f"Query too long. Please limit to {MAX_QUERY_LENGTH} characters."
-        )
-    try:
-        logger.info(f"Processing query: {query[:50]}...")
-        # Process query in thread pool
-        response = await asyncio.to_thread(route_and_process_query, query)
-        logger.info(f"Query processed successfully. Route: {response.get('route', 'Unknown')}")
-        return response
-    except Exception as e:
-        logger.error(f"Error processing query: {e}")
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail="Error processing your query. Please try again."
-        )
-async def handle_document_upload(file: UploadFile) -> Dict[str, str]:
-    """Handle uploading a document to the RAG's vector store."""
-    # File validation
-    if not file.filename:
-        raise HTTPException(
-            status_code=status.HTTP_400_BAD_REQUEST,
-            detail="No file provided"
-        )
-    if file.content_type not in SUPPORTED_CONTENT_TYPES:
-        raise HTTPException(
-            status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
-            detail=f"Unsupported file type: {file.content_type}. "
-                   f"Supported types: {', '.join(SUPPORTED_CONTENT_TYPES)}"
-        )
-    # Check file size
-    contents = await file.read()
-    if len(contents) > MAX_FILE_SIZE:
-        raise HTTPException(
-            status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
-            detail=f"File too large. Maximum size: {MAX_FILE_SIZE / (1024*1024):.1f}MB"
-        )
-    # Reset file pointer
-    await file.seek(0)
-    try:
-        logger.info(f"Processing file upload: {file.filename}")
-        # Extract text from file
-        text = await extract_text_from_file(file)
-        if not text or not text.strip():
-            raise HTTPException(
-                status_code=status.HTTP_400_BAD_REQUEST,
-                detail="The file appears to be empty or could not be read."
-            )
-        if len(text) < 50:  # Too short to be meaningful
-            raise HTTPException(
-                status_code=status.HTTP_400_BAD_REQUEST,
-                detail="The extracted text is too short to be meaningful."
-            )
-        # Add to RAG system
-        success = await asyncio.to_thread(
-            add_document_to_rag,
-            text,
-            {
-                "source": file.filename,
-                "content_type": file.content_type,
-                "size": len(contents)
-            }
-        )
-        if not success:
-            raise HTTPException(
-                status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-                detail="Failed to add document to the knowledge base"
-            )
-        logger.info(f"Successfully processed file: {file.filename}")
-        return {
-            "message": f"Successfully uploaded and processed '{file.filename}'. "
-                      f"It is now available for querying.",
-            "filename": file.filename,
-            "text_length": len(text),
-            "content_type": file.content_type
-        }
-    except HTTPException:
-        raise
-    except Exception as e:
-        logger.error(f"Error processing file {file.filename}: {e}")
-        raise HTTPException(
-            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail="Error processing the file. Please try again."
-        )
-async def handle_health_check() -> Dict[str, Any]:
-    """Handle health check requests."""
-    try:
-        health_status = await asyncio.to_thread(check_system_health)
-        if health_status["status"] == "unhealthy":
-            raise HTTPException(
-                status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
-                detail="Service is currently unhealthy"
-            )
-        return health_status
-    except HTTPException:
-        raise
-    except Exception as e:
-        logger.error(f"Health check failed: {e}")
-        raise HTTPException(
-            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
-            detail="Health check failed"
-        )

features/rag_chatbot/document_handler.py DELETED Viewed

@@ -1,37 +0,0 @@
-from io import BytesIO
-from fastapi import UploadFile, HTTPException
-import PyPDF2
-import docx
-async def extract_text_from_file(file: UploadFile) -> str:
-    """Extracts text from various file types."""
-    content = await file.read()
-    file_stream = BytesIO(content)
-    if file.content_type == "application/pdf":
-        return extract_text_from_pdf(file_stream)
-    elif file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
-        return extract_text_from_docx(file_stream)
-    elif file.content_type == "text/plain":
-        return file_stream.read().decode("utf-8")
-    else:
-        raise HTTPException(
-            status_code=415,
-            detail="Unsupported file type. Please upload a .pdf, .docx, or .txt file."
-        )
-def extract_text_from_pdf(file_stream: BytesIO) -> str:
-    """Extracts text from a PDF file."""
-    reader = PyPDF2.PdfReader(file_stream)
-    text = ""
-    for page in reader.pages:
-        text += page.extract_text() or ""
-    return text
-def extract_text_from_docx(file_stream: BytesIO) -> str:
-    """Extracts text from a DOCX file."""
-    doc = docx.Document(file_stream)
-    text = ""
-    for para in doc.paragraphs:
-        text += para.text + "\n"
-    return text

features/rag_chatbot/rag_pipeline.py DELETED Viewed

@@ -1,329 +0,0 @@
-import os
-import chromadb
-from dotenv import load_dotenv
-from langchain_core.documents import Document
-from langchain.text_splitter import RecursiveCharacterTextSplitter
-from langchain_community.embeddings import HuggingFaceEmbeddings
-from langchain_community.llms import OpenAI
-from langchain.chains.question_answering import load_qa_chain
-from langchain_community.vectorstores import Chroma
-from langchain.chains import LLMChain
-from langchain.prompts import PromptTemplate
-from langchain.chat_models import ChatOpenAI
-from config import Config
-load_dotenv()
-# ChromaDB configuration
-CHROMA_HOST = Config.RAG_CHROMA_HOST
-CHROMA_PORT = Config.RAG_CHROMA_PORT
-COLLECTION_NAME = Config.RAG_COLLECTION_NAME
-# LLM Provider Configuration
-LLM_PROVIDER = Config.RAG_LLM_PROVIDER
-LLM_API_KEY = Config.RAG_LLM_API_KEY
-LLM_MODEL = Config.RAG_LLM_MODEL
-LLM_TEMPERATURE = Config.RAG_LLM_TEMPERATURE
-LLM_MAX_TOKENS = Config.RAG_LLM_MAX_TOKENS
-# Provider-specific configurations
-PROVIDER_CONFIGS = {
-    "openai": {
-        "api_base": "https://api.openai.com/v1",
-        "default_model": "gpt-3.5-turbo"
-    },
-    "groq": {
-        "api_base": "https://api.groq.com/openai/v1",
-        "default_model": "llama-3.3-70b-versatile"
-    },
-    "openrouter": {
-        "api_base": "https://openrouter.ai/api/v1",
-        "default_model": "mistralai/mistral-small-3.2-24b-instruct:free"
-    }
-}
-vector_store = None
-company_qa_chain = None
-query_router_chain = None
-cybersecurity_chain = None
-llm = None
-def get_llm_config():
-    """Get the appropriate LLM configuration based on the provider."""
-    if LLM_PROVIDER not in PROVIDER_CONFIGS:
-        raise ValueError(f"Unsupported LLM provider: {LLM_PROVIDER}. Supported: {list(PROVIDER_CONFIGS.keys())}")
-    config = PROVIDER_CONFIGS[LLM_PROVIDER].copy()
-    # Use provided model or fall back to default
-    model = LLM_MODEL if LLM_MODEL != "gpt-3.5-turbo" else config["default_model"]
-    return {
-        "model": model,
-        "openai_api_key": LLM_API_KEY,
-        "openai_api_base": config["api_base"],
-        "temperature": LLM_TEMPERATURE,
-        "max_tokens": LLM_MAX_TOKENS,
-    }
-def initialize_llm():
-    """Initialize the LLM based on the configured provider."""
-    if not LLM_API_KEY:
-        raise ValueError(f"LLM_API_KEY environment variable is required for {LLM_PROVIDER}")
-    config = get_llm_config()
-    print(f"Initializing {LLM_PROVIDER.upper()} with model: {config['model']}")
-    return ChatOpenAI(**config)
-def initialize_pipelines():
-    """Initializes all required models, chains, and the vector store."""
-    global vector_store, company_qa_chain, query_router_chain, cybersecurity_chain, llm
-    try:
-        # Initialize LLM
-        llm = initialize_llm()
-        # Initialize embeddings
-        embeddings = HuggingFaceEmbeddings(
-            model_name="all-MiniLM-L6-v2",
-            model_kwargs={'device': 'cpu'},
-            encode_kwargs={'normalize_embeddings': True}
-        )
-        # Initialize ChromaDB client
-        try:
-            chroma_client = chromadb.HttpClient(host=CHROMA_HOST, port=CHROMA_PORT)
-            chroma_client.heartbeat()
-        except Exception as e:
-            raise ConnectionError("Failed to connect to ChromaDB.") from e
-        # Initialize vector store
-        vector_store = Chroma(
-            client=chroma_client,
-            collection_name=COLLECTION_NAME,
-            embedding_function=embeddings,
-        )
-        # Query Router Chain
-        router_template = """You are a query classifier. Classify the following query into one of these categories:
-- COMPANY: Questions about our company, its products, services, or general information
-- CYBERSECURITY: Questions about cybersecurity, security threats, best practices, or vulnerabilities
-- OFF_TOPIC: Questions that don't fit the above categories
-Query: {query}
-Respond with only the category name (COMPANY, CYBERSECURITY, or OFF_TOPIC):"""
-        router_prompt = PromptTemplate(
-            input_variables=["query"],
-            template=router_template
-        )
-        query_router_chain = LLMChain(
-            llm=llm,
-            prompt=router_prompt
-        )
-        # Custom Company QA Chain
-        company_qa_template = """You are a helpful assistant for CyberAlertNepal. Answer the following question about our company using the information provided and links if only available. Give a natural, direct and polite response.
-Question: {question}
-Information:
-{context}
-Answer:"""
-        company_qa_prompt = PromptTemplate(
-            input_variables=["question", "context"],
-            template=company_qa_template
-        )
-        company_qa_chain = LLMChain(
-            llm=llm,
-            prompt=company_qa_prompt
-        )
-        # Cybersecurity Chain
-        cybersecurity_template = """You are a cybersecurity professional. Answer the following question truthfully and concisely.
-If you are not 100% sure about the answer, simply respond with: "I am not sure about the answer."
-Do not add extra explanations or assumptions. Do not provide false or speculative information.
-Question: {question}
-Provide a comprehensive and accurate answer about cybersecurity:"""
-        cybersecurity_prompt = PromptTemplate(
-            input_variables=["question"],
-            template=cybersecurity_template
-        )
-        cybersecurity_chain = LLMChain(
-            llm=llm,
-            prompt=cybersecurity_prompt
-        )
-        print(f"Successfully initialized pipelines with {LLM_PROVIDER.upper()}")
-    except Exception as e:
-        print(f"Error initializing pipelines: {e}")
-        raise
-def add_document_to_rag(text: str, metadata: dict):
-    """Splits a document and adds it to the ChromaDB index."""
-    global vector_store
-    if not vector_store:
-        initialize_pipelines()
-    try:
-        text_splitter = RecursiveCharacterTextSplitter(
-            chunk_size=1000,
-            chunk_overlap=200
-        )
-        docs = text_splitter.create_documents([text], metadatas=[metadata])
-        if not docs:
-            print("Document was empty after splitting, not adding to ChromaDB.")
-            return False
-        vector_store.add_documents(docs)
-        print("Successfully added documents.")
-        return True
-    except Exception as e:
-        print(f"Error adding document to RAG: {e}")
-        return False
-def route_and_process_query(query: str):
-    """Routes the query and processes it using the appropriate pipeline."""
-    global query_router_chain, vector_store, company_qa_chain, cybersecurity_chain
-    if not all([query_router_chain, vector_store, company_qa_chain, cybersecurity_chain]):
-        initialize_pipelines()
-    try:
-        # 1. Classify the query
-        route_result = query_router_chain.run(query)
-        route = route_result.strip().upper()
-        # 2. Route to appropriate logic
-        if "CYBERSECURITY" in route:
-            answer = cybersecurity_chain.run(question=query)
-            return {
-                "answer": answer,
-                "source": "Cybersecurity Knowledge Base",
-                "route": "CYBERSECURITY",
-                "provider": LLM_PROVIDER.upper(),
-                "model": get_llm_config()["model"]
-            }
-        elif "COMPANY" in route:
-            # Perform similarity search on ChromaDB
-            docs = vector_store.similarity_search(query, k=3)
-            if not docs:
-                return {
-                    "answer": "I could not find any relevant information to answer your question.",
-                    "source": "Company Documents",
-                    "route": "COMPANY",
-                    "provider": LLM_PROVIDER.upper(),
-                    "model": get_llm_config()["model"]
-                }
-            # Combine document content for context
-            context = "\n\n".join([doc.page_content for doc in docs])
-            # Run the custom QA chain
-            answer = company_qa_chain.run(question=query, context=context)
-            sources = list(set([doc.metadata.get("source", "Unknown") for doc in docs]))
-            return {
-                "answer": answer,
-                "source": "Company Documents",
-                "documents": sources,
-                "route": "COMPANY",
-                "provider": LLM_PROVIDER.upper(),
-                "model": get_llm_config()["model"]
-            }
-        else:  # OFF_TOPIC
-            return {
-                "answer": "I am a specialized assistant of CyberAlertNepal. I cannot answer questions outside of cybersecurity topics.",
-                "source": "N/A",
-                "route": "OFF_TOPIC",
-                "provider": LLM_PROVIDER.upper(),
-                "model": get_llm_config()["model"]
-            }
-    except Exception as e:
-        print(f"Error processing query: {e}")
-        return {
-            "answer": "I encountered an error while processing your query. Please try again.",
-            "source": "Error",
-            "route": None,
-            "documents": None,
-            "provider": LLM_PROVIDER.upper(),
-            "error": str(e)
-        }
-def check_system_health():
-    """Check if all components are properly initialized."""
-    try:
-        # Test ChromaDB connection
-        if vector_store:
-            vector_store._client.heartbeat()
-        # Test if all chains are initialized
-        components = {
-            "vector_store": vector_store is not None,
-            "company_qa_chain": company_qa_chain is not None,
-            "query_router_chain": query_router_chain is not None,
-            "cybersecurity_chain": cybersecurity_chain is not None,
-            "llm": llm is not None
-        }
-        return {
-            "status": "healthy" if all(components.values()) else "unhealthy",
-            "components": components,
-            "provider": LLM_PROVIDER.upper(),
-            "model": get_llm_config()["model"] if llm else "Not initialized"
-        }
-    except Exception as e:
-        return {
-            "status": "unhealthy",
-            "error": str(e),
-            "provider": LLM_PROVIDER.upper()
-        }
-def test_llm_connection():
-    """Test the LLM API connection."""
-    try:
-        if not llm:
-            initialize_pipelines()
-        # Simple test query
-        test_response = llm("Say 'Hello, LLM is working!'")
-        return {
-            "success": True,
-            "provider": LLM_PROVIDER.upper(),
-            "model": get_llm_config()["model"],
-            "response": str(test_response)
-        }
-    except Exception as e:
-        return {
-            "success": False,
-            "provider": LLM_PROVIDER.upper(),
-            "error": str(e)
-        }
-# Initialize pipelines on module import
-try:
-    initialize_pipelines()
-except Exception as e:
-    print(f"Failed to initialize pipelines on startup: {e}")

features/rag_chatbot/routes.py DELETED Viewed

@@ -1,107 +0,0 @@
-from fastapi import APIRouter, Depends, HTTPException, UploadFile, File, Request
-from fastapi.security import HTTPBearer
-from pydantic import BaseModel, Field
-from slowapi.util import get_remote_address
-from slowapi import Limiter
-from typing import Optional
-from config import ACCESS_RATE, Config
-from .controller import (
-    handle_rag_query,
-    handle_document_upload,
-    handle_health_check,
-    verify_token,
-)
-limiter = Limiter(key_func=get_remote_address)
-router = APIRouter(prefix="/rag", tags=["RAG Chatbot"])
-security = HTTPBearer()
-class QueryInput(BaseModel):
-    query: str = Field(..., min_length=1, max_length=1000, description="The question to ask")
-class QueryResponse(BaseModel):
-    answer: str
-    source: str
-    route: Optional[str] = None
-    documents: Optional[list] = None
-    error: Optional[str] = None
-class UploadResponse(BaseModel):
-    message: str
-    filename: str
-    text_length: int
-    content_type: str
-class HealthResponse(BaseModel):
-    status: str
-    components: Optional[dict] = None
-    error: Optional[str] = None
-@router.post("/question", response_model=QueryResponse)
-@limiter.limit(ACCESS_RATE)
-async def ask_question(
-    request: Request,
-    data: QueryInput,
-    token: str = Depends(verify_token)
-) -> QueryResponse:
-    """
-    Ask a question to the RAG chatbot.
-    The chatbot can answer:
-    - Company-related questions (based on uploaded documents)
-    - Cybersecurity questions (from knowledge base)
-    """
-    response = await handle_rag_query(data.query)
-    return QueryResponse(**response)
-@router.post("/upload", response_model=UploadResponse)
-@limiter.limit(ACCESS_RATE)
-async def upload_document(
-    request: Request,
-    file: UploadFile = File(..., description="Document file (PDF, DOCX, or TXT)"),
-    token: str = Depends(verify_token)
-) -> UploadResponse:
-    """
-    Upload a document to the company knowledge base.
-    Supported formats:
-    - PDF (.pdf)
-    - Word documents (.docx)
-    - Plain text (.txt)
-    Maximum file size: 10MB
-    """
-    response = await handle_document_upload(file)
-    return UploadResponse(**response)
-@router.get("/health", response_model=HealthResponse)
-@limiter.limit(ACCESS_RATE)
-async def health_check(request: Request) -> HealthResponse:
-    """
-    Check the health status of the RAG system.
-    Returns the status of all components:
-    - ChromaDB connection
-    - Vector store
-    - AI chains
-    """
-    response = await handle_health_check()
-    return HealthResponse(**response)
-@router.get("/info")
-@limiter.limit(ACCESS_RATE)
-async def get_system_info(request: Request):
-    """Get information about the RAG system capabilities."""
-    return {
-        "name": "RAG Chatbot",
-        "version": "1.0.0",
-        "description": "A specialized chatbot for cybersecurity and company-related questions",
-        "capabilities": [
-            "Company document Q&A (based on uploaded documents)",
-            "Cybersecurity knowledge and best practices",
-            "Document upload and processing (PDF, DOCX, TXT)"
-        ],
-        "supported_file_types": sorted(Config.RAG_SUPPORTED_CONTENT_TYPES),
-        "max_file_size_mb": round(Config.RAG_MAX_FILE_SIZE / (1024 * 1024), 2),
-        "max_query_length": Config.RAG_MAX_QUERY_LENGTH
-    }

features/real_forged_classifier/__init__.py DELETED Viewed

@@ -1,9 +0,0 @@
-"""Package for real_forged_classifier feature.
-This module ensures package-relative imports work when importing
-`features.real_forged_classifier.*` from the application.
-"""
-__all__ = [
-    'controller', 'routes', 'preprocessor', 'inferencer', 'model_loader', 'model'
-]

features/real_forged_classifier/controller.py CHANGED Viewed

@@ -1,30 +1,6 @@
 from typing import IO
-import io
-import numpy as np
-from PIL import Image
-from fastapi import Depends, HTTPException, status
-from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
-import torch
-from torchvision import transforms
-from .preprocessor import preprocessor
-from .inferencer import interferencer
-from .model_loader import models
-from config import Config
-security = HTTPBearer()
-async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
-    token = credentials.credentials
-    expected_token = Config.SECRET_TOKEN
-    if token != expected_token:
-        raise HTTPException(
-            status_code=status.HTTP_403_FORBIDDEN,
-            detail="Invalid or expired token",
-        )
-    return token
 class ClassificationController:
     """
@@ -58,72 +34,3 @@ class ClassificationController:
 # Create a single instance of the controller
 controller = ClassificationController()
-class documentForger:
-    """
-    Document forgery detector that uses the ELA-trained EfficientNet model
-    when available (models.doc_model). Returns a dict with verdict and confidence.
-    """
-    def is_forged(self, document_file: IO) -> dict:
-        # Ensure a document model is loaded
-        if not hasattr(models, 'doc_model') or models.doc_model is None:
-            _downloadmodel = Config.DOCUMENT_FORGERY_MODEL_PATH
-            return {"detail": "Document forgery model not available."}
-        # Read file bytes
-        try:
-            data = document_file.read()
-            img = Image.open(io.BytesIO(data)).convert('RGB')
-        except Exception as e:
-            return {"detail": f"Could not open document image: {e}"}
-        # Compute ELA map (same approach as the notebook)
-        try:
-            buf = io.BytesIO()
-            img.save(buf, format='JPEG', quality=90)
-            buf.seek(0)
-            recompressed = Image.open(buf).convert('RGB')
-            ela_arr = np.abs(np.array(img, dtype=np.float32) - np.array(recompressed, dtype=np.float32))
-            p99 = np.percentile(ela_arr, 99)
-            if p99 > 0:
-                ela_arr = np.clip(ela_arr * (255.0 / p99), 0, 255).astype(np.uint8)
-            else:
-                ela_arr = ela_arr.astype(np.uint8)
-            ela_pil = Image.fromarray(ela_arr, mode='RGB')
-        except Exception as e:
-            return {"detail": f"Failed to compute ELA: {e}"}
-        # Transform and run through model
-        transform = transforms.Compose([
-            transforms.Resize((224, 224)),
-            transforms.ToTensor(),
-            transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
-        ])
-        tensor = transform(ela_pil).unsqueeze(0).to(models.device)
-        with torch.no_grad():
-            logits = models.doc_model(tensor)
-            probs = torch.softmax(logits, dim=1)[0, 1].item()
-        # Interpret confidence using configurable thresholds (values in 0..1)
-        low = getattr(Config, 'DOCUMENT_FORGERY_POSSIBLE_LOW', 0.40)
-        high = getattr(Config, 'DOCUMENT_FORGERY_FORGED_LOW', 0.55)
-        if probs < low:
-            verdict = 'LIKELY AUTHENTIC'
-        elif probs < high:
-            verdict = 'POSSIBLY FORGED'
-        else:
-            verdict = 'LIKELY FORGED'
-        return {
-            "verdict": verdict,
-            "confidence": float(probs),
-            "confidence_pct": round(float(probs) * 100, 2),
-        }
-# Create a single instance of the document forger
-document_forger = documentForger()

 from typing import IO
+from preprocessor import preprocessor
+from inferencer import interferencer
 class ClassificationController:
     """
 # Create a single instance of the controller
 controller = ClassificationController()

features/real_forged_classifier/inferencer.py CHANGED Viewed

@@ -3,7 +3,7 @@ import torch.nn.functional as F
 import numpy as np
 # Import the globally loaded models instance
-from .model_loader import models
 class Interferencer:
     """
@@ -26,10 +26,6 @@ class Interferencer:
         Returns:
             dict: A dictionary containing the classification label and confidence score.
         """
-        # 0. Ensure model is loaded
-        if self.fft_model is None:
-            return {"error": "FFT model not loaded."}
         # 1. Get model outputs (logits)
         outputs = self.fft_model(image_tensor)

 import numpy as np
 # Import the globally loaded models instance
+from model_loader import models
 class Interferencer:
     """
         Returns:
             dict: A dictionary containing the classification label and confidence score.
         """
         # 1. Get model outputs (logits)
         outputs = self.fft_model(image_tensor)

features/real_forged_classifier/main.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from fastapi import FastAPI
+from routes import router as api_router
+# Initialize the FastAPI app
+app = FastAPI(
+    title="Real vs. Fake Image Classification API",
+    description="An API to classify images as real or forged using FFT and cnn.",
+    version="1.0.0"
+)
+# Include the API router
+# All routes defined in routes.py will be available under the /api prefix
+app.include_router(api_router, prefix="/api", tags=["Classification"])
+@app.get("/", tags=["Root"])
+async def read_root():
+    """
+    A simple root endpoint to confirm the API is running.
+    """
+    return {"message": "Welcome to the Image Classification API. Go to /docs for the API documentation."}
+# To run this application:
+# 1. Make sure you have all dependencies from requirements.txt installed.
+# 2. Make sure the 'svm_model.joblib' file is in the same directory.
+# 3. Run the following command in your terminal:
+#    uvicorn main:app --reload

features/real_forged_classifier/model_loader.py CHANGED Viewed

@@ -1,202 +1,60 @@
 from pathlib import Path
-from typing import Any
-import shutil
-from .model import FFTCNN # Import the FFT CNN architecture (package-relative)
-from config import Config
-try:
-    from huggingface_hub import hf_hub_download
-except Exception:
-    hf_hub_download = None
-# NOTE: EfficientNet/nn imports are done lazily when torch is available.
-ELAForgeryNet = None  # will be constructed dynamically when needed
-torch = None
-TORCH_AVAILABLE = False
 class ModelLoader:
-    """A class to load and hold PyTorch models used by this feature.
-    It loads:
-      - an FFT-based CNN (downloaded from Hugging Face Hub)
-      - an ELA-based document forgery detector (local .pth by default)
     """
-    def __init__(self, model_repo_id: str, model_filename: str, doc_model_path: str = None):
-        # Try to import torch once and expose module-level variables
-        global torch, TORCH_AVAILABLE
-        try:
-            import torch as _torch
-            torch = _torch
-            TORCH_AVAILABLE = True
-        except Exception:
-            torch = None
-            TORCH_AVAILABLE = False
-            print("[WARN] PyTorch not available; model loading will be skipped until torch is installed.")
-        if TORCH_AVAILABLE:
-            self.device = "cuda" if torch.cuda.is_available() else "cpu"
-        else:
-            self.device = "cpu"
-        print(f"Using device: {self.device} (torch available: {TORCH_AVAILABLE})")
-        # Load FFT CNN from HF Hub
-        self.fft_model = None
-        if TORCH_AVAILABLE:
-            try:
-                self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
-                print("FFT CNN model loaded successfully from Hub.")
-            except Exception:
-                # Try local fallback path (if provided in config)
-                self.fft_model = None
-                local_path = Path(getattr(Config, 'REAL_FORGED_MODEL_LOCAL_PATH', ''))
-                if local_path and local_path.exists():
-                    try:
-                        print(f"Attempting to load FFT model from local path: {local_path}")
-                        model = FFTCNN()
-                        state = torch.load(str(local_path), map_location=torch.device(self.device))
-                        state_dict = state.get('state_dict', state) if isinstance(state, dict) else state
-                        model.load_state_dict(state_dict, strict=False)
-                        model.to(self.device)
-                        model.eval()
-                        self.fft_model = model
-                        print("FFT CNN model loaded successfully from local path.")
-                    except Exception as e:
-                        print(f"Failed to load local FFT model: {e}")
-                else:
-                    print("No local FFT model path configured or file missing; FFT model not loaded.")
-        else:
-            print("Skipping FFT model load because PyTorch is not installed.")
-        # Load document forgery model (ELA CNN), downloading the checkpoint if needed.
-        self.doc_model = None
-        if doc_model_path is None:
-            doc_model_path = Config.DOCUMENT_FORGERY_MODEL_PATH
-        self.doc_model = None
-        if TORCH_AVAILABLE:
-            try:
-                self.doc_model = self._load_document_forgery_model(Path(doc_model_path))
-                if self.doc_model is not None:
-                    print("Document forgery (ELA) model loaded successfully.")
-            except Exception as e:
-                print(f"Warning: failed to load document forgery model: {e}")
-        else:
-            print("Skipping document forgery model load because PyTorch is not installed.")
     def _load_fft_model(self, repo_id: str, filename: str):
-        """Downloads and loads the FFT CNN model from a Hugging Face Hub repository."""
-        print(f"Attempting to download FFT CNN model from Hugging Face repo: {repo_id}")
-        try:
-            from huggingface_hub import hf_hub_download
-        except Exception as e:
-            raise RuntimeError(f"huggingface_hub not available: {e}")
         try:
-            model_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
             print(f"Model downloaded to: {model_path}")
             model = FFTCNN()
             model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
             model.to(self.device)
             model.eval()
             return model
         except Exception as e:
-            print(f"Error downloading or loading FFT model from Hugging Face: {e}")
             raise
-    def _load_document_forgery_model(self, path: Path):
-        """Load the ELA-based document forgery model from a local .pth checkpoint.
-        Returns the model instance or None if the file does not exist.
-        """
-        # If the configured path doesn't exist, try sensible fallbacks in the repo.
-        if not path.exists():
-            print(f"Document forgery model file not found at configured path: {path}")
-            # 1) Try the configured document forgery checkpoint path relative to repo root
-            repo_root = Path(__file__).resolve().parents[2]
-            candidate = repo_root / 'features' / 'Model' / 'document_forgery' / path.name
-            if candidate.exists():
-                path = candidate
-                print(f"Found document forgery model at fallback path: {path}")
-            else:
-                # 2) Search the repo for any file with the configured checkpoint name
-                print(f"Searching repository for '{path.name}'...")
-                matches = list(repo_root.rglob(path.name))
-                if matches:
-                    path = matches[0]
-                    print(f"Found document forgery model at: {path}")
-                else:
-                    try:
-                        path = self._download_document_forgery_model(path)
-                    except Exception as exc:
-                        print(f"Document forgery model not found in repository and download failed: {exc}")
-                        return None
-        print(f"Loading document forgery model from: {path}")
-        # Build the ELA model architecture lazily (requires torchvision & torch.nn)
-        try:
-            import torchvision.models as tv_models
-            import torch.nn as nn
-        except Exception as e:
-            raise RuntimeError(f"Required packages for ELA model not available: {e}")
-        backbone = tv_models.efficientnet_b0(weights='IMAGENET1K_V1')
-        in_features = backbone.classifier[1].in_features
-        backbone.classifier = nn.Sequential(
-            nn.Dropout(p=0.4),
-            nn.Linear(in_features, 256),
-            nn.ReLU(inplace=True),
-            nn.Dropout(p=0.2),
-            nn.Linear(256, 2),
-        )
-        model = backbone
-        state = torch.load(str(path), map_location=torch.device(self.device))
-        # The checkpoint might be either a state_dict or a full checkpoint dict
-        if isinstance(state, dict) and 'state_dict' in state:
-            state_dict = state['state_dict']
-        else:
-            state_dict = state
-        # Attempt to load state dict; allow strict=False to be tolerant to minor key name differences
-        model.load_state_dict(state_dict, strict=False)
-        model.to(self.device)
-        model.eval()
-        return model
-    def _download_document_forgery_model(self, target_path: Path) -> Path:
-        """Download the document forgery checkpoint into the configured local path."""
-        if hf_hub_download is None:
-            raise RuntimeError("huggingface_hub not available")
-        repo_id = getattr(Config, "DOCUMENT_FORGERY_MODEL_REPO_ID", Config.REAL_FORGED_MODEL_REPO_ID)
-        configured_name = getattr(Config, "DOCUMENT_FORGERY_MODEL_FILENAME", str(target_path))
-        candidate_filenames = []
-        for candidate in (configured_name, str(target_path), target_path.name):
-            if candidate and candidate not in candidate_filenames:
-                candidate_filenames.append(candidate)
-        last_error = None
-        for filename in candidate_filenames:
-            try:
-                print(f"Downloading document forgery model from Hugging Face repo: {repo_id} ({filename})")
-                downloaded_path = hf_hub_download(repo_id=repo_id, filename=filename, token=Config.HF_TOKEN)
-                target_path.parent.mkdir(parents=True, exist_ok=True)
-                shutil.copy2(downloaded_path, target_path)
-                print(f"Document forgery model downloaded to: {target_path}")
-                return target_path
-            except Exception as exc:
-                last_error = exc
-        raise RuntimeError(f"unable to download document forgery model: {last_error}")
 # --- Global Model Instance ---
-MODEL_REPO_ID = Config.REAL_FORGED_MODEL_REPO_ID
-MODEL_FILENAME = Config.REAL_FORGED_MODEL_FILENAME
-DOC_MODEL_PATH = Config.DOCUMENT_FORGERY_MODEL_PATH
-models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME, doc_model_path=DOC_MODEL_PATH)

+import torch
 from pathlib import Path
+from huggingface_hub import hf_hub_download
+from model import FFTCNN # Import the model architecture
 class ModelLoader:
     """
+    A class to load and hold the PyTorch CNN model.
+    """
+    def __init__(self, model_repo_id: str, model_filename: str):
+        """
+        Initializes the ModelLoader and loads the model.
+        Args:
+            model_repo_id (str): The repository ID on Hugging Face.
+            model_filename (str): The name of the model file (.pth) in the repository.
+        """
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        print(f"Using device: {self.device}")
+        self.fft_model = self._load_fft_model(repo_id=model_repo_id, filename=model_filename)
+        print("FFT CNN model loaded successfully.")
     def _load_fft_model(self, repo_id: str, filename: str):
+        """
+        Downloads and loads the FFT CNN model from a Hugging Face Hub repository.
+        Args:
+            repo_id (str): The repository ID on Hugging Face.
+            filename (str): The name of the model file (.pth) in the repository.
+        Returns:
+            The loaded PyTorch model object.
+        """
+        print(f"Downloading FFT CNN model from Hugging Face repo: {repo_id}")
         try:
+            # Download the model file from the Hub. It returns the cached path.
+            model_path = hf_hub_download(repo_id=repo_id, filename=filename)
             print(f"Model downloaded to: {model_path}")
+            # Initialize the model architecture
             model = FFTCNN()
+            # Load the saved weights (state_dict) into the model
             model.load_state_dict(torch.load(model_path, map_location=torch.device(self.device)))
+            # Set the model to evaluation mode
             model.to(self.device)
             model.eval()
             return model
         except Exception as e:
+            print(f"Error downloading or loading model from Hugging Face: {e}")
             raise
 # --- Global Model Instance ---
+MODEL_REPO_ID = 'rhnsa/real_forged_classifier'
+MODEL_FILENAME = 'fft_cnn_model_78.pth'
+models = ModelLoader(model_repo_id=MODEL_REPO_ID, model_filename=MODEL_FILENAME)

features/real_forged_classifier/preprocessor.py CHANGED Viewed

@@ -6,7 +6,7 @@ import cv2
 from torchvision import transforms
 # Import the globally loaded models instance
-from .model_loader import models
 class ImagePreprocessor:
     """

 from torchvision import transforms
 # Import the globally loaded models instance
+from model_loader import models
 class ImagePreprocessor:
     """

features/real_forged_classifier/routes.py CHANGED Viewed

@@ -1,14 +1,14 @@
-from fastapi import APIRouter, File, UploadFile, HTTPException, status, Depends
 from fastapi.responses import JSONResponse
-# Import the controller instance and document forger
-from .controller import controller, document_forger, verify_token
 # Create an API router
 router = APIRouter()
 @router.post("/classify_forgery", summary="Classify an image as Real or Fake")
-async def classify_image_endpoint(image: UploadFile = File(...), token: str = Depends(verify_token)):
     """
     Accepts an image file and classifies it as 'real' or 'fake'.
@@ -35,23 +35,3 @@ async def classify_image_endpoint(image: UploadFile = File(...), token: str = De
     return JSONResponse(content=result, status_code=status.HTTP_200_OK)
-@router.post("/isforged", summary="Check if the document is forged")
-async def is_forged_endpoint(file: UploadFile = File(...), token: str = Depends(verify_token)):
-    """Run the document forgery detector on an uploaded image file.
-    Accepts image uploads (multipart/form-data) and returns a JSON verdict with confidence.
-    """
-    if not file.content_type.startswith("image/"):
-        raise HTTPException(
-            status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
-            detail="Unsupported file type. Please upload an image (e.g., JPEG, PNG)."
-        )
-    result = document_forger.is_forged(file.file)
-    if isinstance(result, dict) and (result.get("error") or result.get("detail")):
-        raise HTTPException(
-            status_code=status.HTTP_400_BAD_REQUEST,
-            detail=result.get("error") or result.get("detail"),
-        )
-    return JSONResponse(content=result, status_code=status.HTTP_200_OK)

+from fastapi import APIRouter, File, UploadFile, HTTPException, status
 from fastapi.responses import JSONResponse
+# Import the controller instance
+from controller import controller
 # Create an API router
 router = APIRouter()
 @router.post("/classify_forgery", summary="Classify an image as Real or Fake")
+async def classify_image_endpoint(image: UploadFile = File(...)):
     """
     Accepts an image file and classifies it as 'real' or 'fake'.
     return JSONResponse(content=result, status_code=status.HTTP_200_OK)

features/text_classifier/controller.py CHANGED Viewed

@@ -1,76 +1,49 @@
 import asyncio
 import logging
 from io import BytesIO
-from fastapi import Depends, HTTPException, UploadFile, status
-from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
-from config import Config
-from .inferencer import analyze_text_with_sentences, classify_text
 from .preprocess import parse_docx, parse_pdf, parse_txt
 security = HTTPBearer()
-# def build_bias_summary(ai_likelihood: float) -> dict[str, object]:
-#     """Convert an AI likelihood score into a human-readable bias summary."""
-#     if ai_likelihood > 50:
-#         overall_bias = "AI"
-#         bias_statement = f"The text is biased toward AI-generated writing ({ai_likelihood}% AI likelihood)."
-#     elif ai_likelihood < 50:
-#         overall_bias = "Human"
-#         bias_statement = f"The text is biased toward human writing ({100 - ai_likelihood}% human likelihood)."
-#     else:
-#         overall_bias = "Balanced"
-#         bias_statement = "The text is balanced between AI and human writing."
-#     return {
-#         "overall_bias": overall_bias,
-#         "bias_statement": bias_statement,
-#     }
 # Verify Bearer token from Authorization header
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
-    expected_token = Config.SECRET_TOKEN
     if token != expected_token:
         raise HTTPException(
-            status_code=status.HTTP_403_FORBIDDEN, detail="Invalid or expired token"
         )
     return token
 # Classify plain text input
 async def handle_text_analysis(text: str):
     text = text.strip()
     if not text or len(text.split()) < 10:
-        raise HTTPException(
-            status_code=400, detail="Text must contain at least 10 words"
-        )
-    if len(text) > 50000:
-        raise HTTPException(
-            status_code=413, detail="Text must be less than 50,000 characters"
-        )
     label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
-    # bias_summary = build_bias_summary(ai_likelihood)
     return {
         "result": label,
         "perplexity": round(perplexity, 2),
-        "ai_likelihood": ai_likelihood,
     }
 # Extract text from uploaded files (.docx, .pdf, .txt)
 async def extract_file_contents(file: UploadFile) -> str:
     content = await file.read()
     file_stream = BytesIO(content)
-    if (
-        file.content_type
-        == "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
-    ):
         return parse_docx(file_stream)
     elif file.content_type == "application/pdf":
         return parse_pdf(file_stream)
@@ -79,83 +52,76 @@ async def extract_file_contents(file: UploadFile) -> str:
     else:
         raise HTTPException(
             status_code=415,
-            detail="Invalid file type. Only .docx, .pdf and .txt are allowed.",
         )
 # Classify text from uploaded file
 async def handle_file_upload(file: UploadFile):
     try:
         file_contents = await extract_file_contents(file)
-        logging.info(f"Extracted text length: {len(file_contents)} characters")
-        if len(file_contents) > 50000:
-            return {
-                "status_code": 413,
-                "detail": "Text must be less than 50,000 characters",
-            }
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
-            raise HTTPException(
-                status_code=400,
-                detail="The uploaded file is empty or only contains whitespace.",
-            )
-        # print(f"Cleaned text: '{cleaned_text}'")  # Debugging statement
-        label, perplexity, ai_likelihood = await asyncio.to_thread(
-            classify_text, cleaned_text
-        )
         return {
             "content": file_contents,
             "result": label,
             "perplexity": round(perplexity, 2),
-            "ai_likelihood": ai_likelihood,
         }
     except Exception as e:
         logging.error(f"Error processing file: {e}")
         raise HTTPException(status_code=500, detail="Error processing the file")
 async def handle_sentence_level_analysis(text: str):
     text = text.strip()
-    if not text or len(text.split()) < 10:
-        raise HTTPException(
-            status_code=400, detail="Text must contain at least 10 words"
-        )
-    if len(text) > 50000:
-        raise HTTPException(
-            status_code=413, detail="Text must be less than 50,000 characters"
-        )
-    result = await asyncio.to_thread(analyze_text_with_sentences, text)
-    return result
-# Analyze each sentence from uploaded file
 async def handle_file_sentence(file: UploadFile):
     try:
         file_contents = await extract_file_contents(file)
-        if len(file_contents) > 50000:
-            # raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
-            return {
-                "status_code": 413,
-                "detail": "Text must be less than 50,000 characters",
-            }
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
-            raise HTTPException(
-                status_code=400,
-                detail="The uploaded file is empty or only contains whitespace.",
-            )
         result = await handle_sentence_level_analysis(cleaned_text)
-        return {"content": file_contents, **result}
-    except HTTPException:
-        raise
     except Exception as e:
         logging.error(f"Error processing file: {e}")
         raise HTTPException(status_code=500, detail="Error processing the file")
 def classify(text: str):
     return classify_text(text)

+import os
 import asyncio
 import logging
 from io import BytesIO
+from fastapi import HTTPException, UploadFile, status, Depends
+from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
+from .inferencer import classify_text
 from .preprocess import parse_docx, parse_pdf, parse_txt
+import spacy
 security = HTTPBearer()
+nlp = spacy.load("en_core_web_sm")
 # Verify Bearer token from Authorization header
 async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
     token = credentials.credentials
+    expected_token = os.getenv("MY_SECRET_TOKEN")
     if token != expected_token:
         raise HTTPException(
+            status_code=status.HTTP_403_FORBIDDEN,
+            detail="Invalid or expired token"
         )
     return token
 # Classify plain text input
 async def handle_text_analysis(text: str):
     text = text.strip()
     if not text or len(text.split()) < 10:
+        raise HTTPException(status_code=400, detail="Text must contain at least 10 words")
+    if len(text) > 10000:
+        raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
     label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, text)
     return {
         "result": label,
         "perplexity": round(perplexity, 2),
+        "ai_likelihood": ai_likelihood
     }
 # Extract text from uploaded files (.docx, .pdf, .txt)
 async def extract_file_contents(file: UploadFile) -> str:
     content = await file.read()
     file_stream = BytesIO(content)
+    if file.content_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
         return parse_docx(file_stream)
     elif file.content_type == "application/pdf":
         return parse_pdf(file_stream)
     else:
         raise HTTPException(
             status_code=415,
+            detail="Invalid file type. Only .docx, .pdf and .txt are allowed."
         )
 # Classify text from uploaded file
 async def handle_file_upload(file: UploadFile):
     try:
         file_contents = await extract_file_contents(file)
+        if len(file_contents) > 10000:
+            raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
+            raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
+        label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, cleaned_text)
         return {
             "content": file_contents,
             "result": label,
             "perplexity": round(perplexity, 2),
+            "ai_likelihood": ai_likelihood
         }
     except Exception as e:
         logging.error(f"Error processing file: {e}")
         raise HTTPException(status_code=500, detail="Error processing the file")
 async def handle_sentence_level_analysis(text: str):
     text = text.strip()
+    if not text.endswith("."):
+        text += "."
+    if len(text) > 10000:
+        raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
+    doc = nlp(text)
+    sentences = [sent.text.strip() for sent in doc.sents]
+    results = []
+    for sentence in sentences:
+        if not sentence:
+            continue
+        label, perplexity, ai_likelihood = await asyncio.to_thread(classify_text, sentence)
+        results.append({
+            "sentence": sentence,
+            "label": label,
+            "perplexity": round(perplexity, 2),
+            "ai_likelihood": ai_likelihood
+        })
+    return {"analysis": results}# Analyze each sentence from uploaded file
 async def handle_file_sentence(file: UploadFile):
     try:
         file_contents = await extract_file_contents(file)
+        if len(file_contents) > 10000:
+            raise HTTPException(status_code=413, detail="Text must be less than 10,000 characters")
         cleaned_text = file_contents.replace("\n", " ").replace("\t", " ").strip()
         if not cleaned_text:
+            raise HTTPException(status_code=404, detail="The file is empty or only contains whitespace.")
         result = await handle_sentence_level_analysis(cleaned_text)
+        return {
+            "content": file_contents,
+            **result
+        }
     except Exception as e:
         logging.error(f"Error processing file: {e}")
         raise HTTPException(status_code=500, detail="Error processing the file")
 def classify(text: str):
     return classify_text(text)

features/text_classifier/inferencer.py CHANGED Viewed

@@ -1,272 +1,40 @@
-from __future__ import annotations
-from dataclasses import dataclass
-from functools import lru_cache
-import logging
-import random
-from typing import Any
-import nltk
-import numpy as np
-from scipy.sparse import csr_matrix, hstack
 import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from features.text_classifier.model_loader import load_model
-logger = logging.getLogger(__name__)
-for resource in ("tokenizers/punkt", "tokenizers/punkt_tab"):
-    try:
-        nltk.data.find(resource)
-    except LookupError:
-        nltk.download(resource.split("/")[-1], quiet=True)
-try:
-    import textstat
-except ImportError:
-    textstat = None
-@dataclass
-class SentenceBlendConfig:
-    sentence_blend_weight: float = 0.70
-    sentence_to_doc_bias: float = 0.35
-    max_sentence_blend_weight: float = 0.90
-    max_sentence_to_doc_bias: float = 0.80
-    random_deviation_pct: float = 2.0
-class PerplexityCalculator:
-    """Lazy-loaded perplexity calculator for distilgpt2."""
-    def __init__(self, model_name: str = "distilgpt2"):
-        self.model_name = model_name
-        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-        self._tokenizer = None
-        self._model = None
-    def _load(self) -> None:
-        if self._model is not None and self._tokenizer is not None:
-            return
-        logger.info("Loading perplexity model: %s", self.model_name)
-        self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
-        self._model = AutoModelForCausalLM.from_pretrained(self.model_name).to(self.device)
-        self._model.eval()
-        logger.info("Perplexity model loaded on %s", self.device)
-    def calculate(self, text: str, max_length: int = 512) -> float:
-        try:
-            self._load()
-            encodings = self._tokenizer(
-                text,
-                return_tensors="pt",
-                truncation=True,
-                max_length=max_length,
-            )
-            input_ids = encodings.input_ids.to(self.device)
-            with torch.no_grad():
-                outputs = self._model(input_ids, labels=input_ids)
-                loss = outputs.loss
-                perplexity = torch.exp(loss).item()
-            return min(float(perplexity), 10000.0)
-        except Exception as exc:
-            logger.warning("Perplexity fallback used due to error: %s", exc)
-            return 100.0
-_perplexity_calc = PerplexityCalculator()
-@lru_cache(maxsize=20000)
-def _cached_perplexity(cleaned_text: str) -> float:
-    return _perplexity_calc.calculate(cleaned_text)
-@lru_cache(maxsize=1)
-def _get_model_artifacts() -> tuple[Any, Any, Any, Any, list[str], dict[str, Any]]:
-    return load_model()
-def normalize_text(text: str) -> str:
-    return " ".join(str(text).split()).strip()
-def split_into_sentences(text: str) -> list[str]:
-    cleaned = normalize_text(text)
-    if not cleaned:
-        return []
-    sentences = [s.strip() for s in nltk.sent_tokenize(cleaned) if s.strip()]
-    return sentences if sentences else [cleaned]
-def extract_burstiness_features(text: str) -> dict[str, float]:
-    sentences = split_into_sentences(text)
-    if not sentences:
-        return {
-            "burst_mean": 0.0,
-            "burst_std": 0.0,
-            "burst_max": 0.0,
-            "burst_min": 0.0,
-            "burst_range": 0.0,
-        }
-    lengths = np.array([len(s.split()) for s in sentences], dtype=float)
-    return {
-        "burst_mean": float(np.mean(lengths)),
-        "burst_std": float(np.std(lengths)),
-        "burst_max": float(np.max(lengths)),
-        "burst_min": float(np.min(lengths)),
-        "burst_range": float(np.max(lengths) - np.min(lengths)),
-    }
-def extract_stylometry_features(text: str) -> dict[str, float]:
-    words = text.split()
-    num_words = len(words)
-    num_chars = len(text)
-    num_sentences = max(len(split_into_sentences(text)), 1)
-    avg_word_len = float(np.mean([len(w) for w in words])) if words else 0.0
-    avg_sent_len = float(num_words / num_sentences)
-    unique_words = len(set(words))
-    lexical_diversity = float(unique_words / num_words) if num_words > 0 else 0.0
-    num_punct = sum(1 for c in text if c in ".,!?;:")
-    punct_ratio = float(num_punct / num_chars) if num_chars > 0 else 0.0
-    num_caps = sum(1 for c in text if c.isupper())
-    caps_ratio = float(num_caps / num_chars) if num_chars > 0 else 0.0
-    if textstat is not None:
-        try:
-            flesch_reading = float(textstat.flesch_reading_ease(text))
-            flesch_grade = float(textstat.flesch_kincaid_grade(text))
-        except Exception:
-            flesch_reading = 50.0
-            flesch_grade = 8.0
-    else:
-        flesch_reading = 50.0
-        flesch_grade = 8.0
-    return {
-        "num_words": float(num_words),
-        "num_chars": float(num_chars),
-        "num_sentences": float(num_sentences),
-        "avg_word_len": avg_word_len,
-        "avg_sent_len": avg_sent_len,
-        "lexical_diversity": lexical_diversity,
-        "punct_ratio": punct_ratio,
-        "caps_ratio": caps_ratio,
-        "flesch_reading": flesch_reading,
-        "flesch_grade": flesch_grade,
-    }
-def extract_all_features(text: str, calc_perplexity: bool = True) -> dict[str, float]:
-    cleaned = normalize_text(text)
-    features: dict[str, float] = {}
-    if calc_perplexity:
-        features["perplexity"] = _cached_perplexity(cleaned)
-    else:
-        features["perplexity"] = 100.0
-    features.update(extract_burstiness_features(cleaned))
-    features.update(extract_stylometry_features(cleaned))
-    return features
-def _predict_ai_probability(text: str) -> tuple[float, float]:
-    (
-        loaded_classifier,
-        loaded_scaler,
-        loaded_word_vectorizer,
-        loaded_char_vectorizer,
-        loaded_features,
-        loaded_metadata,
-    ) = _get_model_artifacts()
-    calc_perplexity = bool(loaded_metadata.get("num_engineered_features", 0) > 0)
-    features = extract_all_features(text, calc_perplexity=calc_perplexity)
-    feature_vector = np.array([features[name] for name in loaded_features], dtype=float).reshape(1, -1)
-    feature_scaled = loaded_scaler.transform(feature_vector)
-    word_vec = loaded_word_vectorizer.transform([text])
-    char_vec = loaded_char_vectorizer.transform([text])
-    num_vec = csr_matrix(feature_scaled)
-    hybrid_vec = hstack([word_vec, char_vec, num_vec], format="csr")
-    if hasattr(loaded_classifier, "predict_proba"):
-        proba = loaded_classifier.predict_proba(hybrid_vec)[0]
-        ai_prob = float(proba[1])
     else:
-        score = float(loaded_classifier.decision_function(hybrid_vec)[0])
-        ai_prob = float(1.0 / (1.0 + np.exp(-score)))
-    perplexity = float(features.get("perplexity", 100.0))
-    return ai_prob, perplexity
-def classify_text(text: str) -> tuple[str, float, float]:
-    """Return (label, perplexity, ai_likelihood_percent)."""
-    cleaned = normalize_text(text)
-    if not cleaned:
-        raise ValueError("Input text is empty")
-    ai_prob, perplexity = _predict_ai_probability(cleaned)
-    ai_likelihood = round(ai_prob * 100.0, 2)
-    label = "AI" if ai_likelihood >= 50.0 else "Human"
-    return label, perplexity, ai_likelihood
-def analyze_text_with_sentences(
-    text: str,
-) -> dict[str, Any]:
-    text = normalize_text(text)
-    overall_classification, overall_perplexity, overall_ai_likelihood = classify_text(text)
-    sentences = split_into_sentences(text)
-    if not sentences:
-        raise ValueError("Input text contains no valid sentences")
-    #  do the per-sentence analysis
-    sentence_results = []
-    for sentence in sentences:
-        try:
-            label, perplexity, ai_likelihood = classify_text(sentence)
-            sentence_results.append(
-                {
-                    "sentence": sentence,
-                    "label": label,
-                    "perplexity": perplexity,
-                    "ai_likelihood": ai_likelihood,
-                }
-            )
-        except Exception as exc:
-            logger.warning("Error analyzing sentence: %s", exc)
-            sentence_results.append(
-                {
-                    "sentence": sentence,
-                    "label": "Error",
-                    "perplexity": None,
-                    "ai_likelihood": None,
-                }
-            )
-    return{
-        "sentences": sentence_results,
-        "summary": {
-            "overall": {
-                "label": overall_classification,
-                "perplexity": overall_perplexity,
-                "ai_likelihood": overall_ai_likelihood,
-            }
-        },
-    }

 import torch
+from .model_loader import get_model_tokenizer
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def perplexity_to_ai_likelihood(ppl: float) -> float:
+    # You can tune these parameters
+    min_ppl = 10     # very confident it's AI
+    max_ppl = 100    # very confident it's human
+    # Clamp to bounds
+    ppl = max(min_ppl, min(ppl, max_ppl))
+    # Invert and scale: lower perplexity -> higher AI-likelihood
+    likelihood = 1 - ((ppl - min_ppl) / (max_ppl - min_ppl))
+    return round(likelihood * 100, 2)
+def classify_text(text: str):
+    model, tokenizer = get_model_tokenizer()
+    inputs = tokenizer(text, return_tensors="pt",
+                       truncation=True, padding=True)
+    input_ids = inputs["input_ids"].to(device)
+    attention_mask = inputs["attention_mask"].to(device)
+    with torch.no_grad():
+        outputs = model(
+            input_ids, attention_mask=attention_mask, labels=input_ids)
+        loss = outputs.loss
+        perplexity = torch.exp(loss).item()
+    if perplexity < 55:
+        result = "AI-generated"
+    elif perplexity < 80:
+        result = "Probably AI-generated"
     else:
+        result = "Human-written"
+    likelihood_result=perplexity_to_ai_likelihood(perplexity)
+    return result, perplexity,likelihood_result

features/text_classifier/model_loader.py CHANGED Viewed

@@ -1,113 +1,50 @@
-import json
-import logging
-import pickle
 import shutil
-from pathlib import Path
-import torch
 from huggingface_hub import snapshot_download
-from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
-from config import Config
-REPO_ID = Config.REPO_ID_LANG
-MODEL_DIR = Path(Config.LANG_MODEL) if Config.LANG_MODEL else None
-HF_TOKEN = Config.HF_TOKEN
-ENGLISH_SUBDIR = "English_model"
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-REQUIRED_FILES = (
-    "classifier.pkl",
-    "scaler.pkl",
-    "word_vectorizer.pkl",
-    "char_vectorizer.pkl",
-    "feature_names.json",
-    "metadata.json",
-)
-def _patch_legacy_logistic_model(model):
-    """Backfill attributes expected by newer sklearn versions."""
-    if isinstance(model, (LogisticRegression, LogisticRegressionCV)) and not hasattr(model, "multi_class"):
-        model.multi_class = "auto"
-    return model
-def _has_required_artifacts(model_dir: Path) -> bool:
-    if not model_dir.exists() or not model_dir.is_dir():
-        return False
-    return all((model_dir / filename).exists() for filename in REQUIRED_FILES)
-def _resolve_artifact_dir(base_dir: Path) -> Path | None:
-    candidates = [base_dir, base_dir / ENGLISH_SUBDIR]
-    for candidate in candidates:
-        if _has_required_artifacts(candidate):
-            return candidate
-    return None
 def warmup():
-    logging.info("Warming up model...")
-    if MODEL_DIR is None:
-        raise ValueError("LANG_MODEL is not configured")
-    if _resolve_artifact_dir(MODEL_DIR):
-        logging.info("Model artifacts already exist, skipping download.")
-        return
     download_model_repo()
 def download_model_repo():
-    if MODEL_DIR is None:
-        raise ValueError("LANG_MODEL is not configured")
-    if not REPO_ID:
-        raise ValueError("English_model repo id is not configured")
-    if _resolve_artifact_dir(MODEL_DIR):
-        logging.info("Model artifacts already exist, skipping download.")
         return
-    snapshot_path = Path(snapshot_download(repo_id=REPO_ID, token=HF_TOKEN))
-    source_dir = snapshot_path / ENGLISH_SUBDIR if (snapshot_path / ENGLISH_SUBDIR).is_dir() else snapshot_path
-    MODEL_DIR.mkdir(parents=True, exist_ok=True)
-    shutil.copytree(source_dir, MODEL_DIR, dirs_exist_ok=True)
 def load_model():
-    if MODEL_DIR is None:
-        raise ValueError("LANG_MODEL is not configured")
-    artifact_dir = _resolve_artifact_dir(MODEL_DIR)
-    if artifact_dir is None:
-        logging.info("Model artifacts missing in %s, downloading now.", MODEL_DIR)
         download_model_repo()
-        artifact_dir = _resolve_artifact_dir(MODEL_DIR)
-    if artifact_dir is None:
-        raise FileNotFoundError(
-            f"Required model artifacts not found in {MODEL_DIR}. Expected files: {', '.join(REQUIRED_FILES)}"
-        )
-    with open(artifact_dir / "classifier.pkl", "rb") as f:
-        loaded_classifier = pickle.load(f)
-    loaded_classifier = _patch_legacy_logistic_model(loaded_classifier)
-    with open(artifact_dir / "scaler.pkl", "rb") as f:
-        loaded_scaler = pickle.load(f)
-    with open(artifact_dir / "word_vectorizer.pkl", "rb") as f:
-        loaded_word_vectorizer = pickle.load(f)
-    with open(artifact_dir / "char_vectorizer.pkl", "rb") as f:
-        loaded_char_vectorizer = pickle.load(f)
-    with open(artifact_dir / "feature_names.json", "r") as f:
-        loaded_features = json.load(f)
-    with open(artifact_dir / "metadata.json", "r") as f:
-        loaded_metadata = json.load(f)
-    return (
-        loaded_classifier,
-        loaded_scaler,
-        loaded_word_vectorizer,
-        loaded_char_vectorizer,
-        loaded_features,
-        loaded_metadata,
-    )

+import os
 import shutil
+import logging
+from transformers import GPT2LMHeadModel, GPT2TokenizerFast, GPT2Config
 from huggingface_hub import snapshot_download
+import torch
+from dotenv import load_dotenv
+load_dotenv()
+REPO_ID = "can-org/AI-Content-Checker"
+MODEL_DIR = "./models"
+TOKENIZER_DIR = os.path.join(MODEL_DIR, "model")
+WEIGHTS_PATH = os.path.join(MODEL_DIR, "model_weights.pth")
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+_model, _tokenizer = None, None
 def warmup():
+    global _model, _tokenizer
+    # Ensure punkt is available
     download_model_repo()
+    _model, _tokenizer = load_model()
+    logging.info("Its ready")
 def download_model_repo():
+    if os.path.exists(MODEL_DIR) and os.path.isdir(MODEL_DIR):
+        logging.info("Model already exists, skipping download.")
         return
+    snapshot_path = snapshot_download(repo_id=REPO_ID)
+    os.makedirs(MODEL_DIR, exist_ok=True)
+    shutil.copytree(snapshot_path, MODEL_DIR, dirs_exist_ok=True)
 def load_model():
+    tokenizer = GPT2TokenizerFast.from_pretrained(TOKENIZER_DIR)
+    config = GPT2Config.from_pretrained(TOKENIZER_DIR)
+    model = GPT2LMHeadModel(config)
+    model.load_state_dict(torch.load(WEIGHTS_PATH, map_location=device))
+    model.to(device)
+    model.eval()
+    return model, tokenizer
+def get_model_tokenizer():
+    global _model, _tokenizer
+    if _model is None or _tokenizer is None:
         download_model_repo()
+        _model, _tokenizer = load_model()
+    return _model, _tokenizer

features/text_classifier/preprocess.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from pypdf import PdfReader
 import docx
 from io import BytesIO
 import logging
@@ -15,16 +15,18 @@ def parse_docx(file: BytesIO):
 def parse_pdf(file: BytesIO):
     try:
-        doc = PdfReader(file)
         text = ""
-        for page in doc.pages:
-            text += page.extract_text()
-        return text
     except Exception as e:
         logging.error(f"Error while processing PDF: {str(e)}")
         raise HTTPException(
             status_code=500, detail="Error processing PDF file")
 def parse_txt(file: BytesIO):
     return file.read().decode("utf-8")

+import fitz  # PyMuPDF
 import docx
 from io import BytesIO
 import logging
 def parse_pdf(file: BytesIO):
     try:
+        doc = fitz.open(stream=file, filetype="pdf")
         text = ""
+        for page_num in range(doc.page_count):
+            page = doc.load_page(page_num)
+            text += page.get_text()
+        return text
     except Exception as e:
         logging.error(f"Error while processing PDF: {str(e)}")
         raise HTTPException(
             status_code=500, detail="Error processing PDF file")
 def parse_txt(file: BytesIO):
     return file.read().decode("utf-8")

features/text_classifier/routes.py CHANGED Viewed

@@ -37,10 +37,9 @@ async def analyze_sentences(request: Request, data: TextInput, token: str = Depe
         raise HTTPException(status_code=400, detail="Missing 'text' in request body")
     return await handle_sentence_level_analysis(data.text)
-@router.post("/analyse-sentence-file")
 @limiter.limit(ACCESS_RATE)
-async def analyze_sentence_file(request: Request, file: UploadFile = File(...), token: str = Depends(verify_token)):
     return await handle_file_sentence(file)
 @router.get("/health")

         raise HTTPException(status_code=400, detail="Missing 'text' in request body")
     return await handle_sentence_level_analysis(data.text)
+@router.post("/analyse-sentance-file")
 @limiter.limit(ACCESS_RATE)
+async def analyze_sentance_file(request: Request, file: UploadFile = File(...), token: str = Depends(verify_token)):
     return await handle_file_sentence(file)
 @router.get("/health")

requirements.txt CHANGED Viewed

@@ -15,23 +15,6 @@ tensorflow
 opencv-python
 pillow
 scipy
-pypdf
 frontend
 tools
-pandas
-numpy
-scikit-learn
-textstat
-requests
-beautifulsoup4
-langchain
-langchain-community
-langchain-openai
-faiss-cpu
-PyPDF2
-tiktoken
-chromadb
-langchain_chroma
-sentence-transformers
-tf-keras
-torchvision

 opencv-python
 pillow
 scipy
+fitz
 frontend
 tools

test.md ADDED Viewed

	@@ -0,0 +1,31 @@

+**Update: Edited & AI-Generated Content Detection – Project Plan**
+### 🔍 Phase 1: Rule-Based Image Detection (In Progress)
+We're implementing three core techniques to individually flag edited or AI-generated images:
+* **ELA (Error Level Analysis):** Highlights inconsistencies via JPEG recompression.
+* **FFT (Frequency Analysis):** Uses 2D Fourier Transform to detect unnatural image frequency patterns.
+* **Metadata Analysis:** Parses EXIF data to catch clues like editing software tags.
+ These give us visual + interpretable results for each image, and currently offer \~60–70% accuracy on typical AI-edited content.
+---
+###  Phase 2: AI vs Human Detection System (Coming Soon)
+**Goal:** Build an AI model that classifies whether content is AI- or human-made — initially focusing on **images**, and later expanding to **text**.
+**Data Strategy:**
+* Scraping large volumes of recent AI-gen images (e.g. SDXL, Gibbli, MidJourney).
+* Balancing with high-quality human images.
+**Model Plan:**
+* Use ELA, FFT, and metadata as feature extractors.
+* Feed these into a CNN or ensemble model.
+* Later, unify into a full web-based platform (upload → get AI/human probability).