jayansh21
/

codesheriff-bug-classifier

@@ -1,149 +1,139 @@
 ---
 language:
-- code
 library_name: transformers
 pipeline_tag: text-classification
 tags:
-- code-review
-- bug-detection
-- codebert
-- python
-- security
-- static-analysis
 datasets:
-- code_search_net
 base_model: microsoft/codebert-base
 metrics:
-- f1
-- accuracy
-- precision
-- recall
-model-index:
-- name: codesheriff-bug-classifier
-  results:
-  - task:
-      type: text-classification
-      name: Code Bug Classification
-    dataset:
-      type: code_search_net
-      name: CodeSearchNet (Python split)
-      config: python
-    metrics:
-    - type: f1
-      value: 0.89
-      name: Macro F1
 ---
-# 🛡️ CodeSheriff Bug Classifier
-A fine-tuned **CodeBERT** model for automatic code bug classification in Python source code. Classifies code snippets into 5 categories to power AI-driven pull request reviews.
-## Model Description
-CodeSheriff Bug Classifier is a multi-class text classification model built on top of [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base). It takes a Python code snippet as input and predicts the type of bug present (or classifies it as clean).
-- **Base model:** `microsoft/codebert-base` (125M parameters)
-- **Task:** Multi-class code classification (5 classes)
-- **Language:** Python
-- **Framework:** PyTorch + HuggingFace Transformers
-## Intended Uses
-- **Primary use:** Backend classifier for the [CodeSheriff](https://github.com/jayansh21/CodeSheriff) automated PR review system.
-- **General use:** Any system that needs to classify Python code snippets for potential bugs (security vulnerabilities, null references, type mismatches, logic flaws).
-- **Out of scope:** This model is not designed for non-Python languages, natural language text, or code generation.
-## Labels
-| ID | Label                  | Description                                      |
-|----|------------------------|--------------------------------------------------|
-| 0  | Clean                  | Well-formed code with no detected issues         |
-| 1  | Null Reference Risk    | Potential `NoneType` access without null checks  |
-| 2  | Type Mismatch          | Incompatible type operations (e.g., `str + int`) |
-| 3  | Security Vulnerability | SQL injection, command injection, `eval()`, etc. |
-| 4  | Logic Flaw             | Off-by-one errors, division by zero, wrong logic |
-## Training
-### Dataset
-- **Source:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) (Python split)
-- **Labelling:** Heuristic rules applied to raw functions, augmented with 5,600 hand-crafted seed examples across all 5 classes (Clean: 3,000, Null Reference: 800, Type Mismatch: 500, Security: 500, Logic Flaw: 800)
-- **Preprocessing:** Function-level tokenization, max sequence length 512 tokens
-- **Split:** 80% train / 10% validation / 10% test (stratified, seed=42)
-### Hyperparameters
-| Parameter                  | Value   |
-|----------------------------|---------|
-| Base model                 | `microsoft/codebert-base` |
-| Max token length           | 512     |
-| Batch size                 | 8       |
-| Gradient accumulation steps| 2       |
-| Effective batch size       | 16      |
-| Learning rate              | 2e-5    |
-| Epochs                     | 4       |
-| Optimizer                  | AdamW   |
-| Scheduler                  | Linear warmup + decay |
-| Weight decay               | 0.01    |
-| Seed                       | 42      |
-### Training Infrastructure
-- **GPU:** NVIDIA RTX 3050 (4GB VRAM)
-- **Python:** 3.11.9
-- **PyTorch:** 2.x with CUDA
-- **Transformers:** 4.35+
-## Evaluation
-### Metrics (Test Set)
-| Metric    | Score |
 |-----------|-------|
-| **Macro F1**  | **0.89** |
-| Accuracy  | 0.91  |
-### Per-Class Performance
-| Label                  | Precision | Recall | F1-Score |
-|------------------------|-----------|--------|----------|
-| Clean                  | 0.92      | 0.95   | 0.93     |
-| Null Reference Risk    | 0.85      | 0.82   | 0.83     |
-| Type Mismatch          | 0.84      | 0.80   | 0.82     |
-| Security Vulnerability | 0.96      | 0.98   | 0.97     |
-| Logic Flaw             | 0.87      | 0.86   | 0.87     |
-> Security Vulnerability achieves the highest F1 due to strong lexical signals (`os.system`, `eval`, SQL concatenation).
-## Confidence Gate
-In production, predictions below **60% confidence** are automatically downgraded to "Code Quality" (a generic low-priority label) to reduce false positives. Multi-label probabilities (`all_probs`) are also returned for downstream use.
-## How to Use
-### With Transformers
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-model_name = "jayansh21/codesheriff-bug-classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-code = """
-def get_user(uid):
-    query = "SELECT * FROM users WHERE id=" + uid
-    return db.execute(query)
-"""
-inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
-with torch.no_grad():
-    logits = model(**inputs).logits
-predicted_class = logits.argmax(dim=-1).item()
-labels = {0: "Clean", 1: "Null Reference Risk", 2: "Type Mismatch",
-          3: "Security Vulnerability", 4: "Logic Flaw"}
-print(f"Prediction: {labels[predicted_class]}")
-# Output: Prediction: Security Vulnerability

 ---
 language:
+  - code
 library_name: transformers
 pipeline_tag: text-classification
 tags:
+  - code-review
+  - bug-detection
+  - codebert
+  - python
+  - security
+  - static-analysis
 datasets:
+  - code_search_net
 base_model: microsoft/codebert-base
 metrics:
+  - f1
+  - accuracy
 ---
+# 🔍 CodeSheriff Bug Classifier
+A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) — an AI system that automatically reviews GitHub pull requests.
+**Base model:** `microsoft/codebert-base` · **Task:** 5-class sequence classification · **Language:** Python
+---
+## Labels
+| ID | Label | Example |
+|----|-------|---------|
+| 0 | Clean | Well-formed code, no issues |
+| 1 | Null Reference Risk | `result.fetchone().name` without a None check |
+| 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int |
+| 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` |
+| 4 | Logic Flaw | `for i in range(len(items) + 1)` |
+---
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")
+LABELS = {
+    0: "Clean",
+    1: "Null Reference Risk",
+    2: "Type Mismatch",
+    3: "Security Vulnerability",
+    4: "Logic Flaw"
+}
+code = """
+def get_user(uid):
+    query = "SELECT * FROM users WHERE id=" + uid
+    return db.execute(query)
+"""
+inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
+with torch.no_grad():
+    logits = model(**inputs).logits
+probs = torch.softmax(logits, dim=-1)
+pred = logits.argmax(dim=-1).item()
+confidence = probs[0][pred].item()
+print(f"{LABELS[pred]} ({confidence:.1%})")
+# Security Vulnerability (99.3%)
+````
+---
+## Training
+**Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.
+**Key hyperparameters:**
+| Parameter | Value |
 |-----------|-------|
+| Epochs | 4 |
+| Effective batch size | 16 (8 × 2 grad accum) |
+| Learning rate | 2e-5 |
+| Optimizer | AdamW + linear warmup |
+| Max token length | 512 |
+| Class weighting | Yes — balanced |
+| Hardware | NVIDIA RTX 3050 (4GB) |
+---
+## Evaluation
+Test set: 840 samples (stratified).
+| Class | Precision | Recall | F1 | Support |
+|-------|-----------|--------|----|---------|
+| Clean | 0.92 | 0.88 | 0.90 | 450 |
+| Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 |
+| Type Mismatch | 0.96 | 0.95 | 0.95 | 75 |
+| Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 |
+| Logic Flaw | 0.96 | 0.97 | 0.97 | 120 |
+| **Macro F1** | **0.89** | **0.90** | **0.89** | |
+**Confusion matrix:**
+```
+                 Clean  NullRef  TypeMis  SecVuln  Logic
+Actual Clean   [  394      52        1        1      2  ]
+Actual NullRef [   23      93        1        0      3  ]
+Actual TypeMis [    3       1       71        0      0  ]
+Actual SecVuln [    4       1        1       69      0  ]
+Actual Logic   [    3       0        0        0    117  ]
+```
+Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.
+---
+## Limitations
+- **Python only** — not trained on other languages
+- **Function-level input** — works best on 5–50 line snippets
+- **Heuristic labels** — training data was pattern-matched, not expert-annotated
+- **Not a SAST replacement** — probabilistic classifier, not a sound static analysis tool
+---
+## Links
+- GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff)
+- Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff)
+```
+````