jayansh21 commited on
Commit
8b2f4fa
·
verified ·
1 Parent(s): 876154b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -108
README.md CHANGED
@@ -1,149 +1,139 @@
1
  ---
2
  language:
3
- - code
4
  library_name: transformers
5
  pipeline_tag: text-classification
6
  tags:
7
- - code-review
8
- - bug-detection
9
- - codebert
10
- - python
11
- - security
12
- - static-analysis
13
  datasets:
14
- - code_search_net
15
  base_model: microsoft/codebert-base
16
  metrics:
17
- - f1
18
- - accuracy
19
- - precision
20
- - recall
21
- model-index:
22
- - name: codesheriff-bug-classifier
23
- results:
24
- - task:
25
- type: text-classification
26
- name: Code Bug Classification
27
- dataset:
28
- type: code_search_net
29
- name: CodeSearchNet (Python split)
30
- config: python
31
- metrics:
32
- - type: f1
33
- value: 0.89
34
- name: Macro F1
35
  ---
36
 
37
- # 🛡️ CodeSheriff Bug Classifier
38
 
39
- A fine-tuned **CodeBERT** model for automatic code bug classification in Python source code. Classifies code snippets into 5 categories to power AI-driven pull request reviews.
40
 
41
- ## Model Description
42
 
43
- CodeSheriff Bug Classifier is a multi-class text classification model built on top of [microsoft/codebert-base](https://huggingface.co/microsoft/codebert-base). It takes a Python code snippet as input and predicts the type of bug present (or classifies it as clean).
44
 
45
- - **Base model:** `microsoft/codebert-base` (125M parameters)
46
- - **Task:** Multi-class code classification (5 classes)
47
- - **Language:** Python
48
- - **Framework:** PyTorch + HuggingFace Transformers
49
 
50
- ## Intended Uses
 
 
 
 
 
 
51
 
52
- - **Primary use:** Backend classifier for the [CodeSheriff](https://github.com/jayansh21/CodeSheriff) automated PR review system.
53
- - **General use:** Any system that needs to classify Python code snippets for potential bugs (security vulnerabilities, null references, type mismatches, logic flaws).
54
- - **Out of scope:** This model is not designed for non-Python languages, natural language text, or code generation.
55
 
56
- ## Labels
57
 
58
- | ID | Label | Description |
59
- |----|------------------------|--------------------------------------------------|
60
- | 0 | Clean | Well-formed code with no detected issues |
61
- | 1 | Null Reference Risk | Potential `NoneType` access without null checks |
62
- | 2 | Type Mismatch | Incompatible type operations (e.g., `str + int`) |
63
- | 3 | Security Vulnerability | SQL injection, command injection, `eval()`, etc. |
64
- | 4 | Logic Flaw | Off-by-one errors, division by zero, wrong logic |
65
 
66
- ## Training
 
67
 
68
- ### Dataset
 
 
 
 
 
 
69
 
70
- - **Source:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) (Python split)
71
- - **Labelling:** Heuristic rules applied to raw functions, augmented with 5,600 hand-crafted seed examples across all 5 classes (Clean: 3,000, Null Reference: 800, Type Mismatch: 500, Security: 500, Logic Flaw: 800)
72
- - **Preprocessing:** Function-level tokenization, max sequence length 512 tokens
73
- - **Split:** 80% train / 10% validation / 10% test (stratified, seed=42)
 
74
 
75
- ### Hyperparameters
 
 
76
 
77
- | Parameter | Value |
78
- |----------------------------|---------|
79
- | Base model | `microsoft/codebert-base` |
80
- | Max token length | 512 |
81
- | Batch size | 8 |
82
- | Gradient accumulation steps| 2 |
83
- | Effective batch size | 16 |
84
- | Learning rate | 2e-5 |
85
- | Epochs | 4 |
86
- | Optimizer | AdamW |
87
- | Scheduler | Linear warmup + decay |
88
- | Weight decay | 0.01 |
89
- | Seed | 42 |
90
 
91
- ### Training Infrastructure
 
 
92
 
93
- - **GPU:** NVIDIA RTX 3050 (4GB VRAM)
94
- - **Python:** 3.11.9
95
- - **PyTorch:** 2.x with CUDA
96
- - **Transformers:** 4.35+
97
 
98
- ## Evaluation
99
 
100
- ### Metrics (Test Set)
101
 
102
- | Metric | Score |
 
 
103
  |-----------|-------|
104
- | **Macro F1** | **0.89** |
105
- | Accuracy | 0.91 |
 
 
 
 
 
106
 
107
- ### Per-Class Performance
108
 
109
- | Label | Precision | Recall | F1-Score |
110
- |------------------------|-----------|--------|----------|
111
- | Clean | 0.92 | 0.95 | 0.93 |
112
- | Null Reference Risk | 0.85 | 0.82 | 0.83 |
113
- | Type Mismatch | 0.84 | 0.80 | 0.82 |
114
- | Security Vulnerability | 0.96 | 0.98 | 0.97 |
115
- | Logic Flaw | 0.87 | 0.86 | 0.87 |
116
 
117
- > Security Vulnerability achieves the highest F1 due to strong lexical signals (`os.system`, `eval`, SQL concatenation).
118
 
119
- ## Confidence Gate
 
 
 
 
 
 
 
120
 
121
- In production, predictions below **60% confidence** are automatically downgraded to "Code Quality" (a generic low-priority label) to reduce false positives. Multi-label probabilities (`all_probs`) are also returned for downstream use.
122
 
123
- ## How to Use
 
 
 
 
 
 
 
124
 
125
- ### With Transformers
126
 
127
- ```python
128
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
129
- import torch
130
 
131
- model_name = "jayansh21/codesheriff-bug-classifier"
132
- tokenizer = AutoTokenizer.from_pretrained(model_name)
133
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
134
 
135
- code = """
136
- def get_user(uid):
137
- query = "SELECT * FROM users WHERE id=" + uid
138
- return db.execute(query)
139
- """
140
 
141
- inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
142
- with torch.no_grad():
143
- logits = model(**inputs).logits
 
 
 
144
 
145
- predicted_class = logits.argmax(dim=-1).item()
146
- labels = {0: "Clean", 1: "Null Reference Risk", 2: "Type Mismatch",
147
- 3: "Security Vulnerability", 4: "Logic Flaw"}
148
- print(f"Prediction: {labels[predicted_class]}")
149
- # Output: Prediction: Security Vulnerability
 
1
  ---
2
  language:
3
+ - code
4
  library_name: transformers
5
  pipeline_tag: text-classification
6
  tags:
7
+ - code-review
8
+ - bug-detection
9
+ - codebert
10
+ - python
11
+ - security
12
+ - static-analysis
13
  datasets:
14
+ - code_search_net
15
  base_model: microsoft/codebert-base
16
  metrics:
17
+ - f1
18
+ - accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
+ # 🔍 CodeSheriff Bug Classifier
22
 
23
+ A fine-tuned **CodeBERT** model that classifies Python code snippets into five bug categories. Built as the classification engine inside [CodeSheriff](https://github.com/jayansh21/CodeSheriff) an AI system that automatically reviews GitHub pull requests.
24
 
25
+ **Base model:** `microsoft/codebert-base` · **Task:** 5-class sequence classification · **Language:** Python
26
 
27
+ ---
28
 
29
+ ## Labels
 
 
 
30
 
31
+ | ID | Label | Example |
32
+ |----|-------|---------|
33
+ | 0 | Clean | Well-formed code, no issues |
34
+ | 1 | Null Reference Risk | `result.fetchone().name` without a None check |
35
+ | 2 | Type Mismatch | `"Error: " + error_code` where `error_code` is an int |
36
+ | 3 | Security Vulnerability | `"SELECT * FROM users WHERE id = " + user_id` |
37
+ | 4 | Logic Flaw | `for i in range(len(items) + 1)` |
38
 
39
+ ---
 
 
40
 
41
+ ## Usage
42
 
43
+ ```python
44
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
45
+ import torch
 
 
 
 
46
 
47
+ tokenizer = AutoTokenizer.from_pretrained("jayansh21/codesheriff-bug-classifier")
48
+ model = AutoModelForSequenceClassification.from_pretrained("jayansh21/codesheriff-bug-classifier")
49
 
50
+ LABELS = {
51
+ 0: "Clean",
52
+ 1: "Null Reference Risk",
53
+ 2: "Type Mismatch",
54
+ 3: "Security Vulnerability",
55
+ 4: "Logic Flaw"
56
+ }
57
 
58
+ code = """
59
+ def get_user(uid):
60
+ query = "SELECT * FROM users WHERE id=" + uid
61
+ return db.execute(query)
62
+ """
63
 
64
+ inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
65
+ with torch.no_grad():
66
+ logits = model(**inputs).logits
67
 
68
+ probs = torch.softmax(logits, dim=-1)
69
+ pred = logits.argmax(dim=-1).item()
70
+ confidence = probs[0][pred].item()
 
 
 
 
 
 
 
 
 
 
71
 
72
+ print(f"{LABELS[pred]} ({confidence:.1%})")
73
+ # Security Vulnerability (99.3%)
74
+ ````
75
 
76
+ ---
 
 
 
77
 
78
+ ## Training
79
 
80
+ **Dataset:** [CodeSearchNet](https://huggingface.co/datasets/code_search_net) Python split with heuristic labeling, augmented with seed templates for underrepresented classes. Final training set: 4,600 balanced samples across all five classes. Stratified 80/10/10 train/val/test split.
81
 
82
+ **Key hyperparameters:**
83
+
84
+ | Parameter | Value |
85
  |-----------|-------|
86
+ | Epochs | 4 |
87
+ | Effective batch size | 16 (8 × 2 grad accum) |
88
+ | Learning rate | 2e-5 |
89
+ | Optimizer | AdamW + linear warmup |
90
+ | Max token length | 512 |
91
+ | Class weighting | Yes — balanced |
92
+ | Hardware | NVIDIA RTX 3050 (4GB) |
93
 
94
+ ---
95
 
96
+ ## Evaluation
 
 
 
 
 
 
97
 
98
+ Test set: 840 samples (stratified).
99
 
100
+ | Class | Precision | Recall | F1 | Support |
101
+ |-------|-----------|--------|----|---------|
102
+ | Clean | 0.92 | 0.88 | 0.90 | 450 |
103
+ | Null Reference Risk | 0.63 | 0.78 | 0.70 | 120 |
104
+ | Type Mismatch | 0.96 | 0.95 | 0.95 | 75 |
105
+ | Security Vulnerability | 0.99 | 0.92 | 0.95 | 75 |
106
+ | Logic Flaw | 0.96 | 0.97 | 0.97 | 120 |
107
+ | **Macro F1** | **0.89** | **0.90** | **0.89** | |
108
 
109
+ **Confusion matrix:**
110
 
111
+ ```
112
+ Clean NullRef TypeMis SecVuln Logic
113
+ Actual Clean [ 394 52 1 1 2 ]
114
+ Actual NullRef [ 23 93 1 0 3 ]
115
+ Actual TypeMis [ 3 1 71 0 0 ]
116
+ Actual SecVuln [ 4 1 1 69 0 ]
117
+ Actual Logic [ 3 0 0 0 117 ]
118
+ ```
119
 
120
+ Logic Flaw and Security Vulnerability are the strongest classes — both have clear lexical patterns. Null Reference Risk is the weakest (precision 0.63) because null-risk code closely resembles clean code structurally. Most misclassifications there are false positives rather than missed bugs.
121
 
122
+ ---
 
 
123
 
124
+ ## Limitations
 
 
125
 
126
+ - **Python only** — not trained on other languages
127
+ - **Function-level input** — works best on 5–50 line snippets
128
+ - **Heuristic labels** training data was pattern-matched, not expert-annotated
129
+ - **Not a SAST replacement** — probabilistic classifier, not a sound static analysis tool
 
130
 
131
+ ---
132
+
133
+ ## Links
134
+
135
+ - GitHub: [jayansh21/CodeSheriff](https://github.com/jayansh21/CodeSheriff)
136
+ - Live demo: [huggingface.co/spaces/jayansh21/CodeSheriff](https://huggingface.co/spaces/jayansh21/CodeSheriff)
137
 
138
+ ```
139
+ ````