File size: 9,584 Bytes
868ca9d 47c3141 868ca9d 47c3141 868ca9d 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 868ca9d 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c 47c3141 82b783c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | ---
license: apache-2.0
library_name: transformers
base_model: huggingface/CodeBERTa-small-v1
tags:
- security
- vulnerability-detection
- code-analysis
- multi-label-classification
- graphcodebert
- owasp
- cwe
- static-analysis
language:
- en
- code
pipeline_tag: text-classification
datasets:
- ayshajavd/code-security-vulnerability-dataset
- bstee615/bigvul
- CyberNative/Code_Vulnerability_Security_DPO
- lemon42-ai/Code_Vulnerability_Labeled_Dataset
model-index:
- name: graphcodebert-vuln-classifier
results:
- task:
type: text-classification
name: Multi-label Vulnerability Classification
dataset:
type: ayshajavd/code-security-vulnerability-dataset
name: Code Security Vulnerability Dataset
split: test
metrics:
- type: f1
value: 0.8648
name: Weighted F1
- type: f1
value: 0.4575
name: Micro F1
- type: f1
value: 0.9501
name: F1 (safe class)
- type: recall
value: 0.5018
name: Macro Recall
---
# GraphCodeBERT Vulnerability Classifier
A multi-label code vulnerability detection model that identifies **31 vulnerability classes** (30 CWEs + safe) mapped to the **OWASP Top 10 2021** categories. Fine-tuned from [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) on 175K+ labeled code samples.
## Quick Start
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "ayshajavd/graphcodebert-vuln-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
code = """
import sqlite3
def get_user(username):
query = f"SELECT * FROM users WHERE username = '{username}'"
conn = sqlite3.connect('db.sqlite')
return conn.execute(query).fetchone()
"""
inputs = tokenizer(code, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze()
# Get predictions above threshold
TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94",
"CWE-119", "CWE-125", "CWE-190", "CWE-200", "CWE-264", "CWE-269", "CWE-276",
"CWE-284", "CWE-287", "CWE-310", "CWE-327", "CWE-330", "CWE-352", "CWE-362",
"CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
"CWE-787", "CWE-798", "CWE-918"]
threshold = 0.5
for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
if prob > threshold:
print(f"{cwe}: {prob:.3f}")
```
## Model Details
| Property | Value |
|----------|-------|
| **Architecture** | RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params) |
| **Base Model** | [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) |
| **Task** | Multi-label classification (BCEWithLogitsLoss with class weights) |
| **Labels** | 31 (30 CWE categories + "safe") |
| **Max Sequence Length** | 512 tokens |
| **Recommended Threshold** | 0.5 (balanced precision/recall) or 0.3 (high recall, security-first) |
## Supported Languages
Python, JavaScript, Java, C, C++, PHP, Go
The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).
## Evaluation Results (Test Set β 5,000 samples)
### Threshold Comparison
| Threshold | Macro F1 | Micro F1 | Weighted F1 | Macro Precision | Macro Recall |
|-----------|----------|----------|-------------|-----------------|--------------|
| 0.2 | 0.066 | 0.301 | 0.859 | 0.048 | 0.562 |
| **0.3** | **0.081** | **0.458** | **0.865** | **0.057** | **0.502** |
| 0.4 | 0.101 | 0.626 | 0.870 | 0.070 | 0.439 |
| **0.5** | **0.125** | **0.739** | **0.870** | **0.088** | **0.366** |
### Per-Class Performance (threshold=0.3)
#### OWASP A01:2021 β Broken Access Control
| CWE | Name | Support | Precision | Recall | F1 |
|-----|------|---------|-----------|--------|-----|
| CWE-22 | Path Traversal | 2 | 0.000 | 0.000 | 0.000 |
| CWE-200 | Information Exposure | 30 | 0.063 | 0.800 | 0.117 |
| CWE-264 | Permissions/Privileges | 23 | 0.025 | 0.696 | 0.049 |
| CWE-269 | Improper Privilege Mgmt | 1 | 0.000 | 0.000 | 0.000 |
| CWE-276 | Incorrect Permissions | 0 | β | β | β |
| CWE-284 | Access Control | 5 | 0.000 | 0.000 | 0.000 |
| CWE-352 | CSRF | 1 | 0.000 | 0.000 | 0.000 |
| CWE-601 | Open Redirect | 0 | β | β | β |
#### OWASP A02:2021 β Cryptographic Failures
| CWE | Name | Support | Precision | Recall | F1 |
|-----|------|---------|-----------|--------|-----|
| CWE-310 | Cryptographic Issues | 5 | 0.000 | 0.000 | 0.000 |
| CWE-327 | Broken Crypto Algorithm | 1 | 0.000 | 0.000 | 0.000 |
| CWE-330 | Insufficient Randomness | 1 | 0.000 | 0.000 | 0.000 |
#### OWASP A03:2021 β Injection
| CWE | Name | Support | Precision | Recall | F1 |
|-----|------|---------|-----------|--------|-----|
| CWE-20 | Input Validation | 69 | 0.023 | **0.957** | 0.046 |
| CWE-78 | Command Injection | 1 | 0.011 | **1.000** | 0.021 |
| CWE-79 | XSS | 16 | 0.084 | **0.750** | 0.151 |
| CWE-89 | SQL Injection | 15 | 0.096 | **1.000** | 0.174 |
| CWE-94 | Code Injection | 27 | 0.123 | **1.000** | 0.220 |
| CWE-119 | Buffer Overflow | 118 | 0.088 | **0.898** | 0.160 |
| CWE-125 | Out-of-bounds Read | 35 | 0.048 | **0.829** | 0.091 |
| CWE-190 | Integer Overflow | 14 | 0.033 | **1.000** | 0.064 |
| CWE-401 | Memory Leak | 2 | 0.022 | **1.000** | 0.044 |
| CWE-416 | Use After Free | 20 | 0.048 | 0.400 | 0.086 |
| CWE-476 | NULL Pointer Deref | 30 | 0.032 | **0.867** | 0.061 |
| CWE-787 | Out-of-bounds Write | 46 | 0.052 | **0.891** | 0.099 |
#### OWASP A04:2021 β Insecure Design
| CWE | Name | Support | Precision | Recall | F1 |
|-----|------|---------|-----------|--------|-----|
| CWE-362 | Race Condition | 11 | 0.035 | 0.636 | 0.065 |
| CWE-399 | Resource Management | 21 | 0.008 | **0.857** | 0.015 |
| CWE-434 | File Upload | 0 | β | β | β |
#### OWASP A07βA10
| CWE | Name | Support | Precision | Recall | F1 |
|-----|------|---------|-----------|--------|-----|
| CWE-287 | Authentication | 0 | β | β | β |
| CWE-798 | Hardcoded Credentials | 0 | β | β | β |
| CWE-502 | Deserialization | 10 | 0.056 | **1.000** | 0.106 |
| CWE-918 | SSRF | 0 | β | β | β |
### Key Metric: Safe Code Detection
| Class | Support | Precision | Recall | F1 |
|-------|---------|-----------|--------|-----|
| **safe** | **4,496** | **0.927** | **0.975** | **0.950** |
### Model Strengths
- **Excellent recall** on many vulnerability classes (0.75β1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
- **Strong safe code detection** (F1=0.95) β reliably identifies secure code
- **High sensitivity** β at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)
### Model Limitations
- **Low precision on rare classes** β many false positives, especially on CWEs with few training examples
- Precision can be improved by using **threshold=0.5** (macro F1 improves to 0.125 but recall drops)
- Classes with 0 test support cannot be evaluated
> **Design choice:** For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).
## Training Data
The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples), combining:
1. **[BigVul](https://huggingface.co/datasets/bstee615/bigvul)** β 265K C/C++ vulnerable functions from real CVEs
2. **[CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul)** β Balanced CWE-labeled subset
3. **[Code Vulnerability Labeled](https://huggingface.co/datasets/lemon42-ai/Code_Vulnerability_Labeled_Dataset)** β Multi-language (Python, JS, Java, PHP, Go)
4. **[CyberNative DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO)** β Vulnerable/secure code pairs
### Training Configuration
| Parameter | Value |
|-----------|-------|
| Epochs | 2 |
| Batch Size | 8 |
| Learning Rate | 5e-5 |
| Scheduler | Cosine with warmup (50 steps) |
| Loss | BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x) |
| Training Subset | 20K balanced samples |
| Optimizer | AdamW (fused) |
## Limitations
1. **Class imbalance**: Many rare CWE types have very few training examples, leading to high false positive rates
2. **Sequence length**: Limited to 512 tokens β long functions may be truncated
3. **Language bias**: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
4. **Single-function analysis**: Analyzes individual functions, not cross-function or cross-file vulnerabilities
5. **Not a replacement**: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)
## Interactive Demo
Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) β paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.
## Citation
```bibtex
@misc{graphcodebert-vuln-classifier,
title={GraphCodeBERT Vulnerability Classifier: Multi-label CWE Detection Mapped to OWASP Top 10},
author={ayshajavd},
year={2025},
url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
}
```
|