README.md · ayshajavd/graphcodebert-vuln-classifier at main

graphcodebert-vuln-classifier / README.md

ayshajavd

Update model card with full test set evaluation metrics (5K samples)

82b783c verified 10 days ago

preview code

raw

history blame contribute delete

9.58 kB

	---
	license: apache-2.0
	library_name: transformers
	base_model: huggingface/CodeBERTa-small-v1
	tags:
	- security
	- vulnerability-detection
	- code-analysis
	- multi-label-classification
	- graphcodebert
	- owasp
	- cwe
	- static-analysis
	language:
	- en
	- code
	pipeline_tag: text-classification
	datasets:
	- ayshajavd/code-security-vulnerability-dataset
	- bstee615/bigvul
	- CyberNative/Code_Vulnerability_Security_DPO
	- lemon42-ai/Code_Vulnerability_Labeled_Dataset
	model-index:
	- name: graphcodebert-vuln-classifier
	results:
	- task:
	type: text-classification
	name: Multi-label Vulnerability Classification
	dataset:
	type: ayshajavd/code-security-vulnerability-dataset
	name: Code Security Vulnerability Dataset
	split: test
	metrics:
	- type: f1
	value: 0.8648
	name: Weighted F1
	- type: f1
	value: 0.4575
	name: Micro F1
	- type: f1
	value: 0.9501
	name: F1 (safe class)
	- type: recall
	value: 0.5018
	name: Macro Recall
	---

	# GraphCodeBERT Vulnerability Classifier

	A multi-label code vulnerability detection model that identifies 31 vulnerability classes (30 CWEs + safe) mapped to the OWASP Top 10 2021 categories. Fine-tuned from [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) on 175K+ labeled code samples.

	## Quick Start

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_id = "ayshajavd/graphcodebert-vuln-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)
	model.eval()

	code = """
	import sqlite3
	def get_user(username):
	query = f"SELECT * FROM users WHERE username = '{username}'"
	conn = sqlite3.connect('db.sqlite')
	return conn.execute(query).fetchone()
	"""

	inputs = tokenizer(code, return_tensors="pt", max_length=512, truncation=True, padding=True)
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.sigmoid(logits).squeeze()

	# Get predictions above threshold
	TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94",
	"CWE-119", "CWE-125", "CWE-190", "CWE-200", "CWE-264", "CWE-269", "CWE-276",
	"CWE-284", "CWE-287", "CWE-310", "CWE-327", "CWE-330", "CWE-352", "CWE-362",
	"CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
	"CWE-787", "CWE-798", "CWE-918"]

	threshold = 0.5
	for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
	if prob > threshold:
	print(f"{cwe}: {prob:.3f}")
	```

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params) \|
	\| Base Model \| [CodeBERTa-small-v1](https://huggingface.co/huggingface/CodeBERTa-small-v1) \|
	\| Task \| Multi-label classification (BCEWithLogitsLoss with class weights) \|
	\| Labels \| 31 (30 CWE categories + "safe") \|
	\| Max Sequence Length \| 512 tokens \|
	\| Recommended Threshold \| 0.5 (balanced precision/recall) or 0.3 (high recall, security-first) \|

	## Supported Languages

	Python, JavaScript, Java, C, C++, PHP, Go

	The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).

	## Evaluation Results (Test Set — 5,000 samples)

	### Threshold Comparison

	\| Threshold \| Macro F1 \| Micro F1 \| Weighted F1 \| Macro Precision \| Macro Recall \|
	\|-----------\|----------\|----------\|-------------\|-----------------\|--------------\|
	\| 0.2 \| 0.066 \| 0.301 \| 0.859 \| 0.048 \| 0.562 \|
	\| 0.3 \| 0.081 \| 0.458 \| 0.865 \| 0.057 \| 0.502 \|
	\| 0.4 \| 0.101 \| 0.626 \| 0.870 \| 0.070 \| 0.439 \|
	\| 0.5 \| 0.125 \| 0.739 \| 0.870 \| 0.088 \| 0.366 \|

	### Per-Class Performance (threshold=0.3)

	#### OWASP A01:2021 — Broken Access Control
	\| CWE \| Name \| Support \| Precision \| Recall \| F1 \|
	\|-----\|------\|---------\|-----------\|--------\|-----\|
	\| CWE-22 \| Path Traversal \| 2 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-200 \| Information Exposure \| 30 \| 0.063 \| 0.800 \| 0.117 \|
	\| CWE-264 \| Permissions/Privileges \| 23 \| 0.025 \| 0.696 \| 0.049 \|
	\| CWE-269 \| Improper Privilege Mgmt \| 1 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-276 \| Incorrect Permissions \| 0 \| — \| — \| — \|
	\| CWE-284 \| Access Control \| 5 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-352 \| CSRF \| 1 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-601 \| Open Redirect \| 0 \| — \| — \| — \|

	#### OWASP A02:2021 — Cryptographic Failures
	\| CWE \| Name \| Support \| Precision \| Recall \| F1 \|
	\|-----\|------\|---------\|-----------\|--------\|-----\|
	\| CWE-310 \| Cryptographic Issues \| 5 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-327 \| Broken Crypto Algorithm \| 1 \| 0.000 \| 0.000 \| 0.000 \|
	\| CWE-330 \| Insufficient Randomness \| 1 \| 0.000 \| 0.000 \| 0.000 \|

	#### OWASP A03:2021 — Injection
	\| CWE \| Name \| Support \| Precision \| Recall \| F1 \|
	\|-----\|------\|---------\|-----------\|--------\|-----\|
	\| CWE-20 \| Input Validation \| 69 \| 0.023 \| 0.957 \| 0.046 \|
	\| CWE-78 \| Command Injection \| 1 \| 0.011 \| 1.000 \| 0.021 \|
	\| CWE-79 \| XSS \| 16 \| 0.084 \| 0.750 \| 0.151 \|
	\| CWE-89 \| SQL Injection \| 15 \| 0.096 \| 1.000 \| 0.174 \|
	\| CWE-94 \| Code Injection \| 27 \| 0.123 \| 1.000 \| 0.220 \|
	\| CWE-119 \| Buffer Overflow \| 118 \| 0.088 \| 0.898 \| 0.160 \|
	\| CWE-125 \| Out-of-bounds Read \| 35 \| 0.048 \| 0.829 \| 0.091 \|
	\| CWE-190 \| Integer Overflow \| 14 \| 0.033 \| 1.000 \| 0.064 \|
	\| CWE-401 \| Memory Leak \| 2 \| 0.022 \| 1.000 \| 0.044 \|
	\| CWE-416 \| Use After Free \| 20 \| 0.048 \| 0.400 \| 0.086 \|
	\| CWE-476 \| NULL Pointer Deref \| 30 \| 0.032 \| 0.867 \| 0.061 \|
	\| CWE-787 \| Out-of-bounds Write \| 46 \| 0.052 \| 0.891 \| 0.099 \|

	#### OWASP A04:2021 — Insecure Design
	\| CWE \| Name \| Support \| Precision \| Recall \| F1 \|
	\|-----\|------\|---------\|-----------\|--------\|-----\|
	\| CWE-362 \| Race Condition \| 11 \| 0.035 \| 0.636 \| 0.065 \|
	\| CWE-399 \| Resource Management \| 21 \| 0.008 \| 0.857 \| 0.015 \|
	\| CWE-434 \| File Upload \| 0 \| — \| — \| — \|

	#### OWASP A07–A10
	\| CWE \| Name \| Support \| Precision \| Recall \| F1 \|
	\|-----\|------\|---------\|-----------\|--------\|-----\|
	\| CWE-287 \| Authentication \| 0 \| — \| — \| — \|
	\| CWE-798 \| Hardcoded Credentials \| 0 \| — \| — \| — \|
	\| CWE-502 \| Deserialization \| 10 \| 0.056 \| 1.000 \| 0.106 \|
	\| CWE-918 \| SSRF \| 0 \| — \| — \| — \|

	### Key Metric: Safe Code Detection
	\| Class \| Support \| Precision \| Recall \| F1 \|
	\|-------\|---------\|-----------\|--------\|-----\|
	\| safe \| 4,496 \| 0.927 \| 0.975 \| 0.950 \|

	### Model Strengths
	- Excellent recall on many vulnerability classes (0.75–1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
	- Strong safe code detection (F1=0.95) — reliably identifies secure code
	- High sensitivity — at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)

	### Model Limitations
	- Low precision on rare classes — many false positives, especially on CWEs with few training examples
	- Precision can be improved by using threshold=0.5 (macro F1 improves to 0.125 but recall drops)
	- Classes with 0 test support cannot be evaluated

	> Design choice: For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).

	## Training Data

	The model was trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset) (175,419 samples), combining:

	1. [BigVul](https://huggingface.co/datasets/bstee615/bigvul) — 265K C/C++ vulnerable functions from real CVEs
	2. [CWE-enriched BigVul/PrimeVul](https://huggingface.co/datasets/mahdin70/cwe_enriched_balanced_bigvul_primevul) — Balanced CWE-labeled subset
	3. [Code Vulnerability Labeled](https://huggingface.co/datasets/lemon42-ai/Code_Vulnerability_Labeled_Dataset) — Multi-language (Python, JS, Java, PHP, Go)
	4. [CyberNative DPO](https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO) — Vulnerable/secure code pairs

	### Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 2 \|
	\| Batch Size \| 8 \|
	\| Learning Rate \| 5e-5 \|
	\| Scheduler \| Cosine with warmup (50 steps) \|
	\| Loss \| BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x) \|
	\| Training Subset \| 20K balanced samples \|
	\| Optimizer \| AdamW (fused) \|

	## Limitations

	1. Class imbalance: Many rare CWE types have very few training examples, leading to high false positive rates
	2. Sequence length: Limited to 512 tokens — long functions may be truncated
	3. Language bias: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
	4. Single-function analysis: Analyzes individual functions, not cross-function or cross-file vulnerabilities
	5. Not a replacement: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)

	## Interactive Demo

	Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) — paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.

	## Citation

	```bibtex
	@misc{graphcodebert-vuln-classifier,
	title={GraphCodeBERT Vulnerability Classifier: Multi-label CWE Detection Mapped to OWASP Top 10},
	author={ayshajavd},
	year={2025},
	url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
	}
	```