graphcodebert-vuln-classifier / README.md

Update model card with full test set evaluation metrics (5K samples)

82b783c verified 10 days ago

9.58 kB

license: apache-2.0
library_name: transformers
base_model: huggingface/CodeBERTa-small-v1
tags:
  - security
  - vulnerability-detection
  - code-analysis
  - multi-label-classification
  - graphcodebert
  - owasp
  - cwe
  - static-analysis
language:
  - en
  - code
pipeline_tag: text-classification
datasets:
  - ayshajavd/code-security-vulnerability-dataset
  - bstee615/bigvul
  - CyberNative/Code_Vulnerability_Security_DPO
  - lemon42-ai/Code_Vulnerability_Labeled_Dataset
model-index:
  - name: graphcodebert-vuln-classifier
    results:
      - task:
          type: text-classification
          name: Multi-label Vulnerability Classification
        dataset:
          type: ayshajavd/code-security-vulnerability-dataset
          name: Code Security Vulnerability Dataset
          split: test
        metrics:
          - type: f1
            value: 0.8648
            name: Weighted F1
          - type: f1
            value: 0.4575
            name: Micro F1
          - type: f1
            value: 0.9501
            name: F1 (safe class)
          - type: recall
            value: 0.5018
            name: Macro Recall

GraphCodeBERT Vulnerability Classifier

A multi-label code vulnerability detection model that identifies 31 vulnerability classes (30 CWEs + safe) mapped to the OWASP Top 10 2021 categories. Fine-tuned from CodeBERTa-small-v1 on 175K+ labeled code samples.

Quick Start

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "ayshajavd/graphcodebert-vuln-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

code = """
import sqlite3
def get_user(username):
    query = f"SELECT * FROM users WHERE username = '{username}'"
    conn = sqlite3.connect('db.sqlite')
    return conn.execute(query).fetchone()
"""

inputs = tokenizer(code, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.sigmoid(logits).squeeze()

# Get predictions above threshold
TARGET_CWES = ["safe", "CWE-20", "CWE-22", "CWE-78", "CWE-79", "CWE-89", "CWE-94",
    "CWE-119", "CWE-125", "CWE-190", "CWE-200", "CWE-264", "CWE-269", "CWE-276",
    "CWE-284", "CWE-287", "CWE-310", "CWE-327", "CWE-330", "CWE-352", "CWE-362",
    "CWE-399", "CWE-401", "CWE-416", "CWE-434", "CWE-476", "CWE-502", "CWE-601",
    "CWE-787", "CWE-798", "CWE-918"]

threshold = 0.5
for i, (cwe, prob) in enumerate(zip(TARGET_CWES, probs)):
    if prob > threshold:
        print(f"{cwe}: {prob:.3f}")

Model Details

Property	Value
Architecture	RobertaForSequenceClassification (6 layers, 768 hidden, 83.5M params)
Base Model	CodeBERTa-small-v1
Task	Multi-label classification (BCEWithLogitsLoss with class weights)
Labels	31 (30 CWE categories + "safe")
Max Sequence Length	512 tokens
Recommended Threshold	0.5 (balanced precision/recall) or 0.3 (high recall, security-first)

Supported Languages

Python, JavaScript, Java, C, C++, PHP, Go

The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul) and Python/JavaScript (from the multi-language datasets).

Evaluation Results (Test Set — 5,000 samples)

Threshold Comparison

Threshold	Macro F1	Micro F1	Weighted F1	Macro Precision	Macro Recall
0.2	0.066	0.301	0.859	0.048	0.562
0.3	0.081	0.458	0.865	0.057	0.502
0.4	0.101	0.626	0.870	0.070	0.439
0.5	0.125	0.739	0.870	0.088	0.366

Per-Class Performance (threshold=0.3)

OWASP A01:2021 — Broken Access Control

CWE	Name	Support	Precision	Recall	F1
CWE-22	Path Traversal	2	0.000	0.000	0.000
CWE-200	Information Exposure	30	0.063	0.800	0.117
CWE-264	Permissions/Privileges	23	0.025	0.696	0.049
CWE-269	Improper Privilege Mgmt	1	0.000	0.000	0.000
CWE-276	Incorrect Permissions	0	—	—	—
CWE-284	Access Control	5	0.000	0.000	0.000
CWE-352	CSRF	1	0.000	0.000	0.000
CWE-601	Open Redirect	0	—	—	—

OWASP A02:2021 — Cryptographic Failures

CWE	Name	Support
CWE-310	Cryptographic Issues	5
CWE-327	Broken Crypto Algorithm	1
CWE-330	Insufficient Randomness	1

OWASP A03:2021 — Injection

CWE	Name	Support	Precision	Recall	F1
CWE-20	Input Validation	69	0.023	0.957	0.046
CWE-78	Command Injection	1	0.011	1.000	0.021
CWE-79	XSS	16	0.084	0.750	0.151
CWE-89	SQL Injection	15	0.096	1.000	0.174
CWE-94	Code Injection	27	0.123	1.000	0.220
CWE-119	Buffer Overflow	118	0.088	0.898	0.160
CWE-125	Out-of-bounds Read	35	0.048	0.829	0.091
CWE-190	Integer Overflow	14	0.033	1.000	0.064
CWE-401	Memory Leak	2	0.022	1.000	0.044
CWE-416	Use After Free	20	0.048	0.400	0.086
CWE-476	NULL Pointer Deref	30	0.032	0.867	0.061
CWE-787	Out-of-bounds Write	46	0.052	0.891	0.099

OWASP A04:2021 — Insecure Design

CWE	Name	Support	Precision	Recall	F1
CWE-362	Race Condition	11	0.035	0.636	0.065
CWE-399	Resource Management	21	0.008	0.857	0.015
CWE-434	File Upload	0	—	—	—

OWASP A07–A10

CWE	Name	Support	Precision	Recall	F1
CWE-287	Authentication	0	—	—	—
CWE-798	Hardcoded Credentials	0	—	—	—
CWE-502	Deserialization	10	0.056	1.000	0.106
CWE-918	SSRF	0	—	—	—

Key Metric: Safe Code Detection

Class	Support	Precision	Recall	F1
safe	4,496	0.927	0.975	0.950

Model Strengths

Excellent recall on many vulnerability classes (0.75–1.0 for SQL injection, buffer overflow, XSS, code injection, etc.)
Strong safe code detection (F1=0.95) — reliably identifies secure code
High sensitivity — at threshold 0.3, catches most real vulnerabilities (macro recall=0.50)

Model Limitations

Low precision on rare classes — many false positives, especially on CWEs with few training examples
Precision can be improved by using threshold=0.5 (macro F1 improves to 0.125 but recall drops)
Classes with 0 test support cannot be evaluated

Design choice: For security applications, we prioritize recall (catching real vulnerabilities) over precision (reducing false positives). Missing a real vulnerability (false negative) is worse than flagging safe code (false positive).

Training Data

The model was trained on the code-security-vulnerability-dataset (175,419 samples), combining:

BigVul — 265K C/C++ vulnerable functions from real CVEs
CWE-enriched BigVul/PrimeVul — Balanced CWE-labeled subset
Code Vulnerability Labeled — Multi-language (Python, JS, Java, PHP, Go)
CyberNative DPO — Vulnerable/secure code pairs

Training Configuration

Parameter	Value
Epochs	2
Batch Size	8
Learning Rate	5e-5
Scheduler	Cosine with warmup (50 steps)
Loss	BCEWithLogitsLoss (class-weighted, pos_weight clipped to 30x)
Training Subset	20K balanced samples
Optimizer	AdamW (fused)

Limitations

Class imbalance: Many rare CWE types have very few training examples, leading to high false positive rates
Sequence length: Limited to 512 tokens — long functions may be truncated
Language bias: Strongest on C/C++ due to BigVul's dominance. Go and PHP performance may be lower
Single-function analysis: Analyzes individual functions, not cross-function or cross-file vulnerabilities
Not a replacement: Should complement manual review and established SAST tools (Semgrep, CodeQL, etc.)

Interactive Demo

Try the model in our Code Security Analyzer Space — paste any code and get a full security report with OWASP mapping, severity scores, attack chain analysis, and suggested fixes.

Citation

@misc{graphcodebert-vuln-classifier,
  title={GraphCodeBERT Vulnerability Classifier: Multi-label CWE Detection Mapped to OWASP Top 10},
  author={ayshajavd},
  year={2025},
  url={https://huggingface.co/ayshajavd/graphcodebert-vuln-classifier}
}