YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Security Vulnerability Datasets Collection
A comprehensive collection of 50+ datasets and repositories for training security classifiers and validating code security policies.
Directory Structure
security-datasets/
βββ 01-missing-rls/ # Row-Level Security misconfigurations
βββ 02-public-data-exposure/ # IDOR and access control vulnerabilities
βββ 03-open-endpoints/ # API security and authentication bypass
βββ 04-secrets-exposed/ # Credential detection and secret scanning
βββ 05-client-side-auth/ # Frontend authorization bypass
βββ 06-storage-exposure/ # Cloud storage misconfigurations (S3, Firebase)
βββ 07-definer-rpc-bypass/ # SECURITY DEFINER and RPC vulnerabilities
βββ 08-info-leakage/ # Information disclosure and error messages
βββ 09-input-validation/ # SQL injection and XSS
βββ 10-general-benchmarks/ # Cross-cutting vulnerability datasets
Quick Start by Use Case
For ML Classifier Training (Labeled Data)
| Category | Dataset | Size | Location |
|---|---|---|---|
| SQL Injection | IEEE DataPort SQLi | 244K queries | Requires registration |
| XSS | IEEE DataPort XSS | 1.8M samples | Requires registration |
| Secret Detection | CredData | 73K labeled lines | 04-secrets-exposed/CredData/ |
| General Vulnerabilities | DiverseVul | 350K functions | 10-general-benchmarks/diversevul/ |
| General Vulnerabilities | MegaVul | 355K functions | 10-general-benchmarks/MegaVul/ |
For Security Tool Benchmarking
| Purpose | Dataset | Location |
|---|---|---|
| SAST/DAST tools | OWASP Benchmark | 03-open-endpoints/BenchmarkJava/ |
| Secret scanners | CredData benchmarks | 04-secrets-exposed/CredData/ |
| Static analyzers | CodeQL test cases | 10-general-benchmarks/codeql/ |
For Hands-On Testing
| Purpose | Application | Location |
|---|---|---|
| Web vulnerabilities | DVWA, Juice Shop | 05-client-side-auth/ |
| API security | VAmPI, crAPI | 03-open-endpoints/ |
| Secrets testing | leaky-repo | 04-secrets-exposed/leaky-repo/ |
Total Statistics
- GitHub Repositories Cloned: 52+
- Total Vulnerable Functions: 700K+ (DiverseVul + MegaVul)
- CWE Types Covered: 270+
- Payload Collections: 10M+ payloads (SecLists, PayloadsAllTheThings)
Gated Datasets (Registration Required)
The following high-value datasets require registration:
IEEE DataPort (ieee-dataport.org)
- SQL Injection Detection Dataset (244K queries)
- Large-Scale XSS Dataset (1.8M samples)
NIST SARD (samate.nist.gov/SARD)
- Juliet Test Suite (81K+ test cases)
- CWE-specific test suites
SecretBench (MSR 2023)
- Requires data protection agreement
- 15,084 manually verified secrets
Zenodo
- LLM Secret Detection Dataset
- OWASP ModSecurity WAF Dataset
Category-Specific READMEs
Each subdirectory contains a detailed README with:
- List of downloaded repositories
- Gated resources with registration links
- Usage examples and code snippets
- CWE mappings
- Dataset creation recommendations
License Information
Individual datasets have their own licenses:
- Most OWASP projects: MIT/Apache 2.0
- CredData: Apache 2.0
- SecLists: MIT
- Academic datasets: Research use (check specific terms)
Contributing
To add new datasets:
- Clone to appropriate category directory
- Update category README with description
- Add to this master README if significant
Underserved Categories
The following categories would benefit from dedicated dataset creation:
- MISSING_RLS - No dedicated ML datasets exist
- DEFINER_OR_RPC_BYPASS - Limited to documentation patterns
- CLIENT_SIDE_AUTH - Requires extraction from vulnerable apps
Recommended approach: Generate labeled data from vulnerable-by-design applications combined with documented anti-patterns.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support