YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Security Vulnerability Datasets Collection

A comprehensive collection of 50+ datasets and repositories for training security classifiers and validating code security policies.

Directory Structure

security-datasets/
β”œβ”€β”€ 01-missing-rls/          # Row-Level Security misconfigurations
β”œβ”€β”€ 02-public-data-exposure/ # IDOR and access control vulnerabilities
β”œβ”€β”€ 03-open-endpoints/       # API security and authentication bypass
β”œβ”€β”€ 04-secrets-exposed/      # Credential detection and secret scanning
β”œβ”€β”€ 05-client-side-auth/     # Frontend authorization bypass
β”œβ”€β”€ 06-storage-exposure/     # Cloud storage misconfigurations (S3, Firebase)
β”œβ”€β”€ 07-definer-rpc-bypass/   # SECURITY DEFINER and RPC vulnerabilities
β”œβ”€β”€ 08-info-leakage/         # Information disclosure and error messages
β”œβ”€β”€ 09-input-validation/     # SQL injection and XSS
└── 10-general-benchmarks/   # Cross-cutting vulnerability datasets

Quick Start by Use Case

For ML Classifier Training (Labeled Data)

Category Dataset Size Location
SQL Injection IEEE DataPort SQLi 244K queries Requires registration
XSS IEEE DataPort XSS 1.8M samples Requires registration
Secret Detection CredData 73K labeled lines 04-secrets-exposed/CredData/
General Vulnerabilities DiverseVul 350K functions 10-general-benchmarks/diversevul/
General Vulnerabilities MegaVul 355K functions 10-general-benchmarks/MegaVul/

For Security Tool Benchmarking

Purpose Dataset Location
SAST/DAST tools OWASP Benchmark 03-open-endpoints/BenchmarkJava/
Secret scanners CredData benchmarks 04-secrets-exposed/CredData/
Static analyzers CodeQL test cases 10-general-benchmarks/codeql/

For Hands-On Testing

Purpose Application Location
Web vulnerabilities DVWA, Juice Shop 05-client-side-auth/
API security VAmPI, crAPI 03-open-endpoints/
Secrets testing leaky-repo 04-secrets-exposed/leaky-repo/

Total Statistics

  • GitHub Repositories Cloned: 52+
  • Total Vulnerable Functions: 700K+ (DiverseVul + MegaVul)
  • CWE Types Covered: 270+
  • Payload Collections: 10M+ payloads (SecLists, PayloadsAllTheThings)

Gated Datasets (Registration Required)

The following high-value datasets require registration:

  1. IEEE DataPort (ieee-dataport.org)

    • SQL Injection Detection Dataset (244K queries)
    • Large-Scale XSS Dataset (1.8M samples)
  2. NIST SARD (samate.nist.gov/SARD)

    • Juliet Test Suite (81K+ test cases)
    • CWE-specific test suites
  3. SecretBench (MSR 2023)

    • Requires data protection agreement
    • 15,084 manually verified secrets
  4. Zenodo

    • LLM Secret Detection Dataset
    • OWASP ModSecurity WAF Dataset

Category-Specific READMEs

Each subdirectory contains a detailed README with:

  • List of downloaded repositories
  • Gated resources with registration links
  • Usage examples and code snippets
  • CWE mappings
  • Dataset creation recommendations

License Information

Individual datasets have their own licenses:

  • Most OWASP projects: MIT/Apache 2.0
  • CredData: Apache 2.0
  • SecLists: MIT
  • Academic datasets: Research use (check specific terms)

Contributing

To add new datasets:

  1. Clone to appropriate category directory
  2. Update category README with description
  3. Add to this master README if significant

Underserved Categories

The following categories would benefit from dedicated dataset creation:

  1. MISSING_RLS - No dedicated ML datasets exist
  2. DEFINER_OR_RPC_BYPASS - Limited to documentation patterns
  3. CLIENT_SIDE_AUTH - Requires extraction from vulnerable apps

Recommended approach: Generate labeled data from vulnerable-by-design applications combined with documented anti-patterns.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support