YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Cybersecurity NER Model v8

Named Entity Recognition model for cybersecurity domain text, trained on spaCy v3.8 with custom training data.

Model Description

Fine-tuned NER model for extracting 13 cybersecurity entity types from technical documentation, CVs, job descriptions, threat reports, and compliance documents.

Performance

Test Results (v8):

  • Pass Rate: 94% (62/66 tests)
  • Dev F1 Score: 98.58%
  • Precision: 98.71%
  • Recall: 98.46%
  • Training Steps: 11,500 (early stopping)
  • Training Data: 2,223 examples

Entity Type Performance:

Entity Type Test Pass Rate Dev Set F1
CVE 100% (3/3) 100.00%
AUDIT_TERM 75% (3/4) 100.00%
SECURITY_TOOL 100% (4/4) 100.00%
CERTIFICATION 100% (4/4) 98.73%
SECURITY_ROLE 100% (4/4) 98.11%
FRAMEWORK 100% (4/4) 93.88%
TECHNICAL_SKILL 100% (4/4) 100.00%
ACRONYM 100% (4/4) 100.00%
SECURITY_DOMAIN 100% (4/4) 100.00%
ATTACK_TECHNIQUE 75% (3/4) 98.70%
THREAT_TYPE 75% (3/4) 95.24%
REGULATION 75% (3/4) 96.55%
CONTROL_ID 100% (4/4) -

Entity Types

  1. CVE - CVE identifiers (e.g., CVE-2024-1234)
  2. CERTIFICATION - Security certifications (CISSP, OSCP, CEH, CISM, Security+)
  3. FRAMEWORK - Security frameworks (NIST CSF, ISO 27001, MITRE ATT&CK, CIS Controls)
  4. ATTACK_TECHNIQUE - Attack methods (SQL injection, XSS, CSRF, buffer overflow)
  5. TECHNICAL_SKILL - Technical skills (Incident Response, Forensics, Penetration Testing)
  6. AUDIT_TERM - Audit/compliance terms (Risk assessment, Compliance audit, Security review)
  7. SECURITY_ROLE - Job roles (CISO, SOC Analyst, Security Engineer, Pentester)
  8. THREAT_TYPE - Threat types (APT, ransomware, phishing, DDoS, malware)
  9. ACRONYM - Security acronyms (SIEM, EDR, SOAR, IDS/IPS, WAF, DLP)
  10. SECURITY_DOMAIN - Security domains (Cloud Security, Network Security, Application Security)
  11. REGULATION - Regulations (GDPR, HIPAA, PCI-DSS, SOX, CCPA)
  12. SECURITY_TOOL - Security tools (Splunk, Metasploit, Burp Suite, Nmap, Wireshark)
  13. CONTROL_ID - Control identifiers (ISO 27001 A.5.1, NIST CSF PR.AC-1, CIS Control 1.1)

Usage

import spacy

# Load model
nlp = spacy.load("path/to/model")

# Extract entities
text = "CISSP certified professional with experience in Splunk and Metasploit"
doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Output:

CISSP -> CERTIFICATION
Splunk -> SECURITY_TOOL
Metasploit -> SECURITY_TOOL

Training Data

Sources:

  • v7 merged data: 1,448 examples
  • v8 generated: 1,347 examples with multi-entity patterns, case variants
  • Manual curated: 100 examples
  • Final dataset: 2,223 unique examples (after validation and deduplication)

v8 Improvements:

  • Multi-entity "X and Y" patterns (50 examples per entity type)
  • Title case variants (CISSP, cissp, Cissp)
  • Comma-separated list patterns
  • AUDIT_TERM edge cases (Compliance audit)

Entity Distribution:

  • AUDIT_TERM: 326 (12.4%)
  • CERTIFICATION: 295 (11.2%)
  • SECURITY_TOOL: 293 (11.1%)
  • ATTACK_TECHNIQUE: 282 (10.7%)
  • THREAT_TYPE: 263 (10.0%)
  • TECHNICAL_SKILL: 228 (8.6%)
  • REGULATION: 222 (8.4%)
  • CVE: 182 (6.9%)
  • FRAMEWORK: 165 (6.3%)
  • SECURITY_ROLE: 153 (5.8%)
  • ACRONYM: 142 (5.4%)
  • SECURITY_DOMAIN: 85 (3.2%)

Training Configuration

  • Framework: spaCy 3.8
  • Architecture: tok2vec + TransitionBasedParser
  • GPU: NVIDIA RTX 4090
  • Training steps: 11,500 (early stopping)
  • Patience: 5,000 steps
  • Learning rate: 3e-05
  • Dropout: 0.25
  • Batch size: 1,000
  • Train/dev split: 85/15

Version History

v8 (Current):

  • 94% pass rate (62/66)
  • Multi-entity extraction improved
  • Title case support added
  • AUDIT_TERM edge cases fixed

v7:

  • 86% pass rate (57/66)
  • CVE detection restored
  • SECURITY_ROLE improved to 100%
  • IDS/IPS and DDoS fixed

v6:

  • 74% pass rate (49/66)
  • CVE regression (missing)
  • AUDIT_TERM and SECURITY_ROLE issues

Known Limitations

v8 has 4 remaining test failures:

  1. Multi-entity extraction in specific contexts ("APT group using ransomware")
  2. Span boundary issues with conjunctions ("XSS and CSRF mitigated")
  3. Specific "X and Y" patterns ("HIPAA and PCI-DSS standards")
  4. "Gap analysis" edge case

Use Cases

  • CV/resume skill extraction
  • Job description analysis
  • Threat intelligence reports
  • Compliance documentation
  • Security audit reports
  • Technical documentation
  • Security training materials

License

MIT

Citation

@misc{cybersecurity-ner,
  title={Cybersecurity NER Model},
  author={PKI},
  year={2026},
  url={https://huggingface.co/pki/cybersecurity-ner}
}

Contact

For issues or questions, please open an issue on GitHub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support