| | --- |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen2.5-Coder-7B-Instruct |
| | tags: |
| | - code |
| | - security |
| | - qwen |
| | - securecode |
| | - owasp |
| | - vulnerability-detection |
| | datasets: |
| | - scthornton/securecode-v2 |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | arxiv: 2512.18542 |
| | --- |
| | |
| | # Qwen 2.5-Coder 7B - SecureCode Edition |
| |
|
| | <div align="center"> |
| |
|
| | [](https://opensource.org/licenses/Apache-2.0) |
| | [](https://huggingface.co/datasets/scthornton/securecode-v2) |
| | [](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) |
| | [](https://perfecxion.ai) |
| |
|
| | **Best-in-class code model fine-tuned for security - exceptional code understanding** |
| |
|
| | [π Paper](https://arxiv.org/abs/2512.18542) | [π€ Model Card](https://huggingface.co/scthornton/qwen-coder-7b-securecode) | [π Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [π» perfecXion.ai](https://perfecxion.ai) | [π Security Research](https://perfecxion.ai/security) |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## π― What is This? |
| |
|
| | This is **Qwen 2.5-Coder 7B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - widely recognized as the **best code model available** in the 7B parameter class, now enhanced with production-grade security knowledge. |
| |
|
| | Unlike standard code models that frequently generate vulnerable code, this model combines Qwen's exceptional code understanding with specific training to: |
| |
|
| | β
**Recognize security vulnerabilities** across 11 programming languages |
| | β
**Generate secure implementations** with defense-in-depth patterns |
| | β
**Explain complex attack vectors** with concrete exploitation examples |
| | β
**Provide operational guidance** including SIEM integration, logging, and monitoring |
| |
|
| | **The Result:** The most capable security-aware code model under 10B parameters. |
| |
|
| | **Why Qwen 2.5-Coder?** This model was pre-trained on **5.5 trillion tokens** of code data, giving it: |
| | - π― **Superior code completion** - Best-in-class for completing partial code |
| | - π **Deep code understanding** - Exceptional at analyzing complex codebases |
| | - π **92 programming languages** - Broader language support than competitors |
| | - π **128K context window** - Can analyze entire files and multi-file contexts |
| | - β‘ **Fast inference** - Optimized for production deployment |
| |
|
| | --- |
| |
|
| | ## π¨ The Problem This Solves |
| |
|
| | **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). Standard code models excel at syntax but lack security awareness. |
| |
|
| | **Real-world costs:** |
| | - Equifax breach (SQL injection): **$425 million** in damages |
| | - Capital One (SSRF attack): **100 million** customer records exposed |
| | - SolarWinds (authentication bypass): **18,000** organizations compromised |
| |
|
| | Qwen 2.5-Coder SecureCode Edition prevents these scenarios by combining world-class code generation with security expertise. |
| |
|
| | --- |
| |
|
| | ## π‘ Key Features |
| |
|
| | ### π Best Code Understanding in Class |
| |
|
| | **Qwen 2.5-Coder** outperforms competitors on code benchmarks: |
| | - HumanEval: **88.2%** pass@1 |
| | - MBPP: **75.8%** pass@1 |
| | - LiveCodeBench: **35.1%** pass@1 |
| | - Better than CodeLlama 34B and comparable to GPT-4 |
| |
|
| | Now with **1,209 security-focused examples** adding vulnerability awareness. |
| |
|
| | ### π Security-First Code Generation |
| |
|
| | Trained on real-world security incidents including: |
| | - **224 examples** of Broken Access Control vulnerabilities |
| | - **199 examples** of Authentication Failures |
| | - **125 examples** of Injection attacks (SQL, Command, XSS) |
| | - **115 examples** of Cryptographic Failures |
| | - Complete coverage of **OWASP Top 10:2025** |
| |
|
| | ### π Multi-Language Security Expertise |
| |
|
| | Fine-tuned on security examples across: |
| | - Python (Django, Flask, FastAPI) |
| | - JavaScript/TypeScript (Express, NestJS, React) |
| | - Java (Spring Boot) |
| | - Go (Gin framework) |
| | - PHP (Laravel, Symfony) |
| | - C# (ASP.NET Core) |
| | - Ruby (Rails) |
| | - Rust (Actix, Rocket) |
| | - **Plus 84 more languages from Qwen's base training** |
| |
|
| | ### π Comprehensive Security Context |
| |
|
| | Every response includes: |
| | 1. **Vulnerable implementation** showing what NOT to do |
| | 2. **Secure implementation** with industry best practices |
| | 3. **Attack demonstration** proving the vulnerability is real |
| | 4. **Defense-in-depth guidance** for production deployment |
| |
|
| | --- |
| |
|
| | ## π Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | **Base Model** | Qwen/Qwen2.5-Coder-7B-Instruct | |
| | | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | |
| | | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) | |
| | | **Dataset Size** | 841 training examples | |
| | | **Training Epochs** | 3 | |
| | | **LoRA Rank (r)** | 16 | |
| | | **LoRA Alpha** | 32 | |
| | | **Learning Rate** | 2e-4 | |
| | | **Quantization** | 4-bit (bitsandbytes) | |
| | | **Trainable Parameters** | 40.4M (0.53% of 7.6B total) | |
| | | **Total Parameters** | 7.6B | |
| | | **Context Window** | 128K tokens (inherited from base) | |
| | | **GPU Used** | NVIDIA A100 40GB | |
| | | **Training Time** | ~90 minutes (estimated) | |
| |
|
| | ### Training Methodology |
| |
|
| | **LoRA (Low-Rank Adaptation)** preserves Qwen's exceptional code abilities while adding security knowledge: |
| | - Trains only 0.53% of model parameters |
| | - Maintains base model's code generation quality |
| | - Adds security-specific knowledge without catastrophic forgetting |
| | - Enables deployment with minimal memory overhead |
| |
|
| | **4-bit Quantization** enables efficient training while maintaining model quality. |
| |
|
| | **Extended Context:** Qwen's 128K context window allows analyzing entire source files, making it ideal for security audits of large codebases. |
| |
|
| | --- |
| |
|
| | ## π Usage |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from peft import PeftModel |
| | |
| | # Load base model and tokenizer |
| | base_model = "Qwen/Qwen2.5-Coder-7B-Instruct" |
| | model = AutoModelForCausalLM.from_pretrained( |
| | base_model, |
| | device_map="auto", |
| | torch_dtype="auto", |
| | trust_remote_code=True |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) |
| | |
| | # Load SecureCode LoRA adapter |
| | model = PeftModel.from_pretrained(model, "scthornton/qwen-coder-7b-securecode") |
| | |
| | # Generate secure code |
| | prompt = """### User: |
| | Review this Python Flask authentication code for security vulnerabilities: |
| | |
| | ```python |
| | @app.route('/login', methods=['POST']) |
| | def login(): |
| | username = request.form['username'] |
| | password = request.form['password'] |
| | query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'" |
| | user = db.execute(query).fetchone() |
| | if user: |
| | session['user_id'] = user['id'] |
| | return redirect('/dashboard') |
| | return 'Invalid credentials' |
| | ``` |
| | |
| | ### Assistant: |
| | """ |
| |
|
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=2048, |
| | temperature=0.7, |
| | top_p=0.95, |
| | do_sample=True |
| | ) |
| | |
| | response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| | |
| | ### Run on Consumer Hardware (4-bit) |
| | |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
| | from peft import PeftModel |
| | |
| | # 4-bit quantization - runs on 16GB GPU |
| | bnb_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_use_double_quant=True, |
| | bnb_4bit_quant_type="nf4", |
| | bnb_4bit_compute_dtype="bfloat16" |
| | ) |
| | |
| | base_model = AutoModelForCausalLM.from_pretrained( |
| | "Qwen/Qwen2.5-Coder-7B-Instruct", |
| | quantization_config=bnb_config, |
| | device_map="auto", |
| | trust_remote_code=True |
| | ) |
| | |
| | model = PeftModel.from_pretrained(base_model, "scthornton/qwen-coder-7b-securecode") |
| | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct", trust_remote_code=True) |
| | |
| | # Now runs on RTX 3090/4080! |
| | ``` |
| | |
| | ### Code Review Use Case |
| | |
| | ```python |
| | # Security audit of entire file |
| | code_to_review = open("app.py", "r").read() |
| | |
| | prompt = f"""### User: |
| | Perform a comprehensive security review of this application code. Identify all OWASP Top 10 vulnerabilities. |
| | |
| | ```python |
| | {code_to_review} |
| | ``` |
| | |
| | ### Assistant: |
| | """ |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=32768).to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.3) # Lower temp for precise analysis |
| | review = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(review) |
| | ``` |
| | |
| | --- |
| | |
| | ## π― Use Cases |
| | |
| | ### 1. **Automated Security Code Review** |
| | Qwen's superior code understanding makes it ideal for reviewing complex codebases: |
| | ``` |
| | Analyze this 500-line authentication module for security vulnerabilities |
| | ``` |
| | |
| | ### 2. **Multi-File Security Analysis** |
| | With 128K context, analyze entire projects: |
| | ``` |
| | Review these 3 related files for security issues: auth.py, middleware.py, models.py |
| | ``` |
| | |
| | ### 3. **Advanced Vulnerability Explanation** |
| | Qwen excels at explaining complex attack chains: |
| | ``` |
| | Explain how an attacker could chain SSRF with authentication bypass in this microservices architecture |
| | ``` |
| | |
| | ### 4. **Production Security Architecture** |
| | Get architectural security guidance: |
| | ``` |
| | Design a secure authentication system for a distributed microservices platform handling 100K requests/second |
| | ``` |
| | |
| | ### 5. **Multi-Language Security Refactoring** |
| | Works across Qwen's 92 supported languages: |
| | ``` |
| | Refactor this Java Spring Boot controller to fix authentication vulnerabilities |
| | ``` |
| | |
| | --- |
| | |
| | ## β οΈ Limitations |
| | |
| | ### What This Model Does Well |
| | β
Exceptional code understanding and completion |
| | β
Multi-language security analysis (92 languages) |
| | β
Large context window for file/project analysis |
| | β
Detailed vulnerability explanations with examples |
| | β
Complex attack chain analysis |
| | |
| | ### What This Model Doesn't Do |
| | β **Not a security scanner** - Use tools like Semgrep, CodeQL, or Snyk |
| | β **Not a penetration testing tool** - Cannot perform active exploitation |
| | β **Not legal/compliance advice** - Consult security professionals |
| | β **Not a replacement for security experts** - Critical systems need professional review |
| | |
| | ### Known Issues |
| | - May generate verbose responses (trained on detailed security explanations) |
| | - Best for common vulnerability patterns (OWASP Top 10) vs novel 0-days |
| | - Requires 16GB+ GPU for optimal performance (4-bit quantization) |
| | |
| | --- |
| | |
| | ## π Performance Benchmarks |
| | |
| | ### Hardware Requirements |
| | |
| | **Minimum:** |
| | - 16GB RAM |
| | - 12GB GPU VRAM (with 4-bit quantization) |
| | |
| | **Recommended:** |
| | - 32GB RAM |
| | - 16GB+ GPU (RTX 3090, A5000, etc.) |
| | |
| | **Inference Speed (on RTX 3090 24GB):** |
| | - ~40 tokens/second with 4-bit quantization |
| | - ~60 tokens/second with bfloat16 (full precision) |
| | |
| | ### Code Generation Benchmarks (Base Qwen 2.5-Coder) |
| | |
| | | Benchmark | Score | Rank | |
| | |-----------|-------|------| |
| | | HumanEval | 88.2% | #1 in 7B class | |
| | | MBPP | 75.8% | #1 in 7B class | |
| | | LiveCodeBench | 35.1% | Top 3 overall | |
| | | MultiPL-E | 78.9% | Best multi-language | |
| | |
| | **Security benchmarks coming soon** - community contributions welcome! |
| | |
| | --- |
| | |
| | ## π¬ Dataset Information |
| | |
| | This model was trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**, a production-grade security dataset with: |
| | |
| | - **1,209 total examples** (841 train / 175 validation / 193 test) |
| | - **100% incident grounding** - every example tied to real CVEs or security breaches |
| | - **11 vulnerability categories** - complete OWASP Top 10:2025 coverage |
| | - **11 programming languages** - from Python to Rust |
| | - **4-turn conversational structure** - mirrors real developer-AI workflows |
| | - **100% expert validation** - reviewed by independent security professionals |
| | |
| | See the [full dataset card](https://huggingface.co/datasets/scthornton/securecode-v2) for complete details. |
| | |
| | --- |
| | |
| | ## π’ About perfecXion.ai |
| | |
| | [perfecXion.ai](https://perfecxion.ai) is dedicated to advancing AI security through research, datasets, and production-grade security tooling. |
| | |
| | **Connect:** |
| | - Website: [perfecxion.ai](https://perfecxion.ai) |
| | - Research: [perfecxion.ai/research](https://perfecxion.ai/research) |
| | - GitHub: [@scthornton](https://github.com/scthornton) |
| | - HuggingFace: [@scthornton](https://huggingface.co/scthornton) |
| | |
| | --- |
| | |
| | ## π License |
| | |
| | **Model License:** Apache 2.0 (commercial use permitted) |
| | **Dataset License:** CC BY-NC-SA 4.0 |
| | |
| | --- |
| | |
| | ## π Citation |
| | |
| | ```bibtex |
| | @misc{thornton2025securecode-qwen7b, |
| | title={Qwen 2.5-Coder 7B - SecureCode Edition}, |
| | author={Thornton, Scott}, |
| | year={2025}, |
| | publisher={perfecXion.ai}, |
| | url={https://huggingface.co/scthornton/qwen-coder-7b-securecode}, |
| | note={Fine-tuned on SecureCode v2.0} |
| | } |
| | ``` |
| | |
| | --- |
| | |
| | ## π Acknowledgments |
| | |
| | - **Alibaba Cloud & Qwen Team** for the exceptional Qwen 2.5-Coder base model |
| | - **OWASP Foundation** for maintaining the Top 10 vulnerability taxonomy |
| | - **MITRE Corporation** for the CVE database |
| | - **Hugging Face** for infrastructure |
| | |
| | --- |
| | |
| | ## π Related Models in SecureCode Collection |
| | |
| | - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B) |
| | - **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B) |
| | - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Established brand (13B) |
| | - **[starcoder2-15b-securecode](https://huggingface.co/scthornton/starcoder2-15b-securecode)** - Multi-language specialist (15B) |
| | |
| | View the complete collection: [SecureCode Models](https://huggingface.co/collections/scthornton/securecode) |
| | |
| | --- |
| | |
| | <div align="center"> |
| | |
| | **Built with β€οΈ for secure software development** |
| | |
| | [perfecXion.ai](https://perfecxion.ai) | [Research](https://perfecxion.ai/research) | [Contact](mailto:scott@perfecxion.ai) |
| | |
| | </div> |
| | |