scthornton commited on
Commit
c3fd001
·
verified ·
1 Parent(s): f66452e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +184 -37
README.md CHANGED
@@ -1,60 +1,207 @@
1
  ---
2
- library_name: peft
3
  license: llama2
4
  base_model: codellama/CodeLlama-13b-Instruct-hf
5
  tags:
6
- - base_model:adapter:codellama/CodeLlama-13b-Instruct-hf
7
- - lora
8
- - transformers
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: text-generation
10
- model-index:
11
- - name: codellama-13b-securecode
12
- results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- # codellama-13b-securecode
19
 
20
- This model is a fine-tuned version of [codellama/CodeLlama-13b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) on the None dataset.
21
 
22
- ## Model description
23
 
24
- More information needed
25
 
26
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- More information needed
29
 
30
- ## Training and evaluation data
31
 
32
- More information needed
 
 
 
 
 
 
33
 
34
- ## Training procedure
 
 
 
 
35
 
36
- ### Training hyperparameters
 
 
 
37
 
38
- The following hyperparameters were used during training:
39
- - learning_rate: 0.0002
40
- - train_batch_size: 2
41
- - eval_batch_size: 8
42
- - seed: 42
43
- - gradient_accumulation_steps: 8
44
- - total_train_batch_size: 16
45
- - optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
46
- - lr_scheduler_type: cosine
47
- - lr_scheduler_warmup_steps: 100
48
- - num_epochs: 3
49
 
50
- ### Training results
 
 
 
 
 
 
 
 
 
51
 
 
52
 
 
 
 
 
53
 
54
- ### Framework versions
55
 
56
- - PEFT 0.18.1
57
- - Transformers 5.1.0
58
- - Pytorch 2.7.1+cu128
59
- - Datasets 2.21.0
60
- - Tokenizers 0.22.2
 
1
  ---
 
2
  license: llama2
3
  base_model: codellama/CodeLlama-13b-Instruct-hf
4
  tags:
5
+ - security
6
+ - cybersecurity
7
+ - secure-coding
8
+ - ai-security
9
+ - owasp
10
+ - code-generation
11
+ - qlora
12
+ - lora
13
+ - fine-tuned
14
+ - securecode
15
+ datasets:
16
+ - scthornton/securecode
17
+ library_name: peft
18
  pipeline_tag: text-generation
19
+ language:
20
+ - code
21
+ - en
22
  ---
23
 
24
+ # CodeLlama 13B SecureCode
25
+
26
+ <div align="center">
27
+
28
+ ![Parameters](https://img.shields.io/badge/params-13B-blue.svg)
29
+ ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
+ ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
+ ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
32
+
33
+ **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
+
35
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
+
37
+ </div>
38
+
39
+ ---
40
+
41
+ ## What This Model Does
42
+
43
+ This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
44
+
45
+ - Identifies the security risks in common coding patterns
46
+ - Provides vulnerable *and* secure implementations side by side
47
+ - Explains how attackers would exploit the vulnerability
48
+ - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
49
+
50
+ The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
51
+
52
+ ## Model Details
53
+
54
+ | | |
55
+ |---|---|
56
+ | **Base Model** | [CodeLlama 13B Instruct](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) |
57
+ | **Parameters** | 13B |
58
+ | **Architecture** | Llama 2 |
59
+ | **Tier** | Tier 3: Large Model |
60
+ | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
+ | **LoRA Rank** | 16 (alpha=32) |
62
+ | **Target Modules** | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (7 modules) |
63
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
+ | **Hardware** | NVIDIA A100 40GB |
65
+
66
+ Meta's code-specialized Llama variant at 13B parameters. Deeper security reasoning with strong code understanding.
67
+
68
+ ## Quick Start
69
+
70
+ ```python
71
+ from peft import PeftModel
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
+ import torch
74
+
75
+ # Load with 4-bit quantization (matches training)
76
+ bnb_config = BitsAndBytesConfig(
77
+ load_in_4bit=True,
78
+ bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
+ )
81
+
82
+ base_model = AutoModelForCausalLM.from_pretrained(
83
+ "codellama/CodeLlama-13b-Instruct-hf",
84
+ quantization_config=bnb_config,
85
+ device_map="auto",
86
+ )
87
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/codellama-13b-securecode")
88
+ model = PeftModel.from_pretrained(base_model, "scthornton/codellama-13b-securecode")
89
+
90
+ # Ask a security-relevant coding question
91
+ messages = [
92
+ {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
+ ]
94
+
95
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
+ outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
98
+ ```
99
+
100
+ ## Training Details
101
+
102
+ ### Dataset
103
+
104
+ Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
105
+
106
+ - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
+ - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
+ - **12+ programming languages** and **49+ frameworks**
109
+ - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
+ - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
111
+
112
+ ### Hyperparameters
113
+
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | LoRA rank | 16 |
117
+ | LoRA alpha | 32 |
118
+ | LoRA dropout | 0.05 |
119
+ | Target modules | 7 linear layers |
120
+ | Quantization | 4-bit NormalFloat (NF4) |
121
+ | Learning rate | 2e-4 |
122
+ | LR scheduler | Cosine with 100-step warmup |
123
+ | Epochs | 3 |
124
+ | Per-device batch size | 2 |
125
+ | Gradient accumulation | 8x |
126
+ | Effective batch size | 16 |
127
+ | Max sequence length | 2048 tokens |
128
+ | Optimizer | paged_adamw_8bit |
129
+ | Precision | bf16 |
130
+
131
+ **Notes:** Reduced max sequence length (2048) to fit A100 40GB memory. Strong at multi-turn security reasoning.
132
+
133
+ ## Security Coverage
134
 
135
+ ### Web Security (1,435 examples)
136
 
137
+ OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
138
 
139
+ Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
140
 
141
+ ### AI/ML Security (750 examples)
142
 
143
+ OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
144
+
145
+ Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
146
+
147
+ ## SecureCode Model Collection
148
+
149
+ This model is part of the **SecureCode** collection of 8 security-specialized models:
150
+
151
+ | Model | Base | Size | Tier | HuggingFace |
152
+ |-------|------|------|------|-------------|
153
+ | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
+ | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
+ | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
+ | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
+ | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
+ | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
+ | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
+ | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
+ Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
163
 
164
+ ## SecureCode Dataset Family
165
 
166
+ | Dataset | Examples | Focus | Link |
167
+ |---------|----------|-------|------|
168
+ | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
+ | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
+ | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
171
+
172
+ ## Intended Use
173
 
174
+ **Use this model for:**
175
+ - Training AI coding assistants to write secure code
176
+ - Security education and training
177
+ - Vulnerability research and secure code review
178
+ - Building security-aware development tools
179
 
180
+ **Do not use this model for:**
181
+ - Offensive exploitation or automated attack generation
182
+ - Circumventing security controls
183
+ - Any activity that violates the base model's license
184
 
185
+ ## Citation
 
 
 
 
 
 
 
 
 
 
186
 
187
+ ```bibtex
188
+ @misc{thornton2026securecode,
189
+ title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
+ author={Thornton, Scott},
191
+ year={2026},
192
+ publisher={perfecXion.ai},
193
+ url={https://huggingface.co/datasets/scthornton/securecode},
194
+ note={arXiv:2512.18542}
195
+ }
196
+ ```
197
 
198
+ ## Links
199
 
200
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
+ - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
+ - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
+ - **Author**: [perfecXion.ai](https://perfecxion.ai)
204
 
205
+ ## License
206
 
207
+ This model is released under the **llama2** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.