CodeDP-CPT Models V2
LoRA adapters from continued pretraining (CPT) on code with and without differential privacy (DP-SGD), across 7 model families.
Models Included
Each model is trained with multiple variants:
base/base_attnβ CPT without DP (no privacy)dp3/dp3_attnβ DP-SGD with Ξ΅=3 (strong privacy)dp8/dp8_attnβ DP-SGD with Ξ΅=8 (moderate privacy)*_v2β re-runs with improved hyperparameters (LR=5e-4, 5 epochs, min_lr_ratio=0.15)
Model Families
| Family | Variants | Base Model |
|---|---|---|
starcoder2-7b |
base, dp3, dp8 | bigcode/starcoder2-7b |
llama3-8b |
base, dp3, dp8, dp8_v2 | meta-llama/Meta-Llama-3-8B |
llama3.1-8b |
dp3, dp8 | meta-llama/Llama-3.1-8B |
llama3.2-3b |
base, dp3, dp8 | meta-llama/Llama-3.2-3B |
qwen3-8b-base |
base, dp3, dp8, dp3_v2, dp8_v2 | Qwen/Qwen3-8B-Base |
granite-4.0-h-tiny |
base_attn, dp3_attn, dp8_attn | ibm-granite/granite-4.0-h-tiny-base |
qwen1.5-moe-a2.7b |
dp3_attn, base_attn_v2, dp3_attn_v2, dp8_attn_v2 | Qwen/Qwen1.5-MoE-A2.7B |
Total: 24 LoRA adapters
Training Data
Trained on melihcatal/codedp-cpt β a code corpus with embedded canary secrets for DP auditing and membership inference evaluation.
Directory Structure
Each variant directory contains:
<model>/<variant>/
βββ adapter/ # Final LoRA adapter (PEFT format)
β βββ adapter_config.json
β βββ adapter_model.safetensors
β βββ README.md
βββ tokenizer/ # Tokenizer (may include added canary tokens)
βββ resolved_config.yaml # Training configuration
βββ metrics.jsonl # Training metrics per step
βββ train.log # Training log
βββ canary_meta.json # Canary metadata for MIA evaluation
βββ summary.json # Run summary
βββ audit_results.json # DP audit results
βββ audit_scores.npz # DP audit raw scores
Loading a Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"bigcode/starcoder2-7b",
dtype="bfloat16",
)
# Load tokenizer (important: uses trained tokenizer with canary tokens)
tokenizer = AutoTokenizer.from_pretrained(
"melihcatal/codedp-cpt-models-v2",
subfolder="starcoder2-7b/base/tokenizer",
)
# Resize embeddings to match tokenizer
base_model.resize_token_embeddings(len(tokenizer))
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"melihcatal/codedp-cpt-models-v2",
subfolder="starcoder2-7b/base/adapter",
)
Notes
- Qwen1.5-MoE requires
--model hfbackend with lm-eval / transformers. vLLM's MoE routing produces incorrect output for this model. - DP collapse at 8B scale: Llama-3-8B, Llama-3.1-8B, and Qwen3-8B DP variants collapse to 0% on HumanEval. StarCoder2-7B, Granite-tiny, and Llama-3.2-3B DP variants retain utility.
- All DP runs target Ξ΅=3 or Ξ΅=8 with Ξ΄=1e-5.
Evaluation
Evaluated on:
- HumanEval (
openai_humaneval) β basic code completion - CodeDP-FC (
melihcatal/codedp-bench-fc-cpt-v2) β in-domain function completion - BigCodeBench (
bigcode/bigcodebench) β library-heavy code generation - Canary MIA (
codedp-ase26/codedp-bench-canary-mia) β membership inference attack
Citation
@misc{codedp-cpt-models-v2,
title={CodeDP-CPT: Differentially Private Continued Pretraining for Code Models},
author={Catal, Melih},
year={2026},
url={https://huggingface.co/melihcatal/codedp-cpt-models-v2},
}
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support