CodeDP-CPT Models V2

LoRA adapters from continued pretraining (CPT) on code with and without differential privacy (DP-SGD), across 7 model families.

Models Included

Each model is trained with multiple variants:

  • base / base_attn β€” CPT without DP (no privacy)
  • dp3 / dp3_attn β€” DP-SGD with Ξ΅=3 (strong privacy)
  • dp8 / dp8_attn β€” DP-SGD with Ξ΅=8 (moderate privacy)
  • *_v2 β€” re-runs with improved hyperparameters (LR=5e-4, 5 epochs, min_lr_ratio=0.15)

Model Families

Family Variants Base Model
starcoder2-7b base, dp3, dp8 bigcode/starcoder2-7b
llama3-8b base, dp3, dp8, dp8_v2 meta-llama/Meta-Llama-3-8B
llama3.1-8b dp3, dp8 meta-llama/Llama-3.1-8B
llama3.2-3b base, dp3, dp8 meta-llama/Llama-3.2-3B
qwen3-8b-base base, dp3, dp8, dp3_v2, dp8_v2 Qwen/Qwen3-8B-Base
granite-4.0-h-tiny base_attn, dp3_attn, dp8_attn ibm-granite/granite-4.0-h-tiny-base
qwen1.5-moe-a2.7b dp3_attn, base_attn_v2, dp3_attn_v2, dp8_attn_v2 Qwen/Qwen1.5-MoE-A2.7B

Total: 24 LoRA adapters

Training Data

Trained on melihcatal/codedp-cpt β€” a code corpus with embedded canary secrets for DP auditing and membership inference evaluation.

Directory Structure

Each variant directory contains:

<model>/<variant>/
β”œβ”€β”€ adapter/              # Final LoRA adapter (PEFT format)
β”‚   β”œβ”€β”€ adapter_config.json
β”‚   β”œβ”€β”€ adapter_model.safetensors
β”‚   └── README.md
β”œβ”€β”€ tokenizer/            # Tokenizer (may include added canary tokens)
β”œβ”€β”€ resolved_config.yaml  # Training configuration
β”œβ”€β”€ metrics.jsonl         # Training metrics per step
β”œβ”€β”€ train.log             # Training log
β”œβ”€β”€ canary_meta.json      # Canary metadata for MIA evaluation
β”œβ”€β”€ summary.json          # Run summary
β”œβ”€β”€ audit_results.json    # DP audit results
└── audit_scores.npz      # DP audit raw scores

Loading a Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-7b",
    dtype="bfloat16",
)

# Load tokenizer (important: uses trained tokenizer with canary tokens)
tokenizer = AutoTokenizer.from_pretrained(
    "melihcatal/codedp-cpt-models-v2",
    subfolder="starcoder2-7b/base/tokenizer",
)

# Resize embeddings to match tokenizer
base_model.resize_token_embeddings(len(tokenizer))

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "melihcatal/codedp-cpt-models-v2",
    subfolder="starcoder2-7b/base/adapter",
)

Notes

  • Qwen1.5-MoE requires --model hf backend with lm-eval / transformers. vLLM's MoE routing produces incorrect output for this model.
  • DP collapse at 8B scale: Llama-3-8B, Llama-3.1-8B, and Qwen3-8B DP variants collapse to 0% on HumanEval. StarCoder2-7B, Granite-tiny, and Llama-3.2-3B DP variants retain utility.
  • All DP runs target Ξ΅=3 or Ξ΅=8 with Ξ΄=1e-5.

Evaluation

Evaluated on:

  • HumanEval (openai_humaneval) β€” basic code completion
  • CodeDP-FC (melihcatal/codedp-bench-fc-cpt-v2) β€” in-domain function completion
  • BigCodeBench (bigcode/bigcodebench) β€” library-heavy code generation
  • Canary MIA (codedp-ase26/codedp-bench-canary-mia) β€” membership inference attack

Citation

@misc{codedp-cpt-models-v2,
  title={CodeDP-CPT: Differentially Private Continued Pretraining for Code Models},
  author={Catal, Melih},
  year={2026},
  url={https://huggingface.co/melihcatal/codedp-cpt-models-v2},
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support