Qwen3.5-DeltaCoder-9B
Reliable tool-calling for agentic coding — LoRA fine-tune of Qwen3.5-9B v1.1-DPO released — DPO alignment improves code correctness and self-verification. If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.
Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls — the kind that coding agents like OpenCode, Pi, and Cline depend on.
v1.1-DPO adds Direct Preference Optimization to further improve code correctness — the model now self-corrects its own bugs rather than submitting wrong answers.
Downloads
| Format | Link | Size |
|---|---|---|
| GGUF Q4_K_M (recommended) | HuggingFace | ~5.5 GB |
| GGUF Q5_K_M | HuggingFace | ~6.5 GB |
| GGUF BF16 | HuggingFace | ~17.9 GB |
| DPO LoRA adapter | HuggingFace | ~700 MB |
The Problem
Jackrong's Qwen3.5-9B reasoning distill scores 53.7% on HumanEval — best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:
tool=edit, error=JSON Parse error: Property name must be a string literal
tool=bash, error=JSON Parse error: Expected '}'
DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.
What's New in v1.1-DPO
- Self-correcting behavior — detects and fixes its own bugs during agentic tasks
- Improved code correctness — trained on 4,519 preference pairs from AceCode-V2-122K
- Two-stage merge — v1 SFT tool-calling improvements + DPO code quality improvements combined
- 13 GGUF quants — from Q2_K to BF16, covering all VRAM configurations
Training Details
v1 — SFT (Tool-Call Reliability)
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-9B (hybrid GDN architecture) |
| Method | LoRA (r=64, alpha=32) |
| Dataset | CoderForge-Preview filtered_reward1 (50K subset) |
| Sequence length | 4096 |
| Effective batch size | 16 |
| Learning rate | 1e-4 (cosine) |
| Epochs | 1 |
| Hardware | NVIDIA H200 140GB (Vast.ai) |
| Training time | ~10 hours |
| Final loss | ~0.94 |
v1.1 — DPO (Code Correctness)
| Parameter | Value |
|---|---|
| Method | DPO (Direct Preference Optimization) |
| Dataset | AceCode-V2-122K — 4,519 preference pairs |
| Pair generation | 10K problems × 8 samples, keep if ≥1 pass AND ≥1 fail (45% keep rate) |
| Beta | 0.1 |
| Loss type | sigmoid |
| Learning rate | 5e-6 (cosine) |
| Effective batch size | 16 |
| Hardware | NVIDIA H100 80GB (Vast.ai) |
| Training time | ~3.7 hours |
| Final loss | 0.538 |
| Rewards/margins (final) | ~1.0 |
| Rewards/accuracies (final) | ~80% |
LoRA Target Modules
All major weight matrices adapted across the hybrid architecture:
- Full Attention (8/32 layers):
q_proj,k_proj,v_proj,o_proj - Gated Delta Net (24/32 layers):
in_proj_qkv,in_proj_z,in_proj_b,in_proj_a,out_proj - MLP (all 32 layers):
gate_proj,up_proj,down_proj
Usage
Ollama
ollama create deltacoder -f Modelfile
llama.cpp / ik_llama.cpp
./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja
With PEFT (Python)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")
Benchmarks
| Model | HumanEval | HumanEval+ | Terminal-Bench Easy |
|---|---|---|---|
| Jackrong Qwen3.5-9B-v2 (base) | 53.7% | — | — |
| DeltaCoder-9B v1 (temp=0.6) | 50.6% | 49.4% | 2/4 (50%) |
| DeltaCoder-9B v1.1-DPO (temp=0.6) | TBD | TBD | 2/4 (50%)* |
*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly — behavioral improvement confirmed, re-evaluating with extended timeout.
Recommended Sampling Settings
| Parameter | Value |
|---|---|
| temperature | 0.6 |
| top_k | 20 |
| top_p | 0.95 |
| min_p | 0.0 |
| presence_penalty | 0.0 |
| repeat_penalty | 1.0 |
Do not use temperature below 0.5 — low temperatures cause deterministic looping in multi-turn agentic use.
KV Cache Quantization
| Context Length | KV Cache | VRAM (Q4_K_M) | Generation Speed |
|---|---|---|---|
| 102,400 | f16/q4_0 | ~8.5 GB | ~111 tok/s |
| 131,072 | f16/q4_0 | ~9.1 GB | ~110 tok/s |
Key Findings
Qwen3.5 is a VLM — Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).
Do not use
flash_attention_2with sample packing on Qwen3.5 — training loss goes to 0. Useattn_implementation="eager"instead.
- Qwen3.5 uses Gated Delta Networks — include
in_proj_qkv,in_proj_z,in_proj_b,in_proj_a,out_projin LoRA target modules or 75% of attention layers are untrained - DPO pairs generated on-policy using
Qwen/Qwen3.5-9Bbase with vLLM async inference (32 concurrent requests) - Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)
Project Structure
scripts/
train_unsloth.py # v1 SFT training
train_dpo.py # v1.1 DPO training (HF + PEFT + TRL)
generate_dpo_pairs.py # Async on-policy pair generation
merge_and_export_dpo.py # Two-stage merge + GGUF export
Status
- v1 SFT fine-tune (CoderForge, H200, ~10hrs)
- GGUF export (all quants Q2_K → BF16)
- HumanEval benchmarking (50.6% / 49.4%)
- Terminal-Bench evaluation (2/4 easy tasks)
- DPO pair generation (4,519 pairs from AceCode-V2-122K)
- v1.1-DPO training (H100, ~3.7hrs)
- v1.1-DPO GGUF export + HuggingFace release
- v1.1-DPO HumanEval benchmarking
- v1.1-DPO Terminal-Bench extended timeout evaluation
Acknowledgements
- Unsloth for Qwen3.5 SFT training support
- Together AI for the CoderForge dataset
- TIGER Lab for AceCode-V2-122K
- Jackrong for the reasoning distillation
- Qwen for the base model