Qwen3.5-DeltaCoder-9B

Reliable tool-calling for agentic coding — LoRA fine-tune of Qwen3.5-9B v1.1-DPO released — DPO alignment improves code correctness and self-verification. If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.

License: Apache 2.0 Base Model HuggingFace LoRA

Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls — the kind that coding agents like OpenCode, Pi, and Cline depend on.

v1.1-DPO adds Direct Preference Optimization to further improve code correctness — the model now self-corrects its own bugs rather than submitting wrong answers.

Downloads

Format Link Size
GGUF Q4_K_M (recommended) HuggingFace ~5.5 GB
GGUF Q5_K_M HuggingFace ~6.5 GB
GGUF BF16 HuggingFace ~17.9 GB
DPO LoRA adapter HuggingFace ~700 MB

The Problem

Jackrong's Qwen3.5-9B reasoning distill scores 53.7% on HumanEval — best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:

tool=edit, error=JSON Parse error: Property name must be a string literal
tool=bash, error=JSON Parse error: Expected '}'

DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.

What's New in v1.1-DPO

  • Self-correcting behavior — detects and fixes its own bugs during agentic tasks
  • Improved code correctness — trained on 4,519 preference pairs from AceCode-V2-122K
  • Two-stage merge — v1 SFT tool-calling improvements + DPO code quality improvements combined
  • 13 GGUF quants — from Q2_K to BF16, covering all VRAM configurations

Training Details

v1 — SFT (Tool-Call Reliability)

Parameter Value
Base model Qwen3.5-9B (hybrid GDN architecture)
Method LoRA (r=64, alpha=32)
Dataset CoderForge-Preview filtered_reward1 (50K subset)
Sequence length 4096
Effective batch size 16
Learning rate 1e-4 (cosine)
Epochs 1
Hardware NVIDIA H200 140GB (Vast.ai)
Training time ~10 hours
Final loss ~0.94

v1.1 — DPO (Code Correctness)

Parameter Value
Method DPO (Direct Preference Optimization)
Dataset AceCode-V2-122K — 4,519 preference pairs
Pair generation 10K problems × 8 samples, keep if ≥1 pass AND ≥1 fail (45% keep rate)
Beta 0.1
Loss type sigmoid
Learning rate 5e-6 (cosine)
Effective batch size 16
Hardware NVIDIA H100 80GB (Vast.ai)
Training time ~3.7 hours
Final loss 0.538
Rewards/margins (final) ~1.0
Rewards/accuracies (final) ~80%

LoRA Target Modules

All major weight matrices adapted across the hybrid architecture:

  • Full Attention (8/32 layers): q_proj, k_proj, v_proj, o_proj
  • Gated Delta Net (24/32 layers): in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj
  • MLP (all 32 layers): gate_proj, up_proj, down_proj

Usage

Ollama

ollama create deltacoder -f Modelfile

llama.cpp / ik_llama.cpp

./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja

With PEFT (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")

Benchmarks

Model HumanEval HumanEval+ Terminal-Bench Easy
Jackrong Qwen3.5-9B-v2 (base) 53.7%
DeltaCoder-9B v1 (temp=0.6) 50.6% 49.4% 2/4 (50%)
DeltaCoder-9B v1.1-DPO (temp=0.6) TBD TBD 2/4 (50%)*

*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly — behavioral improvement confirmed, re-evaluating with extended timeout.

Recommended Sampling Settings

Parameter Value
temperature 0.6
top_k 20
top_p 0.95
min_p 0.0
presence_penalty 0.0
repeat_penalty 1.0

Do not use temperature below 0.5 — low temperatures cause deterministic looping in multi-turn agentic use.

KV Cache Quantization

Context Length KV Cache VRAM (Q4_K_M) Generation Speed
102,400 f16/q4_0 ~8.5 GB ~111 tok/s
131,072 f16/q4_0 ~9.1 GB ~110 tok/s

Key Findings

Qwen3.5 is a VLM — Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).

Do not use flash_attention_2 with sample packing on Qwen3.5 — training loss goes to 0. Use attn_implementation="eager" instead.

  • Qwen3.5 uses Gated Delta Networks — include in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj in LoRA target modules or 75% of attention layers are untrained
  • DPO pairs generated on-policy using Qwen/Qwen3.5-9B base with vLLM async inference (32 concurrent requests)
  • Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)

Project Structure

scripts/
  train_unsloth.py          # v1 SFT training
  train_dpo.py              # v1.1 DPO training (HF + PEFT + TRL)
  generate_dpo_pairs.py     # Async on-policy pair generation
  merge_and_export_dpo.py   # Two-stage merge + GGUF export

Status

  • v1 SFT fine-tune (CoderForge, H200, ~10hrs)
  • GGUF export (all quants Q2_K → BF16)
  • HumanEval benchmarking (50.6% / 49.4%)
  • Terminal-Bench evaluation (2/4 easy tasks)
  • DPO pair generation (4,519 pairs from AceCode-V2-122K)
  • v1.1-DPO training (H100, ~3.7hrs)
  • v1.1-DPO GGUF export + HuggingFace release
  • v1.1-DPO HumanEval benchmarking
  • v1.1-DPO Terminal-Bench extended timeout evaluation

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for danielcherubini/Qwen3.5-DeltaCoder-9B

Finetuned
Qwen/Qwen3.5-9B
Adapter
(85)
this model
Quantizations
1 model

Datasets used to train danielcherubini/Qwen3.5-DeltaCoder-9B