prism-coder:32b β€” Tool Routing Model (Desktop Quality Tier)

Fine-tuned Qwen3-30B-A3B (MoE) for 6-tool routing in the Prism AAC system. Quality escalation tier in the desktop cascade: 14B β†’ 32B β†’ cloud Claude.

v5 (May 2026): Switched base from dense Qwen3-32B to Qwen3-30B-A3B (MoE). Same accuracy, 9 GB smaller, ~4Γ— faster inference (only ~3B params active per token).

BFCL Routing Benchmark β€” v7 (Current)

Mean: 100.0% PERFECT (3-seed average, seeds 2027/2028/2029, 102 cases each)

Category Count Description Accuracy
aac 12 AAC phrase requests β†’ plain text 100%
cmpct 6 Ledger compaction 100%
edge 6 Multi-step / compound requests 100%
hand 8 Agent handoff / relay 100%
info 5 General facts β†’ plain text 100%
irrel 10 Irrelevant / live queries β†’ plain text 100%
know 7 Knowledge base search 100%
load 9 Session context loading 100%
pred 8 Factual / knowledge queries β†’ plain text 100%
save 13 Session ledger save 100%
smem 12 Session memory search 100%
tran 6 Translation requests β†’ plain text 100%

All 12 categories at 100%. No remaining failures.

Eval: MLX inference + thinking, temperature=0, 3-seed mean. Gate: β‰₯90% = deploy.

Full Cascade Benchmark (May 2026)

Individual BFCL scores (MLX, 3 seeds):

Model BFCL Size Tier
prism-coder:8b v36 100.0% PERFECT 4.7 GB Desktop / Mobile tier
prism-coder:14b v36 100.0% PERFECT 8.4 GB Desktop primary tier
prism-coder:32b v7 100.0% PERFECT 16 GB Desktop quality tier

Cascade eval: 14b β†’ 32b β†’ Claude Opus (102 cases Γ— 3 seeds)

Metric Result
Cascade accuracy 100.0% (mean, 3 seeds)
Opus-solo etalon 98.3%
Ξ” vs Opus +1.7%
Traffic served by 14b 99% (101/102 cases avg)
Traffic escalated to 32b 1% (1/102 avg) β€” catches save live state β†’ handoff edge case
Traffic reaching Opus API 0%

Fine-tuned cascade outperforms Claude Opus on edge (+16.7%) and know (+14.3%).

Version History

Version Base BFCL Notes
v7 (current) Qwen3-30B-A3B MoE 100.0% PERFECT Fixed: "what do I know + search memory" compound β†’ knowledge_search
v6 Qwen3-30B-A3B MoE 99.0% Fixed MoE merge (BF16 safetensors + correct MLX→HF key mapping)
v5 Qwen3-30B-A3B MoE 97.1% 18Γ— density fix; 9GB smaller, 4Γ— faster vs dense
v4 Qwen3-30B-A3B MoE 92.2% rank=32 experiment β€” regressed vs v3
v3 Qwen3-30B-A3B MoE 92.5% 20Γ— reps + LR=1e-5 β€” hit rank bottleneck
v2 Qwen3-30B-A3B MoE 92.5% v34 corpus + 1400 iters
v33 (dense) Qwen3-32B dense 99.0% Prior generation β€” larger/slower

Tools

The model routes between exactly 6 tools:

  1. session_load_context β€” load/fetch/resume project context
  2. session_save_ledger β€” note/log/remember/record progress
  3. session_save_handoff β€” handoff/relay to next agent/session
  4. session_compact_ledger β€” compact/archive/shrink ledger
  5. session_search_memory β€” recall past sessions/conversations
  6. knowledge_search β€” search stored notes/knowledge base

Files

File Size Use
qwen3-30b-a3b-v7-iq4nl.gguf 16 GB Current β€” recommended
qwen3-30b-a3b-v6-iq4nl.gguf 17 GB Previous (99.0%)
qwen3-30b-a3b-v5-iq4nl.gguf 17 GB Previous (97.1%)
qwen3-32b-v33-q6k.gguf 25 GB Dense predecessor (99.0%, legacy)

Usage (Ollama)

ollama run dcostenco/prism-coder:32b

Training

  • Base: Qwen/Qwen3-30B-A3B (HF BF16, ~57 GB)
  • Adapters: v6 LoRA (rank=8, scale=10, 8 layers, LR=1e-5)
  • Merge: Direct safetensors merge on HF BF16 base; delta = (scale/rank) Γ— B^T A^T for attn/gate; delta[i] = (scale/rank) Γ— B[i] A[i] for MoE experts (128 experts stacked)
  • Key fix: v5 merge used wrong base (MLX 4-bit, can't apply float LoRA delta) and uppercase regex lora_[AB] vs actual lowercase lora_a/lora_b adapter keys
  • Hardware: Apple Silicon (M-series, 64 GB RAM)
Downloads last month
3,426
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dcostenco/prism-coder-32b

Quantized
(114)
this model