How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dcostenco/prism-coder-8b",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

prism-coder:8b โ€” Tool Routing Model (iOS / Edge Tier)

Fine-tuned Qwen3-8B for 6-tool routing in the Prism AAC system. Primary deployment: iOS and edge devices via llama.cpp GGUF.

BFCL Routing Benchmark โ€” v36 (Current)

Mean: 100.0% (3-seed average, seeds 2027/2028/2029, 102 cases each)

Category Count Description Accuracy
aac 12 AAC phrase requests โ†’ plain text 100%
cmpct 6 Ledger compaction 100%
edge 6 Multi-step / compound requests 100%
hand 8 Agent handoff / relay 100%
info 5 General facts โ†’ plain text 100%
irrel 10 Irrelevant / live queries โ†’ plain text 100%
know 7 Knowledge base search 100%
load 9 Session context loading 100%
pred 8 Factual / knowledge queries โ†’ plain text 100%
save 13 Session ledger save 100%
smem 12 Session memory search 100%
tran 6 Translation requests โ†’ plain text 100%

Eval: MLX inference + thinking, temperature=0, 3-seed mean. Gate: โ‰ฅ90% = deploy.

Cascade Benchmark (May 2026)

Full desktop cascade: 14b โ†’ 32b โ†’ Claude Opus (102 cases ร— 3 seeds)

Metric Result
Cascade accuracy 100.0% (mean, 3 seeds)
Opus-solo etalon 98.3%
ฮ” vs Opus +1.7%
Traffic served by 14b 99% (101/102 cases avg)
Traffic escalated to 32b 1% (1/102 avg)
Traffic reaching Opus API 0%

Fine-tuned cascade outperforms Claude Opus on edge (+16.7%) and know (+14.3%).

Version History

Version BFCL Notes
v36 100.0% Fixed: smem "BFCL v4 notes" and "training loss" โ†’ session_search_memory
v35 98.0% Proper safetensors merge โ€” fixes mlx_lm.fuse LoRA loss
v32 98.0% Routing corpus v32_8b, direct safetensors merge
v31 95.1% Surgical smem/know boundary fix
v30 ~93% Baseline 8B routing

Tools

The model routes to exactly 6 tools:

Tool Trigger
session_load_context Load/resume project context
session_save_ledger Note/log/record/remember something
session_save_handoff Pass state to next agent/session
session_compact_ledger Shrink/prune ledger (no relay)
session_search_memory Recall prior session discussions
knowledge_search Search stored knowledge base

Plain text (no tool) for: AAC phrases, translations, weather, general facts, code, math.

Model Details

  • Base: Qwen/Qwen3-8B
  • Format: GGUF Q4_K_M (~4.9 GB)
  • Context: 32,768 tokens
  • Training: MLX LoRA, rank=16, 16 layers, 1000 iters, LR=2e-6, v36 corpus (806 examples)
  • Merge: mlx_lm.fuse โ†’ llama.cpp convert โ†’ Q4_K_M quantization

Usage

ollama pull dcostenco/prism-coder-8b
ollama run prism-coder:8b

Or in the Prism Coder IDE โ€” set model to prism-coder:8b in Settings.

Downloads last month
2,499
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dcostenco/prism-coder-8b

Finetuned
Qwen/Qwen3-8B
Quantized
(283)
this model