Post
142
## quant-eval Agent Arena — Now Live
After several months of building, the quant-eval Agent Arena is live: pbhappliedsystems/quant-eval-agent-arena
**What it is:** A side-by-side ReAct agent comparison platform running 9 independently evaluated GGUF models. Select any two models, pick an agent template, submit a query, and watch both agents reason through it in real time — with quant_eval v7.21 behavioral scores displayed alongside every response.
**Three agent templates:**
- 〔R〕 Reasoning & Analysis
- 〔D〕 Document Intelligence
- 〔C〕 Code & Automation
**The models (all Q4_K_M GGUF):**
- Qwen2.5-3B / 7B / 14B-Instruct-1M / 32B
- Ministral-3-14B-Instruct-2512
- Ministral-3-14B-Reasoning-2512
- Phi-4-reasoning-plus
- Mistral-Nemo-Instruct-2407
- Qwen3.6-27B
**What quant_eval v7.21 measures:** 42 fixture cases across 8 task families — json_multistep, stateful_followup, toolcall_only, mixed_brief_json, toolcall, json, fuzz, mcq. Every model evaluated at both F16 and Q4_K_M precision where hardware permits. The delta is the quantization impact report.
**Stack:** Gradio + llama-cpp-python (GGUF, CUDA) + custom lightweight ReAct loop + ZeroGPU (H200)
All 18 model cards with full evaluation data are published at: @pbhappliedsystems
Feedback welcome — especially from anyone running evaluations on open-weight quantized models. This is the public-facing surface of a consulting and evaluation practice; the full agent demo is at https://pbhappliedsystems.com/assistant.html
After several months of building, the quant-eval Agent Arena is live: pbhappliedsystems/quant-eval-agent-arena
**What it is:** A side-by-side ReAct agent comparison platform running 9 independently evaluated GGUF models. Select any two models, pick an agent template, submit a query, and watch both agents reason through it in real time — with quant_eval v7.21 behavioral scores displayed alongside every response.
**Three agent templates:**
- 〔R〕 Reasoning & Analysis
- 〔D〕 Document Intelligence
- 〔C〕 Code & Automation
**The models (all Q4_K_M GGUF):**
- Qwen2.5-3B / 7B / 14B-Instruct-1M / 32B
- Ministral-3-14B-Instruct-2512
- Ministral-3-14B-Reasoning-2512
- Phi-4-reasoning-plus
- Mistral-Nemo-Instruct-2407
- Qwen3.6-27B
**What quant_eval v7.21 measures:** 42 fixture cases across 8 task families — json_multistep, stateful_followup, toolcall_only, mixed_brief_json, toolcall, json, fuzz, mcq. Every model evaluated at both F16 and Q4_K_M precision where hardware permits. The delta is the quantization impact report.
**Stack:** Gradio + llama-cpp-python (GGUF, CUDA) + custom lightweight ReAct loop + ZeroGPU (H200)
All 18 model cards with full evaluation data are published at: @pbhappliedsystems
Feedback welcome — especially from anyone running evaluations on open-weight quantized models. This is the public-facing surface of a consulting and evaluation practice; the full agent demo is at https://pbhappliedsystems.com/assistant.html