Patrick Hill PRO

pbhappliedsystems

AI & ML interests

PBH Applied Systems publishes evaluated open-weight GGUF models for practical AI deployment, with an emphasis on quantized inference, agentic workflows, structured outputs, tool use, and production reliability. Every model published under this organization is converted, evaluated, and documented by PBH Applied Systems using its proprietary `quant_eval` framework. The evaluation process compares full-precision and quantized variants across agent-adjacent task families including structured JSON output, tool dispatch, multi-turn state retention, mixed natural language plus JSON responses, multiple-choice extraction, fuzz-style constraint adherence, and multi-step planning. These model cards are designed to support deployment decisions, not just model discovery. Each card documents practical behavior, quantization trade-offs, failure modes, recommended use cases, hardware requirements, and guardrails for production use. Try the live PBH Applied Systems AI Agent Demo: https://pbhappliedsystems.com/assistant.html The demo lets visitors interact with evaluated quantized open-weight models across reasoning, document intelligence, and code automation workflows running on private GPU infrastructure.

Recent Activity

updated a dataset 12 days ago

pbhappliedsystems/veritruct-cloud-regulated-deid-1k

liked a Space 12 days ago

webml-community/bonsai-image-webgpu

published a dataset 12 days ago

pbhappliedsystems/veritruct-cloud-regulated-deid-1k

View all activity

Organizations

None yet

Posts 2

Post

2325

🚀 New from Veritruct — and an argument about what a dataset card should be. Most synthetic datasets on the Hub ship row counts, a license, and little else. De-identification corpora are worse: their span labels are usually unverifiable. We published the opposite — two datasets, two record types, one quality-gated methodology.

① Construction-true de-identification — labels correct by construction, because every PII value is injected at an offset the pipeline records. Ground truth, not fallibly-tagged.
👉 pbhappliedsystems/veritruct-cloud-regulated-deid-1k
1,053 records · 7 domains · 2,452 ground-truth spans · 17 identifier types. A hybrid regex+GLiNER2 detector scores micro-F1 0.905 against the known-correct labels — and we report the free-text precision gaps, not just the wins.

② Regulated-domain instruction — same gates, same provenance.
👉 pbhappliedsystems/veritruct-studio-regulated-instruct-1K
1,014 records · 90.4% yield · MATTR 0.790 · 0.0% residual PII leak.

Every record cleared a documented cascade — dual-signal hallucination gate, template-leak gate, layered PII masking — and every number on each card is a field in the evaluation_report.json shipped beside the data. Rejections ship too, each tagged with its failing gate. Two substrates: Studio (local GPU) · Cloud (Modal + vLLM).

📄 Whitepaper: https://pbhappliedsystems.com/Veritruct_Quality-Gated_Synthetic_Data_Generation_for_Regulated_Industries.pdf
🔎 Overview: https://pbhappliedsystems.com/veritruct.html

CC BY 4.0 — commercial use welcome, just credit it. Need defensible synthetic data at scale? Let's talk.
— Patrick Hill, PBH Applied Systems

Post

216

## quant-eval Agent Arena — Now Live

After several months of building, the quant-eval Agent Arena is live: pbhappliedsystems/quant-eval-agent-arena

**What it is:** A side-by-side ReAct agent comparison platform running 9 independently evaluated GGUF models. Select any two models, pick an agent template, submit a query, and watch both agents reason through it in real time — with quant_eval v7.21 behavioral scores displayed alongside every response.

**Three agent templates:**
- 〔R〕 Reasoning & Analysis
- 〔D〕 Document Intelligence
- 〔C〕 Code & Automation

**The models (all Q4_K_M GGUF):**
- Qwen2.5-3B / 7B / 14B-Instruct-1M / 32B
- Ministral-3-14B-Instruct-2512
- Ministral-3-14B-Reasoning-2512
- Phi-4-reasoning-plus
- Mistral-Nemo-Instruct-2407
- Qwen3.6-27B

**What quant_eval v7.21 measures:** 42 fixture cases across 8 task families — json_multistep, stateful_followup, toolcall_only, mixed_brief_json, toolcall, json, fuzz, mcq. Every model evaluated at both F16 and Q4_K_M precision where hardware permits. The delta is the quantization impact report.

**Stack:** Gradio + llama-cpp-python (GGUF, CUDA) + custom lightweight ReAct loop + ZeroGPU (H200)

All 18 model cards with full evaluation data are published at: @pbhappliedsystems

Feedback welcome — especially from anyone running evaluations on open-weight quantized models. This is the public-facing surface of a consulting and evaluation practice; the full agent demo is at https://pbhappliedsystems.com/assistant.html