Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
CRAFTFramework 
posted an update 1 day ago
Post
59
# A measurable QA layer for LLM working sessions

Hallucination is treated as an inherent LLM failure mode, but most production workflows respond to it with "be careful, double-check things." That doesn't scale past a handful of sessions, and it doesn't catch the failure mode that does the most damage: confident reconstruction in late-session context.

CRAFT for Cowork takes a structural approach. The QA framework runs verification at four levels — individual claims, recipe execution, file integrity, and cross-session consistency — and treats trust as a measurable property rather than a vibe.

**The four-gate verification sub-routine** (RCP-CWK-024) runs before any recipe reports a result:

1. *File-pointability* — claim traceable to a specific file
2. *Read-vs-reconstructed* — was data actually read this session
3. *Lessons-Learned conflict* — contradicts documented prior truth
4. *Untested assumption* — verified vs. assumed

**Confidence scoring** grades every factual claim 0-100 against a source hierarchy: evidence read from files (80-100), tool-output observation (50-79), design intent (30-49), pure reasoning (0-29). A 10-point penalty applies past 70% token usage to correct for late-session reliability decay.

**Cross-session consistency** is enforced by a longitudinal audit recipe (RCP-CWK-036) run every 5-10 sessions. It has caught ~40% drift in tracking-file state tables — drift that would otherwise propagate as silent ground truth.

**Concrete result:** A factual claim validation pass caught nine pre-publication content files referencing the framework with an incorrect license descriptor. Single pass, all nine corrected.

This is week 5 of an 8-week capability spotlight. CRAFT for Cowork is a free public beta.

Repository: https://github.com/CRAFTFramework/craft-framework
License: https://craftframework.ai/craft-license/ (Spec under BSL 1.1, converts to Apache 2.0 on Jan 1, 2029; content proprietary)

"I wrote this because I was tired of LLMs 'confidently lying' to me during long working sessions—especially when those errors started propagating into my ground-truth files.

The four-gate verification pattern has been a game-changer for my workflow, but I'm curious: How are you all handling AI reliability in your own projects? Do you use any specific 'sanity check' prompts or external validation layers, or are you mostly relying on manual code reviews?"