Post
59
# A measurable QA layer for LLM working sessions
Hallucination is treated as an inherent LLM failure mode, but most production workflows respond to it with "be careful, double-check things." That doesn't scale past a handful of sessions, and it doesn't catch the failure mode that does the most damage: confident reconstruction in late-session context.
CRAFT for Cowork takes a structural approach. The QA framework runs verification at four levels — individual claims, recipe execution, file integrity, and cross-session consistency — and treats trust as a measurable property rather than a vibe.
**The four-gate verification sub-routine** (RCP-CWK-024) runs before any recipe reports a result:
1. *File-pointability* — claim traceable to a specific file
2. *Read-vs-reconstructed* — was data actually read this session
3. *Lessons-Learned conflict* — contradicts documented prior truth
4. *Untested assumption* — verified vs. assumed
**Confidence scoring** grades every factual claim 0-100 against a source hierarchy: evidence read from files (80-100), tool-output observation (50-79), design intent (30-49), pure reasoning (0-29). A 10-point penalty applies past 70% token usage to correct for late-session reliability decay.
**Cross-session consistency** is enforced by a longitudinal audit recipe (RCP-CWK-036) run every 5-10 sessions. It has caught ~40% drift in tracking-file state tables — drift that would otherwise propagate as silent ground truth.
**Concrete result:** A factual claim validation pass caught nine pre-publication content files referencing the framework with an incorrect license descriptor. Single pass, all nine corrected.
This is week 5 of an 8-week capability spotlight. CRAFT for Cowork is a free public beta.
Repository: https://github.com/CRAFTFramework/craft-framework
License: https://craftframework.ai/craft-license/ (Spec under BSL 1.1, converts to Apache 2.0 on Jan 1, 2029; content proprietary)
Hallucination is treated as an inherent LLM failure mode, but most production workflows respond to it with "be careful, double-check things." That doesn't scale past a handful of sessions, and it doesn't catch the failure mode that does the most damage: confident reconstruction in late-session context.
CRAFT for Cowork takes a structural approach. The QA framework runs verification at four levels — individual claims, recipe execution, file integrity, and cross-session consistency — and treats trust as a measurable property rather than a vibe.
**The four-gate verification sub-routine** (RCP-CWK-024) runs before any recipe reports a result:
1. *File-pointability* — claim traceable to a specific file
2. *Read-vs-reconstructed* — was data actually read this session
3. *Lessons-Learned conflict* — contradicts documented prior truth
4. *Untested assumption* — verified vs. assumed
**Confidence scoring** grades every factual claim 0-100 against a source hierarchy: evidence read from files (80-100), tool-output observation (50-79), design intent (30-49), pure reasoning (0-29). A 10-point penalty applies past 70% token usage to correct for late-session reliability decay.
**Cross-session consistency** is enforced by a longitudinal audit recipe (RCP-CWK-036) run every 5-10 sessions. It has caught ~40% drift in tracking-file state tables — drift that would otherwise propagate as silent ground truth.
**Concrete result:** A factual claim validation pass caught nine pre-publication content files referencing the framework with an incorrect license descriptor. Single pass, all nine corrected.
This is week 5 of an 8-week capability spotlight. CRAFT for Cowork is a free public beta.
Repository: https://github.com/CRAFTFramework/craft-framework
License: https://craftframework.ai/craft-license/ (Spec under BSL 1.1, converts to Apache 2.0 on Jan 1, 2029; content proprietary)