hslee1981

T18 Phase 1 Tier 1: model card

24e2c47 verified about 1 month ago

14.3 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- document-classification
	- binary-classification
	- legal-documents
	- hoa
	- property-management
	- ccr
	- declaration-of-covenants
	- sklearn
	- logistic-regression
	pipeline_tag: text-classification
	library_name: scikit-learn
	metrics:
	- f1
	- accuracy
	- roc_auc
	model-index:
	- name: ccr-binary-logreg
	results:
	- task:
	type: text-classification
	name: Document Binary Classification
	metrics:
	- type: f1
	value: 0.940
	name: Test F1
	- type: roc_auc
	value: 0.955
	name: Test ROC AUC
	- type: accuracy
	value: 0.908
	name: Test Accuracy
	---

	# CCR Binary Classifier (LogReg over OpenAI Embeddings)

	Document-level binary classifier for distinguishing actual Declarations of Covenants, Conditions & Restrictions (CC&Rs) from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies).

	> Note: This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg

	## Model Description

	This is a scikit-learn `LogisticRegression` trained on mean-pooled OpenAI `text-embedding-3-small` embeddings (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on.

	### Why this model exists

	The XGBoost page classifier (`GoverningDocs/xgboost-page-classifier`) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents — which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports.

	This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR — but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document.

	### Architecture

	- Embedding model: OpenAI `text-embedding-3-small` (1536 dims) — same encoder used by the production vectorstore (`langchain_pg_embedding`)
	- Aggregation: mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector
	- Classifier head: sklearn `LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)`
	- Operating threshold: 0.436 (F1-maximizing on validation set)
	- Output: `P(is_actual_declaration_of_covenants)` ∈ [0, 1]

	Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1×512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value.

	## Training Data

	Trained on 7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents, produced by a multi-signal corpus relabeling pipeline:

	\| Signal \| Coverage \|
	\|---\|---\|
	\| Signature anti-patterns (CDD, ordinance, bond, policy markers) \| Auto-labels obvious non-CCR cases \|
	\| Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) \| Auto-labels boilerplate at high confidence \|
	\| Claude Opus subagent verification \| Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning \|

	### Page-level label distribution

	\| Class \| Count \| % \|
	\|---\|---\|---\|
	\| DECLARATION_OF_COVENANTS \| 3,014 \| 42.3% \|
	\| AUXILIARY_HOA_DOC \| 1,551 \| 21.8% \|
	\| BOILERPLATE \| 2,564 \| 35.9% \|

	### Document-level binary labels

	A document is labeled `is_declaration_of_covenants = 1` if `count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5`. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content.

	Final binary class balance: 42.3% positive / 57.7% negative across 465 documents.

	## Performance

	### Test set (held-out, 65 documents, 47 positives)

	\| Metric \| Value \|
	\|---\|---\|
	\| F1 \| 0.940 \|
	\| ROC AUC \| 0.955 \|
	\| Accuracy \| 0.908 \|
	\| Confusion matrix \| `[[12 TN, 6 FP], [0 FN, 47 TP]]` \|
	\| Recall \| 100% — never misses a real Declaration in this composition \|
	\| Precision \| 88.7% \|
	\| Brier score \| 0.134 \|
	\| ECE (10-bin) \| 0.278 \|

	### Validation set (75 documents, 39 positives)

	\| Metric \| Value \|
	\|---\|---\|
	\| F1 \| 0.894 \|
	\| ROC AUC \| 0.875 \|
	\| Brier score \| 0.191 \|
	\| ECE \| 0.207 \|

	The val/test difference reflects diverse class composition across stratified-by-document splits — val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets.

	### Cross-architecture comparison

	All three Phase 1 candidates converged to the same operating point:

	\| Model \| Val F1 \| Test F1 \| Test AUC \|
	\|---\|---\|---\|---\|
	\| LogReg (default 0.5) \| 0.85 \| — \| 0.875 \|
	\| LogReg (threshold 0.436) \| 0.894 \| 0.940 \| 0.955 \|
	\| LogReg + Platt-calibrated \| 0.894 \| — \| 0.874 \|
	\| MLP (1×512 ReLU) at best threshold \| 0.892 \| — \| 0.847 \|

	LogReg wins on simplicity-tiebreak.

	## Intended Use

	### Primary use case

	Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band (recalibrated empirically — the original `(0.30, 0.85)` plan-time bands left FAST_PASS empty in production because real Declarations score 0.45-0.70 raw):

	- Score < 0.25: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch.
	- Score >= 0.55: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic `detect_ccr` validator.
	- 0.25 <= Score < 0.55: ambiguous. Escalate to agentic `detect_ccr` for a deeper look.

	### Out-of-scope use

	- Per-page CCR detection (this is a document-level model)
	- Multi-class document categorization (use `GoverningDocs/xgboost-page-classifier` for that)
	- Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference)

	## Limitations

	### Calibration

	The raw LogReg artifact has ECE 0.19-0.28 on validation/test — predicted probabilities are systematically miscalibrated. The decision-band thresholds `(0.25, 0.55)` above are empirically tuned on the production score distribution, not probability-calibrated.

	A separate isotonic calibrator artifact (`ccr_binary_isotonic_calibrator.joblib`) ships in the same repo and reduces test-set ECE from 0.278 to 0.087 (3.2x improvement). It is purely additive metadata — the production gate still consumes raw scores. Use the calibrator if you need probability-calibrated outputs for drift monitoring, signal combination with other classifiers, or user-facing confidence display. See the "Calibration Support" section below for details.

	### Sample size

	Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown.

	### Geographic distribution

	Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net — downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases.

	### Composite documents

	The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those.

	## How to Use

	### Inference

	```python
	import joblib
	import numpy as np
	from huggingface_hub import hf_hub_download
	from langchain_openai import OpenAIEmbeddings

	# Load model + threshold + config
	model_path = hf_hub_download(
	repo_id="GoverningDocs/ccr-binary-logreg",
	filename="ccr_binary_logreg_tuned.joblib",
	)
	artifact = joblib.load(model_path)
	model = artifact["model"]
	threshold = artifact["threshold"] # 0.436
	cfg = artifact["config"]

	# Compute mean-pooled embedding for a document's first N substantive pages
	embeddings = OpenAIEmbeddings(model=cfg["embedding_model"])
	page_texts = [...] # first ~5-20 substantive (non-boilerplate) pages
	page_vectors = embeddings.embed_documents(page_texts)
	doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1)

	# Predict
	score = model.predict_proba(doc_vector)[0, 1]

	# Three-band decision (recalibrated production bands)
	if score < 0.25:
	decision = "REJECT" # confident not a Declaration; skip CCR pipeline
	elif score >= 0.55:
	decision = "FAST_PASS" # confident Declaration; bypass agentic validator
	else:
	decision = "ESCALATE" # ambiguous; run agentic detect_ccr
	```

	### Calibration Support

	Optional isotonic calibrator (`ccr_binary_isotonic_calibrator.joblib`) maps raw scores to probability-calibrated outputs.

	```python
	calibrator_path = hf_hub_download(
	repo_id="GoverningDocs/ccr-binary-logreg",
	filename="ccr_binary_isotonic_calibrator.joblib",
	)
	cal_artifact = joblib.load(calibrator_path)
	calibrator = cal_artifact["calibrator"]

	# Apply isotonic to a raw score (cv="prefit" + method="isotonic" + binary
	# fits on raw predict_proba outputs, so we can apply directly to a float)
	inner = calibrator.calibrated_classifiers_[0].calibrators[0]
	calibrated = float(inner.predict([score])[0])
	```

	Caveats:
	- The shipped isotonic was fit on a small (~70-doc) validation split and produces approximately 3 plateau outputs (0.737, 0.833, 1.000). Treat calibrated scores as 3-level (low / med / high) confidence rather than fine-grained probabilities.
	- The calibrator's `shipped_model_filename` field MUST match the model file you loaded. Cross-check before use to guard against artifact mismatch.

	### Files in this repo

	- `ccr_binary_logreg_tuned.joblib` — pickled dict containing `model` (sklearn LogisticRegression) and `config` (dict with `embedding_model`, `max_pages_per_doc`, `skip_boilerplate` flags). The `threshold` field (0.436) is a Phase 1 artifact; production uses bands, not a single threshold.
	- `ccr_binary_isotonic_calibrator.joblib` — pickled dict containing `calibrator` (sklearn `CalibratedClassifierCV` with `cv="prefit"`, `method="isotonic"`), `shipped_model_filename` (paired model artifact), and ECE before/after metadata.
	- `config.json` — JSON-readable summary of the model configuration, decision bands, and calibrator metadata.

	## Training Procedure

	### Data preparation

	1. Source: `setfit_experiments` PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents.
	2. Phase 0 corpus relabeling (4 stages):
	- Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics → 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative)
	- Opus subagent verification stage 1: 75 batches × 50 pages, prioritized by class-balance need
	- Opus subagent verification stage 2 (deferred-pages pass): 28 batches × 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified
	- Curation: 7,129 high-confidence pages retained
	3. Page → document aggregation: `is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)`
	4. Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out
	5. Mean-pool embeddings of first 5-20 substantive pages → 1536-dim doc vector

	### Training

	- Stratified split by `document_id`: 70% train / 15% val / 15% test (325 / 75 / 65 docs)
	- `LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)`
	- Training time: ~1 second on CPU
	- Threshold selection: F1-maximizing on validation set → 0.436

	### Reproducibility

	Training script: https://github.com/governingdocs/backend (path: `experiments/setfit_ccr_binary/scripts/train_tier1_v2.py`)

	Phase 0 relabeling pipeline: same repo, `experiments/setfit_ccr_binary/scripts/opus_verify.py` and `experiments/setfit_ccr_binary/scripts/heuristics/`

	Findings document with full Phase 0 + Phase 1 results: `experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md`

	## Citation

	If you use this model in research or production, please cite:

	```
	@misc{ccr-binary-logreg-2026,
	title = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions},
	author = {GoverningDocs Engineering},
	year = {2026},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}}
	}
	```

	## Versioning

	Model artifacts are versioned via HuggingFace commit history. `config.json` includes the corpus snapshot commit hash for reproducibility.

	## Maintenance

	This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See `plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md` (v2.2.1, Completed) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in.

	Calibrator artifact added per `plans/CCR_BINARY_ISOTONIC_RECALIBRATION_PLAN.md` (v1.4.0). Phase 1 findings: `experiments/setfit_ccr_binary/ISOTONIC_CALIBRATION_FINDINGS.md`.