CCR Binary Classifier (LogReg over OpenAI Embeddings)
Document-level binary classifier for distinguishing actual Declarations of Covenants, Conditions & Restrictions (CC&Rs) from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies).
Note: This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg
Model Description
This is a scikit-learn LogisticRegression trained on mean-pooled OpenAI text-embedding-3-small embeddings (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on.
Why this model exists
The XGBoost page classifier (GoverningDocs/xgboost-page-classifier) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents β which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports.
This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR β but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document.
Architecture
- Embedding model: OpenAI
text-embedding-3-small(1536 dims) β same encoder used by the production vectorstore (langchain_pg_embedding) - Aggregation: mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector
- Classifier head: sklearn
LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000) - Operating threshold: 0.436 (F1-maximizing on validation set)
- Output:
P(is_actual_declaration_of_covenants)β [0, 1]
Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1Γ512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value.
Training Data
Trained on 7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents, produced by a multi-signal corpus relabeling pipeline:
| Signal | Coverage |
|---|---|
| Signature anti-patterns (CDD, ordinance, bond, policy markers) | Auto-labels obvious non-CCR cases |
| Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) | Auto-labels boilerplate at high confidence |
| Claude Opus subagent verification | Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning |
Page-level label distribution
| Class | Count | % |
|---|---|---|
| DECLARATION_OF_COVENANTS | 3,014 | 42.3% |
| AUXILIARY_HOA_DOC | 1,551 | 21.8% |
| BOILERPLATE | 2,564 | 35.9% |
Document-level binary labels
A document is labeled is_declaration_of_covenants = 1 if count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content.
Final binary class balance: 42.3% positive / 57.7% negative across 465 documents.
Performance
Test set (held-out, 65 documents, 47 positives)
| Metric | Value |
|---|---|
| F1 | 0.940 |
| ROC AUC | 0.955 |
| Accuracy | 0.908 |
| Confusion matrix | [[12 TN, 6 FP], [0 FN, 47 TP]] |
| Recall | 100% β never misses a real Declaration in this composition |
| Precision | 88.7% |
| Brier score | 0.134 |
| ECE (10-bin) | 0.278 |
Validation set (75 documents, 39 positives)
| Metric | Value |
|---|---|
| F1 | 0.894 |
| ROC AUC | 0.875 |
| Brier score | 0.191 |
| ECE | 0.207 |
The val/test difference reflects diverse class composition across stratified-by-document splits β val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets.
Cross-architecture comparison
All three Phase 1 candidates converged to the same operating point:
| Model | Val F1 | Test F1 | Test AUC |
|---|---|---|---|
| LogReg (default 0.5) | 0.85 | β | 0.875 |
| LogReg (threshold 0.436) | 0.894 | 0.940 | 0.955 |
| LogReg + Platt-calibrated | 0.894 | β | 0.874 |
| MLP (1Γ512 ReLU) at best threshold | 0.892 | β | 0.847 |
LogReg wins on simplicity-tiebreak.
Intended Use
Primary use case
Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band (recalibrated empirically β the original (0.30, 0.85) plan-time bands left FAST_PASS empty in production because real Declarations score 0.45-0.70 raw):
- Score < 0.25: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch.
- Score >= 0.55: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic
detect_ccrvalidator. - 0.25 <= Score < 0.55: ambiguous. Escalate to agentic
detect_ccrfor a deeper look.
Out-of-scope use
- Per-page CCR detection (this is a document-level model)
- Multi-class document categorization (use
GoverningDocs/xgboost-page-classifierfor that) - Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference)
Limitations
Calibration
The raw LogReg artifact has ECE 0.19-0.28 on validation/test β predicted probabilities are systematically miscalibrated. The decision-band thresholds (0.25, 0.55) above are empirically tuned on the production score distribution, not probability-calibrated.
A separate isotonic calibrator artifact (ccr_binary_isotonic_calibrator.joblib) ships in the same repo and reduces test-set ECE from 0.278 to 0.087 (3.2x improvement). It is purely additive metadata β the production gate still consumes raw scores. Use the calibrator if you need probability-calibrated outputs for drift monitoring, signal combination with other classifiers, or user-facing confidence display. See the "Calibration Support" section below for details.
Sample size
Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown.
Geographic distribution
Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net β downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases.
Composite documents
The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those.
How to Use
Inference
import joblib
import numpy as np
from huggingface_hub import hf_hub_download
from langchain_openai import OpenAIEmbeddings
# Load model + threshold + config
model_path = hf_hub_download(
repo_id="GoverningDocs/ccr-binary-logreg",
filename="ccr_binary_logreg_tuned.joblib",
)
artifact = joblib.load(model_path)
model = artifact["model"]
threshold = artifact["threshold"] # 0.436
cfg = artifact["config"]
# Compute mean-pooled embedding for a document's first N substantive pages
embeddings = OpenAIEmbeddings(model=cfg["embedding_model"])
page_texts = [...] # first ~5-20 substantive (non-boilerplate) pages
page_vectors = embeddings.embed_documents(page_texts)
doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1)
# Predict
score = model.predict_proba(doc_vector)[0, 1]
# Three-band decision (recalibrated production bands)
if score < 0.25:
decision = "REJECT" # confident not a Declaration; skip CCR pipeline
elif score >= 0.55:
decision = "FAST_PASS" # confident Declaration; bypass agentic validator
else:
decision = "ESCALATE" # ambiguous; run agentic detect_ccr
Calibration Support
Optional isotonic calibrator (ccr_binary_isotonic_calibrator.joblib) maps raw scores to probability-calibrated outputs.
calibrator_path = hf_hub_download(
repo_id="GoverningDocs/ccr-binary-logreg",
filename="ccr_binary_isotonic_calibrator.joblib",
)
cal_artifact = joblib.load(calibrator_path)
calibrator = cal_artifact["calibrator"]
# Apply isotonic to a raw score (cv="prefit" + method="isotonic" + binary
# fits on raw predict_proba outputs, so we can apply directly to a float)
inner = calibrator.calibrated_classifiers_[0].calibrators[0]
calibrated = float(inner.predict([score])[0])
Caveats:
- The shipped isotonic was fit on a small (~70-doc) validation split and produces approximately 3 plateau outputs (0.737, 0.833, 1.000). Treat calibrated scores as 3-level (low / med / high) confidence rather than fine-grained probabilities.
- The calibrator's
shipped_model_filenamefield MUST match the model file you loaded. Cross-check before use to guard against artifact mismatch.
Files in this repo
ccr_binary_logreg_tuned.joblibβ pickled dict containingmodel(sklearn LogisticRegression) andconfig(dict withembedding_model,max_pages_per_doc,skip_boilerplateflags). Thethresholdfield (0.436) is a Phase 1 artifact; production uses bands, not a single threshold.ccr_binary_isotonic_calibrator.joblibβ pickled dict containingcalibrator(sklearnCalibratedClassifierCVwithcv="prefit",method="isotonic"),shipped_model_filename(paired model artifact), and ECE before/after metadata.config.jsonβ JSON-readable summary of the model configuration, decision bands, and calibrator metadata.
Training Procedure
Data preparation
- Source:
setfit_experimentsPostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents. - Phase 0 corpus relabeling (4 stages):
- Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics β 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative)
- Opus subagent verification stage 1: 75 batches Γ 50 pages, prioritized by class-balance need
- Opus subagent verification stage 2 (deferred-pages pass): 28 batches Γ 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified
- Curation: 7,129 high-confidence pages retained
- Page β document aggregation:
is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5) - Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out
- Mean-pool embeddings of first 5-20 substantive pages β 1536-dim doc vector
Training
- Stratified split by
document_id: 70% train / 15% val / 15% test (325 / 75 / 65 docs) LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)- Training time: ~1 second on CPU
- Threshold selection: F1-maximizing on validation set β 0.436
Reproducibility
Training script: https://github.com/governingdocs/backend (path: experiments/setfit_ccr_binary/scripts/train_tier1_v2.py)
Phase 0 relabeling pipeline: same repo, experiments/setfit_ccr_binary/scripts/opus_verify.py and experiments/setfit_ccr_binary/scripts/heuristics/
Findings document with full Phase 0 + Phase 1 results: experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md
Citation
If you use this model in research or production, please cite:
@misc{ccr-binary-logreg-2026,
title = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions},
author = {GoverningDocs Engineering},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}}
}
Versioning
Model artifacts are versioned via HuggingFace commit history. config.json includes the corpus snapshot commit hash for reproducibility.
Maintenance
This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md (v2.2.1, Completed) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in.
Calibrator artifact added per plans/CCR_BINARY_ISOTONIC_RECALIBRATION_PLAN.md (v1.4.0). Phase 1 findings: experiments/setfit_ccr_binary/ISOTONIC_CALIBRATION_FINDINGS.md.
- Downloads last month
- 50
Evaluation results
- Test F1self-reported0.940
- Test ROC AUCself-reported0.955
- Test Accuracyself-reported0.908