CCR Binary Classifier (LogReg over OpenAI Embeddings)

Document-level binary classifier for distinguishing actual Declarations of Covenants, Conditions & Restrictions (CC&Rs) from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies).

Note: This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg

Model Description

This is a scikit-learn LogisticRegression trained on mean-pooled OpenAI text-embedding-3-small embeddings (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on.

Why this model exists

The XGBoost page classifier (GoverningDocs/xgboost-page-classifier) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents — which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports.

This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR — but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document.

Architecture

Embedding model: OpenAI text-embedding-3-small (1536 dims) — same encoder used by the production vectorstore (langchain_pg_embedding)
Aggregation: mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector
Classifier head: sklearn LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)
Operating threshold: 0.436 (F1-maximizing on validation set)
Output: P(is_actual_declaration_of_covenants) ∈ [0, 1]

Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1×512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value.

Training Data

Trained on 7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents, produced by a multi-signal corpus relabeling pipeline:

Signal	Coverage
Signature anti-patterns (CDD, ordinance, bond, policy markers)	Auto-labels obvious non-CCR cases
Page-structural heuristics (TOC, recording stamps, signature blocks, blanks)	Auto-labels boilerplate at high confidence
Claude Opus subagent verification	Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning

Page-level label distribution

Class	Count	%
DECLARATION_OF_COVENANTS	3,014	42.3%
AUXILIARY_HOA_DOC	1,551	21.8%
BOILERPLATE	2,564	35.9%

Document-level binary labels

A document is labeled is_declaration_of_covenants = 1 if count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content.

Final binary class balance: 42.3% positive / 57.7% negative across 465 documents.

Performance

Test set (held-out, 65 documents, 47 positives)

Metric	Value
F1	0.940
ROC AUC	0.955
Accuracy	0.908
Confusion matrix	`[[12 TN, 6 FP], [0 FN, 47 TP]]`
Recall	100% — never misses a real Declaration in this composition
Precision	88.7%
Brier score	0.134
ECE (10-bin)	0.278

Validation set (75 documents, 39 positives)

Metric	Value
F1	0.894
ROC AUC	0.875
Brier score	0.191
ECE	0.207

The val/test difference reflects diverse class composition across stratified-by-document splits — val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets.

Cross-architecture comparison

All three Phase 1 candidates converged to the same operating point:

Model	Val F1	Test F1	Test AUC
LogReg (default 0.5)	0.85	—	0.875
LogReg (threshold 0.436)	0.894	0.940	0.955
LogReg + Platt-calibrated	0.894	—	0.874
MLP (1×512 ReLU) at best threshold	0.892	—	0.847

LogReg wins on simplicity-tiebreak.

Intended Use

Primary use case

Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band (recalibrated empirically — the original (0.30, 0.85) plan-time bands left FAST_PASS empty in production because real Declarations score 0.45-0.70 raw):

Score < 0.25: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch.
Score >= 0.55: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic detect_ccr validator.
0.25 <= Score < 0.55: ambiguous. Escalate to agentic detect_ccr for a deeper look.

Out-of-scope use

Per-page CCR detection (this is a document-level model)
Multi-class document categorization (use GoverningDocs/xgboost-page-classifier for that)
Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference)

Limitations

Calibration

The raw LogReg artifact has ECE 0.19-0.28 on validation/test — predicted probabilities are systematically miscalibrated. The decision-band thresholds (0.25, 0.55) above are empirically tuned on the production score distribution, not probability-calibrated.

A separate isotonic calibrator artifact (ccr_binary_isotonic_calibrator.joblib) ships in the same repo and reduces test-set ECE from 0.278 to 0.087 (3.2x improvement). It is purely additive metadata — the production gate still consumes raw scores. Use the calibrator if you need probability-calibrated outputs for drift monitoring, signal combination with other classifiers, or user-facing confidence display. See the "Calibration Support" section below for details.

Sample size

Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown.

Geographic distribution

Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net — downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases.

Composite documents

The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those.

How to Use

Inference

import joblib
import numpy as np
from huggingface_hub import hf_hub_download
from langchain_openai import OpenAIEmbeddings

# Load model + threshold + config
model_path = hf_hub_download(
    repo_id="GoverningDocs/ccr-binary-logreg",
    filename="ccr_binary_logreg_tuned.joblib",
)
artifact = joblib.load(model_path)
model = artifact["model"]
threshold = artifact["threshold"]  # 0.436
cfg = artifact["config"]

# Compute mean-pooled embedding for a document's first N substantive pages
embeddings = OpenAIEmbeddings(model=cfg["embedding_model"])
page_texts = [...]  # first ~5-20 substantive (non-boilerplate) pages
page_vectors = embeddings.embed_documents(page_texts)
doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1)

# Predict
score = model.predict_proba(doc_vector)[0, 1]

# Three-band decision (recalibrated production bands)
if score < 0.25:
    decision = "REJECT"  # confident not a Declaration; skip CCR pipeline
elif score >= 0.55:
    decision = "FAST_PASS"  # confident Declaration; bypass agentic validator
else:
    decision = "ESCALATE"  # ambiguous; run agentic detect_ccr

Calibration Support

Optional isotonic calibrator (ccr_binary_isotonic_calibrator.joblib) maps raw scores to probability-calibrated outputs.

calibrator_path = hf_hub_download(
    repo_id="GoverningDocs/ccr-binary-logreg",
    filename="ccr_binary_isotonic_calibrator.joblib",
)
cal_artifact = joblib.load(calibrator_path)
calibrator = cal_artifact["calibrator"]

# Apply isotonic to a raw score (cv="prefit" + method="isotonic" + binary
# fits on raw predict_proba outputs, so we can apply directly to a float)
inner = calibrator.calibrated_classifiers_[0].calibrators[0]
calibrated = float(inner.predict([score])[0])

Caveats:

The shipped isotonic was fit on a small (~70-doc) validation split and produces approximately 3 plateau outputs (0.737, 0.833, 1.000). Treat calibrated scores as 3-level (low / med / high) confidence rather than fine-grained probabilities.
The calibrator's shipped_model_filename field MUST match the model file you loaded. Cross-check before use to guard against artifact mismatch.

Files in this repo

ccr_binary_logreg_tuned.joblib — pickled dict containing model (sklearn LogisticRegression) and config (dict with embedding_model, max_pages_per_doc, skip_boilerplate flags). The threshold field (0.436) is a Phase 1 artifact; production uses bands, not a single threshold.
ccr_binary_isotonic_calibrator.joblib — pickled dict containing calibrator (sklearn CalibratedClassifierCV with cv="prefit", method="isotonic"), shipped_model_filename (paired model artifact), and ECE before/after metadata.
config.json — JSON-readable summary of the model configuration, decision bands, and calibrator metadata.

Training Procedure

Data preparation

Source: setfit_experiments PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents.
Phase 0 corpus relabeling (4 stages):
- Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics → 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative)
- Opus subagent verification stage 1: 75 batches × 50 pages, prioritized by class-balance need
- Opus subagent verification stage 2 (deferred-pages pass): 28 batches × 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified
- Curation: 7,129 high-confidence pages retained
Page → document aggregation: is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)
Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out
Mean-pool embeddings of first 5-20 substantive pages → 1536-dim doc vector

Training

Stratified split by document_id: 70% train / 15% val / 15% test (325 / 75 / 65 docs)
LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)
Training time: ~1 second on CPU
Threshold selection: F1-maximizing on validation set → 0.436

Reproducibility

Training script: https://github.com/governingdocs/backend (path: experiments/setfit_ccr_binary/scripts/train_tier1_v2.py)

Phase 0 relabeling pipeline: same repo, experiments/setfit_ccr_binary/scripts/opus_verify.py and experiments/setfit_ccr_binary/scripts/heuristics/

Findings document with full Phase 0 + Phase 1 results: experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md

Citation

If you use this model in research or production, please cite:

@misc{ccr-binary-logreg-2026,
  title  = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions},
  author = {GoverningDocs Engineering},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}}
}

Versioning

Model artifacts are versioned via HuggingFace commit history. config.json includes the corpus snapshot commit hash for reproducibility.

Maintenance

This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md (v2.2.1, Completed) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in.

Calibrator artifact added per plans/CCR_BINARY_ISOTONIC_RECALIBRATION_PLAN.md (v1.4.0). Phase 1 findings: experiments/setfit_ccr_binary/ISOTONIC_CALIBRATION_FINDINGS.md.

Downloads last month: 50

Evaluation results

Test F1
self-reported

0.940
Test ROC AUC
self-reported

0.955
Test Accuracy
self-reported

0.908