Text Classification
Joblib
Scikit-learn
English
scikit-learn
sklearn-logistic-regression
document-classification
binary-classification
legal-documents
hoa
property-management
ccr
declaration-of-covenants
logistic-regression
Eval Results (legacy)
Instructions to use GoverningDocs/ccr-binary-logreg with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use GoverningDocs/ccr-binary-logreg with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("GoverningDocs/ccr-binary-logreg", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - text-classification | |
| - document-classification | |
| - binary-classification | |
| - legal-documents | |
| - hoa | |
| - property-management | |
| - ccr | |
| - declaration-of-covenants | |
| - sklearn | |
| - logistic-regression | |
| pipeline_tag: text-classification | |
| library_name: scikit-learn | |
| metrics: | |
| - f1 | |
| - accuracy | |
| - roc_auc | |
| model-index: | |
| - name: ccr-binary-logreg | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Document Binary Classification | |
| metrics: | |
| - type: f1 | |
| value: 0.940 | |
| name: Test F1 | |
| - type: roc_auc | |
| value: 0.955 | |
| name: Test ROC AUC | |
| - type: accuracy | |
| value: 0.908 | |
| name: Test Accuracy | |
| # CCR Binary Classifier (LogReg over OpenAI Embeddings) | |
| Document-level binary classifier for distinguishing actual **Declarations of Covenants, Conditions & Restrictions (CC&Rs)** from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies). | |
| > **Note:** This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg | |
| ## Model Description | |
| This is a **scikit-learn `LogisticRegression`** trained on **mean-pooled OpenAI `text-embedding-3-small` embeddings** (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on. | |
| ### Why this model exists | |
| The XGBoost page classifier (`GoverningDocs/xgboost-page-classifier`) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents β which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports. | |
| This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR β but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document. | |
| ### Architecture | |
| - **Embedding model:** OpenAI `text-embedding-3-small` (1536 dims) β same encoder used by the production vectorstore (`langchain_pg_embedding`) | |
| - **Aggregation:** mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector | |
| - **Classifier head:** sklearn `LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)` | |
| - **Operating threshold:** 0.436 (F1-maximizing on validation set) | |
| - **Output:** `P(is_actual_declaration_of_covenants)` β [0, 1] | |
| Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1Γ512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value. | |
| ## Training Data | |
| Trained on **7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents**, produced by a multi-signal corpus relabeling pipeline: | |
| | Signal | Coverage | | |
| |---|---| | |
| | Signature anti-patterns (CDD, ordinance, bond, policy markers) | Auto-labels obvious non-CCR cases | | |
| | Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) | Auto-labels boilerplate at high confidence | | |
| | Claude Opus subagent verification | Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning | | |
| ### Page-level label distribution | |
| | Class | Count | % | | |
| |---|---|---| | |
| | DECLARATION_OF_COVENANTS | 3,014 | 42.3% | | |
| | AUXILIARY_HOA_DOC | 1,551 | 21.8% | | |
| | BOILERPLATE | 2,564 | 35.9% | | |
| ### Document-level binary labels | |
| A document is labeled `is_declaration_of_covenants = 1` if `count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5`. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content. | |
| Final binary class balance: 42.3% positive / 57.7% negative across 465 documents. | |
| ## Performance | |
| ### Test set (held-out, 65 documents, 47 positives) | |
| | Metric | Value | | |
| |---|---| | |
| | **F1** | **0.940** | | |
| | ROC AUC | 0.955 | | |
| | Accuracy | 0.908 | | |
| | Confusion matrix | `[[12 TN, 6 FP], [0 FN, 47 TP]]` | | |
| | **Recall** | **100%** β never misses a real Declaration in this composition | | |
| | Precision | 88.7% | | |
| | Brier score | 0.134 | | |
| | ECE (10-bin) | 0.278 | | |
| ### Validation set (75 documents, 39 positives) | |
| | Metric | Value | | |
| |---|---| | |
| | F1 | 0.894 | | |
| | ROC AUC | 0.875 | | |
| | Brier score | 0.191 | | |
| | ECE | 0.207 | | |
| The val/test difference reflects diverse class composition across stratified-by-document splits β val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets. | |
| ### Cross-architecture comparison | |
| All three Phase 1 candidates converged to the same operating point: | |
| | Model | Val F1 | Test F1 | Test AUC | | |
| |---|---|---|---| | |
| | LogReg (default 0.5) | 0.85 | β | 0.875 | | |
| | **LogReg (threshold 0.436)** | **0.894** | **0.940** | **0.955** | | |
| | LogReg + Platt-calibrated | 0.894 | β | 0.874 | | |
| | MLP (1Γ512 ReLU) at best threshold | 0.892 | β | 0.847 | | |
| LogReg wins on simplicity-tiebreak. | |
| ## Intended Use | |
| ### Primary use case | |
| Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band (recalibrated empirically β the original `(0.30, 0.85)` plan-time bands left FAST_PASS empty in production because real Declarations score 0.45-0.70 raw): | |
| - **Score < 0.25**: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch. | |
| - **Score >= 0.55**: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic `detect_ccr` validator. | |
| - **0.25 <= Score < 0.55**: ambiguous. Escalate to agentic `detect_ccr` for a deeper look. | |
| ### Out-of-scope use | |
| - Per-page CCR detection (this is a document-level model) | |
| - Multi-class document categorization (use `GoverningDocs/xgboost-page-classifier` for that) | |
| - Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference) | |
| ## Limitations | |
| ### Calibration | |
| The raw LogReg artifact has ECE 0.19-0.28 on validation/test β predicted probabilities are systematically miscalibrated. The decision-band thresholds `(0.25, 0.55)` above are **empirically tuned on the production score distribution, not probability-calibrated**. | |
| A separate isotonic calibrator artifact (`ccr_binary_isotonic_calibrator.joblib`) ships in the same repo and reduces test-set ECE from 0.278 to 0.087 (3.2x improvement). It is **purely additive metadata** β the production gate still consumes raw scores. Use the calibrator if you need probability-calibrated outputs for drift monitoring, signal combination with other classifiers, or user-facing confidence display. See the "Calibration Support" section below for details. | |
| ### Sample size | |
| Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown. | |
| ### Geographic distribution | |
| Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net β downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases. | |
| ### Composite documents | |
| The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those. | |
| ## How to Use | |
| ### Inference | |
| ```python | |
| import joblib | |
| import numpy as np | |
| from huggingface_hub import hf_hub_download | |
| from langchain_openai import OpenAIEmbeddings | |
| # Load model + threshold + config | |
| model_path = hf_hub_download( | |
| repo_id="GoverningDocs/ccr-binary-logreg", | |
| filename="ccr_binary_logreg_tuned.joblib", | |
| ) | |
| artifact = joblib.load(model_path) | |
| model = artifact["model"] | |
| threshold = artifact["threshold"] # 0.436 | |
| cfg = artifact["config"] | |
| # Compute mean-pooled embedding for a document's first N substantive pages | |
| embeddings = OpenAIEmbeddings(model=cfg["embedding_model"]) | |
| page_texts = [...] # first ~5-20 substantive (non-boilerplate) pages | |
| page_vectors = embeddings.embed_documents(page_texts) | |
| doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1) | |
| # Predict | |
| score = model.predict_proba(doc_vector)[0, 1] | |
| # Three-band decision (recalibrated production bands) | |
| if score < 0.25: | |
| decision = "REJECT" # confident not a Declaration; skip CCR pipeline | |
| elif score >= 0.55: | |
| decision = "FAST_PASS" # confident Declaration; bypass agentic validator | |
| else: | |
| decision = "ESCALATE" # ambiguous; run agentic detect_ccr | |
| ``` | |
| ### Calibration Support | |
| Optional isotonic calibrator (`ccr_binary_isotonic_calibrator.joblib`) maps raw scores to probability-calibrated outputs. | |
| ```python | |
| calibrator_path = hf_hub_download( | |
| repo_id="GoverningDocs/ccr-binary-logreg", | |
| filename="ccr_binary_isotonic_calibrator.joblib", | |
| ) | |
| cal_artifact = joblib.load(calibrator_path) | |
| calibrator = cal_artifact["calibrator"] | |
| # Apply isotonic to a raw score (cv="prefit" + method="isotonic" + binary | |
| # fits on raw predict_proba outputs, so we can apply directly to a float) | |
| inner = calibrator.calibrated_classifiers_[0].calibrators[0] | |
| calibrated = float(inner.predict([score])[0]) | |
| ``` | |
| **Caveats:** | |
| - The shipped isotonic was fit on a small (~70-doc) validation split and produces approximately 3 plateau outputs (0.737, 0.833, 1.000). Treat calibrated scores as 3-level (low / med / high) confidence rather than fine-grained probabilities. | |
| - The calibrator's `shipped_model_filename` field MUST match the model file you loaded. Cross-check before use to guard against artifact mismatch. | |
| ### Files in this repo | |
| - `ccr_binary_logreg_tuned.joblib` β pickled dict containing `model` (sklearn LogisticRegression) and `config` (dict with `embedding_model`, `max_pages_per_doc`, `skip_boilerplate` flags). The `threshold` field (0.436) is a Phase 1 artifact; production uses bands, not a single threshold. | |
| - `ccr_binary_isotonic_calibrator.joblib` β pickled dict containing `calibrator` (sklearn `CalibratedClassifierCV` with `cv="prefit"`, `method="isotonic"`), `shipped_model_filename` (paired model artifact), and ECE before/after metadata. | |
| - `config.json` β JSON-readable summary of the model configuration, decision bands, and calibrator metadata. | |
| ## Training Procedure | |
| ### Data preparation | |
| 1. Source: `setfit_experiments` PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents. | |
| 2. Phase 0 corpus relabeling (4 stages): | |
| - Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics β 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative) | |
| - Opus subagent verification stage 1: 75 batches Γ 50 pages, prioritized by class-balance need | |
| - Opus subagent verification stage 2 (deferred-pages pass): 28 batches Γ 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified | |
| - Curation: 7,129 high-confidence pages retained | |
| 3. Page β document aggregation: `is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)` | |
| 4. Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out | |
| 5. Mean-pool embeddings of first 5-20 substantive pages β 1536-dim doc vector | |
| ### Training | |
| - Stratified split by `document_id`: 70% train / 15% val / 15% test (325 / 75 / 65 docs) | |
| - `LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)` | |
| - Training time: ~1 second on CPU | |
| - Threshold selection: F1-maximizing on validation set β 0.436 | |
| ### Reproducibility | |
| Training script: https://github.com/governingdocs/backend (path: `experiments/setfit_ccr_binary/scripts/train_tier1_v2.py`) | |
| Phase 0 relabeling pipeline: same repo, `experiments/setfit_ccr_binary/scripts/opus_verify.py` and `experiments/setfit_ccr_binary/scripts/heuristics/` | |
| Findings document with full Phase 0 + Phase 1 results: `experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md` | |
| ## Citation | |
| If you use this model in research or production, please cite: | |
| ``` | |
| @misc{ccr-binary-logreg-2026, | |
| title = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions}, | |
| author = {GoverningDocs Engineering}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}} | |
| } | |
| ``` | |
| ## Versioning | |
| Model artifacts are versioned via HuggingFace commit history. `config.json` includes the corpus snapshot commit hash for reproducibility. | |
| ## Maintenance | |
| This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See `plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md` (v2.2.1, Completed) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in. | |
| Calibrator artifact added per `plans/CCR_BINARY_ISOTONIC_RECALIBRATION_PLAN.md` (v1.4.0). Phase 1 findings: `experiments/setfit_ccr_binary/ISOTONIC_CALIBRATION_FINDINGS.md`. | |