jim-crow-laws-claude-code
A binary text classifier that flags whether a North Carolina session-law section
(1866–1967) is a Jim Crow law. Fine-tuned from
answerdotai/ModernBERT-base
on biglam/on_the_books,
the labeled training set from UNC Chapel Hill Libraries' On the Books: Jim Crow
and Algorithms of Resistance project.
Intended use
- Surface candidate Jim Crow laws within historical NC session-law corpora to support archival, library, and digital-humanities work.
- Reproduce / extend the On the Books methodology on related corpora.
- Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant text classification.
The original On the Books project trained a classifier on this data and ran it over the full ~century corpus. This model is a re-training of that idea with a modern long-context encoder (ModernBERT) and is intended to be applied the same way: as a retrieval / triage tool whose flagged outputs are then reviewed by domain experts.
Out-of-scope / limitations
- Jurisdiction: trained on North Carolina session laws only. Patterns will not transfer cleanly to other states without adaptation.
- Period: 1866–1967 legal language. Modern statutes differ substantially.
- OCR noise: training texts contain period-OCR errors; expect degraded performance on cleaner or differently-OCR'd inputs.
- Label scope: the negative class means "not flagged by the project's labeling process" — laws with discriminatory effect that the source compilations did not catalogue may be present in the negatives. Treat model predictions as candidates for review, not ground truth.
- Class imbalance: training data is ~29% positive; trained with inverse-frequency class weights to compensate.
Per the dataset's authors, the texts include slurs and dehumanising language present in the historical record. Downstream users should preserve the project's framing and not strip the historical context.
How to use
from transformers import pipeline
clf = pipeline(
"text-classification",
model="davanstrien/jim-crow-laws-claude-code",
)
text = "..." # text of a single law section
print(clf(text))
# [{'label': 'jim_crow', 'score': 0.99}]
Labels: no_jim_crow (0) and jim_crow (1).
Training data
- Dataset:
biglam/on_the_books(1,785 rows; singletrainsplit). - Input field used:
section_text(the OCR text of the labeled section).chapter_textandsourcewere ignored —sourcewould leak the label (paschalis 100% positive,murrayis 92% positive). - Split: stratified 80/20 train/eval split (seed 42) — 1,428 train / 357 eval, preserving the ~29% positive rate in both.
Training procedure
- Base model:
answerdotai/ModernBERT-base(~150M params, 8K context). - Max sequence length: 1024 tokens (covers ~95th percentile of
section_texttoken lengths; long-tail truncated). - Loss: cross-entropy with inverse-frequency class weights computed
from the training split (
[0.701, 1.741]) to handle class imbalance. - Hardware: trained on a single L4 GPU via
hf jobs uv run.
Hyperparameters
| Optimizer | AdamW (fused), β=(0.9, 0.999), ε=1e-8 |
| Learning rate | 3e-5 |
| LR schedule | Linear with 10% warmup |
| Weight decay | 0.01 |
| Train batch size | 16 |
| Eval batch size | 32 |
| Epochs | 5 |
| Precision | bf16 |
| Seed | 42 |
| Best-model selection | F1 on jim_crow class |
Training results
Best checkpoint selected by f1_jim_crow on the held-out eval split (epoch 3):
| Metric | Value |
|---|---|
| Accuracy | 0.9776 |
| Precision (jim_crow) | 0.9352 |
| Recall (jim_crow) | 0.9902 |
| F1 (jim_crow) | 0.9619 |
| F1 (macro) | 0.9730 |
| ROC AUC | 0.9965 |
Per-epoch eval:
| Training Loss | Epoch | Step | Val Loss | Accuracy | Precision (jim_crow) | Recall (jim_crow) | F1 (jim_crow) | F1 macro | ROC AUC |
|---|---|---|---|---|---|---|---|---|---|
| 0.2893 | 1 | 90 | 0.1920 | 0.9524 | 0.8972 | 0.9412 | 0.9187 | 0.9425 | 0.9913 |
| 0.0716 | 2 | 180 | 0.0793 | 0.9776 | 0.9519 | 0.9706 | 0.9612 | 0.9727 | 0.9971 |
| 0.1101 | 3 | 270 | 0.1205 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9965 |
| 0.0027 | 4 | 360 | 0.1251 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9958 |
| 0.0001 | 5 | 450 | 0.1231 | 0.9748 | 0.9346 | 0.9804 | 0.9569 | 0.9696 | 0.9960 |
Held-out eval is small (357 rows; 102 positive). Treat differences in the fourth decimal as noise.
Citation
Please cite the original On the Books project for the data and methodology:
On the Books: Jim Crow and Algorithms of Resistance.
University of North Carolina at Chapel Hill Libraries.
https://onthebooks.lib.unc.edu
DOI: https://doi.org/10.17615/5c4g-sd44
Framework versions
- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2
- Downloads last month
- 20
Model tree for davanstrien/jim-crow-laws-claude-code
Base model
answerdotai/ModernBERT-baseDataset used to train davanstrien/jim-crow-laws-claude-code
Evaluation results
- accuracy on biglam/on_the_booksself-reported0.978
- F1 (jim_crow class) on biglam/on_the_booksself-reported0.962
- Precision (jim_crow class) on biglam/on_the_booksself-reported0.935
- Recall (jim_crow class) on biglam/on_the_booksself-reported0.990
- roc_auc on biglam/on_the_booksself-reported0.997