jim-crow-laws-claude-code

A binary text classifier that flags whether a North Carolina session-law section (1866–1967) is a Jim Crow law. Fine-tuned from answerdotai/ModernBERT-base on biglam/on_the_books, the labeled training set from UNC Chapel Hill Libraries' On the Books: Jim Crow and Algorithms of Resistance project.

Intended use

Surface candidate Jim Crow laws within historical NC session-law corpora to support archival, library, and digital-humanities work.
Reproduce / extend the On the Books methodology on related corpora.
Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant text classification.

The original On the Books project trained a classifier on this data and ran it over the full ~century corpus. This model is a re-training of that idea with a modern long-context encoder (ModernBERT) and is intended to be applied the same way: as a retrieval / triage tool whose flagged outputs are then reviewed by domain experts.

Out-of-scope / limitations

Jurisdiction: trained on North Carolina session laws only. Patterns will not transfer cleanly to other states without adaptation.
Period: 1866–1967 legal language. Modern statutes differ substantially.
OCR noise: training texts contain period-OCR errors; expect degraded performance on cleaner or differently-OCR'd inputs.
Label scope: the negative class means "not flagged by the project's labeling process" — laws with discriminatory effect that the source compilations did not catalogue may be present in the negatives. Treat model predictions as candidates for review, not ground truth.
Class imbalance: training data is ~29% positive; trained with inverse-frequency class weights to compensate.

Per the dataset's authors, the texts include slurs and dehumanising language present in the historical record. Downstream users should preserve the project's framing and not strip the historical context.

How to use

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="davanstrien/jim-crow-laws-claude-code",
)

text = "..."  # text of a single law section
print(clf(text))
# [{'label': 'jim_crow', 'score': 0.99}]

Labels: no_jim_crow (0) and jim_crow (1).

Training data

Dataset: biglam/on_the_books (1,785 rows; single train split).
Input field used: section_text (the OCR text of the labeled section). chapter_text and source were ignored — source would leak the label (paschal is 100% positive, murray is 92% positive).
Split: stratified 80/20 train/eval split (seed 42) — 1,428 train / 357 eval, preserving the ~29% positive rate in both.

Training procedure

Base model: answerdotai/ModernBERT-base (~150M params, 8K context).
Max sequence length: 1024 tokens (covers ~95th percentile of section_text token lengths; long-tail truncated).
Loss: cross-entropy with inverse-frequency class weights computed from the training split ([0.701, 1.741]) to handle class imbalance.
Hardware: trained on a single L4 GPU via hf jobs uv run.

Hyperparameters


Optimizer	AdamW (fused), β=(0.9, 0.999), ε=1e-8
Learning rate	3e-5
LR schedule	Linear with 10% warmup
Weight decay	0.01
Train batch size	16
Eval batch size	32
Epochs	5
Precision	bf16
Seed	42
Best-model selection	F1 on `jim_crow` class

Training results

Best checkpoint selected by f1_jim_crow on the held-out eval split (epoch 3):

Metric	Value
Accuracy	0.9776
Precision (jim_crow)	0.9352
Recall (jim_crow)	0.9902
F1 (jim_crow)	0.9619
F1 (macro)	0.9730
ROC AUC	0.9965

Per-epoch eval:

Training Loss	Epoch	Step	Val Loss	Accuracy	Precision (jim_crow)	Recall (jim_crow)	F1 (jim_crow)	F1 macro	ROC AUC
0.2893	1	90	0.1920	0.9524	0.8972	0.9412	0.9187	0.9425	0.9913
0.0716	2	180	0.0793	0.9776	0.9519	0.9706	0.9612	0.9727	0.9971
0.1101	3	270	0.1205	0.9776	0.9352	0.9902	0.9619	0.9730	0.9965
0.0027	4	360	0.1251	0.9776	0.9352	0.9902	0.9619	0.9730	0.9958
0.0001	5	450	0.1231	0.9748	0.9346	0.9804	0.9569	0.9696	0.9960

Held-out eval is small (357 rows; 102 positive). Treat differences in the fourth decimal as noise.

Citation

Please cite the original On the Books project for the data and methodology:

On the Books: Jim Crow and Algorithms of Resistance.
University of North Carolina at Chapel Hill Libraries.
https://onthebooks.lib.unc.edu
DOI: https://doi.org/10.17615/5c4g-sd44

Framework versions

Transformers 5.7.0
PyTorch 2.11.0+cu130
Datasets 4.8.5
Tokenizers 0.22.2

Downloads last month: 20

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for davanstrien/jim-crow-laws-claude-code

Base model

answerdotai/ModernBERT-base

Finetuned

(1219)

this model

Dataset used to train davanstrien/jim-crow-laws-claude-code

Evaluation results

accuracy on biglam/on_the_books
self-reported

0.978
F1 (jim_crow class) on biglam/on_the_books
self-reported

0.962
Precision (jim_crow class) on biglam/on_the_books
self-reported

0.935
Recall (jim_crow class) on biglam/on_the_books
self-reported

0.990
roc_auc on biglam/on_the_books
self-reported

0.997