jim-crow-laws-claude-code

A binary text classifier that flags whether a North Carolina session-law section (1866–1967) is a Jim Crow law. Fine-tuned from answerdotai/ModernBERT-base on biglam/on_the_books, the labeled training set from UNC Chapel Hill Libraries' On the Books: Jim Crow and Algorithms of Resistance project.

Intended use

  • Surface candidate Jim Crow laws within historical NC session-law corpora to support archival, library, and digital-humanities work.
  • Reproduce / extend the On the Books methodology on related corpora.
  • Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant text classification.

The original On the Books project trained a classifier on this data and ran it over the full ~century corpus. This model is a re-training of that idea with a modern long-context encoder (ModernBERT) and is intended to be applied the same way: as a retrieval / triage tool whose flagged outputs are then reviewed by domain experts.

Out-of-scope / limitations

  • Jurisdiction: trained on North Carolina session laws only. Patterns will not transfer cleanly to other states without adaptation.
  • Period: 1866–1967 legal language. Modern statutes differ substantially.
  • OCR noise: training texts contain period-OCR errors; expect degraded performance on cleaner or differently-OCR'd inputs.
  • Label scope: the negative class means "not flagged by the project's labeling process" — laws with discriminatory effect that the source compilations did not catalogue may be present in the negatives. Treat model predictions as candidates for review, not ground truth.
  • Class imbalance: training data is ~29% positive; trained with inverse-frequency class weights to compensate.

Per the dataset's authors, the texts include slurs and dehumanising language present in the historical record. Downstream users should preserve the project's framing and not strip the historical context.

How to use

from transformers import pipeline

clf = pipeline(
    "text-classification",
    model="davanstrien/jim-crow-laws-claude-code",
)

text = "..."  # text of a single law section
print(clf(text))
# [{'label': 'jim_crow', 'score': 0.99}]

Labels: no_jim_crow (0) and jim_crow (1).

Training data

  • Dataset: biglam/on_the_books (1,785 rows; single train split).
  • Input field used: section_text (the OCR text of the labeled section). chapter_text and source were ignored — source would leak the label (paschal is 100% positive, murray is 92% positive).
  • Split: stratified 80/20 train/eval split (seed 42) — 1,428 train / 357 eval, preserving the ~29% positive rate in both.

Training procedure

  • Base model: answerdotai/ModernBERT-base (~150M params, 8K context).
  • Max sequence length: 1024 tokens (covers ~95th percentile of section_text token lengths; long-tail truncated).
  • Loss: cross-entropy with inverse-frequency class weights computed from the training split ([0.701, 1.741]) to handle class imbalance.
  • Hardware: trained on a single L4 GPU via hf jobs uv run.

Hyperparameters

Optimizer AdamW (fused), β=(0.9, 0.999), ε=1e-8
Learning rate 3e-5
LR schedule Linear with 10% warmup
Weight decay 0.01
Train batch size 16
Eval batch size 32
Epochs 5
Precision bf16
Seed 42
Best-model selection F1 on jim_crow class

Training results

Best checkpoint selected by f1_jim_crow on the held-out eval split (epoch 3):

Metric Value
Accuracy 0.9776
Precision (jim_crow) 0.9352
Recall (jim_crow) 0.9902
F1 (jim_crow) 0.9619
F1 (macro) 0.9730
ROC AUC 0.9965

Per-epoch eval:

Training Loss Epoch Step Val Loss Accuracy Precision (jim_crow) Recall (jim_crow) F1 (jim_crow) F1 macro ROC AUC
0.2893 1 90 0.1920 0.9524 0.8972 0.9412 0.9187 0.9425 0.9913
0.0716 2 180 0.0793 0.9776 0.9519 0.9706 0.9612 0.9727 0.9971
0.1101 3 270 0.1205 0.9776 0.9352 0.9902 0.9619 0.9730 0.9965
0.0027 4 360 0.1251 0.9776 0.9352 0.9902 0.9619 0.9730 0.9958
0.0001 5 450 0.1231 0.9748 0.9346 0.9804 0.9569 0.9696 0.9960

Held-out eval is small (357 rows; 102 positive). Treat differences in the fourth decimal as noise.

Citation

Please cite the original On the Books project for the data and methodology:

On the Books: Jim Crow and Algorithms of Resistance.
University of North Carolina at Chapel Hill Libraries.
https://onthebooks.lib.unc.edu
DOI: https://doi.org/10.17615/5c4g-sd44

Framework versions

  • Transformers 5.7.0
  • PyTorch 2.11.0+cu130
  • Datasets 4.8.5
  • Tokenizers 0.22.2
Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for davanstrien/jim-crow-laws-claude-code

Finetuned
(1219)
this model

Dataset used to train davanstrien/jim-crow-laws-claude-code

Evaluation results