Havelock Orality Regressor

ModernBERT-based regression model that scores text on the oral–literate spectrum (0–1), grounded in Walter Ong's Orality and Literacy (1982).

Given a passage of text, the model outputs a continuous score where higher values indicate greater orality (spoken, performative, additive discourse) and lower values indicate greater literacy (analytic, subordinative, abstract discourse).

Model Details

Property	Value
Base model	`answerdotai/ModernBERT-base`
Architecture	`HavelockOralityRegressor` (custom, mean pooling → linear)
Task	Single-value regression (MSE loss)
Output range	Continuous (not clamped)
Max sequence length	512 tokens
Best MAE	0.0791
R² (at best MAE)	0.748
Parameters	~149M

Usage

import os
os.environ["TORCH_COMPILE_DISABLE"] = "1"

import warnings
warnings.filterwarnings("ignore", message="Flash Attention 2 only supports")

import torch
from transformers import AutoModel, AutoTokenizer

model_name = "HavelockAI/bert-orality-regressor"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad(), torch.autocast(device_type=device.type, enabled=device.type == "cuda"):
    score = model(**inputs).logits.squeeze().item()

print(f"Orality score: {max(0.0, min(1.0, score)):.3f}")

Score Interpretation

Score	Register
0.8–1.0	Highly oral — epic poetry, sermons, rap, oral storytelling
0.6–0.8	Oral-dominant — speeches, podcasts, conversational prose
0.4–0.6	Mixed — journalism, blog posts, dialogue-heavy fiction
0.2–0.4	Literate-dominant — essays, expository prose
0.0–0.2	Highly literate — academic papers, legal texts, philosophy

Training

Data

The model was trained on a curated corpus of documents annotated with orality scores using a multi-pass scoring system. Scores were originally on a 0–100 scale and normalized to 0–1 for training. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages, representing a range of registers from highly oral to highly literate.

An 80/20 train/test split was used (random seed 42).

Hyperparameters

Parameter	Value
Epochs	20
Learning rate	2e-5
Optimizer	AdamW (weight decay 0.01)
LR schedule	Cosine with warmup (10% of total steps)
Gradient clipping	1.0
Loss	MSE
Mixed precision	FP16
Regularization	Mixout (p=0.1)

Training Metrics

Click to show per-epoch metrics

Epoch	Loss	MAE	R²
1	0.3496	0.1173	0.476
2	0.0286	0.0992	0.593
3	0.0215	0.0872	0.704
4	0.0144	0.0879	0.714
5	0.0169	0.0865	0.712
6	0.0117	0.0853	0.700
7	0.0096	0.0922	0.691
8	0.0094	0.0850	0.722
9	0.0086	0.0822	0.745
10	0.0064	0.0841	0.723
11	0.0054	0.0921	0.682
12	0.0050	0.0840	0.720
13	0.0044	0.0806	0.744
14	0.0037	0.0805	0.740
15	0.0034	0.0791	0.748
16	0.0033	0.0807	0.738
17	0.0031	0.0803	0.742
18	0.0026	0.0797	0.745
19	0.0027	0.0803	0.742
20	0.0029	0.0805	0.741

Best checkpoint selected at epoch 15 by lowest MAE.

Architecture

Custom HavelockOralityRegressor with mean pooling (ModernBERT has no pooler output):

ModernBERT (answerdotai/ModernBERT-base)
    └── Mean pooling over non-padded tokens
        └── Dropout (p=0.1)
            └── Linear (hidden_size → 1)

Regularization

Mixout (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
Weight decay (0.01) via AdamW
Gradient clipping (max norm 1.0)

Limitations

No sigmoid clamping: The model can output values outside [0, 1]. Consumers should clamp if needed.
Domain coverage: Training corpus skews historical/literary. Performance on modern social media, code-switched text, or non-English text is untested.
Document length: Texts longer than 512 tokens are truncated. The model sees only the first ~400 words, which may not be representative of longer documents.
Regression target subjectivity: Orality scores involve human judgment; inter-annotator agreement bounds the ceiling for model performance.

Theoretical Background

The oral–literate spectrum follows Ong's framework, which characterizes oral discourse as additive, aggregative, redundant, agonistic, empathetic, and situational, while literate discourse is subordinative, analytic, abstract, distanced, and context-free. The model learns to place text along this continuum from document-level annotations informed by 72 specific rhetorical markers (36 oral, 36 literate).

Citation

@misc{havelock2026regressor,
  title={Havelock Orality Regressor},
  author={Havelock AI},
  year={2026},
  url={https://huggingface.co/HavelockAI/bert-orality-regressor}
}

References

Ong, Walter J. Orality and Literacy: The Technologizing of the Word. Routledge, 1982.
Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.

Trained: February 2026