Upload folder using huggingface_hub

47ff542 verified about 1 month ago

12.5 kB

	---
	license: mit
	tags:
	- token-classification
	- modernbert
	- orality
	- linguistics
	- multi-label
	language:
	- en
	metrics:
	- f1
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: token-classification
	library_name: transformers
	datasets:
	- custom
	---

	# Havelock Orality Token Classifier

	ModernBERT-based token classifier for detecting oral and literate markers in text, based on Walter Ong's "Orality and Literacy" (1982).

	This model performs multi-label span-level detection of 53 rhetorical marker types, where each token independently carries B/I/O labels per type — allowing overlapping spans (e.g. a token that is simultaneously part of a concessive and a nested clause).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| `answerdotai/ModernBERT-base` \|
	\| Task \| Multi-label token classification (independent B/I/O per type) \|
	\| Marker types \| 53 (22 oral, 31 literate) \|
	\| Test macro F1 \| 0.378 (per-type detection, binary positive = B or I) \|
	\| Training \| 20 epochs, fp16 \|
	\| Regularization \| Mixout (p=0.1) — stochastic L2 anchor to pretrained weights \|
	\| Loss \| Per-type focal loss (γ=2.0) with inverse-frequency OBI and type weights \|
	\| Min examples \| 150 (types below this threshold excluded) \|

	## Usage
	```python
	import json
	import torch
	from transformers import AutoModel, AutoTokenizer
	from huggingface_hub import hf_hub_download

	model_name = "HavelockAI/bert-token-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	model.eval()

	# Load marker type map
	type_map_path = hf_hub_download(model_name, "type_to_idx.json")
	type_to_idx = json.loads(open(type_map_path).read())
	idx_to_type = {v: k for k, v in type_to_idx.items()}

	text = "Tell me, O Muse, of that ingenious hero who travelled far and wide"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

	with torch.no_grad():
	logits = model(**inputs) # (1, seq_len, num_types, 3)
	preds = logits.argmax(dim=-1) # (1, seq_len, num_types)

	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	for i, token in enumerate(tokens):
	active = [
	f"{idx_to_type[t]}={'OBI'[v]}"
	for t, v in enumerate(preds[0, i].tolist())
	if v > 0
	]
	if active:
	print(f"{token:15} {', '.join(active)}")
	```

	> Note: This model uses a custom architecture (`HavelockTokenClassifier`) with independent B/I/O heads per marker type, enabling overlapping span detection. Loading requires `trust_remote_code=True`.

	## Training Data

	- Sources: Project Gutenberg, textfiles.com, Reddit, Wikipedia talk pages
	- Types with fewer than 150 annotated spans are excluded from training
	- Multi-label BIO annotation: tokens can carry labels for multiple overlapping marker types simultaneously

	## Marker Types (53)

	### Oral Markers (22 types)

	Characteristics of oral tradition and spoken discourse:

	\| Category \| Markers \|
	\|----------\|---------\|
	\| Address & Interaction \| vocative, imperative, second_person, inclusive_we, rhetorical_question, phatic_check, phatic_filler \|
	\| Repetition & Pattern \| anaphora, parallelism, tricolon, lexical_repetition, antithesis \|
	\| Conjunction \| simple_conjunction \|
	\| Formulas \| discourse_formula, intensifier_doubling \|
	\| Narrative \| named_individual, specific_place, temporal_anchor, sensory_detail, embodied_action, everyday_example \|
	\| Performance \| self_correction \|

	### Literate Markers (31 types)

	Characteristics of written, analytical discourse:

	\| Category \| Markers \|
	\|----------\|---------\|
	\| Abstraction \| nominalization, abstract_noun, conceptual_metaphor, categorical_statement \|
	\| Syntax \| nested_clauses, relative_chain, conditional, concessive, temporal_embedding, causal_explicit \|
	\| Hedging \| epistemic_hedge, probability, evidential, qualified_assertion, concessive_connector \|
	\| Impersonality \| agentless_passive, agent_demoted, institutional_subject, objectifying_stance \|
	\| Scholarly apparatus \| citation, cross_reference, metadiscourse, definitional_move \|
	\| Technical \| technical_term, technical_abbreviation, enumeration, list_structure \|
	\| Connectives \| contrastive, additive_formal \|
	\| Setting \| concrete_setting, aside \|

	## Evaluation

	Per-type detection F1 on test set (binary: B or I = positive, O = negative):

	<details><summary>Click to show per-marker precision/recall/F1/support</summary>
	```
	Type Prec Rec F1 Sup
	========================================================================
	literate_abstract_noun 0.190 0.325 0.240 381
	literate_additive_formal 0.246 0.556 0.341 27
	literate_agent_demoted 0.404 0.368 0.386 304
	literate_agentless_passive 0.575 0.607 0.591 1133
	literate_aside 0.379 0.429 0.403 436
	literate_categorical_statement 0.267 0.146 0.189 514
	literate_causal_explicit 0.227 0.279 0.251 190
	literate_citation 0.639 0.556 0.595 372
	literate_conceptual_metaphor 0.310 0.364 0.335 415
	literate_concessive 0.499 0.470 0.484 502
	literate_concessive_connector 0.455 0.408 0.430 49
	literate_concrete_setting 0.241 0.125 0.165 407
	literate_conditional 0.369 0.630 0.466 760
	literate_contrastive 0.310 0.428 0.360 341
	literate_cross_reference 0.386 0.524 0.444 42
	literate_definitional_move 0.395 0.185 0.252 81
	literate_enumeration 0.495 0.483 0.489 775
	literate_epistemic_hedge 0.421 0.481 0.449 445
	literate_evidential 0.625 0.360 0.457 472
	literate_institutional_subject 0.332 0.326 0.329 282
	literate_list_structure 0.338 0.523 0.411 86
	literate_metadiscourse 0.140 0.393 0.206 135
	literate_nested_clauses 0.091 0.246 0.133 1169
	literate_nominalization 0.499 0.612 0.549 991
	literate_objectifying_stance 0.635 0.365 0.464 167
	literate_probability 0.432 0.593 0.500 27
	literate_qualified_assertion 0.143 0.100 0.118 40
	literate_relative_chain 0.382 0.507 0.436 1424
	literate_technical_abbreviation 0.667 0.711 0.688 225
	literate_technical_term 0.280 0.375 0.321 715
	literate_temporal_embedding 0.228 0.259 0.242 526
	oral_anaphora 0.800 0.028 0.054 287
	oral_antithesis 0.249 0.238 0.243 412
	oral_discourse_formula 0.340 0.408 0.371 557
	oral_embodied_action 0.280 0.391 0.326 425
	oral_everyday_example 0.333 0.156 0.212 404
	oral_imperative 0.591 0.662 0.625 293
	oral_inclusive_we 0.516 0.632 0.568 622
	oral_intensifier_doubling 0.680 0.200 0.309 85
	oral_lexical_repetition 0.404 0.254 0.312 173
	oral_named_individual 0.441 0.749 0.556 770
	oral_parallelism 0.741 0.110 0.191 182
	oral_phatic_check 0.611 0.733 0.667 30
	oral_phatic_filler 0.174 0.409 0.244 93
	oral_rhetorical_question 0.509 0.692 0.586 905
	oral_second_person 0.576 0.552 0.564 811
	oral_self_correction 0.158 0.235 0.189 51
	oral_sensory_detail 0.285 0.169 0.212 461
	oral_simple_conjunction 0.179 0.102 0.130 98
	oral_specific_place 0.556 0.705 0.622 424
	oral_temporal_anchor 0.410 0.559 0.473 546
	oral_tricolon 0.299 0.119 0.171 553
	oral_vocative 0.652 0.747 0.696 158
	========================================================================
	Macro avg (types w/ support) 0.378
	```

	</details>

	Missing labels (test set): 0/53 — all types detected at least once.

	Notable patterns:
	- Strong performers (F1 > 0.5): vocative (0.696), technical_abbreviation (0.688), phatic_check (0.667), imperative (0.625), specific_place (0.622), citation (0.595), agentless_passive (0.591), rhetorical_question (0.586), inclusive_we (0.568), second_person (0.564), named_individual (0.556), nominalization (0.549), probability (0.500)
	- Weak performers (F1 < 0.2): anaphora (0.054), qualified_assertion (0.118), simple_conjunction (0.130), nested_clauses (0.133), concrete_setting (0.165), tricolon (0.171), categorical_statement (0.189), self_correction (0.189), parallelism (0.191)
	- Precision-recall tradeoff: Most types show balanced precision/recall. Notable exceptions include `anaphora` (0.800 precision / 0.028 recall), `parallelism` (0.741 / 0.110), and `intensifier_doubling` (0.680 / 0.200), which remain high-precision but very low-recall.

	## Architecture

	Custom `MultiLabelTokenClassifier` with independent B/I/O heads per marker type:
	```
	ModernBERT (answerdotai/ModernBERT-base)
	└── Dropout (p=0.1)
	└── Linear (hidden_size → num_types × 3)
	└── Reshape to (batch, seq, num_types, 3)
	```

	Each marker type gets an independent 3-way O/B/I classification, so a token can simultaneously carry labels for multiple overlapping marker types. Types share the full backbone representation but make independent predictions.

	### Regularization

	- Mixout (p=0.1): During training, each backbone weight element has a 10% chance of being replaced by its pretrained value per forward pass, acting as a stochastic L2 anchor that prevents representation drift (Lee et al., 2019)
	- Per-type focal loss (γ=2.0): Focuses learning on hard examples, reducing the contribution of easy negatives
	- Inverse-frequency type weights: Rare marker types receive higher loss weighting
	- Inverse-frequency OBI weights: B and I classes upweighted relative to dominant O class
	- Weighted random sampling: Examples containing rarer markers sampled more frequently

	### Initialization

	Fine-tuned from `answerdotai/ModernBERT-base`. Backbone linear layers wrapped with Mixout during training (frozen pretrained copy used as anchor). The classification head is randomly initialized:
	```
	backbone.* layers → loaded from pretrained, anchored via Mixout
	classifier.weight → randomly initialized
	classifier.bias → randomly initialized
	```

	## Limitations

	- Near-zero recall types: `anaphora` (0.028 recall), `simple_conjunction` (0.102), `parallelism` (0.110), and `tricolon` (0.119) are rarely detected despite being present in training data
	- Low-precision types: `nested_clauses` (0.091), `metadiscourse` (0.140), and `qualified_assertion` (0.143) have precision below 0.15, meaning most predictions for those types are false positives
	- Context window: 128 tokens max; longer spans may be truncated
	- Domain: Trained primarily on historical/literary texts; may underperform on modern social media
	- Subjectivity: Some marker boundaries are inherently ambiguous

	## Citation
	```bibtex
	@misc{havelock2026token,
	title={Havelock Orality Token Classifier},
	author={Havelock AI},
	year={2026},
	url={https://huggingface.co/HavelockAI/bert-token-classifier}
	}
	```

	## References

	- Ong, Walter J. Orality and Literacy: The Technologizing of the Word. Routledge, 1982.
	- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
	- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.

	---

	Trained: February 2026